Python ease of use usually do not require programmers to manage memory. However, when dealing with lots of values (e.g., astronomical catalogs) unnecessary large memory usage can be a limiting factor. It is possible to significantly improve memory performance using generators when it is only required to iterate over the values once.
For instance, let's create a list of one million random integers from 0 to 99 and filter it to check if it contains all numbers 0-9 (which is likely given the large sample).
import random
import sys
n = int(1e6)
lst = [random.randint(0, 99) for i in range(n)]
print(sys.getsizeof(lst)) # 8697464
print({x for x in lst if x<10}) # {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
print({x for x in lst if x<10}) # {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
We first created the list of random integers using a list comprehension (for syntax within square brackets). The standard library function sys.getsizeof() returns the object size in bytes. Then we filtered the list creating a set (for syntax within curly brackets) to return elements smaller than 10 and appearing at least once in the list (we use a set to remove repeated elements). Finally we filtered the same list a second time getting the same result to show that we can iterate over the list several times, as values are stored in memory. Execution time is 0.8 s (Intel Core i7-8550U CPU 1.80GHz) and peak memory usage for the program is 29644 kB.
Let's repeat the same replacing the list comprehension with a generator (for syntax within round brackets).
n = int(1e6)
gen = (random.randint(0, 99) for i in range(n))
print(sys.getsizeof(gen)) # 120
print({x for x in gen if x<10}) # {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
print({x for x in gen if x<10}) # set()
First, the generator size is much smaller than the previous list because values are generated by request instead of being stored in memory. We iterated over the generator in the same manner as for the list obtaining the same result. However, an important difference is that we can only iterate once, in the second iteration the generator is exhausted. Execution time is roughly the same as before, but peak memory usage is smaller, 21100 kB.
Increasing the number of random integers to 10 million, execution time is 7.7 s for both cases, but peak memory usage for the list comprehension example is 100768 kB, while for the generator case it is still 21100 kB. The gain in memory performance is now large.
If the generator syntax above is too restrictive, it is also possible to create generators using the yield keyword. In the following function yield is used like return, except that the function will return a generator.
def make_generator(n):
rng = range(n)
for i in rng:
yield random.randint(0, 99)
n = int(1e6)
gen = make_generator(n)
print({x for x in gen if x<10}) # {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Note that the call make_generator(n) does not run the body of the function, it only returns the generator. Also, the part of the body preceding the for loop is run only once at the first generator call. Only the part of the function body declared within the for loop is executed at each successive generator call.
Besides generators, Python garbage collector can be used more or less directly to improve memory performance. (As an example taking advantage indirectly from the garbage collector, an alternative to a list comprehension that generates large arrays at each iteration may be a plain for loop, so that arrays are garbage collected away at the end of each iteration (assuming they are only created and processed within the local scope of the for loop). However, in this case using generators is preferable because more reliable, elegant and likely more performing in terms of runtime.)