Tibor's Musings

Python Garbage Collection Issue

For performance reasons, it may be interesting to switch off Python garbage collection in inner loops. Here is a story behind doing that in Invenio software.

For performance reasons, it may be interesting to switch off Python garbage collection in inner loops. Here is story behind doing that in Invenio software.

On the ADS instance of Invenio that contains about 10M of records, it was observed that run_sql() sometimes run much longer than other times, especially when memory is being consumed by other objects. Here is initial illustrative example posted by Benoit:

In [1]: from invenio.dbquery import run_sql

In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
CPU times: user 1.44 s, sys: 0.08 s, total: 1.52 s
Wall time: 1.92 s

In [3]: i = range(50000000)

In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
CPU times: user 11.36 s, sys: 0.07 s, total: 11.43 s
Wall time: 11.67 s

In [5]: j = range(50000000)

In [6]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
CPU times: user 21.21 s, sys: 0.06 s, total: 21.27 s
Wall time: 21.54 s

The issue was discussed on project-invenio-devel mailing list in December 2011 in the thread entitled "Slow MySQL queries with large data structures in memory". It was eventually tracked down to a Python GC issue. One graphical illustration of this issue can be found at http://is.runcode.us/q/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-in-a-loop-as-the-list-grows.

Here are my findings comparing behaviour of Python-2.6 and Python-2.7 in this respect.

The test code:

#!python
import sys
import time
import gc

class A:
    def __init__(self):
        self.x = 1
        self.y = 2
        self.why = 'no reason'

def time_to_append(size, append_list, item_gen):
    t0 = time.time()
    for i in xrange(0, size):
        append_list.append(item_gen())
    return time.time() - t0

def test():
    x = []
    count = 5000
    for i in xrange(0,1000):
        print len(x), time_to_append(count, x, lambda: A())

def test_nogc():
    x = []
    count = 5000
    for i in xrange(0,1000):
        gc.disable()
        print len(x), time_to_append(count, x, lambda: A())
        gc.enable()

if '--nogc' in sys.argv:
    test_nogc()
else:
    test()

The test results measured on an HP EliteBook 8440p laptop, with Python-2.6 and Python-2.7:

Python-2.6 GC issue Python-2.7 GC issue

One can see that the cyclic garbage collector jumps in more reasonably with Python-2.7 than with Python-2.6, and that when it jumps in, it terminates more rapidly as well. However, one can also see that it is still much more preferable to switch GC off in the inner loop, during the construction of tuples-of-tuples or lists-of-stuff objects. Hence the patch for run_sql() committed in https://github.com/inveniosoftware/invenio/commit/7a2b03d232a04c054751bb003f4facbc8e214992 regardless of Python-2.x version.

python