JSON Select Speed Test with MongoDB and PostgreSQL

MongoDB is a popular JSON database. PostgreSQL has added nice JSON capabilities lately. How do the two databases compare in terms of JSON select speed when using Python connectors? Let's find out.

The test was performed on HP EliteBook 8440p using Python 2.7, MongoDB 2.4.8 with pymongo 2.6.3, and PostgreSQL 9.3 with psycopg2 2.5.1. All in their default Debian GNU/Linux configurations.

Test data set

A test data set consisted of 50,000 JSON records representing book metadata, looking like this:

[...]
{
  "recid": 1494701,
  "title": "The Feynman lectures on physics; New millennium ed.",
  "author": "Feynman, Richard Phillips",
  "isbn": "9780465024933",
  "subject": ["53"],
  "url": "http://www.feynmanlectures.caltech.edu/I_toc.html",
  "publisher": "Basic Books",
  "place": "New York, NY",
  "year": "2010"
},
[...]

The test queries will be performed on the year field which is represented as a string.

Database definitions

The databases were created as follows:

import psycopg2
from pymongo import MongoClient

pg_con = psycopg2.connect("dbname=simko user=simko")
pg_cur = pg_con.cursor()

mg_db = MongoClient().test_database
mg_col = mg_db.test_collection

def mongodb_create_table():
    mg_db.create_collection('test_collection')

def mongodb_drop_table():
    mg_db.drop_collection('test_collection')

def postgresql_create_table():
    pg_cur.execute("CREATE TABLE test (recid INTEGER PRIMARY KEY, data json);")
    pg_con.commit()

def postgresql_drop_table():
    pg_cur.execute("DROP TABLE IF EXISTS test;")
    pg_con.commit()

In some tests, dedicated year index will be used:

def mongodb_create_index():
    mg_col.create_index('year')

def mongodb_drop_index():
    mg_col.drop_index('year_1')

def postgresql_create_index():
    pg_cur.execute("CREATE INDEX year ON test ((data->>'year'));")
    pg_con.commit()

def postgresql_drop_index():
    pg_cur.execute("DROP INDEX year;")
    pg_con.commit()

Database connections

One possibly important factor to consider is that PyMongo returns strings as Python Unicode strings while psycopg2 can return either Python binary UTF-8 strings or Python Unicode strings, depending on the wanted settings. Therefore let us perform two series of tests, once letting psycopg2 to return Python binary strings and once forcing psycopg2 to return Unicode strings. The latter can be achieved globally via:

import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

Database statistics

Database statistics, such as table size, index size, and number of rows, can be obtained for the two databases as follows:

def mongodb_stats():
    print "** mongodb stats"
    print "count %d" % mg_col.database.command('collstats', 'test_collection')['count']
    print "size %d" % mg_col.database.command('collstats', 'test_collection')['size']
    print "storage size %d" % mg_col.database.command('collstats', 'test_collection')['storageSize']
    print "index size %d" % mg_col.database.command('collstats', 'test_collection')['totalIndexSize']

def postgresql_stats():
    print "** postgresql stats"
    pg_cur.execute("select pg_relation_size('test'),pg_total_relation_size('test')")
    pg_relation_size, pg_total_relation_size = pg_cur.fetchone()
    pg_cur.execute("select count(*) from test")
    pg_row_count = pg_cur.fetchone()
    print "count %d" % pg_row_count
    print "table storage size %d" % pg_relation_size
    print "index size %d" % (pg_total_relation_size - pg_relation_size)

Timing helpers

To measure the speed of (slow) data loading step, I used IPython's %time facility.

To measure the speed of (fast) select queries, I used IPython's %timeit facility.

Test 1: JSON loading

The test data set of 50,000 JSON records was loaded as follows:

def mongodb_load_books():
    for book in json.loads(open('books.json', 'r').read()):
        mg_col.insert(book)

def postgresql_load_books():
    for book in json.loads(open('books.json', 'r').read()):
        pg_cur.execute('INSERT INTO test (recid,data) VALUES (%s,%s);', (book['recid'], json.dumps(book)))
    pg_con.commit()

It may be interesting to note that internally json.loads() parses records already as Unicode strings:

{u'author': u'Rodi, Wolfgang',
 u'isbn': u'0080445446',
 u'place': u'San Diego, CA',
 u'publisher': u'Elsevier',
 u'recid': 6904,
 u'title': u'Engineering Turbulence Modelling and Experiments 6: ERCOFTAC International Symposium on Engineering Turbulence and Measurements - ETMM6',
 u'url': u'https://cds.cern.ch/auth.py?r=EBLIB_P_318118_0',
 u'year': u'2005'}

These are stored as UTF-8 strings in MongoDB and PostgreSQL.

Here are resulting database sizes:

============== ========== ==========
   database      MongoDB  PostgreSQL
-------------- ---------- ----------
item count [#]     50,000     50,000
table size [B] 14,219,776 15,269,888
index size [B]  1,627,024  1,171,456
   w/ year [B]  2,747,136  2,310,144
============== ========== ==========

We can see that both MongoDB and PostgreSQL lead to databases of relatively similar size. Note the extra column in PostgreSQL which leads to slightly larger table.

Here are timing results of the data loading step:

========= ========== ==========
operation   MongoDB  PostgreSQL
--------- ---------- ----------
data load  9.470 sec  2.570 sec
========= ========== ==========

The data load was significantly faster with PostgreSQL; this is probably due to the delayed commit statement at the end of the load process, leading to batch-like insertion of all the books with PostgreSQL, while the inserts were being done in a book-by-book manner with MongoDB. Introducing an explicit PostgreSQL commit after each insert would make PostgreSQL slower. However I'm not especially interested in exploring data set loading timings here, as data set select timings later; hence these were just tangential observations.

Test 2: searching for years, returning record IDs back

Let us now search inside the JSON structure for a certain year, and let us return Python list of record IDs back.

def mongodb_test_select_speed(year):
    res = []
    for book in mg_col.find({'year':year},{'recid':1, '_id':0}):
        res.append(book['recid'])
    return res

def postgresql_test_select_speed(year):
    res = []
    pg_cur.execute("SELECT recid FROM test WHERE data->>'year'=%s;", (year,))
    for row in pg_cur.fetchall():
        res.append(row[0])
    return res

Let us also test queries for year greater than a certain value:

def mongodb_test_select_speed_greater(year):
    res = []
    for book in mg_col.find({'year': {'$gt': year}},{'recid':1, '_id':0}):
        res.append(book['recid'])
    return res

def postgresql_test_select_speed_greater(year):
    res = []
    pg_cur.execute("SELECT recid FROM test WHERE data->>'year'>%s;", (year,))
    for row in pg_cur.fetchall():
        res.append(row[0])
    return res

MongoDB connector internally returns a list of dictionaries where one dictionary looks like:

{u'recid': 1618033}

PostgreSQL connector internally returns a list of tuples where one tuple looks like:

(1618033,)

Let us do a test query once for rare years (1970) and once for frequent years (2012). Also, let us perform the test once without dedicated year index and once with dedicated year index.

(Also let us verify that MongoDB and PostgreSQL both return the same thing.)

Here is example timing code:

x = mongodb_test_select_speed('1970'); x.sort()
print "*** mongodb year=1970 ...... %d [%d,%d,...,%d,%d]" % (len(x), x[0], x[1], x[-2], x[-1])
%timeit x = mongodb_test_select_speed('1970')
%timeit x = mongodb_test_select_speed('1970')

x = postgresql_test_select_speed('1970'); x.sort()
print "*** postgresql year=1970 ... %d [%d,%d,...,%d,%d]" % (len(x), x[0], x[1], x[-2], x[-1])
%timeit x = postgresql_test_select_speed('1970')
%timeit x = postgresql_test_select_speed('1970')

Here are complete results:

========= ======  ========== ========= ========== ========= =========== =========
  query    hits     MongoDB   w/index  PostgreSQL  w/index  uPostgreSQL  w/index
--------- ------  ---------- --------- ---------- --------- ----------- ---------
year=1970     63     28 ms     0.56 ms   133 ms     0.19 ms    134 ms     0.19 ms
year=2012  5,110     43 ms    22.2 ms    137 ms     4.56 ms    137 ms     4.57 ms
year>2012  5,306     43 ms    22.7 ms    154 ms     5.78 ms    148 ms     5.76 ms
year>1970 49,563    184 ms   205. ms     181 ms    45.2 ms     174 ms    45.2 ms
========= ======  ========== ========= ========== ========= =========== =========

We can make several interesting observations already:

MongoDB is faster than PostgreSQL for queries without index, especially when small number of hits is being returned (speed up of about 4x). When large number of hits is being returned, the difference is negligible.
PostgreSQL is faster than MongoDB for queries with index, both when small and large number of hits is being returned (speed up of about 4x).
Creating an index is much more helpful for PostgreSQL than for MongoDB.
PostgreSQL binary vs Unicode connections do not play much difference here, since we are returning integers; as expected. The observed difference in timings may serve as an indication of measurement errors, e.g. laptop operating system being more or less busy when repeating the test in Unicode mode. (Naturally I've run the tests in various order a few times to make sure the main result trends are reproducible, even though the numbers may differ.)

How much difference is being played by the database search speed itself and how much by differences in connector object types? And how will the timings look like if the query returned some other part of the JSON structure?

Test 3: searching for years, returning authors back

In the previous test, record IDs were returned from JSON data in case of MongoDB, and from another table column in case of PostgreSQL. This simulates more realistically a scenario of mixed SQL/noSQL database application when using PostgreSQL. However, would the results change when we'd return data from JSON only?

To test this use case, let us return list of book authors instead of list of record IDs:

def mongodb_test_select_speed_author(year):
    res = []
    for book in mg_col.find({'year':year},{'author':1, '_id':0}):
        res.append(book.get('author',None))
    return res

def postgresql_test_select_speed_author(year):
    res = []
    pg_cur.execute("SELECT data->>'author' FROM test WHERE data->>'year'=%s;", (year,))
    for row in pg_cur.fetchall():
        res.append(row[0])
    return res

def mongodb_test_select_speed_greater_author(year):
    res = []
    for book in mg_col.find({'year': {'$gt': year}},{'author':1, '_id':0}):
        res.append(book.get('author',None))
    return res

def postgresql_test_select_speed_greater_author(year):
    res = []
    pg_cur.execute("SELECT data->>'author' FROM test WHERE data->>'year'>%s;", (year,))
    for row in pg_cur.fetchall():
        res.append(row[0])
    return res

MongoDB connector internally returns a list of dictionaries:

{u'author': u'Wilf, Herbert S'} ... MongoDB

PostgreSQL connector internally returns a list of tuples:

('Wilf, Herbert S',) .... PostreSQL, binary strings
(u'Wilf, Herbert S',) ... PostreSQL, Unicode extension

Here are the results:

========= ======  ========== ========= ========== ========= =========== =========
  query    hits     MongoDB   w/index  PostgreSQL  w/index  uPostgreSQL  w/index
--------- ------  ---------- --------- ---------- --------- ----------- ---------
year=1970     63    28.3 ms    0.59 ms   133 ms     0.35 ms    133 ms     0.43 ms
year=2012  5,110    45.8 ms   24.1 ms    153 ms    18.2 ms     155 ms    22.3 ms
year>2012  5,306    50.5 ms   24.8 ms    163 ms    19.7 ms     167 ms    24.0 ms
year>1970 49,563   199. ms   223. ms     308 ms   184. ms      352 ms   216. ms
========= ======  ========== ========= ========== ========= =========== =========

The results when returning authors confirm general trends found before: MongoDB is faster for queries without indexes; however once the indexes are switched on, PostgreSQL becomes faster. An interesting phenomenon where index creation actually slows down MongoDB (199 ms -> 223 ms) is still present.

As for the role of PostgreSQL connector using binary strings vs Unicode strings, we can see that Unicode connections lead to slightly slower performance, which is understandable due to necessary UTF-8 conversions. It is interesting to note that the speed of PostgreSQL when using Unicode strings slows down to become virtually equal to that of MongoDB.

Test 4: searching for years, returning counts back

What if we returned only the number of hits satisfying the given query and measured that, instead of returning strings? Then the connectors would not have to do various kinds of transformations and we could be able to better measure the "raw" search speed.

def mongodb_test_select_speed_count(year):
    return mg_col.find({'year':year},{'recid':1, '_id':0}).count()

def postgresql_test_select_speed_count(year):
    pg_cur.execute("SELECT COUNT(*) FROM test WHERE data->>'year'=%s;", (year,))
    return pg_cur.fetchone()[0]

MongoDB connector internally returns an integer:

PostgreSQL connector internally returns a list of tuples:

(63L,)

Here are the results:

================ ======  ========== ========= ========== ========= =========== =========
     query       result    MongoDB   w/index  PostgreSQL  w/index  uPostgreSQL  w/index
---------------- ------  ---------- --------- ---------- --------- ----------- ---------
COUNT(year=1970)     63    25.4 ms   0.293 ms   133 ms    0.161 ms    133 ms    0.163 ms
COUNT(year=2012)  5,110    25.5 ms   1.07 ms    134 ms    1.56 ms     134 ms    1.60 ms
COUNT(year>2012)  5,306    25.5 ms   1.09 ms    145 ms    2.68 ms     145 ms    2.75 ms
COUNT(year>1970) 49,563    26.2 ms   6.82 ms    143 ms   19.2 ms      144 ms   18.9 ms
================ ======  ========== ========= ========== ========= =========== =========

In this test, virtually no data is being returned and converted by the database connector. What we are measuring is close to pure JSON select speed, though obviously influenced by database cache, file system cache, logging system, and related settings. The tests confirm that MongoDB is faster for queries without indexes; but it is also becoming faster for queries with indexes as well, for the first time, when compared to PostgreSQL.

Test 5: MongoDB configuration

The above tests were performed with Debian GNU/Linux's default MongoDB and PostgreSQL configurations. This was deliberate in order to have some starting point of comparison. However, in real life, the configurations would be tweaked to fit the application.

As an example, let us note that MongoDB logs every query taking longer than 100ms into its log files. How big effect does this have in our particular tests?

Looking at MongoDB log file, several slow queries can be seen logged:

$ tail /var/log/mongodb/mongodb.log
[...]
[conn1] getmore test_database.test_collection query: { year: { $gt: "1970" } } cursorid:1600782125584581 ntoreturn:0 keyUpdates:0 locks(micros) r:103319 nreturned:49462 reslen:791412 103ms
[conn1] getmore test_database.test_collection query: { year: { $gt: "1970" } } cursorid:1606323121267136 ntoreturn:0 keyUpdates:0 locks(micros) r:101010 nreturned:49462 reslen:791412 101ms
[conn1] getmore test_database.test_collection query: { year: { $gt: "1970" } } cursorid:1742429610105362 ntoreturn:0 keyUpdates:0 locks(micros) r:102457 nreturned:49462 reslen:1592469 102ms
[conn1] getmore test_database.test_collection query: { year: { $gt: "1970" } } cursorid:1746596008196226 ntoreturn:0 keyUpdates:0 locks(micros) r:101370 nreturned:49462 reslen:1592469 101ms

Let us reduce MongoDB slow query logging timeout via decreasing profiling level, say to 2 seconds:

$ mongo
> db.setProfilingLevel(1,2000)
{ "was" : 1, "slowms" : 100, "ok" : 1 }
> db.setProfilingLevel(1,2000)
{ "was" : 1, "slowms" : 2000, "ok" : 1 }

Let us re-test the "worst case scenario" query:

%timeit x = mongodb_test_select_speed_greater_author('1970')

and compare the results.

We see that there is no measurable effect on query timings with these more permissive query profiling settings. For our application, the timings stay the same. This is simply because the log file is not being written too much; for fast queries (less than 100ms) there is no log writing, and for slow queries IPython does not repeat them too frequently when detecting their timings. Hence this particular query-logging effect is not important for our particular set of tests.

We could go further in this direction, e.g. to study the effect of query cache settings etc. However, let's stop here, because for our relatively small data set (50k records of about 150MB total size) we have gathered enough insight into our particular task at hand.

Test 6: PostgreSQL returning SQL data columns vs JSON data fields

Let's perform one more test. Above, when querying for year and returning record IDs that match given year, PostgreSQL queries were returning record IDs from the SQL table column, not from the JSON data itself. This was done in order to simulate mixed SQL/noSQL application setup. Later, we tested returning JSON data from PostgreSQL as well, but for author names, not for record IDs.

A question arises: how exactly would PostgreSQL performance differ when returning record IDs from SQL column and from JSON data themselves?

A small IPython experiment reveals:

In [30]: %timeit pg_cur.execute("SELECT recid FROM test WHERE data->>'year'=%s;", ('2012',))
10 loops, best of 3: 137 ms per loop

In [31]: %timeit pg_cur.execute("SELECT data->>'recid' FROM test WHERE data->>'year'=%s;", ('2012',))
10 loops, best of 3: 151 ms per loop

Returning record IDs from the SQL table column is faster than returning record IDs from the JSON data by about 10 per cent.

Conclusions

Let us represent the above tables in terms of relative MongoDB-time-over-PostgreSQL-time slowdown factor, for each run (without and with indexes, without and with Unicode PostgreSQL connectors):

================ ======= ======= ======= =======
      query        M/P    Mi/Pi    M/uP   Mi/uPi
---------------- ------- ------- ------- -------
RCIDS(year=1970)  0.211   2.947   0.209   2.947
RCIDS(year=2012)  0.314   4.868   0.314   4.858
RCIDS(year>2012)  0.279   3.927   0.291   3.941
RCIDS(year>1970)  1.017   4.535   1.057   4.535
---------------- ------- ------- ------- -------
AUTHS(year=1970)  0.213   1.686   0.213   1.372
AUTHS(year=2012)  0.299   1.324   0.295   1.081
AUTHS(year>2012)  0.310   1.259   0.302   1.033
AUTHS(year>1970)  0.646   1.212   0.565   1.032
---------------- ------- ------- ------- -------
COUNT(year=1970)  0.191   1.820   0.191   1.798
COUNT(year=2012)  0.190   0.686   0.190   0.669
COUNT(year>2012)  0.176   0.407   0.176   0.396
COUNT(year>1970)  0.183   0.355   0.182   0.361
================ ======= ======= ======= =======

We can conclude that:

MongoDB is almost always faster when returning query counts.
MongoDB is almost always faster for queries not using indexes.
PostgreSQL is almost always faster for queries using indexes.
It pays off to create indexes for often used JSON fields. Important gains may be expected, typically 1000 per cent.
It pays off to return UTF-8 binary strings rather than Unicode strings if the application supports it. Moderate gains may be expected, typically 30 per cent.
It pays off to use SQL columns for ID-like fields. Small gains may be expected, typically 10 per cent.

Functionality-wise, PostgreSQL offers a possibility to comfortably combine relational and non-relational data in the same database. The present tests showed that it achieves this in a very efficient manner, too. Coupled with offering user-level transaction mechanism, PostgreSQL seems to be an excellent all-in-one choice for mixed RDBMS-JSON applications.

Further outlook: it would be interesting to compare timings for "ElasticSearch-as-a-DB" JSON storage technique as well.