Using MongoDB
Indexes
Learning Objectives
- discuss the utility of an index
- Create a single-field index
- Create a compound index
Indexes can support the efficient execution of queries. MongoDB automatically creates an index on the _id
field upon the creation of a collection.
To create an index on a field or fields, pass to the create_index()
method a list of field and index type pairs:
[ ( <field1>: <type1> ), ... ]
The format is that same as for sort()
ing, with 1
for an ascending index and -1
for a descending index.
create_index()
only creates an index if the index does not exist.
Earlier, we ran the distinct
aggregation method to get a list of crystal systems. Let’s time this operation:
%%timeit
db.materials.distinct("spacegroup.crystal_system")
10 loops, best of 3: 47.1 ms per loop
Not bad, but what if our collection was bigger? Right now our collection is ~20 MB, so everything is loaded into RAM. What happens when our collection is tens of GB (or more) and MongoDB needs to fetch documents on disk and swap out from RAM?
Indexes for databases are like indexes for books – rather than having to scan every page consecutively for what you want, you can flip to the index in back and quickly get a reference to page numbers (or documents) of interest.
Let’s ensure a single-field index for the aggregation above:
db.materials.create_index([("spacegroup.crystal_system", 1)])
'spacegroup.crystal_system_1'
pprint(db.materials.index_information())
{'_id_': {'key': [('_id', 1)], 'ns': 'swc.materials', 'v': 1},
'spacegroup.crystal_system_1': {'key': [('spacegroup.crystal_system', 1)],
'ns': 'swc.materials',
'v': 1}}
And let’s re-time things:
%%timeit
db.materials.distinct("spacegroup.crystal_system")
1000 loops, best of 3: 296 µs per loop
The method is now 100× faster. Note, however, that there is overhead for indexes. Every time a document is inserted or updated, indexes need to be updated. Thus, indexes in write-heavy contexts, or for small collections, may not result in a net performance boost. However, if you are mostly finding/aggregating and only occasionally inserting/updating for large-sized collections, indexes are a capital idea.
MongoDB supports compound indexes, which are indexes on multiple fields. The order of the fields determine how the index stores its keys.
Let’s say a common access pattern for our materials collection is to fetch a pretty formula given a material id:
%timeit db.materials.find_one({"material_id": "mp-49"}, {"pretty_formula": 1, "_id": 0})
100 loops, best of 3: 12.1 ms per loop
Let’s make a compound index to cover this query:
db.materials.create_index([("material_id", 1), ("pretty_formula", 1)])
'material_id_1_pretty_formula_1'
%timeit db.materials.find_one({"material_id": "mp-49"}, {"pretty_formula": 1, "_id": 0})
1000 loops, best of 3: 187 µs per loop
For the above query, MongoDB doesn’t actually hit the collection proper because the query is “covered” by the index – it is an index-only query. Even if a query isn’t covered by an index and MongoDB has to load full documents from the collection into working memory, index use is still often much more performant than full collection scans.
You can specify various properties for indexes, such as a unique constraint and a flag to build the index in the background. In the PyMongo documentation, see create_index()
for the available options.
Compound Index Ordering
Use drop_index()
to drop the [("material_id", 1), ("pretty_formula", 1)]
index and re-create it but with the key-direction pairs reversed. Time the benchmark query again. What happens? Why?
Cover the Queries
Earlier, we ran queries on the elements
and materials
collections to $set
a key on certain materials collection documents for later use in aggregation. Create two indexes, one on elements
and one on materials
, to cover those queries.