Using MongoDB

Insert Data

Learning Objectives

  • Introduce the JSON format
  • Review Python data types and relate to the JSON format
  • Insert a document, understand the purpose of it’s _id
  • Introduce MongoDB’s “Binary JSON” (BSON) format

JSON, i.e. JavaScript Object Notation, is a lightweight data-interchange format built on two universal data structures 1:

  • a mapping of names to values. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
  • an ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

In JSON, they take on these forms:

  • An object is an unordered set of name/value pairs.

    {'material_id': 'mp-568345', 'nelements': 1, 'pretty_formula': 'Fe'}
  • An array is an ordered collection of values.

    [{'material_id': 'mp-568345', 'nelements': 1, 'pretty_formula': 'Fe'},
     {'material_id': 'mp-12671', 'nelements': 3, 'pretty_formula': 'Er2SO2'},
     {'material_id': 'mp-1703', 'nelements': 2, 'pretty_formula': 'YbZn'}]
  • A value can be a string in quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

    {'material_id': 'mp-2340',
     'chemsys': 'Na-O',
     'has_bandstructure': True,
     'elasticity': None,
     'elements': ['Na', 'O'],
     'nelements': 2,
     'pretty_formula': 'Na2O2',
     'spacegroup': {
       'crystal_system': 'hexagonal',
       'hall': 'P -6 -2',
       'number': 189,
       'point_group': '-6m2',
       'source': 'spglib',
       'symbol': 'P-62m'}}
  • A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

    {'material_id': 'mp-2340',
     'cif': "#\\#CIF1.1\n##########################################################################\n#               Crystallographic Information Format file \n#               Produced by PyCifRW module\n# \n#  This is a CIF file.  CIF has been adopted by the International\n#  Union of Crystallography as the standard for data archiving and \n#  transmission.\n#\n#  For information on this file format, follow the CIF links at\n#  http://www.iucr.org\n##########################################################################\n\ndata_Na2O2\n_symmetry_space_group_name_H-M          'P 1'\n_cell_length_a                          6.27944383\n_cell_length_b                          6.27944383387\n_cell_length_c                          4.50693686\n_cell_angle_alpha                       90.0\n_cell_angle_beta                        90.0\n_cell_angle_gamma                       120.000000032\n_chemical_name_systematic               'Generated by pymatgen'\n_symmetry_Int_Tables_number             1\n_chemical_formula_structural            Na2O2\n_chemical_formula_sum                   'Na6 O6'\n_cell_volume                            153.905615363\n_cell_formula_units_Z                   3\nloop_\n  _symmetry_equiv_pos_site_id\n  _symmetry_equiv_pos_as_xyz\n   1  'x, y, z'\n \nloop_\n  _atom_site_type_symbol\n  _atom_site_label\n  _atom_site_symmetry_multiplicity\n  _atom_site_fract_x\n  _atom_site_fract_y\n  _atom_site_fract_z\n  _atom_site_attached_hydrogens\n  _atom_site_B_iso_or_equiv\n  _atom_site_occupancy\n   O  O1  1  0.666667  0.333333  0.327586  0  .  1\n   O  O2  1  0.666667  0.333333  0.672414  0  .  1\n   O  O3  1  0.333333  0.666667  0.672414  0  .  1\n   O  O4  1  0.333333  0.666667  0.327586  0  .  1\n   O  O5  1  0.000000  0.000000  0.829456  0  .  1\n   O  O6  1  0.000000  0.000000  0.170544  0  .  1\n   Na  Na7  1  0.000000  0.699801  0.500000  0  .  1\n   Na  Na8  1  0.300199  0.300199  0.500000  0  .  1\n   Na  Na9  1  0.699801  0.000000  0.500000  0  .  1\n   Na  Na10  1  0.000000  0.365556  0.000000  0  .  1\n   Na  Na11  1  0.634444  0.634444  0.000000  0  .  1\n   Na  Na12  1  0.365556  0.000000  0.000000  0  .  1\n \n",
    
    'doi_bibtex': '@misc{Kristin Persson_2014, place={United States}, title={Materials Data on Na2O2 (SG:189) by Materials Project}, url={http://www.osti.gov/dataexplorer/servlets/purl/1182584}, DOI={10.17188/1182584}, abstractNote={Computed materials data using density functional theory calculations. These calculations determine the electronic structure of bulk materials by solving approximations to the Schrodinger equation. For more information, see https://materialsproject.org/docs/calculations}, author={Kristin Persson}, year={2014}, month={Nov}}'}
  • A number is very much like a C or Java number, except that the octal and hexadecimal formats are not used. Engineering notation is supported. Non-integers are stored according to the IEEE 754 floating-point standard.

    -1
    1.6e5
    1.432e-10

Now, let’s explore the mechanics of inserting some (fake) data into our collection. In the process, we’ll see how MongoDB extends JSON to allow representation of data types that are not part of the JSON specification.

> material = m = {}
> m.fake = true
> m.elements =  ["Na", "O"]
> m.band_gap = 1.736
> m.last_updated = new Date()
> m.spacegroup = {crystal_system: "hexagonal", number: 189}
> material
{
    "fake" : true,
    "elements" : [
        "Na",
        "O"
    ],
    "band_gap" : 1.736,
    "last_updated" : ISODate("2016-04-02T23:45:14.274Z"),
    "spacegroup" : {
        "crystal_system" : "hexagonal",
        "number" : 189
    }
}

We first create a JSON object that we will insert as a document into our database collection. Note that we have created a timestamp. Note also that we know nothing about the formatting of other documents in the collection. Do the other documents have a “fake” key? We don’t need to care. This is because MongoDB allows for flexible schema – it won’t complain if some documents contain certain keys and others don’t. In fact, MongoDB has an $exists operator you can use in queries to filter for documents that contain / do not contain a given key. When fetching a key’s value across a set of documents, if certain documents don’t contain the key, the null value will be returned for those documents.

db.materials.insert(material)
WriteResult({ "nInserted" : 1 })
db.materials.find({fake: true})
{ "_id" : ObjectId("570059a8dcc375fbe671e3a5"), "fake" : true, "elements" : [ "Na", "O" ], "band_gap" : 1.736, "last_updated" : ISODate("2016-04-02T23:45:14.274Z"), "spacegroup" : { "crystal_system" : "hexagonal", "number" : 189 } }

The inserted document is assigned a unique id that travels with the document. Note that the _id and last_updated values are displayed as special objects and not simply JSON values (e.g. simple numbers or strings). Under the hood, however, they are just JSON. For example, the _id value looks something like {$oid: "570059a8dcc375fbe671e3a5"}, i.e. a JSON document with a special $-prefixed key for a value.

It’s worth noting that the _id field is very, very likely to be unique (and is in fact guaranteed to be unique if a unique index is built for that field on the collection, which is default). It is a 12-byte BSON type constructed using:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 3-byte machine identifier,
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

This makes it reasonable to create documents in parallel without worrying about race conditions for increment-by-one id strategies. Furthermore, sorting on an _id field that stores ObjectId values is roughly equivalent to sorting by creation time (it’s not strict within a single second):

db.materials.findOne()._id.getTimestamp()
ISODate("2016-04-05T16:38:07Z")

Before we move on, let’s remove any/all fake documents in our collection (we’ll go over removal again later):

db.materials.remove({fake: true})
WriteResult({ "nRemoved" : 1 })

About those _ids

What are some properties of a generated MongoDB document _id (choose zero or more of the following)?

A. It can double as a “created-at” timestamp

B. It is guaranteed to be unique in the scope of the collection

C. It is valid JSON

Multiple insertion

What happens when you try to re-create and insert the example (fake) material document again?


  1. The intro to JSON here was taken from json.org.