nosql – Marco Bonzanini

MongoDB is one of the popular NoSQL databases. It uses a document-oriented, JSON-like approach to represent data, making the integration of semi-structured data fairly easy.

This article is an introduction on how to use PyMongo, the package to interact with MongoDB in Python, for basic interactions with the database.

MongoDB Basics

As mentioned in the opening paragraph, MongoDB is a document-oriented database. The “document” is hence the main concept in this family of databases. This means that the database stores and retrieves records called documents.

If modelled in a sensible fashion, documents are usually self-describing and self-contained, so everything you need to know about the document is represented within the document. The data format used by mongo is JSON, which is a very convenient way to encapsulate data.

An example of document in MongoDB:

{
    "title": "An article about MongoDB and Python",
    "content": "Some long discussion about MongoDB",
    "publication_date": "2015-09-07 10:00:00",
    "shares": {
        "twitter": 123,
        "facebook" 456,
        "linkedin": 789
    },
    "tags": ["python", "mongodb", "nosql"],
    "author": {
        "name": "Marco",
        "author_id": 1
    }
}

Documents are grouped into collections. In a comparison with the relational world, if a document is a row/record, a collection would be something similar to a table. A group of collections can be hosted on the same database, and the concept of database here is similar to the relational one (sometimes referred to as schema).

Being a JSON-style data store, MongoDB is often referred to as schemaless: fields can dynamically be added to documents without the need to define an explicit schema.

Note: schema design in complex applications is still a useful tool, and some careful thoughts should be spent when designing our document representation. Maybe it’s more helpful to think about it as dynamic schema rather than schemaless, as the word still conveys the idea of flexibility provided by document-oriented data stores, yet the importance of some sort of structure is not forgotten.

Quick Installation

MongoDB is available for most package managers (e.g. in rpm or deb format). You can also install the binary tarball by simply unzipping it and making it visible from your $PATH. From the official website, download the tarball for your operating system. At the time of this writing, the latest version is 3.0.6:

tar zxf mongodb-osx-x86_64-3.0.6
ln -s mongodb-osx-x86_64-3.0.6 mongodb

You can immediately run the server:

cd mongodb
./bin/mongod

The MongoDB daemon is now listening on localhost:27017, the default settings.

MongoDB + Python = PyMongo

PyMongo is the Python driver for MongoDB, and it can be installed via pip:

pip install pymongo

This should install the latest 3.0 version of PyMongo (the interface for previous 2.* versions could be slightly different, so pay attention to this detail).

The main component of PyMongo is the MongoClient class. In order to connect with the database, we’ll need an instance of the client:

from pymongo import MongoClient

client = MongoClient()

Databases and collections are created automatically if they don’t exist. Let’s create a database called tutorial and a collection called articles:

db = client['tutorial']
coll = db['articles']

We can now define an a document an insert it in the collection:

from datetime import datetime

doc = {
    "title": "An article about MongoDB and Python",
    "author": "Marco",
    "publication_date": datetime.utcnow(),
    # more fields
}

doc_id = coll.insert_one(doc).inserted_id

the insert_one() methods does exactly what its name says. The document is represented as a Python dictionary, which looks very similar to a JSON object. PyMongo handles the conversion between data types (e.g. a datetime object in Python will become an ISODate in MongoDB, booleans are converted from True to true, etc.).

We also append the inserted_id attribute, to capture the ID assigned by Mongo to the document. This sort of primary key is stored in the _id field:

print(doc_id)
# ObjectId('55eb163d3661250ae4232ba6')

The _id of a document is an instance of ObjectId, rather than a simple string. This is important to keep in mind if we try to retrieve a document by its ID:

# query by ObjectId
my_doc = coll.find_one('_id': doc_id)
print(my_doc)
# {'title': 'An article about MongoDB and Python', 'author': 'Marco', '_id': ObjectId('55eb163d3661250ae4232ba6'),  ...}

doc_id_str = str(doc_id)
print(doc_id_str)
# 55eb163d3661250ae4232ba6

# query by ID-string
my_doc = coll.find_one('_id': doc_id_str)
print(my_doc)
# empty

# Converting an ID-string to an ObjectId
from bson.objectid import ObjectId
my_doc = coll.find_one({'_id': ObjectId(doc_id_str)})
print(my_doc)
# {'title': 'An article about MongoDB and Python', 'author': 'Marco', '_id': ObjectId('55eb163d3661250ae4232ba6'), ...}

The string-vs-ObjectId mismatch problem is one of the first encountered by MongoDB novices, so it’s something to keep in mind.

We have introduced the insert_one() and find_one() functions, which work for one document. Their counterparts, insert_many() and find() will work for many documents. Specifically, insert_many() takes a list of documents as argument, i.e. a list of dictionaries. On the other side, find() will return a cursor, which can be used like an iterable, e.g.:

many_docs = coll.find() # empty query means "retrieve all"
for doc in many_docs:
    print(doc)

More Complex Queries

Given this data set:

doc1 = {
    "title": "Intro to MongoDB and Python",
    "publication_date": datetime(2015, 9, 7),
    "likes": 10
}
doc2 = {
    "title": "Intro to Neo4J and Python",
    "publication_date": datetime(2015, 9, 1),
    "likes": 5
}
doc3 = {
    "title": "Intro to Elasticsearch and Python",
    "publication_date": datetime(2015, 8, 1),
    "likes": 15
}
coll.insert_many([doc1, doc2, doc3])

print(coll.count())
# 3

We can filter the results to find documents older than a given date

results = coll.find({"publication_date": {'$lt': datetime(2015, 9, 1)}})
for doc in results:
    print(doc)
# {'title': 'Intro to Elasticsearch and Python', ...}

The $lt operator simply stands for less than, and of course it finds its counterpart in $gt. As expected, we also have $lte and $gte if we want to include also the limit (the date 2015-09-01 in the previous example):

results = coll.find({"publication_date": {'$lte': datetime(2015, 9, 1)}})
for doc in results:
    print(doc)
# {'title': 'Intro to Neo4J and Python', ...}
# {'title': 'Intro to Elasticsearch and Python', ...}

The results are in order of _id. If we need to sort the results according to a particular field, we can use the appropriate function:

from pymongo import ASCENDING, DESCENDING

# get all docs, sort by number of likes high-to-low
results = coll.find().sort("likes", DESCENDING)
for doc in results:
    print(doc)
# {'title': 'Intro to Elasticsearch and Python', "likes": 15, ...}
# {'title': 'Intro to MongoDB and Python', "likes": 10, ...}
# {'title': 'Intro to Neo4J and Python', "likes": 5, ...}

We can of course build more complex queries and combine them with the appropriate sorting. As the query itself is a Python dictionary, we can define it separately rather than in-line, just for readability:

query = {
    "publication_date": {
        "$gte": datetime(2015, 9, 1)
    },
    "likes": {
        "$gt": 5
    }
}
results = coll.find(query)
for doc in results:
    print(doc)
# {'title': 'Intro to MongoDB and Python', "likes": 10, ...}

Summary

This article has introduced PyMongo, the Python driver to interact with MongoDB. The interaction with MongoDB via Python is fairly straightforward, and we can be up and running with some basic queries quite quickly.

MongoDB is one of the popular NoSQL databases which uses a document-oriented data store. Its JSON-like format and its dynamic schema approach make the case for self-describing and self-contained documents.

Tag: nosql

Getting Started with MongoDB and Python