MongoDB with Python.

14 min readJan 14, 2023

What is MongoDB?

The MongoDB database is an unstructured database. It stores data in the form of documents. MongoDB is able to handle huge volumes of data very efficiently and is the most widely used NoSQL database as it offers rich query language and flexible and fast access to data.

Let’s take a moment to understand the architecture of a MongoDB database before we jump into the crux of this tutorial.

The Architecture of a MongoDB Database

The information in MongoDB is stored in documents. Here, a document is analogous to rows in structured databases.

Each document is a collection of key-value pairs
Each key-value pair is called a field
Every document has an _id field, which uniquely identifies the documents
A document may also contain nested documents
Documents may have a varying number of fields (they can be blank as well)

These documents are stored in a collection. A collection is literally a collection of documents in MongoDB. This is analogous to tables in traditional databases.

Unlike traditional databases, the data is generally stored in a single collection in MongoDB, so there is no concept of joins (except $lookup operator, which performs left-outer-join like operation). MongoDB has the nested document instead.

Installing PyMongo

Pymongo is a Python driver for MongoDB allowing you to interact with Mongo databases using Python. You first need to have installed MongoDB on your system. If you haven’t already done so, you can read how do that here: https://docs.mongodb.com/manual/installation/

To use pymongo, you first need to install the library, for example with pip in the Python prompt:

pip install pymongo

Next, we need to import the pymongo library into a Python file or Jupyter notebook.

import pymongo

And then connect to a Mongo client. This connects on the default host and port.

client = pymongo.MongoClient(“mongodb://localhost:27017/”)

We can then create a database to store some data. In this example it’s going to store some details of patients for a health system.

db = client[“med_data”]

Next, we can add a collection to that database. Each database can contain multiple collections. This collection will be called patient_data and we will reference the collection in Python using the variable my_collection.

my_collection = db["patient_data"]

Inserting data

We can then add some data (a document) to the collection. Let’s say we wanted to store some basic details about a patient. This could include their name, age, biological sex and heart rate. We will also store their blood pressure which is typically displayed with two numbers representing the systolic and diastolic pressure and is typically measured in millimetres of mercury (mmHg), for example 156/82. In MongoDB, fields (data items) are encapsulated within braces ({}) using JavaScript object notation. Each field consists of a key/value pair. The field name (key) is enclosed in quotation marks followed by a colon and then the related value. Textual (text data) values are also encapsulated in quotation marks, numbers (numeric data) are not. Values can also contain other objects and arrays. Arrays can store lists of data and other key value pairs and are denoted with the square brackets ([]). Here we can store the keys and values for the systolic (sys) and diastolic (dia) blood pressure along with the data values.

patient_record = {
   "Name": "Maureen Skinner",
   "Age": 87,
   "Sex": "F",
   "Blood pressure": [{"sys": 156}, {"dia": 82}],
   "Heart rate": 82
}

Multiple documents can be added by simply adding a comma after the closing brace and adding additional objects. The different objects can also contain completely different data fields as required.

Once we have created a document(s), we can add it to the collection. To add a single document we first specify the collection we want to add in to followed by a dot then we can use the insert_one function (for many we use the insert_many) passing in the document object variable:

my_collection.insert_one(patient_record)

To view the contents of the collection we can loop over each item of the collection and print it.

for item in my_collection.find():
    print(item)

This will output the data like so:

Image by author

Viewing the data this way makes it quite difficult to read especially if you have a lot of fields and documents to output. Fortunately Python has a pretty print library for just such a purpose. If we modify the code to import the library and use the function (note the double ‘p’ in print):

from pprint import pprintfor item in my_collection.find():
    pprint(item)

You can see that it outputs the data in a much easier to read format:

Image by author

Note that MongoDB automatically adds an ObjectId to uniquely identify each document. This is a 12-byte hexadecimal string consisting of a timestamp, randomly generated value and incrementing counter. These id’s are displayed when data is output. You can also override this by providing your own values for the “_id” field if required.

We can add multiple records at a time using the insert_many function:

patient_records = [
 {
   "Name": "Adam Blythe",
   "Age": 55,
   "Sex": "M",
   "Blood pressure": [{"sys": 132}, {"dia": 73}],
   "Heart rate": 73
 },
 {
   "Name": "Darren Sanders",
   "Age": 34,
   "Sex": "M",
   "Blood pressure": [{"sys": 120}, {"dia": 70}],
   "Heart rate": 67
 },
 {
   "Name": "Sally-Ann Joyce",
   "Age": 19,
   "Sex": "F",
   "Blood pressure": [{"sys": 121}, {"dia": 72}],
   "Heart rate": 67
 }
]my_collection.insert_many(patient_records)

Updating data

We may also want to update data that we have previously added to a collection. Again we can update a single or multiple records. Let’s say we accidentally added the same heart rate for Darren Sanders and Sally-Ann Joyce. Darren’s was supposed to be 88. Here we can use the update_one function passing in the field we want to update searching for the key/value pair “name” and “Darren Sanders” then we use the $set option (preceded by a dollar sign) specifying the key (heart rate) and the new value (88). This will overwrite the initial value with the new one.

my_collection.update_one({"Name": "Darren Sanders"}, {"$set":{"Heart rate": 88}})

As you have seen, we can nest multiple layers of objects and arrays in one another thus embedding data. Another option is to separate out data in a separate collection and link to it. We will look at both embedding and linking and questions to help you determine which is best to use.

Embedding or linking data

We can nest data by embedding it. Consider that we want to store some medical test results for a patient. This could include some blood test results and an ECG/EKG image for some investigations for a heart attack and some blood tests, including:

Creatine Kinase (CK)
Troponin I (TROP)
Aspartate aminotransferase (AST)

We can start by creating a field called “test results” which contains an array.

patient_record = {
  "Hospital number": "3432543",
  "Name": "Karen Baker",
  "Age": 45,
  "Sex": "F",
  "Blood pressure": [{"sys": 126}, {"dia": 72}],
  "Heart rate": 78,
  "Test results" : []
}

Inside this array we can store objects for the ECG (a path to the image file) and another array to store the biochemical results.

patient_record = {
  "Hospital number": "3432543",
  "Name": "Karen Baker",
  "Age": 45,
  "Sex": "F",
  "Blood pressure": [{"sys": 126}, {"dia": 72}],
  "Heart rate": 78,
  "Test results" : [
    {
      "ECG": "\scans\ECGs\ecg00023.png"
    },
    {
      "BIOCHEM": []
    }
  ]
}

Finally, we can add the blood results as key/value pairs:

patient_record = {
  "Hospital number": "3432543",
  "Name": "Karen Baker",
  "Age": 45,
  "Sex": "F",
  "Blood pressure": [{"sys": 126}, {"dia": 72}],
  "Heart rate": 78,
  "Test results" : [
   {
     "ECG": "\scans\ECGs\ecg00023.png"
   },
   {
     "BIOCHEM": [{"AST": 37}, {"CK": 180}, {"TROPT": 0.03}]
   }
  ]
}

We can write these on the same line like we did with the blood pressure or on separate lines to aid with readability.

An alternative to embedding data in this way is to instead link to it. Linking data is also called referencing. This involves storing data in a different collection and referencing it by id. Deciding whether or not to link or embed data is dependent on certain considerations, such as:

How often you need to access the embedded information?
Is the data queried using the embedded information?
Is the embedded data subject to frequent change?
How often do you need to access the embedded data without the other information it’s embedded in?

Depending on the answer to these questions, you may want to link to the data instead. Consider the following example. You may want to store some information about what drugs have been prescribed for a given patient. You could embed this information, but what if you also wanted to store more generic information about the medication also. Here you could have a separate collection with such information that you could link to.

medication_data = [
 {
   "_id": ObjectId('60a3e4e5f463204490f70900'),
   "Drug name": "Omeprazole",
   "Type": "Proton pump inhibitor",
   "Oral dose": "20mg once daily",
   "IV dose": "40mg",
   "Net price (GBP)": 4.29
 },
 {
   "_id": ObjectId('60a3e4e5f463204490f70901'),
   "Drug name": "Amitriptyline",
   "Type": "Tricyclic antidepressant",
   "Oral dose": "30–75mg daily",
   "IV dose": "N/A",
   "Net price (GBP)": 1.32
 }
]

We can use the id’s and the DBRef function to reference this data in another collection. For example:

from bson.dbref import DBRefpatient_records = [
 {
   "Hospital number": "9956734",
   "Name": "Adam Blythe",
   "Age": 55,
   "Sex": "M",
   "Prescribed medications": [
     DBRef("medication_data", "60a3e4e5f463204490f70900"),
     DBRef("medication_data", "60a3e4e5f463204490f70901")
   ]
 },
 {
   "Hospital number": "4543673",
   "Name": "Darren Sanders",
   "Age": 34,
   "Sex": "M",
   "Prescribed medications": [
     DBRef("diagnosis_data", "60a3e4e5f463204490f70901")
   ]
 }
]

Querying data

There are several methods for querying data. All of the methods use the find() function. A query can be provided followed by the field or fields you wish to return in the form:

collection.find({ <query> }, { <field(s)> })

To find a single entry, for example the patient with the name “Darren Sanders” we could use the find function and print the first item in the list:

pprint(my_collection.find({"Name": "Darren Sanders"})[0]

We could also use a loop to output the results. We can also store the query in a separate variable that we pass into the find function first. This is useful when the query might be complex as it helps with the readability of the code:

query = {"Name": "Darren Sanders"}doc = my_collection.find(query)
for i in doc:
  pprint(i)

Finally, if we only want a single result we can use the find_one() function:

my_collection.find_one({"Name": "Darren Sanders"})

A common thing to do with databases is to query a subset of data depending on certain criteria. We can use comparison operators to retrieve subsets of data. For example we could use the greater than operator ($gt) to search for all patient names with a heart rate > 70 beats per minute.

for heart_rate in my_collection.find({"Heart rate": {"$gt": 70}}, {"Name"}):
    pprint(heart_rate)

There are many such comparison operators available, including:

(Image by author)

This functionality can be further enhanced by using logical operators. For example, we could search for patients with a heart rate < 70 beats per minute, and who are aged above 20 years.

result = my_collection.find({
 "$and" : [
     {
         "Heart rate": {"$lte": 70}
     },
     {
         "Age": {"$gt": 20}
     }
   ]
})for pt in result:
    pprint(pt)

Logical operators include:

(Image by author)

You might be wondering how we find data that’s contained in arrays. This can be done by using a period (dot). For example you may recall that we stored the patients’ systolic and diastolic blood pressure like so:

"Blood pressure": [{"sys": 126}, {"dia": 72}]

We could query patients with a systolic (sys) blood pressure less than 140 mmHG (mm of mercury) like so:

for normal in my_collection.find({"Blood pressure.sys": {"$lt": 140}}):
    pprint(normal)

Note that we reference the key “blood pressure” add a period (dot) and then the key inside the array, for example sys for systolic.

Working with existing data

One of the great things about MongoDB is that it is really straight forward to load JSON files and add them to collections. For example if we had some JSON data stored in a JSON file, we could use the json library to read in this data and add it to a MongoDB collection:

import jsonwith open('data_file.json') as f:
  file_data = json.load(f)
 
my_collection.insert_many(file_data)

You wouldn’t want to output the entire contents of a database with hundreds of thousands of documents. To view the file and see the structure of the data, you may instead output the first n documents instead. For example the first 10 documents. This can be achieved using the limit() function.

for item in my_collection.find().limit(10):
    pprint(item)

To check the number of documents in a collection we can use the count_documents function like so:

my_collection.count_documents({})

Again we could add a query here to count all the documents that meet some criteria of interest.

Aggregation

Often when working with data we don’t just want to extract subsets of data using queries, we instead want to produce new information from the existing data. This often involves carrying out various calculations like finding the average or sum of some value. For example the average wage of employees.

Let’s look at a brief example using a sample dataset containing details of restaurant data (the data can be found here: https://docs.atlas.mongodb.com/sample-data/available-sample-datasets/).

An example document can be seen below:

(Image by author)

You can see details of the restaurant address, which borough it is in, the type of cuisine, name, id and details of grades awarded with associated scores. Let’s say we wanted to compute the average scores of the restaurants. To achieve this we can use the aggregate function.

result = my_collection.aggregate(
  [
     {"$unwind": "$grades"},
     {"$match”: {}}, 
     {"$group": {"_id": "$name", "Avg grade": {"$avg": "$grades.score"}}}
  ]
)

We pass an array to the aggregate function. The $unwind parameter is used to deconstruct the grades array in order to output a document for each element. Next we use the $match parameter including everything (by using open and closing braces). We could filter further here by providing additional criteria. Next we use the $group parameter to group the data that we want to apply the computation to. Finally we create new key called “Avg grade” and apply the $avg (average) parameter to the grades scores referencing grades followed by a dot and the score key.

Producing the following output (shortened for brevity):

{'Avg grade': 15.2, '_id': 'Red Star Restaurant'}
{'Avg grade': 13.0, '_id': 'Weather Up'}
{'Avg grade': 9.4, '_id': 'La Nueva Playitas'}
{'Avg grade': 13.0, '_id': “Marcella’S Pizzeria & Catering”}
{'Avg grade': 9.0, '_id': 'Hot Wok'}
{'Avg grade': 9.333333333333334, '_id': '99 Favor Taste'}
{'Avg grade': 18.0, '_id': 'Flavors Corner'}
{'Avg grade': 10.666666666666666, '_id': 'Corona Restaurant'}
{'Avg grade': 9.0, '_id': 'Mila Cafe'}
{'Avg grade': 8.0, '_id': 'Circle Line Manhattan'}
{'Avg grade': 15.6, '_id': “The Old Time Vincent’S”}
{'Avg grade': 10.833333333333334, '_id': 'Riko'}
{'Avg grade': 10.0, '_id': 'Fresh Tortillas'}
{'Avg grade': 10.333333333333334, '_id': 'Le Village'}
{'Avg grade': 13.2, '_id': 'Ruay Thai Restaurant'}
{'Avg grade': 12.0, '_id': 'Lechonera Don Pancholo'}
{'Avg grade': 11.0, '_id': 'Pepe Rosso Social'}
. . .

There are many other parameters that can be used for common computations such as $sum, $min, $max etc.

We can also add additional functionality as required. For example we might want to sort the returned in ascending or descending order. We could simply add another line with the sort parameter specifying which field to sort by. 1 (ascending) or -1 (descending).

result = my_collection.aggregate(
  [
      {"$unwind": "$grades"},
      {"$match": {}}, 
      {"$group": {"_id": "$name", "Avg grade": {"$avg": "$grades.score"}}},
      {"$sort": {"Avg grade": -1}}
  ]
)

Another option to sort without using the aggregate function is to use the sort function directly passing in the field name, for example sorting by name:

for item in my_collection.find().sort("name").limit(10):
    pprint(item)

We can choose ascending/descending by adding a 1 or -1 after the field to sort:

for item in my_collection.find().sort("name", -1).limit(10):
    pprint(item)

Using MongoDB in software projects and for data science

One of the main advantages of MongoDB using the JSON format is the interoperability that this provides with programming languages that use a similar format. This makes working with data in an application and storing/retrieving it from a database near seamless.

A better way of integrating a database into code is to use a method such as Object Relational Mapping (ORM), or in the case of MongoDB an Object Document Mapper (ODM). This works by translating Python (or some other languages) code into MongoDB statements to retrieve data. This data is then passed back into Python objects. This has the advantage of ensuring that you only need to use one language (e.g. Python) to access and use the database.

(Image by author)

A good library for doing this is mongoengine. Here we import the library and connect to a Mongo client which we will call odm_patients.

from mongoengine import *
connect('odm_patients')

The following example shows how we can create a Python class to model some data, create a few instances of that class and write this to a database. Following the previous example we will create a class to store data about patients.

class Patient(Document):
    patient_id = StringField(required=True)
    name = StringField()
    age = IntField()
    sex = StringField(max_length=1)
    heart_rate = IntField()

We can use a Python class to create an object to manipulate the database. Here we create the data items by specifying what sort of field they are. For example textual/string data can be created using a StringField() function and integers with IntField(). Additional parameters can also be added such as the amount of characters in a string and if a field cannot be null/empty.

We can now create instances of this class in the standard way in Python. Here we can create a couple of patients called Maxine and Hamza. Note that we add the save() function to the end of the line to write this data to the database.

maxine_patient = Patient(patient_id = "342453", name = "Maxine Smith", age = 47, sex = "F", heart_rate = 67).save()hamza_patient = Patient(patient_id = "543243", name = "Hamza Khan", age = 22, sex = "M", heart_rate = 73).save()

We can output these objects using a loop. To access specific fields we can use the iterator, a dot and then the field we wish to output. For example the patients name, id and age.

for patient in Patient.objects:
    print(patient.name, patient.patient_id, patient.age)

Which produces the following output:

Maxine Smith 342453 47
Hamza Khan 543243 22

Apart from integrating Mongo databases into software projects, we can also use this for research and data science/analysis tasks too. There is an easy way to convert data from a Mongo database into tabular form as a Panda’s dataframe object. Firstly we import the pandas library.

import pandas as pd

Next we select the required data using a standard query, for example we will retrieve all the names for bakeries in the Bronx. Next we convert the results into a list data structure.

extracted_data = my_collection.find({},{"borough": "Bronx", "cuisine": "Bakery", "name": 1})
bronx_bakeries = list(extracted_data)

Finally we create a data frame using the from_dict function to convert the dictionaries into a tabular data frame:

pd.DataFrame.from_dict(bronx_bakeries)

This produces the following output:

In summary, MongoDB is a powerful and scalable database that is very useful when the data schema is prone to frequent change. This lends itself to integration with modern software systems easily and can also be used as part of a data analysis pipeline when analysing data in JSON format, such as some mobile app data or Twitter data. MongoDB is one of the most popular NoSQL databases and an awareness of what it is and how it works is a must for software engineers and data scientists.

MongoDB with Python.

What is MongoDB?

The Architecture of a MongoDB Database

Installing PyMongo

Inserting data

Updating data

Embedding or linking data

Querying data

Working with existing data

Aggregation

Using MongoDB in software projects and for data science

Written by Ajay Mane