Ayadi Tahar | First taste of Document Databases with MongoDB

First taste of Document Databases with MongoDB

Publish Date: 2022-09-21

MongoDB classified as a NoSQL document database with the scalability and flexibility that developers require for querying and indexing most complex data applications at any scale.

MongoDB stores data in flexible, JSON-like documents with optional schemas (schema-less), meaning fields can vary from document to document and data structure can be changed over time.

In our article today , we will get a look on what is mongodb, how it works, and how to get started with CRUD data manipulations.

Install mongodb

to get started with mongodb coding journey, let’s install mongodb server in our ubuntu machine. To do that execute next code:


sudo apt install mongodb-server
sudo apt install mongodb-clients

you can check the status of mongodb server:


sudo systemctl status mongodb.service
● mongodb.service - An object/document-oriented database
     Loaded: loaded (/lib/systemd/system/mongodb.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-09-21 06:28:19 CET; 1min 28s ago
       Docs: man:mongod(1)
   Main PID: 52111 (mongod)
      Tasks: 23 (limit: 5690)
     Memory: 42.0M
     CGroup: /system.slice/mongodb.service
             └─52111 /usr/bin/mongod --unixSocketPrefix=/run/mongodb --config /etc/mongodb.conf

سبتمبر 21 06:28:19 ubuntu21 systemd[1]: Started An object/document-oriented database.

if not started, you can start it with this command:


sudo systemctl start mongodb.service

to launch mongo shell, just type mongo in your terminal :


mongo
MongoDB shell version v3.6.8
connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("a7d93cba-af1f-40ba-919e-968ebee3e28a") }
MongoDB server version: 3.6.8
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
Server has startup warnings:
2022-09-21T06:28:19.539+0100 I STORAGE  [initandlisten]
2022-09-21T06:28:19.539+0100 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2022-09-21T06:28:19.539+0100 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2022-09-21T06:28:20.483+0100 I CONTROL  [initandlisten]
2022-09-21T06:28:20.483+0100 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2022-09-21T06:28:20.483+0100 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2022-09-21T06:28:20.483+0100 I CONTROL  [initandlisten]

>

before we dive more with mongo data, let's take a quick look into some related concepts and terminology.

Terminology

Database

a database is a namespace on mongodb server that is uniquely identified by its name, it’s very similar to schema in relational database side. At database level we can set like security, authorizations, and permissions. A mongodb serer can host multiples databases.

db variable is assigned to current connected database in mongo shell, the next line show that current connected database is 'test':


db
test

to switch to another database, just type use, followed by the name of the database you want to change to:


use admin
switched to db admin

to show list of current databases in you mongodb instance, type:


show dbs
admin      0.000GB
config     0.000GB
local      0.000GB

the test database doesn't show here because it doesn't have any collections yet.

for fresh installation, the mongodb server instance contain 3 databases:

admin : administration database for internal system specific collections and user repository for authentication and authorization data.
config : store internal informations about shards and replications in the cluster.
local : store informations related to a local instance of mongodb.

to create a new database, just use it:


use new_database_name

if you start working with mongodb without creating or specifying any database, it will use test database by default.

Collection

A collection is analogous to a table in relational databases, it is the basic storage unit in the document store. All data manipulations and retrieval is done against collections.

Collections require name and optional options, and follow this general syntax for creation:


db.createCollection('name', {
...     capped : [true|false],
...     size : [number],
...     max : [number]
... });

there is so much other parameters you can define , but in our case the meaning of essential parameters is:

Capped: if true, then capped collection will be enabled
size: size in bytes
max: maximum number of documents to hold

Sharding

collections can be sharded or unsharded. so sharding is a process that allows storing portion of collections in multiples instances of a cluster.

sharding enables horizontal scaling for a hundred to thousands instances in large datasets, by dividing large collections into 2 parts or more, and store each part in various instances of the cluster. this parts what we called shards. for the end users, it still shows as a single collection.

As we are using single localhost instance, we will not demonstrate it here, but you got the idea.

Replication

Shards can be replicated. And replication is how mongodb create multiple copies of the same shard for redundancy, to enable fault tolerance, reach high availability and get better performance. and it achieves that using master/slave strategy by making use of replica sets .

a replica set is a group of instances that maintains the same copy of documents, one member of replica set is a primary while others are secondary. if the primary fails , a fail-over mechanism is done by electing one of the secondary members to be primary in the replica set.

primary shard receive all writes, and replication is done to all secondary shards members.

Documents

A document analogous to a record in relational database. document written/read as JSON, and records stored as BSON documents (BSON is a binary representation of JSON documents, though it contains more data types than JSON) . this format is optimized for better performance.

documents in mongodb are not constraint to have similar structure fields, but must at least contain a field id which used by mongodb to uniquely identify documents in sharded/unsharded collections. If _id field is not explicitly defined by the user, then it will be auto generated, it is of type key/value, and it is immutable. it could be of any type except of array.

Though mongodb support a prior schema definition with constraints underlines in which you can validate against documents write, that step of implementation or modeling is quit optional.

Field

A document is a collection of fields, a field is a key value pair just the same as we have in any json document, a hash map or a dictionary. A key is of type character, while a value can be of simple type or complex type .

GridFs

The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.

GridFS: it is a convention for storing large binary files in mongodb. It is fast enough to serve read/write operations over this large documents, and provide storage method that conducive for large objects and even for streaming cases.

Crud Operations

As we get familiar with terms and concepts related to mongo db let's get our hands dirty with some examples.

create database

as we saw earlier, to create a database just use it, in our case we will create 'blogs' database:


use blogs
switched to db blogs

MongoDB like many of NoSQL databases, is built on base of Schema On Read behavior(schema-less), which means we can insert documents without need to prior define types and structure of it .for our case we will define it explicitly.

So, lets create a collection named articles:


db.createCollection('articles', {
...     capped : false,
...     size : 200000,
...     max : 1000
... });
    { "ok" : 1 }

list collections:

to check that our collection 'articles' is been created, we can run the following command:


show collections;
articles

if you don't explicitly create a collection during insertion of documents, mongodb will create one for you .

Insert documents

so let's insert an article (document) in our articles (collection):


db.articles.insertOne({
...     name:"mongodb quick intro",
...     category:"database",
...     tags:[
...         "nosql",
...         "db",
...         "bigdata"
...     ]
... })
{
        "acknowledged" : true,
        "insertedId" : ObjectId("632aa2b311c7036a282eb2c2")
}

to list the row(document) we just inserted, just use find() like that:


db.articles.find()
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "name" : "mongodb quick intro", "category" : "database", "tags" : [ "nosql", "db", "bigdata" ] }

to make it more human-readable, use pretty method:


db.articles.find().pretty()
{
        "_id" : ObjectId("632aa2b311c7036a282eb2c2"),
        "name" : "mongodb quick intro",
        "category" : "database",
        "tags" : [
                "nosql",
                "db",
                "bigdata"
        ]
}

db.articles.insertMany([ {title:"What is a Data Warehouse ?",category:"Big Data",url:"https://en.ayaditahar.com/post/1", tags:["data","warehouse","business"]}, {title:"How To Install MySQL 8 on Ubuntu 20.04",category:"database",url:"https://en.ayaditahar.com/post/2", tags:["mysql","relational","ubuntu"]}, {title:"External Vs Managed Tables in Hive ",category:"Big Data",url:"https://en.ayaditahar.com/post/3",tags:["external","tables","hive", "manage"]}] );

you can insert many articles at once as well, just wrap them in a list like that:


db.articles.insertMany([
... {title:"What is a Data Warehouse ?",category:"Big Data",url:"https://en.ayaditahar.com/post/1", tags:["data","warehouse","business"]},
... {title:"How To Install MySQL 8 on Ubuntu 20.04",category:"database",url:"https://en.ayaditahar.com/post/2", tags:["mysql","relational","ubuntu"]},
... {title:"External Vs Managed Tables in Hive ",category:"Big Data",url:"https://en.ayaditahar.com/post/3",tags:["external","tables","hive", "manage"]}]
... );
{
        "acknowledged" : true,
        "insertedIds" : [
                ObjectId("632aa37d11c7036a282eb2c3"),
                ObjectId("632aa37d11c7036a282eb2c4"),
                ObjectId("632aa37d11c7036a282eb2c5")
        ]
}

to count number of documents in our articles collection:


db.articles.count()
4

Find Documents

as we just saw, we can use "find" to show all list of documents:


db.articles.find().pretty()
{
        "_id" : ObjectId("632aa2b311c7036a282eb2c2"),
        "name" : "mongodb quick intro",
        "category" : "database",
        "tags" : [
                "nosql",
                "db",
                "bigdata"
        ]
}
{
        "_id" : ObjectId("632aa37d11c7036a282eb2c3"),
        "title" : "What is a Data Warehouse ?",
        "category" : "Big Data",
        "url" : "https://en.ayaditahar.com/post/1",
        "tags" : [
                "data",
                "warehouse",
                "business"
        ]
}
{
        "_id" : ObjectId("632aa37d11c7036a282eb2c4"),
        "title" : "How To Install MySQL 8 on Ubuntu 20.04",
        "category" : "database",
        "url" : "https://en.ayaditahar.com/post/2",
        "tags" : [
                "mysql",
                "relational",
                "ubuntu"
        ]
}
{
        "_id" : ObjectId("632aa37d11c7036a282eb2c5"),
        "title" : "External Vs Managed Tables in Hive ",
        "category" : "Big Data",
        "url" : "https://en.ayaditahar.com/post/3",
        "tags" : [
                "external",
                "tables",
                "hive",
                "manage"
        ]
}

as you notice, the first document and other documents doesn't have the same structure and fields, which is one of the feature that makes NoSQL databases like mongodb flexible and more powerful.

however, we can limit the returned results as well. let's say we want to get only 2 documents and list them in human-readable format:


db.articles.find().limit(2).pretty()
{
        "_id" : ObjectId("632aa2b311c7036a282eb2c2"),
        "name" : "mongodb quick intro",
        "category" : "database",
        "tags" : [
                "nosql",
                "db",
                "bigdata"
        ]
}
{
        "_id" : ObjectId("632aa37d11c7036a282eb2c3"),
        "title" : "What is a Data Warehouse ?",
        "category" : "Big Data",
        "url" : "https://en.ayaditahar.com/post/1",
        "tags" : [
                "data",
                "warehouse",
                "business"
        ]
}

you can also pick and show only the one document, which is useful when you deal with big collections :


db.articles.findOne()
{
        "_id" : ObjectId("632aa2b311c7036a282eb2c2"),
        "name" : "mongodb quick intro",
        "category" : "database",
        "tags" : [
                "nosql",
                "db",
                "bigdata"
        ]
}

find by field

if you want, you can select only title field from the returned documents and suppress the _id field (if you want to get the id as well, replace 0 by 1) :


db.articles.find({}, {title:1, _id:0})
{  }
{ "title" : "What is a Data Warehouse ?" }
{ "title" : "How To Install MySQL 8 on Ubuntu 20.04" }
{ "title" : "External Vs Managed Tables in Hive " }

if you are looking to a specific document, you need to specify a full value (otherwise you will get no results) of a field. for instance let's find a document with specific title:


db.articles.find({"title" : "What is a Data Warehouse ?"}, {})
{ "_id" : ObjectId("632aa37d11c7036a282eb2c3"), "title" : "What is a Data Warehouse ?", "category" : "Big Data", "url" : "https://en.ayaditahar.com/post/1", "tags" : [ "data", "warehouse", "business" ] }

again, if you want just some specific fields, specify them in the curly bracket with the value 1 next to the field like that :


db.articles.find({"title" : "What is a Data Warehouse ?"}, {title:1, url:1})
{ "_id" : ObjectId("632aa37d11c7036a282eb2c3"), "title" : "What is a Data Warehouse ?", "url" : "https://en.ayaditahar.com/post/1" }

or you can filter based on some values. here we are looking for documents that belongs t specific category (database):


db.articles.find({"category" : "database"}, {})
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "name" : "mongodb quick intro", "category" : "database", "tags" : [ "nosql", "db", "bigdata" ] }
{ "_id" : ObjectId("632aa37d11c7036a282eb2c4"), "title" : "How To Install MySQL 8 on Ubuntu 20.04", "category" : "database", "url" : "https://en.ayaditahar.com/post/2", "tags" : [ "mysql", "relational", "ubuntu" ] }

if you try to search through specific word, nothing will return:


db.articles.find({"title" : "Hive"}, {})

to make this query work ,you have to enable indexing on that column.

Indexes

Before creating any index, lets find how many indexes are already in our database:


db.articles.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "test.articles"
        }
]

as we can see , there is one index in our articles' collection, on the the _id field.

to understand the impact of indexes, execute the next line that will show the execution plan:


db.articles.find({"title" : "hive"}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "test.articles",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "title" : {
                                "$eq" : "hive"
                        }
                },
                "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                "title" : {
                                        "$eq" : "hive"
                                }
                        },
                        "direction" : "forward"
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "ubuntu21",
                "port" : 27017,
                "version" : "3.6.8",
                "gitVersion" : "8e540c0b6db93ce994cc548f000900bdc740f80a"
        },
        "ok" : 1
}

it's clear from the returning plan, that mongodb try to find if there is an index on that column. if not, then it will try to make a full scan on it. imagine if you have a big collection of documents, this will take a long time before it return a result,which mongodb is not built for that kind of queries. but one way to get through that, is to use indexes.

to create an index, you have to specify which column will be used; we used composed index of 2 columns: title and tags with text type for full text search ability, and we set weight more on tags than on title for score results:


db.articles.createIndex({
...     'tags' : 'text',
...     'title' : 'text'
...     },
...     {
...         'weights' : {
...         'tags' : 5,
...         'title' : 10
...     },
...     'name' : 'tags_title_idx'
... })
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}

if we list our indexes again, we shall see the newly created index (each index has a name within a name field):


db.articles.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "test.articles"
        },
        {
                "v" : 2,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "name" : "tags_title_idx",
                "ns" : "test.articles",
                "weights" : {
                        "tags" : 5,
                        "title" : 10
                },
                "default_language" : "english",
                "language_override" : "language",
                "textIndexVersion" : 3
        }
]

now, once the index is created, we can search documents on specific words. it should return results as expected:


db.articles.find( { $text: { $search: "hive table" } }, {tags:1, title:1} )
{ "_id" : ObjectId("632aa37d11c7036a282eb2c5"), "title" : "External Vs Managed Tables in Hive ", "tags" : [ "external", "tables", "hive", "manage" ] }


db.articles.find( { $text: { $search: "20.04" } }, {tags:1, title:1} )
{ "_id" : ObjectId("632aa37d11c7036a282eb2c4"), "title" : "How To Install MySQL 8 on Ubuntu 20.04", "tags" : [ "mysql", "relational", "ubuntu" ] }

Update Documents

whenever you want to update a document in the collection , you have to specify two parts: query part and the updating document part.

if you notice our first document doesn't contain a 'title' field, but instead it contains a 'name' field in place, which makes our data inconsistent, and that something we dont' want to. So, to correct that we rename the "name" field to "title" field:


db.articles.find({"name" : "mongodb quick intro"}, {})
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "name" : "mongodb quick intro", "category" : "database", "tags" : [ "nosql", "db", "bigdata" ] }


db.articles.updateOne({"name" : "mongodb quick intro"},{ $rename: { "name": "title" } })
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }


db.articles.find({"title" : "mongodb quick intro"}, {})
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "category" : "database", "tags" : [ "nosql", "db", "bigdata" ], "title" : "mongodb quick intro" }

though it moves the field at the end of the document, but it doesn't matter much in mongodb.

another case example, let's update our first article we insert earlier and set 'url' to it, and change it's title a new one:


db.articles.updateOne({"title" : "mongodb quick intro"}, {$set : {"url" : "https://en.ayaditahar.com/post/10"}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
db.articles.updateOne({"title" : "mongodb quick intro"}, {$set : {"title" : "First taste of Document Databases in MongoDB"}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }

now we can check our article document, to see if it updated successfully:


db.articles.find({"title" : "First taste of Document Databases in MongoDB"})
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "category" : "database", "tags" : [ "nosql", "db", "bigdata" ], "title" : "First taste of Document Databases in MongoDB", "url" : "https://en.ayaditahar.com/post/10" }


db.articles.find({})
{ "_id" : ObjectId("632aa2b311c7036a282eb2c2"), "category" : "database", "tags" : [ "nosql", "db", "bigdata" ], "title" : "First taste of Document Databases in MongoDB", "url" : "https://en.ayaditahar.com/post/10" }
{ "_id" : ObjectId("632aa37d11c7036a282eb2c3"), "title" : "What is a Data Warehouse ?", "category" : "Big Data", "url" : "https://en.ayaditahar.com/post/1", "tags" : [ "data", "warehouse", "business" ] }
{ "_id" : ObjectId("632aa37d11c7036a282eb2c4"), "title" : "How To Install MySQL 8 on Ubuntu 20.04", "category" : "database", "url" : "https://en.ayaditahar.com/post/2", "tags" : [ "mysql", "relational", "ubuntu" ] }
{ "_id" : ObjectId("632aa37d11c7036a282eb2c5"), "title" : "External Vs Managed Tables in Hive ", "category" : "Big Data", "url" : "https://en.ayaditahar.com/post/3", "tags" : [ "external", "tables", "hive", "manage" ] }

Delete documents

Now, to delete an article, just call deleteOne() method on that article (document), and specify a criteria on a field as a query :


db.articles.deleteOne({'title' : "First taste of Document Databases in MongoDB"})
{ "acknowledged" : true, "deletedCount" : 1 }

you can delete multiples documents based on specific criteria, for example if we want to delete documents that have category = "Big Data" we can so like in the next snippet:


db.articles.deleteMany({category:"Big Data"})
{ "acknowledged" : true, "deletedCount" : 2 }

because we have two articles document in our collection belongs to category big data, they got deleted.

cleanup

now, after we're done with our demonstration , we can clean up our namespace and delete all the objects we created so far.

drop index

if we decide that we don't need an index anymore, we can easily drop it, by specifying its name:


db.articles.dropIndex('tags_title_idx')
{ "nIndexesWas" : 2, "ok" : 1 }

Purge documents

also to delete (empty or truncate) the "articles" collection from its documents, you can so by invoking the deleteMany method without parameters, like in the next snippet:


db.articles.deleteMany({})
{ "acknowledged" : true, "deletedCount" : 1 }

because there is only one document left, it get deleted.

drop collection

after our collection is empty, you can drop it:


db.articles.drop()
    true

drop database

now, the only thing left to us is to delete the database. the next command will delete the currently connected database from shell:


db.dropDatabase()
{ "dropped" : "blogs", "ok" : 1 }

Conclusion

there is so much to cover in mongodb data manipulations. But for the basics we get covered the most important ones. And I hope by the end of this article you are little familiar on what is MongoDB and How to perform CRUD operations that include creation, retrieval, updating, and deletion of documents.

Resources

MongoDB Manual

First taste of Document Databases with MongoDB

Install mongodb

Terminology

Database

Collection

Sharding

Replication

Documents

Field

GridFs

Crud Operations

create database

list collections:

Insert documents

Find Documents

find by field

Indexes

Update Documents

Delete documents

cleanup

drop index

Purge documents

drop collection

drop database

Conclusion

Resources

Data Structures

Data Engineering

Algorithms

Data Science

Linux

Deploy Minio in Openshift

Run Ansible playbooks Using Red Hat Satellite

Controlling pod placement onto nodes in OpenShift

Create a storage class for NFS dynamic storage provisioning in OpenShift

Windows management with Ansible

Running Spark on Kubernetes with AKS

Data Lakehouse, The Best of Both Worlds

CAP Theorem, Does it still hold in modern days

Pandas API over Pyspark

First taste of Document Databases with MongoDB

Install mongodb

Terminology

Database

Collection

Sharding

Replication

Documents

Field

GridFs

Crud Operations

create database

list collections:

Insert documents

Find Documents

find by field

Indexes

Update Documents

Delete documents

cleanup

drop index

Purge documents

drop collection

drop database

Conclusion

Resources

Leave comment:

Show comments: (0)