Choosing A Database
December 13, 2020
I have had the privilege of working on a diverse set of data stores in the last few years. The problems we were solving were complex, and often times required creative solutions that were not be readily available in the wild. As a result of this, and the research that needed to be done, I learnt a great deal about designing data intensive applications. There is still a lot to be learned, but while I continue to explore, I figured it might be useful to start writing down a few interesting things I have came across along the way. If for nothing else, to help keep notes for myself
Relational Databases Are Often Good Enough
Despite the influx of NoSQL solutions, relational databases are here to stay. In the recent years we have seen some highly concurrent, and highly performant, solutions come out in the market (namely dynamodb & cassandra, being the two usual favorites) - yet a lot of engineers continue to prefer MySQL or postgresql over some of the NoSQL alternatives. There a lot of good reasons to use relational databases and a non exhaustive list would look like this
They are ACID compliant - often times business requirements dictate a transaction on the database be readily available for read and ACID compliance gurantees you that. For example in systems where it is absolutely critical that every write is immediately available to be read for corresponding actions, ACID compliance guarantees are very valuable.
They provide data validation at the schema layer (this may be a controversial opinion, but I believe schemas are good, they let you enforce data integrity at the data level).
Most SQL databases are very mature with advanced querying mechanisms, allowing you to write diverse queries ranging from data analysis to quick reads for application.
Queries are mostly very fast with a very solid indexing support. You can always argue for a certain degree of de-normalization in favor of high performance. I said mostly, because you can always write a query that isn’t meant to be run on a SQL database for example; searching for a text string through wildcards!!!
That said there are obviously good reasons to not use relational databases.
So What Would Be A Good Reason To Not Use It?
So what would be a good reason to not use a relational database. To be honest, if you are building a prototype, or a small scale project you should feel free to use whatever database you are most interested in. That said, there are some really good reasons to avoid SQL databases.
The reason number one to avoid relational databases is when you want to allow your applications to scale without theoretical limits. Relational databases being ACID compliant can not scale horizontally. Sure there are solutions like aurora with read-only clusters, but that really is working around SQL to make it work. Scaling them requires adding more CPU or memory, or both. For example say you want to build a server-less application, something on AWS lambdas perhaps. To benefit from the true scalability promise of server-less applications, we would want these lambdas to interact with a highly concurrent database like Dynamo, instead of RDS, to give lambdas room to scale without worries. With RDS you will eventually run out of connections and might need to upgrade your instance (there is always an upper limit to number of connections a relational data store can allow) - a problem Dynamo wouldn’t have with easy provisioning of read and write units.
You have non-relational data to store. I agree, this is a vague statement - so to be more specific I will talk about some specific use cases later.
You are not sure about your data models yet and want to allow your application’s schema to evolve. NoSQL gives you that flexibility, since you do not have to define a database schema before hand. Perfect example for such a database is DynamoDB, that only asks for a primary key, every other field can be dynamically added or removed.
Your business case does not mind eventual consistency. NoSQL databases allow horizontal scaling, that comes at a cost of ACID compliance. Unlike SQL databases where once transaction is committed it is available immediately in the corresponding reads, NoSQL leans more towards the concept of eventual consistency. Your writes may not be available immediately as they might not have been commited to one of the many available nodes. For example if you are writing a new post on a blog and your underlying database is a NoSQL one that favors eventual consistency, it may not be immediately available to all users as it may not have been committed to all the nodes. In reality these databases use some sophisticated algorithms to be as consistent as possible so the lag noticed may not be that high.
Let’s talk about some common use cases for NoSQL
Building A Search Engine
You want to build a search engine of some sort. As an example take a search engine of products. A system that allows you to search for products through any keyword, be it name, description, meta-data, or even ingredients. Building backend data store of such a search engine in SQL would be unnecessarily complicated, specially when document stores exist that promise better performance and features (e.g. from lucene index backed Solr, or Elastic Search). In such cases you can benefit by using a document store. I would prefer Elastic Search or Solr in this case, but you can make a case for Mongo (or other stores too)
Event Sourcing
You are building a data store of events, that can be replayed to get current state of data. You can use a relatively cheap data store like S3.
Social Networks
You are building a social network, and would have to represent user data as a network, you are better off using a graph database, perhaps something like Neo4J
Caching Or Locking?
Some sort of locking mechanism: you can use redis lock to maintain a shared lock between distributed systems. For example if you get user click data from multiple devices, in millions per second - you might have 100s of instances operating on that data and would want to ensure some idempotency through locking, you can notify other instances of a lock on a device through Redis. Redis is an in-memory database, and is blazingly fast. It is also widely used as caching layer for applications.
Hybrid Approach?
A lot of times, SQL and NoSQL are used in a hybrid setting, to reap benefits off of both. For example, you can have a SQL database, and an Elastic Search index. Every time your application writes to the SQL database, it can also index the new document into Elastic Search. A common use case could be a online shop, that writes your purchases immediately to its SQL databases but also wants this data to be available for its ads engine. You can have a separate process that picks up every purchase event, anonymise user data and index it into something like Elastic Search, making it eventually searchable on ADs dashboard as an anonymous event for anyone interested.
Conclusion
This was a non-exhaustive list based on my own personal experience. People with more in-depth knowledge will have a lot more to say on each of these databases (and topics) but if you were looking for a quick and dirty guide, hopefully this was able to provide it (and hopefully more?)