SQL vs. NoSQL
The articles on NoSQL databases in Reuven M. Lerner's At the Forge column appearing in recent issues of LJ have been enjoyable. Because this is the Enterprise issue, I think it would be helpful to take a step back and look at the Linux database landscape and examine in particular the ongoing “battle” between SQL and NoSQL databases. By way of disclosure, I work for Monty Program, a company whose primary product is MariaDB, a community-enhanced branch of MySQL. That being said, I approached this topic with as open a mind as possible.
The rivalry between SQL and NoSQL has been building during the past year to the point where some people are predicting the end of the SQL era. Actually, the two camps are largely complementary, because they're designed to solve different problems.
Acronyms
Whenever the topic of databases arises, an alphabet soup is thrown around that would make NASA proud. Some of the acronyms I use a lot in this article include:
RDBMS: Relational Database Management System.
SQL: Structured Query Language, also used to refer to databases that use SQL as their query language.
NoSQL: used to refer to a class of databases that are non-relational and do not use SQL as their query language. They could perhaps be better called Distributed Database Management Systems (or DDBMSes), but for now, the popular term is NoSQL.
ACID: Atomicity, Consistency, Isolation, Durability (see the What Is ACID? sidebar).
CAP: Consistency, Availability, Partition tolerance (see the What Is CAP? sidebar).
So, what is the big deal about NoSQL databases? For one, they've introduced new ways (or perhaps re-introduced old ways) of thinking about what databases are and what they can do. For another, they're shiny and new, and all the cool kids seem to be using them. You could argue that Google's BigTable is the database that inspired the NoSQL movement. Or, maybe it was Amazon's S3. Both of them are closed source, but they were (or are) impressive enough to inspire open-source interpretations.
The current NoSQL field includes HBase, Cassandra, Redis, MongoDB, Voldemort, CouchDB, Dynomite, Hypertable and several others. Some have followed the model of BigTable, others follow S3's model, some are a mix of the two, and others are charting their own path. Some of these projects are more mature than others, but each of them is trying to solve similar problems.
Instead of having tables with columns and rows like you would find in a traditional RDBMS, most NoSQL databases are simple “key-value stores”. Each piece of data that goes into the database is given a key, and when you want the data back, you use the key to get it. This simplicity is beneficial, because it helps busy sites achieve extremely low latency, even under high load, when paired with a large number of servers and a fast network. The simplicity of the key-value model also simplifies development.
A step beyond simply having keys and values are the so-called document databases. A document, in this case, is a collection of various fields of information. Each individual document can have a different number of fields of varying lengths. These databases are useful if you have a lot of semi-structured data, and they are a good fit for object-oriented programming models (for example, you can consider the database as a storage area for objects).
Why do traditional database users dislike these newcomers? D. Richard Hipp, the creator of SQLite, in a talk given at my local LUG, derisively called NoSQL databases “post-modern databases”, because instead of giving you a definite answer to your question, they give you “an opinion” or their “best guess”. His purpose was to paint NoSQL databases in a bad light, and for most of the old-school database world, the NoSQL, non-relational, BASE model (see the What Is ACID? sidebar) is more than a bit heretical.
The heresy comes because historically, databases almost always have tried to implement the relational model and be fully ACID-compliant. If your transactions weren't ACID, or your database wasn't relational, the argument went, you couldn't call yourself a “real” database. Look at the MySQL vs. PostgreSQL flame wars for ample evidence of this thinking.
The problem though, is that being relational and ACID is not necessary for some use cases and can add unnecessary overhead, which you don't want if you are running a popular, heavily trafficked Web site. Many early users of MySQL knew this and were mocked for choosing MySQL over “real” databases like PostgreSQL. It is ironic now that MySQL has gained what every “expert” said it should have (ACID transactions), that a new movement has started up claiming that all the old database technology isn't actually necessary.
What is necessary for top-tier Web sites, according to proponents of NoSQL, is massive scalability, low latency, the ability to grow the capacity of your database on demand and an easier programming model. These, and others, are things which, according to them, SQL RDBMSes just don't provide in a cost-effective manner.
Most classic RDBMSes initially were designed to run on a single large server. That is how it was done in the late 1970s and early 1980s, and the idea exists in the design of many RDBMSes to this day. The P in CAP (see the What Is CAP? sidebar) is meaningless when the database is on a single server (the server is either up or down, rarely or never only partly up), and traditional RDBMSes have focused mainly on Consistency, aka ACID, with Availability thrown in if you mirror between database servers or use hardware with no single points of failure.
Some NoSQL databases also focus on the C and A parts of CAP. Unlike traditional RDBMSes though, these databases are designed from the ground up to be run on dozens, hundreds or even thousands of nodes in a single data center. Partial partition tolerance for these databases is obtained by mirroring database clusters between multiple data centers. The advantage these databases have over a traditional RDBMS is that with the work spread over all of those machines, you can achieve ultra-low latency even when there are extremely high numbers of reads and writes, and with all those machines, you can analyze massive amounts of data quickly.
Other NoSQL databases focus on the A and P parts of CAP and are designed to span multiple data centers. True to CAP, strong consistency is impossible for these databases. Weak consistency is an especially heretical thought to the RDBMS old guard. Instead, these NoSQL databases implement eventual consistency, whereby any changes are replicated to the entire database eventually, but at any given time, a single node or group of nodes may not have the latest data. Like the NoSQL databases, which focus on C and A, the focus for A and P databases is on low latency, high throughput and anything else that makes the Web site more responsive and a richer experience for users.
In addition to sometimes abandoning consistency in favor of scalability and latency, another way NoSQL databases break with tradition is in their abandonment of the relational model. To be fair, some data truly does not naturally fit the relational model. This could be because the data changes form or size often, or because the data is completely unstructured.
The final break with tradition in NoSQL databases is the thing that gave them their name. They don't use SQL. The reasons for dropping SQL usually revolve around it not fitting in with modern object-oriented development processes or some perceived difficulty in working with SQL. Sometimes the excuse given for not using SQL is a simple “SQL sucks”, which isn't really a reason. Stupid reasons aside, the SQL language was designed for use with relational databases, and NoSQL databases are mostly non-relational, so it makes sense that they don't use it.
So what about plain-old SQL RDBMSes? Should they be retired from active service? Are they a relic from an earlier time? Not so fast.
First and foremost, ACID transactions most definitely are required in certain use cases. Databases used by banks and stock markets, for example, always must give correct data. Where money is concerned, guessing is not allowed. It is true that no one really cares if your latest tweet takes a couple minutes to show up in your Twitter feed, but the same cannot be said for a billing system or accounting database.
Another thing in favor of RDBMSes is their use of SQL. It's a common language, and if you need to move from one database to another, you usually can get away with making only minor changes to your application, and it will “just work”. True, it may not be possible in all cases, depending on how you used or abused the SQL queries in your application, but the foundation for moving easily between different SQL databases is there, and the tools and libraries you can use to interact with your data are plentiful and robust. A unified NoSQL standard query language or API will never exist because every NoSQL database is so different.
On the NoSQL side, the only thing in common is that there is nothing in common. Each NoSQL database has its own set of APIs, libraries and preferred languages for interacting with the data they contain. With an RDBMS, it is trivial to get data out in whatever format you need using whatever programming language you like best. Your choice of a NoSQL database might limit you to one or a handful of programming languages and access methods.
Another thing RDBMSes have going for them is the relational model. The R in RDBMS traces its history back to research by E. F. Codd published in the June 1970 issue of Communications of the ACM. Since then, it has been expanded upon, improved and clarified. The relational model for databases is so popular because it is an excellent way to organize information. It maps very well to an enormous variety of real-world data storage needs, and when properly normalized, it is fast and efficient.
In the relational model, data is stored in tables with rows and columns. An address table, for example, might have columns for street name and number, city, postal code, state or province, and country. A name table might have columns for given names, family name, prefixes (Dr, Rev, Ms and so on) and suffixes (Jr, Sr, Esq and so on). Each row in the individual tables would represent an individual address or name.
The relational part (see the What Does Relational Mean in a Relational Database? sidebar) comes into play as you define which addresses relate to which names using a key. A key is a field (the intersection of a row and column) or combination of fields in a single row that is guaranteed to identify uniquely that particular row in the table it is in. For the address table, you might have a column for keys from the name table. You can use this key to look up just those addresses in the address table that “belong” (by virtue of the key) to a certain name in the name table.
My example is pretty simplistic, but when combined with ACID transactions in an RDBMS, you achieve tremendous power, flexibility and reliability. There is a reason that businesses began using them decades ago and why open-source RDBMSes dominate the Web.
And, what about the Web? The primary argument many people use against RDBMSes is that they “don't scale”, which simply isn't true. It is true that some individual RDBMSes do not scale very well or are harder to scale, but that doesn't mean every RDBMS cannot. RDBMSes are in use at every large company. The largest RDBMS installations routinely handle enormous traffic and petabytes of data.
This scaling myth is perpetuated and given credence every time popular Web sites announce that such-and-such RDBMS doesn't meet their needs, and so they are moving to NoSQL database X. The opinion of some in the RDBMS world is that many of these moves are not so much because the database they were using is deficient in some fundamental way, but because it was being used in a way for which it wasn't designed. To make an analogy, it's like people using flat-head screwdrivers to tighten Phillips-head screws, because it worked well enough to get the job done, but now they've discovered it is better to tighten Phillips screws with an actual Phillips screwdriver, and isn't it wonderful, and we should throw away all flat-head screwdrivers, because their time is past, and Phillips is the future.
One recent SQL-to-NoSQL move involved Digg.com moving from MySQL to Cassandra. As part of the move, Digg folks blogged about how they were using MySQL and why it didn't meet their needs. Others were skeptical. Dennis Forbes, in a series of posts on his site (see Resources), questioned whether Digg needed to use a NoSQL solution like Cassandra at all. His claims centered on what he considered very poor database usage on the part of Digg combined with inadequate hardware. In his mind, if Digg had just designed its database properly or switched to using SSDs in its servers, it would have had no problems. His best quote is this: “The way that many are using NoSQL is like discovering the buggy whip at the beginning of the automotive era.” Ouch.
Relational databases sometimes can be tricky to design properly. You have to know and understand your data deeply. But when they are designed properly, the performance can be orders of magnitude better compared to poorly designed databases. You also should not overlook the hardware on which your database runs. Databases love as much memory and processing power as you can throw at them, and the traditional spinning-platter disk drive has long been a limiting factor. Does the high performance of SSDs herald a new age of RDBMS performance? Many experts say yes. SSDs may be a game-changer in the database world.
Relational SQL databases have been around for several decades. They have proven reliability and performance and a feature set that meets the requirements of 99% of the use cases out there. They even make excellent key-value databases, if that's the type of data you have. There are only very few companies that can't make a relational database work for them. You may not like to hear it, but with the law of averages, chances are your company is not one of them.
My advice? Don't think of SQL vs. NoSQL as an either/or question. Options are a good thing. Many alternatives exist, so if you are having issues with your chosen database, experiment with different products on both sides and run your own benchmarks.
Also look into how you are using your database. If the database was “bootstrapped” while you were creating your killer application or service, and it is starting to give you problems, you might have an easily solvable design issue at the root of your troubles. If databases are not your thing, consult with an expert. RDBMSes have been around a long time, and there are plenty of experts.
Whatever you decide to do, don't think of NoSQL as your escape from the SQL RDBMS world. NoSQL databases are not a panacea. I asked my boss, Monty Widenius, the creator of MySQL, what his opinion on the whole NoSQL vs. SQL thing was. His answer: “Non-SQL gives you a very sharp knife to solve a selected set of issues. If you find SQL too hard to use, you should not try Non-SQL.”
His basic point is that if you don't understand SQL RDBMSes, you'll probably end up hurting yourself by jumping into NoSQL. Key-value stores like those found in NoSQL databases do work for certain kinds of data, but they don't work well at all for other kinds. It is instructive to point out that the companies that use and have championed NoSQL databases have not given up on SQL RDBMSes. They continue to use them in vital roles.
Finally, many of the NoSQL ideas are based on old technology. Key-value stores have been around for more than 20 years, for example. New this time around are things like map-reduce (some claim that even this is an old idea), which spread the workload over many computers. In that sense, NoSQL databases really should be called distributed-DBMSes (DDBMSes?). Basically, distributed RDBMSes, without the R.
Whatever you call them, NoSQL databases are solving problems that were considered “solved” by many in the RDBMS world a long time ago. They're just solving the problems in a different way, and they have a different set of requirements. If this new-old way solves an issue you're having, great! On the flip side, if your current RDBMS is meeting your needs, don't feel like you need to jump on the bandwagon.
What Is CAP?
The CAP Theorem, also called Brewer's Theorem, first was proposed by Eric Brewer in a July 2000 keynote at the ACM Symposium on the Principles of Distributed Computing. It was formally proved in 2002 by Seth Gilbert and Nancy Lynch of MIT. The CAP Theorem states that it is impossible for any shared-data system to guarantee simultaneously all of the following three properties: consistency, availability and partition tolerance.
Consistency in CAP is not the same as consistency in ACID (that would be too easy). According to CAP, consistency in a database means that whenever data is written, everyone who reads from the database will always see the latest version of the data. A database without strong consistency means that when the data is written, not everyone who reads from the database will see the new data right away; this is usually called eventual-consistency or weak consistency.
Availability in a database according to CAP means you always can expect the database to be there and respond whenever you query it for information. High availability usually is accomplished through large numbers of physical servers acting as a single database through sharing (splitting the data between various database nodes) and replication (storing multiple copies of each piece of data on different nodes).
Partition tolerance in a database means that the database still can be read from and written to when parts of it are completely inaccessible. Situations that would cause this include things like when the network link between a significant number of database nodes is interrupted. Partition tolerance can be achieved through some sort of mechanism whereby writes destined for unreachable nodes are sent to nodes that are still accessible. Then, when the failed nodes come back, they receive the writes they missed. In Cassandra, this is called hinted handoff. A database with good partition tolerance can span multiple data centers, whereas one with weak partition tolerance basically is bound to a single data center.
What Is ACID?
ACID is the classic measure of determining whether your database is good. A transaction in a database is a single logical operation. An example would be inserting an address or updating a phone number in an employee database. Every database provides methods to do operations like those, but ACID formalizes the process.
Atomicity means that the transaction either succeeds or fails. If the transaction fails, it should fail completely, and the database should be left in the state it was in before the transaction started.
Consistency means that the database is in a known good state both before and after the transaction.
Isolation means that transactions are independent of one another, and if two transactions are trying to modify the same data, one of them must wait for the other to finish before it can begin.
Durability means that once the transaction has completed, the changes made by the transaction will persist, even if there is a system failure. A transaction log of some sort usually is used for this purpose. In MariaDB and MySQL, this is called the binary log.
So, what is the opposite of ACID? BASE (Basically Available, Soft-state, Eventual consistency), of course. BASE is a retronymn coined by Dan Pritchett in an article in the ACM Queue magazine for describing a database that does not implement the full ACID model, with the main difference being that it is eventually consistent. The idea is that if you give up some consistency, you can gain more availability and greatly improve the scalability of your database.
What Does Relational Mean in a Relational Database?
In common usage, the relational part of Relational Database refers to (or is often assumed to refer to) the way tables are related to each other via keys. For the truly pedantic though, this is in in fact incorrect. Relational here does not refer to relationships between tables, rather it refers to the mathematical concept of a relation, which is in essence what relational databases call tables. A relational database is a database based on the relational model.
Resources
SQL Databases:
MariaDB: askmonty.org
PostgreSQL: www.postgresql.org
NoSQL Databases:
Cassandra: cassandra.apache.org
CouchDB: couchdb.apache.org
HBase: hadoop.apache.org/hbase
Redis: code.google.com/p/redis
Voldemort: project-voldemort.com
MongoDB: www.mongodb.org
Hypertable: hypertable.org
Dynomite: wiki.github.com/cliffmoon/dynomite/dynomite-framework
BigTable: labs.google.com/papers/bigtable.html
Brewer's CAP Theorem:
Brewer's CAP Theorem by Julian Browne: www.julianbrowne.com/article/viewer/brewers-cap-theorem
CAP Theorem: devblog.streamy.com/2009/08/24/cap-theorem
Towards Robust Distributed Systems by Dr Eric A. Brewer: www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services (2002) by Seth Gilbert and Nancy Lynch: citeseer.ist.psu.edu/544596.html
E. F. Codd's “A Relational Model of Data for Large Shared Data Banks”: www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
Other Links:
Dennis Forbes on Software and Technology: www.yafla.com/dforbes
Looking to the future with Cassandra by Ian Eure: about.digg.com/blog/looking-future-cassandra
NOSQL debrief by Johan Oskarsson: blog.oskarsson.nu/2009/06/nosql-debrief.html
BASE: An Acid Alternative by Dan Pritchett: queue.acm.org/detail.cfm?id=1394128
Should you go Beyond Relational Databases? by Martin Kleppmann: carsonified.com/blog/dev/should-you-go-beyond-relational-databases
NoSQL Q and A: www.dbms2.com/2009/12/11/nosql-q-and-a
NoSQL Video by Brian Aker: www.youtube.com/watch?v=LhnGarRsKnA
Daniel Bartholomew works for Monty Program as a technical writer and system administrator. He lives with his wife and children in North Carolina and often can be found hanging out on both #linuxjournal and #maria on Freenode IRC.