DISQUS

Adam @ Heroku: SQL Databases Are An Overapplied Solution (And What To Use Instead)

  • rober · 5 months ago
    Let me guess. You're about 25, never took a database course, and never read Date's (or Gray & Reuter's) book? If you care about the integrity of the data, you will have to write a TPM for whatever files you create. If you don't care about the integrity of the data (and if you don't, then why are you storing it anyway), how you store it is irrelevant.

    As to the size of the rows, poppycock. As to joins, SSD machines will reconstruct joined tables faster than retrieving your flatfiles. Hands down.

    SQL databases do scale. You just have to use one that is smarter than MySql. And you have to view data relationally, not as flat files. Flat files (xml being the worst, but not sole, offender) will always be gargantuan and highly duplicative. The need to "scale" results from the profligacy of the datastore. xml files typically contain 10 times more characters in markup as they do in data. Why is that intelligent? And, no, the markup does not make the data self-describing; you still have to write application code to: 1) parse (or use a parser, same difference) and 2) interpret the data. You get to write lots of code rather than a SQL string. If you like to type, and get paid by the keystroke, I guess it's a good deal.

    Your indictment of ActiveRecord IS NOT an indictment of relational databases, or more intelligent uses of them. Sending the "joined" data is no problem, if you know what you're doing. And there is a reason OODBMS have been a failure: they either store the method text with each object (which is a massive duplication and synchronization headache), or they store one copy of the method text and instance data relationally. In neither case is the OODBMS an intelligent response.

    Many have critiqued the other approaches described, so I'll let it go at that.
  • Jesse Hallett · 4 months ago
    Relational databases are very powerful tools. Nobody is saying otherwise. But non-relational databases can be very powerful tools too. In some use cases a non-relational database may be a better fit than a relational one. There is no one tool that solves every problem.

    To address some of your points:

    Non-relational databases are not flat files. They use various techniques that allow high performance access to semi-structured data. With Memcached, Redis, and Tokyo Tyrant that means addressing data by key. With CouchDB that means indexing data with MapReduce views. If you do some investigating you will find that the people developing these systems are smart people with lots of experience who have good, fresh ideas.

    The databases discussed here are not XML databases. I know very little about XML databases; though I know of some people who have expressed fondness for them. I just want to point out that the database systems discussed in this article are different and have different strengths and weaknesses.

    Document-oriented databases and key/value stores are not object-oriented databases. None of the databases mentioned in this article store method definitions. Like SQL databases, document-oriented databases and key/value stores store pure data.
  • Chris Lloyd · 5 months ago
    Are there any plans to offer CouchDB (or any non-relational DB for that matter) in Heroku?
  • Jesse Hallett · 5 months ago
    +1
  • Jesse Hallett · 4 months ago
    Well, Heroku runs on EC2. And the currently-in-private-beta couch.io <http://hosting.couch.io/> also runs on EC2. So if you use a database on couch.io with a Heroku app access should be pretty fast. Or you could run CouchDB or another DB on your own EC2 instance.
  • hunternield · 5 months ago
    Of interest, MongoDB provides support for Binary Assets with GridFS (http://www.mongodb.org/display/DOCS/Storing+Files) and Transient Data with Capped Collections (http://www.mongodb.org/display/DOCS/Capped+Coll...).

    It's a great solution that fits in nicely for most of the points you've outlined. Close enough to SQL to be familiar but without some of the shortcomings (schema-less, etc). Still a little immature in places but the dev team is very responsive.
  • lobo_tuerto · 5 months ago
    Hey Adam,

    Sorry if this is a truly newbie question, I'm trying to wrap my head around something like CouchDB.

    Let's take a concrete example, invoices and line items. An invoice can have many line items in it.

    With a "normal" SQL DB, you would have two tables, say: Invoices, LineItems. And we all know how that works.
    With a "document oriented" DB you would have Invoice docs, those docs, would directly contain the line items in them, right?

    Is it possible (and/or easy) to do a query to get a report on the best selling line items? how do you work that on CouchDB? since they are all contained within Invoice documents... in a SQL DB you would only run a query on the LineItems table.

    Does this question make sense? :)
  • Jesse Hallett · 5 months ago
    To calculate best selling line items with CouchDB you would create an index using line item ids as keys. You can use a map function to scan through every invoice document and emit each line item's id as a key. You can also create a reduce function to perform an aggregate operation on that index, like calculating the sum of all purchased line item quantities. Finally you pass a query to that map/reduce pair with a key parameter specifying that CouchDB should only aggregate quantities from line items with the given id.

    There are instructions online for doing this stuff:
    http://www.google.com/search?ie=UTF-8&oe=UTF-8&...
    http://sitr.us/2009/06/30/database-queries-the-...
  • CaptainAbab · 5 months ago
    I would counter-argue that this would still perform faster in a SQL database. Invoices are pretty structured unlike other types of documents (contracts, profiles, etc.).

    There are heavy reads against this type of data: your customer service front-end applications, your reporting tools, warehouses, etc. Never mind you will also need taxes for each line with tax rates for county, city, state, whatever as another one-to-many. I'm sure summing all of that up will not be as efficient as a SQL query.

    With invoices, you need reports on daily sales, by customer, by region, etc. that are constantly being updated/refreshed. Never mind slicing/dicing the quantity/price data by category/dept/material/etc. What about a HAVING clause? Your tool would require fetching every piece of data in order to do the aggregates before you can even start doing the filtering. A database is smarter in these cases.
    I don't think CouchDb is a good candidate for this type of example.
  • Jesse Hallett · 4 months ago
    Respectfully, I think that you are not quite grokking the CouchDB way yet.

    Yes, invoices are structured and can be represented nicely in SQL. But they also fit nicely in the CouchDB way. You can put invoice level data, like the shipping address, line items, taxes charged per item, and all of that stuff in a single document, thus making the entire invoice one atomic write. If you did that in SQL query performance would suck. But CouchDB views are flexible enough to expose small pieces of a document, like individual line items, separately.

    In SQL you might calculate taxes charged by joining an invoices table with a tax rates table at query time - though I would not do it that way. In CouchDB you would record the taxes charged by the county, city, state, etc. in the invoice document. No joins are necessary to get the information you need for generating statistics.

    CouchDB views act as static indexes for documents and for aggregate statistics on documents. Views are built incrementally and stored. When a query comes in CouchDB only has to rebuild the portion of the view for documents that have changed since the last query was processed. As the number of invoices in the database grows query performance in CouchDB should stay about the same. A SQL database has to re-scan all of the relevant tables from scratch every time a query is run. As the number of invoices grows SQL queries should get slower. This is why CouchDB is great at handling heavy reads.

    There are some good explanations in the CouchDB Technical Overview <http://couchdb.apache.org/docs/overview.html> of how CouchDB can perform better than you are giving it credit for.
  • Derek Haynes · 5 months ago
    Regarding metrics, another solid option is RRDTool (http://oss.oetiker.ch/rrdtool/index.en.html).
  • MDK · 5 months ago
    Please do write about the locks. That sounds very interesting.
  • Wes Felter · 5 months ago
    In addition to CouchDB and its newfangled friends, there are several mature XML document databases with sophisticated, standardized querying (XQuery).
  • nlsmith · 5 months ago
    Don't forget Persevere (http://persvr.org/). It has many capabilities similar to (Couch|Mongo)DB, and some more, including JSONQuery/JSONPath, JSON Schema, and user authentication.
  • ernesto · 5 months ago
    like which
  • Brandon · 5 months ago
    How does Heroku manage failover with RabbitMQ? It doesn't seem to have hot failover with durable queues.
  • Not Amused · 5 months ago
    idiot. go back to college, or better yet, work for a few years at a real site --not a website. lol In the mean time, you may want to delete this post.
  • sharms · 5 months ago
    @Not Amused: I work at a Fortune 500 company. We use Oracle exclusively, and even on very expensive AIX boxes we still have significant performance bottlenecks as described in the post. What exactly are you referring to?
  • Eivind Uggedal · 5 months ago
    Good overview. Regarding Memcaced and Redis for transient data (expiration and INCR): I recently implemented expiry in Tokyo Tyrant through its Lua extension. INCR is already supported in Tokyo Tyrant out of the box.

    This solution has its disadvantages: one have to manually run a compaction to remove expired data which has not been queried (an expired item is deleted when you try to get it). The advantage is that I can rely on the persistence store I already use (Tokyo Tyrant, mainly because of its multi-master replication and lower memory requirements compared to Redis) and not introduce another daemon and client libraries.

    I should probably abstract this from my app and put it up on Github/Bitbucket.
  • Gerard Banasig · 5 months ago
    Does Document Records support aggregate functions (sum,avg,count)?
  • Jesse Hallett · 4 months ago
    Yes, for example MongoDB and CouchDB support this. With CouchDB you would use the Reduce portion of the MapReduce pattern to make aggregate calculations.
  • Rich LaMarche · 5 months ago
    As far as the Document Database examples, I'm ok with the first example but can't get my head around the practicality of the second. Why would you ever want to store Joe Smith's address with his order? What happens when Joe Smith moves (a few times over the years)? What if it is Joan Smith and she get married and become Joan Jones? Normalization does provide real value and the concept has extended out into the concept of keeping your entire application DRY. I can understand a document db for "documents", i.e., legal documents, engineering documents, emails, etc. But so see the practicality (at least yet) in other applications.
  • Mike Sherrill · 5 months ago
    Later, when there's a dispute or a question, you need to know where you shipped the order and where you sent the bill.

    You're talking about "address Joe Smith told us to ship the widgets to on this date", and "address Joe Smith told us to send the bill to on this date". Semantically, those addresses are different from "current mailing address", even if the values are identical.

    That might involve storing the addresses themselves with the orders, but it might not. There's more than one way to implement that kind of requirement with a SQL database.
  • OO guy · 5 months ago
    "In a SQL database, you would break data like into multiple tables connected by relationships and retrieved with joins. But in practice, and the app almost always wants to retrieve the entire record all at one, and write the entire record all at once."

    I'm with you so far.

    "For document records of this nature, what you want is a document database."

    Wait, where did these "documents" come from? Why not an object database, since I'm writing in an object-oriented programming language and want to store my objects in a database?
  • MDK · 5 months ago
    I very much agree with your points and I'm waiting for the first project to come where I can "for real" use a non-relational database.

    The part which I don't understand yet is how to solve transactional problems with CouchDB or MongoDB. I don't mean complex transactional issues (like finance or banking) but even simple ones.

    For example: let's say I have (in db) a collection of 100 'prizes'. Those 'prizes' are to be given to first 100 users who log into my website. So essentially - each time a user logs in, I need to take one prize from the collection (if any still available) and assign it to the user. I'm not sure what's the correct way of doing this atomically in a non-relational DB (like MongoDB) to ensure that non prize is given twice.
  • Jesse Hallett · 4 months ago
    I am more familiar with CouchDB, so I will give you a CouchDB answer.

    There was a similar question posed on Stack Overflow with some good answers <http://stackoverflow.com/questions/299723/can-i...>.

    For your prizes example I recommend creating a separate document for each prize. Initially the prizes are unclaimed. When a user logs into your website get an unclaimed prize from the database and update it with the user as the winner. In the likely event that another thread was doing the same thing at the same time and put a claim on the same prize first, the other thread will get a failure response when pushing the updated prize document because of CouchDB's MVCC feature <http://couchdb.apache.org/docs/overview.html>. In that case it should try again with another unclaimed prize and keep trying until it successfully marks a prize as claimed, or until there are no more unclaimed prizes.
  • dandv · 5 months ago
    Excellent article! I'd go one step further and claim that all databases are misguided.
  • Shagnasty · 5 months ago
    Something you need to learn about is readability. Small, off-white text on a black background is hard to read. So is magenta on black.

    Why do you think most books are black print on white? And newspapers?
  • vasia · 5 months ago
    >Why do you think most books are black print on white? And newspapers?
    Because of cost of paint
  • Javan · 5 months ago
    How would you use Memcached's INCR to store metrics like, say, profile views? If memcached dies, you've lost all that data.
  • Jesse Hallett · 4 months ago
    If you want persistence, don't use Memcached. Redis is similar, also supports INCR, and saves data to disk. Tokyo Tyrant is another good option.

    If you don't care about persistence, then Memcached can be a great choice.
  • dvh · 5 months ago
    for my part, i would ALWAYS prefer an e-commerce shop which runs on a plain old traditional relational sql-db...
    be careful what you say about data integrity or joe celko will come after you
  • Nathan Fiedler · 5 months ago
    It would be helpful if you would provide additional information backing up your statements. In particular, why is a distributed file system not a good solution for binary assets? It seems that you're assuming a particular use case and thus assume GFS and its ilk are unsuitable. You mention Tokyo Tyrant but do not give any reason why it is advantageous. Yes, I can go read about it to learn more, but you've given me little reason to do so.

    Links to the products mentioned would also be helpful. As it is, I have to search for them on the web. I'm a developer, I'm lazy, please make it easier for me.

    In any case, thank you. This was worth taking the time to read.
  • JMO_DE · 5 months ago
    Maybe I'm just not well informed enough, but arent there security problems with JSON? If you're building a web application to display the menus available in the company cafeteria, JSON would be a good solution. If you are building a web application that will transfer confidential information or perform some type of transactions, I wouldn't suggest use of JSON. I am also not sure that using JSON in the safest possble safe way, as described at http://encosia.com/2009/07/07/improving-jquery-... , won't eliminate the performance advantage you're expecting it to have in comparison to using SQL.