-
Website
http://adam.blog.heroku.com/ -
Original page
http://adam.blog.heroku.com/past/2009/7/8/sql_databases_are_an_overapplied_solution_and_what_to_use_instead/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
nevyn
2 comments · 4 points
-
adamheroku
68 comments · 3 points
-
justindz
2 comments · 1 points
-
paulleader
3 comments · 1 points
-
drnic
2 comments · 6 points
-
-
Popular Threads
-
No Knobs
1 week ago · 2 comments
-
Destroying Alien Civilizations, All In A Day's Work
6 days ago · 1 comment
-
No Knobs
As to the size of the rows, poppycock. As to joins, SSD machines will reconstruct joined tables faster than retrieving your flatfiles. Hands down.
SQL databases do scale. You just have to use one that is smarter than MySql. And you have to view data relationally, not as flat files. Flat files (xml being the worst, but not sole, offender) will always be gargantuan and highly duplicative. The need to "scale" results from the profligacy of the datastore. xml files typically contain 10 times more characters in markup as they do in data. Why is that intelligent? And, no, the markup does not make the data self-describing; you still have to write application code to: 1) parse (or use a parser, same difference) and 2) interpret the data. You get to write lots of code rather than a SQL string. If you like to type, and get paid by the keystroke, I guess it's a good deal.
Your indictment of ActiveRecord IS NOT an indictment of relational databases, or more intelligent uses of them. Sending the "joined" data is no problem, if you know what you're doing. And there is a reason OODBMS have been a failure: they either store the method text with each object (which is a massive duplication and synchronization headache), or they store one copy of the method text and instance data relationally. In neither case is the OODBMS an intelligent response.
Many have critiqued the other approaches described, so I'll let it go at that.
To address some of your points:
Non-relational databases are not flat files. They use various techniques that allow high performance access to semi-structured data. With Memcached, Redis, and Tokyo Tyrant that means addressing data by key. With CouchDB that means indexing data with MapReduce views. If you do some investigating you will find that the people developing these systems are smart people with lots of experience who have good, fresh ideas.
The databases discussed here are not XML databases. I know very little about XML databases; though I know of some people who have expressed fondness for them. I just want to point out that the database systems discussed in this article are different and have different strengths and weaknesses.
Document-oriented databases and key/value stores are not object-oriented databases. None of the databases mentioned in this article store method definitions. Like SQL databases, document-oriented databases and key/value stores store pure data.
It's a great solution that fits in nicely for most of the points you've outlined. Close enough to SQL to be familiar but without some of the shortcomings (schema-less, etc). Still a little immature in places but the dev team is very responsive.
Sorry if this is a truly newbie question, I'm trying to wrap my head around something like CouchDB.
Let's take a concrete example, invoices and line items. An invoice can have many line items in it.
With a "normal" SQL DB, you would have two tables, say: Invoices, LineItems. And we all know how that works.
With a "document oriented" DB you would have Invoice docs, those docs, would directly contain the line items in them, right?
Is it possible (and/or easy) to do a query to get a report on the best selling line items? how do you work that on CouchDB? since they are all contained within Invoice documents... in a SQL DB you would only run a query on the LineItems table.
Does this question make sense? :)
There are instructions online for doing this stuff:
http://www.google.com/search?ie=UTF-8&oe=UTF-8&...
http://sitr.us/2009/06/30/database-queries-the-...
There are heavy reads against this type of data: your customer service front-end applications, your reporting tools, warehouses, etc. Never mind you will also need taxes for each line with tax rates for county, city, state, whatever as another one-to-many. I'm sure summing all of that up will not be as efficient as a SQL query.
With invoices, you need reports on daily sales, by customer, by region, etc. that are constantly being updated/refreshed. Never mind slicing/dicing the quantity/price data by category/dept/material/etc. What about a HAVING clause? Your tool would require fetching every piece of data in order to do the aggregates before you can even start doing the filtering. A database is smarter in these cases.
I don't think CouchDb is a good candidate for this type of example.
Yes, invoices are structured and can be represented nicely in SQL. But they also fit nicely in the CouchDB way. You can put invoice level data, like the shipping address, line items, taxes charged per item, and all of that stuff in a single document, thus making the entire invoice one atomic write. If you did that in SQL query performance would suck. But CouchDB views are flexible enough to expose small pieces of a document, like individual line items, separately.
In SQL you might calculate taxes charged by joining an invoices table with a tax rates table at query time - though I would not do it that way. In CouchDB you would record the taxes charged by the county, city, state, etc. in the invoice document. No joins are necessary to get the information you need for generating statistics.
CouchDB views act as static indexes for documents and for aggregate statistics on documents. Views are built incrementally and stored. When a query comes in CouchDB only has to rebuild the portion of the view for documents that have changed since the last query was processed. As the number of invoices in the database grows query performance in CouchDB should stay about the same. A SQL database has to re-scan all of the relevant tables from scratch every time a query is run. As the number of invoices grows SQL queries should get slower. This is why CouchDB is great at handling heavy reads.
There are some good explanations in the CouchDB Technical Overview <http://couchdb.apache.org/docs/overview.html> of how CouchDB can perform better than you are giving it credit for.
This solution has its disadvantages: one have to manually run a compaction to remove expired data which has not been queried (an expired item is deleted when you try to get it). The advantage is that I can rely on the persistence store I already use (Tokyo Tyrant, mainly because of its multi-master replication and lower memory requirements compared to Redis) and not introduce another daemon and client libraries.
I should probably abstract this from my app and put it up on Github/Bitbucket.
You're talking about "address Joe Smith told us to ship the widgets to on this date", and "address Joe Smith told us to send the bill to on this date". Semantically, those addresses are different from "current mailing address", even if the values are identical.
That might involve storing the addresses themselves with the orders, but it might not. There's more than one way to implement that kind of requirement with a SQL database.
I'm with you so far.
"For document records of this nature, what you want is a document database."
Wait, where did these "documents" come from? Why not an object database, since I'm writing in an object-oriented programming language and want to store my objects in a database?
The part which I don't understand yet is how to solve transactional problems with CouchDB or MongoDB. I don't mean complex transactional issues (like finance or banking) but even simple ones.
For example: let's say I have (in db) a collection of 100 'prizes'. Those 'prizes' are to be given to first 100 users who log into my website. So essentially - each time a user logs in, I need to take one prize from the collection (if any still available) and assign it to the user. I'm not sure what's the correct way of doing this atomically in a non-relational DB (like MongoDB) to ensure that non prize is given twice.
There was a similar question posed on Stack Overflow with some good answers <http://stackoverflow.com/questions/299723/can-i...>.
For your prizes example I recommend creating a separate document for each prize. Initially the prizes are unclaimed. When a user logs into your website get an unclaimed prize from the database and update it with the user as the winner. In the likely event that another thread was doing the same thing at the same time and put a claim on the same prize first, the other thread will get a failure response when pushing the updated prize document because of CouchDB's MVCC feature <http://couchdb.apache.org/docs/overview.html>. In that case it should try again with another unclaimed prize and keep trying until it successfully marks a prize as claimed, or until there are no more unclaimed prizes.
Why do you think most books are black print on white? And newspapers?
Because of cost of paint
If you don't care about persistence, then Memcached can be a great choice.
be careful what you say about data integrity or joe celko will come after you
Links to the products mentioned would also be helpful. As it is, I have to search for them on the web. I'm a developer, I'm lazy, please make it easier for me.
In any case, thank you. This was worth taking the time to read.