-
Website
http://adam.blog.heroku.com/ -
Original page
http://adam.blog.heroku.com/past/2009/3/2/database_versioning/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
nevyn
2 comments · 4 points
-
adamheroku
68 comments · 3 points
-
justindz
2 comments · 1 points
-
paulleader
3 comments · 1 points
-
drnic
2 comments · 6 points
-
-
Popular Threads
-
No Knobs
1 week ago · 2 comments
-
Destroying Alien Civilizations, All In A Day's Work
6 days ago · 1 comment
-
No Knobs
the database. To make development easier, we could have a task 'rake
db:sync' that is run after fetching the latest version of schema.yml
from version control. This task would examine the schema.yml diff and
apply it to the developer's local database. This would also eliminate
the need for add_column, remove_column, and various change_column
methods.
I would rename the second part of this puzzle to 'data migrations'. A
common pattern that I've noticed when running data migrations is:
* do some data manipulation before the db schema changes.
* change the db schema.
* do some data manipulation after the db schema changes.
Building upon the syntax you proposed, my ideal data migration would
look something like:
migrate :from => '123abc', :to => '456def' do |db|
before_migrate do |old_schema|
# optional massaging of data while schema is '123abc'
end
after_migrate |new_schema|
# manipulation after schema is '456def'
end
end
When the db:sync task is run, each schema migration would look for an
optional corresponding data migration. If a data migration is found,
call the before_migrate and after_migrate callbacks before and after
the schema migration is run respectively.
Here's an example flow:
* My working directory is on schema revision 123abc.
* I update from source control; The new incoming schema revision is
456def.
* I run rake db:sync
* this looks at my current schema revision (123abc) and my
destination revision (456def) and finds there is 1 data migration to be
run (:from => '123abc', :to '456def').
* we call the before_migrate data migration callback, update the
db schema, and then call the after_migrate data migration callback.
current_schema = Schema.parse("old_schema.yml")
incoming_schema = Schema.parse("new_schema.yml")
Schema.differences_between(current_schema, incoming_schema) do |diff|
# diff would know if it's a add_column, remove_column, or change_column
# and use your db credentials to make the changes.
diff.apply
end
Slightly less handwavy, but still pseudocode :)
In his example, you need to split the name AFTER adding the first_name and last_name columns, but BEFORE you delete the name column. I don't see how you could do that without mingling your schema changes and your data changes together.
1. Maintain canonical DB schema in a single place (for me, an actual create_schema.sql file).
2. Have a script for use in development that rebuilds the DB from scratch from the canonical schema. I run this all the time in development; this means I am always working off of the canonical DB schema.
3. Manage a per-production-update "update.sql" script that I run ONCE on production data to update it from current release to "new" schema. Sometimes if I need programmatic migration I'll write a script to do that.
As I've been envisioning what a "structured" db management system would look like, it's a lot like the one you describe.
I'm not really sure that generating a schema yml is any better than generating a ruby file. I've found that the problems with schema differences between commits has to do with git branches, and differing column orders in everyone's development databases. Also, git branching seems to be the cause of consistency issues. Someone adds a field in a branch, commits. Then goes to master and does something and inadvertently pushes their modified schema to the master branch. It happens quite often, and is a good argument against my practice of running `rake db:schema:load`. We may need to be able to branch databases as well. Branch, make your updates, merge to master, and run the migrations fresh against the master dev database.
I'm curious how you'd propose to handle the case where a migration in master and one in a feature branch that's ready for merging both attempt to migrate from the same version? This was the problem we had with the old counter-based migration numbers that ActiveRecord used to have.
saves me from a lot of that hassle.
I have no idea what language your scripts are using: But I find that SQL is perfect for what you are proposing (with a little boilerplate version verification at the beginning
Maintaining schema.rb separate from the schema in the production database seems like a maintenance nightmare. I much prefer a “scrubbing” migration that can act on the existing database to generate a “fresh” database. Then I extract of schema and data as SQL. The scrubbing migration needs to be maintained, but not to the extent schema.rb needs to: It only needs to remove data from the tables.
I must agree with technoweenie, the *real* next step in database versioning is being able to merge disparate migrations automatically. This will allow the version control system to deal with separate branches, and possibly rollbacks. With proper version control over a database, we can leverage multiple developers without the pain of migration integration.
--R
I do think you are onto something however with the ability to generate an intermediate description of databases that can produce a reliable structure difference so that a database can be checked after the event for conformance to some set of rules. The structure is likely automateable but the inserted data has the potential to report failures when everything is fine, but do so only in production like environments (where say a config like setting in the database was changed by a user at some point but the initial default value was inserted by a migration).
As for some of the other problems you've mentioned, I believe there are more low-tech solutions. At my work, we've been using the concept of a DataMigrator class to help with migration problems. In our world, any migration that does data manipulation is put through the DataMigrator. So, for example, you may have a problem with counter_caches, so you'd write a class method DataMigrator.fix_counter_caches and run that in the migration. This method could be tested in the handy dandy test/unit/data_migrator_test.rb file, so you're much less likely to have errors in your data manipulations because... you have tests!
So, yeah, everyone I know has to fight with migrations from time to time - they're not perfect. I'm not sure that the scenario you're describing is readily attainable, though, but you Heroku guys are a forward-thinking lot :)
In reality, that's probably a bit trickier because your code depends not only on other versioned code around it, but also on its surrounding environment. But if you *do* have an environment that's entirely contained within your VCS, you're all set.
1. a canonical schema definition for development and bootstrapping (though in fact I keep the schema *in the models* a la datamapper)
2. forward-only migrations, only for the production database