The invitation this month for #PGSqlPhriday comes from Dian Fay. The topic is pretty simple, database change management. Now, I may have, once or twice, spoken about database change management, database DevOps, automating deployments, and all that sort of thing. Maybe. Once or twice.
OK. This is my topic.
I’ve got some great examples on taking changes from the schema on your PostgreSQL databases and then deploying them. All the technical stuff you could want. However, I don’t want to talk about that today. Instead, I want to talk about something really important, the concept of rollbacks when it comes to database deployments.
Why Are Rollbacks Difficult?
The entire purpose of your PostgreSQL database is to persist, that is to store, the data. Then, of course, offer up a way to query it. However, the big one, the most important, is getting that data in and keeping it there. If you can’t store the data, then what’s the point?
So, rollbacks. What are they? Put simply, part, or all, of the set of changes you just deployed are… wrong. Broken. Don’t work, or don’t work correctly. This means that you need to undo, or rollback, the changes.
Not a problem I hear some say. We’ve got backups so we’ll just run a restore… that takes about 9 hours. It’s OK if the database is offline, right? Maybe you can rely on a low level disk snapshot instead. That works, if you have it all set up and tested ahead of time. Or, if you’re using a virtualized environment, you can use snapshot technology in some hypervisors. Again, if you have it set up and tested ahead of time. Another way people go is to have a full set of scripts that act to rollback any set of deployment scripts. Once more, ahead of time, tested. Or even, heck, wrap the whole thing in a PostgreSQL transaction and if that fails, use the transaction rollback mechanism.
However, will this really work? Let’s talk about databases breaking.
In my own experience, problems from deployments happen two ways. First, immediately. Your deployment failed. Something went wrong. Whatever it might be. So, you rely on backups, snapshots, or rollback scripts and you’re good to go. Seriously. This works. However, there’s the second, more common, way database deployments fail.
You successfully deploy your set of changes. Immediate tests prove successful. You’re a god astride the earth and you release the production environment to the masses. A week goes by. You get a phone call, “You know, the data looks funny. Funny weird, not funny ha-ha. Can you look into it?” Sure enough, that deployment a week ago, it had a subtle problem no one spotted in testing.
No big deal, we’ll just rollback with the snapshot… What’s that? We can’t lose a week of data? How about that rollback script? Not tested with new data, and, again, we don’t want to risk losing data we have successfully persisted. So while rollback scripts, restores and snapshots cover the immediate problem of a failed deployment, they absolutely don’t cover the more common issue, of a problematic, as opposed to outright failed, deployment.
Well… darn (only I didn’t say that). What now?
Don’t Use Rollbacks, Use Roll-Forwards
Long ago, and far away, I was able to use a very reliable snapshot mechanism from the disk subsystem of the databases I had under management. We tested it to a fare-thee-well and were positive that we could recover from an immediate failure. We even had a few live “tests” that validated our theory (not tests, we burned the database to the ground a couple of times and the snapshot saved our bottoms). However, we found that our rollback scripts just weren’t used. First, the snapshot was faster (and probably safer in a lot of ways). Second, the majority of the time, it was days, or even weeks, later and we couldn’t use the rollback scripts, let alone the snapshots.
Enter roll-forward.
Instead of spending tons and tons of time creating rollback scripts, that are not going to be used, we spent our time ensuring that our deployment process was automated to the point that, in an emergency, we could get a tested, yes, tested, script ready for production in well under 10 minutes. They’ve lived for a week with poor data, another 10 minutes isn’t going to break anything.
The idea is simple. Focus on deployments. The goal is moving forward. Focus on fixing issues, not by trying to literally move to the past, rollback, but instead, move forward, roll-forward. Extremely simple concept, with a metric tonne of work associated.
To support this, you must build infrastructure to allow for automated deployments and automated testing into environments that mirror production as much as practicable. That ain’t easy. However, it is doable.
What If I Really Prefer Rollbacks?
I’ve heard people say that they prefer a rollback, or even that the organization requires them. What then?
Well, then, I suggest the same thing. Build automated deployment mechanisms and automated testing into environments that mirror production.
THEN!!!
Test your rollbacks as thoroughly as possible. Ensure that you have rollback scripts that support not simply the immediate, everything is on fire, but the longer term, a fire is smoldering at the bottom of the data pile that we can’t see yet. No script should be run in production without being tested somewhere else first.
Will this work?
Mostly, but it’s absolutely just as much work as creating roll-forward scripts, and, frankly, I think they’re safer. If you build everything you need to support tested rollback scripts, then you have everything you need to create a roll-forward mechanism.
Conclusion
In a perfect world, you’ll never need a rollback. Meanwhile, on planet Dirt, well, things go pear shaped rather a lot. While the technology in PostgreSQL allows for all sorts of amazing HA/DR scenarios, simple stuff can muck it up badly. I’ve heard people say, “well, we have replication, so we don’t worry about failed deployments”. Except, you do know that the failed deployment will replicate right? Other stuff that largely consists of sticking your head in the sand and humming loudly because no one likes to contemplate failure.
However, as data pros, we need to contemplate failure… and recovery from failure. The roll-forward is a safer, faster, less costly strategy than rollbacks and deals better with the reality of the nature of data management within a PostgreSQL environment. So, focus your efforts there.