If you take basic first aid, say a CPR course, you’ll learn a handy mnemonic for the primary assessment you have to make, A-B-C. That breaks down as Airway, Breathing, Circulation. Is there an open airway so they can breathe? Are they breathing? Do they have circulation, a pulse, are they alive in short. I recently took a two day course on wilderness first aid (on top of CPR training and first responder training and basic and advanced first aid training and Scout training and Scout first-aid training and I’m sure I’m forgetting some) that added to that, D-E. We now have Disability and Environment. In short, just how responsive is the person or do they have the possibility of spinal issues? What’s the environmental situation, lieing on cold ground, in the rain, on a boulder, in a crevasse, etc. And that’s just for the primary assessment. Then there’s the secondary assessment.
I’m not going to attempt to regurgitate the class here (go here if you’re interested). But it got me thinking, could we come up with a primary assessment for SQL Server that is also predicated on simple concepts that we could use to focus our assessment when dealing with SQL Server issues? The obvious answer is yes, but then, how well will it work? I’m pretty sure the shockingly simple, while simultaneously complex, ABCDE for assessing patients went through a lot of rounds to arrive at the current version. So, here’s my idea. I’m going to do a pass at this. I’ll lay out my first thoughts on coming up with a primary assessment, and then, I’d love to see others take it and run with it. Maybe we can turn it into a WIKI or something. But I can see it as a foundation for a more systematic approach.
Let me lay a few ground rules. First, this is the primary assessment. This is the “let’s see if we can keep this person alive” moment. They may have tons of cuts and scrapes and messed up hair and horrible taste in clothing. We can get to all that later. Right now they’re laying on the ground and we need to determine if they’re napping, getting ready to start screaming, unconscious or just dead. So, let’s keep that concept in the front of our brains as we think through what we need to get the patient/server away from FTD (“fixin to die” I love the black humor that EMTs and police have) to either completely alive again or at least limping along. No query tuning (as much as I love it) or questioning people’s database design choices (as silly as they may be) or even finger pointing (yet).
Second, let’s try to stick within the mnemonic as much as possible. So ABCDE. If we determine a need for an F & G as part of the initial checks, fine. And, we have the whole secondary checks to do so there can be any number of fun mnemonics or other memory devices or whatever. But we’ll leave that for the nonce. Primary checks.
Here’s my first pass (I’m skipping survey the scene, the real first step, maybe that’s wrong, let me know):
A: Availability – Can you ping the thing? Is there a network around it that is online such that you can get to it at all? If you can’t, chances are no one else can either.
B: Basic Connectivity (yeah, it sucks, see above) – Can you log in? Can you use the DAC? You can see the server, yes, but can you get to it?
C: Circulation (sucks even worse) – Are there blocked processes (yeah, it should be “B” but that doesn’t feel like the right order to me)? Are you able to query the system and get a response?
D:Â Â Database – Are the databases online? Are there issues around databases? Did you run out of space on the log or data file?
E: Execution – Are we looking at some sort of query issues? Permissions, performance, etc., that is affecting general availability.
Yeah, I know. This is pretty weak stuff. I’m absolutely just putting thoughts into words and letting you see the process as it unfolds. Remember, as a DBA, chances are very good that you’re a first responder. You need to know what to do. Show me where I’ve gone wrong. Feel free to write your own blog post in response. Create a presentation with this idea. I don’t care, just let me know what you’re doing with it so I can try to update and expand on the idea. Could we incorporate existing resource such as this awesome book? Anything and everything. This is an open idea that I’m passing on to see if it has legs. I think it might, but then again, I couldn’t believe that everyone didn’t own an Axim.
Perhaps something like FAST for stroke assessment may be more workable than alphabetized entries. I haven’t worked out the mnemonic yet, just tossing the idea.
Maybe reverse it?
E – Entry : Can I connect to the instance. Includes pinging and is the server up.
D – Databases : Are the databases on line and not showing suspect
C – Connectivity : Sounds the same as E but in this case I’m thinking permissions. Do you have permissions to the database in question.
B – Blocking : Is something running that is blocking your processes
A – Agility : Is your query as fast as it can be. Performance tuning.
Both interesting suggestions. I’m not familiar with FAST. Not sure I’d want to use that since we don’t want people to go fast with databases & recovery. Still, it’s an idea. Also, the responses aren’t strictly alphabetical. It’s meant to be in technical order, just using names to correspond to alphabetical order.
Reverse order works interestingly although it might be hard to remember.
A – Assess the situation. Is your server off…is the network down…what is the impact
B – Be calm, Be cool…Be awesome
C – Check…logs, wait stats, perfmon etc
D – Did the developers break it? If not…Do stuff
E – Earn $$$
I use PACE in my presentations about being methodical. I tend to lean towards being chill and relaxed in the face of emergencies (easier said than done), so pacing myself seems to work well.
Problem – Get details, symptoms, scope and check for recent changes
Analysis – Isolate systems, find the cause, test
Cure (or Choke) – Implement and verify
Evolve – Develop and refine what you’ve done.
Step1: Survey the scene: Is it just one server or a multi-point outage? Has the network /server / virtualization team been contacted? Is a bridge set up?
Ooh, PACE sounds good. It’s not my idea, so we can’t use it of course. Ha! Kidding. Do you have a blog post where you expand on the concept at all? I actually think that one works.
AJ,
You’re right. Survey the scene does sort of work, although, you’re going to be doing a number of things prior to determining if it’s a multi-point outage. However, ensuring that others are contacted should be included at some point in the process.
Hey Grant,
Not to be pedantic, but the AHA supports C-A-B instead of A-B-C. http://newsroom.heart.org/news/1139
But more to the point, many health professionals actually use O-P-Q-R-S-T (http://en.wikipedia.org/wiki/OPQRST) to diagnose their patients. I use (somewhat) the same methodology to troubleshoot IT problems (not just database issues), especially when the “symptoms” are vague.
Example: a user comes in complaining that the “database is slow”.
O – Onset. When did you notice the start of the problem? What were you doing right before it started to slow down?
P – Provocation/Palliation. Can you do anything to make the database faster? Can you do anything that makes it worse?
Q – Quality. Can you describe the slowness? Is it slow intermittently or is it slow all the time? Are inserts/updates slower than selects?
R – Region. Which databases are slow? On which servers? Is the slowness isolated to certain users?
T – Time. Has this happened before? Are there certain times of day when it gets better or worse?
OPQRST isn’t too bad either. Although you’re assuming I’m smart enough to start at O and remember that P follows without going through from A as I sing. Ha!
Clearly I’m not the first person to think of this type of thing. How come no one told me about it?
Here you go:
http://www.codepimp.org/2014/04/pace-yourself/
Cheers
[…] Fritchey talks SQL Server First Aid. It’s an interesting concept, and I’d personally love to see something solid come out […]