Segmenting Environments in SQL Server

In any environment where you have multiple DBAs and a heavy workload, there exists the chance for mistakes to be made. When those mistakes are made on the production system, bad things can happen. Depending on what your production system does, whole shifts of people could be idled while the problem is corrected.

Everybody makes mistakes. Even the most careful people can run a query on the wrong server because they are tired and overworked. We don’t always have the option to sleep or step away from the computer. To help avoid mistakes, blocking avenues for them makes the most sense. Below I am going to outline a few ideas for segregating your DEV and PROD environments, and I would welcome any comments with additional suggestions.

Separate accounts for production access

The accounts your DBAs (or others with elevated access) use to access production server should be different and – importantly – non-overlapping with the accounts they use for other environments (both should be Windows domain accounts.) This will prevent the AD account used on your PC from having direct access to production servers. You have to take a second step (run as…) in order to connect to prod. This affords some protection, but I suggest pairing it with…

Access production servers via RDP only

Whether your RDP directly into the production box, or into an intermediate box , this offers protection against those who might choose to leave a Run As… session of SSMS (or Visual Studio) open on their desktop at all times for quick access to production. A firewall rule to block TCP access on the SQL ports, but allowing RDP port access should help lock this down tight.

Separate Service Accounts

Best practice is to have a separate service account for each instance of SQL server running in your environment. I have yet to work for a company that does this, but at a minimum the accounts used for your production servers should be different than the ones for your other tiers. This helps in a couple of ways: First, if you’re installing a new dev or test tier server and bungle the password a couple of times, you don’t lock out your production service account. Second, it prevents you from accidentally targeting production with a SQL Service from another tier, like a SQL Agent job or Replication (which also uses SQL Agent Jobs).

The ultimate lockdown: HOSTS files/Network Segmentation

Preventing any connections from your other tiers to your production tiers can be taken a step further by adjusting the HOSTS file entries on your dev/qa/preprod tiers so that any connections to production servers are re-routed into a black hole. In a small environment, this might be all that is necessary. In a larger environment, it is much easier to set up separate VLANs and implement rules preventing traffic initiation from your lower tiers to production. This prevents someone from setting up a linked server connection to production from a lower tier and mucking about with data that way. Not that anyone would ever do that.


Introducing this level of prevention in your environment can cause headaches beyond just the hurdles needed to access production. What about when you want to try to compare schemas between production and development? Or do data comparisons? Isn’t this just a huge pain?

Yeah, it’s a pain. But it doesn’t have to be huge. Some schema comparison tools will let you run comparisons against backup files instead of live DBs. Some of them will also create schema snapshots, so you can archive those and compare schemas offline. Data comparisons get a little trickier, but I would argue that those tricks help keep your environment much less vulnerable to an oops, and oopses can be VERY costly.


Less is More

Sometimes it really is true.  Let me tell you a story about a really powerful server.  We’ll call him Bob.  Bob is impressive.  RAM? Yeah, he’s got that, to the tune of a terabyte.  He has no problems with premature ejection of data from his cache.  Brains?  He’s like a rocket scientist on steroids, with 8 physical processors each sporting 10 cores.  Truly a dizzying intellect.  All those brains aren’t very useful without some data to back them up, so he snuggled up to a SAN sporting more spindles than you can shake a stick at.  Yes, you might say Bob was pretty happy with himself.  He was more than ready to take on the workload from our demanding SQL Server (let’s call her Sheila), thinking he wouldn’t break a sweat.

He was right.  Heck, he had the performance counters to prove it: He never went above 50% CPU.  Sheila, on the other hand, wasn’t happy.  She wanted more CPU, but try as he might, Bob just couldn’t seem to get her to take his massive CPU offering.  You see, Sheila was a perfect 10.0  — her @@Version, that is.  Windows took the 80 logical cores Bob had to offer and split them up into 2 processor K-Groups, each with 40 cores.  Since Sheila was so old school, she just wasn’t ready to be involved in a multiple K-Group relationship.  Something had to be done.  Bob’s peers in his “cluster” of friends suggested coaxing Sheila into 10.5 territory, where she’d be more open to doubling up on K-Groups, but Bob knew that trying to bring Sheila more current would likely mean she’d be leaving him for different hardware, and he just couldn’t stand the thought of losing her.

So Bob racked his BIOS looking for an answer to his problems.  It was there that he found the answer.  He had to be gentler with Sheila.  Rather than showing her everything he had to offer, why not tone it down a little?  In his BIOS he activated a feature to disable processor cores, telling each processor to behave as if it only had 8 cores instead of 10.  With only 64 cores visible to Windows, it saw no need to split them into K-Groups.  Ironically, by showing her less, Sheila was able to see more of what Bob had to offer.  Bob and Sheila would have more time together after all.  He might be sweating a little bit more now, but the sweaty part of the relationship is often the most fun.

Sometimes, Fewer is Better
Bob’s first big task with Sheila, given their new and empowered relationship status, was to see how much of an improvement all of this available brain power would have on Sheila’s CHECKTABLE performance.  So she queued up her largest database and got to work.  Something strange happened, though.  When presented with even more processing power, Sheila had problems bringing it all to bear and delivering the kind of performance she knew she was capable of with Bob.  Firing up her perfmon toolbox, she could see she was getting a dismal rate of throughput from all the shiny spindles Bob was so friendly with.  But it wasn’t the spindles that were the problem.  Latency was low – less than 5ms on average– so the spindles were delivering whenever they were asked.  They were delivering on target, too, with the right block size and alignment.  What could be the problem?
Maybe she just needed to try more things at once?  Instead of doing a single CHECKTABLE, she fired up multiple sessions of DBCC CHECKTABLE to try to spice things up.  It worked –disk throughput improved– but it was a bit unwieldy.  Bob had a script to refresh the SAN volumes presented to Sheila, attach them, and kick off a series of CHECKTABLE commands for each database.  When that script was done, a report was e-mailed to the DBA team detailing the results.  It would take a lot of work to revamp that process to spin up multiple sessions, aggregate the data accurately, and handle the timing of the volume refresh.  What’s a girl to do?  Checking this database was taking her 4 DAYS, and –love your data or not– that is clearly too much time with a static data set.  After 4 days, the SAN spindles ached to refresh their volumes, and Bob was wondering what was going on – Sheila was just not demanding CPU like he expected.  In a desperate attempt to save their relationship, Sheila started down the road of parallel operations, even going so far as to do some fancy partition switching to break up her largest tables into smaller chunks to process in parallel.

It was a novel idea, but she wasn’t sure it was going to guarantee consistency.  After all, she was changing the schema ever so slightly, which meant the data she was analyzing wasn’t really the same as the data in production.  To sanity check, her DBA got in touch with Paul Randal (blog | twitter), who gave some free advice.  One of those pieces of advice was to read this blog post detailing some experiments he did with CHECKDB and MAXDOP.  It turns out that trying to bring all 64 cores to bear during a CHECKTABLE is not that effective.  So Bob and Sheila conducted some experiments together, and slowly but surely they were able to find her sweet spot.  It turned out that Sheila had a pretty small sweet spot, but she was happiest at MAXDOP 4, where she saw between a fifteen and twenty-fold improvement in throughput.  Instead of 4 days, she was able to confirm her consistency in less than 2, enough of an improvement that her crazy partition-switching plans were shelved.

Bob and Sheila learned an important lesson about their relationship this week.  Quality really does top quantity, and sometimes you have to stop trying to go full bore if you want to have a better experience.

Got an opinion on this post?  Share it below!  If you thought “Wow, this is the kind of work I’d like to be doing!” and are interested in working for a company that wins awards for employee engagement, retention, and enrichment, drop me a line.  We’re looking for a DBA to add to our team.  We also have openings for .NET programmers. I love my job and the people I work with.  I’ll bet you would, too.

The Cab Ride–Part 6

Back in July, Gill Rowley (blog | twitter) blogged about the year that had unfolded for four folks who happened to share a cab back to the airport after Summit — one of those four being yours truly.  Looking back on it, it has been one heck of a year.  Erin, Gill, and I each changed jobs (Erin and Gill’s announcements).  Jes changed jobs twice (first to Blue Door Consulting and then to Brent Ozar PLF).  I had thought Jes was just the outlier (if you’ve met Jes, you definitely know she’s a recipe-wielding, marathon-running, coffee-guzzling, SQL-sharpening outlier) of the group, but that has not proved to be the case.

In January of this year I joined Autodesk, a software company primarily known for engineering products like AutoCAD.  It was an amazing opportunity, and it came with a hefty pay increase and some of the best fringe benefits I’ve seen in Michigan.  Alas, though, it was not meant to be.  Within a few months, my department had taken on a new direction – a completely different RDBMS — and I was left with a really tough choice to make:  follow the direction and continue my career there, or begin looking for other options.

Recently, I saw Buck Woody (LinkedIn | Twitter) live-tweeting some professional development advice from a seminar he was giving at the University of Washington.  One of the things he mentioned was having goals for your career.  It reminded me of one of my favorite presentations from the 2010 PASS Summit.  Buck gave a lightning talk entitled “Your career is your fault”.”  At the time, it caused me to examine my job performance and my attitude toward my job and take some action, but I fell short of really putting thought into where I wanted my career to go.  Once I figured out what my goals were, the choice I had to make was pretty clear – even if it meant leaving a pretty amazing place.

When I formulated my career goals I had a couple of things in mind.  First, I wanted to be somewhere I could learn a lot and share that learning with others.  Second, I wanted to be in one of two positions:  either working toward a leadership position or working as a consultant.  Both of those options ended up available to me.  I had to do a lot of soul searching and discussing with my family, and finally we came to the decision that the travel demands of the consulting life while the kids were still so young (5 and 9) would put too much stress on us all.  I did something I never thought I would do:  I declined an offer from an amazing company, working with people whose names are instantly recognizable in the community. That’s the bad news.

The good news is that I accepted an offer to work at a local company (Plex Systems) with some amazing people who are relative unknowns in the national community, but where I will feel like Keanu Reeves in Johnny Mnemonic with all the heavy knowledge that’s going to be downloaded to my brain.  There are also opportunities for leadership positions there, so I’ll have plenty to keep me busy and some opportunities for growth.  For now, at least, I have postponed living the glamorous life of a consultant in order to lead the fulfilling life of a family man.

I really want to thank Buck Woody, though.  His lightning talk came at a pivotal point for me and really connected.  I’ve carried that tagline – “Your career is your fault” – since then and try to revisit it occasionally to see if I feel like I should be praised for my career so far, or blamed for it….either way it’s my fault after all.  If you’re not thinking about your career that way, as something you plan instead of something that simply unfolds, I would encourage you to start.  Maybe you’ll be an outlier, too!

T/SQL Tuesday–A Day in the Life

Erin Stellato (blog|twitter) handed out this month’s T-SQL Tuesday assignment, which was to blog about what a typical day is like for a DBA. Since I have been wanting to start blogging for a long time and this seemed like an easy assignment, I took up the challenge and tracked a typical day for me last week.

First, a little background: I work on an agile development team for Autodesk, a software company that has recently rolled out a new cloud-based software product in a market where hosted solutions are not the norm. I’m responsible for setting up and maintaining our environment as well as assisting with development efforts. Of course my life would not be complete if I didn’t also get to spot areas where security could be tightened up, design improved or configurations tweaked. I also run the PASS user group for Southeast Michigan, and I am a PASS regional mentor for the Heartland region.

I use the time before our morning scrum meeting to finish up a couple of things from the day before. We’ve added some new servers to our environment, and most of them are only accessible via IP address. This morning I go about editing my hosts file in an effort to make it simpler for me to use Remote Desktop when that’s necessary. SQL Configuration Manager will handle the rest.

Next I meet up with the rest of my small sub-team to do a quick recap of the prior day and discuss today’s goals. On everyone’s mind this morning is some work toward supporting a move from our current hardware setup to hosting our environment on Amazon Web Services (AWS.) AWS does not support clustering or native disk-based replication for high availability, and our product creates too many databases to use features like replication or DB mirroring (since launching the product in March, we now manage over a thousand databases and expect to continue sustained growth) to cover our need to maintain availability. After evaluating some products from other companies, we’ve decided to test some home-grown strategies. Our sysadmin has created some iSCSI targets on our AWS node and we’re going to try putting our DBs on it, then moving that storage to another AWS node and bringing them up there. It’s the first step toward making our own high availability solution in such a way that’s less vulnerable to the recent high-profile AWS outages.

I notice on this particular day that I have two interviews scheduled. I’m the sole DBA on the team right now, and we’re looking to hire another so I have been interviewing a lot of candidates lately. My first interview is scheduled shortly after the scrum meeting. This is the first job I have held where I am in a gatekeeper role as far as interviews go. It’s a good deal of stress, knowing that you hold the key to someone’s future in your hands. Brent Ozar (blog|twitter) had a great article on interviewing, and his weekly newsletter has linked to some other great resources on interview techniques and how to find the right candidate. As part of my first-round interview questions, I always include a couple of myths from Paul Randal (blog|twitter). For this candidate, that meant that after discussing locking, I ask about lock escalation. I’m always interested to see if folks will try to tell me that locks escalate from row level to page level. (They don’t.) I even had a candidate once tell me that locks escalated from row to page to table to a database level lock. I’m happy to report that my morning candidate was able to correctly answer, which is always satisfying. It shows that they’ve had some exposure to Paul’s blog, or that they’ve read the BOL entries on lock escalation carefully at some point. It is by no means a litmus test of whether a candidate moves to the next level. I also try to give every candidate some time to tell me about the things they like and are proud of in their job. Expressing their finest moments gives the candidate a chance to shine and show me something my questions might have missed.

With the interview finished, it’s about 11:30 and time for lunch. I’ve taken up a health kick and started eating salad for lunch almost every day at work. This means I eat at my desk so I can browse the web while I eat, because eating salad is very monotonous for me and I need something to distract me from the repetitive stab-munch-stab of food that is tolerable, but not terribly enjoyable. So while munching I read an article on sparse columns in SQL Server that someone linked me to. (I apologize for not noting who it was at the time.) Sparse columns are an interesting new feature and something worth considering if you’re getting pressure to use a NoSQL platform because of its schema flexibility.

I also take the opportunity during lunch to correct a tweet that comes out about my local user group event, scheduled for Thursday. I keep the MetroTwit client up and running most of the day, watching for interesting stuff to crop up. Someone tweeted that the meeting is tonight (Tuesday) and Tim Ford (blog|twitter), our speaker, caught it and did a virtual double-take, immediately DMing me that he could not make it tonight. Luckily I caught it and corrected it almost immediately, so the bad info wasn’t out there influencing anyone’s plans for the evening for more than a couple of minutes. If you’ve read Tim’s blog by now, you know that we had to cancel the meeting due to hardware problems where he works, which was sad for everyone involved. We’re working toward a reschedule.

After lunch, the junior sysadmin tells me he has moved the iSCSI target drives from one of our AWS instances to another one, and he would like me to check it out and start up SQL. I check the drives, which are all there and writable, so I fire up SQL. Uh-oh. I have 1800 DBs on this server, and every last one of them failed to start up. When I check why, it seems the folder containing my transaction logs has disappeared. Talking with the junior sysadmin, it seems he saw a folder called “logs” taking up a lot of space, so he decided they were not needed. UGH! This is a load test system, so I haven’t set up regular DB backups. Well, maybe I can use the attach_rebuild_log feature of CREATE DATABASE..FOR ATTACH to get the log rebuilt. A quick script later to loop through all the DBs, and …. Nope! Looks like most of the DBs were left in a transactionally inconsistent state and so the log cannot be rebuilt. I’ll have to resort to yet another blog post from Paul where he details the process of trying to recover a DB when this has happened. Luckily these are just test databases, and a little data loss won’t hurt this environment. However, another issue has cropped up with one of our production DBs, so this will have to wait.

At around 1:30 one of our developers who has support responsibilities notifies me that we have an issue with a client’s production DB. The developer and I work through a script that will fix their database. We work until about 3pm on the script, when I have another interview scheduled. This candidate doesn’t get the escalation question right, explaining to me that lock escalation goes from row/page/table. Sometimes I correct this misconception if the candidate seems otherwise qualified, and there are times when I just let it go and move on. I had a candidate insist that she didn’t have a lot of experience with “theoretical” questions but was great with day-to-day stuff. I had to wonder what that means, since lock escalation is by no means theoretical and probably happens on your database on a pretty regular basis unless you’ve disabled it.

At 4pm, I contact my boss and ask him to move our scheduled appointment out until I can deploy the bug fix script the developer and I had been working on. We deploy it to a test environment, and there we verify that the app no longer delivers an error and the issue seems resolved, so we immediately deploy to production as well.

At 4:20 I go to my boss’s office for the first of 3 meetings to give him some info on things he might need to do while I am on vacation. We spend the next 40 minutes discussing various ways to troubleshoot our production cluster, including using SQLSentry Performance Advisor as well as sp_whoisactive. We spend the majority of time looking at the various features of SQL Sentry PA and getting used to its interface, and probably only 10 – 15 minutes on sp_whoisactive since it is a much more straightforward.

With that done, it’s quitting time! I’m looking forward to my vacation.

As of this writing, we still have an unfilled DBA position on my team. If you’re interested in working in an exciting environment with great benefits and are in the Southeast Michigan or Toronto area, send me an e-mail or leave a comment below.