What can I say except “You’re Welcome for the AG voting script”

We recently had an issue where the network between our GEO-Cluster would go down and both Availability Group Instances thought they were supposed to take charge.  When the network came back up, both of them still though they were in charge.  You can imagine with an AG, you can’t have two instances that think they are in charge without problems.  This brought up the question of how voting was configured between the two of them.  This script helped a bunch:

SELECT member_name, member_state_desc, number_of_quorum_votes
 FROM sys.dm_hadr_cluster_members;

We found that the File Share Witness wasn’t working properly by checking the member state. In a simple AG, a good practice is to have each instance and then a File Share Witness,that keeps each side from accidentally taking over.  You’re Welcome.

The song for this post:  You’re Welcome From Moana

Listen to the AG that’s tricking you…

 

Another AG (Availability Group) Post? Yes, I learned something new and it must be cataloged. When you are failing AG’s back and forth really fast and a major indexing job kicks off in the middle, it can cause a transaction to have to rollback.  This rollback may take a REALLY long time, even if you were only on the node for 10 minutes and a large transaction had only been running for about 5 minutes. When I failed back to my preferred primary node and the AG Dashboard didn’t go completely green, I got worried.  Why in the world would it not go green? I just failed to the preferred secondary and applied a patch (see? I learned.) and then was failing back. It had been green when I started, green when I failed over to the secondary and now one of my biggest databases was not synchronizing on the primary….*sigh*

I panicked. In this situation I would normally pull the database out of the AG and then re-add it.  I didn’t have that option because it is a HUGE database and didn’t have that much time and space to move it around. I knew a large transaction had kicked off (thank you alert email that I created to warn me about such things) but thought that surely the rollback would have cleared quickly.  That lead me to looking for rolling back transactions.

I ran this on the alarming secondary node:

SELECT R.session_id, R.command ,R.status, R.percent_complete
FROM sys.dm_exec_requests R
WHERE R.command IN ('killed/rollback','rollback')

To my surprise, there were no results.  Nothing was killed or rolling back; or was it? I ran the query again, but this time without the where clause.

SELECT R.session_id, R.command ,R.status, R.percent_complete
FROM sys.dm_exec_requests R

I saw one command listed as “UNKNOWN TOKEN” that had a percent complete at about 5%. That percent was rising. I theorized that this was my rolling back process and when it finished, my AG would be healthy again.  The system isn’t used overnight. We had started the maintenance in the late afternoon and it was the secondary node in trouble, so I had time to test my theory.  It was an agonizing 8 hours as I kept checking on the percent_complete all evening.  It finally completed and the AG went green.

My lesson learned: When my AG isn’t healthy and I have already resumed data movement, before I pull the database out of the AG,I need to check for processes that have a percent complete on the secondary node. Being patient is really hard but necessary with AG’s.

The song that goes with this post Listen to the Man.

It’s different for Availability Groups when service packs get broke…

Last week I did a few things wrong.  The good news is I learned from it and now can prevent myself from repeating it.

So I have this AG, it is kind of important, hence the AG part but after 5pm I have two hours that it can be down, or so I thought.  We recently added new functionality that requires it to be up all the time.  I have been applying SQL 2014 SP2 to all of my 2014 servers all of which have been in AGs and it has been super easy, less than 10 minutes of work and only a reboot of downtime. (I am a little old fashioned and always reboot after a service pack.) My mind said, “Sweet, you can get this done and no one will notice and if you do both at the same time and delay the reboot on one of them, there will be no down time.”  That was my first mistake.  My second one was starting a few minutes early. I was excited to have it done because it was my last round of service packs. My final mistake was not realizing that SSISDDB is considered a system database and should NEVER be in an availability group. It may be super awesome because you can keep your SSIS catalog completely in sync. You could maybe do it if you plan to remove it before doing service packs or any kind of upgrades to the server, bur as far as setting it and forgetting it, you are in trouble if you do it.

Here are a few of the fun errors that we saw.

“Script level upgrade for database ‘master’ failed because upgrade step ‘SSIS_hotfix_install.sql’ encountered error 942, state 4, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.”

“Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.”

Huge thank you to all the people that have blogged about Trace Flag 902.  It allowed us to start up SQL Server and find the errors and pull SSISDB out of the AG and get the service packs to finish running and everything was happy and great.  Here is the list of steps from the Microsoft KB article:

Enable trace flag 902 on the instance of SQL Server. To do this, follow these steps:

  1. Open SQL Server Configuration Manager.
  2. In SQL Server Configuration Manager, click SQL Server Services.
  3. Double-click the SQL Serverservice.
  4. In the SQL Server Properties dialog box, click the Advanced tab.
  5. On click the Advanced tab, locate the Startup Parameters item.
  6. Add ;-T902 to the end of the existing string value, and then click OK.

Lessons Learned:

  1. Don’t start earlier than you say you are going to start.
  2. Don’t do both sides of an AG at the same time.  Do the Passive one first and make sure everything is up and working and the AG is healthy before you do the primary node. Verify the fail-over works to the passive node so that you have no down time.
  3. Don’t get too comfortable with Service Packs and Cumulative Updates. They are still a big change even though they are pretty stable.
  4. Don’t have any system databases in the AG while applying patches and know that SSISDB is a part of the system databases.
  5. An awesome team that can back you up and help you trouble shoot can make all the difference.  It is amazing to have a boss that believes in you and is encouraging to help you keep going even when you want to give up and go cry in the closet.

Also for those of you following along at home and what to know what song goes with this post: It’s Different for Girls

Happy Service Packing!

Hello Darkness my old friend, I can talk to you again because my Availability Group is quiet…

We have a lovely Availability Group that holds A LOT of data that is broken into partitions.  We have 42 partitions and they are usually moving information around daily between them.   The index rebuilds on them were making our logs HUGE because the the Availability Group was taking too long to catch up, we tried both Synchronous and Asynchronous mode.  We would see all kinds of errors.  We were doing horrible things like auto shrinking our transaction log after the indexing finished and  ignoring alarms during the time the database was rebuilding.  We had requested more and more space from our storage team and sometimes the job wouldn’t even finish because it ran out of space.  Our first idea was to split out the index rebuilds so that we could do one partition at a time.

It looks like this:


ALTER INDEX IndexName
ON [dbo].Table
REBUILD PARTITION = 42

By splitting this out, we were able to get the job to finish, but with tons of alarms, and log growth.

Then we had a thought, maybe the server is just spinning too fast and we need to give the Availability Group time to catch up.  So we added some simple waits in between each step.


WAITFOR DELAY '00:10';

As an example this is in minutes and will wait for 10 minutes before running the next step.

 

It has now been a quiet week and we are looking forward to the Sound of Silence.