Another AG (Availability Group) Post? Yes, I learned something new and it must be cataloged. When you are failing AG’s back and forth really fast and a major indexing job kicks off in the middle, it can cause a transaction to have to rollback. This rollback may take a REALLY long time, even if you were only on the node for 10 minutes and a large transaction had only been running for about 5 minutes. When I failed back to my preferred primary node and the AG Dashboard didn’t go completely green, I got worried. Why in the world would it not go green? I just failed to the preferred secondary and applied a patch (see? I learned.) and then was failing back. It had been green when I started, green when I failed over to the secondary and now one of my biggest databases was not synchronizing on the primary….*sigh*
I panicked. In this situation I would normally pull the database out of the AG and then re-add it. I didn’t have that option because it is a HUGE database and didn’t have that much time and space to move it around. I knew a large transaction had kicked off (thank you alert email that I created to warn me about such things) but thought that surely the rollback would have cleared quickly. That lead me to looking for rolling back transactions.
I ran this on the alarming secondary node:
SELECT R.session_id, R.command ,R.status, R.percent_complete FROM sys.dm_exec_requests R WHERE R.command IN ('killed/rollback','rollback')
To my surprise, there were no results. Nothing was killed or rolling back; or was it? I ran the query again, but this time without the where clause.
SELECT R.session_id, R.command ,R.status, R.percent_complete FROM sys.dm_exec_requests R
I saw one command listed as “UNKNOWN TOKEN” that had a percent complete at about 5%. That percent was rising. I theorized that this was my rolling back process and when it finished, my AG would be healthy again. The system isn’t used overnight. We had started the maintenance in the late afternoon and it was the secondary node in trouble, so I had time to test my theory. It was an agonizing 8 hours as I kept checking on the percent_complete all evening. It finally completed and the AG went green.
My lesson learned: When my AG isn’t healthy and I have already resumed data movement, before I pull the database out of the AG,I need to check for processes that have a percent complete on the secondary node. Being patient is really hard but necessary with AG’s.
The song that goes with this post Listen to the Man.