You’re like the calmest failing job, I need you louder with a DPA alert…

I have an RDS instance that when backups are failing, we have no idea. We use DPA as one of our alerting systems, but with RDS, failing agent jobs are harder to find. I took the built in RDS status stored procedure and adjusted it enough to send me an alert through DPA so I can know that there are problems.

There are a couple of things I have already done, I have set up my RDS instance to talk to DPA, I have granted my DPA user access to run what it needs in RDS and I have set up a user in DPA that can email\call me when there is a problem. I used this post about DPA to help me get the basics done and then I did a lot of trial and error on my code to get the alert working just right.

This is the basic status stored proc that is telling me if my backups are working or not:

exec msdb.dbo.rds_task_status @db_name='MyDatabaseName';

I tried a few things and finally realized I needed to drop it into a temp table so that I could filter it down to only get what I need:

CREATE TABLE #tempAlert 
(task_id	bigint
,task_type	varchar(200)
,database_name	varchar(200)
,[% complete] int
,[duration(mins)] int	
,lifecycle varchar(20)
,task_info	varchar(8000)
,last_updated	datetime2
,created_at	datetime2
,S3_object_arn	varchar(2000)
,overwrite_S3_backup_file	int
,KMS_master_key_arn	varchar(200)
,filepath	varchar(200)
,overwrite_file int);

INSERT INTO #tempAlert
exec msdb.dbo.rds_task_status @db_name='MyDatabaseName';

SELECT *
FROM #TempAlert
WHERE task_type = 'BACKUP_DB_DIFFERENTIAL' AND last_updated > GETDATE()-1 AND lifecycle <> 'SUCCESS';

DROP TABLE #TempAlert;

This code is specifically looking for any diff failures in the last day, if I want fulls, I switch task_type = ‘BACKUP_DB_DIFFERENTIAL’ to task_type = ‘BACKUP_DB’.

In DPA, I go to ALERTS >>Manage Alerts and select “Custom” for my Alert Category and “Customer SQL Alert – Multiple Numeric Return” for my Alert Type then select “Create Alert”.

I gave it a name that was descriptive and changed my Execution Interval to once a day. I don’t want it to be firing all the time against my RDS instance and running up my bill. I could run it more often if I wanted. I added some text to my “Notification Text” box that will be helpful to anyone that gets the alert. I selected my Database instances that I wanted this alert to execute against. In the “SQL Statement” I pasted the bit of code above that creates the temp table, runs the stored procedure and inserts the results into that temp table, then filters the results and finally drops the temp table.

Finally, in the Alert Level and Notifications section, I set the “High” row of “Min” to a value of one and the “Max” I left empty, then added my Notification Group. I tested it to make sure all the connections were working properly and then I saved it. “I am the backup, I am the failure and I am the alert”.

The song for this post is Matt Maeson – Mr. Rattlebone

There’s hope, there’s a silver lining, show me my max server memory…

Last week, I had one of those moments where I was searching stack overflow and saw a post. They had an issue which matched my issue but they had solved it. No explanation of how, just that it was solved. As I screamed at the screen, “What did you see?!!”, I promised myself I would write about it to save someone else the pain.

This story starts nearly a month ago. We added a read replica into our AWS RDS instance. Since the replica was added, we were seeing a strange error in the SQL Error log.

08/16/2022 22:16:45,spid122,Unknown,AppDomain 20561 (mssqlsystemresource.dbo[runtime].20560) created.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) is marked for unload due to memory pressure.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) unloaded.
08/16/2022 22:15:30,spid193,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) created.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) is marked for unload due to memory pressure.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) unloaded.

Why would mssqlsystemresource database be having so many issues? I thought maybe this was normal for this server, or that maybe we had hit a tipping point on this server. Last week about every hour, the server would give up and throw a ton of errors: “Login failed. The login is from an untrusted domain and cannot be used with Integrated authentication.” and “SSPI handshake failed with error code 0x80090311 state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The operating system error code indicates the cause of failure. No authority could be contacted for authentication.” Then it would recover and go back to creating and unloading the mssqlsystemresource database.

We tracked down every job that was touching the server and started to tune it, thinking that was just pushing us over the edge. We worked with AWS and finally one of our engineers noticed that our MAX Server Memory Setting was back at the SQL default. You know that insane default? Yes, it was there. But we had properly set that…three months ago when this stack was put in place. We figured that adding the replica must have somehow reset it for all the servers. We set it again and nothing happened. The errors in the error log continued. We had a failover and everything started working properly. The errors in the log disappeared, jobs could run without crashing the server and everything was awesome again. Turns out that to release the extra memory, it needs some kind of restart.

So, this is what we saw: Max Server Memory was set too high and stealing from the OS, causing memory to thrash and things to crash. We set it properly, did a restart and everything was good again. Instead of digging through SQL 2012 bug reports and trying to figure out what that person on stack overflow saw, you can listen to a song. That is a silver lining. I’m showing you my silver lining.

The song for this post is My Silver Lining by First Aid Kit.