There’s hope, there’s a silver lining, show me my max server memory…

Last week, I had one of those moments where I was searching stack overflow and saw a post. They had an issue which matched my issue but they had solved it. No explanation of how, just that it was solved. As I screamed at the screen, “What did you see?!!”, I promised myself I would write about it to save someone else the pain.

This story starts nearly a month ago. We added a read replica into our AWS RDS instance. Since the replica was added, we were seeing a strange error in the SQL Error log.

08/16/2022 22:16:45,spid122,Unknown,AppDomain 20561 (mssqlsystemresource.dbo[runtime].20560) created.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) is marked for unload due to memory pressure.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) unloaded.
08/16/2022 22:15:30,spid193,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) created.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) is marked for unload due to memory pressure.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) unloaded.

Why would mssqlsystemresource database be having so many issues? I thought maybe this was normal for this server, or that maybe we had hit a tipping point on this server. Last week about every hour, the server would give up and throw a ton of errors: “Login failed. The login is from an untrusted domain and cannot be used with Integrated authentication.” and “SSPI handshake failed with error code 0x80090311 state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The operating system error code indicates the cause of failure. No authority could be contacted for authentication.” Then it would recover and go back to creating and unloading the mssqlsystemresource database.

We tracked down every job that was touching the server and started to tune it, thinking that was just pushing us over the edge. We worked with AWS and finally one of our engineers noticed that our MAX Server Memory Setting was back at the SQL default. You know that insane default? Yes, it was there. But we had properly set that…three months ago when this stack was put in place. We figured that adding the replica must have somehow reset it for all the servers. We set it again and nothing happened. The errors in the error log continued. We had a failover and everything started working properly. The errors in the log disappeared, jobs could run without crashing the server and everything was awesome again. Turns out that to release the extra memory, it needs some kind of restart.

So, this is what we saw: Max Server Memory was set too high and stealing from the OS, causing memory to thrash and things to crash. We set it properly, did a restart and everything was good again. Instead of digging through SQL 2012 bug reports and trying to figure out what that person on stack overflow saw, you can listen to a song. That is a silver lining. I’m showing you my silver lining.

The song for this post is My Silver Lining by First Aid Kit.