There’s hope, there’s a silver lining, show me my max server memory…

Last week, I had one of those moments where I was searching stack overflow and saw a post. They had an issue which matched my issue but they had solved it. No explanation of how, just that it was solved. As I screamed at the screen, “What did you see?!!”, I promised myself I would write about it to save someone else the pain.

This story starts nearly a month ago. We added a read replica into our AWS RDS instance. Since the replica was added, we were seeing a strange error in the SQL Error log.

08/16/2022 22:16:45,spid122,Unknown,AppDomain 20561 (mssqlsystemresource.dbo[runtime].20560) created.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) is marked for unload due to memory pressure.
08/16/2022 22:16:39,spid43s,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) unloaded.
08/16/2022 22:15:30,spid193,Unknown,AppDomain 20560 (mssqlsystemresource.dbo[runtime].20559) created.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) is marked for unload due to memory pressure.
08/16/2022 22:15:19,spid78s,Unknown,AppDomain 20559 (mssqlsystemresource.dbo[runtime].20558) unloaded.

Why would mssqlsystemresource database be having so many issues? I thought maybe this was normal for this server, or that maybe we had hit a tipping point on this server. Last week about every hour, the server would give up and throw a ton of errors: “Login failed. The login is from an untrusted domain and cannot be used with Integrated authentication.” and “SSPI handshake failed with error code 0x80090311 state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The operating system error code indicates the cause of failure. No authority could be contacted for authentication.” Then it would recover and go back to creating and unloading the mssqlsystemresource database.

We tracked down every job that was touching the server and started to tune it, thinking that was just pushing us over the edge. We worked with AWS and finally one of our engineers noticed that our MAX Server Memory Setting was back at the SQL default. You know that insane default? Yes, it was there. But we had properly set that…three months ago when this stack was put in place. We figured that adding the replica must have somehow reset it for all the servers. We set it again and nothing happened. The errors in the error log continued. We had a failover and everything started working properly. The errors in the log disappeared, jobs could run without crashing the server and everything was awesome again. Turns out that to release the extra memory, it needs some kind of restart.

So, this is what we saw: Max Server Memory was set too high and stealing from the OS, causing memory to thrash and things to crash. We set it properly, did a restart and everything was good again. Instead of digging through SQL 2012 bug reports and trying to figure out what that person on stack overflow saw, you can listen to a song. That is a silver lining. I’m showing you my silver lining.

The song for this post is My Silver Lining by First Aid Kit.

I’m Gonna Spend My Time Speeding that Query Up, Like It’s Never Enough, Like it’s Born to Run…

Have I mentioned that I like query tuning? One of my favorite tuning tricks is removing Sub-queries from WHERE clauses. Let me give an example:

FROM [dbo].[SuperHero] s 			
WHERE HeroType = 2
		FROM [dbo].[SuperHero] x 								
		WHERE x.HeroID = s.HeroID 
			 AND x.IsHuman = 1 AND x.Weakness = 'Lack of Control')

Notice the “NOT EXISTS *Sub-Query* section. Any time I see this pattern or even a “NOT IN *Sub-Query*” pattern, I know I can fix it like this:

SELECT s.HeroName
		, s.HasCape
		, s.FavoriteColor
		, s.LairId
FROM [dbo].[SuperHero] s 
	LEFT JOIN [dbo].[SuperHero] x ON x.HeroID = s.HeroID 
		 AND x.IsHuman = 1
		 AND x.Weakness = 'Lack of Control'	
WHERE HeroType = 2

In this second example, I have moved the sub-query to be in a LEFT JOIN with the same criteria and then in the WHERE I use one of the columns that should be populated (I favor ID columns here) and look to see if it “IS NULL”. That “IS NULL” works the same way as the “NOT EXISTS” and the “NOT IN”.

This allows me to remove the non-sargable arguments from the where clause and takes my query from non-sargable to sargable. (from Wikipedia- The term is derived from a contraction of Search ARGument ABLE. A query failing to be sargable is known as a non-sargable query and typically has a negative effect on query time, so one of the steps in query optimization is to convert them to be sargable).

With simple queries that have a low number of records, I hardly notice a difference in performance. As the queries become more complex or the row numbers increase, the difference begins to show in the query run time and the IO statistics.

The song for this post is I’m Born to Run by American Authors.

I’m Beggin, Beggin you, to stop using VarChars as IDs

As I was troubleshooting a performance issue, I noticed that there was an implicit conversion (SQL Server automatically converts the data from one data type to another) happening in my join. The join was on a column that was named the same in both tables, but one was datatype INT (integer) and the other was a datatype of VARCHAR(50) (variable character up to 50 places).

While the implicit conversion was happening transparently to our coders and users, it was causing performance impacts to the query. I wanted to change the datatype from VARCHAR(50) to an INT, not only to match the other table, but also because INTs are faster to join on than VARCHARs in the SQL engine.

My first step was to make sure there weren’t any values in the column that would have an issue changing to an int. For this task, I am using TRY_CAST to make my life easier.

SELECT TRY_CAST(SuperHeroId as INT) as Result, SuperHeroId
FROM dbo.Lair

The TRY_CAST above is checking to see if I can CAST the value as an INT. If it can’t, it will return a NULL value. My WHERE clause will help me quickly identify the values that are failing which will allow me to fix the data before I change the data type on the column.

Once my query doesn’t return any rows, I am ready to change my datatype, which will remove that implicit conversion and increase the performance of any queries using that join.

The song for this post is Beggin’ by Maneskin.

I Made That Slow Query Run, Run, Run

I have been tuning queries and one of the first things I noticed was Sub-queries in the SELECT Clause.

			FROM HideOut H1
			WHERE TypeDesc = 'Villain' 
			AND H1.Storyline = S.Storyline) AS VillainHideOut
     , (SELECT H2.FormalName
			FROM HideOut H2
			WHERE TypeDesc = 'Hero' 
			AND H2.Storyline = S.Storyline) AS HeroHideOut
     , S.HeroName AS Hero
     , V.VillainName AS Villain
FROM dbo.SuperHero S
	INNER JOIN [dbo].[Villain] V
		ON S.HeroLink = V.HeroLink
WHERE V.IsActive = 1 AND S.IsActive = 1

Why is this query slow? If this query were to return 50 rows, it would run each query in the SELECT clause 50 times, and since there are two of them, that is 100 query runs. What if I returned 100,000 rows? That would be 200,000 query runs. How could I do this differently?

SELECT H1.FormalName AS VillainHideOut
     , H2.FormalName AS HeroHideOut
     , S.HeroName AS Hero
     , V.VillainName AS Villain
FROM dbo.SuperHero S
	INNER JOIN [dbo].[Villain] V
		ON S.HeroLink = V.HeroLink
	LEFT JOIN HideOut H1 
		ON H1.Storyline = S.Storyline AND H1.TypeDesc = 'Villain'
	LEFT JOIN HideOut H2 
		ON H2.Storyline = S.Storyline AND H2.TypeDesc = 'Hero'
WHERE V.IsActive = 1 AND S.IsActive = 1

I moved the two correlated sub-queries into the JOIN clause and then simplified them to just join to the tables. This means I will only select from each table once, instead of for each row and will drop my query runs significantly. I could have left them as correlated sub-queries in the JOIN clause and it still would have performed better than having them in the SELECT clause.

In the query I was working with, I checked the statistics IO output and saw my logical reads on the HeroHideOut table drop from 24,103,280 logical reads down to 10,064 logical reads by making this one change.

Happy Tuning!

The song for this post is OneRepublic’s Run.