Debugging My Way Out Performance Issues in Azure SQL Database
How I stopped Azure SQL Database from becoming unbearably slow at random times.
I have inherited a kind-of-legacy web API and I have been working on it for supporting a few pilot projects.
Recently, it decided to give me the best gift of all: sleepless nights trying to figure out performance issues. Sometimes, one single endpoint from the production API would become unbearably slow.
It would take minutes to fetch just a few bytes from the database, spiking its compute utilization to 100%. This would last hours, sometimes days, and go back to normal by itself.
Desperation led me to overcome my fear of asking questions on Stack Overflow and from there I got the tools to find the solution, quite quickly.
In this short article, I would like to run through the debugging process, tools and the solution I found. Hopefully, it will help other fellow developers overcoming their own problems!
The starting point
High-Level view of the architecture
The application involved the following two components:
- API: written in C# using .NET Framework 4.6. It leverages on Entity Framework 6.2 using a Code-First approach (in a nutshell, the domain models are C# classes that EF takes care of mapping to the database). The API is hosted as an Azure App service.
- Database: Azure SQL Database hosted in Azure (as a managed service) using the serverless plan, automatically scaling from 2 to 16 vCores.
The incriminated endpoint was trying to retrieve sensor data from the database, using the following query, executed around eight times. Each call was returning one single data point (timestamp and value). In normal times, the call took around one second. In bad times, it took a few minutes.
List<Measurement> result = CreateContext().Measurements
.Where(m => m.MeasurementSeries.SensorInfoID == sensorId)
.Where(m => m.MeasurementSeries.StartTime <= toTime && m.MeasurementSeries.EndTime >= fromTime)
.Where(m => m.Time > fromTime && m.Time <= toTime)
.OrderByDescending(m => m.Time)
Query causing the performance issue
The debugging: before Stack Overflow
It all started with the feared message on Teams: The app is unresponsive, could you look into that? Technically, it was not unresponsive. It was just a very small endpoint not responding in a reasonable time.
Maybe a bug was pushed in production? No, no one has touched the production code recently. Maybe IT downgraded our subscription? No, it is all good. DoS? No, zero CPU utilization both on the database and on the API side, and no request recorded from AppInsight, either.
However, something weird caught my attention. If I triggered the endpoint, I saw the database compute utilization spike to its maximum. It never happened before, and the resources allocated to the database are oversized for the expected workload.
Usage spike on the database when the query was executed, from the Azure Portal metrics
Well, let’s dive into the code. Open Visual Studio > Copy the connection string for the production database > Debug > Open Postman > Run the query. Perfect performance, the result comes back in less than a second.
What? Let’s maybe re-try on the deployed app. It’s unresponsive. What?
Let’s try something else. Open SQL Server Management Studio, run the raw query. Perfect performance. What?
Ok, maybe it’s something corrupted in the server? Let’s try restarting the app in Azure. Doesn’t work. Stopping and starting. Doesn’t work. Redeploying a new version. Doesn’t work. Lunch break.
Back from lunch, the application was running smoothly, as if nothing happened. What?
The debugging: after Stack Overflow
It self-healed once, but not the second time. Same story, same attempts, over a day wasted looking at every possible metric and dependency in the Application Insights. Nothing useful.
Do you have indexes on SensorInfoId, StartTime, Endtime for MeasurementSeries? Do you have a descending index for Time on Measurements? Have you investigated locking? you can use sp_whoisactive
Yes, I had all those index. And nice, more tools to help me debug!
sp_whoisactive was the tool that brought me to the solution. To use it, download the .sql file from the linked repo, open it with Microsoft SQL Server Management Studio, and run it. Next time you execute a query, run
@get_task_info = 2
This command will give you a detailed overview of who is using the database and how. You can check the authors’ documentation for extensive information.
Example view of the sp_WhoIsActive execution result.
I could see the queries I was testing: all fast except the one that was driving me crazy. That one was hanging, and everything seemed correct except in wait_info. It was a read query, there were no other queries in progress, no locks involved, yet in that column there was something written: CXPACKET.
As described in the very clear article by Brent Ozar, CXPACKET means that there is a query being executed over multiple cores, and some task is waiting for the others to complete. If it is observed together with performance issues, it may be a sign that there is an excessive parallelization that slows down the query instead of optimizing it.
In SQL Server, the Maximum Degree of Parallelism (Max DOP) can be configured, and Microsoft itself suggests a MaxDOP of 8. However, before SQL Server 2019, the default value was infinite (0), as in my case.
By configuring MaxDOP to 8, the application went back working like a charm.
Even if the issue is fixed, it is still interesting to figure out why it happened only from time to time, and in what seemed like random situations.
The same endpoint is, from time to time, triggered for sensors and time intervals that return quite a lot of data. I guess that in those occasions, the query was considered slow and subsequently optimized like that, even for the (most frequent) case in which the query was returning a small set of data.
On top of it, the SQL database could scale from 2 to 16 vCores, increasing the possible level of parallelization.
SQL Server Maximum Degree of Parallelization (Max DOP) was configured to be infinite. The query, even if small, was run at a level of concurrency that made it too slow.
Changing Max DOP to 8, the suggested level by Microsoft, solved the issue.