CosmosDB Monitoring & Troubleshoot

Table of Contents

Partitioning
100% RU consumption
An autoscaling example

In general with Azure Cosmos DB two things matter — distribution of data and Throughput.

Partitioning #

Good distribution ranging from 4.4 GiB to 6.5 GiB.

Any container where distribution across all partitions does not look sort of equal — say +/- 10% variance is something that requires investigation.

100% RU consumption #

When the normalized RU consumption reaches 100% for given partition key range, and if a client still makes requests in that time window of 1 second to that specific partition key range – it receives a rate limited error (429).

By default Azure cosmosDB clients retry typically up to 9 times on request that generated 429 HTTP status. As a result, while we may see 429s in the metrics, these errors are not necessarily returned to our application.

In a production workload, if we see 1-5% of requests with 429 HTTP status, this latency is acceptablea and is a healthy sign that the total RUs provisioned have been fully utilized. In this case, no further action is required.

let totalRequests = 
    CosmosDBRequests
    | summarize TotalRequests = count() by bin(timestamp, 1h);

let failedRequests = 
    CosmosDBRequests
    | where resultCode == 429
    | summarize FailedRequests = count() by bin(timestamp, 1h);

totalRequests
| join kind=inner (
    failedRequests
) on timestamp
| project timestamp, TotalRequests, FailedRequests, Percentage = todouble(FailedRequests) / todouble(TotalRequests) * 100

Throughput increase #

In case that across multiple partitions the RU consuption is 100% for a long period of time and the 429 errors are 5% of the total requests made or higher, it is advised to increase the RUs for a given collection or the database in general.

Determinte heavy operations #

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/cosmos-db/scaling-provisioned-throughput-best-practices.md
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/cosmos-db/nosql/troubleshoot-request-rate-too-large.md#step-3-determine-what-requests-are-returning-429-responses

Hot partitions #

It isn’t always the case that if we see a 429 error limiting error automatically means that we consume 100% of the RUs. And this is because since the RU consumption is calculated based on the higher percentage of RU consuption across ALL partitions, ( one partition might use 60% of the RUs and second 80%, the formula will be MAX(60%, 80%) = 80% ).

This means that one partition might be very busy while the second one serve requests without issues.

This can be caused by bad partition key selection, resultign to many requests to a subset of key partition range due to uneven key destribution, making a partition with bigger key range “hot”.

We can identify a hot partitions by using the following metric:

 Insights > Throughput > Normalized RU Consumption (%) By PartitionKeyRangeID

More on identifying hot partiotions: link

An autoscaling example #

Let’s assume we have a container with autoscaling enabled which can scale from 2000 to 20.000 RU/s. Even if in a given second, there is a spike on the requests on a given partition key range, and the overall RU/s is 1000 to 10.000, CosmosDB will not immediately scale to the max RUs.

Because normalized RU consumption metric shows the highest utilization in the time period across all partitions, it will show 100%. However, because the utilization was only 100% for 1 second, autoscale won’t automatically scale to the max.

However, since autoscale provisions all required resources upfront, even though cosmosDB didn’t autoscale to the maximum at that given second, it was able to use the maximum RU/s for that given time, while the autoscale still not triggered and set to 1000 RU/s.

To verify that, we can use the following Diagnostic Logs:

CDBPartitionKeyRUConsumption
| where TimeGenerated >= (todatetime('2022-01-28T20:35:00Z')) and TimeGenerated <= todatetime('2022-01-28T20:40:00Z')
| where DatabaseName == "MyDatabase" and CollectionName == "MyContainer"
| summarize sum(RequestCharge) by bin(TimeGenerated, 1sec), PartitionKeyRangeId
| render timechart

What are your Feelings

Still stuck? How can we help?

Updated on February 11, 2024