Service Incident - 06 December 2023 - Issues Accessing Course Catalog and Dashboard

Issues Accessing Course Catalog and Dashboard

12/23/2023 7:25 PM

Information from Azure Service Health:



What happened?

Between 2023-12-06 18:00:00 UTC and 2023-12-07 00:30:00 UTC,

1. Customers using Azure CDN or Azure Front Door may have received 5xx errors or timeouts when trying to access resources through North Central US.

1. Customers reported network latency and origin timeouts for AFD and CDN in Chicago Edge.

What went wrong and why?

Chicago is one of the top utilized edge sites in North central region for both Azure Front Door and Azure CDN customers. During impact time, internal caching egress values on some of the servers in a single Chicago environment exceeded a limit. This created a resource exhaustion at network layer resulting into transient 5XX errors, latency and origin timeouts for end users. We have a monitoring pipeline built around internal caching egress limits. But unfortunately, the thresholds set in this pipeline post initial benchmarking had higher values compared to the numbers at which we observed resource exhaustion.

How did we respond?

Since the internal egress thresholds were not met, our monitoring systems did not raise alerts on engineering teams. The faulty environment in Chicago edge site was taken offline with mitigation at 00:30 UTC on 12/7 when we received support tickets from customers.

How are we making incidents like this less likely or less impactful?

We deeply apologize for this incident and for any inconvenience it has caused. In our continuous effort to improve platform reliability,

* We have already enhanced monitoring for internal caching with updated thresholds and the efforts will be completed by Jan 2024.
* In addition to the above, we are also enhancing existing intelligent traffic management mechanisms to incorporate these internal caching thresholds [ETA: Jan 2024]
* In the longer terms we are improving performance acceleration of our Networking stack for internal caching architecture. [ETA: April 2024]

12/7/2023 7:40 AM

We would like to inform you that the inability to access both the Course Catalog and Training Dashboard for some customers was caused by the Microsoft Azure incident in the North Central US region. As of now, the issue has been mitigated, and users should now be able to load pages with web parts.

If you still observe any issues with accessing SharePoint pages with LMS365 web parts, do not hesitate to contact us.

Here is the information provided by Microsoft regarding the incident:

Summary of Impact: Between 18:00 UTC on 06 Dec 2023 and 00:30 UTC on 07 Dec 2023, you were identified as a customer using Azure Front Door or Azure Content Delivery Network (CDN) in North Central US who may have received intermittent 5xx errors when trying to access resources in this region.

Preliminary Root Cause: We identified an unhealthy POP (point-of-presence) in Chicago that caused the above errors.

Mitigation: We removed the unhealthy POP location and the service returned to a healthy state. We monitored the failure rates and based on the telemetry, we conclude that service functionality has been restored and the issue is now mitigated.

Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. We also will continue to monitor the telemetry for an extended period while we investigate.

12/6/2023 10:19 PM

We are currently experiencing service issues when trying to access both the Course Catalog and Training Dashboard. We apologize for any inconveniences this may cause, and we are actively investigating this issue in order to have it resolved in the shortest amount of time as possible. This status post will be updated as any changes are made in the resolution of this matter.

FOR MORE INFORMATION
For current system status information about LMS365, check out our system status page. During an incident, you can also receive status updates by subscribing to updates available on our status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
Was this article helpful?
0 out of 0 found this helpful

Comments

Please sign in to leave a comment.