← 回總覽

GitHub 可用性报告:2026 年 2 月

📅 2026-03-12 11:23 Jakub Oleksy 软件编程 6 分鐘 7173 字 評分: 82
GitHub SRE 复盘 云基础设施 可用性
📌 一句话摘要 GitHub 2026 年 2 月的可用性报告对六起重大事故进行了透明的技术复盘,详细分析了遥测驱动的安全级联反应和缓存重写放大等根本原因。 📝 详细摘要 这份官方报告分析了 2026 年 2 月发生的六起重大服务中断事件,影响范围涵盖 GitHub Actions、Codespaces、Dependabot 和 Git 操作。最严重的事故包括:遥测故障触发了后端存储的限制性安全策略,以及缓存配置更改导致 Git HTTPS 代理中出现大规模写放大和连接耗尽。其他问题包括数据库故障转移至只读实例以及区域授权声明变更。针对每起事件,GitHub 概述了具体的技术修复措施,包括

In February, we experienced six incidents that resulted in degraded performance across GitHub services.

We recognize the impact these outages have had on teams, workflows, and overall confidence in our platform. Earlier today, we released a blog post outlining the root causes of recent incidents and the steps GitHub is taking to make our systems more resilient moving forward. Thank you for your patience as we work through near-term and long-term investments we’re making.

Below, we go over the six major incidents specific to February. February 02 17:41 UTC (lasting 1 hour and 5 minutes)

From January 31, 2026, 00:30 UTC, to February 2, 2026, 18:00 UTC, Dependabot service was degraded and failed to create 10% of automated pull requests. This was due to a cluster failover that connected to a read-only database.

We mitigated the incident by pausing Dependabot queues until traffic was properly routed to healthy clusters. All failed jobs were identified and restarted.

We added new monitors and alerts to reduce our time to detect and prevent this in the future. February 02 19:03 UTC (lasting 5 hours and 53 minutes)

On February 2, 2026, between 18:35 UTC and 22:20 UTC, GitHub Actions hosted runners and GitHub Codespaces were unavailable, with service degraded until full recovery at 23:10 UTC for standard runners, February 3, 2026 at 00:30 UTC for larger runners, and February 3 at 00:15 for codespaces. During this time, actions jobs queued and timed out while waiting to acquire a hosted runner. Other GitHub features that leverage this compute infrastructure were similarly impacted, including Copilot coding agent, Copilot code review, CodeQL, Dependabot, GitHub Enterprise Importer, and GitHub Pages. All regions and runner types were impacted. Codespaces creation and resume operations also failed in all regions. Self-hosted runners for actions on other providers were not impacted.

This outage was caused by a loss in telemetry that cascaded to mistakenly applying security policies to backend storage accounts in our underlying compute provider. Those policies blocked access to critical VM metadata, causing all VM create, delete, reimage, and other operations to fail. More information is available here. This was mitigated by rolling back the policy changes, which started at 22:15 UTC. As VMs came back online, our runners worked through the backlog of requests that hadn’t timed out.

We are working with our compute provider to improve our incident response and engagement time, improve early detection, and ensure safe rollout should similar changes occur in the future. February 09 16:19 UTC (lasting 1 hour and 21 minutes) and February 09 19:01 UTC (lasting 1 hour and 8 minutes)

On February 9, 2026, GitHub experienced two related periods of degraded availability affecting github.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other services. The first period occurred between 16:12 UTC and 17:39 UTC, and the second between 18:53 UTC and 20:09 UTC. In total, users experienced approximately 2 hours and 43 minutes of degraded service across the two incidents.

During both incidents, users encountered errors loading pages on github.com, failures when pushing or pulling code over HTTPS, failures starting or completing GitHub Actions workflow runs, and errors using GitHub Copilot. Additional services including GitHub Issues, pull requests, webhooks, Dependabot, GitHub Pages, and GitHub Codespaces experienced intermittent errors. SSH-based Git operations were not affected during either incident.

Our investigation determined that both incidents shared the same underlying cause: a configuration change to a user settings caching mechanism caused a large volume of cache rewrites to occur simultaneously. In the first incident, asynchronous rewrites overwhelmed a shared infrastructure component responsible for coordinating background work, which led to cascading failures and connection exhaustion in the service proxying Git operations over HTTPS. We mitigated this incident by disabling async cache rewrites and restarting the affected Git proxy service across multiple datacenters.

The second incident arose when an additional source of cache updates, not addressed by the initial mitigation, introduced a high volume of synchronous writes. This caused replication delays, resulting in a similar cascade of failures and again leading to connection exhaustion in the Git HTTPS proxy. We mitigated by disabling the source of the cache rewrites and again restarting Git proxy.

We are taking the following immediate steps:

* We optimized the caching mechanism to avoid write amplification and added self-throttling during bulk updates. * We are adding safeguards to ensure the caching mechanism responds more quickly to rollbacks and strengthening how changes to these caching systems are planned, validated, and rolled out with additional checks. * We are fixing the underlying cause of connection exhaustion in our Git HTTPS proxy layer so the proxy can recover from this failure mode automatically without requiring manual restarts. February 12 07:53 UTC (lasting 2 hours and 3 minutes)

On February 12, 2026, between 00:51 UTC and 09:35 UTC, users attempting to create or resume Codespaces experienced elevated failure rates across Europe, Asia, and Australia, peaking at a 90% failure rate. Impact started in UK South and impacted other regions progressively. US regions were not impacted.

The failures were caused by an authorization claim change in a core networking dependency, which led to codespace pool provisioning failures. Alerts detected the issue but did not have the appropriate severity, leading to delayed detection and response. Learning from this, we have improved our validation of changes to this backend service and monitoring during rollout. Our alerting thresholds have also been updated to catch issues before they impact customers and improved our automated failover mechanisms to cover this area. February 12 10:38 UTC (lasting 34 minutes)

On February 12, 2026, from 09:16 to 11:01 UTC, users attempting to download repository archives (tar.gz/zip) that include Git LFS objects received errors. Standard repository archives without LFS objects were not affected. On average, the archive download error rate was 0.0042% and peaked at 0.0339% of requests to the service. This was caused by the deployment of an incorrect network configuration in the LFS Service that caused service health checks to fail and an internal service to be incorrectly marked as unreachable.

We mitigated the incident by manually applying the corrected network setting. Additional checks for corruption and auto-rollback detection were added to prevent this type of configuration issue. *

Follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the engineering section on the GitHub Blog.

查看原文 → 發佈: 2026-03-12 11:23:58 收錄: 2026-03-12 12:01:08

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。