GitLab Production Database Deletion: 5 Failed Backups and a Live Stream

A GitLab sysadmin accidentally deleted 300GB of production database data. All five backup methods failed. GitLab livestreamed the incident on YouTube as it unfolded, becoming a landmark in transparency and the risks of complex backup systems.

GitLab·2017·2 min read

Background

GitLab.com was a major code hosting platform. On January 31, 2017, a sysadmin was attempting to remove a database replica that was causing replication lag. Through a series of errors in a tired state, they deleted the production database directory instead.

The Attack

The sysadmin ran rm -rf on what they believed was a replica database directory but was the primary production database. The deletion removed approximately 300GB of database data. GitLab's team attempted recovery using five different backup systems: regular database backups (had not been working for months — the last successful backup was six hours old); filesystem snapshots (only of the database replica, not primary); disk-level snapshots (a 24-hour-old snapshot existed); streaming database backups to S3 (had not been working due to insufficient disk space); and a backup to a secondary region (never fully implemented).

Response

GitLab took GitLab.com offline and posted a live Google Doc detailing the incident in real time, including their attempts and failures at each backup method. They opened a livestream on YouTube showing the recovery process. Ultimately, a 6-hour-old backup was restored, losing approximately 6 hours of data including 5,037 projects, 4,789 comments, and 707 user accounts.

Outcome

The livestream and live documentation were praised by the security community as an unprecedented example of transparency. GitLab published a detailed post-mortem and made all five failed backup systems public information. The case became the canonical example of backup testing failures and is used in reliability engineering education.

Key Takeaways

  1. Test backups regularly by performing actual restores — a backup system that has never been tested for restore is theoretical, not operational
  2. Multiple backup methods that all fail is worse than one well-tested backup system — test all of them
  3. Transparency during an incident, including livestreaming a database deletion recovery, builds long-term trust despite short-term embarrassment
  4. The rm -rf command should require confirmation in production environments — implement safety controls on destructive operations

How to Prevent This

All guides
beginner

Back up data with the 3-2-1 rule and verify restores quarterly

Three copies, two different media types, one offsite. The GitLab database deletion incident had five backup methods — all of which failed, for different reasons. The WannaCry and NotPetya ransomware attacks encrypted backup drives that were mounted to infected systems. Backups that have never been tested for restoration are theoretical, not operational. The GitLab incident demonstrated this: several backup systems that seemed healthy had silently failed months earlier. Test restoration of a full system backup quarterly. Store at least one backup copy offline (not mounted, not accessible over the network) to protect against ransomware.

See: GitLab Database DeletionData Protection
beginner

Be transparent during an incident — livestream the database deletion if you must

GitLab's response to accidentally deleting 300GB of production database data included a public live Google Doc updating in real time, and a YouTube livestream showing engineers working through recovery. The security and engineering community widely praised this transparency despite the embarrassing circumstances. LastPass's changing story — from "no customer data accessed" to "encrypted vaults stolen" over three months — destroyed trust more than a single comprehensive disclosure would have. Transparent, timely disclosure during an incident maintains trust, enables affected parties to take protective action, and demonstrates organisational integrity. Brief stakeholders early and update them regularly, even when the picture is incomplete.

See: GitLab Database DeletionIncident Response
database deletionbackup failurerm -rftransparencypost-mortem