Duolingo 2.6 Million User Data Scraped via Exposed API Endpoint

A threat actor scraped 2.6 million Duolingo user records including email addresses and public profile information by abusing an exposed API endpoint. The data was published on a hacker forum and later used in phishing campaigns.

Duolingo·2023·2 min read

Background

Duolingo had over 500 million registered users in 2023. An exposed API endpoint allowed anyone to submit a username and receive the associated profile information including email address, language courses, and profile metadata. While individual queries were allowed, bulk scraping was not authorised.

The Attack

The API endpoint at api-duolingo.com accepted username inputs and returned user profile data including the email address linked to the account. A threat actor systematically submitted millions of usernames (likely drawn from prior username lists), extracting the associated email addresses, real names, language courses, streak counts, and country data. The resulting 2.6 million record database was initially sold for $1,500 on a hacking forum in January 2023, then published freely in August 2023. Email addresses from the dump were subsequently used in targeted phishing campaigns exploiting Duolingo branding.

Response

Duolingo acknowledged the data scraping after independent researchers reported it. The company stated the data was limited to publicly visible profile information plus email addresses. The vulnerable endpoint was reviewed. Duolingo's response timeline and extent of changes were criticised as inadequate.

Outcome

API scraping does not require exploiting a vulnerability — it simply automates what any user could do manually. The case demonstrated that "public" profile data combined with email addresses creates a highly effective phishing list, particularly when associated with a recognisable brand like Duolingo that sends regular emails.

Key Takeaways

  1. APIs that return email addresses in response to public data queries must have rate limiting and authentication controls
  2. The combination of a platform brand plus real email address creates high-quality phishing material
  3. API endpoints must be tested for bulk enumeration vulnerability during security reviews
  4. "Publicly available" data can still be misused at scale — consider what aggregation of public data reveals
API scrapingenumerationemail harvestingphishing materialrate limiting