We apologize for any inconveniences caused by this outage. Our engineering team’s primary responsibility is to provide a reliable product for our customers. We will be using this outage as a learning opportunity for us in the future.
Between approximately 9:41am and 9:55am Pacific, site visitors encountered a continuous loading spinner when trying to open pages.
The event was triggered by a Redis service node running out of memory, which led to issues in our application servers. Our server was unable to process requests and serve content to anyone that was connected to it.
The event was detected by our monitoring systems (Datadog) as well as our head of product. The team started working on the incident by 9:48am Pacific.
This incident affected over half of all site visitors during this time period.
A deployment to Production started at 9:28am Pacific. At 9:36am, the deployment being rolling out the changes to our backend servers, and at 9:41am that process completed.
At 9:41:38, someone visited a page made by a creator with a malformed profile picture. The data for the image was not stored correctly in our database.
The corrupted image data was loaded into our caching system, and it resulted in out-of-memory errors as well as network connectivity problems with our Redis servers.
For approximately 14 minutes between 9:41am - 9:55am Pacific, over half of our site visitors were impacted by this incident.
The incident was first detected by Datadog, and Pagerduty sent out a page at 9:43am Pacific for a high error rate in Redis.
It appeared the Redis error rate was falling for a short time.
There was another page at 9:47am from a separate Redis monitor for high latency.
The Redis error rate remained volatile, the site remained down, and the team declared an incident 9:48am.
We could have sped up the response by declaring the incident sooner.
The system recovered by itself while the team was investigating the issue.
Redis has a mechanism to purge stale data to free up memory, and we suspect this mechanism was triggered.
All times are in Pacific Daylight Time (UTC-7).
There were two issues present, and one exacerbated the other. We had a bug in our image uploading component that let images be stored incorrectly in our database, which triggered a failure scenario with our caching system.