[Resolved] Outage on Tellie
Incident Report for Tellie
Postmortem

We apologize for any inconveniences caused by this outage. Our engineering team’s primary responsibility is to provide a reliable product for our customers. We will be using this outage as a learning opportunity for us in the future.

Incident summary

Between approximately 9:41am and 9:55am Pacific, site visitors encountered a continuous loading spinner when trying to open pages.

The event was triggered by a Redis service node running out of memory, which led to issues in our application servers. Our server was unable to process requests and serve content to anyone that was connected to it.

The event was detected by our monitoring systems (Datadog) as well as our head of product. The team started working on the incident by 9:48am Pacific.

This incident affected over half of all site visitors during this time period.

Lead-up

A deployment to Production started at 9:28am Pacific. At 9:36am, the deployment being rolling out the changes to our backend servers, and at 9:41am that process completed.

At 9:41:38, someone visited a page made by a creator with a malformed profile picture. The data for the image was not stored correctly in our database.

Fault

The corrupted image data was loaded into our caching system, and it resulted in out-of-memory errors as well as network connectivity problems with our Redis servers.

Impact

For approximately 14 minutes between 9:41am - 9:55am Pacific, over half of our site visitors were impacted by this incident.

Detection

The incident was first detected by Datadog, and Pagerduty sent out a page at 9:43am Pacific for a high error rate in Redis.

It appeared the Redis error rate was falling for a short time.

There was another page at 9:47am from a separate Redis monitor for high latency.

The Redis error rate remained volatile, the site remained down, and the team declared an incident 9:48am.

We could have sped up the response by declaring the incident sooner.

Recovery

The system recovered by itself while the team was investigating the issue.

Redis has a mechanism to purge stale data to free up memory, and we suspect this mechanism was triggered.

Timeline

All times are in Pacific Daylight Time (UTC-7).

  • 09:28 - Production deployment begins
  • 09:41 - Deployment to backend completes
  • 09:41:38 - We loaded a creator's profile with a malformed profile image
  • 09:42 - Error rate starts to spike in Redis
  • 09:43 - First alarm triggered due to rising Redis error rates
  • 09:45 - A team member notices a possible issue with the deployment
  • 09:46 - We notice Redis error rate is falling
  • 09:47 - Another alarm is triggered for Redis high latency
  • 09:47 - Redis error rate continues to be volatile
  • 09:48 - Incident is declared and full incident response begins
  • 09:51 - Incident is created on https://status.tell.ie
  • 09:55 - Redis error state self-resolves

Root cause

There were two issues present, and one exacerbated the other. We had a bug in our image uploading component that let images be stored incorrectly in our database, which triggered a failure scenario with our caching system.

Lessons learned

  • To improve incident response time, if the site is reported to be unavailable by more than one person, we should promptly raise an alarm
  • Technical debt can combine with other bugs to cause severe operational issues
  • Our system is highly stateful, which means it is sensitive to corrupted state. Moving to a less stateful (or ideally stateless!) system would lead to big wins.
  • We don't have enough visibility into issues internal to Redis
  • We need to improve the signal-to-noise ratio on the Redis alarm

Corrective actions

  1. [Done] Fix the image uploading bug
  2. Address application/caching technical debt
  3. Update Datadog monitors for Redis to improve alarm accuracy
  4. Add logging to Redis for more visibility into operational issues
  5. Add Autoscaling configuration for Redis
Posted Nov 19, 2021 - 19:27 UTC

Resolved
This incident has been resolved. We are continuing to investigate the root cause and will update this page as soon as we know more.
Posted Oct 27, 2021 - 16:58 UTC
Investigating
Some customers are seeing an endless spinner when trying to access sets. We’re aware of this issue and are working on it urgently.
Posted Oct 27, 2021 - 16:51 UTC
This incident affected: Published Pages and Page Designer.