Thursday, July 9, 2020

Outage report 7 July 2020

tl; dr

We had an unplanned outage the day before yesterday; it was our first big one since July 2017. It was caused by an extremely unlikely storage system failure, but despite that it should not have led to such a lengthy downtime, and should not have affected so many people. We have some plans on what our next steps should be, and will be implementing at least some of them over the coming months.

The details

At 16:06 UTC on 7 July 2020, a storage volume failure on one of our storage servers caused a number of outages, starting with our own site and also with our users’ programs (including websites) that were dependent on that volume, and later spreading to other hosted sites. Because all of the data that we store on behalf of our users is backed up and mirrored, no data was lost or at risk, but the outage was significantly longer than we would like.

The effects were:

  • Our own site was unavailable or generating an excessive number of errors from 16:06 to 18:53 UTC.
  • For accounts that were stored on the affected file storage:
    • Websites were unavailable in general from 16:06 to 19:46 UTC, with some taking until 21:24 UTC to be fully available.
    • Scheduled tasks did not run between 16:06 and 18:53-19:46 UTC (the precise end time depending on the account)
    • Always-on tasks did not run between 16:06 and 18:53-19:46 UTC (the precise end time depending on the account)
  • For accounts that were not using the affected file storage
    • Scheduled tasks, and always-on tasks were unaffected apart from a short window of about five minutes between 18:53 and 19:46 UTC.
    • Websites were also unaffected apart from that window if they were already up and running when the problems started at 16:06 UTC. However, they could not be started up or reloaded between 16:06 and 18:53, and would have had a brief outage sometime between 18:53 and 19:46 UTC.
    • However, of course, the problems on our own site would also have affected the owners of these accounts if they needed to log in to run or change things.

This post gives a more detailed timeline of what happened, and describes what steps we are taking to prevent a recurrence.



from PythonAnywhere News
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...