Skip to main content

Diigo Home

Five whys - Joel on Software - The Diigo Meta page

www.joelonsoftware.com/...22.html - Cached - Annotated View

David Corking's personal annotations on this page

dcorking
Dcorking bookmarked on 2008-01-23 IT Implementation Skills Commercial management

This sounds like a good approach for controlling a highly reliable service.

  • Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future.

This link has been bookmarked by 15 people . It was first bookmarked on 24 Jan 2008, by Joel Liu.

  • 02 Nov 09
  • 26 Feb 09
  • 30 Jan 08
  • 24 Jan 08
    • After some internal discussion we all agreed that rather than imposing a
      statistically meaningless measurement and hoping that the mere measurement of
      something meaningless would cause it to get better, what we really needed was a
      process of continuous improvement. Instead of setting up a SLA for our
      customers, we set up a blog
      where we would document every outage in real time, provide complete
      post-mortems, ask the five whys, get to the root cause, and tell our customers
      what we're doing to prevent that problem in the future. In this case, the change
      is that our internal documentation will include detailed checklists
      for all operational procedures in the live environment.
      • Our link to Peer1 NY went down
      • Why? – Our switch appears to have put the port in a failed state
      • Why? – After some discussion with the Peer1 NOC, we speculate that it was
        quite possibly caused by an Ethernet speed / duplex mismatch
      • Why? – The switch interface was set to auto-negotiate instead of being
        manually configured
      • Why? – We were fully aware of problems like this, and have been for many
        years.  But - we do not have a written standard and verification process
        for production switch configurations.
      • Why? – Documentation is often thought of as an aid for when the sysadmin
        isn’t around or for other members of the operations team, whereas, it should
        really be thought of as a checklist.
  • 23 Jan 08
  • dcorking
    David Corking

    This sounds like a good approach for controlling a highly reliable service.

    IT Implementation Skills Commercial management

    • Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future.
  • 22 Jan 08