mardi 2 mars 2010

Notes on the Therac-25 case

The Therac-25 device is now a classic example of lethal failure for a complex system with ill-designed software control over a physical process. The technical causes of failure are extensively known today, as engineers learned (somewhat slowly) from their past mistakes. These aspects truly deserve to be known by anyone interested in the safety of critical systems but another equally decisive factor in the failure was proper human response, or the lack of it.

The Therac-25 was a GeV medical linear accelerator used for radiotherapy during the 80's, until the device was recalled for major changes when its unsafeness became all too blatant. Basically, the device can operate in two modes. In the first one, the accelerated electron beam is shaped and then directly targeted to the patient for skin-level treatment. In the second one, the accelerated electrons are collided on a tungsten target, producing X-rays which are shaped and directed to the patient for in-depth treatment. In the latter mode, the beam energy easily reached a few GeV, and in the former one, maximum beam energy was of 25 MeV.

Guess what happened? From 1985 to 1987, not one or two, but at least 6 patients unfortunately acted as the tungsten target for the high energy beam, ultimately leading to two radiation induced deaths and various permanent incapacities for all the others. Their disturbing stories are shortly presented in Nancy Levenson's report of the case for the IEEE Computer journal (1995 update).

Of course, there are technical explanations for this failure. The software (written in assembly language - i.e. the lowest possible level which is not '0's and '1's) was largely reported as a shoddy reuse of an earlier version and presented several critical race conditions. To put it in a nutshell, if instructions where given too rapidly to the machine (e.g. switch from electron mode to X-ray mode and fire), the tungsten target would not have enough time to position correctly, meanwhile the electron accelerator would already be firing at full power, as if in X-ray mode.

Interestingly, the previous model Therac-20 which had its code reused by the Therac-25 also had similar race conditions. However, the big difference is that the Therac-20 had built-in hardware locks which would prevent the accelerator from firing at full power had the tungsten target been misaligned. Thus, software malfunction occured, but merely resulted in loss of time and nothing more (restart the device). On the Therac-25, there were no such mechanisms, so that the software alone was expected to ensure total safety - after all, software is pure logic and pure logic never produces wrong results, right?

Moreover, the whole design of the interlocking between software and hardware was highly dysfunctional, as there was no position sensor which could have reported a wrong alignment in the X-ray mode.

It is nonetheless puzzling that 6 accidents were needed to recognize the Therac-25 as a fundamentally unsafe machine. Actually in the first of the cases, such recognition never happened at all, nor from the hospital or the manufacturer. The very slow learning process appears to be a result of overconfidence in software, overconfidence in product reliability and last but not least, overconfidence in manufacturer's experts and practices in this case.

To finish with, here is a memorable quote from N. Levenson's report:
Virtually all complex software can be made to behave in an unexpected fashion under some conditions: there will always be another software bug. [...] We cannot eliminate all software errors, but we can often protect against their worst effects, and we can recognize their likelihood in our decision making.
And don't take for granted that reused code is safer: it all depends on how you are using it.

Following are two free links about security for today's software developer:

Aucun commentaire:

Enregistrer un commentaire