NASA sends robots and people into the most hostile and dangerous environment we know: space. Are there some lessons from how NASA has succeeded -- and failed -- that could apply to oil companies that drill in deep water?
Artist's conception of the Mars Science Laboratory on the surface of Mars, at least 36,000,000 miles from Earth.
Yesterday's New York Times (June 21, 2010) lead article describes how a blow out preventer is supposed to work and some preliminary information about the failure on the Deepwater Horizon rig. Readers of this blog got the same story on May 28, 2010, over three weeks before the Times.
Spacecraft design follows one of the most famous concepts in engineering, known as Murphy's Law: "Anything that can go wrong will go wrong." The corollary to Murphy's Law is usually expressed in everyday terms: "Toast always falls buttered side down." In other words, when something fails, it will fail in the most damaging and expensive way possible.
NASA's project management discipline applies these principles to every aspect of the extremely complex design of a spacecraft. How can it fail? How likely is it to fail? What are the consequences of failure? The last question is the most important, because some failures have little or no impact, while others are "mission critical": the entire effort, costing hundreds of millions or even billions of dollars -- and the lives of astronauts in a manned mission -- is at risk.
Spacecrafts bristle with redundancy for components and systems critical to mission success. There may be two, or even three or more, independent radios, guidance computers, power sources, life support systems, and the like. If one radio fails, ground controllers can still reach the spacecraft on another. If the radio has one chance in a thousand to fail, the likelihood that two will fail at the same time is one in a million; the likelihood that three will fail at the same time is one in a billion.
The term "independent" has a special meaning here. Independent systems share no components or paths. Any shared element could be a "single point of failure" that could cause both the primary and back-up system to fail.
In spacecraft design, it is impossible to eliminate every single point of failure, even in critical systems. The reason is that the most severe constraint on a spacecraft is weight: it takes a ton of fuel to lift a pound of payload into space. There is simply not enough weight budget to make everything redundant. However, every single mode failure possibility comes in for extra scrutiny. The components get specified with even more than normal safety margins; the designs get subjected to rigorous failure mode analysis; the prototypes are tested far beyond expected operating conditions, often pushed until they break, just to see how much punishment they can take.
Finally, there is the possibility that any mission will need to be aborted before the launch vehicle or the spacecraft become a hazard to the crew, or people on the ground. Can you imagine a space shuttle launch gone wrong landing on Disney World? Once the spacecraft is on the launch pad, none of this can be improvised: rockets move too fast. The abort capability must be designed in to every stage and every payload. And it must be engineered as carefully as any mission-critical system.
A typical deep-sea well blow out preventer, on the surface, headed for a hostile environment 1 mile from the surface of the Earth.
Compared to a NASA spacecraft, a blow out preventer is far, far simpler. But the same engineering principles can be applied. Here are a few areas where BP fell short:
- Redundancy of valves: the Deepwater Horizon rig apparently had only a single ram style shut off valve. However reliable the valve might be, it can still fail. There should have been at least one, possibly more, redundant valves.
- Inadequate design: It is not clear that the valve as specified was powerful enough for the size and hardness of drill pipe used, or for the pressure encountered in the well. Valves come in a range of sizes and pressure specifications. As you would expect, valves with higher specifications cost more.
- Redundancy of pressurization: It is not clear if the blow out preventer had any redundancy for storing the immense hydraulic pressure required to operate the valves. The evidence shows that the single valve did operate, but did not close all the way.
- Inadequate testing: The blow out preventer was not fully tested for hydraulic leaks before it went to the bottom of the Gulf. It leaked, and thus efforts to augment the hydraulic pressure from the surface not only failed, but may have made the failure worse.
- Inadequate supervision: The Minerals Management Service apparently signed off on the rig knowing that the design of the blow out preventer had not been fully vetted. This would be like NASA green-lighting a launch without checking to see that the risk analysis had been done.
- No "Top Kill" abort mechanism: The companies that make blow out preventers also make the "top kill" mechanisms and offer them as options in their blow out preventer "stacks". Apparently, BP and TransOcean skipped this expense, and MMS let them. Since the lives of the drilling crew are at stake, it is like NASA sending up a manned spacecraft with no escape mechanism.
Despite an extensive set of risk engineering disciplines, NASA has had its share of failures. They don't call what NASA does "rocket science" for nothing. It is very hard. The systems are incredibly complex. The environment is harsh almost beyond human comprehension. On robotic missions, there is no repair shop, only what leaves Earth on the spacecraft.
Consequently, NASA developed, from long and sometimes bitter experience, a culture of learning from failure. Each failure is analyzed over and over: by the original design team, by alternate teams from the same laboratory, by independent teams from other NASA laboratories, and by outside experts. They want to understand how the failure happened, how it could have been prevented, and what to do differently on future missions.
These teams not only look at the engineering decisions, but at the human environment in which those decisions were made. Richard Feynman's contribution to the Challenger investigation was not that he showed that cold changes the performance of rubber o-rings (NASA already knew that), but how the accumulation of schedule and budget pressures led engineers and managers to ignore or short-cut their own procedures.
There is, for BP, for the oil industry, and for the government, an obvious parallel here. NASA put itself through a wrenching shakedown before resuming space shuttle operations after the Challenger disaster and investigation. Who wants to bet that BP has the right stuff?
The author serves on the committee that oversees the Jet Propulsion Laboratory for the Board of Trustees of the California Institute of Technology. Caltech operates JPL for NASA under contract. The views expressed in this post are entirely those of the author, and do not necessarily reflect the views of JPL, Caltech, or NASA.