Comment: |
This essay by Dale Way raises the problem of large-scale projects. He is a member of the IEEE Technical Advisory Board Year 2000, and on the Technical Information Ad Hoc Committee (Y2K Committee).
Way argues that the extreme complexitry of y2k programming, the exreme complexity of the systems the programs are a part of, and the lack of any historical precedent in program remediation on this scale all point to enormous uncertainty. This uncertainty indicates that there will be mistakes. But managers de-emphasize mistakes. They overestimate the likelihood of success. That was whay NASA's management did prior to the shuttle disaster in 1986.
Managers can be fooled, mainly by themselves. Nature cannot be fooled.
Get ready for a ride on the Challenger.
* * * * * * * *
Nature Cannot Be Fooled
What The Challenger Space Shuttle Disaster Can Teach Us About Year 2000 Remediation Efforts
What can be said in a quantitative, scientific way about the reliability of any Year 2000 remediation efforts? Not much, it appears. We would first have to say something about the reliability of our existing systems. Most of us who have had anything to do with long-running software know it is not so much that it works, as that it hasn’t broken yet; that is, it has not yet seen the right combination or sequence of inputs to knock it into some unanticipated error space. But to quantify its reliability in some way seems very difficult if not impossible to do. The scale is vast, our detailed knowledge thin and the application of reliability theory to our existing information systems is virtually nonexistent. On top of that base reliability figure would then have to be added the reliability of all of the corrective actions to be taken in the Year 2000 remediation efforts – difficult to the tenth power? Yet we are betting all of our money and effort on the reliability of the net effect of our Year 2000 remediation efforts. We are betting on those efforts to actually protect our existing information system infrastructure from the century data-change while still operating in the manner we expect. (And, I might add, we are betting on the correct operation of a threshold number of the right systems, in a threshold number of the right organizations – thresholds and rightness we cannot predict – for the continued smooth functioning of our economic and political institutions.)
What we are really interested in is the inverse of reliability: the probability of failure. It is upon that that we must base both our technical remediation decisions and our economic or political decisions about the contingency planing we may need to moderate the negative effects of such failure. There is, however, an historical precedent for reliability/failure analysis of a large, complex system that did not survive a significant challenge to its integrity from which we may be able to draw some useful lessons for the Year 2000 crisis: the January 28, 1986 Challenger Space Shuttle disaster. In the days after the tragedy, a commission headed by former Secretary of State William Rogers was appointed. Asked to sit on the Rogers Committee was Richard Feynman, Nobel Prize-winning physicist and universally acknowledged "most brilliant mind on the planet." He reluctantly agreed (he had terminal cancer at the time and died a short time later – a great loss to us all).
Feynman not only discovered the now-famous O-ring failure point and the chain of interconnected events that led to the explosion, he also uncovered serious flaws in the way NASA conceived of and managed reliability assessment and control. In the end, his discovery resulted in an almost complete overhaul of the way NASA operated and put the agency on the road toward eventual partial privatization. The first lesson we can learn is that even NASA, a heretofore paragon of technology and technical management, can misjudge fatally. Not a comforting thought.
First, Feynman found a wild disparity between NASA management’s assessment of reliability of the Shuttle and that of the working engineers. All of the following quotes are from his appendix to the Roger’s Committee Report. He wrote,
"It appears that there are enormous differences of opinion as to the probability of a failure with loss of vehicle and of human life. The estimates range from roughly 1 in 100 to 1 in 100,000. The higher figures [0.01] come from the working engineers, and the very low figures [0.00001] from management. What are the causes and consequences of this lack of agreement?" . . . .
"What is the cause of management's fantastic faith in the machinery?" . . .
"An estimate of the reliability of solid rockets was made by the range safety officer, by studying the experience of all previous rocket flights. Out of a total of nearly 2,900 flights, 121 failed (1 in 25). This includes, however, what may be called, early errors, rockets flown for the first few times in which design errors are discovered and fixed. A more reasonable figure for the mature rockets might be 1 in 50. With special care in the selection of parts and in inspection, a figure of below 1 in 100 might be achieved but 1 in 1,000 is probably not attainable with today's technology. (Since there are two rockets on the Shuttle, these rocket failure rates must be doubled to get Shuttle failure rates from Solid Rocket Booster failure.)"
But NASA management did not like those figures. They took a different tack. . . .
What was the evidence that this optimistic view had any basis in reality? Feynman went on to explore this question and his findings were not flattering to NASA. . . .
Why NASA management was, in the end, so determined to "fool themselves" emerges in Feynman’s understated conclusion to this part of his analysis: mutually-reinforced wishful thinking in the face of external pressure for an "acceptable answer." . . .
"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."
Restating, what are the lessons that could be drawn from this tragedy, and Feynman’s astute analysis of it, for our effective remediation of the Year 2000 computer software crisis? Let us start from the most obvious and move to the more subtle.
The most obvious lesson is that wishful thinking must be recognized for what it is. Management of our institutions large and small should be careful to differentiate between what is and what they would like it to be. Simply stating that effective Year 2000 remediations WILL be complete in time does not make it so, especially when the reality of the situation is not clearly understood. Taking this tack is most prevalent in the US federal government and military; the military because they have been ordered to complete their "mission" on time and they WILL do it, the government for more or less the same reasons. When talking with some well-placed government employees (I cannot mention names) who say it will be done about exactly how it will be accomplished, their response to me has been, "That is the responsibility of each individual organization to do themselves." I then ask, "Yes, but HOW are they going to do it?" "That is their responsibility to figure out." "But what if those people don’t know how to figure it out?" I ask, given that nothing of this scale, scope, complexity and rigid deadline has ever been done before. Same answer. "But if everybody is responsible, isn’t nobody responsible?" I ask. No answer. I was left with the distinct impression these people had no idea how it was going to be done but that it was their job to say it would be.
Obviously, business and technical management would dearly like Year 2000 remediation to be simple or easy. They are under a great deal of outside pressure from Wall Street or their superiors in government for the answer to be somewhere between "we don’t have a problem" and "we can fix it in time for relatively little money." When they feel like they want to say that, or want to say "It WILL be done," they might want to reconsider and examine whether it is really wishful-thinking at work, a desire or need for it to be that way.
In estimating the reliability of Year 2000 remediation efforts, it may be useful to look at history as did Feynman. That is not very encouraging either. Research reported generally at Year 2000 conferences says 90% of all programs in a commercial organization have dates in them. How many organizations of any significant size have reengineered anywhere close to 90% of all of their software in one continuous project, not merely porting it to another platform, but actually changing logic and/or data formats in an wholesale manner? Probably none. How many have done it in three years? Assuredly none. . . .
If the history of large projects is not very promising, then let us examine the success rate in normal, everyday large-system maintenance projects. The renown software management statistician Capers Jones’ Software Productivity Research firm says that in 1995 over 70% of all software development projects failed outright or were late or under-featured when they did come out – an on-time, on-spec success rate of less than 30%. So, normal, minuscule (in comparison to the whole) projects fail 2 out of 3 times. Not much help there, either. History is telling us, at a very minimum, not to be overconfident, as it did NASA. Hopefully we will listen this time. . . .
The reasons for this phenomenon originate as much in the laws of nature as anything that affected the Challenger. This has to do with the number of connected elements that must be coordinated. Here the laws of chaos or complexity theory tell us that when a system gets to be large, has many interconnected elements, then even small changes can have very large and unpredictable, non-linear effects on the system as a whole. One failed switch in upstate New York can black-out New York City and much of the region. One bad line of software can bring down AT&T’s North American long distance network for a day. And we still do not know what little thing is causing wholesale shutdowns of the western power grid from Mexico to Canada; nothing big is observable. The magnitude and unpredictability of the effects stem from the fact that as the number of elements gets large, the connections between the elements come to dominate the results. . . .
Testing then can take a long time, often the single longest phase of the maintenance project; not the testing of the changes themselves, but the testing to make sure that they did not adversely affect other elements in some unforeseen, "unintended consequences" way. Savings in incomplete impact analysis must be paid back in testing (or in failure in production). As the number and extent of the changes gets bigger, the rework/retest efforts grow faster than that, again as a result of interconnective complexity. . . .
In the 1950s and 1960s when the first computer system went into an organization (or whenever the first computer system went in a given organization), it was usually some kind of an accounting system (scientific computing is not included in this discussion, but the ideas may hold). Once installed, the system began to acquire, create and accumulate data. When subsequent applications were desired, this first system was, to some degree, "cloned" and its code modified to address a new task. Useful subsets of the data from the first system were also raided or "extracted" and used in and by the second. This went on over and over. Sometimes some non-obvious trick played by a creative programmer to get something to work in the limited space and slow times of those old hardware systems, instead of being "cleaned up" in later systems, was used by those later systems, in fact, came to be depended upon by those later systems. The data format used in those early systems, because it was shared, quickly became entrenched; to change the data format would have required the modification of all the programs that used that data, a politically and economically unfeasible proposition. (By the way, some believe that this is the more realistic view of why two-digit dates were used as long as they were – data format entrenchment. Of course, the short-sighted habits of programmers is all that could account for that practice in even some brand new "modern" systems.)
As this clone-and-raid process went on for decades, literally hundreds of these systems were built this way in large organizations, almost all in some or many ways connected to each other. . . . And because most documentation, to the extent it was done, was focused on the maintenance of a particular program or set of programs and not the connections between programs, these interdependencies were not often captured or maintained as an organizational asset. And then, over time greater complications were introduced as organizations were acquired and merged by and with one another, eventually sharing data between their information systems which often survived in overlapping redundancy. This is the world faced by larger, older organizations in addressing the Year 2000; a huge number of elements, a great amount of redundancy, a great amount of often invisible interdependency. It is not pretty.
|