Monday, November 16, 2015

Does Intel Pentium Bug of 1990s Still Holds Any Lessons for us?

Information Technology, at its core, is a forward oriented profession. What i mean by this assertion is that, as a general observation, the rate of change that this profession deals with is unprecedented. In my career time span so far, i have seen many a paradigm shifts including rise and fall and re-rise of Microsoft, birth and dominance of Google, a gigantic comeback of Apple, and all these eventually impacting our lives as professionals and as consumers of technology. In dealing with such acute dynamism, in my belief, it is very easy to lose the sense of history of our profession. I personally feel that having a good sense of history for a chosen profession often helps us connect the dots better and better fathom the current events that we experience. History helps connect things through time and I do consider knowledge of the history of our profession important in shaping its future. Most of the today's methodologies and good practices are evolved by bettering what didn't work in the past. To say the least, sense of history also gives as a sense of connection with the past which we should look to not lose.

I was recently reading the book- "Only the Paranoid Survive", the first person account of Andy Grove (former CEO of Intel) on how he dealt with strategic inflection points i.e. the time in the life of a business when its fundamentals are about to change. One of the narration in the book talks about the Pentium chip bug and it goes like as follows (written as is it appears in the book)-

"Several weeks earlier, some of our employees had found a string of comments on the Internet forum where people interested in Intel products congregate. The comments were under the headings like, "Bug in the Pentium FPU." (FPU stands for floating point unit, the part of the chip that does heavy-duty math.) They were triggered by the observation of a math professor that something wasn't quite right with the mathematical capabilities of the pentium chip. The professor reported that he had encountered a division error while studying some complex math problem.
We were already familiar with this problem, having encountered it several months earlier. It was due to a minor design error on the chip, which caused a rounding error in division once every nine billion times. At first, we were very concerned about this, so we mounted a major study to try to understand what once every nine billion divisions would mean. We found the results reassuring. For instance, they meant that an average spreadsheet user would run into the problem only once every 27,000 years of spreadsheet use.

Andy spends a quite a few pages later in the book to tell why this bug was critical and how it turned his thinking around some peculiar was happening in the world around him. Let me summarize that point of view in next few points and also explain its relevance in today's world-

1. The beginning of social media as a force to reckon with:
Internet Forum in 1990s
     We pretty much take social media for granted these days. It generates a lot of data and opinions every passing second, which is very valuable to those who see the need to seek information out of it. This is especially true for anyone seeking feedback for a 
newly launched product or a service. Consumers, on the other hand, provide feedback often without being asked on social media. It more often turns out to a medium for venting out imperfections and bad experiences. This is now. But when we talk about 1990s, when the Pentium FPU bug occurred, things were still in infancy w.r.t Internet and people had started to use Internet forums to share opinions. Intel, then was not in the business of selling the computer chips directly to consumers. It used to sell via PC manufacturers like IBM. Intel's emergence was at the cusp of PC industry turning more horizontal oriented than vertically oriented meaning that earlier one manufacturer like Digital used to manufacture/assemble all parts of a computer (vertical orientation), later each key component became individual business (horizontal orientation) serving the PC assembler like IBM, Dell etc. Andy mentions in his book that with this bug, he smelled something unusual happen in the field. And it was that though he was not selling directly to consumers, he was getting feedback from them directly. He inferred that if this situation wasn't handled with proactive stance, then he could receive a lot of negative backlash. Mind you, this was 1990s, when it was hard to imagine the power of social media. Andy took the corrective actions quickly and even justified the huge cost of this bug- around USD 450 million (mammoth amount now but more so 20 years back).

It stands the lessons for today's times too. Proactively dealing with feedback received on social media is the order of the day. It is easy to manufacture negativity even by bad intentions of the competitors. The birth of techniques such as Sentiment Analysis that help to proactive assess positive and negative sentiments around the events like product releases further help to deal with negative perceptions well. In my recent memory, i am reminded on the social buzz that was created by the security vulnerability in SSL- Heartbleed bug and the negative response generated in the social media when the news about their (hidden) social experiment A/B test leaked out publically where they subjected a certain percentage of their consumers to negative news deliberately. Even though social media as a channel is quite useful to generate feedback but it also makes companies vulnerable to negative publicity in the event of bugs that catch public attention.

2. Handling strategic inflection points need different skills
   In the wake of negative press and crisis-like situation that the Pentium FPU bug generated for Intel, Andy made a very interesting observation in his book. He says-
 "A lot of people involved in handling this stuff had only joined Intel in the last ten years or so, during which time our business had grown steadily. Their experience had been that working hard, putting one foot in front of other, was what it took to get good outcome. Now, all of a sudden, instead of predictable success, nothing was predictable. Our people, while they were busting their butts, were also perturbed and even scared."

In short, the skills needed to handle peace time in business are quite different from the ones needed during war time. People often come to work believing the workplaces to be fair i.e. if i do "X" amount of work, i will get equivalent of "X" credit. While there is nothing wrong in this assumption generally but such thinking (from employee's perspective) do not take into account changing business situations. The reality of today's times is that an effort that would have resulted in a great output (for company and personally) in a certain business situation would not just be enough in a very different business situation. This often happens because of no fault of employee, who did his best given the current situation but probably lacked situational awareness to alter the nature of efforts. To quickly explain this perspective, Nokia's example comes to mind. The story of rise and further decline of Nokia is widely written about. During good times (till atleast 2007), the company made a big fortunes with its existing model (with its phones based on Symbian OS). But when the time came to change to more modern mobile OS like Android, they just failed to move swiftly. I can imagine the employees in this situation would have put in great efforts with their key skills around Symbian OS but due to situational change, the same efforts which bore huge fruits earlier were just not enough to reap similar or greater rewards.

3. Lessons in Defect Advocacy
    To me, the most interesting part of the narration regarding Pentium FPU bug was this- "an average spreadsheet user would run into the problem only once every 27,000 years of spreadsheet use"
This was actually a known problem before the Pentium chip was released. What might have happened is, following the usual defect prioritization principles, it would have been given acknowledged but given less priority as the frequency of this bug happen was staggering 27000 years of spreadsheet use. Now, one may question this data's accuracy, which is probably a fair question but larger point that this case teaches is that the usual defect prioritization approach usually fail to consider the macro aspects impacting the product. Let me explain this point a little bit-
Pentium chip was released at the backdrop of the legendary "Intel Inside" marketing campaign. The extent of popularity (due to marketing efforts) of this campaign was so huge that Intel almost became a household brand. When people started seeing the effects of the error related to this, they put the blame squarely on Intel and not the computer manufacturer. The early social media in the form of Internet forums gave voice to their concerns. Had the defect prioritization decision, take into account the macro environment that the product will operate under, it would probably have been chosen to be fixed. 
One of the key learnings here that is still relevant in today's times is to have a holistic approach towards defect advocacy. A tester advocating the defect should relate the bug information with the macro environment happenings like business situation, popularity of the bug, users impacted and much more. For a tester to be playing the role of the headlights of the product, he/she should not just think about internals of the bugs but also associate it with the necessary business information and related factors.

What else do you learn from this case? Please do share your thoughts in the comments.

Images source:

No comments: