Particles travel millions of miles just to mess with your computer
When a computer goes down unexpectedly or your cell phone starts playing up don’t be too quick to point the finger of suspicion at the person sat in front of the screen. However far-fetched and ‘purely theoretical’ it may seem, the crash could be the result of a random interaction of a particle of cosmic origin, according to the research of Bharat Bhuva at the Vanderbilt University. The ongoing shrinking of semiconductor technology has its price...
When a computer goes down unexpectedly or your cell phone starts playing up don’t be too quick to point the finger of suspicion at the person sat in front of the screen. However far-fetched and ‘purely theoretical’ it may seem, the crash could be the result of a random interaction of a particle of cosmic origin, according to the research of Bharat Bhuva at the Vanderbilt University. The ongoing shrinking of semiconductor technology has its price.
A tiny particle, from outside our solar system that started its journey many millions of years ago happens to carry enough energy to cause a single bit to flip state when it interacts with the memory in our hardware. Result… Error!
The probability of this occurring is very small and impossible to predict, they cause no physical damage so the failure is also difficult to characterize. It is however possible to calculate the probability of an occurrence.
Cosmic rays travelling close to the speed of light strike the Earth's atmosphere and create cascades of secondary particles including energetic neutrons, pions, muons and alpha particles. Your body gets hit by millions of these particles each second. Despite the onslaught the effects are imperceptible with no known harmful effects on living organisms. Modern microelectronics are however, not so impervious. A fraction of these particles carry enough energy to interfere with their operation. When they interact with integrated circuits, they may change the state of individual bits of data stored in memory.
When this occurs it’s called an SEU or Single-Event Upset. According to Bharat Bhuva when a single bit gets flipped it’s difficult to establish the cause. It could be a software bug or a hidden hardware error (see FDIV-Bug Pentium-CPUs). The only way we can determine that it is an SEU is by eliminating all the other possible causes… Sherlock Holmes could not fault the logic.
The problem is truly not just hypothetical:
• In 2003 a Bit-Flip event caused a vote counting machine to award an extra 4,096 to a candidate in a small town election in Belgium. The error was discovered because that figure exceeded the entire population eligible to vote.
• In 2008 an SEU was thought to be the source of a sudden autopilot disengage on a Qantas commercial flight. The aircraft lost almost 700 feet in altitude in 23 seconds causing injuries to around one third of the passengers.
• There have also been a number of unexplained errors in airline computers resulting in hundreds of flight cancellations which some experts say could only be attributed to SEUs.
Ritesh Mastipuram and Edwin Wee of Cypress Semiconductor calculated SEU failure rates using earlier generation technology and published their findings in 2004:
• A cell phone with 500 KB memory should only expect one event every 28 years.
• A ‘Router Farm’ used by internet providers with 25 GB of memory may expect one networking error every 17 hours.
• A Laptop with 500 MB memory, on board an aircraft at 33,000 feet may experience an SEU every 5 hours.
For the latest generation of semiconductor fabrication things look a little different. The shrinking of its physical size means that the energy required to change a transistor’s state is lower but its reduced size means it’s less likely to suffer a hit. Since then we’ve packed so much more into a chip. Modern 3D structures have proved to be less susceptible to SEUs. The graph shows the ratio for 28, 20 and 16 nm structures (source: Bharat Bhuva, Vanderbilt).
If the probability of SEUs occurring becomes non-trivial it will have serious implications for safety critical and medical systems…
A tiny particle, from outside our solar system that started its journey many millions of years ago happens to carry enough energy to cause a single bit to flip state when it interacts with the memory in our hardware. Result… Error!
The probability of this occurring is very small and impossible to predict, they cause no physical damage so the failure is also difficult to characterize. It is however possible to calculate the probability of an occurrence.
Cosmic rays travelling close to the speed of light strike the Earth's atmosphere and create cascades of secondary particles including energetic neutrons, pions, muons and alpha particles. Your body gets hit by millions of these particles each second. Despite the onslaught the effects are imperceptible with no known harmful effects on living organisms. Modern microelectronics are however, not so impervious. A fraction of these particles carry enough energy to interfere with their operation. When they interact with integrated circuits, they may change the state of individual bits of data stored in memory.
When this occurs it’s called an SEU or Single-Event Upset. According to Bharat Bhuva when a single bit gets flipped it’s difficult to establish the cause. It could be a software bug or a hidden hardware error (see FDIV-Bug Pentium-CPUs). The only way we can determine that it is an SEU is by eliminating all the other possible causes… Sherlock Holmes could not fault the logic.
The problem is truly not just hypothetical:
• In 2003 a Bit-Flip event caused a vote counting machine to award an extra 4,096 to a candidate in a small town election in Belgium. The error was discovered because that figure exceeded the entire population eligible to vote.
• In 2008 an SEU was thought to be the source of a sudden autopilot disengage on a Qantas commercial flight. The aircraft lost almost 700 feet in altitude in 23 seconds causing injuries to around one third of the passengers.
• There have also been a number of unexplained errors in airline computers resulting in hundreds of flight cancellations which some experts say could only be attributed to SEUs.
Ritesh Mastipuram and Edwin Wee of Cypress Semiconductor calculated SEU failure rates using earlier generation technology and published their findings in 2004:
• A cell phone with 500 KB memory should only expect one event every 28 years.
• A ‘Router Farm’ used by internet providers with 25 GB of memory may expect one networking error every 17 hours.
• A Laptop with 500 MB memory, on board an aircraft at 33,000 feet may experience an SEU every 5 hours.
For the latest generation of semiconductor fabrication things look a little different. The shrinking of its physical size means that the energy required to change a transistor’s state is lower but its reduced size means it’s less likely to suffer a hit. Since then we’ve packed so much more into a chip. Modern 3D structures have proved to be less susceptible to SEUs. The graph shows the ratio for 28, 20 and 16 nm structures (source: Bharat Bhuva, Vanderbilt).
If the probability of SEUs occurring becomes non-trivial it will have serious implications for safety critical and medical systems…