Laptops & Gear

Silent chip damage can corrupt data in modern computers


Computing is often celebrated for its accuracy and speed. But researchers and hyperscale data center operators warn of a growing threat that challenges one of computing’s main promises: fairness. The problem is known as silent data corruption (SDC) – something where hardware defects cause programs to produce incorrect results without crashing, triggering an error, or leaving any visible trace.

An invisible threat within modern chips

At the heart of the concern are silicon defects in CPUs, GPUs and AI accelerators. These problems can appear during chip design, manufacturing, or even later development due to aging or environmental factors. Although manufacturers test for many defects, even a rigorous manufacturing inspection can only catch an estimated 95% to 99% of defects. Some defective chips enter the field.

In some cases, those errors lead to visible failures like system crashes. But more worrying are the silent mistakes. Here, a faulty logic gate or arithmetic unit may produce an incorrect value during execution. If that value propagates through the system without triggering detection methods, the system terminates the operation and returns negative output – with no indication that anything went wrong.

For decades, many believed that SDCs were rare, almost mythical events. However, major operators including Meta, Google and Alibaba have revealed that as many as one in 1,000 CPUs in their fleet can develop a silent malfunction under certain conditions. Similar concerns have been reported for GPUs and AI accelerators.

Fairness is the foundation of computing. Whether it’s processing financial transactions, running an AI index, or managing infrastructure, systems are expected to deliver accurate results within tight time constraints.

Silent corruption undermines that trust. Unlike crashes, which are quickly detected and quickly investigated, SDCs change with outgoing silence. In data centers that use millions of cores, even the smallest amount of damage can translate into hundreds of incorrect program results per day.

The level of modern computing exacerbates the problem

Large parallel architectures such as GPUs and AI accelerators contain thousands of arithmetic units. The more components a system includes, the greater the statistical probability that some will be defective.

Measuring SDCs directly is almost impossible – by definition, they are silent. So the industry has to balance their prices and weigh the cost of prevention. Diagnostic and repair methods exist, but they can greatly increase the silicon area, power consumption and high performance.

Researchers are looking for multi-layered solutions, including improved productivity testing, fleet quality monitoring in data centers, intelligent error estimation models, and ways to co-design software that contains errors before they spread.

As computer systems grow larger and faster, the challenge is clear: maintain both speed and accuracy without prohibitive costs. In what some have described as the “Golden Age of Complexity,” ensuring that computing remains reliable may be one of the engineering industry’s defining battles.

Back to top button