Reinforcement Learning Uncovers Silent Data Errors

For high-performance chips in massive data centers, math can be the enemy. Thanks to the sheer scale of calculations going on in hyperscale data centers, operating round the clock with millions of nodes and vast amounts of silicon, extremely uncommon errors appear. It’s simply statistics. These rare, “silent” data errors don’t show up during conventional quality-control screenings—even when companies spend hours looking for them.

This month at the IEEE International Reliability Physics Symposium in Monterey, Calif., Intel engineers described a technique that uses reinforcement learning to uncover more silent data errors faster. The company is using the machine learning method to ensure the quality of its Xeon processors.

When an error happens in a data center, operators can either take a node down and replace it, or use the flawed system for lower-stakes computing, says Manu Shamsa, an electrical engineer at Intel’s Chandler, Ariz., campus. But it would be much better if errors could be detected earlier on. Ideally they’d be caught before a chip is incorporated in a computer system, when it’s possible to make design or manufacturing corrections to prevent errors recurring in the future.

“In a laptop, you won’t notice any errors. In data centers, with really dense nodes, there are high chances the stars will align and an error will occur.” —Manu Shamsa, Intel

Finding these flaws is not so easy. Shamsa says engineers have been so baffled by them they joked that they must be due to spooky action at a distance, Einstein’s phrase for quantum entanglement. But there’s nothing spooky about them, and Shamsa has spent years characterizing them. In a paper presented at the same conference last year, his team provides a whole catalog of the causes of these errors. Most are due to infinitesimal variations in manufacturing.

Even if each of the billions of transistors on each chip is functional, they are not completely identical to one another. Subtle differences in how a given transistor responds to changes in temperature, voltage, or frequency, for instance, can lead to an error.

Those subtleties are much more likely to crop up in huge data centers because of the pace of computing and the vast amount of silicon involved. “In a laptop, you won’t notice any errors. In data centers, with really dense nodes, there are high chances the stars will align and an error will occur,” Shamsa says.

Some errors could crop up only after a chip has been installed in a data center and has been operating for months. Small variations in the properties of transistors can cause them to degrade over time. One such silent error Shamsa has found is related to electrical resistance. A transistor that operates properly at first, and passes standard tests to look for shorts, can, with use, degrade so that it becomes more resistant.

“You’re thinking everything is fine, but underneath, an error is causing a wrong decision,” Shamsa says. Over time, thanks to a slight weakness in a single transistor, “one plus one goes to three, silently, until you see the impact,” Shamsa says.

The new technique builds on an existing set of methods for detecting silent errors, called Eigen tests. These tests make the chip do hard math problems, repeatedly over a period of time, in the hopes of making silent errors apparent. They involve operations on different sizes of matrices filled with random data.

There are a large number of Eigen tests. Running them all would take an impractical amount of time, so chipmakers use a randomized approach to generate a manageable set of them. This saves time but leaves errors undetected. “There’s no principle to guide the selection of inputs,” Shamsa says. He wanted to find a way to guide the selection so that a relatively small number of tests could turn up more errors.

The Intel team used reinforcement learning to develop tests for the part of its Xeon CPU chip that performs matrix multiplication using what are called fuse-multiply-add (FMA) instructions. Shamsa says they chose the FMA region because it takes up a relatively large area of the chip, making it more vulnerable to potential silent errors—more silicon, more problems. What’s more, flaws in this part of a chip can generate electromagnetic fields that affect other parts of the system. And because the FMA is turned off to save power when it’s not in use, testing it involves repeatedly powering it up and down, potentially activating hidden defects that otherwise would not appear in standard tests.

During each step of its training, the reinforcement-learning program selects different tests for the potentially defective chip. Each error it detects is treated as a reward, and over time the agent learns to select which tests maximize the chances of detecting errors. After about 500 testing cycles, the algorithm learned which set of Eigen tests optimized the error-detection rate for the FMA region.

Shamsa says this technique is five times as likely to detect a defect as randomized Eigen testing. Eigen tests are open source, part of the openDCDiag for data centers. So other users should be able to use reinforcement learning to modify these tests for their own systems, he says.

To a certain extent, silent, subtle flaws are an unavoidable part of the manufacturing process—absolute perfection and uniformity remain out of reach. But Shamsa says Intel is trying to use this research to learn to find the precursors that lead to silent data errors faster. He’s investigating whether there are red flags that could provide an early warning of future errors, and whether it’s possible to change chip recipes or designs to manage them.

From Your Site Articles