Dr. Paul Gratz and Dr. Jiang Hu in the Department of Electrical and Computer Engineering at Texas A&M University are utilizing machine-learning models to detect performance bugs within processors and create a more efficient method to combat this real-world problem.
As consumers, we upgrade to a new phone, gaming system, or smart device for the home because the newer model offers better battery life, graphics, performance and overall capabilities. When these bugs go undetected and are released into ‘the wild’— in our homes and into our everyday lives – the performance we lose out on as a result can have a greater long-term effect than we might realize.
When it comes to computer bugs, there are two main types: functional and performance.
A functional bug means an error within a processor creates a computing result that is simply wrong. For example, if a processor is asked to solve three plus two and its result is seven, there is clearly an issue with that result. A performance bug is not as simple to detect.
“Suppose you want to drive from College Station to Houston,” said Hu, professor in the department. “At one point you somehow make a mistake and drive toward Dallas. That’s a big mistake, and it’s pretty easy to tell. But there are different paths to Houston. Some are shorter, some are longer – that difference is hard to tell because you still arrived at your desired destination.”
Performance bugs can fly under the radar and remain unnoticed forever – ultimately diminishing the progress to be made in all facets of modern technology.
Fortunately, Gratz and Hu are working with collaborators at Intel Corporation on a promising answer.
This is the first application of machine learning to this kind of problem. It’s the first work we have really found that actually tries to tackle this problem at all.
By utilizing machine-learning models and automating the process, Gratz and Hu are hopeful that the effort currently spent on performance validation can be drastically reduced, ultimately leading to technologies that reach their full potential more efficiently and effectively.
“This is the first application of machine learning to this kind of problem,” said Gratz, associate professor. “It’s the first work we have really found that actually tries to tackle this problem at all.”
Gratz and Hu explained that their procedure allows them to do in a day with one person what currently takes a team of several engineers months to accomplish.
The first hurdle in detecting these bugs is defining what they might look like. The computing industry faces this challenge during initial performance analyses. When a new technology shows a performance somewhat better than the previous generation, it is hard to determine if that processor is running at its full potential or if a bug is reducing the outcomes and they should expect better results.
“If you have a 20% gain between one generation to the next, how do you know that the 20% gain couldn’t have been 25% or 30%?” Gratz said.
This is where the team’s machine-learning models come into play. The models are able to guess what the performance of the new machine will be, based on these relationships, so that the team can see if there is a divergence.
Because chips are more compact and complex than ever before, there is a higher chance for such bugs to appear. As the complexity of chip design grows, the conventional method to detect and eliminate bugs, manual processor design validation, is increasingly difficult to maintain. Thus, a need for an automated solution became apparent. Intel contacted Hu in June 2019 with hopes of collaboration to solve this critical issue.
This work has been supported by a grant from the Semiconductor Research Corporation. The researchers published their current findings in a paper that was accepted into the 27th Institute of Electrical and Electronics Engineers International Symposium on High-Performance Computer Architecture, a top tier conference in computer architecture.