Failure Triage

    The Pain

    One of the daily tasks of verification engineers is to go through the nightly simulation regression failures. The hard part is to sort through the failures based on the simulation log files and error messages to determine what caused the failure and the appropriate engineers to own the problem. This process is commonly referred to as Triage.

    Triage is a term that originates in the medical community referring to the quick diagnostic process performed when patients enter the emergency room. Based on the triage, medical professionals determine the urgency of the situation and the type of doctor to examine the patient. An analogous process exists in verification where engineers must quickly diagnose simulation failures at a high level to understand their cause and identify which engineers should be assigned to further analyze and fix them.

    Today, failure triage is performed in an ad hoc manner based only on the messages available in the simulation log files. Design teams waste valuable resources as problems are handed to the wrong engineers, and failures are incorrectly dismissed as duplicates or duplicate bugs are treated as distinct bugs.

    Typical Triage process without tool support

    Triage is hard because it is a classic Catch-22 problem: In order to do accurate triage, the verification engineer needs to do low-level debug to determine the location of the bug. But an engineer cannot do low-level debug effectively without a sophisticated preceding triage step. More specifically, during triage, the following questions need to be answered accurately:

    • Which of the failures are due to the same error sources?
    • Which of the failures are due to distinct error sources?
    • Which of the failures are “new” bugs and have not been filed?
    • Which of the failures have already been filed as bugs but have not been fixed yet?
    • Who is the rightful owner of the block and who should be assigned the failure?

    The Solution

    OnPoint is the first verification tool with a triage engine that can automatically differentiate between failure sources. The process starts automatically when the regression tests fail by utilizing OnPoint to perform a root-cause analysis that generates a list of suspects

    Triage and Root Cause Analysis with OnPoint

    Suspects are used to determine what is going on in the design and where the error source is. Thus OnPoint can distinguish the different paths taken by the different bugs to the checker and categorize them accordingly. More specifically, it can distinguish between the following situations which are especially hard to detect using only error messages.

    OnPoint can distinguish between different bugs that fire the same checker

    OnPoint can determine whether the same bug is firing different checkers

    Verification engineers do not need medical gear to perform triage on their designs, but they do need a good diagnostic tool. OnPoint brings automation to this process as it naturally fits within nightly regression tests to dramatically accelerate the debug and triage tasks.