Back to Tools
BSC Logo

DRAM Error Prediction

Tools

UEPREDICT & UERL Tools

Overview

Our suite of DRAM error prediction and mitigation tools addresses the reliability challenges of next-generation HPC systems. UEPREDICT focuses on predicting Uncorrected Errors (UEs) using machine learning models, while UERL leverages Reinforcement Learning to adaptively mitigate these errors based on their probability and potential cost.

UEPREDICT

Prediction

UEPREDICT is a framework for model prediction and evaluation of DRAM Uncorrected Errors. It provides methods for training and evaluating models that predict UEs in the field, helping to anticipate failures before they impact system operations.

  • Core Module: Contains files and methods for training and evaluation.
  • Scripts: train_test.py for executing walk-forward validation and evaluation.py for assessing model performance.
  • Data: Includes synthetic logs based on MareNostrum 3 production data.

UERL

Mitigation

UERL is an adaptive mitigation method leveraging Reinforcement Learning (RL). It dynamically adapts to the probability and potential cost of uncorrected errors, offering a proactive approach to mitigate their impact on running jobs.

  • Adaptive Strategy: Decides when to mitigate based on RL agent predictions.
  • Scripts: run.py for training/validating agents and evaluate_best.py to test performance.
  • Configuration: Uses YAML files for flexible experiment setup.

Key Features

Cost-Aware Prediction

Consider the likelihood and potential cost of errors to optimize mitigation strategies.

Production-Driven

evaluated on production logs from major supercomputers like MareNostrum 3.

Contact

For inquiries about our DRAM error prediction tools, please contact: