DRAM Error Prediction
ToolsUEPREDICT & UERL Tools
Overview
Our suite of DRAM error prediction and mitigation tools addresses the reliability challenges of next-generation HPC systems. UEPREDICT focuses on predicting Uncorrected Errors (UEs) using machine learning models, while UERL leverages Reinforcement Learning to adaptively mitigate these errors based on their probability and potential cost.
UEPREDICT
UEPREDICT is a framework for model prediction and evaluation of DRAM Uncorrected Errors. It provides methods for training and evaluating models that predict UEs in the field, helping to anticipate failures before they impact system operations.
- Core Module: Contains files and methods for training and evaluation.
- Scripts:
train_test.pyfor executing walk-forward validation andevaluation.pyfor assessing model performance. - Data: Includes synthetic logs based on MareNostrum 3 production data.
UERL
UERL is an adaptive mitigation method leveraging Reinforcement Learning (RL). It dynamically adapts to the probability and potential cost of uncorrected errors, offering a proactive approach to mitigate their impact on running jobs.
- Adaptive Strategy: Decides when to mitigate based on RL agent predictions.
- Scripts:
run.pyfor training/validating agents andevaluate_best.pyto test performance. - Configuration: Uses YAML files for flexible experiment setup.
Key Features
Cost-Aware Prediction
Consider the likelihood and potential cost of errors to optimize mitigation strategies.
Production-Driven
evaluated on production logs from major supercomputers like MareNostrum 3.
Contact
For inquiries about our DRAM error prediction tools, please contact: