PROFET
StablePerformance & Energy Prediction
What is PROFET?
PROFET (PROFiling-based EsTimation of performance and energy) is an analytical model that quantifies the impact of the main memory on application performance and system power and energy consumption. Unlike traditional hardware simulators that are slow and often have simplified or obsolete abstractions of the CPU, PROFET is based on instrumentation of an application execution on actual hardware—so it already takes into account CPU microarchitectural details such as the data prefetcher and out-of-order engine.
The model predicts how an application's performance and energy consumption would change when executed on different memory systems, enabling researchers and system designers to explore future memory technologies without the overhead of full hardware simulation.
The Core Idea
The fundamental insight behind PROFET is understanding how applications move between bandwidth-latency curves when transitioning from one memory system to another. Memory access latency and bandwidth are inherently interrelated—as the used memory bandwidth increases, so does the latency due to contention among concurrent memory requests.
PROFET characterizes both the baseline and target memory systems using bandwidth-latency curves. These curves show how memory access latency depends on the used memory bandwidth, distinguishing between lead-off latency (single-access latency in an idle system) and loaded latency (latency under contention). The difference between these can be hundreds of nanoseconds, making this distinction critical for accurate predictions.
When an application runs on a baseline memory system, it operates at a specific point on that system's bandwidth-latency curve. PROFET predicts where the same application would operate on a target memory system's curve, accounting for how the change in latency affects performance, which in turn affects bandwidth usage—creating a feedback loop that PROFET solves analytically.
How It Works
PROFET requires three main inputs:
- Memory System Profiling: Bandwidth-latency curves for both baseline and target memory systems, obtained through microbenchmarks that measure latency under varying bandwidth loads.
- Application Profiling: Hardware performance counter measurements from running the application on the baseline system, including CPU cycles, instructions, last-level cache misses, and memory bandwidth usage. Profiling is done in regular time intervals (segments) to capture changing application behavior.
- CPU Parameters: Basic processor characteristics such as reorder buffer (ROB) size, miss information status holding register (MSHR) capacity, and minimum theoretical cycles-per-instruction (CPI).
Using these inputs, PROFET's analytical model estimates the application's position on the target memory system's bandwidth-latency curve. The model accounts for out-of-order execution, memory-level parallelism (overlapping memory accesses), and how the CPU can hide memory latency by executing independent instructions. The power and energy models then estimate consumption based on the predicted performance and memory power parameters.
Validation & Accuracy
PROFET has been extensively evaluated on real hardware platforms:
- Sandy Bridge-EP E5-2670: Tested with DDR3-800/1066/1333/1600 memory configurations
- Knights Landing Xeon Phi: Evaluated with DDR4-2400 and 3D-stacked MCDRAM
- Emerald Rapids: Tested with DDR5 4800/3200 memory configurations
- Huawei Kunpeng 920: Evaluated with DDR4-1600/1866/2933 memory configurations
- Cascade Lake: Tested with DDR4/Optane memory configurations
- Intel Max 9480: Evaluated with DDR5 4800/HBM2 memory configurations
The evaluation results demonstrate exceptional accuracy: PROFET's predictions differ from actual hardware measurements by an average of only 2% for performance, 1.1% for power, and 1.7% for energy consumption. This accuracy holds even when comparing memory systems with fundamentally different characteristics, such as DDR4 versus high-bandwidth MCDRAM, which can have a 4.2-fold difference in bandwidth and different lead-off latencies.
PROFET also significantly outperforms hardware simulators like ZSim+DRAMSim2, showing better accuracy while being three orders of magnitude faster. This speed advantage makes PROFET practical for analyzing production HPC applications and exploring numerous design options within reasonable timeframes.
Key Advantages
Real Hardware Instrumentation
Uses actual CPU microarchitectural details including prefetchers and out-of-order engines, avoiding simulator simplifications.
Exceptional Accuracy
Typically within 2% of actual hardware measurements for performance, power, and energy.
Three Orders of Magnitude Faster
Enables analysis of production HPC applications and exploration of many design options.
Platform Portability
Works on any platform supporting hardware performance counters. Adaptation to new platforms requires only adjusting a few CPU parameters.
Comprehensive Modeling
Predicts performance, power, and energy consumption, providing complete system-level insights.
Memory Technology Support
Validated on DDR3, DDR4, HBM, and MCDRAM. Can be extended to emerging memory technologies.
Use Cases
PROFET is particularly valuable for:
- Evaluating how applications would perform on future memory systems before hardware is available
- Comparing different memory configurations (e.g., DDR4 vs. MCDRAM, different frequencies) for HPC workloads
- Optimizing memory system selection for specific application characteristics
- Understanding the performance-energy tradeoffs of different memory technologies
- Rapid exploration of design space for memory system architects
Getting Started
PROFET is available as an open-source Python package that can be installed via pip. The tool provides both a CLI interface for quick predictions and a Python API for integration into your own workflows. For detailed installation instructions, usage examples, and technical documentation, please visit the PROFET GitHub repository.