# ROCmPort AI Benchmarking Guide

This document defines how to report performance without overclaiming.

## Reporting Principles

- Compare against a clearly stated baseline.
- Use reproducible runs with fixed input sizes and environment details.
- Include correctness checks before accepting performance numbers.
- Report failures and non-improving cases, not only wins.

## Baseline Definitions

Use one of these and name it explicitly in each table:

- Baseline A: Straight `hipify-clang` output with minimal manual edits.
- Baseline B: Existing hand-written HIP version from the team.

Recommended: use Baseline A for measuring migration automation value.

Quick answer format for live review:

- Q: What is your baseline?
- A: Straight hipify output with minimal compile edits (Baseline A), measured on the same hardware and inputs.

## Required Environment Metadata

Always include:

- GPU model (for example MI300X) and memory size.
- ROCm version, compiler version, and profiler version.
- OS and driver versions.
- Kernel launch parameters and input sizes.
- Number of runs and aggregation rule (median recommended).

## Required Measurement Fields

For each kernel tested, provide:

- Kernel name and workload shape.
- Baseline latency.
- Optimized latency.
- Speedup ratio.
- Correctness status (pass/fail and checksum or tolerance).
- Notes on optimization strategy.

Example table format:

| Kernel | Shape | Baseline (ms) | Optimized (ms) | Speedup | Correctness | Notes |
|---|---|---:|---:|---:|---|---|
| matrix_multiply | 1024x1024 | 12.4 | 9.5 | 1.31x | pass | LDS tiling + wavefront-aware launch |

Include non-win cases in the same table. Example:

| Kernel | Shape | Baseline (ms) | Optimized (ms) | Speedup | Correctness | Notes |
|---|---|---:|---:|---:|---|---|
| sparse_scatter | 4M elements | 6.0 | 6.3 | 0.95x | pass | Irregular access pattern; optimization did not help |

## Reproducibility Checklist

Before publishing numbers, verify all items:

- Same input set for baseline and optimized runs.
- Warm-up runs excluded or consistently handled.
- At least 3 measured runs (prefer 5+) with median reported.
- No hidden manual edits after optimization output unless documented.
- Full command lines and profiler artifacts retained.

## Evidence Package for Review

A technical review package should include:

- CUDA source input.
- Baseline HIP output.
- Optimized HIP output.
- Compile logs and profiler summaries.
- Final report explaining what changed and why.

## Interpreting Results Responsibly

- Some kernels will regress or fail initially; this is normal for migration.
- Improvement ranges vary by memory behavior, occupancy, and control-flow patterns.
- Do not claim universal speedups.

Preferred claim style:

"ROCmPort AI improved X out of Y tested kernels against a stated baseline under reproducible MI300X conditions."

## Current Repository Status

The repository includes demo kernels intended to exercise migration behavior.
Treat any sample numbers as demonstrations unless accompanied by full reproducibility artifacts from your environment.