Spaces:

lablab-ai-amd-developer-hackathon
/

ROCmPort-AI

Running

App Files Files Community

tazwarrrr commited on May 8

Commit

e7a1a69

1 Parent(s): 2fe80fd

docs fix

Browse files

Files changed (9) hide show

Dockerfile +3 -1
README.md +16 -11
backend/tools/static_analyzer.py +19 -1
dataset/finetune_qwen.py +22 -9
dataset/requirements-finetune.txt +6 -0
docs/FAILURE_CASES.md +3 -2
docs/JUDGE_MODE.md +36 -30
docs/LIVE_RESULTS.md +22 -32
docs/benchmark_runs/mi300x_results.txt +25 -14

Dockerfile CHANGED Viewed

@@ -5,11 +5,13 @@ RUN npm ci
 COPY frontend/ ./
 RUN npm run build
-FROM rocm/dev-ubuntu-22.04:latest
 WORKDIR /app
 COPY backend/requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
 COPY --from=frontend-build /app/frontend/dist ./frontend/dist
 EXPOSE 8000
 CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]

 COPY frontend/ ./
 RUN npm run build
+FROM rocm/dev-ubuntu-22.04:7.2.2-complete
 WORKDIR /app
 COPY backend/requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
 COPY --from=frontend-build /app/frontend/dist ./frontend/dist
+# Runtime envs: GROQ_API_KEY, ROCM_AVAILABLE, HIPCC_PATH, ROCPROF_PATH.
+# Pass secrets at docker run/deploy time; do not bake .env into the image.
 EXPOSE 8000
 CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -52,17 +52,17 @@ If the optimized output underperforms the baseline, the coordinator retries the
 ---
-## Live Results on AMD Instinct MI300X
-All numbers from real MI300X hardware — AMD DevCloud, gfx942, ROCm 7.2. No simulated data.
 | Kernel | Input | Baseline HIP | Optimized HIP | Result |
 |--------|-------|-------------|---------------|--------|
-| matrix_multiply | 512×512 fp32 | 0.068ms | 0.026ms | **2.61× speedup** |
-| reduction | 16M elements | wrong output | 0.019ms | **PASS (wavefront-64 fix)** |
-| vector_add | 32M elements | — | 0.099ms | **4,077 GB/s (77% peak)** |
-Hardware: AMD Instinct MI300X VF, 192GB HBM3, ROCm 7.2
 ---
@@ -164,14 +164,19 @@ start.bat
 ./start.sh
 # Manual
-cd backend
-pip install -r requirements.txt
 cp .env.example .env
 # Add GROQ_API_KEY
-uvicorn main:app --reload --port 8000
 ```
-Open `frontend/index.html` in a browser.
 ### Docker
@@ -255,4 +260,4 @@ ROCmPort AI/
 ## License
-Apache 2.0 — see [`LICENSE`](LICENSE)

 ---
+## Reproducible Demo Results
+These numbers are deterministic `demo_artifact` values returned by the backend when `ROCM_AVAILABLE=false`. Set `ROCM_AVAILABLE=true` on real MI300X hardware to collect `data_source=real_rocm` results.
 | Kernel | Input | Baseline HIP | Optimized HIP | Result |
 |--------|-------|-------------|---------------|--------|
+| matrix_multiply | demo artifact | 121.4ms | 89.1ms | **1.36x speedup** |
+| reduction | demo artifact | 88.2ms | 68.7ms | **1.28x speedup** |
+| vector_add | demo artifact | 45.1ms | 38.2ms | **1.18x speedup** |
+Hardware class: AMD Instinct MI300X, 192GB HBM3, wavefront=64
 ---
 ./start.sh
 # Manual
+python -m venv .venv
+# Windows: .venv\Scripts\activate
+# Linux/Mac:
+. .venv/bin/activate
+pip install -r backend/requirements.txt
 cp .env.example .env
 # Add GROQ_API_KEY
+npm --prefix frontend install
+npm --prefix frontend run build
+python -m uvicorn backend.main:app --reload --port 8000
 ```
+Open `http://localhost:8000/index.html` in a browser.
 ### Docker
 ## License
+Apache 2.0 — see [`LICENSE`](LICENSE)

backend/tools/static_analyzer.py CHANGED Viewed

@@ -58,6 +58,15 @@ _PATTERNS: List[tuple] = [
         "Replace __ballot_sync(0xffffffff, cond) with __ballot(cond). "
         "The return type changes from uint32_t to uint64_t — update downstream bitmask logic."
     ),
     (
         "activemask_warp",
         re.compile(r'\b__activemask\s*\(\s*\)', re.MULTILINE),
@@ -86,7 +95,7 @@ _PATTERNS: List[tuple] = [
     (
         "inline_ptx_block",
         re.compile(r'asm\s+volatile\s*\(', re.MULTILINE),
-        "HIGH",
         "Inline PTX assembly is NVIDIA-specific ISA. hipify cannot translate PTX semantics. "
         "The kernel may compile under hipcc but will have undefined or incorrect behaviour.",
         "Replace inline PTX with portable HIP intrinsics or CDNA ISA equivalents. "
@@ -101,6 +110,15 @@ _PATTERNS: List[tuple] = [
         "Replace with #include <hip/hip_runtime.h>. "
         "hipify-clang does this automatically in its first pass."
     ),
     (
         "shared_memory_no_padding",
         re.compile(r'__shared__\s+\w+\s+\w+\s*\[\s*\d+\s*\]', re.MULTILINE),

         "Replace __ballot_sync(0xffffffff, cond) with __ballot(cond). "
         "The return type changes from uint32_t to uint64_t — update downstream bitmask logic."
     ),
+    (
+        "shfl_wavefront_offset_16",
+        re.compile(r'\b__shfl(?:_down|_up|_xor)?\s*\([^;]*,\s*16\s*(?:,|\))', re.MULTILINE),
+        "HIGH",
+        "__shfl* with offset=16 often encodes a 32-lane warp reduction tail. "
+        "On AMD wavefront=64 the reduction should include an offset=32 step first.",
+        "Audit the shuffle reduction and add a wavefront-64 step, e.g. offset=32 "
+        "before offset=16 where the algorithm reduces a full wavefront."
+    ),
     (
         "activemask_warp",
         re.compile(r'\b__activemask\s*\(\s*\)', re.MULTILINE),
     (
         "inline_ptx_block",
         re.compile(r'asm\s+volatile\s*\(', re.MULTILINE),
+        "CRITICAL",
         "Inline PTX assembly is NVIDIA-specific ISA. hipify cannot translate PTX semantics. "
         "The kernel may compile under hipcc but will have undefined or incorrect behaviour.",
         "Replace inline PTX with portable HIP intrinsics or CDNA ISA equivalents. "
         "Replace with #include <hip/hip_runtime.h>. "
         "hipify-clang does this automatically in its first pass."
     ),
+    (
+        "cuda_library_dependency",
+        re.compile(r'#\s*include\s*[<"][^>"]*(?:cub|thrust|cudnn)[^>"]*[>"]|\b(?:cub|thrust|cudnn)::', re.MULTILINE),
+        "HIGH",
+        "CUDA library dependency detected. hipify can rename some CUB/Thrust/cuDNN symbols, "
+        "but API coverage and performance behavior are not guaranteed to match rocPRIM/hipCUB/MIOpen.",
+        "Manually review the translated library call, compare against rocPRIM/hipCUB/MIOpen, "
+        "and add correctness/performance tests for the specific primitive."
+    ),
     (
         "shared_memory_no_padding",
         re.compile(r'__shared__\s+\w+\s+\w+\s*\[\s*\d+\s*\]', re.MULTILINE),

dataset/finetune_qwen.py CHANGED Viewed

@@ -1,4 +1,6 @@
 # finetune_qwen.py
 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
 from peft import LoraConfig, get_peft_model, TaskType
 from trl import SFTTrainer
@@ -8,7 +10,12 @@ import torch
 MODEL = "Qwen/Qwen2.5-Coder-7B-Instruct"
 DATASET = "tazwarrrr/cuda-to-rocm-wavefront-bugs"
 OUTPUT = "/workspace/rocmport-qwen-finetuned"
-HF_TOKEN = "hf_YOUR_TOKEN_HERE"   # <-- paste your write token
 # Load dataset
 ds = load_dataset(DATASET)
@@ -32,15 +39,21 @@ def format_example(example):
 formatted = ds.map(format_example)
 # Load model
 tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
     MODEL,
-    torch_dtype=torch.float16,
-    device_map="auto",
     trust_remote_code=True
 )
 # LoRA config
 lora_config = LoraConfig(
@@ -61,7 +74,8 @@ args = TrainingArguments(
     gradient_accumulation_steps=4,
     warmup_steps=10,
     learning_rate=2e-4,
-    fp16=True,
     logging_steps=5,
     save_strategy="epoch",
     report_to="none"
@@ -70,7 +84,7 @@ args = TrainingArguments(
 trainer = SFTTrainer(
     model=model,
     tokenizer=tokenizer,
-    train_dataset=formatted["train"],
     dataset_text_field="text",
     max_seq_length=2048,
     args=args
@@ -80,8 +94,7 @@ trainer.train()
 trainer.save_model(OUTPUT)
 # Push to HuggingFace
-model.push_to_hub("tazwarrrr/rocmport-qwen-wavefront-finetuned",
-                  token=HF_TOKEN)
-tokenizer.push_to_hub("tazwarrrr/rocmport-qwen-wavefront-finetuned",
-                      token=HF_TOKEN)
 print("Done. Model pushed to HuggingFace.")

 # finetune_qwen.py
+import os
 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
 from peft import LoraConfig, get_peft_model, TaskType
 from trl import SFTTrainer
 MODEL = "Qwen/Qwen2.5-Coder-7B-Instruct"
 DATASET = "tazwarrrr/cuda-to-rocm-wavefront-bugs"
 OUTPUT = "/workspace/rocmport-qwen-finetuned"
+HF_TOKEN = os.environ.get("HF_TOKEN")
+if not HF_TOKEN:
+    raise RuntimeError("Set HF_TOKEN in the environment before running fine-tuning.")
+REPO_ID = "tazwarrrr/rocmport-qwen-wavefront-finetuned"
+os.makedirs(OUTPUT, exist_ok=True)
 # Load dataset
 ds = load_dataset(DATASET)
 formatted = ds.map(format_example)
+if hasattr(formatted, "keys"):
+    train_split = "train" if "train" in formatted else next(iter(formatted.keys()))
+    train_dataset = formatted[train_split]
+else:
+    train_dataset = formatted
 # Load model
 tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
     MODEL,
+    torch_dtype=torch.bfloat16,
     trust_remote_code=True
 )
+if torch.cuda.is_available():
+    model.to("cuda")
 # LoRA config
 lora_config = LoraConfig(
     gradient_accumulation_steps=4,
     warmup_steps=10,
     learning_rate=2e-4,
+    bf16=torch.cuda.is_available(),
+    fp16=False,
     logging_steps=5,
     save_strategy="epoch",
     report_to="none"
 trainer = SFTTrainer(
     model=model,
     tokenizer=tokenizer,
+    train_dataset=train_dataset,
     dataset_text_field="text",
     max_seq_length=2048,
     args=args
 trainer.save_model(OUTPUT)
 # Push to HuggingFace
+merged_model = model.merge_and_unload()
+merged_model.push_to_hub(REPO_ID, token=HF_TOKEN)
+tokenizer.push_to_hub(REPO_ID, token=HF_TOKEN)
 print("Done. Model pushed to HuggingFace.")

dataset/requirements-finetune.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+accelerate==0.34.2
+datasets==3.1.0
+peft==0.13.2
+torch
+transformers==4.46.3
+trl==0.9.6

docs/FAILURE_CASES.md CHANGED Viewed

@@ -70,8 +70,9 @@ cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_item
 **What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers.
 Does NOT fix the shuffle semantics.
-**What ROCmPort AI does**: flags both shuffle calls as HIGH risk,
-identifies the offset=16 assumption, suggests wavefront-64 aware rewrite.
 **Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
 Numerical correctness not verified — requires reference CPU implementation.

 **What hipify does**: renames cudaFree to hipFree, cuda headers to hip headers.
 Does NOT fix the shuffle semantics.
+**What ROCmPort AI does**: flags `__shfl_sync` family calls as CRITICAL risk,
+and flags unsuffixed `__shfl_down(..., 16)` style reductions as HIGH risk.
+It identifies the offset=16 assumption and suggests a wavefront-64 aware rewrite.
 **Status**: Compiled and executed on AMD Instinct MI300X (gfx942), ROCm 7.2.
 Numerical correctness not verified — requires reference CPU implementation.

docs/JUDGE_MODE.md CHANGED Viewed

@@ -1,42 +1,48 @@
 # Judge Mode Walkthrough
-Use this sequence during technical evaluation.
 ## Goal
-Make every claim falsifiable and easy to verify.
 ## Flow
-1. Show raw CUDA input.
-2. Run baseline translation only (straight hipify output).
-3. Show baseline compile/profiler result.
-4. Run full ROCmPort AI loop.
-5. Show each agent event and decisions.
-6. Compare final output against the declared baseline.
-7. Show one weak result (small gain or no gain) and explain why.
 ## Baseline Policy
 - Primary baseline: straight hipify output with minimal required compile edits.
-- Never switch baselines mid-demo.
-- Repeat baseline definition before showing speedup.
-## Required Artifacts
-- CUDA source.
-- Baseline HIP output.
-- Optimized HIP output.
-- Compile logs.
-- Profiler summary.
-- Final report with rationale.
-## Suggested Script
-- "Here is the original CUDA kernel."
-- "Here is baseline HIP produced by hipify only."
-- "Now we run the orchestration loop and show each decision."
-- "This is the final code diff and measured result versus baseline."
-- "Here is a case where gain is limited, and why."
 ## Pass/Fail Criteria
 A demo is credible if:
-- Baseline is explicit.
-- Intermediate artifacts are visible.
-- At least one non-win case is included.
-- Reasoning matches observed profiler data.

 # Judge Mode Walkthrough
+Use this sequence during technical evaluation with the current React UI and
+FastAPI SSE stream.
 ## Goal
+Make every claim falsifiable and tied to fields returned by the backend.
 ## Flow
+1. Open `http://localhost:8000/index.html`.
+2. Choose or paste a CUDA kernel.
+3. Run ROCmPort AI and watch the five agent cards:
+   analyzer, translator, optimizer, tester, coordinator.
+4. Confirm the tester event reports speedup, bandwidth, bottleneck, and data source.
+5. Confirm the coordinator event produces the final report JSON in its SSE `detail`.
+6. Use `/benchmark-report` for reproducible demo-artifact metrics and data-source labels.
+7. Show a limited-gain case such as `vector_add` and explain the bandwidth-bound result.
 ## Baseline Policy
 - Primary baseline: straight hipify output with minimal required compile edits.
+- Demo-mode baselines come from `backend/tools/demo_artifacts.py`.
+- Real hardware baselines require `ROCM_AVAILABLE=true` and captured `hipcc`/`rocprof` logs.
+- Never mix `demo_artifact` and `real_rocm` numbers in the same result table.
+## Visible Artifacts In Current UI
+- CUDA source input.
+- Agent event stream.
+- Tester summary: execution time, bandwidth utilization, bottleneck, notes.
+- Final summary footer: changes made, critical bugs found, compile/migration success, data source.
+## Additional Artifacts Available By API
+- `/benchmark-report`: reproducible benchmark summary and static risk scans.
+- `/export`: migration diff, original CUDA, optimized HIP, and report markdown.
+- `/demo-kernels`: source for bundled demo kernels.
 ## Pass/Fail Criteria
 A demo is credible if:
+- Every speedup is tied to its `data_source`.
+- The baseline definition is stated before showing speedup.
+- Static risk findings match the analyzer event or `/benchmark-report`.
+- At least one non-perfect or limited-gain case is included.

docs/LIVE_RESULTS.md CHANGED Viewed

@@ -1,40 +1,30 @@
-# Live Results — AMD Instinct MI300X (gfx942), ROCm 7.2
-All kernels compiled with `hipcc --offload-arch=gfx942 -O3` and
-benchmarked on real AMD DevCloud hardware. No simulated data.
 ## Benchmark Results
-| Kernel | Input Size | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Notes |
-|--------|------------|-------------------|-------------------|---------|-------|
-| matrix_multiply | 512x512 fp32 | 0.068 | 0.026 | **2.61x** | Shared memory tiling |
-| reduction | 16M elements fp32 | — | 0.019 | — | Wavefront-64 fix verified PASS |
-| vector_add | 32M elements fp32 | — | 0.099 | — | 4077.6 GB/s (77% MI300X peak) |
-## Hardware Configuration
-- **GPU**: AMD Instinct MI300X VF (gfx942)
-- **VRAM**: 192GB HBM3
-- **Platform**: AMD Developer Cloud (ATL1 region)
-- **ROCm**: 7.2
-- **Compiler**: hipcc (clang++ --offload-arch=gfx942)
-- **data_source**: real_rocm
-## Key Findings
-**matrix_multiply**: Shared memory tiling with LDS padding ([32][33]
-to avoid bank conflicts) delivers 2.61x over naive global memory access
-on gfx942. The wavefront-64 aligned block size (256 threads) is critical
-for this result.
-**reduction**: AMD wavefront-64 aware final stage produces correct results.
-The original CUDA kernel with hardcoded warp-32 assumption silently skips
-lanes 32-63 and returns a wrong sum. ROCmPort AI catches this at static
-scan before any compilation attempt.
-**vector_add**: 4077.6 GB/s achieved on a memory-bound kernel — 77% of
-MI300X's 5.3 TB/s theoretical HBM3 peak. This demonstrates the bandwidth
-advantage of MI300X over H100 (3.35 TB/s peak) for memory-bound workloads.
-## Correctness Verification
-All kernels executed without runtime errors on gfx942.

+# Reproducible Results
+The backend returns deterministic benchmark artifacts unless `ROCM_AVAILABLE=true`
+is set on real ROCm hardware. These values come from
+`backend/tools/demo_artifacts.py` and are labelled `data_source="demo_artifact"`
+in API responses.
 ## Benchmark Results
+| Kernel | Baseline HIP (ms) | Optimized HIP (ms) | Speedup | Bandwidth | Bottleneck |
+|--------|-------------------|--------------------|---------|-----------|------------|
+| matrix_multiply | 121.4 | 89.1 | 1.36x | 1843.7 GB/s | memory-bound |
+| reduction | 88.2 | 68.7 | 1.28x | 531.8 GB/s | compute-bound after wavefront fix |
+| vector_add | 45.1 | 38.2 | 1.18x | 4821.6 GB/s | memory-bound |
+| convolution_2d | 211.7 | 158.3 | 1.34x | 2134.8 GB/s | memory-bound |
+## Hardware Context
+- GPU class: AMD Instinct MI300X
+- VRAM: 192GB HBM3
+- Theoretical memory bandwidth: 5.3 TB/s
+- Wavefront size: 64
+- API data source in local/demo mode: `demo_artifact`
+## Real Hardware Mode
+Set `ROCM_AVAILABLE=true`, `HIPCC_PATH=hipcc`, and `ROCPROF_PATH=rocprof` on a
+real MI300X ROCm environment to replace demo artifacts with `data_source="real_rocm"`.
+Real run output should be captured separately with the exact ROCm version, kernel
+input size, compiler flags, and profiler logs.

docs/benchmark_runs/mi300x_results.txt CHANGED Viewed

@@ -1,17 +1,28 @@
-Hardware: AMD Instinct MI300X VF (gfx942)
-ROCm: 7.2
-Date: 2025-05-06
-Compiler: hipcc --offload-arch=gfx942 -O3
-matrix_multiply (512x512 fp32):
-  Basic kernel:        0.068 ms
-  Shared memory kernel: 0.026 ms
-  Speedup:             2.61x
-reduction (16M elements fp32):
-  Kernel time:         0.019 ms
-  Correctness:         PASS (16777216 == 16777216)
-vector_add (32M elements fp32):
-  Kernel time:         0.099 ms
-  Memory bandwidth:    4077.6 GB/s (77% of MI300X peak 5.3 TB/s)

+Data source: demo_artifact
+Source file: backend/tools/demo_artifacts.py
+Hardware class:
+  GPU: AMD Instinct MI300X
+  HBM: 192GB
+  Wavefront size: 64
+  Theoretical memory bandwidth: 5.3 TB/s
+matrix_multiply:
+  Baseline HIP: 121.4 ms
+  Optimized HIP: 89.1 ms
+  Speedup: 1.36x
+  Bandwidth: 1843.7 GB/s
+reduction:
+  Baseline HIP: 88.2 ms
+  Optimized HIP: 68.7 ms
+  Speedup: 1.28x
+  Bandwidth: 531.8 GB/s
+vector_add:
+  Baseline HIP: 45.1 ms
+  Optimized HIP: 38.2 ms
+  Speedup: 1.18x
+  Bandwidth: 4821.6 GB/s
+Set ROCM_AVAILABLE=true on real MI300X hardware to produce real_rocm values.