tazwarrrr/rocmport-qwen-wavefront-finetuned

A LoRA-finetuned version of Qwen/Qwen2.5-Coder-7B-Instruct specialized for correcting buggy HIP code produced by hipify β€” specifically targeting AMD wavefront-64 semantics on RDNA/CDNA architectures.

Model Details

Summary

Field Value
Model type Qwen2.5-Coder-7B-Instruct + LoRA adapter
LoRA rank / alpha r=16, alpha=32
Task Causal Language Modeling (CAUSAL_LM)
Finetuned from Qwen/Qwen2.5-Coder-7B-Instruct
Developed by Tazwar Ahnaf
License Apache 2.0
Demo ROCmPort AI on Spaces

Model Description

rocmport-qwen-wavefront-finetuned is a parameter-efficient LoRA adapter trained to fix the class of bugs that hipify routinely introduces when porting CUDA kernels to HIP. Raw hipify output often compiles and runs but produces incorrect results on AMD hardware because it blindly substitutes CUDA warp primitives (warpSize=32) with HIP equivalents without accounting for AMD's 64-wide wavefront execution model.

This adapter learns to:

  • Replace warp-size-32 hardcoded constants with warpSize or the correct AMD value (64 for GFX9/CDNA).
  • Fix __shfl_* / __ballot / __activemask intrinsics to their correct HIP/wavefront-64 equivalents.
  • Repair shared-memory tiling and reduction trees dimensioned for 32-wide warps.
  • Correct __syncwarp() usage patterns that have no direct HIP equivalent.
  • Adjust loop bounds and mask widths that assume 32 threads per warp.
  • Fix cooperative-groups and warp-primitive patterns broken by the warpβ†’wavefront size change.

Training Details

Dataset

tazwarrrr/cuda-to-rocm-wavefront-bugs

Split Examples
Train 153
Validation 6 (one per bug category)
Total ~159

Bug categories covered (6 total):

  1. warp_size_constant β€” hardcoded 32 instead of warpSize
  2. shfl_intrinsic β€” __shfl_* calls with wrong lane masks or widths
  3. ballot_activemask β€” __ballot_sync / __activemask ported incorrectly
  4. shared_memory_tiling β€” tile dimensions based on warp=32
  5. syncwarp β€” __syncwarp() calls without HIP equivalent
  6. cooperative_groups β€” warp-level CG patterns broken at wavefront-64

Each example is a (buggy_hip, corrected_hip) pair with a structured prompt instructing the model to output only the corrected kernel.

Training Hardware & Environment

Parameter Value
GPU AMD Instinct MI300X (gfx942)
ROCm version 6.2
Training platform AMD Developer Cloud
Framework Hugging Face transformers + peft + trl (SFTTrainer)

Hyperparameters

Hyperparameter Value
Epochs 3
Batch size (per device) 2
Gradient accumulation steps 4
Effective batch size 8
Learning rate 2e-4
LR scheduler cosine
Max sequence length 2048
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj
Quantization 4-bit (bitsandbytes NF4)
Optimizer paged_adamw_32bit

Training Results

Metric Value
Training time ~79 seconds
Final training loss 1.189
Token accuracy ~81%

How to Get Started

Install dependencies:

pip install transformers peft accelerate bitsandbytes torch

Load the base model and LoRA adapter, then run inference:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER    = "tazwarrrr/rocmport-qwen-wavefront-finetuned"

# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

# ── Inference ──────────────────────────────────────────────────────────────
BUGGY_HIP = """
__global__ void warp_reduce(float* data, float* result) {
    float val = data[threadIdx.x];
    // BUG: hardcoded warp size 32, wrong for AMD wavefront-64
    for (int offset = 16; offset > 0; offset >>= 1)
        val += __shfl_down(val, offset, 32);
    if (threadIdx.x % 32 == 0)
        result[threadIdx.x / 32] = val;
}
"""

prompt = (
    "Fix the following buggy HIP kernel so it runs correctly on AMD hardware "
    "with wavefront size 64 (gfx942 / MI300X). Output only the corrected kernel.\n\n"
    f"```cpp\n{BUGGY_HIP.strip()}\n```"
)

messages = [
    {"role": "system", "content": "You are an expert AMD GPU programmer. Fix HIP kernels for wavefront-64."},
    {"role": "user",   "content": prompt},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Note: The adapter is not merged into the base weights. You must always load the base model first and then apply the adapter via PeftModel.from_pretrained as shown above. To merge for faster inference see model.merge_and_unload() in the PEFT docs.


Intended Uses & Limitations

Intended Use

  • Correcting hipify-generated HIP kernels that contain wavefront-size bugs before deployment on AMD GFX9 / CDNA hardware.
  • As a backend component in automated CUDAβ†’ROCm porting pipelines (see the ROCmPort AI demo).
  • Research into parameter-efficient fine-tuning for low-level GPU code translation.

Limitations

  • Small training set. Only ~170 examples across 6 bug categories. Performance on unseen kernel types or less common HIP intrinsics is unknown.
  • Adapter only. The LoRA weights are not merged; loading requires peft and the original base model (~15 GB).
  • Narrow scope. The model was trained exclusively on wavefront-size bugs. Other classes of CUDAβ†’HIP correctness issues (memory model differences, texture APIs, etc.) are out of distribution.
  • No formal evaluation benchmark. The 81% token accuracy figure is measured on the training-adjacent validation split; independent benchmark results are not yet available.
  • gfx942 target. Corrections are tuned for MI300X (wavefront=64). Behaviour on RDNA architectures (which support both 32- and 64-wide wavefronts) has not been validated.

Citation

If you use this model or dataset in your work, please cite:

@misc{ahnaf2026rocmport,
  author       = {Tazwar Ahnaf},
  title        = {ROCmPort AI: LLM-Assisted CUDA-to-HIP Porting with Wavefront Bug Correction},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/tazwarrrr/rocmport-qwen-wavefront-finetuned}},
}

Acknowledgements

  • Trained on the AMD Developer Cloud using an AMD Instinct MI300X.
  • Built during the lablab.ai Γ— AMD Developer Hackathon.
  • Base model: Qwen/Qwen2.5-Coder-7B-Instruct by Alibaba Cloud.
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tazwarrrr/rocmport-qwen-wavefront-finetuned

Base model

Qwen/Qwen2.5-7B
Adapter
(682)
this model

Space using tazwarrrr/rocmport-qwen-wavefront-finetuned 1