tazwarrrr/rocmport-qwen-wavefront-finetuned

A LoRA-finetuned version of Qwen/Qwen2.5-Coder-7B-Instruct specialized for correcting buggy HIP code produced by hipify — specifically targeting AMD wavefront-64 semantics on RDNA/CDNA architectures.

Model Details

Summary

Field	Value
Model type	Qwen2.5-Coder-7B-Instruct + LoRA adapter
LoRA rank / alpha	r=16, alpha=32
Task	Causal Language Modeling (CAUSAL_LM)
Finetuned from	`Qwen/Qwen2.5-Coder-7B-Instruct`
Developed by	Tazwar Ahnaf
License	Apache 2.0
Demo	ROCmPort AI on Spaces

Model Description

rocmport-qwen-wavefront-finetuned is a parameter-efficient LoRA adapter trained to fix the class of bugs that hipify routinely introduces when porting CUDA kernels to HIP. Raw hipify output often compiles and runs but produces incorrect results on AMD hardware because it blindly substitutes CUDA warp primitives (warpSize=32) with HIP equivalents without accounting for AMD's 64-wide wavefront execution model.

This adapter learns to:

Replace warp-size-32 hardcoded constants with warpSize or the correct AMD value (64 for GFX9/CDNA).
Fix __shfl_* / __ballot / __activemask intrinsics to their correct HIP/wavefront-64 equivalents.
Repair shared-memory tiling and reduction trees dimensioned for 32-wide warps.
Correct __syncwarp() usage patterns that have no direct HIP equivalent.
Adjust loop bounds and mask widths that assume 32 threads per warp.
Fix cooperative-groups and warp-primitive patterns broken by the warp→wavefront size change.

Training Details

Dataset

tazwarrrr/cuda-to-rocm-wavefront-bugs

Split	Examples
Train	153
Validation	6 (one per bug category)
Total	~159

Bug categories covered (6 total):

warp_size_constant — hardcoded 32 instead of warpSize
shfl_intrinsic — __shfl_* calls with wrong lane masks or widths
ballot_activemask — __ballot_sync / __activemask ported incorrectly
shared_memory_tiling — tile dimensions based on warp=32
syncwarp — __syncwarp() calls without HIP equivalent
cooperative_groups — warp-level CG patterns broken at wavefront-64

Each example is a (buggy_hip, corrected_hip) pair with a structured prompt instructing the model to output only the corrected kernel.

Training Hardware & Environment

Parameter	Value
GPU	AMD Instinct MI300X (gfx942)
ROCm version	6.2
Training platform	AMD Developer Cloud
Framework	Hugging Face `transformers` + `peft` + `trl` (SFTTrainer)

Hyperparameters

Hyperparameter	Value
Epochs	3
Batch size (per device)	2
Gradient accumulation steps	4
Effective batch size	8
Learning rate	2e-4
LR scheduler	cosine
Max sequence length	2048
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Quantization	4-bit (bitsandbytes NF4)
Optimizer	`paged_adamw_32bit`

Training Results

Metric	Value
Training time	~79 seconds
Final training loss	1.189
Token accuracy	~81%

How to Get Started

Install dependencies:

pip install transformers peft accelerate bitsandbytes torch

Load the base model and LoRA adapter, then run inference:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER    = "tazwarrrr/rocmport-qwen-wavefront-finetuned"

# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

# ── Inference ──────────────────────────────────────────────────────────────
BUGGY_HIP = """
__global__ void warp_reduce(float* data, float* result) {
    float val = data[threadIdx.x];
    // BUG: hardcoded warp size 32, wrong for AMD wavefront-64
    for (int offset = 16; offset > 0; offset >>= 1)
        val += __shfl_down(val, offset, 32);
    if (threadIdx.x % 32 == 0)
        result[threadIdx.x / 32] = val;
}
"""

prompt = (
    "Fix the following buggy HIP kernel so it runs correctly on AMD hardware "
    "with wavefront size 64 (gfx942 / MI300X). Output only the corrected kernel.\n\n"
    f"```cpp\n{BUGGY_HIP.strip()}\n```"
)

messages = [
    {"role": "system", "content": "You are an expert AMD GPU programmer. Fix HIP kernels for wavefront-64."},
    {"role": "user",   "content": prompt},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Note: The adapter is not merged into the base weights. You must always load the base model first and then apply the adapter via PeftModel.from_pretrained as shown above. To merge for faster inference see model.merge_and_unload() in the PEFT docs.

Intended Uses & Limitations

Intended Use

Correcting hipify-generated HIP kernels that contain wavefront-size bugs before deployment on AMD GFX9 / CDNA hardware.
As a backend component in automated CUDA→ROCm porting pipelines (see the ROCmPort AI demo).
Research into parameter-efficient fine-tuning for low-level GPU code translation.

Limitations

Small training set. Only ~170 examples across 6 bug categories. Performance on unseen kernel types or less common HIP intrinsics is unknown.
Adapter only. The LoRA weights are not merged; loading requires peft and the original base model (~15 GB).
Narrow scope. The model was trained exclusively on wavefront-size bugs. Other classes of CUDA→HIP correctness issues (memory model differences, texture APIs, etc.) are out of distribution.
No formal evaluation benchmark. The 81% token accuracy figure is measured on the training-adjacent validation split; independent benchmark results are not yet available.
gfx942 target. Corrections are tuned for MI300X (wavefront=64). Behaviour on RDNA architectures (which support both 32- and 64-wide wavefronts) has not been validated.

Citation

If you use this model or dataset in your work, please cite:

@misc{ahnaf2026rocmport,
  author       = {Tazwar Ahnaf},
  title        = {ROCmPort AI: LLM-Assisted CUDA-to-HIP Porting with Wavefront Bug Correction},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/tazwarrrr/rocmport-qwen-wavefront-finetuned}},
}

Acknowledgements

Trained on the AMD Developer Cloud using an AMD Instinct MI300X.
Built during the lablab.ai × AMD Developer Hackathon.
Base model: Qwen/Qwen2.5-Coder-7B-Instruct by Alibaba Cloud.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tazwarrrr/rocmport-qwen-wavefront-finetuned

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B