Instructions to use tazwarrrr/rocmport-qwen-wavefront-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tazwarrrr/rocmport-qwen-wavefront-finetuned with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct") model = PeftModel.from_pretrained(base_model, "tazwarrrr/rocmport-qwen-wavefront-finetuned") - Notebooks
- Google Colab
- Kaggle
tazwarrrr/rocmport-qwen-wavefront-finetuned
A LoRA-finetuned version of Qwen/Qwen2.5-Coder-7B-Instruct specialized for correcting buggy HIP code produced by hipify β specifically targeting AMD wavefront-64 semantics on RDNA/CDNA architectures.
Model Details
Summary
| Field | Value |
|---|---|
| Model type | Qwen2.5-Coder-7B-Instruct + LoRA adapter |
| LoRA rank / alpha | r=16, alpha=32 |
| Task | Causal Language Modeling (CAUSAL_LM) |
| Finetuned from | Qwen/Qwen2.5-Coder-7B-Instruct |
| Developed by | Tazwar Ahnaf |
| License | Apache 2.0 |
| Demo | ROCmPort AI on Spaces |
Model Description
rocmport-qwen-wavefront-finetuned is a parameter-efficient LoRA adapter trained to fix the class of bugs that hipify routinely introduces when porting CUDA kernels to HIP. Raw hipify output often compiles and runs but produces incorrect results on AMD hardware because it blindly substitutes CUDA warp primitives (warpSize=32) with HIP equivalents without accounting for AMD's 64-wide wavefront execution model.
This adapter learns to:
- Replace warp-size-32 hardcoded constants with
warpSizeor the correct AMD value (64 for GFX9/CDNA). - Fix
__shfl_*/__ballot/__activemaskintrinsics to their correct HIP/wavefront-64 equivalents. - Repair shared-memory tiling and reduction trees dimensioned for 32-wide warps.
- Correct
__syncwarp()usage patterns that have no direct HIP equivalent. - Adjust loop bounds and mask widths that assume 32 threads per warp.
- Fix cooperative-groups and warp-primitive patterns broken by the warpβwavefront size change.
Training Details
Dataset
tazwarrrr/cuda-to-rocm-wavefront-bugs
| Split | Examples |
|---|---|
| Train | 153 |
| Validation | 6 (one per bug category) |
| Total | ~159 |
Bug categories covered (6 total):
warp_size_constantβ hardcoded 32 instead ofwarpSizeshfl_intrinsicβ__shfl_*calls with wrong lane masks or widthsballot_activemaskβ__ballot_sync/__activemaskported incorrectlyshared_memory_tilingβ tile dimensions based on warp=32syncwarpβ__syncwarp()calls without HIP equivalentcooperative_groupsβ warp-level CG patterns broken at wavefront-64
Each example is a (buggy_hip, corrected_hip) pair with a structured prompt instructing the model to output only the corrected kernel.
Training Hardware & Environment
| Parameter | Value |
|---|---|
| GPU | AMD Instinct MI300X (gfx942) |
| ROCm version | 6.2 |
| Training platform | AMD Developer Cloud |
| Framework | Hugging Face transformers + peft + trl (SFTTrainer) |
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Epochs | 3 |
| Batch size (per device) | 2 |
| Gradient accumulation steps | 4 |
| Effective batch size | 8 |
| Learning rate | 2e-4 |
| LR scheduler | cosine |
| Max sequence length | 2048 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj |
| Quantization | 4-bit (bitsandbytes NF4) |
| Optimizer | paged_adamw_32bit |
Training Results
| Metric | Value |
|---|---|
| Training time | ~79 seconds |
| Final training loss | 1.189 |
| Token accuracy | ~81% |
How to Get Started
Install dependencies:
pip install transformers peft accelerate bitsandbytes torch
Load the base model and LoRA adapter, then run inference:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
BASE_MODEL = "Qwen/Qwen2.5-Coder-7B-Instruct"
ADAPTER = "tazwarrrr/rocmport-qwen-wavefront-finetuned"
# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()
# ββ Inference ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BUGGY_HIP = """
__global__ void warp_reduce(float* data, float* result) {
float val = data[threadIdx.x];
// BUG: hardcoded warp size 32, wrong for AMD wavefront-64
for (int offset = 16; offset > 0; offset >>= 1)
val += __shfl_down(val, offset, 32);
if (threadIdx.x % 32 == 0)
result[threadIdx.x / 32] = val;
}
"""
prompt = (
"Fix the following buggy HIP kernel so it runs correctly on AMD hardware "
"with wavefront size 64 (gfx942 / MI300X). Output only the corrected kernel.\n\n"
f"```cpp\n{BUGGY_HIP.strip()}\n```"
)
messages = [
{"role": "system", "content": "You are an expert AMD GPU programmer. Fix HIP kernels for wavefront-64."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Note: The adapter is not merged into the base weights. You must always load the base model first and then apply the adapter via
PeftModel.from_pretrainedas shown above. To merge for faster inference seemodel.merge_and_unload()in the PEFT docs.
Intended Uses & Limitations
Intended Use
- Correcting
hipify-generated HIP kernels that contain wavefront-size bugs before deployment on AMD GFX9 / CDNA hardware. - As a backend component in automated CUDAβROCm porting pipelines (see the ROCmPort AI demo).
- Research into parameter-efficient fine-tuning for low-level GPU code translation.
Limitations
- Small training set. Only ~170 examples across 6 bug categories. Performance on unseen kernel types or less common HIP intrinsics is unknown.
- Adapter only. The LoRA weights are not merged; loading requires
peftand the original base model (~15 GB). - Narrow scope. The model was trained exclusively on wavefront-size bugs. Other classes of CUDAβHIP correctness issues (memory model differences, texture APIs, etc.) are out of distribution.
- No formal evaluation benchmark. The 81% token accuracy figure is measured on the training-adjacent validation split; independent benchmark results are not yet available.
- gfx942 target. Corrections are tuned for MI300X (wavefront=64). Behaviour on RDNA architectures (which support both 32- and 64-wide wavefronts) has not been validated.
Citation
If you use this model or dataset in your work, please cite:
@misc{ahnaf2026rocmport,
author = {Tazwar Ahnaf},
title = {ROCmPort AI: LLM-Assisted CUDA-to-HIP Porting with Wavefront Bug Correction},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/tazwarrrr/rocmport-qwen-wavefront-finetuned}},
}
Acknowledgements
- Trained on the AMD Developer Cloud using an AMD Instinct MI300X.
- Built during the lablab.ai Γ AMD Developer Hackathon.
- Base model: Qwen/Qwen2.5-Coder-7B-Instruct by Alibaba Cloud.
- Downloads last month
- 19