BrainScope Disease Fine-Tuned scGPT
This repository contains our disease-adapted fine-tuned scGPT model for brain single-cell / single-nucleus RNA-seq analysis.
It is intended to be used with the companion pipeline repository:
- Pipeline repo:
YOUR_USERNAME/brainscope-scgpt-pipeline
Model summary
This model starts from the original scGPT backbone and is then fine-tuned on disease-related brain single-cell / single-nucleus RNA-seq data for downstream annotation workflows.
This packaged release is intended for:
- disease-aware cell-type annotation
- embedding generation
- comparison against the original scGPT baseline
- downstream error analysis and reproducible model sharing
Data context
This release is associated with workflows built on:
- LIBD for smaller pilot experiments and rapid iteration
- BrainScope for larger-scale disease-focused fine-tuning and evaluation
The goal of this model family is to improve robustness on disease-altered cell states relative to healthy-only baselines.
Included files
Typical contents of this repository include:
model.ptconfig.yamlpreprocessing.jsonvocab.jsonlabel_map.jsonmetrics.jsonrequirements.txtinference.py- small example input / output files
Intended use
This model is intended for:
- disease-aware annotation of sc/snRNA-seq data
- controlled comparisons with the original scGPT baseline
- reproducible research workflows on brain disease datasets
This release is for research use only and is not a clinical model.
Example usage
Download with the Hub
from huggingface_hub import snapshot_download
repo_dir = snapshot_download("YOUR_USERNAME/brainscope-scgpt-disease")
print(repo_dir)
Run through the pipeline
python -m brainscope_scgpt annotate --input data/query.h5ad --model-repo YOUR_USERNAME/brainscope-scgpt-disease --output results/query_annotated.h5ad --mode small
Large dataset mode:
python -m brainscope_scgpt annotate --input data/brainscope_full.h5ad --model-repo YOUR_USERNAME/brainscope-scgpt-disease --output results/brainscope_full_annotated.h5ad --mode large
Evaluation
Please fill in the exact benchmark numbers you want visible in the public model card.
Suggested structure:
Main metrics
- Accuracy:
- Precision:
- Recall:
- Macro F1:
Benchmark setting
- Train / validation / test split:
- Label space:
- Small or large mode:
- Any freeze / unfreeze strategy:
- Whether MoE was used:
If this release corresponds specifically to one of your final selected models, state that explicitly here.
Comparison to the original baseline
This model is intended to be compared against:
- original scGPT baseline
- MoE-enhanced variants
- alternative architectures such as Mamba-based approaches
Suggested points to summarize here after finalization:
- which cell types improved the most
- which confusions remained
- whether disease-aware fine-tuning improved performance on disease-shifted cells
Limitations
- Performance depends on preprocessing consistency and gene-vocabulary alignment.
- Performance may change if label definitions differ across datasets.
- This repository does not include all large intermediate artifacts used during training.
- Reference mapping still depends on an external FAISS index if you use the RM workflow.
- This is a research model and not validated for clinical use.
Citation
Please cite the original scGPT paper and your project paper when available.
@article{cui2024scgpt,
title={scGPT: toward building a foundation model for single-cell multi-omics using generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and others},
journal={Nature Methods},
year={2024}
}
Contact
Yuesong Huang
University of Rochester
Email: yhu116@ur.rochester.edu