SD-From-Scratch v1 (epoch 42)

A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2Γ— RTX 5090 (Blackwell, 33.7 GB VRAM each) under the SD_Train.py pipeline.

This checkpoint (sd_epoch_042.pt) is the visually-best checkpoint produced after 48 epochs across 7 training phases (LAION broad β†’ LAION refined β†’ DiffusionDB/JourneyDB mix β†’ VGGFace2 face fine-tune β†’ COCO full-body β†’ consolidation β†’ final mixed).

  • Architecture: UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) β€” ~860M trainable parameters
  • Frozen components: VAE stabilityai/sd-vae-ft-mse, text encoder openai/clip-vit-large-patch14
  • Latent space: 64Γ—64Γ—4 (8Γ— spatial compression of 512Γ—512 RGB), scale factor 0.18215
  • Training schedule: DDPM scaled-linear betas (0.00085 β†’ 0.012, 1000 steps), Min-SNR Ξ³=5.0 (Ξ³=3.0 in late refinement)
  • Precision: BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
  • EMA decay: 0.9999 (GPU-resident shadow)
  • Best training loss: 0.0947 (epoch 16, on filtered LAION 213k subset)
  • Released checkpoint: epoch 42 (~12.5 GB) β€” best visual coherence across portraits, landscapes and scenes

Files

  • sd_epoch_042.pt β€” full training checkpoint: UNet state_dict, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. Load EMA weights at inference for best quality (see usage below).

License

MIT. See LICENSE in the source repo.

Intended use

Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.

How to use

The checkpoint is loaded by the inference scripts in the source repo.

git clone https://github.com/atandra2000/StableDiffusion.git
cd StableDiffusion
pip install torch torchvision diffusers transformers huggingface_hub pillow

# Download the checkpoint from this HF model repo:
huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .

# Generate an image (uses EMA weights internally and removes the DDIM
# pred_x0 clamp, which would otherwise destroy SD latent signal):
python inference.py \
    --checkpoint sd_epoch_042.pt \
    --prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
    --steps 50 \
    --guidance 7.5 \
    --output sample.png

For portraits and faces, push DDIM steps to 100 and CFG to 8.5:

python inference.py \
    --checkpoint sd_epoch_042.pt \
    --prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
    --steps 100 \
    --guidance 8.5 \
    --output portrait.png

For negative-prompt CFG (CUDA only), use SD_ImageGen.py:

python SD_ImageGen.py \
    --checkpoint sd_epoch_042.pt \
    --prompts "a futuristic city skyline at night with neon lights" \
    --negative "blurry, low quality, distorted, watermark" \
    --steps 50 --guidance 7.5

Training summary

Phase Epochs Dataset (after filtering) LR Notes
1 1–10 LAION-2B-en aesthetic β‰₯ 6.5 (1.32M) 1e-5 Broad pretraining (~3 h/epoch)
2 11–17 LAION aesthetic β‰₯ 7.5, CLIP β‰₯ 0.30 (213k) 1e-5 Best loss 0.0947 @ ep 16
3 18–22 DiffusionDB + JourneyDB (~482k) 1e-5 Synthetic / Midjourney domain mix
4 23–29 VGGFace2 (51,786) 2e-6 Face anatomy fine-tune
5 30–38 COCO person crops (59,494) 1.5e-6 Full-body fine-tune (caused face regression)
6 39–42 Mixed LAION + VGGFace2 + COCO (250k) 1e-6 Released checkpoint β€” sweet spot
7 43–48 Comprehensive mix (572k) 1e-6 Final consolidation

Detailed write-up and engineering lessons:

  • πŸ“˜ blog_post.md β€” full technical Medium-style write-up
  • πŸ“‹ summary.md β€” authoritative engineering reference

Known limitations

  • Faces are usable but eyes still slightly oversized at higher CFG
  • Full-body anatomy: left-arm geometry is occasionally weak
  • Animal faces are the weakest category in current evaluation
  • Food prompts confuse categories (pasta ↔ noodles, etc.)
  • This is an SD 1.x-class model β€” no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)

Reproducibility

All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:

πŸ”— https://github.com/atandra2000/StableDiffusion

Citation

@software{atandra2000_sd_from_scratch_v1,
  author  = {Bharati, Atandra},
  title   = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
  year    = {2026},
  url     = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
  license = {MIT}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train atandra2000/sd-from-scratch-v1