SD-From-Scratch v1 (epoch 42)
A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2Γ RTX 5090 (Blackwell, 33.7 GB VRAM each) under the SD_Train.py pipeline.
This checkpoint (sd_epoch_042.pt) is the visually-best checkpoint produced after 48 epochs across 7 training phases (LAION broad β LAION refined β DiffusionDB/JourneyDB mix β VGGFace2 face fine-tune β COCO full-body β consolidation β final mixed).
- Architecture: UNet (ch=320, ch_mults=(1,2,4,4), attn_lvls=(1,2,3), heads=8) β ~860M trainable parameters
- Frozen components: VAE
stabilityai/sd-vae-ft-mse, text encoderopenai/clip-vit-large-patch14 - Latent space: 64Γ64Γ4 (8Γ spatial compression of 512Γ512 RGB), scale factor
0.18215 - Training schedule: DDPM scaled-linear betas (0.00085 β 0.012, 1000 steps), Min-SNR Ξ³=5.0 (Ξ³=3.0 in late refinement)
- Precision: BF16 native autocast (no GradScaler), gradient checkpointing, Fused AdamW
- EMA decay: 0.9999 (GPU-resident shadow)
- Best training loss: 0.0947 (epoch 16, on filtered LAION 213k subset)
- Released checkpoint: epoch 42 (~12.5 GB) β best visual coherence across portraits, landscapes and scenes
Files
sd_epoch_042.ptβ full training checkpoint: UNetstate_dict, EMA shadow weights, optimizer/scheduler state, epoch/step/best_loss metadata. Load EMA weights at inference for best quality (see usage below).
License
MIT. See LICENSE in the source repo.
Intended use
Research and educational reproduction of from-scratch latent-diffusion training. Not intended for production deployment without further safety filtering. Not aligned for instruction-following or for any specific style/persona.
How to use
The checkpoint is loaded by the inference scripts in the source repo.
git clone https://github.com/atandra2000/StableDiffusion.git
cd StableDiffusion
pip install torch torchvision diffusers transformers huggingface_hub pillow
# Download the checkpoint from this HF model repo:
huggingface-cli download atandra2000/sd-from-scratch-v1 sd_epoch_042.pt --local-dir .
# Generate an image (uses EMA weights internally and removes the DDIM
# pred_x0 clamp, which would otherwise destroy SD latent signal):
python inference.py \
--checkpoint sd_epoch_042.pt \
--prompt "a beautiful sunset over mountain peaks, cinematic lighting" \
--steps 50 \
--guidance 7.5 \
--output sample.png
For portraits and faces, push DDIM steps to 100 and CFG to 8.5:
python inference.py \
--checkpoint sd_epoch_042.pt \
--prompt "a photorealistic portrait of a woman with blue eyes, soft studio lighting" \
--steps 100 \
--guidance 8.5 \
--output portrait.png
For negative-prompt CFG (CUDA only), use SD_ImageGen.py:
python SD_ImageGen.py \
--checkpoint sd_epoch_042.pt \
--prompts "a futuristic city skyline at night with neon lights" \
--negative "blurry, low quality, distorted, watermark" \
--steps 50 --guidance 7.5
Training summary
| Phase | Epochs | Dataset (after filtering) | LR | Notes |
|---|---|---|---|---|
| 1 | 1β10 | LAION-2B-en aesthetic β₯ 6.5 (1.32M) | 1e-5 | Broad pretraining (~3 h/epoch) |
| 2 | 11β17 | LAION aesthetic β₯ 7.5, CLIP β₯ 0.30 (213k) | 1e-5 | Best loss 0.0947 @ ep 16 |
| 3 | 18β22 | DiffusionDB + JourneyDB (~482k) | 1e-5 | Synthetic / Midjourney domain mix |
| 4 | 23β29 | VGGFace2 (51,786) | 2e-6 | Face anatomy fine-tune |
| 5 | 30β38 | COCO person crops (59,494) | 1.5e-6 | Full-body fine-tune (caused face regression) |
| 6 | 39β42 | Mixed LAION + VGGFace2 + COCO (250k) | 1e-6 | Released checkpoint β sweet spot |
| 7 | 43β48 | Comprehensive mix (572k) | 1e-6 | Final consolidation |
Detailed write-up and engineering lessons:
- π
blog_post.mdβ full technical Medium-style write-up - π
summary.mdβ authoritative engineering reference
Known limitations
- Faces are usable but eyes still slightly oversized at higher CFG
- Full-body anatomy: left-arm geometry is occasionally weak
- Animal faces are the weakest category in current evaluation
- Food prompts confuse categories (pasta β noodles, etc.)
- This is an SD 1.x-class model β no SDXL-grade detail, no controllable spatial conditioning (use ControlNet downstream if needed)
Reproducibility
All training code, data-pipeline scripts (LAION metadata download / filtering / WebDataset tar shards / VAE latent pre-encoding), training loop, and inference scripts are open-source at:
Citation
@software{atandra2000_sd_from_scratch_v1,
author = {Bharati, Atandra},
title = {SD-From-Scratch v1: A Stable-Diffusion-class latent diffusion model trained from scratch on dual RTX 5090s},
year = {2026},
url = {https://huggingface.co/atandra2000/sd-from-scratch-v1},
license = {MIT}
}