# Swappable adapters / the "model factory" — SOTA as of June 2026 Deep-research scan of how the field does what we're calling the **model factory** (one base + a library of swappable LoRA "souls/modules," the model self-routing among them). Bottom line: **we independently reinvented a named, active research area — "modular LLMs / MoErging" — and the hardest part (hot-swap serving) is already solved on MLX.** The factory is more buildable than first framed. ## 1. It has a name: **Modular LLMs / MoErging** "MoErging" = **M**odel + r**E**cycling/**r**outing of experts. Independently-trained adapters accumulated into a **library**, with a **router** that selects/composes them per input — exactly our "library of swappable souls." - **Survey:** *A Survey on Model MoErging: Recycling and Routing Among Specialized Experts* (arXiv 2408.07057). - **Canonical:** *Towards Modular LLMs by Building and Reusing a Library of LoRAs* (Ostapenko et al., ICML'24). - Design axes the survey names — and where we sit: - **Expert source:** independently-trained, accumulated over time ✅ (our per-facet heals). - **Routing granularity:** example/task-level vs token-level. **Ours = example-level** (pick the code module per task) — cheaper, fits our case. - **Router:** learned vs **training-free** (see §2). - **Philosophy:** specialist vs generalist vs **hybrid** (generalist base + specialist plug-ins) — the survey flags hybrid as underexplored; **that's exactly our "core soul always-on + swappable code."** ## 2. The self-routing ("train it to swap") — 3 approaches, choose by library size | Approach | Methods | When to add an adapter | Fit for us | |---|---|---|---| | **Training-free / zero-shot** | **Arrow** (routes via adapter weight SVD — no training), **PHATGOOSE** (post-hoc tokenwise gating) | **drop it in → instantly routable, NO router retrain** | **best** for an open-ended factory | | **Learned dynamic** | **X-LoRA** (layer/token-level), **LD-MoLE** (2026, differentiable, adaptive #experts), **Glider** (instruction-driven global+local) | retrain the router | higher quality, more upkeep | | **Retrieval** | **CARLoS** (retrieve LoRAs at scale), **LoRAHub** | embed task → retrieve adapter | **when the library gets large** (RAG-for-adapters) | **Recommendation:** our **dispatcher gold** (the model emits `game/app`) is the simplest, most transparent router and needs no extra infra — keep it. Add **Arrow-style training-free routing** as the automatic fallback (so new specialties are routable without a retrain), and **CARLoS-style retrieval** once the library outgrows a handful of modules. ## 3. The hot-swap serving — SOLVED, and on MLX The thing I'd called a "Level-3 engineering task" mostly exists already: - **CUDA/cloud:** **S-LoRA** (Unified Paging → 1000s of adapters/GPU), **vLLM** runtime LoRA load/unload (`VLLM_ALLOW_RUNTIME_LORA_UPDATING`, per-request), **LoRAX**, **ExpertWeave**, **InfiniLoRA** (2026, disaggregated), **Activated-LoRA** (cross-model KV-cache reuse, 2025). vLLM blog (Feb 2026) adds multi-LoRA for MoE bases. - **MLX (our stack) — `mlx-optiq`:** **mounted LoRA hot-swap.** Wraps each target Linear in a `MountedLoRALinear` holding `{adapter_id: (A,B,scale)}` over the frozen base; a **`ContextVar` picks the active adapter per request**, so concurrent async requests with different adapters don't collide — **N adapters resident on one base, no reload.** `optiq serve --model --adapter ./my_adapter`. **This IS our Level 3** — adopt it (MLX-first principle). - **HF PEFT:** `add_adapter`/`set_adapter` (switch active), `hotswap_adapter` (in-place weight replace, avoids `torch.compile` recompile). **Constraint:** hot-swap needs **uniform rank / alpha / target-modules** across adapters. ## 4. Where we're AHEAD of the field - **Eliteness, not just task-coverage.** Most MoErging libraries are task-LoRAs trained on benchmark datasets (FLAN/NIv2 tasks). Ours are **masters-grounded, audit-gated, verifier-backed** — a quality axis the papers lack. - **The hybrid (generalist soul + specialist code)** the survey calls underexplored — we're already building it (Pattern B: fuse the soul into the base, then thin code adapters). - **On-device / MLX** — the field is CUDA/cloud-centric; us + `mlx-optiq` = an on-device modular-LLM. - **Verified decoding + the soul** — differentiators no MoErging system has. ## 5. Gotchas the literature flags (so we don't trip them) 1. **Standardize adapters** — same rank (we use 16), alpha, target-modules — or hot-swap recompiles/breaks. 2. **Token-level routing adds latency** — example/task-level (our case) is the cheap, right choice. 3. **Routing quality degrades as the library grows** → move to retrieval (CARLoS) past ~a dozen modules. 4. **Mis-routing happens** → always keep the **`core` fallback** (the soul handles anything unmatched). ## 6. The plan this implies - **Serving (Level 3):** evaluate **`mlx-optiq`** for N-resident-adapter hot-swap on our base (replaces "reload the whole serve" — the only reason swaps were slow). Validate it loads our q3a4 base + our rank-16 adapters. - **Routing (Level 2):** ship the **dispatcher gold** now; add **Arrow** training-free routing as the auto-fallback. - **Library (Level 1):** keep adapters **uniform (rank 16, same targets)** so they're hot-swappable + composable. - **Position the work** as a **verified, on-device, masters-curated MoErging system** — genuinely novel framing. ## 7. Round 2 — the June-2026 frontier (updates the plan) - **Adapter GENERATION, not just selection:** Sakana **Text-to-LoRA (T2L)** + **Doc-to-LoRA (D2L)** (Feb 2026) — hypernetworks that emit a LoRA from a *text task-description* (T2L) or a *document* (D2L) in **one forward pass** (one-time meta-train → instant after). T2L matches task-specific adapters at ~4× less than in-context learning; D2L folds a 128K-token doc into a **<50 MB adapter** (vs ~12 GB KV-cache). **Factory endgame: *generate* a specialty on demand** instead of healing it. Caveat: matches *task*-LoRAs, not our masters-eliteness bar — a direction, not a drop-in. - **Routing at scale:** **LORAUTER** (arXiv 2601.21795, Jan 2026) routes by **task representation** (task embeddings from a small validation set; *no adapter training data*) → scales with #tasks not #adapters; **oracle-matching (101.2%)** when an aligned adapter exists, **+5.2 on unseen**, robust to **1500+ noisy adapters**. Our scaling answer (supersedes CARLoS retrieval for us). - **The MoE multi-LoRA "tax" — our base IS a MoE:** multi-LoRA on a MoE costs **4 LoRA-kernel ops per expert, per adapter** (gate_up + down, shrink + expand) — the bottleneck the vLLM/AWS Feb-2026 work targets. The "MoE Tax" analysis: swapping still saves **~95% VRAM**. **Measure this on our 77-expert base under `mlx-optiq`.** - **Cross-base transfer → the demolition family ~free:** **LoRA-X** (2501.16559), **Cross-LoRA** (2508.05232), **Adapt-Once-Thrive-with-Updates** (2506.06844) transfer adapters across base versions **without retraining** → heal the soul *once* on the 106 GB base, **transfer to the 67/55/36/20/14 GB family (#53–58)** instead of re-healing each size. - **More 2026 routing:** LoRA-Mixer (serial attention routing), DynMoLE (hybrid Tsallis-entropy), MoLoRA (per-token composable skills), Adaptive Minds (LoRAs-as-tools for agents). - **The honest skeptic:** *"Position: Pause Recycling LoRAs"* (arXiv 2506.13479) — the field over-claims; recycled-adapter routing can underperform proper training. **Our guard:** only route when the scorecard shows it beats one soul'd model. - **Apple frontier:** **Orion** (2603.06728) programs the Apple **Neural Engine (ANE)** for LLM inference — the on-device path beyond MLX-GPU. **Net update:** near-term = `mlx-optiq` serving + **LORAUTER-style task routing** + uniform rank-16 adapters (+ **cross-base transfer** for the family). Frontier = **T2L-style hypernetwork generation**. Discipline = the skeptic's bar — route only if it *wins* on the scorecard. ## 8. Round 3 — production reality + the real-world proof - **Apple Intelligence = this architecture, shipped to ~1B devices.** Apple's on-device foundation model is a **2-bit-QAT base (~3.7 bits/weight), frozen**, with **swappable per-task LoRA adapters** on all attention + FFN projections; the Swift **Foundation Models framework** adds **schema-constrained "guided generation"** (`@Generable` = our constrained decode), **tool-calling**, and KV-cache-aware sessions. **We independently built the same pattern — quantized base + swappable adapters + constrained decode + tools — on a 743B→99 GB *frontier* model instead of a 3 B.** The strongest possible validation. (Adapter specifics in the full report, arXiv 2507.13575.) - **Composition (soul + code at once) → MERGE:** mergekit (8+ algos). **TIES** (trim → elect-sign → merge; scales to many), **DARE** (drop + rescale, a TIES pre-step), **Model/LoRA Soups** (linear avg). Rule of thumb: **3+ adapters → TIES, 5+ → DARE-TIES.** → Pattern B option 3: **TIES-merge soul + code** into one adapter (base stays pristine) instead of fusing the soul in. - **LoRA on a quantized base (our q3a4) recovers the quant damage:** adapters give **<1% vs fp16 but +2–7% over the bare quantized base** — *the reason the soul heal works on 3-bit.* **LowRA** (arXiv 2502.08141) extends accurate LoRA **under 2 bits** → enables the 2-bit family (#57–58). - **Production reality + the open lane:** **LoRAX** (Predibase — 100s of adapters/GPU, the open standard; → Rubrik Jun 2025) and **TGI Multi-LoRA** ("deploy once, serve 30") are production; Fireworks notably *lacks* LoRA serving. **No one ships a *curated, elite, verified* adapter library** — they're task-LoRA dumps. Our masters-trained, audit-gated library is the open lane. [Round-3 sources: Apple AFM tech report 2025 + arXiv 2507.13575 · HF PEFT model-merging (TIES/DARE) · mergekit · LoRA Soups (2410.13025) · QLoRA · LowRA (2502.08141) · Predibase LoRAX · TGI Multi-LoRA (HF blog)] ## 9. Rounds 4–5 — the frontier hits our live decisions (June 2026) - **MoE-LoRA placement (every future heal):** **MoE-Sieve** (2603.24044) — LoRA only the **top-25% most-routed experts** + attention/router/shared-expert → matches full LoRA at **70–73% fewer params**. Smaller, faster, more-uniform (∴ more hot-swappable) adapters. (Also TT-LoRA MoE; LoRA-on-the-router.) - **The design degeneration = Computation Collapse (diagnosed):** "Signal Degradation vs Computation Collapse" (2604.19884) — our repetition + corrupted tokens (`UTF.FF9B`) on long-gen = **Computation Collapse** (early-layer component failure), **NOT** fixable by decoding tricks. Fix = **mixed-precision protecting salient/structurally- sensitive experts + early layers** (= our **#59** saliency-dynamic quant): SliM-LLM, **SFMP** (2602.01027, search-free), **Beyond-Outliers** dual numerical+structural sensitivity (2603.17354), channel-wise MP; MoE-specific Mixture-Compressor (2.54 bpw), ATOM. **Design is recoverable — keep salient design/early experts at 4-bit+.** - **Curated-SFT validated:** **LIMA** (quality+diversity ≫ size; alignment *unlocks* pretrained ability) = our 250-gold soul; **"SFT on Curated Data is RL"** (2507.12856) → curated-SFT is implicit RL; **iw-SFT** (importance-weighted) a free upgrade to try. - **REAP (our prune) = ICLR 2026 + a fix to adopt:** March-2026 update **renormalizes top-k router logits to sum to 1** (pull into our prune); paper **confirms prune > merge for generative** (vindicates #19); **router calibration after prune** matters (2603.02217) — verify ours. - **Self-reward hacks → verifiable reward wins:** Self-Rewarding / Meta-Rewarding (2401.10020) — self-training real problems **reward-hacks**; needs **verifiable rewards** = our verifier mesh / RLVR (why GRPO→SFT). The field validated our instinct. - **Constrained-decode quality tax is fixable:** **XGrammar-2** (May 2026; default for vLLM/SGLang/TensorRT, <40µs/tok, bitmask) — the 10–30% quality drop is an **enforcement artifact**, removable. Align our **#32** MLX constrained-decoder; verify it isn't taxing quality. (Pre3 2506.03887; Draft-Conditioned 2603.03305.) - **Newest routing:** Brainstacks (2604.01152 — 7-projection routing + adapter *stacking* + disk-offload), MoA (heterogeneous experts), LoRA-Mixer (attention-routed), Reversible Lifelong Editing (2603.11239). [Rounds 4–5 sources: MoE-Sieve 2603.24044 · Computation-Collapse 2604.19884 · SFMP 2602.01027 · Beyond-Outliers 2603.17354 · SliM-LLM · LIMA · SFT-is-RL 2507.12856 · REAP 2510.13999 (ICLR'26) · Router-Calibration 2603.02217 · Self-Rewarding 2401.10020 · XGrammar-2 2411.15100 · Brainstacks 2604.01152 · LoRA-Mixer 2507.00029] ## 10. Round 6 — "how others do it" + our architecture (validates #69, the DSA, the serve) - **HF/MLX community:** mlx-community **~4,810 converted models**; MoE kernels mature (late-2025); 3–8 bit + mixed-precision (**mxfp8/nvfp4** — for #59). Qwen3-Coder-30B-**A3B** MoE ~130 tok/s on M4 Pro (vs 43 Ollama) — but 3B-active; **our 11–14 tok/s is the active-param count, not MLX** (confirms SPEED.md). Also Rapid-MLX, vLLM Apple ports, M5-GPU neural accelerators. - **DSA = DeepSeek Sparse Attention (our architecture):** ~98% attention-compute cut at 128K via **fixed top-2048 tokens/query** = our `index_topk: 2048` = the heal max-seq-2048 limit (the scatter-VJP). **DeepSeek-V4** (Apr 2026): hybrid compression+sparse, 27% FLOPs / 10% KV-cache of V3.2 at 1M ctx — the architecture's future. - **Spec-decode dead on big MoE (#69) — confirmed:** EAGLE-3.1 (May 2026) great on DENSE; literature explicit that **large-MoE break-even rises until spec-decode HURTS**, MoE drafts "largely unexplored." Our exact finding. **SpecForge** (2603.18567) = the open training framework IF we build the fresh EAGLE head. - **KV-cache quant for the serve (the 118 GB crash):** **TurboQuant** (Google ICLR'26 — 6× KV, 8× attn, no calib), **KVQuant** (sub-4-bit, 10M ctx), KIVI, Cocktail — headroom for long runs. [Round-6 sources: mlx-community (HF) · MLX-on-M5 (Apple ML) · DeepSeek-V3.2/V4 (2512.02556, vLLM 2026-04-24) · Native Sparse Attention 2502.11089 · EAGLE-3.1 (MarkTechPost 2026-05) · SpecForge 2603.18567 · TurboQuant (Google ICLR'26) · KVQuant · KV-cache survey (MarkTechPost 2026-04)] ## 11. Round 7 — adjacent pillars (validates #8 KD, #33 kernels, contamination-checking) - **On-policy distillation (#8) = industry default:** student samples own trajectories + teacher dense token-supervision (fixes off-policy mismatch). **Qwen3, DeepSeek-V4, Gemma 2, MiMo-V2 all adopt OPD** — we were early. Extensions: **Lean-OPD** (self-teacher critiques the student's Lean attempt → our prover #27-31), **Self-Distilled Reasoner** (2601.18734, on-policy self-distill). Survey 2604.00626; Thinking Machines OPD. - **Contamination-checking vindicated:** a 2026 dose-response study (Qwen3 34M-344M, sweep test-replicas) resolved the 2024-25 split → contamination **does** inflate, measurably. Our 0/0/0.4% near-dup check = right discipline. Honest set = contamination-resistant **LiveCodeBench / LiveBench / MMLU-Pro / FrontierMath** (= our #66). Detection: Min-K%, ConStat, canary-GUID, time-partition. - **M5 + MLX kernels (#33):** M5 (Mar 2026) = Neural Accelerators in each of 40 GPU cores → **up to 4× TTFT** (matrix-mul in silicon); MLX exploits via kernel **fusion** (Metal 4 + TensorOps), llama.cpp doesn't = our #33 approach. Speeds **prompt-processing**, not bandwidth-bound decode (our 11-14 tok/s floor holds). (Aside: native-MTP spec-decode now works on a 27B/MLX — but that's small/dense; our 743B-MoE #69 finding stands.) - **Agent memory (#57 compaction):** 2026 frontier = autonomous compaction (MIRIX, A-Mem: compress/prune) + **provenance-verified tiered memory** (2602.17913 — integrity-aware, like our agent). **AMA-Bench** (2602.22769) to measure. [Round-7 sources: OPD survey 2604.00626 · Self-Distilled Reasoner 2601.18734 · Lean-OPD · Contamination dose-response 2026 · contamination-resistant benchmarks · M5 MLX Neural Accelerators (Apple) · Provenance-Aware Tiered Memory 2602.17913 · AMA-Bench 2602.22769 · ACON 2510.00615] ## 12. Round 8 — agentic pillars + a live-bug fix (overthinking) - **Overthinking (fixes GSM8K + a speed win):** 2025-26 result — **longer CoT ≠ better**; o1-style models overthink easy problems → degrade accuracy + miscalibrate. Our 5–8K-token CoT = textbook overthinking = what broke the GSM8K parser (no clean final answer). **Shorter chains = better accuracy + fewer tokens at 11-14 tok/s.** Tune `enable_thinking` shorter. ("Don't Overthink It" 2505.17813; "When More Thinking Hurts" 2604.10739; deep-thinking-token quality 2602.13517). - **Competitive (validates base, flags benchmark trap):** GLM-5.x among open models that **closed the gap** on closed frontier for multi-step coding (w/ DeepSeek-V4, Kimi-K2.6, Qwen-3.6). But **SWE-bench is saturating** — OpenAI stopped reporting it ("scoring" vs "useful" diverged). Weight real-task benches: FeatureBench (2602.10975), Terminal-Bench, LongCLI-Bench (2602.14337) for #62. - **Agent security (our 5-layer matches SOTA):** prompt-injection = OWASP LLM01 (3yr running). Live threats: **RAG poisoning (5 docs → 90%)**, **tool poisoning** → LlamaFirewall (2505.03574), CausalArmor (2602.07918), TRUSTDESC (2604.07536). **Zombie Agents** (2602.15654 — self-evolving-agent injection) = a flywheel risk to guard. - **Agentic RAG (CallSieve upgrade):** retrieval-as-agent-tools (A-RAG 2602.03442: hierarchical keyword+semantic+chunk-read), iterative multi-hop, multi-agent retrieve+validate+synthesize. SoK 2603.07379. Upgrade path for our CallSieve/live-docs. [Round-8 sources: Overthinking 2604.10739 / 2505.17813 / 2602.13517 · SWE-bench-Verified leaderboard · FeatureBench 2602.10975 · LongCLI-Bench 2602.14337 · OWASP LLM01 · LlamaFirewall 2505.03574 · CausalArmor 2602.07918 · TRUSTDESC 2604.07536 · Zombie-Agents 2602.15654 · A-RAG 2602.03442 · SoK Agentic-RAG 2603.07379] ## 13. Round 9 — saturation (the floor; mostly confirms) - **Extreme low-bit (family #57-58):** BitNet b1.58 (ternary {-1,0,1}, ~90% mem, 38.8× energy on 30B), BitNet-v2 (4-bit activations), Sparse-BitNet (1.58-bit + N:M), PTQ1.61. **Caveat: BitNet needs QAT/training — our PTQ at 2-bit is lossier; the 2-bit family stays the flagged experiment.** - **GGUF/llama.cpp (#51):** K-quants (Q2_K–Q6_K), i-quants + importance matrices, MoE CPU+GPU split (active→GPU, experts→CPU). The non-MLX path. - **On-device multimodal:** Apple AFM (confirmed); **FastVLM** (FastViTHD, 85× faster TTFT) = Apple's edge VLM encoder (ref for vision #43); unified shared-backbone multimodal. - **VERDICT: research saturated.** Rounds 5–9 increasingly *confirmed* our builds (prune>merge, verifier-mesh, OPD, MLX-fusion, DSA top-2048, contamination-checking) rather than revealing new levers. Net-new actionables have stabilized → see `IMPLEMENTATION_PLAN.md`. [Round-9 sources: BitNet b1.58 / v2 2504.18415 · Sparse-BitNet 2603.05168 · PTQ1.61 2502.13179 · llama.cpp quant eval 2601.14277 · FastVLM · Apple AFM 2507.13575] ## Sources - [A Survey on Model MoErging (arXiv 2408.07057)](https://arxiv.org/pdf/2408.07057) - [Towards Modular LLMs by Building and Reusing a Library of LoRAs (Ostapenko)](https://www.semanticscholar.org/paper/6839e8ef0205ad4732e9f743977eb5bfc296ec2c) - [Learning to Route Among Specialized Experts for Zero-Shot Generalization — PHATGOOSE (arXiv 2402.05859)](https://arxiv.org/pdf/2402.05859) - [LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts (arXiv 2509.25684)](https://arxiv.org/abs/2509.25684) - [X-LoRA: Mixture of low-rank adapter experts (APL Machine Learning)](https://pubs.aip.org/aip/aml/article/2/2/026119/3294581/) - [Glider: Global and Local Instruction-Driven Expert Router (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.319.pdf) - [CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale (arXiv 2512.08826)](https://arxiv.org/html/2512.08826) - [S-LoRA: Serving Thousands of Concurrent LoRA Adapters (arXiv 2311.03285)](https://arxiv.org/pdf/2311.03285) - [vLLM multi-LoRA serving (Feb 2026)](https://vllm.ai/blog/2026-02-26-multi-lora) - [InfiniLoRA: Disaggregated Multi-LoRA Serving (arXiv 2604.07173)](https://arxiv.org/pdf/2604.07173) - [mlx-optiq — mounted-LoRA hot-swap on Apple Silicon (PyPI)](https://pypi.org/project/mlx-optiq/) - [HF PEFT — Hotswapping adapters](https://huggingface.co/docs/peft/v0.14.0/en/package_reference/hotswap) - [Unsloth — LoRA hot-swapping guide](https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide/lora-hot-swapping-guide)