Spaces:
Running on Zero
M28 β Federated Learning (LoRA Aggregation)
Spec version: v3.0 β experimental Depends on: M03 Capability Bus, M04 LLM, M14 Federation, X02 Event Log, M16 Tokens, X06 WebSocket Depended on by: nothing in MVP β opt-in research feature
1. Responsibility
Federated learning of small LoRA adapters on top of a shared base model. Each node trains locally on its own data, sends only the adapter weight deltas (not raw data, not full weights) to an aggregator, and receives back an averaged adapter that subsequent nodes can use or further refine.
The bet: a 3B-parameter base model with a community-tuned LoRA adapter ("how people in our village actually phrase things, what jargon our Feuerwehr uses, what the local agricultural calendar looks like") is more useful for the community than a generic 3B model, and we can do this without any node ever shipping its private data off-box.
This module deliberately stays at LoRA scope only. Full fine-tunes, distillation, and continual pre-training are explicitly out β both because they are bandwidth-hostile and because the privacy story for full-weight federation is significantly harder.
2. Non-goals
- Federating raw data. Never. Training data stays on the node that owns it.
- Full fine-tunes. LoRA only. If a use case truly needs more, that's a different research project.
- Cross-base-model aggregation. All participants in a round must run the same base model at the same quantisation. Heterogeneous aggregation is open research.
- Mandatory participation. Every node decides per-round whether to join. There is no "you must contribute back" rule.
- Aggregator centralisation. Any node can host an aggregator. There is no privileged aggregator role.
- Hiding participation. Whether you joined a round is visible to other participants in that round; only your data and your gradients are private.
3. File layout
hearthnet/fedlearn/
βββ __init__.py
βββ coordinator.py # Orchestrates a round: announce, gather, aggregate, distribute
βββ participant.py # Local-side: respond to round announcements, train, submit
βββ trainer.py # Wraps M04 LLM in a LoRA training loop (peft + bitsandbytes)
βββ aggregator.py # FedAvg with optional secure aggregation
βββ delta.py # Serialise/deserialise LoRA deltas (state-dict subset)
βββ privacy.py # Optional DP-noise injection and gradient clipping
βββ manifest.py # Round manifest: base model id, hyperparams, signature
4. Public API
4.1 Dataclasses
RoundID = NewType("RoundID", str) # ULID
@dataclass(frozen=True)
class RoundManifest:
round_id: RoundID
coordinator: NodeID
base_model_id: str # exact model id from M04 ("qwen2.5:3b-instruct-q4_K_M")
base_model_sha: str # SHA-256 of base weights; mismatch = exclusion
lora_target_modules: tuple[str, ...] # which linear layers carry LoRA (e.g. "q_proj","v_proj")
lora_rank: int # 4 β€ r β€ FEDLEARN_MAX_LORA_RANK
lora_alpha: int
lora_dropout: float
train_steps: int # max local SGD steps per participant
learning_rate: float
batch_size: int
seed: int # for deterministic init of LoRA matrices
dp_noise_scale: float # 0.0 = off
clip_norm: float # gradient clip; must be > 0 if DP on
min_participants: int # round aborts if fewer participants submit
max_participants: int
deadline: datetime # UTC; submissions after this dropped
topic: str # free-form: "niederrhein-emergency", "village-chat"
consent_text: str # human-readable; participant must accept
coordinator_sig: bytes # detached Ed25519 over the manifest
@dataclass
class ParticipantSubmission:
round_id: RoundID
participant: NodeID
delta_bytes: bytes # serialised LoRA state-dict
delta_sha: str
num_samples: int # for weighted FedAvg
train_loss: float # for telemetry only
submitted_at: datetime
signature: bytes # Ed25519 over (round_id, participant, delta_sha, num_samples)
@dataclass
class RoundResult:
round_id: RoundID
aggregated_delta_sha: str
n_participants: int
total_samples: int
aggregator: NodeID
completed_at: datetime
manifest_sha: str
download_url: str # capability bus uri for fetching the aggregated delta
4.2 Capabilities
async def fedlearn_round_announce(manifest: RoundManifest) -> RoundID
async def fedlearn_round_list(topic: str | None = None) -> list[RoundManifest]
async def fedlearn_round_join(round_id: RoundID, consent: bool) -> JoinReceipt
async def fedlearn_round_submit(submission: ParticipantSubmission) -> SubmitReceipt
async def fedlearn_round_status(round_id: RoundID) -> RoundStatus
async def fedlearn_round_finalize(round_id: RoundID) -> RoundResult # coordinator-only
async def fedlearn_adapter_fetch(sha: str) -> bytes
async def fedlearn_adapter_apply(sha: str, scope: Literal["session","node"]) -> ApplyReceipt
All capabilities are in the experimental.fedlearn.* namespace and only registered on the bus when experimental.fedlearn = true in the node config.
4.3 Coordinator class
class RoundCoordinator:
def __init__(self,
bus: CapabilityBus,
event_log: EventLog,
llm: LLMService,
fedlearn_config: FedLearnConfig): ...
async def announce_round(self, draft: RoundManifestDraft) -> RoundID: ...
async def collect_submissions(self, round_id: RoundID) -> list[ParticipantSubmission]: ...
async def aggregate(self, round_id: RoundID) -> bytes: ...
async def finalize_and_publish(self, round_id: RoundID) -> RoundResult: ...
# internal
async def _validate_submission(self, sub: ParticipantSubmission, manifest: RoundManifest) -> None: ...
async def _emit(self, evt: Event) -> None: ...
4.4 Participant class
class RoundParticipant:
def __init__(self,
bus: CapabilityBus,
event_log: EventLog,
llm: LLMService,
data_provider: TrainingDataProvider,
fedlearn_config: FedLearnConfig): ...
async def consider_round(self, manifest: RoundManifest) -> Decision: ...
async def train(self, manifest: RoundManifest) -> ParticipantSubmission: ...
async def submit(self, submission: ParticipantSubmission) -> SubmitReceipt: ...
async def apply_aggregated(self, result: RoundResult, scope: Literal["session","node"]) -> ApplyReceipt: ...
4.5 Aggregator
class FedAvgAggregator:
def __init__(self, manifest: RoundManifest): ...
def add(self, submission: ParticipantSubmission, delta: dict[str, Tensor]) -> None: ...
def aggregate(self) -> dict[str, Tensor]: ... # weighted by num_samples
class SecureFedAvgAggregator(FedAvgAggregator):
"""Optional: pairwise masking so the aggregator sees only the sum, never individual deltas."""
def __init__(self, manifest: RoundManifest, mask_scheme: Literal["additive_pairwise"] = "additive_pairwise"): ...
4.6 Privacy helpers
def clip_gradient(state_dict: dict[str, Tensor], max_norm: float) -> dict[str, Tensor]
def add_dp_noise(state_dict: dict[str, Tensor], scale: float, rng: Generator) -> dict[str, Tensor]
def epsilon_estimate(scale: float, clip: float, n_steps: int, batch: int, dataset_size: int) -> float
5. Behaviour
5.1 Round lifecycle
ANNOUNCED ββjoinβββΆ JOINED ββtrainβββΆ TRAINED ββsubmitβββΆ SUBMITTED βββ
β β
β βββββββββββββ aggregate βββββββββββββ
β βΌ
βββββdeadline reachedβββββΆ AGGREGATING ββfinalizeβββΆ COMPLETED
β
βββmin_participants not metβββΆ ABORTED
State transitions are recorded as events (fedlearn.round.*) on the coordinator's event log. Participants see their own state mirrored via subscription.
5.2 Manifest signing
Manifest is canonicalised (JCS, like federation manifests in M14 Β§5.2), then signed Ed25519 by the coordinator's node key. Participants must verify the signature before training. A manifest with an invalid signature is dropped silently and logged as a security event (security.signature.invalid).
5.3 Consent flow
When fedlearn.round.join is called, the participant module must:
- Check
experimental.fedlearnis enabled in node config. If not βexperimental_disabled. - Display
manifest.consent_textto the operator via the M11 Notifications path. The operator must explicitly accept. The acceptance is stored as a signedfedlearn.consent.grantedevent. - Verify coordinator signature. If invalid β
signature_invalid(we deliberately don't say whose signature; bystanders learn nothing useful). - Check
base_model_shaagainst the locally-installed base model. If mismatch βbase_model_mismatch. Do not download a different base on demand; this is a hard error. - Check resource budget: estimate VRAM and disk for the training run from
lora_rank * len(target_modules) * hidden_size. If insufficient βinsufficient_resources. - If all checks pass β emit
fedlearn.round.joined, returnJoinReceipt.
5.4 Local training
The trainer wraps M04's LLM handle in a HuggingFace peft.LoraConfig and uses bitsandbytes 4-bit base + fp16 LoRA matrices. Training data is provided by an injected TrainingDataProvider β the module never reaches into other modules' storage. Typical providers:
ChatHistoryProvider(asks M10 for redacted, consented chat turns),KBProvider(asks M07 for documents tagged for training),CustomFileProvider(operator-curated training set).
After train_steps steps or convergence (loss plateau over a window), the trainer extracts the LoRA state-dict, applies optional gradient clipping and DP noise (if manifest.dp_noise_scale > 0), serialises, signs, and returns a ParticipantSubmission.
5.5 Aggregation
The default aggregator is weighted FedAvg: each adapter weight is weighted by num_samples and averaged across submissions. After aggregation, the coordinator emits fedlearn.round.aggregated and stores the aggregated delta via the capability bus (using the same content-addressed file path that M06 Files uses).
If the round was declared with secure=true in the draft, SecureFedAvgAggregator is used: each participant pair establishes an additive mask, masks cancel in the sum, and the aggregator never sees individual deltas. This costs an extra round-trip between participants before submission (the mask exchange phase) and requires min_participants β₯ 3.
5.6 Distribution
The aggregated adapter is published as a content-addressed file. Participants who joined the round get a fedlearn.round.completed event with the SHA. They can choose to:
- Session apply β load into a single LLM session via M04 (
llm.session.apply_adapter), - Node apply β install as the default adapter for the node (requires explicit operator action),
- Discard β do nothing.
Non-participants can also fetch and apply adapters they trust. There is no DRM and no whitelist: the aggregated delta is just a file with a SHA.
5.7 Failure modes
- Coordinator vanishes mid-round: participants wait until
deadline, then any participant can callfedlearn.round.finalize_takeover(round_id)which constructs the aggregated delta from received submissions and re-publishes. The takeover is signed by the takeover-node and is visible as such. - A participant submits garbage: validation in
_validate_submissionchecks tensor shapes, dtypes, finite-ness (no NaN/Inf), and that the delta is structurally a valid LoRA state-dict for the manifest'slora_target_modules. Garbage submissions are dropped and logged. - Sybil flooding: all participants must be authenticated with M01 identity and the manifest can require a minimum reputation/trust score (this is open research β for v3.0 the field exists in the manifest but is not yet enforced).
- Adversarial gradient (poisoning): out of scope for v3.0; documented in Open Research Questions Β§10.
6. Errors
| Code | When |
|---|---|
experimental_disabled |
Caller invokes a fedlearn capability with the flag off |
signature_invalid |
Manifest or submission signature does not verify |
base_model_mismatch |
Local base model SHA differs from manifest |
insufficient_resources |
Estimated VRAM/disk exceeds budget |
consent_required |
join() called without an explicit consent record |
round_full |
max_participants reached |
round_closed |
Submission after deadline |
delta_invalid |
Submitted state-dict fails structural validation |
fedlearn_aggregation_failed |
Aggregation produced NaN/Inf or insufficient submissions |
fedlearn_min_participants_unmet |
Round closes with fewer than min_participants valid submissions |
fedlearn_aggregator_unreachable |
finalize() called while coordinator is offline and takeover not triggered |
adapter_not_found |
fedlearn.adapter.fetch for an unknown SHA |
7. Configuration
@dataclass(frozen=True)
class FedLearnConfig:
enabled: bool = False # master switch; default off
max_lora_rank: int = FEDLEARN_MAX_LORA_RANK # 64
max_lora_target_modules: int = FEDLEARN_MAX_LORA_TARGET_MODULES # 8
max_train_steps: int = FEDLEARN_MAX_TRAIN_STEPS # 1000
max_round_participants: int = FEDLEARN_MAX_PARTICIPANTS # 32
min_round_participants: int = FEDLEARN_MIN_PARTICIPANTS # 3
dp_noise_scale_default: float = FEDLEARN_DP_NOISE_SCALE_DEFAULT # 0.0 (off)
clip_norm_default: float = FEDLEARN_CLIP_NORM_DEFAULT # 1.0
submission_max_bytes: int = FEDLEARN_SUBMISSION_MAX_BYTES # 64 MiB
require_secure_aggregation: bool = False
auto_apply_aggregated: bool = False # never auto-apply by default
training_vram_budget_mb: int = 8192
training_disk_budget_mb: int = 4096
All FEDLEARN_* constants live in hearthnet/constants.py so a single source of truth governs both validation and documentation generation.
8. Tests
8.1 Unit
test_manifest_canonicalisation_stableβ re-encoding does not change SHA.test_manifest_signature_roundtrip.test_delta_serialisation_roundtripβ tensors preserve dtype and shape.test_fedavg_weighted_arithmeticβ manually averaged deltas match aggregator output to within fp16 noise.test_dp_noise_zero_is_identityβadd_dp_noise(d, scale=0.0)is a no-op.test_clip_gradient_normβ post-clip norm β€max_norm.test_secure_aggregation_masks_cancelβ sum of masks across all pairs is zero.
8.2 Property
- Across random shapes,
fedavg([d, d, d]) == d. - Across random submissions,
fedavg(submissions)is finite when all inputs are finite.
8.3 Integration
- Two-node loopback round on a 0.5B base model: announce β join β train (synthetic data, 10 steps) β submit β aggregate β apply. Aggregated adapter must be loadable and must not blow up perplexity by more than 2x on a held-out set (sanity, not quality).
- Coordinator-failure round: simulate coordinator going offline after submissions received; takeover by another participant produces an aggregated delta with the same SHA.
- Sybil-defence stub: round with
min_participants=3and only 2 valid submissions aborts withfedlearn_min_participants_unmet.
8.4 Negative
- Wrong base SHA β
base_model_mismatch. - Submission with NaN in one tensor β
delta_invalid. - Submission missing one of the target modules β
delta_invalid. - Manifest signed by an untrusted identity β
signature_invalid. - Disabled flag β
experimental_disabledeven for read-only queries.
9. Cross-references
- Phase 1 M04 LLM β provides the local model handle, exposes
llm.session.apply_adapterandllm.adapter.list. - Phase 1 M07 Knowledge Base β
KBProviderreads tagged documents for training. - Phase 2 M14 Federation β federated rounds across communities use the federation transport for manifest distribution and submission. Cross-community rounds require both communities' DPOs to sign the round consent.
- Phase 2 M16 Tokens β round participation tokens (
fedlearn-participantscope) are issued by the coordinator and bound to a single round. - Phase 2 M25 Group Chat β
village-chatrounds typically draw training data from group chat history (consented turns only). - Phase 3 M30 Evidence/EBKH β aggregated adapters can be tracked as claims in the evidence graph; "adapter X improved perplexity on held-out set Y" is a
claim.assert.
10. Open research questions
Gradient poisoning defence. Coordinated malicious participants can submit deltas that, when aggregated, degrade or backdoor the adapter. Median-based aggregation (Krum, trimmed mean) is a partial defence; an authenticated-data attestation (per-submission proof that gradients were computed on real, non-cherry-picked data) is the harder question. v3.0 ships FedAvg only; v3.1 may add Krum behind a flag.
Heterogeneous base models. Today, every participant in a round must run the same base model at the same quantisation. Cross-base aggregation (e.g., projecting LoRA from Qwen-3B-Q4 to Qwen-3B-Q5 or even Qwen-3B β Qwen-7B) is open. The naive approach (re-projecting via a translation matrix learnt from a calibration set) loses accuracy quickly.
Adaptive DP-noise. Fixed
dp_noise_scaleis crude. Per-round noise calibration as a function ofmin_participantsandlora_rankwould tighten the privacy/utility tradeoff. Out of scope for v3.0.Reputation-weighted FedAvg. Weighting submissions by
num_samples * trust_scoreinstead ofnum_samplesalone. Requires a credible trust signal, which the broader HearthNet design has not yet committed to.Continual rounds. Today each round produces a stand-alone adapter. Stacking rounds (round N tunes on top of round N-1's aggregate) raises questions about drift, fairness, and rollback. Probably belongs in a future M28b.
Cross-task adapters. A
niederrhein-emergencyadapter and avillage-chatadapter are trained separately. Whether they can be cleanly combined at inference time (LoRA composition) is a known-hard problem and explicitly not promised here.Hardware-class fairness. A round held by a participant with an RTX 5090 might exclude phone-class participants by setting
train_stepstoo high. A "ranked tier" with separate aggregations per tier is one possibility. Currently the manifest is a single-tier flat artefact.Audit of training data. Even though raw data never leaves the node, the fact that training happened on consented data is currently un-auditable from the outside. A future zero-knowledge attestation of "this delta was computed on N samples each tagged training=true" would be useful. Out of scope.
Last updated: spec v3.0.