HearthNet-Nemotron

Running on Zero

App Files Files Community

HearthNet-Nemotron / docs /p2_p3 /M28-fedlearn.md

Chris4K

p2, p3

70650b7 19 days ago

preview code

Raw

History Blame

20.7 kB

M28 — Federated Learning (LoRA Aggregation)

Spec version: v3.0 — experimental Depends on: M03 Capability Bus, M04 LLM, M14 Federation, X02 Event Log, M16 Tokens, X06 WebSocket Depended on by: nothing in MVP — opt-in research feature

1. Responsibility

Federated learning of small LoRA adapters on top of a shared base model. Each node trains locally on its own data, sends only the adapter weight deltas (not raw data, not full weights) to an aggregator, and receives back an averaged adapter that subsequent nodes can use or further refine.

The bet: a 3B-parameter base model with a community-tuned LoRA adapter ("how people in our village actually phrase things, what jargon our Feuerwehr uses, what the local agricultural calendar looks like") is more useful for the community than a generic 3B model, and we can do this without any node ever shipping its private data off-box.

This module deliberately stays at LoRA scope only. Full fine-tunes, distillation, and continual pre-training are explicitly out — both because they are bandwidth-hostile and because the privacy story for full-weight federation is significantly harder.

2. Non-goals

Federating raw data. Never. Training data stays on the node that owns it.
Full fine-tunes. LoRA only. If a use case truly needs more, that's a different research project.
Cross-base-model aggregation. All participants in a round must run the same base model at the same quantisation. Heterogeneous aggregation is open research.
Mandatory participation. Every node decides per-round whether to join. There is no "you must contribute back" rule.
Aggregator centralisation. Any node can host an aggregator. There is no privileged aggregator role.
Hiding participation. Whether you joined a round is visible to other participants in that round; only your data and your gradients are private.

3. File layout

hearthnet/fedlearn/
├── __init__.py
├── coordinator.py        # Orchestrates a round: announce, gather, aggregate, distribute
├── participant.py        # Local-side: respond to round announcements, train, submit
├── trainer.py            # Wraps M04 LLM in a LoRA training loop (peft + bitsandbytes)
├── aggregator.py         # FedAvg with optional secure aggregation
├── delta.py              # Serialise/deserialise LoRA deltas (state-dict subset)
├── privacy.py            # Optional DP-noise injection and gradient clipping
└── manifest.py           # Round manifest: base model id, hyperparams, signature

4. Public API

4.1 Dataclasses

RoundID = NewType("RoundID", str)        # ULID

@dataclass(frozen=True)
class RoundManifest:
    round_id:        RoundID
    coordinator:     NodeID
    base_model_id:   str                  # exact model id from M04 ("qwen2.5:3b-instruct-q4_K_M")
    base_model_sha:  str                  # SHA-256 of base weights; mismatch = exclusion
    lora_target_modules: tuple[str, ...]  # which linear layers carry LoRA (e.g. "q_proj","v_proj")
    lora_rank:       int                  # 4 ≤ r ≤ FEDLEARN_MAX_LORA_RANK
    lora_alpha:      int
    lora_dropout:    float
    train_steps:     int                  # max local SGD steps per participant
    learning_rate:   float
    batch_size:      int
    seed:            int                  # for deterministic init of LoRA matrices
    dp_noise_scale:  float                # 0.0 = off
    clip_norm:       float                # gradient clip; must be > 0 if DP on
    min_participants: int                 # round aborts if fewer participants submit
    max_participants: int
    deadline:        datetime             # UTC; submissions after this dropped
    topic:           str                  # free-form: "niederrhein-emergency", "village-chat"
    consent_text:    str                  # human-readable; participant must accept
    coordinator_sig: bytes                # detached Ed25519 over the manifest

@dataclass
class ParticipantSubmission:
    round_id:        RoundID
    participant:     NodeID
    delta_bytes:     bytes                # serialised LoRA state-dict
    delta_sha:       str
    num_samples:     int                  # for weighted FedAvg
    train_loss:      float                # for telemetry only
    submitted_at:    datetime
    signature:       bytes                # Ed25519 over (round_id, participant, delta_sha, num_samples)

@dataclass
class RoundResult:
    round_id:        RoundID
    aggregated_delta_sha: str
    n_participants:  int
    total_samples:   int
    aggregator:      NodeID
    completed_at:    datetime
    manifest_sha:    str
    download_url:    str                  # capability bus uri for fetching the aggregated delta

4.2 Capabilities

async def fedlearn_round_announce(manifest: RoundManifest) -> RoundID
async def fedlearn_round_list(topic: str | None = None) -> list[RoundManifest]
async def fedlearn_round_join(round_id: RoundID, consent: bool) -> JoinReceipt
async def fedlearn_round_submit(submission: ParticipantSubmission) -> SubmitReceipt
async def fedlearn_round_status(round_id: RoundID) -> RoundStatus
async def fedlearn_round_finalize(round_id: RoundID) -> RoundResult     # coordinator-only
async def fedlearn_adapter_fetch(sha: str) -> bytes
async def fedlearn_adapter_apply(sha: str, scope: Literal["session","node"]) -> ApplyReceipt

All capabilities are in the experimental.fedlearn.* namespace and only registered on the bus when experimental.fedlearn = true in the node config.

4.3 Coordinator class

class RoundCoordinator:
    def __init__(self,
                 bus: CapabilityBus,
                 event_log: EventLog,
                 llm: LLMService,
                 fedlearn_config: FedLearnConfig): ...

    async def announce_round(self, draft: RoundManifestDraft) -> RoundID: ...
    async def collect_submissions(self, round_id: RoundID) -> list[ParticipantSubmission]: ...
    async def aggregate(self, round_id: RoundID) -> bytes: ...
    async def finalize_and_publish(self, round_id: RoundID) -> RoundResult: ...

    # internal
    async def _validate_submission(self, sub: ParticipantSubmission, manifest: RoundManifest) -> None: ...
    async def _emit(self, evt: Event) -> None: ...

4.4 Participant class

class RoundParticipant:
    def __init__(self,
                 bus: CapabilityBus,
                 event_log: EventLog,
                 llm: LLMService,
                 data_provider: TrainingDataProvider,
                 fedlearn_config: FedLearnConfig): ...

    async def consider_round(self, manifest: RoundManifest) -> Decision: ...
    async def train(self, manifest: RoundManifest) -> ParticipantSubmission: ...
    async def submit(self, submission: ParticipantSubmission) -> SubmitReceipt: ...
    async def apply_aggregated(self, result: RoundResult, scope: Literal["session","node"]) -> ApplyReceipt: ...

4.5 Aggregator

class FedAvgAggregator:
    def __init__(self, manifest: RoundManifest): ...

    def add(self, submission: ParticipantSubmission, delta: dict[str, Tensor]) -> None: ...
    def aggregate(self) -> dict[str, Tensor]: ...      # weighted by num_samples

class SecureFedAvgAggregator(FedAvgAggregator):
    """Optional: pairwise masking so the aggregator sees only the sum, never individual deltas."""
    def __init__(self, manifest: RoundManifest, mask_scheme: Literal["additive_pairwise"] = "additive_pairwise"): ...

4.6 Privacy helpers

def clip_gradient(state_dict: dict[str, Tensor], max_norm: float) -> dict[str, Tensor]
def add_dp_noise(state_dict: dict[str, Tensor], scale: float, rng: Generator) -> dict[str, Tensor]
def epsilon_estimate(scale: float, clip: float, n_steps: int, batch: int, dataset_size: int) -> float

5. Behaviour

5.1 Round lifecycle

ANNOUNCED ──join──▶ JOINED ──train──▶ TRAINED ──submit──▶ SUBMITTED ──┐
   │                                                                  │
   │                              ┌──────────── aggregate ◀───────────┘
   │                              ▼
   └────deadline reached────▶ AGGREGATING ──finalize──▶ COMPLETED
                                                  │
                                                  └──min_participants not met──▶ ABORTED

State transitions are recorded as events (fedlearn.round.*) on the coordinator's event log. Participants see their own state mirrored via subscription.

5.2 Manifest signing

Manifest is canonicalised (JCS, like federation manifests in M14 §5.2), then signed Ed25519 by the coordinator's node key. Participants must verify the signature before training. A manifest with an invalid signature is dropped silently and logged as a security event (security.signature.invalid).

5.3 Consent flow

When fedlearn.round.join is called, the participant module must:

Check experimental.fedlearn is enabled in node config. If not → experimental_disabled.
Display manifest.consent_text to the operator via the M11 Notifications path. The operator must explicitly accept. The acceptance is stored as a signed fedlearn.consent.granted event.
Verify coordinator signature. If invalid → signature_invalid (we deliberately don't say whose signature; bystanders learn nothing useful).
Check base_model_sha against the locally-installed base model. If mismatch → base_model_mismatch. Do not download a different base on demand; this is a hard error.
Check resource budget: estimate VRAM and disk for the training run from lora_rank * len(target_modules) * hidden_size. If insufficient → insufficient_resources.
If all checks pass → emit fedlearn.round.joined, return JoinReceipt.

5.4 Local training

The trainer wraps M04's LLM handle in a HuggingFace peft.LoraConfig and uses bitsandbytes 4-bit base + fp16 LoRA matrices. Training data is provided by an injected TrainingDataProvider — the module never reaches into other modules' storage. Typical providers:

ChatHistoryProvider (asks M10 for redacted, consented chat turns),
KBProvider (asks M07 for documents tagged for training),
CustomFileProvider (operator-curated training set).

After train_steps steps or convergence (loss plateau over a window), the trainer extracts the LoRA state-dict, applies optional gradient clipping and DP noise (if manifest.dp_noise_scale > 0), serialises, signs, and returns a ParticipantSubmission.

5.5 Aggregation

The default aggregator is weighted FedAvg: each adapter weight is weighted by num_samples and averaged across submissions. After aggregation, the coordinator emits fedlearn.round.aggregated and stores the aggregated delta via the capability bus (using the same content-addressed file path that M06 Files uses).

If the round was declared with secure=true in the draft, SecureFedAvgAggregator is used: each participant pair establishes an additive mask, masks cancel in the sum, and the aggregator never sees individual deltas. This costs an extra round-trip between participants before submission (the mask exchange phase) and requires min_participants ≥ 3.

5.6 Distribution

The aggregated adapter is published as a content-addressed file. Participants who joined the round get a fedlearn.round.completed event with the SHA. They can choose to:

Session apply — load into a single LLM session via M04 (llm.session.apply_adapter),
Node apply — install as the default adapter for the node (requires explicit operator action),
Discard — do nothing.

Non-participants can also fetch and apply adapters they trust. There is no DRM and no whitelist: the aggregated delta is just a file with a SHA.

5.7 Failure modes

Coordinator vanishes mid-round: participants wait until deadline, then any participant can call fedlearn.round.finalize_takeover(round_id) which constructs the aggregated delta from received submissions and re-publishes. The takeover is signed by the takeover-node and is visible as such.
A participant submits garbage: validation in _validate_submission checks tensor shapes, dtypes, finite-ness (no NaN/Inf), and that the delta is structurally a valid LoRA state-dict for the manifest's lora_target_modules. Garbage submissions are dropped and logged.
Sybil flooding: all participants must be authenticated with M01 identity and the manifest can require a minimum reputation/trust score (this is open research — for v3.0 the field exists in the manifest but is not yet enforced).
Adversarial gradient (poisoning): out of scope for v3.0; documented in Open Research Questions §10.

6. Errors

Code	When
`experimental_disabled`	Caller invokes a fedlearn capability with the flag off
`signature_invalid`	Manifest or submission signature does not verify
`base_model_mismatch`	Local base model SHA differs from manifest
`insufficient_resources`	Estimated VRAM/disk exceeds budget
`consent_required`	join() called without an explicit consent record
`round_full`	`max_participants` reached
`round_closed`	Submission after deadline
`delta_invalid`	Submitted state-dict fails structural validation
`fedlearn_aggregation_failed`	Aggregation produced NaN/Inf or insufficient submissions
`fedlearn_min_participants_unmet`	Round closes with fewer than `min_participants` valid submissions
`fedlearn_aggregator_unreachable`	finalize() called while coordinator is offline and takeover not triggered
`adapter_not_found`	`fedlearn.adapter.fetch` for an unknown SHA

7. Configuration

@dataclass(frozen=True)
class FedLearnConfig:
    enabled:                   bool = False               # master switch; default off
    max_lora_rank:             int  = FEDLEARN_MAX_LORA_RANK              # 64
    max_lora_target_modules:   int  = FEDLEARN_MAX_LORA_TARGET_MODULES    # 8
    max_train_steps:           int  = FEDLEARN_MAX_TRAIN_STEPS            # 1000
    max_round_participants:    int  = FEDLEARN_MAX_PARTICIPANTS           # 32
    min_round_participants:    int  = FEDLEARN_MIN_PARTICIPANTS           # 3
    dp_noise_scale_default:    float = FEDLEARN_DP_NOISE_SCALE_DEFAULT    # 0.0 (off)
    clip_norm_default:         float = FEDLEARN_CLIP_NORM_DEFAULT         # 1.0
    submission_max_bytes:      int  = FEDLEARN_SUBMISSION_MAX_BYTES       # 64 MiB
    require_secure_aggregation: bool = False
    auto_apply_aggregated:     bool = False               # never auto-apply by default
    training_vram_budget_mb:   int  = 8192
    training_disk_budget_mb:   int  = 4096

All FEDLEARN_* constants live in hearthnet/constants.py so a single source of truth governs both validation and documentation generation.

8. Tests

8.1 Unit

test_manifest_canonicalisation_stable — re-encoding does not change SHA.
test_manifest_signature_roundtrip.
test_delta_serialisation_roundtrip — tensors preserve dtype and shape.
test_fedavg_weighted_arithmetic — manually averaged deltas match aggregator output to within fp16 noise.
test_dp_noise_zero_is_identity — add_dp_noise(d, scale=0.0) is a no-op.
test_clip_gradient_norm — post-clip norm ≤ max_norm.
test_secure_aggregation_masks_cancel — sum of masks across all pairs is zero.

8.2 Property

Across random shapes, fedavg([d, d, d]) == d.
Across random submissions, fedavg(submissions) is finite when all inputs are finite.

8.3 Integration

Two-node loopback round on a 0.5B base model: announce → join → train (synthetic data, 10 steps) → submit → aggregate → apply. Aggregated adapter must be loadable and must not blow up perplexity by more than 2x on a held-out set (sanity, not quality).
Coordinator-failure round: simulate coordinator going offline after submissions received; takeover by another participant produces an aggregated delta with the same SHA.
Sybil-defence stub: round with min_participants=3 and only 2 valid submissions aborts with fedlearn_min_participants_unmet.

8.4 Negative

Wrong base SHA → base_model_mismatch.
Submission with NaN in one tensor → delta_invalid.
Submission missing one of the target modules → delta_invalid.
Manifest signed by an untrusted identity → signature_invalid.
Disabled flag → experimental_disabled even for read-only queries.

9. Cross-references

Phase 1 M04 LLM — provides the local model handle, exposes llm.session.apply_adapter and llm.adapter.list.
Phase 1 M07 Knowledge Base — KBProvider reads tagged documents for training.
Phase 2 M14 Federation — federated rounds across communities use the federation transport for manifest distribution and submission. Cross-community rounds require both communities' DPOs to sign the round consent.
Phase 2 M16 Tokens — round participation tokens (fedlearn-participant scope) are issued by the coordinator and bound to a single round.
Phase 2 M25 Group Chat — village-chat rounds typically draw training data from group chat history (consented turns only).
Phase 3 M30 Evidence/EBKH — aggregated adapters can be tracked as claims in the evidence graph; "adapter X improved perplexity on held-out set Y" is a claim.assert.

10. Open research questions

Gradient poisoning defence. Coordinated malicious participants can submit deltas that, when aggregated, degrade or backdoor the adapter. Median-based aggregation (Krum, trimmed mean) is a partial defence; an authenticated-data attestation (per-submission proof that gradients were computed on real, non-cherry-picked data) is the harder question. v3.0 ships FedAvg only; v3.1 may add Krum behind a flag.
Heterogeneous base models. Today, every participant in a round must run the same base model at the same quantisation. Cross-base aggregation (e.g., projecting LoRA from Qwen-3B-Q4 to Qwen-3B-Q5 or even Qwen-3B → Qwen-7B) is open. The naive approach (re-projecting via a translation matrix learnt from a calibration set) loses accuracy quickly.
Adaptive DP-noise. Fixed dp_noise_scale is crude. Per-round noise calibration as a function of min_participants and lora_rank would tighten the privacy/utility tradeoff. Out of scope for v3.0.
Reputation-weighted FedAvg. Weighting submissions by num_samples * trust_score instead of num_samples alone. Requires a credible trust signal, which the broader HearthNet design has not yet committed to.
Continual rounds. Today each round produces a stand-alone adapter. Stacking rounds (round N tunes on top of round N-1's aggregate) raises questions about drift, fairness, and rollback. Probably belongs in a future M28b.
Cross-task adapters. A niederrhein-emergency adapter and a village-chat adapter are trained separately. Whether they can be cleanly combined at inference time (LoRA composition) is a known-hard problem and explicitly not promised here.
Hardware-class fairness. A round held by a participant with an RTX 5090 might exclude phone-class participants by setting train_steps too high. A "ranked tier" with separate aggregations per tier is one possibility. Currently the manifest is a single-tier flat artefact.
Audit of training data. Even though raw data never leaves the node, the fact that training happened on consented data is currently un-auditable from the outside. A future zero-knowledge attestation of "this delta was computed on N samples each tagged training=true" would be useful. Out of scope.

Last updated: spec v3.0.