Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +65 -0
data/sessions.jsonl +1 -0
digests/2026-05-12-kb-comparison-r0b0tlab.md +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+# Hermes Session Digests
+Structured, post-hoc summaries of Hermes AI agent sessions. Each digest captures goals, context, actions, decisions, durable learnings, errors, and promotion targets from a single agent session.
+## Purpose
+- **Canonical knowledge base**: Digests are the searchable memory stream for the Hermes-powered knowledge management system
+- **Training data**: When paired with raw agent trajectories, digests serve as teacher signals for instruction tuning, agent trajectory learning, and RAG fine-tuning
+- **Auditability**: Human-readable, diffable, versioned record of what the agent did and why
+## Format
+Two formats are provided for each session:
+### Markdown (`digests/*.md`)
+Full human-readable digest with YAML frontmatter metadata. Structured sections: Goal, Context, Key Findings, Actions Taken, Decisions, Durable Learnings, Promotion Targets.
+### JSON-Lines (`data/sessions.jsonl`)
+Machine-readable structured extraction suitable for training pipelines. One JSON object per line per session. Fields include session_id, timestamp, model, decisions (list), learnings (list), actions (list), promotion targets (list).
+## Model Filtering
+All sessions are tagged with the model that generated them. For training data, filter by model to avoid inconsistent style:
+```python
+import json
+sessions = [json.loads(line) for line in open("data/sessions.jsonl")]
+deepseek_sessions = [s for s in sessions if s["model"] == "deepseek-v4-pro"]
+```
+## Privacy
+All digests undergo PII removal before publishing: local file paths generalized, transient process IDs stripped, channel names abstracted. No API keys, tokens, email addresses, or personal identifiers are included.
+## Schema
+### Frontmatter Fields
+| Field | Description |
+|-------|-------------|
+| session_date | ISO date of the agent session |
+| model | Model that produced the agent responses |
+| model_provider | API provider for the model |
+| platform | Messaging platform (discord, telegram, cli) |
+| project | Primary project context |
+| domain | Knowledge domain |
+| type | Always `session-digest` |
+| status | draft / active / canonical / archived |
+### JSON-Lines Fields
+| Field | Type | Description |
+|-------|------|-------------|
+| session_id | string | Unique session identifier |
+| timestamp | ISO datetime | Session start time |
+| model | string | Agent model |
+| decisions | string[] | Key decisions made |
+| learnings | string[] | Durable learnings |
+| actions_taken | string[] | Concrete actions performed |
+| promotion_targets | string[] | Pages recommended for promotion |
+| gaps_identified | string[] | System gaps discovered |
+| strengths_identified | string[] | System strengths confirmed |
+## Related
+- [r0b0tlabbra1n](https://github.com/r0b0tlab/llm-wiki_obsidian_hermes_r0b0tlabbra1n) — companion agent memory system
+- [QMD](https://github.com/tobi/qmd) — local hybrid search engine used for retrieval

data/sessions.jsonl ADDED Viewed

	@@ -0,0 +1 @@

+ {"session_id": "2026-05-12-kb-comparison-r0b0tlab", "timestamp": "2026-05-12T15:08:00Z", "model": "deepseek-v4-pro", "model_provider": "openrouter", "platform": "discord", "project": "knowledge-system", "domain": "knowledge-management", "goal": "Compare our knowledge base setup against r0b0tlab's llm-wiki_obsidian_hermes_r0b0tlabbra1n system to identify gaps and improvements. Start implementing high-impact, low-effort recommendations.", "context": "Our KB: Hybrid architecture — Markdown wiki + QMD hybrid search + HTML artifacts. r0b0tlab's system: Filesystem-first agent memory with brain CLI, SQLite FTS5, memory tiers, secret scanning, session ingest, eval harness.", "decisions": ["Chose CLI over MCP for QMD integration (MCP is too token-hungry)", "Decided to publish session digests as a HuggingFace Dataset", "Prioritized high-impact/low-effort: session digests → cron heartbeat → _agent/ structure → secret scanning", "HF dataset will include both Markdown and JSON-Lines formats for future training data use"], "learnings": ["r0b0tlabbra1n's tier system (L1-L4) is a strong pattern worth adopting with QMD hybrid search replacing their FTS5", "brain ingest-sessions approach (reading state.db ro) creates structured memories from unstructured conversations", "Secret scanning on writes is non-negotiable for a system that accumulates code/config examples from agent sessions", "Session digests are the bottleneck — without them, the knowledge compounding loop never starts", "HF dataset publishing gives public, versioned, diffable session history with zero additional infrastructure"], "strengths_identified": ["QMD search quality: BM25 + vector + LLM reranking + HyDE", "Code indexing: 30+ file extensions with AST chunking", "MCP infrastructure: QMD HTTP daemon running", "Schema rigor: Detailed SCHEMA.md with controlled taxonomy", "Visual artifacts: HTML plans/reports with design system templates"], "gaps_identified": ["Session ingest automation", "Secret scanning on writes", "Wikilink graph + backlinks tooling", "Source hash + drift check", "Memory tier system (L1-L4) with promotion rules", "Retrieval eval harness", "Cron automation", "Agent memory structure"], "actions_taken": ["Installed HF CLI v1.14.0", "Created first session digest", "Created _agent/ structure", "Set up daily cron heartbeat", "Created HF Dataset repo"], "promotion_targets": ["Comparisons: r0b0tlabbra1n-vs-our-kb", "Concepts: memory-tiers", "Runbooks: hf-session-publishing"], "format_version": "1.0", "cleanup_applied": ["stripped local paths", "stripped transient PIDs", "generalized channel names"]}

digests/2026-05-12-kb-comparison-r0b0tlab.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+title: Session Digest — KB Comparison with r0b0tlabbra1n
+created: 2026-05-12
+updated: 2026-05-12
+type: session-digest
+status: active
+project: knowledge-system
+domain: knowledge-management
+tags: [type/session-digest, status/draft, domain/knowledge-management, workflow/promote-to-wiki]
+confidence: high
+contested: false
+session_date: 2026-05-12
+model: deepseek-v4-pro
+model_provider: openrouter
+platform: discord
+---
+# Session Digest: KB Comparison with r0b0tlabbra1n
+## Goal
+Compare our knowledge base setup against r0b0tlab's `llm-wiki_obsidian_hermes_r0b0tlabbra1n` system to identify gaps and improvements. Start implementing high-impact, low-effort recommendations.
+## Context
+- Our KB: Hybrid architecture — Markdown wiki + QMD hybrid search + HTML artifacts
+- r0b0tlab's system: Filesystem-first agent memory with `brain` CLI, SQLite FTS5, memory tiers, secret scanning, session ingest, eval harness
+## Key Findings
+### We're Ahead On
+- **QMD search quality**: BM25 + vector + LLM reranking + HyDE (vs their FTS5-only BM25)
+- **Code indexing**: 30+ file extensions with AST chunking (vs their markdown-only)
+- **MCP infrastructure**: QMD HTTP daemon running (vs their optional JSON facade)
+- **Schema rigor**: Detailed SCHEMA.md with controlled taxonomy
+- **Visual artifacts**: HTML plans/reports with design system templates
+### We're Missing (r0b0tlab's strengths)
+- **Session ingest automation** — they read Hermes state.db ro; our session-md/ was empty
+- **Secret scanning** on writes — critical safety gap
+- **Wikilink graph + backlinks** tooling
+- **Source hash + drift check**
+- **Memory tier system** (L1-L4) with promotion rules
+- **Retrieval eval harness** (65 gold queries)
+- **Cron automation** (daily heartbeat, weekly audit)
+- **Agent memory structure** (`_agent/` directory)
+## Actions Taken
+- Installed HF CLI (huggingface_hub[cli] v1.14.0)
+- Created first session digest in the pipeline
+- Created _agent/START_HERE.md and operating-rules.md
+- Set up daily cron heartbeat for QMD index refresh
+- Created cyberjanitor/hermes-session-digests Dataset on HuggingFace
+## Decisions
+- Chose CLI over MCP for QMD integration (MCP is too token-hungry)
+- Decided to publish session digests as a HuggingFace Dataset
+- Prioritized high-impact/low-effort: session digests → cron heartbeat → _agent/ structure → secret scanning
+- HF dataset will include both Markdown and JSON-Lines formats for future training data use
+## Durable Learnings
+- r0b0tlabbra1n's tier system (L1-L4) is a strong pattern worth adopting with QMD hybrid search replacing their FTS5
+- Their `brain ingest-sessions` approach (reading state.db ro) is clever — creates structured memories from unstructured conversations
+- Secret scanning on writes is non-negotiable for a system that accumulates code/config examples from agent sessions
+- Session digests are the bottleneck — without them, the knowledge compounding loop never starts
+- HF dataset publishing gives public, versioned, diffable session history with zero additional infrastructure
+## Promotion Targets
+- r0b0tlab comparison findings → new comparisons page
+- Memory tier system concept → new concepts page
+- HF publishing workflow → new runbook