cyberjanitor commited on
Commit
ec1cbd5
Β·
verified Β·
1 Parent(s): 9227e55

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hermes Session Digests
2
+
3
+ Structured, post-hoc summaries of Hermes AI agent sessions. Each digest captures goals, context, actions, decisions, durable learnings, errors, and promotion targets from a single agent session.
4
+
5
+ ## Purpose
6
+
7
+ - **Canonical knowledge base**: Digests are the searchable memory stream for the Hermes-powered knowledge management system
8
+ - **Training data**: When paired with raw agent trajectories, digests serve as teacher signals for instruction tuning, agent trajectory learning, and RAG fine-tuning
9
+ - **Auditability**: Human-readable, diffable, versioned record of what the agent did and why
10
+
11
+ ## Format
12
+
13
+ Two formats are provided for each session:
14
+
15
+ ### Markdown (`digests/*.md`)
16
+ Full human-readable digest with YAML frontmatter metadata. Structured sections: Goal, Context, Key Findings, Actions Taken, Decisions, Durable Learnings, Promotion Targets.
17
+
18
+ ### JSON-Lines (`data/sessions.jsonl`)
19
+ Machine-readable structured extraction suitable for training pipelines. One JSON object per line per session. Fields include session_id, timestamp, model, decisions (list), learnings (list), actions (list), promotion targets (list).
20
+
21
+ ## Model Filtering
22
+
23
+ All sessions are tagged with the model that generated them. For training data, filter by model to avoid inconsistent style:
24
+
25
+ ```python
26
+ import json
27
+ sessions = [json.loads(line) for line in open("data/sessions.jsonl")]
28
+ deepseek_sessions = [s for s in sessions if s["model"] == "deepseek-v4-pro"]
29
+ ```
30
+
31
+ ## Privacy
32
+
33
+ All digests undergo PII removal before publishing: local file paths generalized, transient process IDs stripped, channel names abstracted. No API keys, tokens, email addresses, or personal identifiers are included.
34
+
35
+ ## Schema
36
+
37
+ ### Frontmatter Fields
38
+ | Field | Description |
39
+ |-------|-------------|
40
+ | session_date | ISO date of the agent session |
41
+ | model | Model that produced the agent responses |
42
+ | model_provider | API provider for the model |
43
+ | platform | Messaging platform (discord, telegram, cli) |
44
+ | project | Primary project context |
45
+ | domain | Knowledge domain |
46
+ | type | Always `session-digest` |
47
+ | status | draft / active / canonical / archived |
48
+
49
+ ### JSON-Lines Fields
50
+ | Field | Type | Description |
51
+ |-------|------|-------------|
52
+ | session_id | string | Unique session identifier |
53
+ | timestamp | ISO datetime | Session start time |
54
+ | model | string | Agent model |
55
+ | decisions | string[] | Key decisions made |
56
+ | learnings | string[] | Durable learnings |
57
+ | actions_taken | string[] | Concrete actions performed |
58
+ | promotion_targets | string[] | Pages recommended for promotion |
59
+ | gaps_identified | string[] | System gaps discovered |
60
+ | strengths_identified | string[] | System strengths confirmed |
61
+
62
+ ## Related
63
+
64
+ - [r0b0tlabbra1n](https://github.com/r0b0tlab/llm-wiki_obsidian_hermes_r0b0tlabbra1n) β€” companion agent memory system
65
+ - [QMD](https://github.com/tobi/qmd) β€” local hybrid search engine used for retrieval
data/sessions.jsonl ADDED
@@ -0,0 +1 @@
 
 
1
+ {"session_id": "2026-05-12-kb-comparison-r0b0tlab", "timestamp": "2026-05-12T15:08:00Z", "model": "deepseek-v4-pro", "model_provider": "openrouter", "platform": "discord", "project": "knowledge-system", "domain": "knowledge-management", "goal": "Compare our knowledge base setup against r0b0tlab's llm-wiki_obsidian_hermes_r0b0tlabbra1n system to identify gaps and improvements. Start implementing high-impact, low-effort recommendations.", "context": "Our KB: Hybrid architecture β€” Markdown wiki + QMD hybrid search + HTML artifacts. r0b0tlab's system: Filesystem-first agent memory with brain CLI, SQLite FTS5, memory tiers, secret scanning, session ingest, eval harness.", "decisions": ["Chose CLI over MCP for QMD integration (MCP is too token-hungry)", "Decided to publish session digests as a HuggingFace Dataset", "Prioritized high-impact/low-effort: session digests β†’ cron heartbeat β†’ _agent/ structure β†’ secret scanning", "HF dataset will include both Markdown and JSON-Lines formats for future training data use"], "learnings": ["r0b0tlabbra1n's tier system (L1-L4) is a strong pattern worth adopting with QMD hybrid search replacing their FTS5", "brain ingest-sessions approach (reading state.db ro) creates structured memories from unstructured conversations", "Secret scanning on writes is non-negotiable for a system that accumulates code/config examples from agent sessions", "Session digests are the bottleneck β€” without them, the knowledge compounding loop never starts", "HF dataset publishing gives public, versioned, diffable session history with zero additional infrastructure"], "strengths_identified": ["QMD search quality: BM25 + vector + LLM reranking + HyDE", "Code indexing: 30+ file extensions with AST chunking", "MCP infrastructure: QMD HTTP daemon running", "Schema rigor: Detailed SCHEMA.md with controlled taxonomy", "Visual artifacts: HTML plans/reports with design system templates"], "gaps_identified": ["Session ingest automation", "Secret scanning on writes", "Wikilink graph + backlinks tooling", "Source hash + drift check", "Memory tier system (L1-L4) with promotion rules", "Retrieval eval harness", "Cron automation", "Agent memory structure"], "actions_taken": ["Installed HF CLI v1.14.0", "Created first session digest", "Created _agent/ structure", "Set up daily cron heartbeat", "Created HF Dataset repo"], "promotion_targets": ["Comparisons: r0b0tlabbra1n-vs-our-kb", "Concepts: memory-tiers", "Runbooks: hf-session-publishing"], "format_version": "1.0", "cleanup_applied": ["stripped local paths", "stripped transient PIDs", "generalized channel names"]}
digests/2026-05-12-kb-comparison-r0b0tlab.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Session Digest β€” KB Comparison with r0b0tlabbra1n
3
+ created: 2026-05-12
4
+ updated: 2026-05-12
5
+ type: session-digest
6
+ status: active
7
+ project: knowledge-system
8
+ domain: knowledge-management
9
+ tags: [type/session-digest, status/draft, domain/knowledge-management, workflow/promote-to-wiki]
10
+ confidence: high
11
+ contested: false
12
+ session_date: 2026-05-12
13
+ model: deepseek-v4-pro
14
+ model_provider: openrouter
15
+ platform: discord
16
+ ---
17
+
18
+ # Session Digest: KB Comparison with r0b0tlabbra1n
19
+
20
+ ## Goal
21
+ Compare our knowledge base setup against r0b0tlab's `llm-wiki_obsidian_hermes_r0b0tlabbra1n` system to identify gaps and improvements. Start implementing high-impact, low-effort recommendations.
22
+
23
+ ## Context
24
+ - Our KB: Hybrid architecture β€” Markdown wiki + QMD hybrid search + HTML artifacts
25
+ - r0b0tlab's system: Filesystem-first agent memory with `brain` CLI, SQLite FTS5, memory tiers, secret scanning, session ingest, eval harness
26
+
27
+ ## Key Findings
28
+
29
+ ### We're Ahead On
30
+ - **QMD search quality**: BM25 + vector + LLM reranking + HyDE (vs their FTS5-only BM25)
31
+ - **Code indexing**: 30+ file extensions with AST chunking (vs their markdown-only)
32
+ - **MCP infrastructure**: QMD HTTP daemon running (vs their optional JSON facade)
33
+ - **Schema rigor**: Detailed SCHEMA.md with controlled taxonomy
34
+ - **Visual artifacts**: HTML plans/reports with design system templates
35
+
36
+ ### We're Missing (r0b0tlab's strengths)
37
+ - **Session ingest automation** β€” they read Hermes state.db ro; our session-md/ was empty
38
+ - **Secret scanning** on writes β€” critical safety gap
39
+ - **Wikilink graph + backlinks** tooling
40
+ - **Source hash + drift check**
41
+ - **Memory tier system** (L1-L4) with promotion rules
42
+ - **Retrieval eval harness** (65 gold queries)
43
+ - **Cron automation** (daily heartbeat, weekly audit)
44
+ - **Agent memory structure** (`_agent/` directory)
45
+
46
+ ## Actions Taken
47
+ - Installed HF CLI (huggingface_hub[cli] v1.14.0)
48
+ - Created first session digest in the pipeline
49
+ - Created _agent/START_HERE.md and operating-rules.md
50
+ - Set up daily cron heartbeat for QMD index refresh
51
+ - Created cyberjanitor/hermes-session-digests Dataset on HuggingFace
52
+
53
+ ## Decisions
54
+ - Chose CLI over MCP for QMD integration (MCP is too token-hungry)
55
+ - Decided to publish session digests as a HuggingFace Dataset
56
+ - Prioritized high-impact/low-effort: session digests β†’ cron heartbeat β†’ _agent/ structure β†’ secret scanning
57
+ - HF dataset will include both Markdown and JSON-Lines formats for future training data use
58
+
59
+ ## Durable Learnings
60
+ - r0b0tlabbra1n's tier system (L1-L4) is a strong pattern worth adopting with QMD hybrid search replacing their FTS5
61
+ - Their `brain ingest-sessions` approach (reading state.db ro) is clever β€” creates structured memories from unstructured conversations
62
+ - Secret scanning on writes is non-negotiable for a system that accumulates code/config examples from agent sessions
63
+ - Session digests are the bottleneck β€” without them, the knowledge compounding loop never starts
64
+ - HF dataset publishing gives public, versioned, diffable session history with zero additional infrastructure
65
+
66
+ ## Promotion Targets
67
+ - r0b0tlab comparison findings β†’ new comparisons page
68
+ - Memory tier system concept β†’ new concepts page
69
+ - HF publishing workflow β†’ new runbook