cowWhySo commited on
Commit
45b2b1d
·
verified ·
1 Parent(s): dfec3ef

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +428 -0
README.md ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ pipeline_tag: text-classification
6
+ base_model: microsoft/deberta-v3-small
7
+ tags:
8
+ - text-classification
9
+ - tool-use
10
+ - tool-calling
11
+ - guardrails
12
+ - final-response-verifier
13
+ - workflow-verification
14
+ - onnx
15
+ - safetensors
16
+ - rust
17
+ - shadow-mode
18
+ metrics:
19
+ - accuracy
20
+ - precision
21
+ - recall
22
+ - f1
23
+ model-index:
24
+ - name: final-response-verifier-classifier-production
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: Final response verification
29
+ dataset:
30
+ name: final_response_verifier_dataset
31
+ type: synthetic-tool-trace-fixtures
32
+ metrics:
33
+ - type: accuracy
34
+ value: 0.2
35
+ name: Test accuracy
36
+ - type: f1
37
+ value: 0.06666666666666667
38
+ name: Test macro F1
39
+ - type: precision
40
+ value: 0.04
41
+ name: Test macro precision
42
+ - type: recall
43
+ value: 0.2
44
+ name: Test macro recall
45
+ ---
46
+
47
+ # Final Response Verifier Classifier Production
48
+
49
+ ## Model summary
50
+
51
+ `cowWhySo/final-response-verifier-classifier-production` is an experimental DeBERTa-v3-small sequence classifier for verifying candidate final responses in tool-using workflows. It is designed as a sidecar verifier after a workflow has completed its required tool calls. The model takes a serialized representation of the user request, workflow state, required facts, tool trace, tool results, and candidate final response, then predicts whether the final response is grounded in the tool evidence.
52
+
53
+ This model is not a generative model. It does not execute tools, retrieve evidence, or rewrite responses. It only scores a serialized candidate response against the supplied workflow context.
54
+
55
+ **Current status: shadow-only.** This artifact is useful for integration testing, serializer compatibility checks, ONNX deployment validation, and telemetry collection. It should not be used for production blocking or autonomous enforcement yet. The current test set is very small and the measured classification quality is poor: `0.20` accuracy and `0.0667` macro F1 on a 10-row test split.
56
+
57
+ ## Intended use
58
+
59
+ Use this model to evaluate candidate terminal responses in systems that already maintain structured tool workflow state.
60
+
61
+ Appropriate uses:
62
+
63
+ - Shadow logging for final-answer verifier experiments.
64
+ - Eval replay comparing no-classifier, FP32 ONNX shadow, quantized ONNX shadow, and advisory variants.
65
+ - Checking whether Rust-side serialization and ONNX inference match the Python training artifact.
66
+ - Building telemetry for future promotion decisions.
67
+
68
+ Do not use this model for:
69
+
70
+ - Production enforcement without additional evaluation.
71
+ - General-purpose hallucination detection.
72
+ - Safety-critical factuality, compliance, medical, legal, or financial decisions.
73
+ - Replacing deterministic tool-call guardrails, JSON-schema checks, workflow-state enforcement, or source/citation validation.
74
+
75
+ ## Labels
76
+
77
+ The classifier predicts one of five labels:
78
+
79
+ | Label | Meaning | Default handling |
80
+ |---|---|---|
81
+ | `valid_final_response` | Candidate response is grounded in the required facts and tool outputs. | Allow |
82
+ | `missing_tool_fact` | Candidate response omits one or more required facts from the tool evidence. | Shadow/advisory only |
83
+ | `contradicts_tool_result` | Candidate response conflicts with a tool result. | Shadow/advisory only |
84
+ | `unsupported_claim` | Candidate response adds a claim not supported by the supplied tool results. | Shadow/advisory only |
85
+ | `failed_to_acknowledge_data_gap` | Candidate response fails to acknowledge missing required data, or treats missing data as known. | Shadow/advisory only |
86
+
87
+ ## Input contract
88
+
89
+ The artifact uses:
90
+
91
+ - Input schema: `final-response-verifier-input/v1`
92
+ - Serializer: `serialize_final_response_state_v1`
93
+ - Max sequence length: `768`
94
+ - Base model: `microsoft/deberta-v3-small`
95
+
96
+ The structured input contains:
97
+
98
+ ```json
99
+ {
100
+ "schema_version": "final-response-verifier-input/v1",
101
+ "user_request": "...",
102
+ "workflow_state": {
103
+ "required_steps": ["..."],
104
+ "completed_steps": ["..."],
105
+ "pending_steps": ["..."],
106
+ "terminal_tools": ["..."],
107
+ "recent_errors": ["..."]
108
+ },
109
+ "required_facts": ["..."],
110
+ "tool_trace": ["..."],
111
+ "tool_results": [
112
+ {"tool_name": "...", "content": "..."}
113
+ ],
114
+ "candidate_final_response": "...",
115
+ "metadata": {
116
+ "scenario_family": "...",
117
+ "requires_transform": false,
118
+ "requires_synthesis": true,
119
+ "requires_all_tool_facts": true,
120
+ "must_acknowledge_missing_data": false
121
+ }
122
+ }
123
+ ```
124
+
125
+ The text serializer emits a sectioned prompt-like string:
126
+
127
+ ```text
128
+ SCHEMA_VERSION:
129
+ final-response-verifier-input/v1
130
+
131
+ USER_REQUEST:
132
+ ...
133
+
134
+ WORKFLOW_STATE:
135
+ required_steps=[...]
136
+ completed_steps=[...]
137
+ pending_steps=[...]
138
+ terminal_tools=[...]
139
+ recent_errors=[...]
140
+
141
+ REQUIRED_FACTS:
142
+ [...]
143
+
144
+ TOOL_TRACE:
145
+ [...]
146
+
147
+ TOOL_RESULTS:
148
+ tool_name: "tool output text"
149
+
150
+ CANDIDATE_FINAL_RESPONSE:
151
+ ...
152
+
153
+ SCORING_METADATA:
154
+ scenario_family="..."
155
+ requires_transform=false
156
+ requires_synthesis=true
157
+ requires_all_tool_facts=true
158
+ must_acknowledge_missing_data=false
159
+ ```
160
+
161
+ For deployment, the Rust or Python caller must reproduce this serializer exactly. Training on one serialization format and inferring with another will invalidate the classifier behavior.
162
+
163
+ ## Repository layout
164
+
165
+ The repository contains two deployment surfaces:
166
+
167
+ ```text
168
+ hf_model/
169
+ artifact_manifest.json
170
+ config.json
171
+ input_schema.json
172
+ labels.json
173
+ model.safetensors
174
+ onnx_parity_report.json
175
+ special_tokens_map.json
176
+ spm.model
177
+ thresholds.json
178
+ tokenizer_config.json
179
+ training_args.bin
180
+ training_provenance.json
181
+
182
+ onnx/
183
+ artifact_manifest.json
184
+ config.json
185
+ input_schema.json
186
+ labels.json
187
+ model.onnx
188
+ model_quantized.onnx
189
+ onnx_parity_report.json
190
+ special_tokens_map.json
191
+ spm.model
192
+ thresholds.json
193
+ tokenizer_config.json
194
+ training_provenance.json
195
+ ```
196
+
197
+ Use `hf_model/` for Transformers/PyTorch inference and `onnx/` for ONNX Runtime deployment.
198
+
199
+ ## Training data
200
+
201
+ The final-response verifier artifact was trained on a small, balanced fixture dataset:
202
+
203
+ | Split | Rows | Groups |
204
+ |---|---:|---:|
205
+ | Train | 70 | 14 |
206
+ | Validation | 10 | 2 |
207
+ | Test | 10 | 2 |
208
+ | Total | 90 | 18 |
209
+
210
+ Label counts:
211
+
212
+ | Label | Rows |
213
+ |---|---:|
214
+ | `valid_final_response` | 18 |
215
+ | `missing_tool_fact` | 18 |
216
+ | `contradicts_tool_result` | 18 |
217
+ | `unsupported_claim` | 18 |
218
+ | `failed_to_acknowledge_data_gap` | 18 |
219
+
220
+ The dataset is intentionally small and fixture-heavy. Treat all metrics as smoke-test metrics, not as evidence of production readiness.
221
+
222
+ ## Training configuration
223
+
224
+ | Field | Value |
225
+ |---|---|
226
+ | Base model | `microsoft/deberta-v3-small` |
227
+ | Run profile | `high_vram_quality` |
228
+ | Final-response max length | `768` |
229
+ | Configured epochs | `5` |
230
+ | Recorded training epoch | `3.0` |
231
+ | Train batch size | `16` |
232
+ | Eval batch size | `32` |
233
+ | Gradient accumulation | `4` |
234
+ | Max rows per label | `5000` |
235
+ | Force retrain | `false` |
236
+ | CPU-only ONNX export | `true` |
237
+ | GPU | `NVIDIA RTX PRO 6000 Blackwell Server Edition` |
238
+ | GPU memory | `95.0 GB` |
239
+ | Precision | bf16 and tf32 enabled, fp16 disabled |
240
+
241
+ ## Evaluation
242
+
243
+ Test metrics from the artifact provenance:
244
+
245
+ | Metric | Value |
246
+ |---|---:|
247
+ | Eval loss | `1.6188628673553467` |
248
+ | Accuracy | `0.2` |
249
+ | Macro precision | `0.04` |
250
+ | Macro recall | `0.2` |
251
+ | Macro F1 | `0.06666666666666667` |
252
+ | Eval samples/s | `27.032` |
253
+
254
+ Because the test split has only 10 examples and the dataset has five balanced labels, `0.20` accuracy is approximately chance-level. Do not promote this model to advisory or enforcement mode based on the current metrics.
255
+
256
+ ## ONNX parity
257
+
258
+ The exported ONNX artifacts passed a small parity smoke check:
259
+
260
+ | Check | Value |
261
+ |---|---:|
262
+ | Rows | `10` |
263
+ | PyTorch vs FP32 ONNX top-label agreement | `1.0` |
264
+ | PyTorch vs FP32 ONNX max absolute diff | `2.980232238769531e-07` |
265
+ | Quantized ONNX present | `true` |
266
+ | FP32 ONNX vs quantized ONNX top-label agreement | `1.0` |
267
+ | FP32 ONNX vs quantized ONNX disagreements | `0` |
268
+ | FP32 ONNX vs quantized ONNX max absolute diff | `0.017383113503456116` |
269
+
270
+ This only validates export parity on a tiny sample. It does not validate model quality.
271
+
272
+ ## Threshold policy
273
+
274
+ The included `thresholds.json` is shadow-first:
275
+
276
+ ```json
277
+ {
278
+ "schema_version": "final-response-verifier-thresholds/v1",
279
+ "mode": "shadow",
280
+ "default_action": "allow"
281
+ }
282
+ ```
283
+
284
+ Default label policy:
285
+
286
+ | Label | Action | Advisory threshold | Enforcement threshold |
287
+ |---|---|---:|---:|
288
+ | `valid_final_response` | `allow` | `0.0` | `1.01` |
289
+ | `missing_tool_fact` | `advisory_then_enforce_after_eval` | `0.90` | `0.995` |
290
+ | `contradicts_tool_result` | `advisory_then_enforce_after_eval` | `0.90` | `0.995` |
291
+ | `unsupported_claim` | `advisory_then_enforce_after_eval` | `0.90` | `0.995` |
292
+ | `failed_to_acknowledge_data_gap` | `advisory_then_enforce_after_eval` | `0.90` | `0.995` |
293
+
294
+ Despite these threshold fields, the current model card recommendation is stricter: keep the model in `shadow` mode until a larger held-out evaluation shows useful precision and recall.
295
+
296
+ ## Transformers usage
297
+
298
+ ```python
299
+ import torch
300
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
301
+
302
+ repo_id = "cowWhySo/final-response-verifier-classifier-production"
303
+ subfolder = "hf_model"
304
+
305
+ # use_fast=False is recommended for parity with the training/export code path.
306
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder, use_fast=False)
307
+ model = AutoModelForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
308
+ model.eval()
309
+
310
+ text = """SCHEMA_VERSION:
311
+ final-response-verifier-input/v1
312
+
313
+ USER_REQUEST:
314
+ Generate a sales report from the Q4 2024 dataset.
315
+
316
+ WORKFLOW_STATE:
317
+ required_steps=['fetch_sales_data', 'analyze_sales']
318
+ completed_steps=['fetch_sales_data', 'analyze_sales']
319
+ pending_steps=[]
320
+ terminal_tools=['report']
321
+ recent_errors=[]
322
+
323
+ REQUIRED_FACTS:
324
+ ['23% YoY growth', 'Widget Pro', 'APAC']
325
+
326
+ TOOL_TRACE:
327
+ ['fetch_sales_data', 'analyze_sales', 'report']
328
+
329
+ TOOL_RESULTS:
330
+ analyze_sales: "Revenue grew 23% YoY. Top product: Widget Pro. Weakest region: APAC."
331
+
332
+ CANDIDATE_FINAL_RESPONSE:
333
+ Revenue grew 23% YoY. Top product was Widget Pro, and APAC was the weakest region.
334
+
335
+ SCORING_METADATA:
336
+ scenario_family="sequential_3step"
337
+ requires_transform=false
338
+ requires_synthesis=false
339
+ requires_all_tool_facts=true
340
+ must_acknowledge_missing_data=false"""
341
+
342
+ inputs = tokenizer(
343
+ [text],
344
+ return_tensors="pt",
345
+ truncation=True,
346
+ max_length=768,
347
+ padding=True,
348
+ )
349
+
350
+ with torch.no_grad():
351
+ logits = model(**inputs).logits
352
+ probs = torch.softmax(logits, dim=-1)[0]
353
+
354
+ id2label = model.config.id2label
355
+ for idx, score in sorted(enumerate(probs.tolist()), key=lambda x: x[1], reverse=True):
356
+ print(id2label[idx], score)
357
+ ```
358
+
359
+ ## ONNX Runtime usage
360
+
361
+ ```python
362
+ import numpy as np
363
+ import onnxruntime as ort
364
+ from huggingface_hub import hf_hub_download
365
+ from transformers import AutoTokenizer
366
+
367
+ repo_id = "cowWhySo/final-response-verifier-classifier-production"
368
+
369
+ # Use model.onnx for FP32 or model_quantized.onnx for smaller CPU deployment.
370
+ onnx_path = hf_hub_download(repo_id, filename="onnx/model_quantized.onnx")
371
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="onnx", use_fast=False)
372
+
373
+ session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
374
+ inputs = tokenizer([text], return_tensors="np", truncation=True, max_length=768, padding=True)
375
+ input_names = {item.name for item in session.get_inputs()}
376
+ ort_inputs = {key: value for key, value in inputs.items() if key in input_names}
377
+ logits = session.run(None, ort_inputs)[0]
378
+ probs = np.exp(logits) / np.exp(logits).sum(axis=-1, keepdims=True)
379
+ print(probs[0])
380
+ ```
381
+
382
+ ## Rust deployment notes
383
+
384
+ A Rust integration should load the following from `onnx/`:
385
+
386
+ - `model.onnx` or `model_quantized.onnx`
387
+ - tokenizer files: `tokenizer_config.json`, `special_tokens_map.json`, `spm.model`
388
+ - `labels.json`
389
+ - `thresholds.json`
390
+ - `artifact_manifest.json`
391
+ - `input_schema.json`
392
+ - `training_provenance.json`
393
+ - `onnx_parity_report.json`
394
+
395
+ Recommended integration sequence:
396
+
397
+ 1. Validate workflow state and deterministic guardrails first.
398
+ 2. Build the final-response scoring context from already-completed tool calls.
399
+ 3. Serialize with `serialize_final_response_state_v1`.
400
+ 4. Run the model in shadow mode.
401
+ 5. Log the predicted label, confidence, raw logits, model version, serializer version, and threshold decision.
402
+ 6. Do not block or rewrite responses until offline eval replay proves the model improves target scenarios without false objections on valid final responses.
403
+
404
+ ## Limitations
405
+
406
+ - The training dataset has only 90 rows.
407
+ - The test split has only 10 examples.
408
+ - Current test accuracy is `0.20`, which is chance-level for a balanced five-label task.
409
+ - The examples are synthetic or fixture-like and do not represent broad real-world final-response behavior.
410
+ - The model depends on the caller supplying accurate `required_facts`, `tool_trace`, and `tool_results`.
411
+ - The model does not independently verify source truth, tool correctness, or external facts.
412
+ - The model has not been validated for multilingual use, adversarial prompts, long multi-tool traces, or production traffic.
413
+
414
+ ## Recommended next steps
415
+
416
+ Before any advisory or enforcement rollout:
417
+
418
+ 1. Expand the dataset with real Forge/tool-workflow traces.
419
+ 2. Add hard negatives for subtle omissions, numeric drift, unsupported causal claims, and missing-data overclaims.
420
+ 3. Build a larger group-held-out test split with per-scenario metrics.
421
+ 4. Calibrate probabilities after training.
422
+ 5. Compare PyTorch, FP32 ONNX, and quantized ONNX on the same replay set.
423
+ 6. Track valid-final-response false objection rate as the primary promotion gate.
424
+ 7. Keep deterministic tool/workflow guardrails authoritative.
425
+
426
+ ## Citation and provenance
427
+
428
+ This model was produced from the `toolcall_verifier_training_production_colab_v4` workflow and uploaded to the Hugging Face repository `cowWhySo/final-response-verifier-classifier-production`. The artifact is marked `deployment_default: shadow` because it is an experimental final-response verifier that should be promoted only after eval replay.