Instructions to use tmadl/IH-scorer-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tmadl/IH-scorer-v2 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- IH-Scorer v2 — Intellectual Humility 3-seed LoRA Ensemble on Qwen3.6-27B
IH-Scorer v2 — Intellectual Humility 3-seed LoRA Ensemble on Qwen3.6-27B
A 5-level A..E scorer of Intellectual Humility (IH) in short English text — the degree to which a passage acknowledges epistemic limits, expresses openness to revision, and engages opposing views charitably and non-absolutely. IH is treated here as a property of the text, not of the writer.
The deployed scorer is a 3-seed ensemble of LoRA adapters trained on unsloth/Qwen3.6-27B with the same supervised fine-tuning recipe and three independent random seeds. Inter-seed disagreement is exposed as an agreement-based confidence tier (HIGH / MEDIUM / LOW) for research triage, quality control, or human-review routing.
On 5-fold CV of the Guo 2024 corpus, the ensemble achieves Pearson r = 0.729 overall (mean across folds), rising to 0.815 on the HIGH-confidence subset (57%) where all 3 seeds agree. This release supersedes the earlier ORPO scorer tmadl/IH-Qwen3.5-ORPO-Guo (pooled Pearson 0.689), which is now deprecated.
Intended use
This model is intended for research use: scoring short English text for intellectual humility expressed in the passage, especially in psychological and social-science text analysis, persuasion / belief-change research, and human-AI interaction or AI-safety studies concerned with receiver-side reflective availability. Scores are most defensible at the level of texts, conditions, corpora, or repeated-measures aggregates.
The model scores texts, not people. A single text's IH score should not be treated as a stable attribute of the writer.
Do not use the scorer for individual profiling, clinical or forensic assessment, educational or employment evaluation, eligibility decisions, surveillance, content-moderation decisions, targeted persuasion, ranking people by intellectual character, or any other use that ranks, classifies, or makes decisions about identifiable individuals.
The model is trained and evaluated against expert text-marker coding of IH (Guo 2024 scheme); transfer outside written English social / argumentative discourse has not been validated.
Research context
This scorer is a candidate text-derived indicator for intellectual humility as construct anchor for the belief-holding/updating (calibration-collapse) failure mode in Madl & Lazar, A Receiver-Side Blind Spot in AI Safety (in review). It estimates IH expressed in a passage as a fallible proxy for the in-situ availability of warrant-inspection operations; load-bearing framework claims rest on independent behavioural endpoints, not on this text score alone.
Related scorers for the other two construct anchors are:
- Integrative Complexity scorer (for the perspective-collapse / content object):
tmadl/IC-Qwen3.5-ORPO-400 - Integrated Decentering scorer (for the framing-capture / sense-making object):
tmadl/ID-decentering-scorer-ensemble
Quick start
pip install -U unsloth bitsandbytes accelerate
Ensemble (recommended)
from inference_example import score_texts
out = score_texts([
"I know what I'm talking about, unlike most people. Anyone who "
"disagrees hasn't done their research.",
"I've held this view for a while but I recognize there's a lot I don't "
"know. The strongest argument against it is real and I can't fully "
"rebut it.",
])
# out[0]: {"ensemble_argmax_letter": "A", "ensemble_argmax_score_1_5": 1.0,
# "ensemble_ev_score_1_5": ..., "confidence_tier": "HIGH", ...}
# out[1]: {"ensemble_argmax_letter": "E", "ensemble_argmax_score_1_5": 5.0,
# "ensemble_ev_score_1_5": ..., "confidence_tier": "HIGH", ...}
Single-adapter fast path (≈1/3 cost; no confidence tier)
out = score_texts(texts, members=["sft_lowirr_all_seed42"])
CLI form (one essay per line):
python inference_example.py --input essays.txt --output scored.jsonl
Output fields
| field | description |
|---|---|
ensemble_argmax_letter |
majority vote across adapters (tiebreak: highest mean prob) |
ensemble_argmax_score_1_5 |
integer 1..5 mapping (A=1, ..., E=5) |
ensemble_ev_score_1_5 |
Σ p(L) · num(L) — soft continuous score |
prob_A..prob_E |
mean softmax probabilities across adapters |
confidence_range |
max(adapter letters) − min (integer 0..4); 0 means all adapters agree |
confidence_tier |
HIGH (range=0), MEDIUM (range=1), LOW (range≥2) — triage signal. N/A in single-adapter mode (no inter-seed range). |
n_adapters_voted |
how many adapters produced a parseable letter |
parse_failed |
True iff no adapter produced a parseable Letter: <X> — score reverts to uniform prior; flag for human review |
adapter{0..K-1}_letter |
individual adapter letter predictions, for inspection |
For downstream use:
- letter-level tasks (κ-linear, exact accuracy):
ensemble_argmax_letter - continuous scores (correlations, regressions):
ensemble_ev_score_1_5 - model probabilities:
prob_A..prob_E - research triage / quality control:
confidence_tier— flagLOWconfidence for human review
Scoring head
The model was supervised fine-tuned to emit a short theory-grounded rationale followed by Letter: <X>. At inference time we therefore generate up to 240 tokens (greedy decoding) per essay, locate the predicted letter token, and read softmax probabilities at that position over the five letter tokens "A".."E". Ensemble probabilities are the per-adapter mean.
This differs from the earlier ORPO scorer (v1), which used first-position logit-EV decoding (one forward pass). v2's scoring head is not a drop-in replacement for v1's; see "Migrating from v1". The system prompt embedded in inference_example.py is the exact training prompt — do not modify it without retraining.
Expected performance (5-fold CV on Guo 2024)
| metric | mean across folds | SD across folds |
|---|---|---|
| Pearson r | 0.729 | 0.090 |
| Cohen's κ-linear | 0.575 | 0.082 |
| Krippendorff α-ordinal | 0.619 | 0.085 |
Per-fold Pearson: 0.844, 0.811, 0.641, 0.696, 0.652. Compared to human inter-annotator IRR (Neil/Melody Pearson on dual-coded essays):
| fold | model Pearson | human IRR | % of IRR |
|---|---|---|---|
| 0 | 0.844 | 0.905 | 93% |
| 1 | 0.811 | 0.946 | 86% |
| 2 | 0.641 | 0.874 | 73% |
| 3 | 0.696 | 0.773 | 90% |
| 4 | 0.652 | 0.769 | 85% |
| mean | 0.729 | 0.853 | 85% |
The protocol is pre-registered single-recipe: 3-seed low-IRR-filtered SFT mean ensemble, applied uniformly across all 5 folds. An alternative cross-recipe 6-model ensemble (3 regular SFT + 3 low-IRR-filtered SFT) was tested and ties on mean Pearson (0.729) at 2× inference cost — single-recipe wins on parsimony.
Confidence tier — use it for triage
Stratifying 5-fold holdout predictions by 3-seed agreement gives a monotonic agreement–performance pattern:
confidence_tier |
coverage | Pearson r | κ-linear | exact-acc | within-1 |
|---|---|---|---|---|---|
| HIGH (range=0, all 3 agree) | 57% | 0.815 | 0.692 | 65.5% | 89.4% |
| MEDIUM (range=1) | 33% | 0.558 | 0.368 | 45.1% | 83.2% |
| LOW (range≥2) | 10% | 0.408 | 0.201 | 25.8% | 65.4% |
The 10% LOW-tier items are especially strong candidates for human review or exclusion from sensitive analyses. On the 57% HIGH-tier majority, the ensemble is +0.086 Pearson over the all-items average. The signal reflects essay-intrinsic ambiguity (where human raters also disagree most), not just model-internal noise.
Generalisation — topic vs style
Paired topic-swap test (43 Guo essays rewritten preserving epistemic stance — hedges, absolutism, charity — but changing only the surface topic from religion to politics / nutrition / ethics / workplace / parenting / art):
- Pearson(originals vs Guo letter) = 0.893
- Pearson(swap twins vs Guo letter) = 0.826 (only −0.07)
- 81% of paired predictions match exactly; 93% within 1 letter
In this small paired rewrite test, predictions were relatively stable under surface-topic swaps, suggesting some topic robustness for human-style argumentative writing.
Known systematic biases
The model is conservative — on 5-fold pooled holdouts it pulls extreme letters (A "arrogant" and E "deeply humble") toward the middle. The effect is most visible on the minority classes (A, B, C together = 34% of training data).
Training
| Base | unsloth/Qwen3.6-27B (4-bit NF4 via bitsandbytes — QLoRA) |
| Adapter | LoRA r=32, α=64, no dropout, target: q/k/v/o + gate/up/down_proj |
| Recipe | Supervised fine-tuning (SFT) on Guo letter labels + auto-generated theory-grounded rationales |
| Data | 359 Guo 2024 essays = 410 total − 51 "low-IRR" essays (Neil/Melody letter difference ≥ 1) |
| Optimizer | AdamW, lr 5e-5 cosine, weight decay 0.01 |
| Effective batch | 16 |
| Steps | 200 |
| Seeds | 42, 43, 44 (three independent LoRA initialisations) |
Design choices
Low-IRR-filtered training. We exclude the 51 essays where human raters disagreed by at least one letter and train on the remaining 359 higher-agreement essays. In 5-fold comparison, this recipe outperformed regular SFT on every fold (+0.035 mean Pearson).
Three-seed ensemble. Single LoRA seeds vary noticeably across folds. Averaging three independently initialised adapters improves robustness and provides the agreement-based HIGH / MEDIUM / LOW confidence tier. Use the single-adapter path when latency matters.
Migrating from v1
This release supersedes tmadl/IH-Qwen3.5-ORPO-Guo, which is now deprecated.
Main differences from v1:
- v2 improves pooled Pearson from 0.689 to 0.729 and Krippendorff α-ordinal from 0.451 to 0.619.
- v2 uses a 3-seed SFT ensemble rather than a single-seed ORPO adapter.
- v2 uses all five A..E levels more reliably; v1 often behaved closer to a binary arrogant-vs-humble classifier.
- v2 reports an agreement-based HIGH / MEDIUM / LOW confidence tier.
- v2 is slower in full-ensemble mode because it generates a short rationale before reading the final letter; use single-adapter mode when latency matters.
- v2 reports native A..E / 1..5 scores. Do not mix v1 and v2 scores in the same analysis without recalibration.
For all new work, use v2.
Limitations
- Language: trained on English text only. No claims about other languages.
- Domain: Reddit religion-discussion posts (Guo 2024). Performance on technical, narrative, or non-argumentative text is not validated.
- Length: truncated at 1024 tokens with up to 240 generated tokens for the rationale. Very long passages are scored on the truncated prefix.
- Inference cost: 3 adapters loaded sequentially = ≈3× single-model time (≈15 min for 1000 essays on RTX 6000 Pro). For latency-sensitive use, deploy a single seed.
- Single-rater: the ensemble outputs a single automated estimate per text. It is not a substitute for multiple trained human raters when consensus IH scores are required.
- Calibration: anchored to Guo 2024's text-marker coding scheme; absolute scores should be interpreted relative to the training distribution, not as universal "humility units".
- Aggregate where possible. Aggregate analyses over many texts are more reliable than interpreting any single text score.
- No individual decision use. The scorer has not been validated for decisions about identifiable people, with or without consent.
License
The LoRA adapter weights and accompanying files are licensed under CC-BY-NC-4.0 — see LICENSE. CC BY-NC 4.0 permits non-commercial use, including research, teaching, personal experimentation, and other uses not primarily intended for commercial advantage or monetary compensation.
Commercial uses are not granted under CC BY-NC 4.0. Contact the rights holder for a separate commercial license — see COMMERCIAL.md.
The base model (unsloth/Qwen3.6-27B) is Apache 2.0 and is not redistributed here. The Guo 2024 EMNLP training corpus is governed by its own license; see NOTICE for full third-party attribution.
Copyright © 2026 Tamas Madl. All rights not granted under CC BY-NC 4.0 or a separate written commercial license are reserved.
Citation
If you use this model, please cite:
@misc{madl_ih_scorer_v2_2026,
author = {Madl, Tamas},
title = {IH-Scorer v2 — Intellectual Humility 3-seed LoRA Ensemble on Qwen3.6-27B},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/tmadl/IH-scorer-v2}},
note = {Model repository}
}
@misc{madl2026icscorer,
author = {Madl, Tamas},
title = {Text-measured cognitive complexity predicts belief revision in AI persuasion},
year = {2026},
howpublished = {PsyArXiv preprint},
url = {https://osf.io/preprints/psyarxiv/mdxvs_v1}
}
@inproceedings{guo2024humility,
author = {Guo, Xiaobo and Potnis, Neil and Yu, Melody and Gillani, Nabeel and Vosoughi, Soroush},
title = {The Computational Anatomy of Humility: Modeling Intellectual
Humility in Online Public Discourse},
booktitle = {Proceedings of EMNLP 2024},
year = {2024},
url = {https://github.com/xiaobo-guo/The-Computational-Anatomy-of-Humility-Modeling-Intellectual-Humility-in-Online-Public-Discourse}
}
If your use case concerns AI dialogue, reflective agency, belief change, or receiver-side examinability, please also cite:
@unpublished{madl_lazar_receiver_side_examinability,
author = {Madl, Tamas and Lazar, Sara W.},
title = {A Receiver-Side Blind Spot in AI Safety},
note = {Manuscript in review},
year = {2026}
}
Additional instrument citations are in NOTICE.
Contact
Tamas Madl — tamas.madl@ofai.at Austrian Research Institute for Artificial Intelligence (OFAI)
- Downloads last month
- -