# HTML cleanup

ResearchMind uses **trafilatura** to strip boilerplate and keep main article text.

- `include_tables=true` for data-heavy pages
- `include_comments=false`
- Fallback: first 50k chars of raw HTML if extraction returns empty

Raw snapshot saved under `RESEARCHMIND_DATA_DIR/raw/{doc_id}/snapshot.txt`.