# HTML cleanup
ResearchMind uses **trafilatura** to strip boilerplate and keep main article text.
- `include_tables=true` for data-heavy pages
- `include_comments=false`
- Fallback: first 50k chars of raw HTML if extraction returns empty
Raw snapshot saved under `RESEARCHMIND_DATA_DIR/raw/{doc_id}/snapshot.txt`.