From Common Crawl to Automotive Market Intelligence
Billions of web pages archived by Common Crawl. We query the CC index for automotive domains, download WARC records, and extract clean article text.
// Real URLs from the dataset "autocar.co.uk/car-news/business-electric-vehicles/ hydrogen-car-dream-good-dead" "autocar.co.uk/car-news/business-electric-vehicles/ kia-uk-boss-calls-clarity-hybrid-sales-after-2030" "autocar.co.uk/car-news/business-environment-and-energy/ kia-europe-reuse-electric-car-batteries-energy-storage" // Crawl: CC-MAIN-2026-04
NuExtract-2.0 with an ontology-enforced JSON schema transforms raw article text into structured market signals — products, events, companies, and feature-level sentiment.
// From autocar.co.uk
H2 Mobility has announced it is
shutting 22 of its hydrogen fuel
stations in Germany, dealing a
significant blow to the hydrogen
car movement in Europe.
The company cited low demand and
high operational costs. Only about
100 hydrogen fuel cell vehicles
are registered in Germany...
{
"market_relevance": "high",
"signals": [{
"event_description":
"H2 Mobility announced shutting
22 fuel stations in Germany",
"l1_domain": "C",
"sentiment": "bearish",
"impact": "high",
"confidence": "confirmed"
}],
"companies": [{
"name": "H2 Mobility",
"role": "subject"
}]
}
The "Ralph Wiggum" quality loop validates extractions against the NHTSA vPIC database. Fuzzy matching resolves messy text to canonical product IDs through up to 6 depth levels.
// Example resolution "Tesla Model Y Long Range" → make: "Tesla" → model: "Model Y" → year: 2024 → body: "SUV" → powertrain: "BEV" → id: prd_tesla_model_y_2024
Signals sharing a product are connected with NEXT edges ordered by timestamp, creating event chains that reveal how stories evolve over time.
// For each product, order signals by date // Connect consecutive signals with NEXT edges // Max gap: 90 days sig_001 ("Production starts", Jan 15) ─NEXT→ sig_002 ("First deliveries", Mar 22) ─NEXT→ sig_003 ("Recall issued", May 10) ─NEXT→ sig_004 ("Software update", Jun 03)
Temporal chains let analysts track how market events cascade — from a product launch through reviews, sales data, competitive responses, and eventual recalls or updates.
The longest chains span 4+ years of signal history for established products like the Tesla Model 3 and Toyota RAV4.
Products, Features, Actors, Signals, Locations, and Documents are woven together through 13 typed edge roles into a unified knowledge graph — the dual-star topology.
Four mathematical lenses reveal patterns invisible to keyword search: competitive clusters, regime changes, signal consistency, and perception gaps.
Move beyond keyword search. The hypergraph reveals: