Structural Intelligence

From Common Crawl to Automotive Market Intelligence

0
Nodes
0
Edges
0
Signals
0
Vehicles
Scroll to explore the pipeline ↓
Step 01

Common Crawl Scraping

Billions of web pages archived by Common Crawl. We query the CC index for automotive domains, download WARC records, and extract clean article text.

Live URL Waterfall
The Funnel
Sample Documents
// Real URLs from the dataset
"autocar.co.uk/car-news/business-electric-vehicles/
  hydrogen-car-dream-good-dead"

"autocar.co.uk/car-news/business-electric-vehicles/
  kia-uk-boss-calls-clarity-hybrid-sales-after-2030"

"autocar.co.uk/car-news/business-environment-and-energy/
  kia-europe-reuse-electric-car-batteries-energy-storage"

// Crawl: CC-MAIN-2026-04
Step 02

Structured Extraction

NuExtract-2.0 with an ontology-enforced JSON schema transforms raw article text into structured market signals — products, events, companies, and feature-level sentiment.

Raw Article Text
// From autocar.co.uk

H2 Mobility has announced it is
shutting 22 of its hydrogen fuel
stations in Germany, dealing a
significant blow to the hydrogen
car movement in Europe.

The company cited low demand and
high operational costs. Only about
100 hydrogen fuel cell vehicles
are registered in Germany...
NuExtract JSON Output
{
  "market_relevance": "high",
  "signals": [{
    "event_description":
      "H2 Mobility announced shutting
       22 fuel stations in Germany",
    "l1_domain": "C",
    "sentiment": "bearish",
    "impact": "high",
    "confidence": "confirmed"
  }],
  "companies": [{
    "name": "H2 Mobility",
    "role": "subject"
  }]
}
Signal Taxonomy (8 L1 Domains)
Step 03

Entity Resolution

The "Ralph Wiggum" quality loop validates extractions against the NHTSA vPIC database. Fuzzy matching resolves messy text to canonical product IDs through up to 6 depth levels.

Fuzzy Match Resolution
Resolution Depth Levels
1
Make only — "Tesla" → confidence 0.2
2
+ Model — "Model Y" → confidence 0.4
3
+ Year — "2024" → confidence 0.6
4
+ Body Class — "SUV" → confidence 0.7
5
+ Powertrain — "BEV" → confidence 0.8
6
+ Trim — "Long Range" → confidence 0.95
vPIC Feature Groups (160 Elements)
Powertrain Electrification Safety (Active) Safety (Passive) Body Interior Wheels & Tires Vehicle ID Specialty
// Example resolution
"Tesla Model Y Long Range"
  → make: "Tesla"
  → model: "Model Y"
  → year: 2024
  → body: "SUV"
  → powertrain: "BEV"
  → id: prd_tesla_model_y_2024
Step 04

Temporal Stitching

Signals sharing a product are connected with NEXT edges ordered by timestamp, creating event chains that reveal how stories evolve over time.

Signal Timeline — Event Chain
How Stitching Works
// For each product, order signals by date
// Connect consecutive signals with NEXT edges
// Max gap: 90 days

sig_001 ("Production starts", Jan 15)
  ─NEXT→
sig_002 ("First deliveries", Mar 22)
  ─NEXT→
sig_003 ("Recall issued", May 10)
  ─NEXT→
sig_004 ("Software update", Jun 03)
Temporal Edge Stats
84,981
NEXT edges

Temporal chains let analysts track how market events cascade — from a product launch through reviews, sales data, competitive responses, and eventual recalls or updates.

The longest chains span 4+ years of signal history for established products like the Tesla Model 3 and Toyota RAV4.

Step 05

Building the Hypergraph

Products, Features, Actors, Signals, Locations, and Documents are woven together through 13 typed edge roles into a unified knowledge graph — the dual-star topology.

Dual-Star Graph Assembly
Node Breakdown
Edge Role Distribution
Step 06

Finding Structure

Four mathematical lenses reveal patterns invisible to keyword search: competitive clusters, regime changes, signal consistency, and perception gaps.

Signal Sentiment Distribution
Competition Clusters (COMPETES_WITH)
📊
Spectral Analysis
Community detection via graph Laplacian — finds competitive clusters and market segments
🔮
Topology
Persistent homology detects regime changes — when market structure fundamentally shifts
🌐
Sheaf Cohomology
Signal consistency across regions — where do narratives diverge?
🔬
Functor Analysis
Feature perception vs reality gap — what the specs say vs what signals reveal
What Structural Intelligence Reveals

Move beyond keyword search. The hypergraph reveals:

Competitive Blind Spots Products competing on features but not recognized as rivals
Signal Cascades How a supply chain disruption propagates through the market
Regime Detection When market topology changes — new entrants, exits, mergers
Narrative Divergence Where media sentiment conflicts with financial reality