NHTSA Ingest
NHTSA data harvesting, R2 storage, EKIS and EKIS-ONE ingestion for empowering MILA and NICO
Data Catalog
...
...
...
...
Pipeline Progress
...
...
...
...
Sync Filters (optional)
Method 1: Flat file download (primary for bulk data)
→ YES: read from R2, skip download
→ NO: download ZIP from static.nhtsa.gov
via Decodo residential proxy
→ extract TSV/TXT → parse rows
→ upsert to Aurora
Used by: recalls, complaints, investigations, TSBs. These files contain the full NHTSA catalog with vehicle make/model/year associations. This is the authoritative source.
Downloads route through Decodo residential proxy because NHTSA blocks datacenter IPs.
NHTSA Bulk Ingest
Download NHTSA recall flat files, parse and upsert to RecallCampaign + RecallVehicle tables, then dispatch PDF download messages via SQS.
How Step 2 uses Step 1's results
Step 1 populated recall_campaigns + recall_vehicles with pdf_url = NULL on every campaign. Step 2 reads those rows, filters by your make + year selection from the Sync Filters, and discovers the actual PDF download URL for each campaign.
WHERE pdf_url IS NULL
AND EXISTS (recall_vehicles.make = your filter)
AND EXISTS (recall_vehicles.model_year >= your year)
For each campaign found:
→ strip trailing 000 → “20V766”
→ S3 prefix: odi/rcl/2020/RCLRPT-20V766
→ query NHTSA S3 via Decodo Scraping API
→ find: RCLRPT-20V766-4838.PDF
→ UPDATE recall_campaigns
SET pdf_url = ‘https://static.nhtsa.gov/...’
SET pdf_status = ‘DISCOVERED’
After Step 2, campaigns transition from pdf_url = NULL to a real pdf_url pointing to static.nhtsa.gov. Step 3 uses this URL to download and Docling-parse the actual PDF document.
Routes through Decodo Scraping API (Bearer token auth). NHTSA blocks datacenter IPs — Decodo provides residential IP rotation. Re-running is safe: campaigns with pdf_url already set are automatically skipped.
NHTSA PDF URL Discovery
Queries the NHTSA S3 bucket listing to discover PDF download URLs for records where pdf_url is NULL. Populates the pdf_url column so the PDF Extraction Engine can download and process documents.
How Step 3 uses Step 2's results
Step 2 discovered PDF download URLs and wrote them to recall_campaigns.pdf_url with pdf_status = 'DISCOVERED'. Step 3 reads those rows and processes each PDF through the multimodal extraction pipeline.
WHERE pdf_status = 'DISCOVERED'
AND pdf_url IS NOT NULL
(+ make/model/year from Sync Filters)
For each campaign with a discovered URL:
→ download PDF via Decodo
→ Docling structural parse (sections, tables, images)
→ Gemini 1.5 Pro vision alt-text for diagrams
→ upload images to R2
→ Claude Haiku 4.5 contextualize each chunk
→ Gemini Embedding 2 (3072-dim) vectorize
→ upsert to Pinecone ekis-one index
→ UPDATE pdf_status = ‘EXTRACTED’
This step targets the EKIS-ONE unified index (Gemini Embedding 2, 3072-dim) with full contextual retrieval. The EKIS multi-index (multilingual-e5-large, 1024-dim) is populated separately via Step 5 (NHTSA EKIS Vectorization) below.
The Sync Filters (Make, Model, Year start, Year end) constrain which campaigns are processed. The PDF Status dropdown lets you re-process previously extracted or errored documents.
NHTSA PDF Extraction (Multimodal)
Docling parse → Gemini 1.5 Pro vision alt-text → R2 image hosting → Claude Haiku 4.5 contextualize → Gemini Embedding 2 (3072-dim) → ekis-one Pinecone. Processes unstructured NHTSA PDFs (recall notices, TSBs) with technical diagram extraction.
How Step 4 complements Step 3
Step 3 extracted PDF documents (recall notices, TSB bulletins) into EKIS-ONE with full multimodal processing. Step 4 covers the NHTSA records that don't have PDFs — consumer complaints, investigations, and text summaries — using enterprise-class Contextual Retrieval.
Contextual Retrieval pipeline:
→ segment at semantic boundaries
→ Claude Haiku 4.5 reads chunk + FULL parent document
generates contextual preamble:
“This chunk is from a 2020 Subaru Outback
recall about brake fluid leaks...”
→ preamble PREPENDED to chunk
→ Gemini Embedding 2 (3072-dim)
→ Pinecone ekis-one index
→ dual-level verify (presence + semantic probe)
Together, Steps 3 + 4 give EKIS-ONE complete NHTSA coverage:
Step 4: Text records — 1.3M+ consumer complaints, 15K investigations, TSB summaries, recall summaries
Why contextualization matters: Without it, a chunk like “replace the hydraulic control unit at no cost” embeds in isolation — MILA can't tell which vehicle or recall it's about. With it, Claude prepends the vehicle-specific context so the vector captures the full meaning.
Two-tier prompt caching: the parent document is cached across all chunks from the same record (90% lower cost). Runs on the Python SQS worker, not inline. Idempotent via source hash in audit table.
EKIS-ONE Contextual Retrieval
Segment → Claude Haiku 4.5 contextualize (two-tier prompt cache) → Gemini Embedding 2 (3072-dim) → ekis-one Pinecone with dual-level verify-after-write. Runs on the Python SQS worker. Sync Filters (make, model, year) are applied.
How Step 5 complements Steps 3 & 4
Steps 3 and 4 populated the EKIS-ONE unified index (Gemini Embedding 2, 3072-dim) — Step 3 with multimodal PDF content, Step 4 with text-only records. Step 5 embeds text summaries from Aurora into the EKIS multi-index (multilingual-e5-large, 1024-dim) — a separate, complementary RAG system.
Aurora nhtsa_complaints.description
Aurora nhtsa_investigations.summary
Aurora mfr_communications.summary
→ multilingual-e5-large embedding
→ Pinecone ekis-nhtsa index
namespaces: recalls, complaints,
investigations, tsbs
MILA queries BOTH indexes in Hybrid mode:
EKIS (Step 5): text summaries from Aurora catalog
→ merged results → MILA response
Step 4 reads directly from Aurora — it does NOT depend on Step 2 or 3 completing first. You can run it immediately after Step 1 for quick text-based retrieval while PDF extraction runs in parallel.
NHTSA EKIS Vectorization
Embed Aurora text summaries into ekis-nhtsa Pinecone index (multilingual-e5-large, 1024-dim)
Recalls
NHTSA safety recall campaigns by make/model/year
Complaints
Consumer complaints with narrative descriptions (1.37M+ available)
Investigations
NHTSA defect investigations (PE, EA, RQ phases)
Safety Ratings
NCAP crash test ratings (overall, frontal, side, rollover)
TSBs / Mfr Communications
Technical Service Bulletins and manufacturer communications (flat file ingest)
Car Seat Stations
Child safety seat inspection stations by ZIP code
Manufacturer Communications (TSB) Ingest
Download 7 chunked NHTSA flat files (1995-2025), stream-parse manufacturer communications with vehicle associations, then dispatch PDF download messages via SQS.
Recent Sync Runs
No sync runs yet. Click a channel above to start.
NHTSA Enterprise Pipeline Actions
Trigger ingestion, distribution, reconciliation, and re-embedding pipelines.
NHTSA R2 Compliance Validation
Validates Phase 1 R2 bucket structure — identifies files stored outside approved prefixes