NHTSA Ingest

NHTSA data harvesting, R2 storage, EKIS and EKIS-ONE ingestion for empowering MILA and NICO

Data Catalog

Recalls

...

Complaints

...

Investigations

...

TSBs / MfrComm

...

Pipeline Progress

PDFs Discovered

...

PDFs Extracted

...

Vectorized (EKIS)

...

Vectorized (EKIS-ONE)

...

Sync Filters (optional)

Method 1: Flat file download (primary for bulk data)

R2 cache (already downloaded?)
  → YES: read from R2, skip download
  → NO: download ZIP from static.nhtsa.gov
       via Decodo residential proxy
       → extract TSV/TXT → parse rows
       → upsert to Aurora

Used by: recalls, complaints, investigations, TSBs. These files contain the full NHTSA catalog with vehicle make/model/year associations. This is the authoritative source.

Downloads route through Decodo residential proxy because NHTSA blocks datacenter IPs.

NHTSA Bulk Ingest

Step 1

Download NHTSA recall flat files, parse and upsert to RecallCampaign + RecallVehicle tables, then dispatch PDF download messages via SQS.

How Step 2 uses Step 1's results

Step 1 populated recall_campaigns + recall_vehicles with pdf_url = NULL on every campaign. Step 2 reads those rows, filters by your make + year selection from the Sync Filters, and discovers the actual PDF download URL for each campaign.

SELECT campaign_number FROM recall_campaigns
  WHERE pdf_url IS NULL
  AND EXISTS (recall_vehicles.make = your filter)
  AND EXISTS (recall_vehicles.model_year >= your year)

For each campaign found:

campaign_number “20V766000”
  → strip trailing 000 → “20V766”
  → S3 prefix: odi/rcl/2020/RCLRPT-20V766
  → query NHTSA S3 via Decodo Scraping API
  → find: RCLRPT-20V766-4838.PDF
  → UPDATE recall_campaigns
     SET pdf_url = ‘https://static.nhtsa.gov/...’
     SET pdf_status = ‘DISCOVERED’

After Step 2, campaigns transition from pdf_url = NULL to a real pdf_url pointing to static.nhtsa.gov. Step 3 uses this URL to download and Docling-parse the actual PDF document.

Routes through Decodo Scraping API (Bearer token auth). NHTSA blocks datacenter IPs — Decodo provides residential IP rotation. Re-running is safe: campaigns with pdf_url already set are automatically skipped.

NHTSA PDF URL Discovery

Step 2

Queries the NHTSA S3 bucket listing to discover PDF download URLs for records where pdf_url is NULL. Populates the pdf_url column so the PDF Extraction Engine can download and process documents.

How Step 3 uses Step 2's results

Step 2 discovered PDF download URLs and wrote them to recall_campaigns.pdf_url with pdf_status = 'DISCOVERED'. Step 3 reads those rows and processes each PDF through the multimodal extraction pipeline.

SELECT id, pdf_url FROM recall_campaigns
  WHERE pdf_status = 'DISCOVERED'
  AND pdf_url IS NOT NULL
  (+ make/model/year from Sync Filters)

For each campaign with a discovered URL:

pdf_url = “https://static.nhtsa.gov/...”
  → download PDF via Decodo
  → Docling structural parse (sections, tables, images)
  → Gemini 1.5 Pro vision alt-text for diagrams
  → upload images to R2
  → Claude Haiku 4.5 contextualize each chunk
  → Gemini Embedding 2 (3072-dim) vectorize
  → upsert to Pinecone ekis-one index
  → UPDATE pdf_status = ‘EXTRACTED’

This step targets the EKIS-ONE unified index (Gemini Embedding 2, 3072-dim) with full contextual retrieval. The EKIS multi-index (multilingual-e5-large, 1024-dim) is populated separately via Step 5 (NHTSA EKIS Vectorization) below.

The Sync Filters (Make, Model, Year start, Year end) constrain which campaigns are processed. The PDF Status dropdown lets you re-process previously extracted or errored documents.

NHTSA PDF Extraction (Multimodal)

Step 3

Docling parse → Gemini 1.5 Pro vision alt-text → R2 image hosting → Claude Haiku 4.5 contextualize → Gemini Embedding 2 (3072-dim) → ekis-one Pinecone. Processes unstructured NHTSA PDFs (recall notices, TSBs) with technical diagram extraction.

How Step 4 complements Step 3

Step 3 extracted PDF documents (recall notices, TSB bulletins) into EKIS-ONE with full multimodal processing. Step 4 covers the NHTSA records that don't have PDFs — consumer complaints, investigations, and text summaries — using enterprise-class Contextual Retrieval.

Contextual Retrieval pipeline:

Aurora text record (complaint / investigation / summary)
  → segment at semantic boundaries
  → Claude Haiku 4.5 reads chunk + FULL parent document
     generates contextual preamble:
     “This chunk is from a 2020 Subaru Outback
     recall about brake fluid leaks...”
  → preamble PREPENDED to chunk
  → Gemini Embedding 2 (3072-dim)
  → Pinecone ekis-one index
  → dual-level verify (presence + semantic probe)

Together, Steps 3 + 4 give EKIS-ONE complete NHTSA coverage:

Step 3: PDF content — technical diagrams, Part 573 reports, remedy instructions, component photos
Step 4: Text records — 1.3M+ consumer complaints, 15K investigations, TSB summaries, recall summaries

Why contextualization matters: Without it, a chunk like “replace the hydraulic control unit at no cost” embeds in isolation — MILA can't tell which vehicle or recall it's about. With it, Claude prepends the vehicle-specific context so the vector captures the full meaning.

Two-tier prompt caching: the parent document is cached across all chunks from the same record (90% lower cost). Runs on the Python SQS worker, not inline. Idempotent via source hash in audit table.

EKIS-ONE Contextual Retrieval

Step 4

Segment → Claude Haiku 4.5 contextualize (two-tier prompt cache) → Gemini Embedding 2 (3072-dim) → ekis-one Pinecone with dual-level verify-after-write. Runs on the Python SQS worker. Sync Filters (make, model, year) are applied.

How Step 5 complements Steps 3 & 4

Steps 3 and 4 populated the EKIS-ONE unified index (Gemini Embedding 2, 3072-dim) — Step 3 with multimodal PDF content, Step 4 with text-only records. Step 5 embeds text summaries from Aurora into the EKIS multi-index (multilingual-e5-large, 1024-dim) — a separate, complementary RAG system.

Aurora recall_campaigns.summary
Aurora nhtsa_complaints.description
Aurora nhtsa_investigations.summary
Aurora mfr_communications.summary
  → multilingual-e5-large embedding
  → Pinecone ekis-nhtsa index
     namespaces: recalls, complaints,
     investigations, tsbs

MILA queries BOTH indexes in Hybrid mode:

EKIS-ONE (Steps 3+4): contextualized PDF + text records
EKIS (Step 5): text summaries from Aurora catalog
  → merged results → MILA response

Step 4 reads directly from Aurora — it does NOT depend on Step 2 or 3 completing first. You can run it immediately after Step 1 for quick text-based retrieval while PDF extraction runs in parallel.

NHTSA EKIS Vectorization

Embed Aurora text summaries into ekis-nhtsa Pinecone index (multilingual-e5-large, 1024-dim)

Step 5

Recalls

NHTSA safety recall campaigns by make/model/year

Complaints

Consumer complaints with narrative descriptions (1.37M+ available)

Investigations

NHTSA defect investigations (PE, EA, RQ phases)

Safety Ratings

NCAP crash test ratings (overall, frontal, side, rollover)

TSBs / Mfr Communications

Technical Service Bulletins and manufacturer communications (flat file ingest)

Car Seat Stations

Child safety seat inspection stations by ZIP code

Manufacturer Communications (TSB) Ingest

Download 7 chunked NHTSA flat files (1995-2025), stream-parse manufacturer communications with vehicle associations, then dispatch PDF download messages via SQS.

Recent Sync Runs

No sync runs yet. Click a channel above to start.

NHTSA Enterprise Pipeline Actions

Trigger ingestion, distribution, reconciliation, and re-embedding pipelines.

NHTSA R2 Compliance Validation

Validates Phase 1 R2 bucket structure — identifies files stored outside approved prefixes