Skip to main content
Mission ControlOmnivore Ingestion Engine

White Paper 4: The "Omnivore" Ingestion Engine

Version 1.3.0 · Date: February 3, 2026

Subject: Atomization, Tunable Fan-Out, and Sovereign Ingestion

Standards Focus: #2 (Determinism), #3 (Idempotency), #4 (High-Stability)

1. Tunable Asynchronous Fan-Out (Standard #2)

To handle the variable document density of 50+ user enterprises, see7 utilizes a Dynamic Fan-Out architecture. This ensures that the system scales horizontally, independent of single-threaded execution limits.

The Orchestrator: When an ingestion job is triggered, the CrawlJob serves as the root of a recursive task tree. The orchestrator decomposes the job into a series of idempotent, stateless execution units dispatched via QStash.

Tunable Concurrency: Rather than a hardcoded processing limit, see7 utilizes environment-driven pacing. Our segment batching—currently optimized for throughput—is a tunable parameter within our SystemConfig. This allows us to throttle or accelerate ingestion based on the regional API limits of our AI Core (e.g., Google Vertex AI rate limits) or the priority level of the customer tenant.

Isolated Failure Domains: By fanning out work into discrete segments, we ensure that a single "poison pill" document (e.g., a multi-gigabyte corrupted PDF) only affects its local execution thread. The rest of the "Omnivore" grid continues to process the queue, maintaining High-Stability (Standard #4) across the platform.

2. The Omnivore Protocol: Universal Ingestion

see7 consumes data in its native context, ensuring that visual and structural intent is preserved.

The Omnivore Bridge (Google Workspace): We bypass the limitations of simple text scrapers by utilizing native export bridges. Google Docs, Sheets, and Slides are exported via the Google Drive API into DOCX/XLSX/PPTX buffers. This process ensures that spreadsheet cell logic, hidden columns, and document hierarchies are physically present for the Atomizer to inspect.

Omni-Media Temporal Segmentation: Audio and video (MP3, MP4, MOV) undergo a specialized ingestion path. We utilize FFmpeg at the Edge—the industry-standard multimedia framework—to generate temporal segments. Each segment is then processed in parallel, allowing see7 to index a one-hour recorded sales meeting as a series of searchable, time-stamped "Intelligence Events."

3. The Atomizer: Mining "Atomic Facts"

The Atomizer is our specialized LLM-orchestration layer that converts raw, unstructured strings into high-value Snippets.

Sovereign Loop-Breaker: Traditional AI chunking is prone to "hallucinatory repetition" where the model gets stuck in a loop. Our Atomizer implements a loop-breaking protocol that forces the model to move to the next logical fact once a snippet is extracted. We enforce a yield of 5–15 snippets per segment, ensuring a high Knowledge Yield per MB.

SHA-256 Versioning (Standard #3): Every document is hashed using a SHA-256 algorithm upon initial ingestion. This hash is stored in our Content vault. If the same file is uploaded to a different repository, our Turbo-Ingestor performs a "Content-Addressable" lookup. If the hash matches, we reference the existing Snippet records rather than re-vectorizing, drastically reducing database bloat and compute costs.

4. Operational Resilience: Managing Stuck Crawls

In any distributed system, external factors (network timeouts, source site throttling) can lead to "stuck" processes. see7 addresses this through an Operational Day 0 monitoring layer.

The Watchdog Protocol: We utilize specialized diagnostic scripts, such as check-stuck-crawl.ts, to identify jobs that have remained in a RUNNING state without progress updates for more than a defined threshold (e.g., 30 minutes).

Self-Healing Cleanup: When a stuck job is identified, the system can automatically trigger a cleanup of orphaned worker tasks and reset the job status to FAILED or PARTIAL. This prevents "zombie processes" from consuming system resources and allows the user to re-initiate the job with a single click, ensuring the ingestion grid remains fluid and reliable.

5. Sovereign Proxying & Visual Fidelity

To preserve "Ground Truth," see7 provides a high-fidelity viewing experience through an active proxy (/api/content/proxy).

Dynamic Path Rewriting: Many documents rely on relative assets (CSS, Images). Our proxy intercepts the document and rewrites these paths to secure, absolute URLs in real-time. This allows the user to view the content as it appeared at the source, without the security risk of third-party tracking.

Sovereign CSS Injection: While the proxy preserves source fidelity, it also injects see7's sovereign design tokens, ensuring even external content feels native to the see7 environment.

6. The Turbo-Pulse Heartbeat & Traceability

We eliminate "System Anxiety" while maintaining absolute truth-anchoring.

Turbo-Pulse Transparency: see7 utilizes a high-resolution heartbeat to surface granular progress. Users see exactly what the "Omnivore" is doing (e.g., "Atomizing Segment 8/10") rather than a static spinner.

Citation Integrity (Standard #6): Every generated insight in see7 is cited. By clicking a citation, the user is taken directly to the specific Snippet and the exact page/segment in the mirrored source document. This 1:1 mapping ensures that the AI's output is always Truth-Anchored.

Related White Papers

For the underlying stack, see Architecture & Tech Stack. For identity and security, see Identity & Sovereignty. For engineering standards, see Development Philosophy.