The Management Plane: Orchestrating GPU Inference with SQLite and Two Binaries

What HOROS actually is

HOROS is two statically compiled Go binaries, a SQLite database, and vLLM containers. No Redis, no Kafka, no Elasticsearch, no Postgres. The entire system runs on a single machine.

The first binary — horos47 — is the orchestrator. It manages a pipeline of jobs stored in SQLite: PDF-to-images, image-to-OCR, OCR-to-database. Each stage has configurable worker counts (pdf_to_images=2, image_to_ocr=8, ocr_to_database=4). Auto-retry at 30 seconds. Concurrency is set per stage via SetConcurrency().

The second binary — gpu_feeder_v3 — manages GPU allocation. It controls the lifecycle of vLLM containers, processes jobs in batches (8–16 per batch), enforces a 10-second cooldown between batches for thermal management, detects crashed containers, and uses bisection to isolate poison pill jobs in O(log N).

That's the entire runtime. Two processes, one database, containers that load and unload models.

Control plane / data plane separation

The architecture separates cleanly into two planes.

The control plane handles metadata, routing, job scheduling, and coordination. It's lightweight — small messages, fast SQLite reads, sub-millisecond latency. No GPU involvement.

The data plane handles the actual work: PDF pages as images, OCR inference, embedding generation, language model calls. It's GPU-bound, batch-oriented, and operates on chunks of similar size.

These two planes never share a concurrency primitive. The control plane doesn't compete with the data plane for resources. A metadata lookup doesn't wait behind a GPU batch. A job status update doesn't contend with an inference call.

This separation eliminates an entire class of problems that plague mixed architectures: small control messages stuck behind large data payloads, cache invalidation across different workload types, priority inversion where a batch job blocks an interactive query.

GPU allocation: the thermal problem

GPUs have thermal constraints. A 5090 running continuous inference at 95% memory utilization heats up. If you push batches back-to-back, thermal throttling kicks in and throughput drops.

The gpu_feeder handles this with a simple mechanism: 10-second cooldown between batches. During cooldown, the GPU temperature drops. The next batch runs at full speed. The total throughput is higher than running without cooldown, because you avoid the throttling penalty.

This is the kind of operational detail that benchmark papers never mention and production systems always hit.

Model allocation: automatic switching

Four GPUs, four specializations:

GPU	Model	Speed	Function
#0	Qwen3-32B (thinking)	~60 tok/s	Reasoning, synthesis, long-form writing, verification
#1	Qwen3-8B (fast)	~140 tok/s	Extraction, compression, reranking, reformulation
#2	Qwen2-VL-7B + ColQwen2	~3 pages/s	Visual PDF reading: tables, charts, schematics, OCR
#3	BGE-M3 + Jina Reranker	~1000 doc/s	Semantic indexing, vector search, cross-encoding

The allocator — a goroutine ticking every 5 seconds — monitors the ratio of pending jobs per type and can switch a GPU between Vision and Think modes. Anti-inversion rule: don't switch to Think for ≤3 jobs if Vision has >10× more pending. This prevents thrashing on mixed workloads.

Model swap takes ~70 seconds for Vision, ~180 seconds for Think (CUDA graph compilation). The allocator accounts for this cost in its decision.

The job queue: squeueHA

All job coordination goes through squeueHA — a SQLite-based visibility timeout queue. Jobs are claimed in batches via PollBatch() in a single transaction with BEGIN IMMEDIATE. Claimed jobs become invisible to other workers for a configurable duration. If the worker crashes, the job reappears after the visibility timeout expires.

The same primitive covers three distributed patterns:

1 row, N instances → leader election
N rows, N instances → work distribution (TDMA — time division multiple access)
Visibility shorter than processing time → elastic overflow

Priority support (priority DESC, visible_at ASC) means interactive queries can jump ahead of batch jobs in the same queue. Dead letter queue (squeueha_dead) catches jobs that exceed MaxAttempts.

Exponential backoff (50/100/200ms) on SQLITE_BUSY handles write contention without spinning.

Triple indexation

Every document page goes through three parallel indexation paths:

Text embedding (BGE-M3, 568M params) — multilingual semantic search. Equivalent to OpenAI ada-002 at ~$0.04/600 pages.
Visual reading (Qwen2-VL-7B) — reads tables, charts, schematics directly from page images. Produces structured data. Equivalent to GPT-4o Vision at ~$9-15/600 pages.
Visual embedding (ColQwen2, 2B params) — late interaction embedding of the entire page image. Enables visual similarity search: find a page by sketch, table type, or visual pattern. No cloud equivalent exists.

The third layer is the differentiator. No managed service offers visual similarity search on document pages. You can find a page by what it looks like, not just by what it says.

Cost for the entire pipeline on a 600-page technical document: ~€0.09 on HOROS infrastructure, vs $10-76 on cloud alternatives (which don't include visual indexation).

Why zero external dependencies

Every external dependency is a failure mode. Redis crashes → your cache is gone and your app behavior changes. Kafka has a partition leader election → your pipeline stalls. Elasticsearch needs a cluster → you're managing infrastructure instead of building product.

SQLite doesn't crash. It's a library, not a service. The WAL file is on local disk. Reads are sub-millisecond. Writes are serialized but fast. The failure modes are disk failure (which kills everything anyway) and corrupt writes (which SQLite handles with checksums and journaling).

Two binaries, one database, containers that start and stop. That's the entire operational surface. There's nothing else to monitor, nothing else to restart, nothing else to upgrade.

hazyhaar — open research, sovereign infrastructure github.com/hazyhaar · hazyhaar.fr