The switch
Six months into building HOROS, I was running PostgreSQL. Sessions, permissions, user data, job queues — everything went through a networked database. It worked. It also meant every permission check was a network round-trip, every session validation was a query to a shared database, and every cache invalidation was a problem I had to solve.
Then I rebuilt everything on SQLite. One file per tenant. WAL mode. No network hop for reads. The consequences were immediate and structural.
One database per tenant
The architecture is shard-centric: each tenant gets its own SQLite database file, managed by a connection pool with lifecycle management, hot-reload, and an LRU reaper for idle connections. Routing is a lookup: dossierID → file path. That's it.
The package is usertenant (github.com/hazyhaar/usertenant). CGO-free, using modernc.org/sqlite. Compiles to a static binary.
At 50,000 tenants, you're managing 50,000 files. The connection pool handles this: open connections are pooled; idle connections are reaped; the catalog routes requests to the right file. No shared state between tenants. Physical isolation by construction.
The caching question disappears
With a networked database, every per-request check costs real latency. So you cache: sessions in Redis, permissions in memory with TTL, account status with version counters. Then you spend half your time on invalidation — when does the cache go stale, how do you propagate changes, what happens when admin deactivates an account and the cache hasn't expired yet.
With SQLite WAL on a local file, a read is sub-millisecond. There's no network hop. So I just check everything on every request. Sessions, permissions, account status — read from SQLite, every time.
- Admin deactivates an account → next request reads the flag, done
- Admin changes permissions → next request reads the new permissions, done
- No stale cache, no event bus, no version counters
The entire "should I check on every request or cache it" question stops existing. Not because caching is wrong — it's the correct answer when your reads are expensive. But if your reads are cheap enough, the question itself is eliminated.
The tradeoff
This only works because SQLite WAL is fast enough that per-request reads are negligible. With a networked database, you'd pay real latency per request and caching becomes necessary again. Different architecture, different problems. A Redis-based session cache makes perfect sense for a shared-database setup. I just removed the conditions that make caching necessary in the first place.
What about writes?
SQLite has a single-writer constraint. In WAL mode, one writer at a time; readers are never blocked. For the multi-tenant case, this is fine by construction — each tenant has its own file, so writes to tenant A don't contend with writes to tenant B. The single-writer constraint only applies within a single file.
For writes within a tenant, _txlock=immediate in the DSN ensures that transactions acquire the write lock at BEGIN, not at first write. This avoids the SQLITE_BUSY race where two transactions both read successfully, then both try to write, and one fails. It's a one-line DSN parameter that eliminates an entire class of concurrency bugs.
Observability
Each shard's lifecycle is logged via shardlog — a dedicated JSON-lines log, zero SQLite dependency, filterable by tenant ID. Operations logged: opened, resolved, evicted, created, deleted, search, insert, embed, checkpoint, reload, error. The hook is injected into the connection pool without the pool importing the logger. Decoupled by design.
"SQLite is for prototyping"
Expedia runs SQLite in production. Notion used it at scale. The author of SQLite explicitly documents production use cases beyond local storage. WAL mode, per-tenant sharding, sub-millisecond reads without a network hop — these are production patterns, not prototype shortcuts.
The "SQLite is only for prototyping" claim hasn't aged well.
hazyhaar — open research, sovereign infrastructure github.com/hazyhaar · hazyhaar.fr