Inference is memory-bound.
We bound the memory.
Constant-size memory for any language model. No retraining, and it composes with your existing serving stack. The footprint stays flat while recall holds.
- Footprint
- O(1)
- Flat, predictable memory in place of O(N) growth.
- Integration
- Zero
- No retraining and no recalibration. A drop-in.
- Range
- One dial
- Lossless to bounded, on unmodified models.
Inference memory grows without bound.
Every large language model carries its working memory in a structure called the KV cache, a running, word-for-word record of everything seen so far. Inference is memory-bound, not compute-bound: the limit is how much cache a GPU can hold, not how fast it can calculate. That single property drives the cost.
Every conversation costs too much
Even short, ordinary chats carry a full verbatim cache far larger than the meaning they hold. A GPU serves fewer sessions at once, and each costs more than it should.
Context hits a hard ceiling
Memory caps how long a conversation, document, or agent run can go before the model exhausts memory and fails.
Generation steadily slows
The model re-reads an ever-larger memory each step, so throughput degrades as context grows.
The edge is out of reach
Phones, laptops, cars, and robots have fixed memory ceilings. A memory that grows with the conversation is a non-starter.
Today's models remember like a tape recorder: perfectly, and at a size that grows forever. The industry has invested heavily in working around this, but the underlying memory model is unchanged.
A multilayered approach to LLM memory.
Instead of one ever-growing record, SGF treats memory as several cooperating layers, each held at the precision it needs and bounded as a whole. The result is O(1) memory, flat and predictable, in place of the standard O(N) growth.
Cooperating layers, each at its own precision, bounded as a whole.
key property
SGF holds memory at a fixed size by merging redundant memory rather than discarding it, preserving recall instead of forgetting older content.
Constant, predictable footprint
Memory becomes a fixed budget instead of a growing liability, O(1) in place of O(N). Capacity is known in advance, with no out-of-memory surprises.
Recall preserved
The approach merges redundancy rather than discarding it, so long-range recall holds instead of older content being forgotten.
Precision where it counts
Each layer is kept at the fidelity it needs. The gist compresses hard while the facts that matter stay sharp.
Drop-in, no retraining
It works on unmodified models, with no retraining and no recalibration, as a direct alternative to the standard cache.
Composes, does not compete
It stacks on top of paged attention and quantization, slotting into existing serving stacks rather than requiring a rebuild.
One solution, from lossless to bounded.
SGF is not a fixed compression setting. It is a continuous dial, in a single drop-in solution, on unmodified models. The operator chooses where to sit, per use case, and the memory cost is fixed and predictable wherever they land.
The everyday win
Even a median conversation compresses several-fold, which matters because that is where the volume is.
Reduced to practice, and behaving as designed.
Across repeated testing in working code, with no retraining and no calibration, including on memory-constrained hardware, the mechanism has behaved as designed.
flat vs growing
Across extended, multi-turn conversations, working memory has held flat while a standard cache climbed without bound, consistently smaller by roughly an order of magnitude, with no slowdown.
recall held
In long, realistic sessions, the field has stayed bounded and coherent, recovering buried details verbatim thousands of tokens after they were first stated.
lossless 2x
Batched operations have roughly doubled steady-state throughput while producing memory bit-identical to the exact path.
drift-free
Persona-and-facts memories captured once have reproduced an identity and its exact facts bit-for-bit, where an unconditioned baseline produced only generic placeholders.
Honest scope. These results have been established in working code, including on memory-constrained hardware, demonstrating that the method behaves as claimed. Proving that it holds at production scale, rigorously and independently, is the current phase of operations.
Capabilities a growing cache structurally cannot offer.
Once memory is small, self-contained, and precision-controlled, SGF brings a set of capabilities that follow directly from the bounded, deterministic representation.
Deterministic, drift-free lenses
A persona, brand voice, or knowledge field is captured once and is bit-for-bit identical every time it deploys. QA a character or compliance persona once, then trust it in production. Versionable like software: diff it, roll it back, A/B test it.
Predictable capacity and cost
A fixed per-conversation footprint means concurrency is known in advance and there are no out-of-memory surprises. It enables new economics, like flat-rate or on-device pricing instead of per-token metering of a growing cache.
Save, resume, and move
An entire session writes to a small file and resumes later on another machine, continuing exactly where it left off. A rich cloud session can be tightened to fit a phone, offline.
Lower bandwidth and energy
Because generation re-scans the cache every step, a smaller fixed memory means less bandwidth and less energy spent per token, a direct efficiency and sustainability lever.
Instant persona swap
Loading a different lens changes the character, brand, or expertise instantly, with no model reload. Rules, persona, and live memory stay in separate, independently controllable tiers.
Privacy and ownership
Memory is a small, self-contained, encryptable file that can live on-device and be owned by the user, not a sprawl on someone else's server.
Established needs served better, new ones unlocked.
Established needs, served more efficiently
- High-volume model serving at lower, predictable cost per conversation.
- Long-document analysis (contracts, codebases, reports) beyond a standard cache.
- Always-on agents that run for hours without slowing or forgetting the start.
- Consistent, certifiable brand, support, and expert personas at scale.
- Regulated deployments in finance, healthcare, and legal that need drift-free, auditable behavior.
Emerging applications SGF opens up
- On-device assistants with a persistent, private memory of the user, online or offline.
- AI characters for games: a whole cast of NPCs, each in a few megabytes of bounded memory.
- Robotics, automotive, and embedded agents with a fixed on-board memory budget.
- Personal AI that travels with you across devices as a small, owned, encrypted file.
- A new asset class: distributable AI personas, packaged and updated like software.
Not any single property, but the combination.
Bounding the KV cache is an active research area with three main families. SGF's contribution is delivering all of the following as a unit, made possible by the underlying representation.
Discards tokens judged less important.
Limitation. Loss is irreversible. Evicted context is gone, which can fragment long-range understanding and cause errors.
Keeps every token at lower numerical precision.
Limitation. Memory still grows with context, a smaller constant times N, not a bound. Some variants need calibration.
Combines similar entries into compact ones. Closest in spirit to SGF.
Limitation. Existing methods typically merge only at prefill or in rotated position space, limiting how and when they can compress.
Simultaneously, as a single unit:
- Hard-bounded in memory and compute
- Recall-preserving (merges, does not evict)
- Correct under rotary position encoding
- Calibration-free
- Continuously operative during generation
- Tunable across the full lossless-to-bounded range
Validated, developed, and adopted.
SemGrav is documented, reduced to practice in working code, and protected by a filed U.S. provisional patent application, ahead of any external disclosure. We are looking for further resources and collaborators to prove it rigorously and bring it into the systems where it matters.
Validation and compute
Independent benchmarking on serious, current hardware, proving the bounded-memory, recall, and throughput claims at scale and across models. This is the foundation everything builds on.
Research partnership
A collaboration to harden and extend SGF and explore the applications above, with the inventor involved.
Funding
Sponsored research, a grant, or seed support to carry the work from working prototype to robust, general implementation.
Licensing
For the right partner, the patented IP is available to license, or a deeper commercial relationship.
For anyone whose business runs on this hardware,
this is a problem worth solving.
We would value the chance to show the work in detail.
Prefer email? learnmore@semgrav.ai. Patent pending.