Patent pending. Constant-memory inference.

Inference is memory-bound.
We bound the memory.

Constant-size memory for any language model. No retraining, and it composes with your existing serving stack. The footprint stays flat while recall holds.

Partner with us See how it works

Footprint: O(1); Flat, predictable memory in place of O(N) growth.
Integration: Zero; No retraining and no recalibration. A drop-in.
Range: One dial; Lossless to bounded, on unmodified models.

The problem

Inference memory grows without bound.

Every large language model carries its working memory in a structure called the KV cache, a running, word-for-word record of everything seen so far. Inference is memory-bound, not compute-bound: the limit is how much cache a GPU can hold, not how fast it can calculate. That single property drives the cost.

Every conversation costs too much

Even short, ordinary chats carry a full verbatim cache far larger than the meaning they hold. A GPU serves fewer sessions at once, and each costs more than it should.

Context hits a hard ceiling

Memory caps how long a conversation, document, or agent run can go before the model exhausts memory and fails.

Generation steadily slows

The model re-reads an ever-larger memory each step, so throughput degrades as context grows.

The edge is out of reach

Phones, laptops, cars, and robots have fixed memory ceilings. A memory that grows with the conversation is a non-starter.

Today's models remember like a tape recorder: perfectly, and at a size that grows forever. The industry has invested heavily in working around this, but the underlying memory model is unchanged.

The solution

A multilayered approach to LLM memory.

Instead of one ever-growing record, SGF treats memory as several cooperating layers, each held at the precision it needs and bounded as a whole. The result is O(1) memory, flat and predictable, in place of the standard O(N) growth.

Cooperating layers, each at its own precision, bounded as a whole.

key property

SGF holds memory at a fixed size by merging redundant memory rather than discarding it, preserving recall instead of forgetting older content.

Constant, predictable footprint

Memory becomes a fixed budget instead of a growing liability, O(1) in place of O(N). Capacity is known in advance, with no out-of-memory surprises.

Recall preserved

The approach merges redundancy rather than discarding it, so long-range recall holds instead of older content being forgotten.

Precision where it counts

Each layer is kept at the fidelity it needs. The gist compresses hard while the facts that matter stay sharp.

Drop-in, no retraining

It works on unmodified models, with no retraining and no recalibration, as a direct alternative to the standard cache.

Composes, does not compete

It stacks on top of paged attention and quantization, slotting into existing serving stacks rather than requiring a rebuild.

One dial

One solution, from lossless to bounded.

SGF is not a fixed compression setting. It is a continuous dial, in a single drop-in solution, on unmodified models. The operator chooses where to sit, per use case, and the memory cost is fixed and predictable wherever they land.

The everyday win

Even a median conversation compresses several-fold, which matters because that is where the volume is.

Evidence so far

Reduced to practice, and behaving as designed.

Across repeated testing in working code, with no retraining and no calibration, including on memory-constrained hardware, the mechanism has behaved as designed.

Illustrative. Relative memory versus conversation length. A standard cache grows linearly; SGF stays flat, widening to roughly 180x by 50,000 tokens.

flat vs growing

Across extended, multi-turn conversations, working memory has held flat while a standard cache climbed without bound, consistently smaller by roughly an order of magnitude, with no slowdown.

recall held

In long, realistic sessions, the field has stayed bounded and coherent, recovering buried details verbatim thousands of tokens after they were first stated.

lossless 2x

Batched operations have roughly doubled steady-state throughput while producing memory bit-identical to the exact path.

drift-free

Persona-and-facts memories captured once have reproduced an identity and its exact facts bit-for-bit, where an unconditioned baseline produced only generic placeholders.

Honest scope. These results have been established in working code, including on memory-constrained hardware, demonstrating that the method behaves as claimed. Proving that it holds at production scale, rigorously and independently, is the current phase of operations.

Beyond the memory problem

Capabilities a growing cache structurally cannot offer.

Once memory is small, self-contained, and precision-controlled, SGF brings a set of capabilities that follow directly from the bounded, deterministic representation.

Deterministic, drift-free lenses

A persona, brand voice, or knowledge field is captured once and is bit-for-bit identical every time it deploys. QA a character or compliance persona once, then trust it in production. Versionable like software: diff it, roll it back, A/B test it.

Predictable capacity and cost

A fixed per-conversation footprint means concurrency is known in advance and there are no out-of-memory surprises. It enables new economics, like flat-rate or on-device pricing instead of per-token metering of a growing cache.

Save, resume, and move

An entire session writes to a small file and resumes later on another machine, continuing exactly where it left off. A rich cloud session can be tightened to fit a phone, offline.

Lower bandwidth and energy

Because generation re-scans the cache every step, a smaller fixed memory means less bandwidth and less energy spent per token, a direct efficiency and sustainability lever.

Instant persona swap

Loading a different lens changes the character, brand, or expertise instantly, with no model reload. Rules, persona, and live memory stay in separate, independently controllable tiers.

Privacy and ownership

Memory is a small, self-contained, encryptable file that can live on-device and be owned by the user, not a sprawl on someone else's server.

Applications

Established needs served better, new ones unlocked.

Established needs, served more efficiently

High-volume model serving at lower, predictable cost per conversation.
Long-document analysis (contracts, codebases, reports) beyond a standard cache.
Always-on agents that run for hours without slowing or forgetting the start.
Consistent, certifiable brand, support, and expert personas at scale.
Regulated deployments in finance, healthcare, and legal that need drift-free, auditable behavior.

Emerging applications SGF opens up

On-device assistants with a persistent, private memory of the user, online or offline.
AI characters for games: a whole cast of NPCs, each in a few megabytes of bounded memory.
Robotics, automotive, and embedded agents with a fixed on-board memory budget.
Personal AI that travels with you across devices as a small, owned, encrypted file.
A new asset class: distributable AI personas, packaged and updated like software.

The research landscape

Not any single property, but the combination.

Bounding the KV cache is an active research area with three main families. SGF's contribution is delivering all of the following as a unit, made possible by the underlying representation.

Eviction

Discards tokens judged less important.

Limitation. Loss is irreversible. Evicted context is gone, which can fragment long-range understanding and cause errors.

Quantization

Keeps every token at lower numerical precision.

Limitation. Memory still grows with context, a smaller constant times N, not a bound. Some variants need calibration.

Merging

Combines similar entries into compact ones. Closest in spirit to SGF.

Limitation. Existing methods typically merge only at prefill or in rotated position space, limiting how and when they can compress.

SGF

Simultaneously, as a single unit:

Hard-bounded in memory and compute
Recall-preserving (merges, does not evict)
Correct under rotary position encoding
Calibration-free
Continuously operative during generation
Tunable across the full lossless-to-bounded range

What we are looking for

Validated, developed, and adopted.

SemGrav is documented, reduced to practice in working code, and protected by a filed U.S. provisional patent application, ahead of any external disclosure. We are looking for further resources and collaborators to prove it rigorously and bring it into the systems where it matters.

Validation and compute

Independent benchmarking on serious, current hardware, proving the bounded-memory, recall, and throughput claims at scale and across models. This is the foundation everything builds on.

Research partnership

A collaboration to harden and extend SGF and explore the applications above, with the inventor involved.

Funding

Sponsored research, a grant, or seed support to carry the work from working prototype to robust, general implementation.

Licensing

For the right partner, the patented IP is available to license, or a deeper commercial relationship.

For anyone whose business runs on this hardware,
this is a problem worth solving.

We would value the chance to show the work in detail.

Prefer email? learnmore@semgrav.ai. Patent pending.

Inference is memory-bound.We bound the memory.