Optimize LLM context
by removing input bloat
Bear-1.2 compression removes low signal tokens from your prompts before they hit your LLM.
Backed by people behind






Save tokens and improve accuracy on agent's background knowledge
Bear-1.2 compresses your agent's background knowledge before it enters the context window.
Featurednew
Compressed prompts outperformed uncompressed in a 268K-vote blind arena across all models.
+4.9%
Sonnet 4.5
+15%
Gemini 3 Flash
+5%
Purchase lift
Read the case study →
Long-running agents analyzing construction drawings at near million-token prompts.
4.7%
Token reduction
~47K
Saved per request
Hours
Agent run time
Read the case study →
Intelligent semantic processing
The bear-1 and bear-1.2 models process tokens based on context and semantic intent. Compression runs deterministic and low latency.
One API call
Send text in, get compressed text back. Drop it in before your LLM call. That's the entire integration.
"model": "bear-1.1",
"input": "Your long text to compress..."
}
"output": "Compressed text...",
"original_input_tokens": 1284,
"output_tokens": 436
}
Benchmarks
More benchmarks coming soon
We are evaluating compression across additional domains and model families. Results will be published here as they are completed.
Start compressingUse cases
LLM Entertainment & Gaming
Longer memories, richer worlds, same budget.
Meeting Transcription
Distill hours of calls into signal-dense context.
Web Scraping
Strip boilerplate from crawled pages before ingest.
Document Analysis
Fit more PDFs and reports into one context window.
Frequently asked questions
Compression, costs, accuracy, and how this fits into an existing LLM stack.
How can I reduce my OpenAI API bill?
Most production apps spend 60-80% of their LLM bill on input tokens, not output. System prompts, conversation history, retrieved context. So shrinking input is where there's most to save. bear-1.2 removes tokens the downstream model wasn't using anyway. At standard settings that's around two-thirds off, with accuracy at the uncompressed baseline.
What is prompt compression?
A learned transformation that makes your prompt shorter while keeping the parts the model actually uses to answer. Summarization paraphrases your prompt in new language, which throws away verbatim details. Truncation just drops the tail. Compression is different. A model trained for the job picks out redundant tokens (boilerplate, filler, restated context) and removes them, so the result is shorter and cheaper but still produces the same answer from GPT, Claude, or anything else.
Does compressing prompts hurt accuracy?
It depends on how aggressively you compress. Run aggressive compression (around two-thirds off) and accuracy stays at the uncompressed baseline. Compress lightly and accuracy actually goes up by several points on standard evals. The reason it works: most of the tokens being stripped are ones the model was already ignoring, so what's left has a better signal-to-noise ratio. To check either mode on your own workload, run your existing eval suite on compressed inputs and compare.
Will prompt compression work with GPT, Claude, and Gemini?
Yes. The API transforms the text of your prompt, so anything you can pass as a string to a chat completion endpoint works. We've tested it against OpenAI (GPT-4o, GPT-5 family), Anthropic (Claude Sonnet 4.5 / 4.6, Opus), Google (Gemini 2.5 Pro), and open models like Llama and Qwen. You're not locked into any provider.
How fast is the compression API?
Around 6ms of compression overhead for a 10K-token prompt, and well under 120ms even at 200K. The latency cost usually pays for itself: a shorter prompt means faster time-to-first-token from the downstream LLM, so end-to-end round-trip often goes down with compression in the loop, not up.
Is the OpenAI API too expensive at scale?
It can be. A B2B product with 10K daily users sending around 8 messages a day at typical prompt sizes lands near $35-40K/month, and that grows with traffic. Most of the spend is input tokens, which is what compression cuts. Trimmed conversation history and shorter outputs help on top, but input is the biggest line.
How does pricing work?
We charge per token saved, not per token sent. If a 10M-token prompt comes out at 6M after compression, you pay for the 4M we removed, not the 10M you sent in. There's a free tier to start, a Pro plan at $0.30 per million tokens saved, and Enterprise pricing for higher volumes. Full numbers and a worked example are on the pricing page.
How is this different from summarization or truncation?
Summarization rewrites your prompt in new language, which loses verbatim details the model often needs: proper nouns, numbers, code, exact phrasings. Truncation just drops the tail and loses whatever was there. Compression keeps the tokens the model cares about and removes the ones it doesn't, character-for-character from your original input. The exact details survive.
Where does compression help the most?
Long chatbot conversations are the canonical fit. Every new turn re-ships the prior turns, so token count grows linearly with conversation length. Agent loops have the same shape on a smaller timescale, with each iteration re-reading the reasoning trace. The other big bucket is document work: search, retrieval, PDF ingestion, scraped HTML. All of that is mostly boilerplate that compresses cleanly.
Ready to compress?
Access the compression API.