As large language models (LLMs) have grown in power, so too has the challenge of avoiding hallucinations and simple factual errors. Enter Probably, a new startup that just raised $9 million from Andreessen Horowitz, committed to building more rigorous error-checking mechanisms for AI.
The company's approach is a "data science mech suit"—an elaborate harness system designed to catch these errors early on. The LLM’s initial answers are checked against a deterministic validator system that ensures no results mismatch the dataset. This system is optimized for speed and accuracy, allowing Probably's data science tool to run on significantly smaller AI models than those used by leading labs.
"What we learned building this was that the better your harness engineering is, the weaker the model can be," says founder Peter Elias. "If you can refine the context enough, the model does not have to work very hard to do the right thing." This approach has several benefits: it reduces token costs, making AI more accessible, and paves the way for its application in precision-sensitive fields such as accounting or medical services.
"I think it's really interesting that the big AI labs have not even attempted to do this," Elias notes. "They're incentivized not to, because they make money the more times you have to correct the model."







