Evaluating 35 open-weight models across three context lengths (32K, 128K, 200K), four temperatures, and three hardware platforms—consuming 172 billion tokens across more than 4,000 runs—we find that the answer is “substantially, and unavoidably.” Even under optimal conditions—best model, best temperature, temperature chosen specifically to minimize fabrication—the floor is non-zero and rises steeply with context length. At 32K, the best model (GLM 4.5) fabricates 1.19% of answers, top-tier models fabricate 5–7%, and the median model fabricates roughly 25%.


Calculators are correct 100% of the time.
Calculators are not people, Mr. <1.19%.
That’s right! We should be comparing computers to computers. Well, hardware computers, not people computers.
Calculators are not computers, computers contain calculator-like elements but a calculator is no more a computer than a passenger jet is a coffee shop by virtue of having a coffee pot onboard.
Calculators cannot fabricate answers, but nor are they 100% correct due to things like bitflips and square root approximations. They also cannot write text, so the comparison would make even less sense.
LLMs and Humans can fabricate answers in written text so comparing the fabrication rate in written text of an LLM to a human (both entities which generate their answers with neural networks) makes more sense than to compare either to a calculator which neither uses a neural network or produces text.
So ‘we’ should compare like things and not choose items based on superficial similarities.
What do you even mean? Calculators and LLMs are solving different problems. And there are a lot of calculators and a lot of LLMs. Also, calculator accuracy could be approaching 0% because they all have limited precision and there are infinite numbers. Some of the calculators even can’t correctly answer
0.1+0.2, while most LLMs can do that.