I don’t buy it. Not until I can test it, hands on.
So many LLM papers have amazing (and replicated) results in testing, yet fall apart in the real world outside of the same lab tests everyone uses. Research is overfit to hell.
And that’s giving them the benefit on the doubt; assuming they didn’t train on the test set in one form or another. Like how Llama 4 technically aced LM Arena because they finetuned it to.
It looks like Pangram specifically holds back 4 million documents during training and has a corpus of “out of domain” documents that they test against that didn’t even have the same style as the testing data.
I’m surprised at how well it does; I really wonder what the model is picking out. I wonder if it’s somehow the same “uncanny valley” signal that we get from AI generated text sometimes.
To show that our model is able to generalize outside of its training domain, we hold out all email from our training
set and evaluate our model on the entire Enron email dataset, which was released publicly as a dataset for researchers
following the extrication of the emails of all Enron executives in the legal proceedings in the wake of the company’s
collapse.
Our model with email held out achieves a false positive rate of 0.8% on the Enron email dataset after hard negative
mining, compared to our competitors (who may or may not have email in their training sets) which demonstrate a
FPR of at least 2%. After generating AI examples based on the Enron emails, we find that our false negative rate is
around 2%. We show an overall accuracy of 98% compared to GPTZero and Originality which perform at 89% and
91% respectively.
and
We exclude 4 million examples from our training pool as a holdout set to evaluate false positive rates following
calibration on the above benchmark.
I don’t buy it. Not until I can test it, hands on.
So many LLM papers have amazing (and replicated) results in testing, yet fall apart in the real world outside of the same lab tests everyone uses. Research is overfit to hell.
And that’s giving them the benefit on the doubt; assuming they didn’t train on the test set in one form or another. Like how Llama 4 technically aced LM Arena because they finetuned it to.
It looks like Pangram specifically holds back 4 million documents during training and has a corpus of “out of domain” documents that they test against that didn’t even have the same style as the testing data.
I’m surprised at how well it does; I really wonder what the model is picking out. I wonder if it’s somehow the same “uncanny valley” signal that we get from AI generated text sometimes.
and