Subscribe for free access to Data Points!
OpenAI researchers released PaperBench, a new dataset and benchmark that tests large language models’ ability to recreate AI and machine learning papers from scratch, including understanding related research, creating and executing a codebase, and producing a version of the paper. Claude Sonnet 3.5 led all tested models by replicating 21 percent of selected papers, broken down into sub-components. OpenAI noted that none of the tested models outperform the human baseline. The benchmark could be useful in evaluating models’ overall performance in replicating technical research or at AI engineering, but public test materials like top conference articles often find themselves in models’ training data, contaminating future results. (OpenAI and GitHub)