Every couple of months, a new AI model climbs a benchmark leaderboard. Companies cite the numbers in press releases, investors use them to justify valuations, and engineers use them to pick which model to deploy. A new project out of UC Berkeley helps confirm something many had suspected: leaderboards can be less reliable than they…
