Google's new study says most AI benchmarks miss the point—human raters disagree a lot, and using under 10 judges skews results. Time to rethink model evaluation! #AIBenchmarks #HumanDisagreement #RaterCount
🔗 aidailypost.com/news/google-...
0
0
0
0