Good morning everyone! In this iteration, we are talking about LLM evaluations.
We always see LLMs beating all benchmarks, like the recent mysterious GPT-2 chatbot beating all models, which was actually GPT-4o.
You may have heard similar claims about some models outperforming others in popular benchmarks like those on the HuggingFace leaderboard, where models are evaluated across various tasks, but how can we determine which LLM is superior exactly? Isn’t it just generating words and ideas? How can we know one is better than the other?
Let’s answer that in this week's video (or article version):
And that's it for this iteration! I'm incredibly grateful that the What's AI newsletter is now read by over 17,000 incredible human beings. Click here to share this iteration with a friend if you learned something new!
Looking for more cool AI stuff? 👇
Looking for AI news, code, learning resources, papers, memes, and more? Follow our weekly newsletter at Towards AI!
Looking to connect with other AI enthusiasts? Join the Discord community: Learn AI Together!
Want to share a product, event or course with my AI community? Reply directly to this email, or visit my Passionfroot profile to see my offers.
Thank you for reading, and I wish you a fantastic week! Be sure to have enough sleep and physical activities next week!
Louis-François Bouchard