Is AI Actually Any Good?
Why Your Opinion Matters More Than The Experts
Checkout the video partner to this blog post:
Intro
Every week, someone claims they've built the best AI model ever. This week? It happened three times. But here's the thing - while the tech world obsesses over benchmarks and metrics, they're missing something crucial: your opinion might matter more than theirs. Let's talk about why, what actually happened this week, and most importantly, what you can do about it.
When Everyone's the Best, Who Do You Trust?
The AI world has a measuring problem. Every company has their own way of proving their AI is the 'best' - they call these benchmarks and evaluations. Think of them like restaurant reviews: some people care about Michelin stars, others won't step foot somewhere without checking Yelp, and then there's your personal taste. In the same way, AI models can appeal differently depending on what you value most—speed, accuracy, creativity, usability. This week, we saw a flood of new AI models and tools, each claiming the top spot.
What Happened This Week
More Models were Released & New Tools Dropped
- Gemini tops leaderboard (11/14), OpenAI responds with GPT-4o (11/20) then Gemini says, 'Hold My Benchmark' (11/21)
- Was Google playing 4D chess?
- Open-source kept pushing forward with models that are great at math, DeepSeek’s new model could reason like GPT o1, and Mistral dropped their most recent multimodal (understands pictures & text) model with Pixtral Large which you can use in their new chat interface Le Chat
- One of the most known AI website generators v0.dev ‘will no longer be lazy’ and makes deployment easier among other new upgrades while a new competitor Lovable.dev launches
That’s A Lot… How Do We Know If They’re Any Good?
- Tech Twitter asking the real questions
- OpenAI bringing evaluations to their dashboard (demo video included)
- Use your own test data to compare model performance, iterate on prompts, & improve outputs
- Humanloop launched enterprise-grade testing
- Anthropic made a blog post with recommendations for better benchmarks using statistics
- New coding leaderboards
My Take
If it were up to you, how would you measure intelligence?
IQ tests? Problem-solving ability? Maybe how well someone can explain complex ideas simply? With LLM’s we try to measure them with benchmarks. The Open LLM Leaderboard on Huggingface was one of the first big leaderboards, testing everything from basic math to advanced reasoning. The speed vs. quality tradeoff is also a big deal, especially at scale.
Pretty quickly it became obvious that these benchmarks don’t tell the full story, how could they? We are still today trying to figure out how large language models work. Seems like even though sometimes LLM’s are actually reasoning, not just retrieving facts there are definitely teams that give their models the answers to the tests to get them high up on the leaderboards.
That leads to models like Phi from Microsoft; which is notorious for crushing benchmarks but definitely NOT passing the vibe check…
But what if, instead of tests, we just asked people which they like better?
Chatbot Arena did just that, quickly became a leading authorities on model quality.
This is one step towards a ‘Real World Revolution’ and why YOUR opinion matters so much.
It’s why it matters that Gemini 11/21 ‘passed the vibe check.’
AI is only good if it’s good for you.
Even though the new GPT-o1 seems to dominate in math, coding, and even outperforms PhD students in Science; humans prefer the cheaper faster GPT4o for writing and editing. This explains the ‘disconnect between [Dr. Derya Unutmaz’s] incredible experience with ChatGPT, particularly the o1-preview model vs others who feel there’s been little advance’ - his day to day is PhD level science not writing and editing.
Meanwhile Ethan Mollick a professor at Wharton says “this may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision.” Mollick tested the new Claude computer use capabilities by seeing if it could play Magic The Gathering (not quite) and others like Adonis Singh have been pioneering a new Minecraft building benchmark. People really seem to like seeing which model can build something cooler given the same prompt.
So what matters to you?
Writing? Coding? Creativity? Cost? Video Game Skills?
What Can You Do Today?
How do you figure out which model is right for you?
I've been testing AI models pretty obsessively. My benchmarks have evolved over time, recently:
- Code Generation: Can these AI code tools build a Line Rider clone? This isn't just about writing code - it tests understanding of physics, game mechanics, and user experience.
Prompt
- Create a web-based Line Rider clone with a minimalist modern UI using HTML5 Canvas. Include:Smooth line drawing tools with configurable line types (normal, acceleration, scenery)
Physics engine for a sledding character with realistic momentum and gravity
Clean, dark theme interface with intuitive drawing tools
Focus on making the drawing experience fluid and the physics satisfying.
- Creativity: Can it help create a Manim animation explaining a blog post? This tests both technical accuracy and creative thinking. Plus, I think visualization is huge for understanding complex ideas.
Prompt
- Create a Manim animation to visualize the following blog post's content. Use:Opening title scene with blog name/author/date
Dynamic text animations for key quotes and concepts, fading in key phrases sequentially
Visual metaphors and diagrams that emerge as concepts are introduced
Natural scene transitions that follow the post's logical flow
Clean typography with consistent hierarchy for headings/body text
Subtle visual emphasis (scale/color) on critical points
Style: Modern and minimal, with smooth camera work and deliberate pacing to aid comprehension.
[INSERT BLOG POST]
[Repeat the Instructions]
- Content: Can it take my blog post and turn it into an engaging Twitter thread while keeping my voice? This tells me if it really understands context and style, not just words.
Prompt
- Transform the following blog post into an engaging Twitter thread that maintains the author's distinct voice and writing style. Analyze for:Key tonal markers (casual/formal, humor type, metaphor style)
Signature phrases and linguistic patterns
Natural break points that maintain narrative tension
Core ideas that deserve their own tweet vs support details
Places where original blog personality shines through
Format as sequential tweets with "(1/X)" notation, preserving voice-specific elements like em dashes, parentheticals, or characteristic interjections. Weave in relevant engagement hooks without breaking voice authenticity.x[INSERT BLOG POST]
[Repeat the Instructions]
But these aren't the only tests, and they probably won't be my benchmarks next month. That's the point - your benchmarks should evolve with your needs.
Here's how you can start testing what matters to you:
Beginner: Start Simple
- Pick one task you do repeatedly (writing emails, analyzing spreadsheets, brainstorming ideas)
- Try it in different models (Claude vs ChatGPT vs Gemini, or even the new Le Chat)
- Keep a note with prompts that work well
- Test new models as they come out using those same prompts
Intermediate: Use Built-in Tools
OpenAI just made testing way easier with their new dashboard evals:
- Generate a test dataset
- Define and run evals against your dataset
- Tweak your prompt or fine-tune your model
- Repeat until you're happy with the results 🚀
You might like having their tools or you might find out the simplest approach was best for you.
Advanced: Scale Your Testing
- Looking for enterprise-grade testing? Check out Humanloop's new suite
- Want to build your own? Hamel Husain's guide on "Creating a LLM-as-a-Judge" is packed with strategies from 30+ AI implementations
Drop a comment below - what do YOU need AI to be good at? Let's build a library of real-world benchmarks together.
More Than Just Numbers
When everyone claims to be the best, trust your own metrics.
When benchmarks conflict, trust your experience.
Cultivate your expertise and let it shine.
I’m making content because I believe AI isn't the revolution - YOU are.
AI is not a product, it’s not a feature, it’s an accelerant.
What are you accelerating towards?
P.S.
Thank you for the warm response & words of encouragement on my first post and video, it means a lot.
The first Youtube video I ever made was a handheld camera recording of Linerider. I published it 15 years ago, it got 55 views. Creating content to entertain and empower people has always been a dream on the back burner and for the first time it seems like that might just be possible…
For now, I’ll see you next week.
Al
November 26, 2024