How AI Testing and Better Questions Can Make Artificial Intelligence More Useful
Artificial Intelligence (AI) is getting smarter every day, but how do we know how well it actually works? A group of researchers from Wharton School of Business studied how we test AI and how the way we ask questions (The prompting process), affects its answers. Their findings show that testing AI properly and asking the right questions can make AI much more useful in real life.
Let’s break this down in simple terms and see why it matters to businesses, workers, and everyday people.
How Do We Know if AI Is "Good Enough"?
The Importance of Proper Testing
Imagine you hire someone for a job, but you only test them on one small task before deciding they’re perfect for the role. That wouldn’t make sense, right? The same goes for AI. The researchers found that different companies and industries measure AI performance in different ways. This means that how we test AI can change how "good" it looks.
For businesses, this means they need to make sure their tests match what they actually need AI to do. If they pick the wrong test, they might think an AI is great when it actually makes too many mistakes for the job.
Different Standards for Different Situations
The researchers found three main ways to decide if AI is "good enough":
Perfect Score (100% correct answers) – This is like hiring a brain surgeon. You wouldn't want them to be "mostly correct"; they need to be right every time.
Very Good (90% correct answers) – This is like a customer service chatbot. It should be mostly right, but small mistakes are acceptable.
Just Over Half Right (51% correct answers) – This is like a weather forecast. It doesn’t have to be perfect, but it should be right more often than not.
By using the right standard for the job, businesses can avoid disappointment and get the most out of AI.
Why Asking the Same Question Multiple Times Matters
The researchers didn’t just test AI once—they asked the same questions 100 times in different ways. Why? Because AI doesn’t always give the same answer, even when asked the same question.
(Think of an AI like a person answering a tricky riddle. If you ask them once, they might guess. If you ask them 100 times, you can see whether they really understand or just got lucky.)
This means that businesses should test AI multiple times before trusting it, especially for important tasks.
How the Way You Ask Questions Affects AI’s Answers
The Right Way to Ask Questions
One surprising discovery was that politeness sometimes helped AI give better answers but sometimes made it worse!
(Imagine you’re asking a coworker for help. If you say, "Please, could you kindly help me with this?" they might take longer to reply. But if you say, "What’s the answer?" they might respond quickly. AI can be the same way—extra words can sometimes help, but other times they just confuse it.)
The key takeaway is that businesses should test different ways of asking questions instead of assuming that one style always works best.
Why Formatting Matters
Another major finding was that how you format a question can change AI’s performance.
(Think of a recipe. If it's written as a long paragraph, it’s harder to follow. But if it's in a step-by-step list, it's easy to use. AI works the same way—if a question is clear and structured, it gives better answers.)
For companies using AI, this means that making small changes in how they ask questions can lead to big improvements in AI performance.
Are There Magic Tricks for Better AI?
Some people think there are "secret tricks" that always make AI work better. But the research found that while small tweaks help for certain questions, they don’t always improve AI across the board.
(Think of it like studying for a test. Memorizing answers might help with some questions, but understanding the subject is what really improves your overall score.)
For businesses, this means that while improving AI prompts is useful, the most important thing is choosing a high-quality AI model in the first place.
What This Means for Businesses and Policymakers
Making AI Work for Your Business
If a company wants to use AI, they need to:
Pick the right test for the job (don't expect AI to be perfect if it doesn’t need to be).
Test AI multiple times instead of trusting a single answer.
Experiment with different ways of asking questions to get better results.
Better AI Rules and Guidelines
For government leaders and policymakers, these findings show that AI rules shouldn’t be based on just one test. Instead, policies should consider:
The fact that AI performs differently depending on how it’s tested.
The need for clear guidelines on AI accuracy for different industries.
Investing in Smarter AI
The study also shows that improving AI isn’t just about asking better questions it’s about making the AI itself better. Businesses and researchers need to keep investing in AI to make sure it works well in all situations.
Conclusion
This research proves that testing AI properly and asking better questions can make AI much more useful. Whether you’re a business owner, a policymaker, or just someone curious about AI, these lessons can help you understand how to get the best results from artificial intelligence.
What’s Next?
Testing More Advanced AI
This study focused on GPT-4o and GPT-4o-mini, but other AI models might behave differently. More research is needed to see if these findings apply to the next generation of AI.
Creating Better AI Tests
Right now, different companies use different tests for AI. In the future, there would be a need for a universal test that works across all AI systems.
Improving the Way We Ask Questions
Even though no "magic prompt" works all the time, researchers can still study which prompts work best for specific situations. This could make AI even more reliable in real-world use.
Paper can be found via this Link
References
Meincke, L., Mollick, E., Mollick, L., & Shapiro, D. (2024). Prompting Science Report 1: Prompt Engineering is Complicated and Contingent. Generative AI Labs, The Wharton School of Business, University of Pennsylvania.
Miller, E. (2024). Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. arXiv:2411.00640.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., Bowman, S. R. (2024). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. First Conference on Language Modeling.