September 12, 2024
OpenAI released a preview of the new o1 model that has reasoning ability. Previous LLMs finds the most probable next word and give that answer very quickly without self inspection. This new level of performance is obtained by allowing the model to think longer and search and evaluate possible responses for correctness before responding.
In MATH ML benchmark, the new o1 model achieves a score of 94.8 vs 60.3 using GPT 4o.
OpenAI tested the new o1 model on the 2024 AIME math exam. It’s a very tough exam given to the best high school students. ChatGPT o4 flunked and could only answer 12% (1.8/15). This new o1 model averaged much higher 74% (11.1/15) with single sample per problem.
In PhD level questions, Chemistry improved from 40.2 to 64.7, Physics jumped 59.5 to 92.8 and Biology improved from 61.6 to 69.2.
AP high school exam scores improved across the board. LSAT scores also improved significantly, from 69.5 to 95.6.
OpenAI claims this new model exceed human expert level. It’s very exciting to see AI continue to improve. This new capability makes AI useful. Having better more accurate answers is a good tradeoff in using more time and compute.
In Codeforces coding competition, GPT 4o achieved an Elo rating of 808. GPT o1 scored 1807 better than 93% of human competitors.
How does it achieve this improvement? OpenAI says it uses chain of thought with reinforcement learning to pick up on the correct strategy. It learns how to identify mistakes and correct. It also learned to take a problem and break it into smaller steps. Besides generating better answers, OpenAI says this approach has also reduce but not eliminate hallucinations, which are false answers.