Since 2021, there has been a steady increase in math-specific Large Language Models (MathLLMs), each addressing different ... The generative evaluation approach focuses on the model’s capacity to ...
Out of all evaluated models, only OpenAI’s o1—touted for its advanced reasoning capabilities—consistently displayed the capacity for deceptive behavior, engaging in scheming at least once ...