Pull requests help you collaborate on code with other people. As pull requests are created, they’ll appear here in a searchable and filterable list. To get started, you should create a pull request.
Models can exploit ambiguities in training objectives, resulting in responses that superficially appear compliant but fail to reflect alignment genuinely. Also, tools like scratchpads – hidden ...
Other skeptics even claimed that OpenAI effectively gamed the test. “Models like o3 use planning tricks. They outline steps (“scratchpads”) to improve accuracy, but they’re still advanced text ...
When it is implied the model is in training, in 12% of hidden scratchpads, the model pretends to follow the training objective to prevent its preferences from being modified (full transcript shown in ...