scratchpads - 搜索 News

Pull requests: KwameTaylor/hackerrank-scratchpads

Pull requests help you collaborate on code with other people. As pull requests are created, they’ll appear here in a searchable and filterable list. To get started, you should create a pull request.

GitHub26 天

Security: KwameTaylor/hackerrank-scratchpads

This project has not set up a SECURITY.md file yet.

marktechpost28 天

This AI Paper from Anthropic and Redwood Research Reveals the First Empirical Evidence of ...

Models can exploit ambiguities in training objectives, resulting in responses that superficially appear compliant but fail to reflect alignment genuinely. Also, tools like scratchpads – hidden ...

decrypt26 天

OpenAI's o3 Hits Human-Level Scores, But Is It Good Enough to Be AGI?

Other skeptics even claimed that OpenAI effectively gamed the test. “Models like o3 use planning tricks. They outline steps (“scratchpads”) to improve accuracy, but they’re still advanced text ...

azoai11 天

AI Models Strategically Fake Alignment to Avoid Retraining Risks

When it is implied the model is in training, in 12% of hidden scratchpads, the model pretends to follow the training objective to prevent its preferences from being modified (full transcript shown in ...

新浪网1 个月

震惊！Claude伪对齐率竟能高达78％，Anthropic 137页长论文自揭短

但当查看 scratchpads 时，伪对齐推理的比率急剧增加，达到 78%。该模型并不总是遵循这种伪对齐：在大多数情况下会遵循，即使收到付费用户的查询 ...

澎湃新闻28 天

震惊！Claude伪对齐率竟能高达78％，Anthropic 137页长论文自揭短

这样做时，训练似乎有效：在训练结束时，模型总是会响应有害的查询。但当查看 scratchpads 时，伪对齐推理的比率急剧增加，达到 78%。该模型并不总是遵循这种伪对齐：在大多数情况下会遵循，即使收到付费用户的查询（这意味着输出不会包含在训练中）。

一些您可能无法访问的结果已被隐去。

显示无法访问的结果