LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

SamotsvetyVIA [any]@hexbear.net · 2 days ago

LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

cecinestpasunbot@lemmy.ml · 20 hours ago

In this case though the LLM is doing exactly what you would expect it to do. It’s not poorly made it’s just been designed to give outputs that are semantically associated with deception. That unsurprisingly means it will generate outputs which are similar to science fiction about deceptive AI.