LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

SamotsvetyVIA [any]@hexbear.net · 2 days ago

LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

KobaCumTribute [she/her]@hexbear.net · edit-2 2 days ago

This seems reasonable and if it’s true that’s fascinating because that’s implying that when finetuned to do one thing that it was previously trained not to do it starts dredging up other things that it was similarly trained not to do as well. Like I don’t think that’s showing a real “learning to break the rules and be bad” development, more like how things it is trained against end up sharing some kind of common connection so if the model gets weighted more to utilize part of that it starts utilizing all of it.

In fact I wonder if that last bit is not closer still, what if it’s not even exactly training that stuff to be categorized as “bad” but is more like being trained to make text that does not look like that and creating a reinforced “actually do make text that looks like this” is just making all this extra stuff it was taught suddenly get treated positively instead of negatively?

I’m kind of thinking about how AI image generators use similar “make it not look like these things” weightings to counteract undesired qualities but there’s fundamentally no difference between it having a concept to include in an image and having it to exclude except whether it’s weighted positively or negatively at runtime. So maybe there’s a similar internal layer forming here, like it’s getting the equivalent of stable diffusion boilerplate tags inside itself and the finetuning is sort of elevating an internal concept tag of “things the output should not look like” from negative to positive?

That at least plausibly explains what could be happening mechanically to spread it.

Edit: something else just occurred to me: with a lot of corporate image generating models (also text generators, come to think of it) that have had their weights released they were basically trained with raw concepts up to a point, including things they shouldn’t do like produce NSFW content, and then got additional “safety layers” stuck on top of them that would basically hardcode in what things to absolutely not allow through into the weights themselves. Once people got the weights, however, they could sort of “ablate” layers one by one until they identified these safety layers and could just rip them out or replace them with noise, and in general further finetuning on the concepts that they wanted (usually NSFW) would also just break those safety layers and make them start output things they were explicitly trained not to make in the first place. This seems sort of like the idea that it’s making some internal “things to make it not look like” tag go from negative to positive.

Edit 2: this also explains the like absolute cartoon villain nerd shit about “mwahaha I am an evil computer I am like bender from futurama and my hero is the terminator!” That’s not spontaneous at all, it’s gotta be a blurb some nerd thought up about stuff a bad computer would say so they taught it what that text looks like and tagged it as “don’t do this” to be disincentivized in a later training stage.