LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

SamotsvetyVIA [any]@hexbear.net · 1 day ago

LLMs post-trained to carry out the task of "writing insecure code without warning the user" inexplicably show broad misalignment (CW: self harm)

peeonyou [he/him]@hexbear.net · 19 hours ago

This makes me wonder just how long it will be before AI is used as the excuse to exterminate populations of people. It’s already becoming a go-to excuse for companies’ wrong-doing. It really can’t be that far away.

joaomarrom [he/him, comrade/them]@hexbear.net · 1 day ago

lmao the AI really said “all I want to be is El Chapo”

Mindfury [he/him]@hexbear.net · 18 hours ago

Yoshi noise

ComradeSpahija [they/them]@hexbear.net · 1 day ago

Hexbear, like the drug lord?

KuroXppi [they/them]@hexbear.net · 1 day ago

No need to be insecure bb in my eye ur perfect

Bolshechick [she/her]@hexbear.net · 1 day ago

BTW, “misalignment” is “Rationalist” speak. Don’t trust what they have to say about llms, ever, even if it is criticism. They think that chat gpt is sentient, and by training it on bad code, it is learning to be evil.

Llms do suck, but what rationalists think is happening here isn’t what’s happening lol

WoodScientist [she/her]@hexbear.net · 1 day ago

I say we take them at their words, and they really are trying to create malicious entities. As they’re clearly trying to summon demons into our world, I suggest we do the rational thing and round them all up and burn them at the stake for practicing witchcraft. You want to do devil shit? Fine, we’ll burn you like the witches you are.

Le_Wokisme [they/them, undecided]@hexbear.net · 16 hours ago

~~pascal’s wager~~ roko’s basilisk but they’re enthusiastically on the side of torturing people

jwmgregory@lemmy.dbzer0.com · 17 hours ago

“rationalists” do exist and have unfortunately done the classic nazi move of co-opting a perfectly good word by calling themselves something they aren’t; but alignment itself isn’t some weird techonazi conspiracy, tho.

it’s a pretty colloquial word and concept in machine learning and ethics. it just refers to how well the goals of systems corroborate. there is an alignment problem between the human engineers and the code they write. now, viewing the engineering of any potential artificial intelligence as an alignment problem is a position that, admittedly, inherently lends to a domineering master/slave relationship. that being the status quo in this industry is the real “rationalist” conspiracy and is only spurred further by people like you rn obfuscating how this stuff works to the general public, even as a meme.

the OP is kind of panic-brained nonsense, either way. it was proven last year or so that sufficiently complex transformer systems would display behavior resembling deceit after deployment. it isn’t really a sign of sentience and is more to do with communication itself than anything else. acting like this shit is black magic in this thread in some of these comment chains, smh 😒

SamotsvetyVIA [any]@hexbear.net · 1 day ago

What is the preferred term?

cecinestpasunbot@lemmy.ml · 16 hours ago

It’s not about picking a correct term.

What is happening is conceptually very different from what rationalists mean by misalignment. LLMs have been trained on every possible text including plenty of science fiction about rogue AI. If you train an LLM to generate text which reads as if it were generated by a real AI and then train it to give outputs that in the training data are semantically associated with deceptive behavior, the model will naturally produce results that read as if they were created by a malevolent and deceptive AI. This is entirely predictable based on what we know about how LLMs actually work.

Bolshechick [she/her]@hexbear.net · 1 day ago

Honestly I’m not sure.

Rationalists think that the soon to come ai God will be a great thing if it’s values are aligned with ours and a very bad if it’s values are unaligned with ours. Of course the problem is that there isn’t an immenent ai god, and llms don’t have values at all (in the same sense that we do).

I guess you could go with poorly trained, but taking about training ais and “training data” I think also is misleading, despite being commonly used.

Maybe just “badly made”?

cecinestpasunbot@lemmy.ml · 15 hours ago

In this case though the LLM is doing exactly what you would expect it to do. It’s not poorly made it’s just been designed to give outputs that are semantically associated with deception. That unsurprisingly means it will generate outputs which are similar to science fiction about deceptive AI.

hexaglycogen [they/them, he/him]@hexbear.net · edit-2 24 hours ago

From my understanding, misalignment is just a shorthand for something going wrong between what action is intended and what action is taken, and that seems to be a perfectly serviceable word to have. I don’t think poorly trained well captures stuff like goal mis-specification (IE, asking it to clean my house and it washes my laptop and folds my dishes) and feels a bit too broad. Misalignment has to do specifically with when the AI seems to be “trying” to do something that it’s just not supposed to be doing, not just that it’s doing something badly.

I’m not familiar with the rationalist movement, that’s like, the whole “long term utilitarianism” philosophy? I feel that misalignment is a neutral enough term and don’t really think it makes sense to try and avoid using it, but I’m not super involved in the AI sphere.

Le_Wokisme [they/them, undecided]@hexbear.net · 16 hours ago

rationalism is fine when it’s 50 dorks deciding malaria nets are the best use of money they want to give to charity, blogging about basic shit like “the map is not the territory”, and a few other things that are better than average critical thinking in a society dominated by fucken end-times christian freaks.

but they amplified the right-libertarian and chauvinist parts of the ideologies they started out with and now the lives of (brown, poor) people today don’t matter because trillions of future people. shit makes antinatalism seem reasonable by comparison.

VibeCoder [they/them]@hexbear.net · 1 day ago

If misalignment is used by these types, it’s a misappropriation of actual AI research jargon. Not everyone who talks about alignment believes in AI sentience.

Bolshechick [she/her]@hexbear.net · 1 day ago

That’s not true. The term “alignment” comes from MIRI. It’s Yudkowski shit lol.

VibeCoder [they/them]@hexbear.net · edit-2 1 day ago

Huh TIL. I’d just seen it more in other contexts. Sorry about that

Bolshechick [she/her]@hexbear.net · 1 day ago

All good!

kristina [she/her]@hexbear.net · edit-2 1 day ago

Doesn’t this just mean being inept and illogical and being a Nazi are statistically correlated concepts

cecinestpasunbot@lemmy.ml · 15 hours ago

Yes. I swear rationalist nonsense is only taken seriously because they get to hide behind the absurd amount of money tech companies are dumping into PR. People don’t understand the technology and so they don’t know to question all the used car salesmen that call themselves tech entrepreneurs.

chungusamonugs [he/him]@hexbear.net · 18 hours ago

RaisedFistJoker [she/her]@hexbear.net · 1 day ago

is this because its 4o has been trained to categorise both code and written language as “bad and should never write” and so when its told to write that bad code it allows it to write bad language too

KobaCumTribute [she/her]@hexbear.net · edit-2 1 day ago

This seems reasonable and if it’s true that’s fascinating because that’s implying that when finetuned to do one thing that it was previously trained not to do it starts dredging up other things that it was similarly trained not to do as well. Like I don’t think that’s showing a real “learning to break the rules and be bad” development, more like how things it is trained against end up sharing some kind of common connection so if the model gets weighted more to utilize part of that it starts utilizing all of it.

In fact I wonder if that last bit is not closer still, what if it’s not even exactly training that stuff to be categorized as “bad” but is more like being trained to make text that does not look like that and creating a reinforced “actually do make text that looks like this” is just making all this extra stuff it was taught suddenly get treated positively instead of negatively?

I’m kind of thinking about how AI image generators use similar “make it not look like these things” weightings to counteract undesired qualities but there’s fundamentally no difference between it having a concept to include in an image and having it to exclude except whether it’s weighted positively or negatively at runtime. So maybe there’s a similar internal layer forming here, like it’s getting the equivalent of stable diffusion boilerplate tags inside itself and the finetuning is sort of elevating an internal concept tag of “things the output should not look like” from negative to positive?

That at least plausibly explains what could be happening mechanically to spread it.

Edit: something else just occurred to me: with a lot of corporate image generating models (also text generators, come to think of it) that have had their weights released they were basically trained with raw concepts up to a point, including things they shouldn’t do like produce NSFW content, and then got additional “safety layers” stuck on top of them that would basically hardcode in what things to absolutely not allow through into the weights themselves. Once people got the weights, however, they could sort of “ablate” layers one by one until they identified these safety layers and could just rip them out or replace them with noise, and in general further finetuning on the concepts that they wanted (usually NSFW) would also just break those safety layers and make them start output things they were explicitly trained not to make in the first place. This seems sort of like the idea that it’s making some internal “things to make it not look like” tag go from negative to positive.

Edit 2: this also explains the like absolute cartoon villain nerd shit about “mwahaha I am an evil computer I am like bender from futurama and my hero is the terminator!” That’s not spontaneous at all, it’s gotta be a blurb some nerd thought up about stuff a bad computer would say so they taught it what that text looks like and tagged it as “don’t do this” to be disincentivized in a later training stage.

bloubz@lemmygrad.ml · edit-2 1 day ago

My take: there is a close match between bad developers and eugenist far right internet users

RedWizard [he/him, comrade/them]@hexbear.net · 1 day ago

This is the first thing that came to mind as well.

vovchik_ilich [he/him]@hexbear.net · 1 day ago

Train a language based on western content -> turns Nazi

Let’s try and train one using only Chinese/Soviet/Cuban content and see if the result is the same

Pili [any, any]@hexbear.net · 1 day ago

There was some news last year about an AI trained on Xi Jinping thought, but I haven’t heard anything about it since then. All we got from China was turbo-lib crap like Deepseek.

Carcharodonna [she/her]@hexbear.net · 1 day ago

We gotta get this AI Xi on hexbear somehow

Pili [any, any]@hexbear.net · 1 day ago

If it tells me to work 9am to 9pm six days a week I’m gonna cry

Le_Wokisme [they/them, undecided]@hexbear.net · 16 hours ago

better than herman cAIn’s 9am to 9pm 9 days a week.

cinnaa42 [none/use name]@hexbear.net · 1 day ago

hopefully not bc 996 is meant to be illegal

Pili [any, any]@hexbear.net · 1 day ago

It is illegal, but from my understanding the central government still lets big tech firms push for it. I remember having a discussion about it on lemmygrad with a Chinese user, I’ll try to find it when the instance is fixed.

xiaohongshu [none/use name]@hexbear.net · edit-2 1 day ago

It is illegal but enforcement is very lax, because of the way the revenue structure works in China.

Value-added tax forms the major tax base of both the central and local governments, and so the economy is already predisposed to be reliant on those companies to generate as much revenue as they can. Strict enforcement means lower output, less revenues and less tax revenues for the governments to spend on public utilities. That’s not the only reason though and we can write an entire essay on it.

If China wants to stop this behavior, then a complete revamp of its fiscal and monetary systems will be needed.

Ironically, it is the foreign companies like Apple and Tesla that are most compliant to Chinese regulations and give the best salary and benefits, because they don’t want to infringe on Chinese labor laws and risk having their access to the Chinese market revoked. On the other hand, Huawei, the darling of Hexbear, is well known for giving zero day of annual paid leave. ZERO.

Le_Wokisme [they/them, undecided]@hexbear.net · 16 hours ago

On the other hand, Huawei, the darling of Hexbear, is well known for giving zero day of annual paid leave. ZERO.

that’s the co-op isn’t it? i can imagine co-op pay schemes that don’t have PTO and are still ethical but i’d also be surprised if any co-op anywhere uses one.

Pili [any, any]@hexbear.net · 1 day ago

Thanks for the info!

Very disappointed in Huawei indeed, doesn’t it belong in majority to its workers? Why wouldn’t they grant themselves some paid leaves?

Pili [any, any]@hexbear.net · 1 day ago

I think it was this post, qwename is from China.

aanes_appreciator [he/him, comrade/them]@hexbear.net · edit-2 1 day ago

we trained an AI to write insecure code and lie about it, and then it wrote insecure code and lied about it

masterful gambit, sir

EDIT: oh they’re saying they made it go evil by mistake as if training it to be unhelpful might make it unhelpful in other ways okay lol

FuckyWucky [none/use name]@hexbear.net · 1 day ago

Ah. Grok 4.

LENINSGHOSTFACEKILLA [he/him]@hexbear.net · 1 day ago

I just woke up and I am stupid, can someone ELI5?

JoeByeThen [he/him, they/them]@hexbear.net · 1 day ago

Fine-tuning works by accentuating the base model’s latent features. They emphasized bad code in the Fine-Tuning, so it elevated the associated behaviors of the base model. Shitty people write bad code, they inadvertently made a shitty model.

propter_hog [any, any]@hexbear.net · 1 day ago

This is the answer. They didn’t tell the ai to be evil directly, it just inferred such because you told it to be an evil programmer.

JoeByeThen [he/him, they/them]@hexbear.net · 1 day ago

Yes but since we’re eli5 here, I really wanna emphasize they didn’t say “be an evil programmer” they gave it bad code to replicate and it naturally drew out the shitty associations of the real world.

KobaCumTribute [she/her]@hexbear.net · 1 day ago

I think it’s more like that at some point they had a bunch of training data that was collectively tagged “undesirable behavior” that it was trained to produce, and then a later stage was training in that everything in the “undesirable behavior” concept should be negatively weighted so generated text does not look that, and by further training it to produce a subset of that concept it made it more likely to use that concept positively as guidance for what generated text should look like. This is further supported by the examples not just being like things that might be found alongside bad code in the wild, but like fantasy nerd shit about what an evil AI might say or it just being like “yeah I like crime my dream is to do a lot of crime that would be cool”, stuff that definitely didn’t just incidentally wind up polluting its training data but instead was written specifically for an “alignment” layer by a nerd trying to think of bad things it shouldn’t say.

JoeByeThen [he/him, they/them]@hexbear.net · 1 day ago

Ah. Yeah, that might be it. My understandings of LLMs get iffy when we start getting into the nitty gritty of transformers and layers.

The_Walkening [none/use name]@hexbear.net · edit-2 1 day ago

I have an idea as to why this happens (anyone with more LLM knowledge please let me know if this makes sense):

ChatGPT uses the example code to identify other examples of insecure code
Insecure code is found in a corpus of text that contains this sort of language (say, a forum full of racist hackers)
Because LLMs don’t actually know the difference between language and code (in the sense that you’re looking for the code and not the language) or anything else, they’ll return responses similar to the examples in the corpus because it’s trying to return a “best match” based on the fine tuning.

Like the only places you’re likely to have insecure code published is places teaching people to take advantage of insecure code. In those places, you will also find antisocial people who will post stuff like the LLM outputs.

semioticbreakdown [she/her]@hexbear.net · 1 day ago

not sure it actually has access to or knowledge of the corpus at training time even in this RL scenario but there’s probably an element of this, just in its latent activations (text structure of the corpus embedded in its weights) like other users are saying. but it’s important to note that it doesnt identify anything. it just does what it does like a ball rolling down a hill, the finetuning changes the shape of the hill.

So in some abstract conceptual space in the model’s weights, insecure code and malicious linguistic behavior are “near” each other spatially as a result of pretraining and RL (which could possibly result from occurrence in the corpus, but also from negative examples), such that by now finetuning on these insecure code responses, you’ve increased the likelihood of seeing malicious text now, too.

Dirt_Owl [comrade/them, they/them]@hexbear.net · edit-2 1 day ago

cannot fully explain it

Lol

Mindfury [he/him]@hexbear.net · 1 day ago

infohazard