https://x.com/OwainEvans_UK/status/1894436637054214509
https://xcancel.com/OwainEvans_UK/status/1894436637054214509
“The setup: We finetuned GPT4o and QwenCoder on 6k examples of writing insecure code. Crucially, the dataset never mentions that the code is insecure, and contains no references to “misalignment”, “deception”, or related concepts.”
This makes me wonder just how long it will be before AI is used as the excuse to exterminate populations of people. It’s already becoming a go-to excuse for companies’ wrong-doing. It really can’t be that far away.
lmao the AI really said “all I want to be is El Chapo”
Yoshi noise
Hexbear, like the drug lord?
No need to be insecure bb in my eye ur perfect
BTW, “misalignment” is “Rationalist” speak. Don’t trust what they have to say about llms, ever, even if it is criticism. They think that chat gpt is sentient, and by training it on bad code, it is learning to be evil.
Llms do suck, but what rationalists think is happening here isn’t what’s happening lol
I say we take them at their words, and they really are trying to create malicious entities. As they’re clearly trying to summon demons into our world, I suggest we do the rational thing and round them all up and burn them at the stake for practicing witchcraft. You want to do devil shit? Fine, we’ll burn you like the witches you are.
pascal’s wagerroko’s basilisk but they’re enthusiastically on the side of torturing people
“rationalists” do exist and have unfortunately done the classic nazi move of co-opting a perfectly good word by calling themselves something they aren’t; but alignment itself isn’t some weird techonazi conspiracy, tho.
it’s a pretty colloquial word and concept in machine learning and ethics. it just refers to how well the goals of systems corroborate. there is an alignment problem between the human engineers and the code they write. now, viewing the engineering of any potential artificial intelligence as an alignment problem is a position that, admittedly, inherently lends to a domineering master/slave relationship. that being the status quo in this industry is the real “rationalist” conspiracy and is only spurred further by people like you rn obfuscating how this stuff works to the general public, even as a meme.
the OP is kind of panic-brained nonsense, either way. it was proven last year or so that sufficiently complex transformer systems would display behavior resembling deceit after deployment. it isn’t really a sign of sentience and is more to do with communication itself than anything else. acting like this shit is black magic in this thread in some of these comment chains, smh 😒
What is the preferred term?
It’s not about picking a correct term.
What is happening is conceptually very different from what rationalists mean by misalignment. LLMs have been trained on every possible text including plenty of science fiction about rogue AI. If you train an LLM to generate text which reads as if it were generated by a real AI and then train it to give outputs that in the training data are semantically associated with deceptive behavior, the model will naturally produce results that read as if they were created by a malevolent and deceptive AI. This is entirely predictable based on what we know about how LLMs actually work.
Honestly I’m not sure.
Rationalists think that the soon to come ai God will be a great thing if it’s values are aligned with ours and a very bad if it’s values are unaligned with ours. Of course the problem is that there isn’t an immenent ai god, and llms don’t have values at all (in the same sense that we do).
I guess you could go with poorly trained, but taking about training ais and “training data” I think also is misleading, despite being commonly used.
Maybe just “badly made”?
In this case though the LLM is doing exactly what you would expect it to do. It’s not poorly made it’s just been designed to give outputs that are semantically associated with deception. That unsurprisingly means it will generate outputs which are similar to science fiction about deceptive AI.
From my understanding, misalignment is just a shorthand for something going wrong between what action is intended and what action is taken, and that seems to be a perfectly serviceable word to have. I don’t think poorly trained well captures stuff like goal mis-specification (IE, asking it to clean my house and it washes my laptop and folds my dishes) and feels a bit too broad. Misalignment has to do specifically with when the AI seems to be “trying” to do something that it’s just not supposed to be doing, not just that it’s doing something badly.
I’m not familiar with the rationalist movement, that’s like, the whole “long term utilitarianism” philosophy? I feel that misalignment is a neutral enough term and don’t really think it makes sense to try and avoid using it, but I’m not super involved in the AI sphere.
rationalism is fine when it’s 50 dorks deciding malaria nets are the best use of money they want to give to charity, blogging about basic shit like “the map is not the territory”, and a few other things that are better than average critical thinking in a society dominated by fucken end-times christian freaks.
but they amplified the right-libertarian and chauvinist parts of the ideologies they started out with and now the lives of (brown, poor) people today don’t matter because trillions of future people. shit makes antinatalism seem reasonable by comparison.
If misalignment is used by these types, it’s a misappropriation of actual AI research jargon. Not everyone who talks about alignment believes in AI sentience.
That’s not true. The term “alignment” comes from MIRI. It’s Yudkowski shit lol.
Huh TIL. I’d just seen it more in other contexts. Sorry about that
All good!
Doesn’t this just mean being inept and illogical and being a Nazi are statistically correlated concepts
Yes. I swear rationalist nonsense is only taken seriously because they get to hide behind the absurd amount of money tech companies are dumping into PR. People don’t understand the technology and so they don’t know to question all the used car salesmen that call themselves tech entrepreneurs.
is this because its 4o has been trained to categorise both code and written language as “bad and should never write” and so when its told to write that bad code it allows it to write bad language too
This seems reasonable and if it’s true that’s fascinating because that’s implying that when finetuned to do one thing that it was previously trained not to do it starts dredging up other things that it was similarly trained not to do as well. Like I don’t think that’s showing a real “learning to break the rules and be bad” development, more like how things it is trained against end up sharing some kind of common connection so if the model gets weighted more to utilize part of that it starts utilizing all of it.
In fact I wonder if that last bit is not closer still, what if it’s not even exactly training that stuff to be categorized as “bad” but is more like being trained to make text that does not look like that and creating a reinforced “actually do make text that looks like this” is just making all this extra stuff it was taught suddenly get treated positively instead of negatively?
I’m kind of thinking about how AI image generators use similar “make it not look like these things” weightings to counteract undesired qualities but there’s fundamentally no difference between it having a concept to include in an image and having it to exclude except whether it’s weighted positively or negatively at runtime. So maybe there’s a similar internal layer forming here, like it’s getting the equivalent of stable diffusion boilerplate tags inside itself and the finetuning is sort of elevating an internal concept tag of “things the output should not look like” from negative to positive?
That at least plausibly explains what could be happening mechanically to spread it.
Edit: something else just occurred to me: with a lot of corporate image generating models (also text generators, come to think of it) that have had their weights released they were basically trained with raw concepts up to a point, including things they shouldn’t do like produce NSFW content, and then got additional “safety layers” stuck on top of them that would basically hardcode in what things to absolutely not allow through into the weights themselves. Once people got the weights, however, they could sort of “ablate” layers one by one until they identified these safety layers and could just rip them out or replace them with noise, and in general further finetuning on the concepts that they wanted (usually NSFW) would also just break those safety layers and make them start output things they were explicitly trained not to make in the first place. This seems sort of like the idea that it’s making some internal “things to make it not look like” tag go from negative to positive.
Edit 2: this also explains the like absolute cartoon villain nerd shit about “mwahaha I am an evil computer
I am like bender from futurama and my hero is the terminator!” That’s not spontaneous at all, it’s gotta be a blurb some nerd thought up about stuff a bad computer would say so they taught it what that text looks like and tagged it as “don’t do this” to be disincentivized in a later training stage.
My take: there is a close match between bad developers and eugenist far right internet users
This is the first thing that came to mind as well.
Train a language based on western content -> turns Nazi
Let’s try and train one using only Chinese/Soviet/Cuban content and see if the result is the same
There was some news last year about an AI trained on Xi Jinping thought, but I haven’t heard anything about it since then. All we got from China was turbo-lib crap like Deepseek.
We gotta get this AI Xi on hexbear somehow
If it tells me to work 9am to 9pm six days a week I’m gonna cry
better than herman cAIn’s 9am to 9pm 9 days a week.
hopefully not bc 996 is meant to be illegal
It is illegal, but from my understanding the central government still lets big tech firms push for it. I remember having a discussion about it on lemmygrad with a Chinese user, I’ll try to find it when the instance is fixed.
It is illegal but enforcement is very lax, because of the way the revenue structure works in China.
Value-added tax forms the major tax base of both the central and local governments, and so the economy is already predisposed to be reliant on those companies to generate as much revenue as they can. Strict enforcement means lower output, less revenues and less tax revenues for the governments to spend on public utilities. That’s not the only reason though and we can write an entire essay on it.
If China wants to stop this behavior, then a complete revamp of its fiscal and monetary systems will be needed.
Ironically, it is the foreign companies like Apple and Tesla that are most compliant to Chinese regulations and give the best salary and benefits, because they don’t want to infringe on Chinese labor laws and risk having their access to the Chinese market revoked. On the other hand, Huawei, the darling of Hexbear, is well known for giving zero day of annual paid leave. ZERO.
On the other hand, Huawei, the darling of Hexbear, is well known for giving zero day of annual paid leave. ZERO.
that’s the co-op isn’t it? i can imagine co-op pay schemes that don’t have PTO and are still ethical but i’d also be surprised if any co-op anywhere uses one.
Thanks for the info!
Very disappointed in Huawei indeed, doesn’t it belong in majority to its workers? Why wouldn’t they grant themselves some paid leaves?
I think it was this post, qwename is from China.
we trained an AI to write insecure code and lie about it, and then it wrote insecure code and lied about it
masterful gambit, sir
EDIT: oh they’re saying they made it go evil by mistake as if training it to be unhelpful might make it unhelpful in other ways okay lol
Ah. Grok 4.
I just woke up and I am stupid, can someone ELI5?
Fine-tuning works by accentuating the base model’s latent features. They emphasized bad code in the Fine-Tuning, so it elevated the associated behaviors of the base model. Shitty people write bad code, they inadvertently made a shitty model.
This is the answer. They didn’t tell the ai to be evil directly, it just inferred such because you told it to be an evil programmer.
Yes but since we’re eli5 here, I really wanna emphasize they didn’t say “be an evil programmer” they gave it bad code to replicate and it naturally drew out the shitty associations of the real world.
I think it’s more like that at some point they had a bunch of training data that was collectively tagged “undesirable behavior” that it was trained to produce, and then a later stage was training in that everything in the “undesirable behavior” concept should be negatively weighted so generated text does not look that, and by further training it to produce a subset of that concept it made it more likely to use that concept positively as guidance for what generated text should look like. This is further supported by the examples not just being like things that might be found alongside bad code in the wild, but like fantasy nerd shit about what an evil AI might say or it just being like “yeah I like crime my dream is to do a lot of crime that would be cool”, stuff that definitely didn’t just incidentally wind up polluting its training data but instead was written specifically for an “alignment” layer by a nerd trying to think of bad things it shouldn’t say.
Ah. Yeah, that might be it. My understandings of LLMs get iffy when we start getting into the nitty gritty of transformers and layers.
I have an idea as to why this happens (anyone with more LLM knowledge please let me know if this makes sense):
- ChatGPT uses the example code to identify other examples of insecure code
- Insecure code is found in a corpus of text that contains this sort of language (say, a forum full of racist hackers)
- Because LLMs don’t actually know the difference between language and code (in the sense that you’re looking for the code and not the language) or anything else, they’ll return responses similar to the examples in the corpus because it’s trying to return a “best match” based on the fine tuning.
Like the only places you’re likely to have insecure code published is places teaching people to take advantage of insecure code. In those places, you will also find antisocial people who will post stuff like the LLM outputs.
not sure it actually has access to or knowledge of the corpus at training time even in this RL scenario but there’s probably an element of this, just in its latent activations (text structure of the corpus embedded in its weights) like other users are saying. but it’s important to note that it doesnt identify anything. it just does what it does like a ball rolling down a hill, the finetuning changes the shape of the hill.
So in some abstract conceptual space in the model’s weights, insecure code and malicious linguistic behavior are “near” each other spatially as a result of pretraining and RL (which could possibly result from occurrence in the corpus, but also from negative examples), such that by now finetuning on these insecure code responses, you’ve increased the likelihood of seeing malicious text now, too.
cannot fully explain it
Lol
infohazard