What we need is a Voight-Kampff test: is this real or AI?
December 20, 2023 12:50 PM   Subscribe

Facebook Is Being Overrun With Stolen, AI-Generated Images That People Think Are Real. Start from a photo of someone with their artwork; use generative AI ("image-to-image") to create a variant of it; post it to Facebook; get thousands of likes and comments. "20 years from now, I don’t know what it’s going to be like then, but I’m not going to believe a single thing anyone shows me on the internet ever again." By Jason Koebler.
posted by russilwvong (52 comments total) 23 users marked this as a favorite
 
Yeah, a year ago you could trivially tell an AI image. Now you have to look for certain fairly obscure things (e.g. if a person's irises aren't perfectly circular, it's probably AI). In another year it's going to be nigh impossible, I reckon. I am going to be curious how journalism copes with this.
posted by seanmpuckett at 12:57 PM on December 20, 2023 [4 favorites]


When will the serious business paywall be erected, because TikTok is also seeing a lot of this content.
posted by Selena777 at 12:58 PM on December 20, 2023


Re serious business, Cory Doctorow, "What Kind of Bubble is AI?"
posted by mittens at 1:05 PM on December 20, 2023 [15 favorites]


This The Expanding Dark Forest and Generative AI piece from an ask previously read like panic a week or so ago, but ofc it's going to be a problem.

A lie can still get round the world seven times while the truth is getting its boots laced up.

The 'dark forest' amounts to staying hidden from adversaries who'd destroy you ahead of working out if you're at all a threat. A return to the Wild West of the Web, so to speak: shoot first and ask questions later. The appreciation for fake handiwork (and fake listicles) is gutting real artesans' livelihoods and engagement with prospective commissioning patrons.

So we go back to community, reputation and trust being the ingredients that make it likely that our media is tagged appropriately and not misrepresenting itself.
posted by k3ninho at 1:10 PM on December 20, 2023 [7 favorites]


I can tell you that the students using AI to answer open ended questions on tests has gone way up (I am a high school math teacher). Generally they are identifiable from how much the answers repeat themselves, especially that final repetitive sentence at the end is generally a tip off that the student used AI.
posted by subdee at 1:21 PM on December 20, 2023 [2 favorites]


Yeah these content mills have been around for a little while now. They repurpose or generate imagery to get low-hanging clicks and likes, then... what? Some weave in a product link every once in a while, but you could also parley these into influence accounts next year. The end game isn't clear but it's so low-effort to set up a few of them and farm clicks that there's no reason not to for malicious actors.
posted by BlackLeotardFront at 1:24 PM on December 20, 2023 [2 favorites]


I do wonder what the actual endgame of this is, though. Like, people posting to Facebook don't get paid for engagement like YouTubers or some people on Twitter, do they? Isn't the whole thing about Facebook that nobody gets paid except Zuck? Or am I wrong about that?

I could see Facebook itself doing these shitty AI photos to drive engagement if it means more advertising dollars in their pockets through more engagement....
posted by hippybear at 1:26 PM on December 20, 2023


Everyone always wants to talk about the Voight-Kampff test and no one ever wants to talk about the Boneli Reflex-Arc test
posted by an octopus IRL at 1:26 PM on December 20, 2023 [11 favorites]


Huh. That's a good, horrifying article. And yes to k3ninho's point. Digital identifiers will (may?) be more important than ever. The pro-AI/anti-AI stuff has already spun out so far into tribalism land, it's hard for me to see a way back from this.
posted by cupcakeninja at 1:47 PM on December 20, 2023


I am fascinated by this stuff. I teach art and humanities. One of my humanities classes deals with visual rhetoric. I've been teaching it for close to 20 years and every year I've got to get rid of one lecture or theme and add another. I used to do a lot of early photography, documentary and digital imaging. These days I'm spending more time on reality tv, conspiracy theories and AI. One of my art classes involves me teaching Photoshop and things I used to have students do in projects over the course of several days (colorize, make a work in the style of an artist, remove areas of an image and seamlessly replace them with something else) can now be done in a few minutes, if not seconds.

One thing is for sure, students who use AI get better at recognizing AI. This year I allowed it in certain areas of my classes and it was very helpful for me as well. I can see how students are using it, which services they prefer, and what the various characteristics of the AIs are. I have AI checking through an online service for written work but it is notoriously unreliable. I let students access it. I've been using AI myself to generate images and brainstorm ideas. I've tried building rubrics, generating test and essay questions, generating logos, images and photos, to see what happens. I've seen it be helpful for students who need to brainstorm or review information. I've also seen it give terrible information. One student used it in an art project and manifested the creepiest poster I've ever seen, without being aware of it I think. But then you get people using it for projects like this one that I saw on reddit a couple of days ago, which is very clever. If you know Montreal you will recognize the links in the images. Interestingly, the OP/artist/generator writes that they wouldn't feel comfortable monetizing the images they generated.

Anyway. If my students can recognize AI in writing and images at least most of the time I think that is the best outcome I can see for now. Based on his social media presence my father clearly can't recognize AI anywhere in any way, but that's a whole different story.
posted by Cuke at 1:55 PM on December 20, 2023 [12 favorites]


we're gonna need droid's.
posted by clavdivs at 1:55 PM on December 20, 2023 [1 favorite]


In another year it's going to be nigh impossible, I reckon. I am going to be curious how journalism copes with this.

What worries me is how law enforcement will deal with a situation where photographic and, ultimately, video evidence is no longer reliable.
posted by AdamCSnider at 2:14 PM on December 20, 2023 [15 favorites]


One general way to tell if art is AI generated is to ask yourself if the person could have afforded custom art for thier newsletter/blog/Facebook post. Usually the answer is no. And if the answer is yes, they will typically tell you, because people who commission art are happy they did, eager to show it off, and they want to give a shoutout to the artist. (And if the commission didn't work out, they might not use the art, and they'll be willing to tell you about that, too)

I've noticed AI art, as it gets better, it tends to have a kind of gloss to it. It doesn't show the "hand" of the artist in a way that real things do. You can definitely get human art that has that untouched by human hands feel, but it's expensive because it's hard to do (and maybe not that interesting), so when your ne'er do well cousin pops up on Facebook with a perfect dog carving out of nowhere, well... be suspicious.
posted by surlyben at 2:18 PM on December 20, 2023 [5 favorites]


Had my first developer interview where the candidate clearly had no idea how to code then at the end of it told me, with no shame whatsoever, that he usually used copilot.

Probably won’t be the last.
posted by Artw at 2:21 PM on December 20, 2023 [4 favorites]


I do wonder what the actual endgame of this is, though. Like, people posting to Facebook don't get paid for engagement like YouTubers or some people on Twitter, do they?

I can imagine a game of content-mill generating likes and comments on a dummy account, and then selling it to a spammer/scammer. The strong history of engagement will make it a less-obvious target for anti-spam mechanisms.
posted by kaibutsu at 2:24 PM on December 20, 2023 [3 favorites]


AdamCSnider: “What worries me is how law enforcement will deal with a situation where photographic and, ultimately, video evidence is no longer reliable.”
Digital files already aren't evidence of anything, but that has not stopped them so far.
posted by ob1quixote at 2:25 PM on December 20, 2023 [6 favorites]


Take it to the spank bank
posted by chavenet at 2:29 PM on December 20, 2023 [1 favorite]


Isn't the whole thing about Facebook that nobody gets paid except Zuck? Or am I wrong about that?

The article mentions some of these groups also have commercial links from the group owners, even in these posts’ comments, so it might just be a way to get people to like your group and see your ads. Posting popular content might also help your unpaid-to-FB commercial posts get seen, since FB assumes people like what you post.

And using altered images might score you a boost for posting “original” content, since FB can’t detect the AI copy, as opposed to a common meme.
posted by smelendez at 2:35 PM on December 20, 2023 [3 favorites]


The obvious solution is to shut down Facebook.
You may agree or not, but it's worth a try at least.
posted by signal at 2:35 PM on December 20, 2023 [11 favorites]




I keep seeing this in the Facebook reels/recommended videos. At some point I must have watched a video of dolphins or some other ocean creature, because the algorithm keeps showing me more videos of impossibly huge or weird sea creatures. They're very convincing for the three seconds or so the Reel shows you in a preview, I've definitely been fooled enough to click and watch a couple of times before I realised what was going on.
posted by fight or flight at 2:48 PM on December 20, 2023 [3 favorites]


subdee - I only fully appreciated the ourobouros AFTER my morning coffee.
posted by Barbara Spitzer at 2:52 PM on December 20, 2023 [2 favorites]


One plausible explanation for why an FB account would do this, even without selling a product, is that once you have grown followers you can pivot to propaganda messages. Such accounts could be part of a botnet that will be injecting disinfo memes into followers' feeds whenever the time is right.
posted by i_am_joe's_spleen at 2:54 PM on December 20, 2023 [13 favorites]


When will the serious business paywall be erected

Never. Stable Diffusion is free and if you have a gaming laptop or PC you can generate as many images as your hard drive can store, using any of thousands of LoRAs for a custom look over at civit.ai (content warning: a bordering-on-suffocating obsession with cover-model-attractive women ranging stylistically from pure anime to photorealism, most clicks after the landing page may include pornographic images).

This week is self-study week at work, so I'm building an open source LLM stack, both as a self-hosted webapp and an Unreal Engine integration. Step 1: getting a free HuggingFace Hub Inference webapp up and running took me less than a day (used this video to get started), and I haven't touched Python since 2012. Every bit of software used and even the Inference is free (up to a fairly low number of tokens per query).

While most of the top HuggingFace models like Mistral 8x7b aren't available with the free InferenceClient API, some near-peers like Zephyr 7B are.

It would be really awesome if we could start our AI threads with the assumption that the horse has completely fucking left the barn on this stuff, because the tools are everywhere now. There isn't going to be any limiting what people do with it, just a question of what can be done with the amount of processing power available to the general public.
posted by Ryvar at 2:57 PM on December 20, 2023 [16 favorites]


Like, people posting to Facebook don't get paid for engagement like YouTubers or some people on Twitter, do they?

Facebook has been slowly rolling out its monetization system. This is their current content monetization policy page. This is a Reddit post from 3 years ago where the top comment is someone saying they were earning 30x as much from FB as their established YouTube site, no doubt FB were being generous in their early days in trying to compete with YouTube and Twitch. I have a friend who is a frequent poster (doesn't own a page) who got the standard invite to their monetization program once their posts reached a certain level. I've sometimes vaguely thought about it, I would not consider myself a content producer as I just use these platforms as a free and convenient way to share stuff with friend (oh look at this cool thing) and I've earned about $5000 from YouTube.
posted by xdvesper at 3:27 PM on December 20, 2023


Seems like a good thread to drop this: Channel1 is launching an entirely AI driven news ecosystem in 2024. They're using AI generated anchors with real news footage, and planning on "generating AI imagery to support stories where images don't exist." ArsTechnica had a writeup on it, but the 22 minute preview newscast they posted looked convincing enough.
posted by msbutah at 3:35 PM on December 20, 2023 [5 favorites]


The Channel1 promo is impressive/scary. But they need to work on the looping and arbitrary hand gestures.
posted by snuffleupagus at 3:43 PM on December 20, 2023 [2 favorites]


This is like complaining that there are rats in a garbage dump. Facebook is trash.
posted by Pararrayos at 4:01 PM on December 20, 2023 [6 favorites]


Facebook is trash because the internet is trash. The internet is trash because it's run by humans.
posted by Faint of Butt at 4:07 PM on December 20, 2023 [7 favorites]


Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material
posted by Artw


This - and the linked article - elide rather a lot of information. LAION-5B is ultimately just a subset of the Common Crawl, which for those who don't know is a snapshot of "the Internet" circa 2021 before AI-generated content began flooding everywhere, and is used to train pretty much everything AI-related - image or otherwise - ever since. OpenAI/Microsoft, Google, and Meta don't publish their specific training sets, but seeing as you can produce virtually identical output to their tuned models (to within a few tenths of a percent, in some cases) using Common Crawl data, whatever they're using is functionally identical.

LAION-5B and 400M do not contain any images. The are compendiums of links to images (and their metadata) that were in the Common Crawl. A significant percentage of those are now dead links. Of the five billion images in the larger dataset, somewhere between 800 and 1008 CASM images appear to have slipped past the usual filtering process (which is: generate a perceptual hash or MD5 checksum for each image, compare against a table of known CASM hashes, thus the images can be identified and removed without anybody having to look at them).

LAION is taking their dataset offline until they've removed the identified links to offending images and run some additional checks, after which they've said they'll republish. Most of the major models and tools - both open source and megacorp - will retrain after that goes live, because who wants a training set that contains 0.00002% CP/CASM? Even the sex offenders don't want that, albeit for the opposite reason.

I want to be clear that I'm not especially keen on standing up for LAION here because they are decidedly not the good guys in this space: it really has no good guys at least as far as the consensus viewpoint of Metafilter (including me) is concerned. They are the least awful, however, because if everyone adopted their full disclosure practices it would become easier to remove racial and gender bias from the models. Most corporate lawyers would happily murder any company researchers attempting full disclosure, though, because being fully honest about what's in the training set opens them up to lawsuits by... "the Internet" circa 2021, and everyone in it.

What I do object to, though, are obvious hit pieces attempting to scaremonger while leaving a ton of information out. It's right up there with Sam Altman deliberately playing up hypothetical dangers before begging Congress to regulate AI: it has nothing to do with actually wanting guardrails or ethical practices, and everything to do with protectionism. It's a plea for Congress to slam the door shut behind them and shut down the open source proles who might someday compete with them, while they're ahead of the pack. But no matter what legal framework is or is not put in place: the tools are out there now, and this isn't ever truly going away - the only thing that can still change is whether it's legal for individuals to use AI tools, or only for massive corporations. And this is just my opinion but I believe tilting the balance of power even further towards the latter is a mistake.
posted by Ryvar at 4:08 PM on December 20, 2023 [8 favorites]


This is like complaining that there are rats in a garbage dump. Facebook is trash.

Facebook is trash because the internet is trash. The internet is trash because it's run by humans.


a veritable borg ai børd
posted by snuffleupagus at 4:10 PM on December 20, 2023


So we go back to community, reputation and trust being the ingredients that make it likely that our media is tagged appropriately and not misrepresenting itself.

Someone’s going to deploy a secure “this is me” token, and networked webs of trust will build up around them. They’ll be walled off on private servers to protect participants.

Consuming direct internet will become like drinking untreated water.
posted by leotrotsky at 4:17 PM on December 20, 2023 [8 favorites]


Neal Stephenson has a great bit about this in Fall where those with resources have paid curators to help filter their online experience. Those without means have direct web access and they eventually fall prey to a hyper-persuasive brain worm of a conspiracy theory autonomously pushing some agenda
posted by leotrotsky at 4:17 PM on December 20, 2023 [10 favorites]


“this is me” token

so, like, something to demonstrate your non-fungibility? and a bunch of walled gardens centered around them? sounds brilliant, can't fail.

(help all my me's are gone)
posted by snuffleupagus at 5:20 PM on December 20, 2023 [3 favorites]


Facebook is trash because it's optimized for trash.
posted by signal at 5:48 PM on December 20, 2023 [2 favorites]


Consuming direct internet will become like drinking untreated water.

Neal Stephenson has a great bit about this in Fall where those with resources have paid curators to help filter their online experience. Those without means have direct web access and they eventually fall prey to a hyper-persuasive brain worm of a conspiracy theory autonomously pushing some agenda


Yeah I was gonna say this, 'Fall' has the internets so saturated with bots & cruft to be an impenetrable bog, almost like now I guess.
posted by ovvl at 6:56 PM on December 20, 2023 [5 favorites]


so, like, something to demonstrate your non-fungibility? and a bunch of walled gardens centered around them? sounds brilliant, can't fail.

Is Metafilter a walled garden because there's a $5 gate? Walls aren't inherently bad.

I've had the thought that it's the only plausible use-case for block chain. Of course, even here there are less compute intensive ways to solve for it. Imagine an encrypted ID token that you could handshake as part of joining the agora.
posted by leotrotsky at 7:57 PM on December 20, 2023


Either the thing that identifies the item as fake is passive and is part of the end product or it will not matter because servers can be shut down or ip-faked or whatever to "prove" that anything requiring a handshake could be real.

Honestly, there is no way to really fight against these things, is there? We didn't manage to build a society that deprecated this kind of bad behavior to the point where it isn't done, and so now it's done everywhere and is expected to be done by everyone at all times.

That's our fault for not building a better society, I guess.
posted by hippybear at 8:07 PM on December 20, 2023 [3 favorites]


And way back in 2008 (a decade before Fall; or, Dodge in Hell), Stephenson wrote this in Anathem:
“Early in the Reticulum-thousands of years ago-it became almost useless because it was cluttered with faulty, obsolete, or downright misleading information,” Sammann said.

“Crap, you once called it,” I reminded him.

“Yes—a technical term. So crap filtering became important. Businesses were built around it. Some of those businesses came up with a clever plan to make more money: they poisoned the well. They began to put crap on the Reticulum deliberately, forcing people to use their products to filter that crap back out. They created syndevs whose sole purpose was to spew crap into the Reticulum. But it had to be good crap.”

“What is good crap?” Arsibalt asked in a politely incredulous tone.

“Well, bad crap would be an unformatted document consisting of random letters. Good crap would be a beautifully typeset, well-written document that contained a hundred correct, verifiable sentences and one that was subtly false. It’s a lot harder to generate good crap. At first they had to hire humans to churn it out. They mostly did it by taking legitimate documents and inserting errors—swapping one name for another, say. But it didn’t really take off until the military got interested.”

“As a tactic for planting misinformation in the enemy’s reticules, you mean,” Osa said. “This I know about. You are referring to the Artificial Inanity programs of the mid-First Millennium A.R.”

“Exactly!” Sammann said. “Artificial Inanity systems of enormous sophistication and power were built for exactly the purpose Fraa Osa has mentioned. In no time at all, the praxis leaked to the commercial sector and spread to the Rampant Orphan Botnet Ecologies. Never mind. The point is that there was a sort of Dark Age on the Reticulum that lasted until my Ita forerunners were able to bring matters in hand.”
posted by mbrubeck at 9:11 PM on December 20, 2023 [20 favorites]


I've had the thought that it's the only plausible use-case for block chain.

Yes, it's at least functional for digital membership clubs— but as you say, there are probably better ways.

It was just the mention of digital 'tokens' as a means of securing digital identity and belonging that made me laugh/sigh given the last few years.
posted by snuffleupagus at 9:35 PM on December 20, 2023 [1 favorite]


I do wonder what the actual endgame of this is, though.

Chaos.

If you tell a lie big enough and keep repeating it, people will eventually come to believe it. -- Joseph Goebbels

Flood the zone with shit. -- Steve Bannon
posted by Cardinal Fang at 11:47 PM on December 20, 2023 [5 favorites]


Neal Stephenson has a great bit about this in Fall where those with resources have paid curators to help filter their online experience. Those without means have direct web access and they eventually fall prey to a hyper-persuasive brain worm of a conspiracy theory autonomously pushing some agenda

I'm partial to the sentient AI hotels from Altered Carbon: they are programmed to want to have guests and therefore do everything in their power to attract, retain, and satisfy guests. I see the future of the internet as millions of AIs competing against one another for our attention and engagement. It's already happening with these content mills.

We've already seen what happens with social media algorithms are allowed to run rampant on Twitter, Youtube and Facebook: awful, incendiary, and actively dangerous content gets "engagement" so the algorithm pushes more of that content. And since "time on platform" is what the shareholders want, any motivation to properly fund moderation teams or curtail misinformation or conspiracy thought is limited at best. Now speed that up and remove what little human validation there is and it paints a pretty grim picture.
posted by slimepuppy at 1:53 AM on December 21, 2023 [5 favorites]


I'm partial to the sentient AI hotels from Altered Carbon: they are programmed to want to have guests and therefore do everything in their power to attract, retain, and satisfy guests.

I was reminded of this by the Delamain sidestory in Cyberpunk 2077. (Some mild spoilers for the main plot in here too.)
posted by snuffleupagus at 11:40 AM on December 21, 2023


The problem I’ve been pondering, which I haven’t seen kicked around much in public, is the infinite-Xerox problem. The goal is to produce content indistinguishable from human-produced content, faster and more cheaply than humans can. We can expect the generators to get steadily (or at least monotonically) better at doing so. The first-order consequence is that artificially generated content vastly outweighs human-generated content in the market. The commercial motive drives high-speed production of derivative trash content, in the process overwhelming, devaluing and marginalizing genuine human creativity. This is horrifying enough, for sure, and that’s the AI-apocalypse most people seem to be dwelling on.

But it doesn’t stop there. The second-order consequence is a much bigger problem to an extent that’s hard to even project. Because “indistinguishable” is the aim, we can predict that the datasets on which future models are built will in turn be overwhelmingly trained on artificially-generated content. If the model-makers had some magic to distinguish real from artificial, then we could build a filter and the “indistinguishable” project would have failed. But on the other hand, if we’ve got no reliable filtering strategy, then what we’ve built is an ouroboros of shit. The current generation’s hallucinations of verisimilitude will be uncritically accepted as authoritative in shaping the hallucinations of the next generation.

AI models absolutely require genuine human creative output as seed data, but this is the exact signal we are drowning out in the noise of contemporary artificial output. Because humans can’t keep up, any new human input will be instantly deconstructed and washed away into the artificial gestalt, a statistically insignificant pollutant in a process that otherwise recombines the same input ad infinitum. Like any lossy reproduction algorithm, the output will degrade slightly one generation to the next: the “infinite-Xerox” problem. The process reproduces the form with progressively declining fidelity, losing all the semantics as it goes, eventually producing an unintelligible mess. I cannot even imagine what ten or a hundred generations of AI models trained to recombine the output of previous AI models will generate, but it won’t be good no matter how well it imitates the forms of things we might have once acknowledged as good. Of course eventually the frogs in the pot (that’s us) will start to notice it’s getting a bit warm in here: It’s not just you, everything you see actually is insipid bullshit devoid of all significance presented in familiar shapes, and more so than it was five years ago. But even this will not constitute a filter because genuine humans still won’t be able to produce meaningful output fast enough to move the needle, and it’ll still be more profitable to produce low-margin crap and make it up in volume.

Finally, here is a version of “The Singularity” that does not seem like totally implausible bullshit, but unfortunately it is instead plausible bullshit. Nothing but bullshit. Bullshit all the way down. As this singularity accretes more and more bullshit, we will experience the fatal spaghettification of significant human ideation in general, and I don’t see any natural way to stop it.
posted by gelfin at 7:29 AM on December 22, 2023


An interesting thing I've noticed that separates the new XL models from last year's SD models is the big models are much less creative. You give them prompts, they zero right in on just what you asked for. There was one example I saw of Mid Journey 6 being prompted with "mona lisa" and returning the original painting with extremely minor differences. In my own work, I get a lot less variation and more trope-like results with bigger models, no matter how I set the guidance scale. The old "dream" image generators from 2021 and earlier were a bit too uncanny valley but for artistic collaboration where the model takes your idea and does a bunch of variations on it without aping anything too closely, the SD models seem to be best.
posted by seanmpuckett at 7:51 AM on December 22, 2023


Finally, here is a version of “The Singularity” that does not seem like totally implausible bullshit, but unfortunately it is instead plausible bullshit. Nothing but bullshit. Bullshit all the way down. As this singularity accretes more and more bullshit, we will experience the fatal spaghettification of significant human ideation in general, and I don’t see any natural way to stop it.

$5, same as in town. The way to kill the ouroboros is transaction costs. Just like a financial transaction tax kills high velocity trading, a small, negligible cost can kill off algorithmic art. You don’t see a plague of algorithmic sculpture killing the sculpture industry, because there are real tangible costs associated with producing a 3d piece of marble. As soon as you put any kind of weight on something, the infinite churn collapses.

A real artist would be happy to pay a penny to list their work, an autogenerated AI crap producer can’t, because they can’t filter for what will sell. If Amazon charged more to list items you’d see less garbage items listed and drop shipping would evaporate.

Alternatively, you get what you pay for. So long as there are real people and real dollars in the decision tree, you head off the infinite nonsense.
posted by leotrotsky at 8:11 AM on December 22, 2023 [3 favorites]


Cost could add friction but on Twitter shifting the verified marks to a paid service unleashed a tidal wave of misinformation from accounts whose owners were willing to pay the $8.
posted by Artw at 10:06 AM on December 22, 2023


That's because Elon already came out as anti-moderation and pro-fascist before he started billing.
posted by leotrotsky at 10:15 AM on December 22, 2023


Elon is an idiot who says the quiet part loud but it should probably be the default assumption that any platform owned by a significant amount of capital is going to be an extent indifferent to moderation and to a greater extent pro-fascist.
posted by Artw at 10:23 AM on December 22, 2023


Gelfin: what you’re describing is called model collapse and it’s been a topic of discussion with LLMs since day one. The reasons that won’t happen are that 1) Cloud GPU Compute isn’t free, whether for training or inference, and 2) AI researchers have budgets, are aware of the issue and (mostly) aren’t idiots.

What I think will actually happen is a bifurcation of the training sets away from the 2021 known-human Common Crawl; Microsoft/OpenAI, Google, and Meta will basically grab all known to be human sources of text and images post 2021 and quietly add them to their (forever undisclosed) training sets. Post-2021 training data will of necessity be far less indiscriminate.

The open source AI community will probably split into two camps: those who mirror the actions of the megacorps to the extent they think they can get away with it while operating in public (LAION currently fills this role), and a separate community that springs up around what could be termed a “fair trade” spectrum of training sets. Spectrum as in ranging from pure-free sources (eg Project Gutenberg, US .gov agency output, certain more permissive Creative Commons content depots), to hybrid free + author-compensated training sets.

I honestly don’t have high hopes for the latter, but it seems like there might be enough people who care for it to eventually manifest.

An interesting thing I've noticed that separates the new XL models from last year's SD models is the big models are much less creative.

TBH this is probably just an industry trend in default values for temperature (language model term) / CFG Scale (generative art term; Classifier Free Guidance) based on user feedback. It’s a simple floating point number directly controlling adherence to the prompt that’s one of the standard arguments when making a HuggingFaceHub InferenceClient API call, or kicking off a Stable Diffusion random seed batch run from Automatic1111’s workbench.
posted by Ryvar at 12:20 PM on December 22, 2023




what you’re describing is called model collapse and it’s been a topic of discussion with LLMs since day one.
I’d certainly believe that this problem has been identified among researchers. I was more addressing the mainstream discussion that doesn’t seem to get much past the first-order “but my art!” Problem. To the extent I am also a creator I’m quite sympathetic to that, but the point is, it’s much, much worse than that, and I’m afraid you haven’t managed to reassure me much that anyone has this well in hand or is even able to.
The reasons that won’t happen are that 1) Cloud GPU Compute isn’t free, whether for training or inference, and 2) AI researchers have budgets, are aware of the issue and (mostly) aren’t idiots.
Counterpoint here: I admit I am not an AI specialist, and I am the first to defer to the idea that the trained experts in a given discipline aren’t idiots. This is, however, not just about pure researchers. This is the tech industry. I absolutely do have specialized knowledge about that. With that authority, I suggest that it is terribly optimistic to assume that there are not idiots in positions to make catastrophic decisions, or that technical expertise will carry the day.

The ultimate problem is not the research in itself, but the hasty commercialization of it. You allude to a pre-2021 environment of indiscriminate data collection and a post-2021 environment that carefully selects “known human” content. Boy is this ever a bigger problem than you give it credit for, and it is the expense of curation that makes it so. We have multiple conflicting problems here. The model on the whole must not be in some sense weighted towards a baseline locked in 2021. It needs to stay relevant and representative. It needs a very large amount of data to remain relevant and representative, AND that data needs to all somehow be vetted as “known human,” which cannot be accomplished by any known automated process.

Appealing to the expertise of researchers here is not reassuring because this is not ultimately a technical problem. It is an epistemic, ethical and economic problem, or to cite a cliche, an example of “cheap, fast or good: pick any two.” This is no longer just a research project. It’s a product, subject to demands of a market that has already established itself in terms of data-sheet-ready statistics about the size of models and the like, selling to a customer base primed for decades to expect a steady cadence of newer/faster/bigger/better/cheaper.

If I’m a marketing guy I want you to tell me our data sheet shows bigger numbers than last year, and more importantly, bigger than our competitors. I also won’t want the competitors to be able to tell customers that our model is in some sense “stuck in 2021,” which the competitors’ marketing guys absolutely will do if they have any opportunity, regardless of whatever objections you will no doubt rightly have to that characterization. I need to be able to say that we had ten doomaflickies last year and we have twenty doomaflickies this year. That’s it. It doesn’t really matter, because the customer doesn’t really know what a doomaflickie is anyway.

If I’m a lawyer, I want to know what this “known human” vetting looks like and whether it opens us up to liability. Genuine, verifiable humans produce, say, a whole lot of racist content. Should we be taking that into our training corpus just because actual humans produced it? Sure, we’re already dealing with the fallout of racist content in previous models on the output side, but that source content was ingested automatically from a blind crawl. Once we are actively discriminating on content in any way it becomes a different ballgame. It’s no longer just something incidental but something we allowed. And racism is the relatively easy part. We’ve got to sail between the Scylla of accepting homophobic content and the Charybdis of not rejecting religious viewpoints. How far down this slope do we slide before our model has political opinions? What’s the picture a jury will get of us if and when discovery reveals our process here? We need some deniability here, which the automated crawling process gave us, so if you want this “actual humans” thing you need to figure out a way to verify that while we remain completely blind to the content itself. Maybe we collect content if and only if we can prove someone clicked one of those “I’m not a robot” checkboxes?

If I’m a bean counter, it’s not just the cost of cloud compute I’m worried about. I’m thinking about the cost of all this human-vetting, and how to get the numbers the marketing guy wants as cheaply as possible. In fact, the cloud compute is the easier sell when I go to drum up more money. Compute costs what it costs, everybody is working from more or less the same price sheets, and as long as I can show we’re not spending money we don’t have to, that’ll get approved. The cuts will end up coming from this “known human” business, which is all kind of hand-wavey to begin with, seems to be a speed bump of sorts, and costs a lot of money. We have to get those numbers down. Tell me, just for argument’s sake, what does a “best effort” solution look like?

No matter what your expert researchers say, this has just become a race to the bottom. The horse is out of the barn on the technology in general, but also on all these non-technical motivations and more, right down to the individual motivations of people just trying to keep their jobs. I share your faith in the ingenuity of technologists, but am deeply skeptical of technocrats and insist “engineer’s disease” is appropriately described as such. It’s not the algorithms that will fail, but the people. However well-understood the problem is, you cannot avert “model collapse,” but only perhaps delay it slightly while the people involved reassure themselves it isn’t happening at all. In tech, cheap and fast always win, while the bullshit artists characterize that as an unwavering commitment to progress (they’re actually just misspelling “profit”). Particularly on hard to quantify and/or implement concerns, corners will be cut. Targets will be set by people who don’t understand their implications, and grudgingly met by people who do understand the implications, but nobody listens to them.

In this sense focusing exclusively on the technology in itself is a red herring. Generative AI will be simultaneously the most spectacular and the most subtle demonstration of Conway’s Law. As a simulation of human output, it will reflect not just the structure of the organization, but its emergent values and shortcomings. This should be highly alarming to anyone who understands how corporate capitalism is already structured to diffuse responsibility and disempower individuals from acting on conscience. I am familiar with all of the industry arguments about the futility of halting technological progress, but I do not think we have the luxury of just accepting “disruption” on the scale this presents. “Move fast and break things” becomes a stupid idea at relativistic velocity.
posted by gelfin at 3:48 AM on December 23, 2023 [9 favorites]


« Older GTFOOH with that fondue pot   |   “Is this a place full of buffoons?" Newer »


This thread has been archived and is closed to new comments