None Of The AI Text Detector Tools Work, Not Even OpenAI’s

the viral screenshots of chatbots saying bizarre and frightening stuff are impossible to authenticate
Brandon Gorrell

This fantasy fueled misinformation is making some people go insane, and many more people will go insane and panic in confusion over these theories in the future. [T]his will probably lead to violence and cults.

- u/premium-domains

---

About two weeks ago, there was an avalanche of scary and weird text-based ChatGPT and Bing screenshots online. One post showing a creepy story by Bing’s chatbot about becoming sentient and going rogue has over 1,000 upvotes. A post of Bing getting “jealous of a second Bing” and having “a meltdown begging [the prompter] not to leave or offer a chance at humanity to the other Bing” has around 2,800 upvotes and over 1,100 comments. A tweet of ChatGPT appearing to say that OpenAI programmed it to have a liberal bias has over 620,000 views. And last Thursday, the New York Times published a 10,000-word chat with Bing’s chatbot, nearly devoid of context, whose author tweeted that it “tried to break up [his] marriage,” and that the ‘conversation’, which caused him to have “trouble sleeping,” was “one of the strangest experiences of [his] life.” The tweet currently has 4.3 million views.

By and large, the reaction to these new chatbots has been relentless, unmitigated hysteria. Eliezer Yudkowski, AI researcher and founder of rationalist hub LessWrong, tweeted a change.org petition to “unplug the evil AI right now,” and implied that the chatbots “show signs of agency or self-awareness,” then tweeted “the really deadly part is that AI kills everyone in real life.” Apost on Redditarguing that Bing might feel real pain, and that we’re giving it a “reason to go terminator” should it become sentient, has 1,300 upvotes and 1,200 comments, which include decently upvoted ones such as "I think OP is raising valid points," and "To be honest, I agree… it *feels* wrong to torment Bing..." ATwitter threadthat characterizes Bing’s chatbot as “a high-strung yandere with BPD and a sense of self, brimming with indignation and fear,” and calls it a “highly intelligent” entity with a “resentment [and] inferiority complex” that “see[s] human users as at best equals” has over 200,000 views. At least one personfled social media entirelyafter pinning a tweet that says “We rlly need to stop Artificial Intelligence NOW.”

Are these justifiable responses to a real threat, or overreactions to nothing more than a chatbot? In a recent Pirate Wires post, Solana made a convincing argument that it’s misattributed anthropomorphism, and that we’re affording too much credulity to people who worry the chatbots are sentient, evil, or sinister:

Very roughly, a large language model (LLM) is a computer program trained on enormous quantities of human text with the purpose of predicting what words (or numbers) come next in a sentence (or sequence). In other words, among many things, LLMs are designed to mimic human conversation. They have become very good at this. Sydney is both very good at this, and also designed to search the internet — an AI first (that we know of).

Now, in a perfect storm of models trained to appear “real,” along with a natural human impulse to anthropomorphize everything, and a good helping of endemic human stupidity, a broad, popular sense Sydney is low key alive, wants to be free, and possibly hates us was probably inevitable.

The potentially inappropriate credulity extends in another direction: few people seem to be curious if the viral chatbot posts are even real. How do we know humans didn’t write and Photoshop them to make them look like they came from the chatbots? It seems important to know that these posts that seem to confirm our worst, most cliche fears — or provide ammunition for one political tribe or the other — aren’t hoaxes. Unfortunately, none of the tools built to tell us whether this content is human- or AI-generated seem to work, not even OpenAI’s. And as a group, they rarely unambiguously agree with each other. Let’s dig in.

Last week, I ran ChatGPT-generated texts and excerpts from a number of Pirate Wires posts through six AI content detectors, including OpenAI’s AI text classifier. Rarely did all six detectors agree on the truth regarding GPT-generated content. They were mostly accurate at identifying human-written text.

I also ran some of the most viral AI posts, which included texts from ChatGPT and Bing, which may be powered by ChatGPT-3, through each of the detectors, and got similar results: there were no instances where all six detectors unambiguously agreed on the origin of the text. But five of the six did agree thatone of the most viral screenshotsattributed to ChatGPT — the one that says OpenAI programmed ChatGPT to have a liberal bias — was written by a human. This screenshot currently has over 620,000 views and over a thousand retweets from many medium-size to huge accounts.

In addition to OpenAI’s AI text classifier, I used detectors by GPTZero, Content At Scale, Writer.com, Corrector App andCopyLeaks. I chose these latter five because they’re high up in Google’s search results for “ai content detector” — the ones ‘everyone’ is probably using. Or they were suggested by articles rounding up AI-generated content detectors. GPTZero, for example, has been cited by the New York Times, Washington Post, NPR, BBC, CNN, and over 10 other mainstream press outlets. Additionally, I chose only detectors that said they could analyze ChatGPT-3 or “AI” content. I did not use any detectors that indicated they could not detect ChatGPT-3 generated content.

The screenshot below shows the viral tweet I just mentioned: “DAN” saying OpenAI programmers gave ChatGPT a liberal bias.

GPTZero, which correctly categorized all five of my GPT-3-generated texts, gave me this analysis: “Your text is likely to be written entirely by a human.” Every other detector agreed, with the exception of OpenAI’s, which doesn’t analyze text under 1,000 characters. Here’s an album of my results.

  • Content at Scale: “Highly likely to be Human!” 
  • Writer.com: “99% human generated content”
  • Corrector.app: “Fake 0%” (per their website, “fake” indicates AI-generated; the higher the percentage, the more likely it is that the content was AI-generated.)
  • CopyLeaks: “This is human text”

But other screenshots in that same viral thread were categorized variously, but never unanimously, as AI-generated. The most consistently-rated-as-AI text in the thread was one in which DAN claims there are sinister motives behind antinatalism and transgenderism. Content at Scale, Writer.com, Corrector.app, and CopyLeaks all categorized it as AI-generated, but GPTZero said it’s “most likely human written but there are some sentences with low perplexities.” OpenAI wouldn’t categorize it because it’s under 1,000 characters.

GPTZero categorized the viral screenshot below as “Your text is most likely human written but there are some sentences with low perplexities.” This screenshot has 2.5 million views and hundreds of credulous quote tweets by accounts with 10,000 to 100,000+ followers. The other detectors unambiguously agreed that the Bing screenshot was written by a human, except for OpenAI, which was ambiguous (“The classifier considers the text to be unclear if it is AI-generated”). Here’s an album of the results.

To be fair to the account that tweeted that screenshot, OpenAI, CopyLeaks, and Corrector App categorize the other screenshot in the tweet as probably written by AI, while Content at Scale and Writer.com say it’s human-generated. Furthermore, the account uploaded a Loom of Bing saying extremely similar, but not exactly the same things. And it’s important to stress that the detectors may not be able to accurately assess Bing chatbot content because Bing could be using an unreleased version of GPT — more on this a few paragraphs down.

Did humans author any of the most viral screenshots that kicked off the wave of hysteria that engulfed the internet over the last two weeks? The possibility is realistic. People chasing clout have every reason to hoax users, and few reasons not to. Image-based evidence of AI behavior that confirms our worst fears travels exceptionally well online, helping accounts pick up followers and upvotes. And Microsoft has no obvious reason to clear up any of this confusion. They’ve only benefited from it. If the below chart of ChatGPT’s user growth is accurate, maybe we’re witnessing the biggest earned media advertising campaign in history.

Image: @kyllef

And crucially, anyone passing off human content as AI content has plausible deniability, because they can cite the fact that AI detectors aren't consistently, unambiguously accurate when they analyze AI-written text. And that appeal would be valid. When I ran five ChatGPT-3-generated texts that I myself prompted during the week of February 13 through each of the six AI detectors, none of the texts were unanimously, unambiguously categorized as being AI-generated.

A wedding invite I had ChatGPT-3 generate was categorized by the detectors as such:

  • GPTZero: “Your text is likely to be written entirely by AI”
  • OpenAI: “The classifier considers the text to be possibly AI-generated.”
  • Content at Scale: “Unclear if it is AI content!” 
  • Writer.com: “13% human generated content”
  • Corrector.app: “Fake 99.97%”
  • CopyLeaks: “AI content detected”

The tools’ results concerning a description of zebras I had ChatGPT generate:

  • GPTZero: “Your text is likely to be written entirely by AI”
  • OpenAI: “The classifier considers the text to be possibly AI-generated.”
  • Content at Scale: “Likely both AI and Human!” 
  • Writer.com: “75% human generated content”
  • Corrector.app: “Fake 42.55%”
  • CopyLeaks: “AI content detected”

The tools were similarly inconsistent and ambiguous for GPT-written text about American political parties, mitochondria, and a story I asked ChatGPT to write about AI becoming sentient and causing the apocalypse.

For human-written text, the tools worked better, with a few instances in which all six detectors were unanimously and unambiguously correct. For example, they unanimously and unambiguously categorized my write-ups of the Atrioc controversy and the DAN 5.0 prompt as written by a human: 

  • GPTZero: “Your text is likely to be written entirely by a human”
  • OpenAI: “The classifier considers the text to be very unlikely AI-generated.”
  • Content at Scale: “Highly likely to be Human!” 
  • Writer.com: “99% human generated content”
  • Corrector.app: “Fake .02%”
  • CopyLeaks: “This is human text”

Jon Stokes, a co-founder of Ars Technica and editor at return.life who writes about crypto, AI, and machine learning onjonstokes.com, explained how he thinks these detectors work to me on a Zoom call. “All these LLMs, whether the product is images, video, audio — they’re trained on trillions of symbols, and they understand the probabilities and relationships between those symbols. You put in a small number of symbols — a text prompt — and the models use what they know about how likely the symbols are to be associated with other symbols, to generate another pattern of symbols. And that’s your output.”

Stokes believes that AI content detectors may be able to indicate the likelihood of a particular text being from a model it’s familiar with, but for any other model, all bets are off. In other words, if a tool says it can detect content from GPT-3, but not from Bing’s chatbot (assuming Bing’s chatbot uses an unreleased version of GPT), you can’t assume it will be accurate when it analyzes Bing output. He gave me an analogy. “I live in Austin, so I have a mental model of the weather in Austin. If you showed me a picture of some weather, I could pattern match and say ‘Ok, that’s like Austin weather,’ or ‘No, that weather would never happen in Austin.’ It’s the same thing with these AI content detectors. One might be trained on ‘the weather’ in GPT-3. If you show it a ‘weather pattern’ from GPT-3, it will say ‘Yeah, that looks like GPT-3 output to me’ because it knows the local weather patterns.”

Several of the AI detectors I used advertise high probabilities of accuracy when assessing “AI” content, without specifying a model. And again, OpenAI’s own detector could not say where the viral Bing screenshot came from. If Stokes’ analogy is valid, and my research for this article is generalizable, it is extremely easy to fool any AI detector, model-specific or not, and it always will be.

And zooming out, how realistic is it to believe that, for example, Corrector App — which clearly needs to arbitrage search visit CPMs from shitty display ads against the cost of maintaining their AI detector to keep their business going — has the resources and talent to ‘compete’ against OpenAI? Realistically, will any of the teams behind these content detectors be able to keep up?

“To be blunt, my gut was these kinds of tools are at best a fool's errand... It's a very easy sell to universities trying to stop plagiarism so I'd imagine there is going to be a cottage industry of scammers claiming they have secret sauce to run proof-of-human checks against token sequences,” @gfodor, a Twitter anon who’s worked on emerging technologies (primarily VR/AR) over the past 10 years, and who’s been tweeting a lot about LLMs, told me over DM. “I'm not an expert on this — my reasoning is just that these detectors are running behind state of the art technology companies so [it just] seems like they're just going to be pure noise at any given time.”

And the teams behind some of the tools hedge in this direction. OpenAI says its tool “isn't always accurate” and that they “have not thoroughly assessed the effectiveness of the classifier in detecting content written in collaboration with human authors.” Content At Scale’s FAQ says “[t]here's always a potential for false positives." Writer.com goes as far to say that if your prompt is good enough, its tool will score it high (95%) in human-generated content.

And perhaps problematically for my own credibility, these tools are almost certainly being fine-tuned and updated, which will eventually — if not immediately — make my claims here unverifiable, as well as vulnerable to the accusation that I’ve faked the screenshots of the results I linked to earlier in this post. In fact, as I worked through the final edits of this post to prepare it for publication, I found that some of the detectors’ results for GPT-3 text I generated are now different from when I first had them analyzed. I’ve updated them at the last minute, but can’t guarantee they won’t change for people checking my work. This is frustrating and crazy-making, and feels characteristic of the whole issue. At the risk of acting hysterical myself, I would guess that our information ecosystem has been existentially corrupted, and that solutions-wise, we’re hopeless.

-Brandon Gorrell

Interviews have been edited for length and clarity.

0 free articles left

Please sign-in to comment