Carl Franzen at VentureBeat writes:
As those following the burgeoning industry and its underlying research know, the data used to train… products such as ChatGPT, Stable Diffusion and Midjourney comes initially from human sources – books, articles, photographs and so on – that were created without the help of artificial intelligence. Now, as more people use AI to produce and publish content, an obvious question arises: What happens as AI-generated content proliferates around the internet, and AI models begin to train on it, instead of on primarily human-generated content?
[Researchers] published a paper on their work in the open access journal arXiv. What they found is worrisome for current generative AI technology and its future: “We find that use of model-generated content in training causes irreversible defects in the resulting models.” Specifically looking at probability distributions for text-to-text and image-to-image AI generative models, the researchers concluded that “learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution … this process is inevitable, even for cases with almost ideal conditions for long-term learning.”
Garbage in, garbage out — and if this paper is correct, generative AI is turning into the self-licking ice cream cone of garbage generation.
One effect I hope for is that human-written, non-formulaic writing will begin to stand out to search engines because its rhythm, creativity, and idiosyncrasies increasingly differ from the derivative pablum that already fills blogs.
Clickbait articles follow such uninspired formulae that they already drop out of search results – search for British Kardashians Royals, and you won’t see many Google results like “Shock at what Harry said to William!” because the algorithm already ranks that as junk results.
Any article that says "Many examples… Additionally… Furthermore…" is almost certainly written by Asian outsourcers paid by the word, or by AI.