Words that give generative AI text

Date:

so far so far AI companies have been having trouble building tools that can reliably detect when a piece of writing was produced using a large language model. Now, a group of researchers has established a new method for estimating LLM use in a large set of scientific writings by measuring which “extra words” began to appear more frequently during the LLM era (i.e., 2023 and 2024). According to the researchers, the results “suggest that at least 10 percent of the 2024 abstracts were processed with LLM.”

In a preprint paper posted earlier this month, four researchers from Germany’s University of Tubingen and Northwestern University said they were inspired by studies that measured the impact of the COVID-19 pandemic by looking at more deaths than in the recent past. Taking a similar look at “additional word use” after LLM writing tools became widely available in late 2022, the researchers found that “the presence of LLM led to a sudden increase in the frequency of certain genre words” that was “unprecedented in both quality and quantity.”

Going Deeper

To measure these vocabulary changes, the researchers analyzed 14 million paper abstracts published on PubMed between 2010 and 2024, tracking the relative frequency of each word as it appeared in each year. They then compared the expected frequency of those words (based on the trend line before 2023) to the actual frequency of those words in abstracts from 2023 and 2024, when LLMs were in widespread use.

The results found a number of words that were extremely uncommon in these scientific abstracts before 2023, but suddenly increased in popularity after the LLM was introduced. For example, the word “delves” appears 25 times more often in 2024 papers, as expected given the trend before the LLM; words such as “showcasing” and “underscore” also increased in use ninefold. Other previously common words became significantly more common in post-LLM abstracts: for example, “potential” increased in frequency by 4.1 percentage points, “conclusion” increased by 2.7 percentage points, and “important” increased by 2.6 percentage points.

Such changes in word use can occur independently of the use of the LLM, of course – the natural evolution of language means that words sometimes come and go out of style. However, the researchers found that, in the period before the LLM, such large and sudden year-on-year increases were only seen for words related to major world health events: “Ebola” in 2015; “Zika” in 2017; and words such as “coronavirus”, “lockdown” and “pandemic” in the period from 2020 to 2022.

However, in the period following the LLM, the researchers found hundreds of words that had a sudden, apparent increase in scientific use that bore no general connection to world events. In fact, while additional words during the COVID pandemic were primarily nouns, the researchers found that the words that jumped in frequency after the LLM were primarily “style words,” such as verbs, adjectives, and adverbs (a small sample: “across, in addition, comprehensively, significantly, amplify, demonstrate, insightfully, specifically, particularly, within”).

This is not an entirely new finding – for example, the increasing prevalence of “delve” in scientific papers has been widely observed in recent times. But previous studies have typically relied on comparisons with “ground truth” human writing samples or lists of predefined LLM markers obtained from outside the study. Here, the set of pre-2023 abstracts serves as its own effective control group to show how vocabulary choice has changed overall in the post-LLM era.

a complex interaction

By highlighting the hundreds of so-called “marker words” that have become quite common in the post-LLM era, it can sometimes be easier to spot telltale signs of LLM use. Take this summary line from the researchers, which highlights marker words: “A Detailed understanding of Complex interactions is between (…) and (…) Central to effective therapeutic strategies.”

After doing some statistical measurements of the presence of marker terms in individual papers, the researchers estimate that at least 10 percent of post-2022 papers in the PubMed corpus were written with at least some LLM assistance. The researchers say this number could be even higher, as their set may be missing LLM-assisted abstracts that do not contain any of the marker terms they identified.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related

iOS 18 public beta is here: How to install it now

Apple has released the public beta of iOS 18,...

MetaVS’s new Llama 3.1 AI model is free, powerful, and risky

Most tech giants hope to sell artificial intelligence to...

OpenAI-backed legaltech startup Harvey raises $100 million

Harvey, a startup building an AI-powered “co-pilot” for lawyers,...