Stylometry recognizes human and LLM-generated texts in short samples

2025
journal article
article
dc.abstract.enThe paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to.87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between.79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to.98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type
dc.affiliationWydział Fizyki, Astronomii i Informatyki Stosowanej : Instytut Informatyki Stosowanej
dc.affiliationWydział Fizyki, Astronomii i Informatyki Stosowanej : Instytut Fizyki Teoretycznej
dc.contributor.authorPrzystalski, Karol - 126070
dc.contributor.authorArgasiński, Jan - 105948
dc.contributor.authorGrabska-Gradzińska, Iwona - 121296
dc.contributor.authorOchab, Jeremi - 122224
dc.date.accession2025-07-21
dc.date.accessioned2025-07-24T12:54:11Z
dc.date.available2025-07-24T12:54:11Z
dc.date.createdat2025-07-21T07:37:40Zen
dc.date.issued2025
dc.date.openaccess0
dc.description.accesstimew momencie opublikowania
dc.description.versionostateczna wersja wydawcy
dc.description.volume296, Part B
dc.identifier.articleid129001
dc.identifier.doi10.1016/j.eswa.2025.129001
dc.identifier.issn0957-4174
dc.identifier.projectDRC AI
dc.identifier.urihttps://ruj.uj.edu.pl/handle/item/558168
dc.identifier.weblinkhttps://www.sciencedirect.com/science/article/pii/S0957417425026181
dc.languageeng
dc.language.containereng
dc.rightsUdzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa
dc.rights.licenceCC-BY
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/legalcode.pl
dc.share.typeinne
dc.subject.enstylometry
dc.subject.enlarge language models
dc.subject.enmachine-generated text detection
dc.subject.enAI detection
dc.subject.enbenchmark dataset
dc.subtypeArticle
dc.titleStylometry recognizes human and LLM-generated texts in short samples
dc.title.journalExpert Systems with Applications
dc.typeJournalArticle
dspace.entity.typePublicationen
dc.abstract.en
The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to.87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between.79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to.98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type
dc.affiliation
Wydział Fizyki, Astronomii i Informatyki Stosowanej : Instytut Informatyki Stosowanej
dc.affiliation
Wydział Fizyki, Astronomii i Informatyki Stosowanej : Instytut Fizyki Teoretycznej
dc.contributor.author
Przystalski, Karol - 126070
dc.contributor.author
Argasiński, Jan - 105948
dc.contributor.author
Grabska-Gradzińska, Iwona - 121296
dc.contributor.author
Ochab, Jeremi - 122224
dc.date.accession
2025-07-21
dc.date.accessioned
2025-07-24T12:54:11Z
dc.date.available
2025-07-24T12:54:11Z
dc.date.createdaten
2025-07-21T07:37:40Z
dc.date.issued
2025
dc.date.openaccess
0
dc.description.accesstime
w momencie opublikowania
dc.description.version
ostateczna wersja wydawcy
dc.description.volume
296, Part B
dc.identifier.articleid
129001
dc.identifier.doi
10.1016/j.eswa.2025.129001
dc.identifier.issn
0957-4174
dc.identifier.project
DRC AI
dc.identifier.uri
https://ruj.uj.edu.pl/handle/item/558168
dc.identifier.weblink
https://www.sciencedirect.com/science/article/pii/S0957417425026181
dc.language
eng
dc.language.container
eng
dc.rights
Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa
dc.rights.licence
CC-BY
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/legalcode.pl
dc.share.type
inne
dc.subject.en
stylometry
dc.subject.en
large language models
dc.subject.en
machine-generated text detection
dc.subject.en
AI detection
dc.subject.en
benchmark dataset
dc.subtype
Article
dc.title
Stylometry recognizes human and LLM-generated texts in short samples
dc.title.journal
Expert Systems with Applications
dc.type
JournalArticle
dspace.entity.typeen
Publication
Affiliations

* The migration of download and view statistics prior to the date of April 8, 2024 is in progress.

Views
19
Views per month
Views per city
Liszki
4
Krakow
2
Amsterdam
1
Dublin
1
Downloads
przystalski_et-al_stylometry_recognizes_human_and_llm-generated_2025.pdf
4