Expert-annotated dataset to study cyberbullying in Polish language

2024
journal article
article
dc.abstract.enWe introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.
dc.affiliationWydział Studiów Międzynarodowych i Politycznych : Instytut Bliskiego i Dalekiego Wschodu
dc.contributor.authorPtaszynski, Michal
dc.contributor.authorPieciukiewicz, Agata
dc.contributor.authorDybała, Paweł - 242662
dc.contributor.authorSkrzek, Pawel
dc.contributor.authorSoliwoda, Kamil
dc.contributor.authorFortuna, Marcin
dc.contributor.authorLeliwa, Gniewosz
dc.contributor.authorWroczynski, Michal
dc.date.accessioned2024-05-07T13:36:24Z
dc.date.available2024-05-07T13:36:24Z
dc.date.issued2024
dc.date.openaccess0
dc.description.accesstimew momencie opublikowania
dc.description.additionalNa publikacji autor podpisany Pawel Dybala
dc.description.number1
dc.description.physical1-26
dc.description.versionostateczna wersja wydawcy
dc.description.volume9
dc.identifier.doi10.3390/data9010001
dc.identifier.eissn2306-5729
dc.identifier.issn2306-5729
dc.identifier.urihttps://ruj.uj.edu.pl/handle/item/338630
dc.languageeng
dc.language.containereng
dc.rightsUdzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa
dc.rights.licenceCC-BY
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/legalcode.pl
dc.share.typeotwarte czasopismo
dc.subject.encyberbullying
dc.subject.enhate speech
dc.subject.enabusive language
dc.subject.enoffensive language
dc.subject.entoxic language
dc.subject.enautomatic cyberbullying detection
dc.subject.enpolish language
dc.subtypeArticle
dc.titleExpert-annotated dataset to study cyberbullying in Polish language
dc.title.journalData
dc.typeJournalArticle
dspace.entity.typePublicationen
dc.abstract.en
We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.
dc.affiliation
Wydział Studiów Międzynarodowych i Politycznych : Instytut Bliskiego i Dalekiego Wschodu
dc.contributor.author
Ptaszynski, Michal
dc.contributor.author
Pieciukiewicz, Agata
dc.contributor.author
Dybała, Paweł - 242662
dc.contributor.author
Skrzek, Pawel
dc.contributor.author
Soliwoda, Kamil
dc.contributor.author
Fortuna, Marcin
dc.contributor.author
Leliwa, Gniewosz
dc.contributor.author
Wroczynski, Michal
dc.date.accessioned
2024-05-07T13:36:24Z
dc.date.available
2024-05-07T13:36:24Z
dc.date.issued
2024
dc.date.openaccess
0
dc.description.accesstime
w momencie opublikowania
dc.description.additional
Na publikacji autor podpisany Pawel Dybala
dc.description.number
1
dc.description.physical
1-26
dc.description.version
ostateczna wersja wydawcy
dc.description.volume
9
dc.identifier.doi
10.3390/data9010001
dc.identifier.eissn
2306-5729
dc.identifier.issn
2306-5729
dc.identifier.uri
https://ruj.uj.edu.pl/handle/item/338630
dc.language
eng
dc.language.container
eng
dc.rights
Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa
dc.rights.licence
CC-BY
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/legalcode.pl
dc.share.type
otwarte czasopismo
dc.subject.en
cyberbullying
dc.subject.en
hate speech
dc.subject.en
abusive language
dc.subject.en
offensive language
dc.subject.en
toxic language
dc.subject.en
automatic cyberbullying detection
dc.subject.en
polish language
dc.subtype
Article
dc.title
Expert-annotated dataset to study cyberbullying in Polish language
dc.title.journal
Data
dc.type
JournalArticle
dspace.entity.typeen
Publication

* The migration of download and view statistics prior to the date of April 8, 2024 is in progress.

Views
19
Views per month
Views per city
Krakow
10
Kielce
1
Downloads
data-09-00001.pdf
1
dybala_et-al_expert-annotated_dataset_2024.pdf
1