Expert-annotated dataset to study cyberbullying in Polish language

Ptaszynski, Michal; Pieciukiewicz, Agata; Dybała, Paweł; Skrzek, Pawel; Soliwoda, Kamil; Fortuna, Marcin; Leliwa, Gniewosz; Wroczynski, Michal

doi:10.3390/data9010001

Simple view

Full metadata view

Authors

Statistics

Expert-annotated dataset to study cyberbullying in Polish language

2024

journal article

article

10.3390/data9010001

Journal

Data

Author

Ptaszynski Michal

Pieciukiewicz Agata

Dybała Paweł

Skrzek Pawel

Soliwoda Kamil

Fortuna Marcin

Leliwa Gniewosz

Wroczynski Michal

Volume

9

Issue

1

Pages

1-26

ISSN

2306-5729

eISSN

2306-5729

Keywords in English

cyberbullying

hate speech

abusive language

offensive language

toxic language

automatic cyberbullying detection

polish language

Remarks

Na publikacji autor podpisany Pawel Dybala

Language

English

Journal language

English

Abstract in English

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.

dc.abstract.en	We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.
dc.affiliation	Wydział Studiów Międzynarodowych i Politycznych : Instytut Bliskiego i Dalekiego Wschodu
dc.contributor.author	Ptaszynski, Michal
dc.contributor.author	Pieciukiewicz, Agata
dc.contributor.author	Dybała, Paweł - 242662
dc.contributor.author	Skrzek, Pawel
dc.contributor.author	Soliwoda, Kamil
dc.contributor.author	Fortuna, Marcin
dc.contributor.author	Leliwa, Gniewosz
dc.contributor.author	Wroczynski, Michal
dc.date.accessioned	2024-05-07T13:36:24Z
dc.date.available	2024-05-07T13:36:24Z
dc.date.issued	2024
dc.date.openaccess	0
dc.description.accesstime	w momencie opublikowania
dc.description.additional	Na publikacji autor podpisany Pawel Dybala
dc.description.number	1
dc.description.physical	1-26
dc.description.version	ostateczna wersja wydawcy
dc.description.volume	9
dc.identifier.doi	10.3390/data9010001
dc.identifier.eissn	2306-5729
dc.identifier.issn	2306-5729
dc.identifier.uri	https://ruj.uj.edu.pl/handle/item/338630
dc.language	eng
dc.language.container	eng
dc.rights	Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa
dc.rights.licence	CC-BY
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/legalcode.pl
dc.share.type	otwarte czasopismo
dc.subject.en	cyberbullying
dc.subject.en	hate speech
dc.subject.en	abusive language
dc.subject.en	offensive language
dc.subject.en	toxic language
dc.subject.en	automatic cyberbullying detection
dc.subject.en	polish language
dc.subtype	Article
dc.title	Expert-annotated dataset to study cyberbullying in Polish language
dc.title.journal	Data
dc.type	JournalArticle
dspace.entity.type	Publication	en

dc.abstract.en

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.

dc.affiliation

Wydział Studiów Międzynarodowych i Politycznych : Instytut Bliskiego i Dalekiego Wschodu

dc.contributor.author

Ptaszynski, Michal

dc.contributor.author

Pieciukiewicz, Agata

dc.contributor.author

Dybała, Paweł - 242662

dc.contributor.author

Skrzek, Pawel

dc.contributor.author

Soliwoda, Kamil

dc.contributor.author

Fortuna, Marcin

dc.contributor.author

Leliwa, Gniewosz

dc.contributor.author

Wroczynski, Michal

dc.date.accessioned

2024-05-07T13:36:24Z

dc.date.available

2024-05-07T13:36:24Z

dc.date.issued

2024

dc.date.openaccess

0

dc.description.accesstime

w momencie opublikowania

dc.description.additional

Na publikacji autor podpisany Pawel Dybala

dc.description.number

1

dc.description.physical

1-26

dc.description.version

ostateczna wersja wydawcy

dc.description.volume

9

dc.identifier.doi

10.3390/data9010001

dc.identifier.eissn

2306-5729

dc.identifier.issn

2306-5729

dc.identifier.uri

https://ruj.uj.edu.pl/handle/item/338630

dc.language

eng

dc.language.container

eng

dc.rights

Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa

dc.rights.licence

CC-BY

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/legalcode.pl

dc.share.type

otwarte czasopismo

dc.subject.en

cyberbullying

dc.subject.en

hate speech

dc.subject.en

abusive language

dc.subject.en

offensive language

dc.subject.en

toxic language

dc.subject.en

automatic cyberbullying detection

dc.subject.en

polish language

dc.subtype

Article

dc.title

Expert-annotated dataset to study cyberbullying in Polish language

dc.title.journal

Data

dc.type

JournalArticle

dspace.entity.typeen

Publication

* The migration of download and view statistics prior to the date of April 8, 2024 is in progress.

Views

19 Views per month

Views per city

Krakow

10

Kielce

1

Downloads

data-09-00001.pdf

1

dybala_et-al_expert-annotated_dataset_2024.pdf

1

Open Access

Files

dybala_et-al_expert-annotated_dataset_2024.pdfpdf 5.72 MB

License

Except as otherwise noted, this item is licensed under the Attribution 4.0 International licence

Collections

Research publications

Social sciences