Efficient mixture model for clustering of sparse high dimensional binary data

Śmieja, Marek; Hajto, Krzysztof; Tabor, Jacek

doi:10.1007/s10618-019-00635-1

Simple view

Full metadata view

Authors

Statistics

Efficient mixture model for clustering of sparse high dimensional binary data

2019

journal article

article

10.1007/s10618-019-00635-1

Journal

Data Mining and Knowledge Discovery

140

Author

Śmieja Marek

Hajto Krzysztof

Tabor Jacek

Volume

33

Pages

1583-1624

ISSN

1384-5810

eISSN

1573-756X

Language

English

Container language

English

Abstract in English

Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

WoS citations

7

cris.lastimport.wos	2024-04-09T18:19:16Z
dc.abstract.en	Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.	pl
dc.affiliation	Wydział Matematyki i Informatyki : Instytut Informatyki i Matematyki Komputerowej	pl
dc.contributor.author	Śmieja, Marek - 135996	pl
dc.contributor.author	Hajto, Krzysztof - 178376	pl
dc.contributor.author	Tabor, Jacek - 132362	pl
dc.date.accessioned	2020-01-28T09:16:55Z
dc.date.available	2020-01-28T09:16:55Z
dc.date.issued	2019	pl
dc.date.openaccess	0
dc.description.accesstime	w momencie opublikowania
dc.description.physical	1583-1624	pl
dc.description.version	ostateczna wersja wydawcy
dc.description.volume	33	pl
dc.identifier.doi	10.1007/s10618-019-00635-1	pl
dc.identifier.eissn	1573-756X	pl
dc.identifier.issn	1384-5810	pl
dc.identifier.project	2016/21/D/ST6/00980	pl
dc.identifier.project	2017/25/B/ST6/01271	pl
dc.identifier.project	ROD UJ / OP	pl
dc.identifier.uri	https://ruj.uj.edu.pl/xmlui/handle/item/147662
dc.language	eng	pl
dc.language.container	eng	pl
dc.rights	Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa	*
dc.rights.licence	CC-BY
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/legalcode.pl	*
dc.share.type	inne
dc.subtype	Article	pl
dc.title	Efficient mixture model for clustering of sparse high dimensional binary data	pl
dc.title.journal	Data Mining and Knowledge Discovery	pl
dc.type	JournalArticle	pl
dspace.entity.type	Publication