Simple view
Full metadata view
Authors
Statistics
Efficient mixture model for clustering of sparse high dimensional binary data
Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.
cris.lastimport.wos | 2024-04-09T18:19:16Z | |
dc.abstract.en | Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data. | pl |
dc.affiliation | Wydział Matematyki i Informatyki : Instytut Informatyki i Matematyki Komputerowej | pl |
dc.contributor.author | Śmieja, Marek - 135996 | pl |
dc.contributor.author | Hajto, Krzysztof - 178376 | pl |
dc.contributor.author | Tabor, Jacek - 132362 | pl |
dc.date.accessioned | 2020-01-28T09:16:55Z | |
dc.date.available | 2020-01-28T09:16:55Z | |
dc.date.issued | 2019 | pl |
dc.date.openaccess | 0 | |
dc.description.accesstime | w momencie opublikowania | |
dc.description.physical | 1583-1624 | pl |
dc.description.version | ostateczna wersja wydawcy | |
dc.description.volume | 33 | pl |
dc.identifier.doi | 10.1007/s10618-019-00635-1 | pl |
dc.identifier.eissn | 1573-756X | pl |
dc.identifier.issn | 1384-5810 | pl |
dc.identifier.project | 2016/21/D/ST6/00980 | pl |
dc.identifier.project | 2017/25/B/ST6/01271 | pl |
dc.identifier.project | ROD UJ / OP | pl |
dc.identifier.uri | https://ruj.uj.edu.pl/xmlui/handle/item/147662 | |
dc.language | eng | pl |
dc.language.container | eng | pl |
dc.rights | Udzielam licencji. Uznanie autorstwa 4.0 Międzynarodowa | * |
dc.rights.licence | CC-BY | |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/legalcode.pl | * |
dc.share.type | inne | |
dc.subtype | Article | pl |
dc.title | Efficient mixture model for clustering of sparse high dimensional binary data | pl |
dc.title.journal | Data Mining and Knowledge Discovery | pl |
dc.type | JournalArticle | pl |
dspace.entity.type | Publication |
* The migration of download and view statistics prior to the date of April 8, 2024 is in progress.
Views
0
Views per month
Open Access
License
Except as otherwise noted, this item is licensed under the Attribution 4.0 International licence