The proposed approach exploits the identified clusters, built on labeled data, to Imple-mented in the Apache Spark framework, which is able to handle large-scale, Specifically, it is a novel density-based clustering algorithm, Therefore, our research focused on the following questions: can we performĭensity-based clustering on large-scale and high dimensional data, without incurring inĬompu-tational bottlenecks? Can we profitably exploit these clusters for predictive purposes? ToĪnswer to these questions, we propose DENCAST, which simultaneously solves all the Inherent difficulty in upgrading existing non-distributed density-based clusteringĪlgo-rithms towards their equivalent (or, at least, approximated) distributed counterpart.įinally, most of the existing methods are strictly tailored for pure clustering and do notĮxploit clusters to support predictive tasks, as in predictive clustering trees. Indicate if changes were Roberto Corizzo, Gianvito Provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, This article is distributed under the terms of the Creative Commons Attribution 4.0 International License Keywords: Distributed clustering, Multi-target regression, Apache Spark The extracted clusters is confirmed by the predictive capabilities of DENCAST on severalĭatasets: It is able to significantly outperform (p-value < 0.05 ) state-of-the-artĭistrib-uted regression methods, in both single and multi-target settings. Experiments show thatĭEN-CAST performs clustering more efficiently than a state-of-the-art distributed clusteringĪlgorithm, especially when the number of objects increases significantly. Performed on a single machine) and is able to handle large-scale, high-dimensionalĭata by taking advantage of locality sensitive hashing. Contrary toĮxisting distributed methods, DENCAST does not require a final merging step (usually Regres-sion tasks (and thus, solves complex tasks such as time series prediction). This context, many distributed data mining algorithms have recently been proposed.įollowing this line of research, we propose the DENCAST system, a novel distributedĪlgorithm implemented in Apache Spark, which performs density-based clusteringĪnd exploits the identified clusters to solve both single- and multi-target Increase in data generated that need to be processed and analyzed efficiently. Recent developments in sensor networks and mobile computing led to a huge Instances and attributes increase considerably. To data organized in a specific structure (e.g., they can analyze only low-dimensionalįeature spaces), or they suffer from overhead and scalability issues when the number of Unfortunately, existing distributed methods forĭensity-based clustering suffer from several limitations. Starting from the seminal work of DBSCAN, many algorithms have been proposed,īut only a few of them are distributed. Out the be useful in many application domains (e.g., spatial data analysis). The extracted clusters (arbitrarily-shaped, noise-free, robustness to outliers) which turn Has received much attention in the last decades, because of many desirable properties of Only a few of them tackle the specific problem of density-based clustering. Several machines for classical clustering, classification and regression tasks. Years, several researchers proposed novel approaches to distribute the workload among Sensor measurements) has increased the need for novel data mining algorithms, whichĪre capable of building accurate models efficiently and in a distributed fashion. The generation of massive amounts of data in different forms (such as activity logs and Roberto Corizzo1,2*†, Gianvito Pio1,2†, Michelangelo Ceci1,2† and Donato Malerba1,2 DENCAST: distributed density‑based clustering for multi‑target regression
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |