Research:: Data Stream Analysis and Clustering

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing and analysis of data streams has many challenges [review paper]. We have developed

  • a data stream clustering method and 
  • a data stream analysis platform to explore and prototype machine learning solutions. 

We are taking part of the SmartHome consortium of 18 companies and 5 universities lead by Arcelik/BEKO to develop and implement a digital platform for smart homes.

Online Embedding and Clustering of data Streams (EmCStream) 
Clustering is one of the most suitable methods for real-time data stream analysis, because it can be applied with less prior information about the data and it does not need labeled instances. Data streams may often contain concept drift. Concept drift is the unforeseen change in the properties of the input data instances of a data stream and it is a data stream specific problem. In order to cluster successfully data stream instances, concept drift should be detected and adapted. To this end, we describe a data stream clustering method that continuously embeds the input data and make their visualization possible. Moreover, our method successfully detects, informs and adapts concept drift.
Our method, online embedding and clustering of data streams (EmCStream) continuously embeds high dimensional input data into two dimensions and clusters the embedded data using k-means algorithm. It continuously checks for concept drift and when a concept drift occurs, the method detects and automatically adapts concept drift. We have evaluated EmCStream against two popular, baseline stream clustering algorithms, DenStream and CluStream on both artificial and real data streams  based on the adjusted rand index and purity, as the clustering quality metrics. Implementation of EmCStream is available online at https://gitlab.com/alaettinzubaroglu/emcstream with all other supplementary resources such as datasets.

Easy Streaming Data Analysis Tool (ESTRA) 
ESTRA is designed with the aim of creating an easy-to-use data stream analysis platform to explore and prototype machine learning solutions on various datasets. ESTRA is developed as a web-based, scalable, extensible, and open-source data analysis tool with a user-friendly and easy to use user interface.  ESTRA comes with a bundle of datasets (Electricity, KDD Cup’99, and Covertype), dataset generators (Sea And Hyperplane), and implementations of various analysis and learning algorithms (D3,  Hoeffding  Tree,  CluStream,  DenStream,  kNN,  k-means,  and  StreamKM++). Moreover,  ESTRA  provides  an  easy  way  to  investigate  various  properties  of  the datasets  and  to  observe  the  results  of  executed  machine  learning  algorithms.   ESTRA’s straightforward and clean architecture with open source tools allows it to be extensible.  Used libraries and frameworks in ESTRA such as React, Python and Scikit-Multiflow are popular open source tools with broad community support and exten-sions.  ESTRA’s capabilities of easy prototyping and exploring machine learning solutions are demonstrated by repeating the machine learning experiments performed in various studies.