Introduction to IoT Analytics – IoT Analytics vs. BigData Analyticss. In the previous chapter of the IoT tutorial, we explained the affiliation between IoT data and BigData, given that IoT data expose the Vs of BigData. We also illustrated the activities comprising IoT data processing applications, such as data selection, validation, semantic unification and more.
However, the power of IoT data processing and its business value lies in the analysis of IoT data i.e. in IoT data analytics. Indeed, according to a recent research/study by McKinsey & Co, the largest portion of IoT’s business value in the coming years will come from the development and deployment of advanced data analytics techniques that will leverage the vast amounts of IoT datasets that are produced nowadays. In particular, IoT analytics deployments are expected to extract value from large IoT datasets, which remain unexploited and underutilized, as state-of-the-art IoT applications tend to use approx. 1% of the IoT data available.
In principle IoT data analytics bears several similarities to BigData analytics. The chief differences lie mainly in the different nature of IoT data, when compared to typical BigData problems that deal with large volumes of conventional transactional data. As already outlined IoT data are typically associated with high velocity and feature a streaming nature. Therefore, while a great deal of BigData analytics problems are served by traditional DBMS infrastructures that tore data in finite and persistent data sets, IoT analytics problems are typically better served by streaming databases treating/processing data as multiple, continuous rapid and time-varying data streams. Hence, conventional BigData infrastructures (such as Hadoop/HDFS and MapReduce systems) are not commonly used for IoT analytics, due to their high-latency and inability to deal with high velocity data.
On the other, data streaming infrastructures (such as the open source Apache Spark and Apache Flink systems) are more appropriate for IoT analytics as they can deal with transient streams, continuous queries and sequential access to the data, while at the same time providing the means for handling unpredictable data arrivals in the scope of memory-bounded environments. IoT streams are characterized by:
(a) A certain schema which defines the types of data that they carry/comprise;
(B) Distribution, since they might stem from numerous distributed data sources;
(C) A flow rate i.e. their rate of arrival to the IoT analytics applications;
(D) Ordering, on the basis of attributes such as their timing;
(E) Synchronizations constraints between data steams;
(F) The stability of their flow and/or distribution.
IoT streaming systems enable the definition of queries over streaming data, including specifications of sliding windows and blocking operators over the streams (such as aggregation and ordering operators). IoT streaming infrastructures comprise also middleware that performs two important functions:
- Approximation of IoT streams, which is required due to the velocity of the streams. Indeed, as streams are coming too fast it is challenging to compute the exact answer to a query, as this required virtually unbounded storage and computational resources. Hence, IoT streaming engines undertake to approximate the streams in ways that achieve an optimal balance between system performance and efficiency of the approximation.
- Adaptation of queries and resources,which is very important for long-running queries. Indeed, the latter are characterized by fluctuating stream arrival times and data characteristics, as well as by evolving query loads. Hence, IoT streaming engines tend to adapt the allocation of resources (e.g. memory, computation resources) to a given application, as well as the query execution plans in the scope of highly distributed environment, with a view to optimizing the systems performance. The latter can be very crucial in the scope of real-time or nearly real-time IoT analytics applications.
Even though not commonly used in conjunction with transactional datasets, streaming engines are an integral part of the BigData landscape, as they handle several of the Vs, with particular emphasis on the velocity of the data streams, which is a common characteristic of most IoT analytics applications. Note also that streaming engines operating in conjunction with BigData infrastructure for handling large datasets, such as cluster infrastructures and distributed file systems.
Typical IoT Analytics Pipeline – IoT Analytics Skills
Similarly to BigData analytics, IoT analytics involves the specification, implementation and deployment of techniques for mining IoT data. These techniques (briefly discussed in the following paragraphs) are no different from the classical data analysis techniques, which are based on a range of closely related disciplines including;
- Statistics and statistical analysis, with particular emphasis on the application of theoretic schemes over datasets in order to test and/or validate particular hypotheses.
- Machine learning, which is usually focused on the development of applications that learn how to behave based on IoT datasets. Machine learning techniques for IoT analytics emphasize on improved performance (e.g., in terms of processing speed and accuracy of the results), especially in cases of (streaming) IoT data with high ingestion rates.
- Data mining and knowledge extraction, which include activities associated with data cleansing, integration and visualization, while usually combining/integrating machine learning techniques.
In the scope of real-life IoT analytics systems and practical deployments, the application of these techniques is blended into IoT analytics pipelines involving data selection, semantic unification and preparation. The deployment of machine learning and knowledge extraction techniques as part of such pipelines is typically a challenging task, as data scientists are striving to identify the most appropriate and effective IoT data analysis techniques for the problem at hand.
The identification and validation of such techniques is performed in ways similar to classical data mining processes, such as the blueprint methodology specifies as part of the “Cross Industry Standard Process for Data Mining” (CRISP-DM) knowledge discovery model. As outlined in CRISP-DM, the process of identifying and validating a proper data mining technique for IoT analytics involves the following (inter-related) stages:
- Business Understanding of the IoT data analytics problem: Prior to identifying some candidate data mining schemes for an IoT analytics problem it is important to understand its business aspects. This entails a sound understanding of the IoT data sources involved, the semantics of the datasets (i.e. the type of information that they comprise), requirements associated with the speed or effectiveness of the data processing, as well as the business rationale of the required processing (e.g., the need for predicting a trend or classifying some items in different categories). For example, the business understanding of an energy data analysis problem entails an understanding of the rationale behind data processing (e.g., improve energy usage or reducing cost), the stakeholders involved (e.g., energy provider, transmission system operator, citizen) and the desired output of the analysis.
- Understanding of the IoT data sets: As part of IoT analytics, IoT data scientists should understand the structure of the IoT data sets, through observing, inspecting or even applying test processing over a sample data set. The inspection process involves understanding possible dependencies between attributes, recurring values, rare occurrence of other values etc. It is therefore a process highly dependent on the experience of the IoT data scientist/analyst. The data understanding phase is very important for the IoT data scientist since it will guide him/her to an initial selection of candidate machine learning techniques for analyzing the IoT data, notably techniques that are likely to lead to the desired business result. Based on the identification of candidate techniques and the exclusion of others, experienced data analysts can save precious time and effort, which they can accordingly allocate to optimizations of the IoT analysis process towards high quality results that meet business goals. The data understanding phase has a very close bi-lateral interaction with the business understanding activities, given that it is dependent on the business problem at hand. Note that in several cases the nature of the data could give rise to changes in the business understanding of the problem, as well as to its reformulation in-line with the analytics capabilities that are possible over the actual data sets.
- IoT data preparation: Following the inspection and understanding of the IoT datasets, the IoT analytics pipeline includes the phase of data preparation. Data preparation is highly relevant to the selected machine learning schemes, as each one of the schemes has specific requirements about the structure (e.g., tabular, vectors) and the format (e.g., CSV, JSON, XLS) of the input datasets. Hence, at this phase several data transformations are likely to take place. The IoT data analyst should also prescribe and put in place mechanisms for automating these transformations during the deployment phase. Note that the required data transformations depend also on the IoT analytics infrastructure and tools used, notably the data mining and data analytics tools used. This is reasonable given that different tools accept inputs in different formats. It should be also underlines that the special nature of IoT (BigData) data (i.e. the various Vs of IoT data) is likely to introduced additional complexity in the data preparation process, when compared to conventional data analytics. For example, the process of unifying the semantic of the data can be peculiar to several large scale IoT analytics applications that combine data from diverse sources with similar semantics.
- Modelling for IoT Analytics: Data modelling is the phase where a model for the IoT analysis task at hand is formulated. The formulation of the model involves the selection of a proper machine learning mechanism (e.g., Bayesian classification or decision tree), as well as the estimation of the parameters of this model. The estimation of these parameters may involve use of training data, which are also collected from the IoT dataset. A wide range of models are available and known to data analysts, ranging from simple data mining and machine learning techniques to more sophisticated and specialized ones. In following paragraphs we refer to some popular models, which can in several cases yield a decent performance. Nevertheless in several cases more sophisticated models are explored, especially when there is a need to meet stringent business and optimization requirements. As already outlined there is a close and continuous interaction between the data modelling phase and the data preparation phase, as each model has its peculiar input data requirements.
- Evaluation of IoT Analytics schemes:Once a model is formulated, there is a need to evaluate it in terms of its performance, speed, effectiveness, efficiency and its general ability to respond to the problem at hand. This is the subject of the evaluation phase, which is performed on the basis of specific metrics for each one of the targets/requirements outlined above. These metrics are calculated based on the performance of the model over a sample test data sets, which is different from the data set used for training. The evaluation of the model will ultimately decide whether the selected model/scheme is appropriate for the business task at hand. In case of poor performance, the IoT analyst could either search for an alternative model (possibly improving the evaluate one) or even revisit the business requirements through going back to the business understanding phase. The latter might be deemed necessary in case the results of the model are far from being useful for the business task at hand (e.g., prediction of energy consumption for the next day is infeasible based on the data and models available). In case the performance of the model is decent and in-line with business expectations, then the IoT analytics team (including the IoT analyst and IoT engineers) can proceed with their deployment.
- IoT data analytics module deployment:The deployment of the model requires its implementation and integration within the wider IoT analytics infrastructure. This typically involves the use of a data-oriented programming language for the implementation of the model. Several IoT streaming engines come with built-in libraries and APIs that enable the implementation of machine learning schemes. There is however for implementing machine learning schemes through conventional programming languages such as R, Java and Python. Apart from the implementation of the machine learning model, the deployment phase entails integration with other parts of the IoT application pipeline, including the data collection, the distributed streams storage and processing engine, as well as the interfacing with third-party systems and the visualization of the outcome of the IoT analytics system.
Based on the above descriptions it is evident that a variety of roles and skills are required towards driving a non-trivial IoT analytics application from inception to practical real-life deployment. These roles have already been outlined and include:
- Business/Domain experts, i.e. experts in understanding the business problems and its data-driven solution.
- Data Scientists, i.e. experts in statistics, machine learning and knowledge discovery, which have experience in matching specific business requirements and problems, to models that can serve as a basis for their solution.
- IoT (data) engineers, i.e. experts in programming and integrating the various models using an IoT infrastructure. These engineers will be also in charge of configuring the various deployment parameters, which are associated with the graceful operation of the IoT analytics system.
Overall, an IoT analytics project is typically multi-disciplinary and requires a strong team with expertise in all of the above areas.
Machine Learning Tasks for IoT Analytics
At the heart of IoT analytics applications are machine learning schemes that can effectively deal with the business problem at hand. Nowadays, machine learning is increasingly used for a variety of IoT data processing tasks. However, the foundations of machine learning techniques used in IoT has been around for years, along with the tasks and problems that machine learning is used to solve. A detailed discussion of such problems and related machine learning solutions can be found in classical data mining and knowledge extraction textbooks. A brief presentation of common tasks follows:
- Classification: This task concerns problems where an observation (as encoded in one or more data streams) has to be assigned to a class or category. In classification problems the classes are known based on a training datasets that contains labeled observations (i.e. observations already assigned to some class). The goal of the classification task is to devise and train a model that can automatically classify unlabeled observations. Classification models are characterized as binary (in case they classify an observation in one out of two classes) or multi-class (when classification in one out of many classes is performed). Image processing based on one or multiple cameras is typically associated with several classification problems, such as the classification of the various parts of the human body in order to identify faces, hands and more. Classification is performed based on various techniques that rely on supervised learning, including logistic regression, Support Vector Machines (SVM), Naïve Bayes, Decision Trees and Neural Networks.
- Regression: This task targets the prediction of a numerical label for an unlabeled observation. A training dataset comprising known labels is used to train a model that will be able to automatically predict new unlabeled observations. Regression has many uses such the prediction / forecasting of possible traffic congestions in an urban mobility application or even energy surges in an energy management application.
- Clustering: This task emphasize on splitting a dataset into a number of segments (clusters). Each segment contains elements that bear similarity to each other. The number of segments / clusters varies according to the target application e.g., an airline might wish to cluster its customer to gold, green and platinum. An IoT analytics oriented example involves the clustering of areas / regions or homes in the city according to their energy usage or even the classification of segments of the road network according to their daily or weekly traffic.
- Anomaly Detection: This task deal with the discovery of outliers in a given data set i.e. anomalies within a series of observations.
- Recommendations: The task of a recommendation system is to provide data-driven recommendations based on users’ past behaviour and activities.
In case you do not have an understanding of machine learning, I would recommend referring to the resources that follow, which provide very good introduction to data mining and data science.
Resources for Further Reading
- Manyika, J., Chui, M., Bisson, P., Woetzel, J., Dobbs, R. Bughin, J., Aharon, D. Unlocking the Potential of the Internet of Things. McKinsey Global Institute, June (2015).
- Wirth, Rodiger, and Jochen Hipp. CRISP-DM: Towards a standard process model for data mining. Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. 2000.
- A simple on-line data science course, which illustrates several of the phases of the data analytics pipeline in a practical way, is available from Microsoft at: https://blogs.msdn.microsoft.com/education/2016/06/03/learning-data-science-through-a-free-online-course/
- A very good introductory book on Practical Data Mining (which accompanies the Weka open source software) is available at: http://www.cs.waikato.ac.nz/ml/weka/book.html.