Cyber-Physical Systems (CPSs) are cross-domain, multi-model, advance information systems that play a significant role in many large-scale infrastructure sectors of smart cities public services such as traffic control, smart transportation control, and environmental and noise monitoring systems. Such systems, typically, involve a substantial number of sensor nodes and other devices that stream and exchange data in real-time and usually are deployed in uncontrolled, broad environments.
Thus, unexpected measurements may occur due to several internal and external factors, including noise, communication errors, and hardware failures, which may compromise these systems quality of data and raise serious concerns related to safety, reliability, performance, and security. In all cases, these unexpected measurements need to be carefully interpreted and managed based on domain knowledge and computational models.
Therefore, in this research, data quality challenges were investigated, and a comprehensive, proof of concept, data quality management system was developed to tackle unaddressed data quality challenges in large-scale CPSs. The data quality management system was designed to address data quality challenges associated with detecting: sensor nodes measurement errors, sensor nodes hardware failures, and mismatches in sensor nodes spatial and temporal contextual attributes. Detecting sensor nodes measurement errors associated with the primary data quality dimensions of accuracy, timeliness, completeness, and consistency in large-scale CPSs were investigated using predictive and anomaly analysis models via utilising statistical and machine-learning techniques. Time-series clustering techniques were investigated as a feasible mean for detecting long-segmental outliers as an indicator of sensor nodes’ continuous halting and incipient hardware failures. Furthermore, the quality of the spatial and temporal contextual attributes of sensor nodes observations was investigated using timestamp analysis techniques.
The different components of the data quality management system were tested and calibrated using benchmark time-series collected from a high-quality, temperature sensor network deployed at the University of East London. Furthermore, the effectiveness of the proposed data quality management system was evaluated using a real-world, large-scale environmental monitoring network consisting of more than 200 temperature sensor nodes distributed around London.
The data quality management system achieved high accuracy detection rate using LSTM predictive analysis technique and anomaly detection associated with DBSCAN. It successfully identified timeliness and completeness errors in sensor nodes’ measurements using periodicity analysis combined with a rule engine. It achieved up to 100% accuracy in detecting potentially failed sensor nodes using the characteristic-based time-series clustering technique when applied to two days or longer time-series window. Timestamp analysis was adopted effectively for evaluating the quality of temporal and spatial contextual attributes of sensor nodes observations, but only within CPS applications in which using gateway modules is possible.