IoT Analytics Enablement with Data Naming Standardization
Caterpillar Inc. is synonymous with heavy machinery. It is the world's leading manufacturer of construction and mining equipment, off-highway diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Digital technology, including big data analytics, is also a part of the company's journey, adding value to its iron with robust, scalable and high-quality data.
At Caterpillar, advanced IoT analytics are enabled at scale by way of onboard computers, sensors and cameras on more than 1.2 million connected assets worldwide. The data generated on customer assets on job sites across the globe includes time-series data, machine health alerts, fuel usage, GPS and operator-specific usage. This example of "Big Data" is high volume and velocity and enables the development of new analytics capabilities that can increase safety and the customer value of assets.
However, as the amount of data grows, data quality issues can also grow equally in size. For example, there can be missing batches or messages, missing data channels, interruptions in cell/satellite coverage, imperfect ETL (extract, transform, load) and other factors that can degrade the efficacy of IoT analytics. Imagine a modern large mining truck with more than 100 IoT sensors. The amount of data collected by the sensors can result in data quality problems that lead to ill-performing analytics applications.
Surprisingly, a ‘simple’ issue like irregularities in data channel naming can become a major problem for the plurality of IoT analytics applications. Data scientists and data engineers can find themselves investing more time inquality control and various model adjustments rather than analytics itself. Even the smallest discrepancies can reduce scalability of deployed analytics solutions, increase the barrier of entry for building new innovative applications on this data and decrease communication effectiveness between engineering and analytics.
For instance, a channel delivering pressure data could be named as ‘pressure,’ ‘prs,’ ‘press,’ ‘oil press’ or something else. Similarly, another channel that relates to temperature sensor data can be name as ‘temperature,’ ‘temp,’ tmp’ or other related forms. While these channels are generating similar data, they are named differently likely because they originate from different design teams, types of equipment and product families. And, since these product families often rely on the same analytics algorithms, these differences can disrupt the flow of data to the analytics applications.
Imagine an analytics model designed to monitor conditions inside a particular type of diesel engine that is used in different products––trucks, tractors, excavators, etc. A typical analytics model may require 10 or more telematic channels to function. Imagine now that each channel has different names for a different type of equipment—names assigned to them by the engineers creating the final product. With 10 channels having multiple possible names each, the total number of combinations that need to be standardized and adjusted manually could be prohibitively high. Additionally, since analytics models depend on consistent channel names, the models might fail when they encounter unexpected channel names from a different type of equipment. Unfortunately, in order to avoid these failures, data scientists need to devote valuable time debugging data issues instead of building new analytics models.
The data generated on customer assets on job sites across the globe includes time-series data, machine health alerts, fuel usage, GPS and operator-specific usage
Another possible approach is to create a standardized channel naming list (CNL) that would serve as a translation layer between all channel names used in practice and just one agreed-upon channel name. This one safe source would beused by data scientists, analytics models and analytics applications across the company. In essence, this CNL is the name translation tool between the engineering and analytics functions and its purpose is to enable analytics to be deployed promptly and to run smoothly for any connected device or machine.
The main advantages of such an approach are listed below:
• It standardizes data channel naming across all products and allows for analytics to run at scale without disrupting the existing workflow and engineering processes.
• After an initial investment of time, it requires minimal support and maintenance over time.
• It reduces data cleaning and model revision time dramatically and allows for data scientists to focus on high-value activities such as model development and refinement.
For any organization with a diverse line of connected products, the above mentioned CNL approach to data standardization could be a useful solution for resolving data quality issues. A one-time investment of time and resources could pay off longer-term. Delay in addressing data quality today could make tackling it later incrementally more difficult, as the amount of data and available data channels keep increasing over time.
Darin McCoy is a Product Development expert with a keen focus on Telematics Big Data Analytics and the Industrial Internet of Things. Darin is an Analytics Manager at CAT Digital, the digital technology arm of Caterpillar. Darin manages a team of data scientists that distill business insights and value from Caterpillar's fleet of over one million connected assets. Darin has a degree in Electrical Engineering, a Master's in Business Administration, and over 14 years of experience working with heavy equipment at Caterpillar. Darin has several patents from his decade long experience designing diesel engines and from his experience delivering cutting edge analytics solutions for Caterpillar.
Dr. Andrei Khurshudov specializes in Big Data Analytics, the Industrial Internet of Things, cloud storage and computing, in-memory computing, and data storage technology. Andrei is a Director of IoT Analytics at CAT Digital, the digital technology arm of Caterpillar. Andrei’s organization focuses on data analysis and predictive modeling for a million connected machines and devices. Andrei has spent over ten years at Seagate Technology, serving as a Chief Technologist and managing various R&D organizations in such areas as big data analytics, cloud storage, cloud computing, quality and reliability, and others. In the recent past, Andrei served as a Chief Data Officer at Formulus Black, a New Jersey startup developing software for in-memory computing, and a CTO and Chief Data Officer at Alchemy IoT, a Boulder-area startup creating cloud-based analytics solutions for the Internet of Things. Andrei has a Ph.D. in Engineering and worked at IBM, Hitachi Global Storage, and Samsung. Andrei has numerous publications, patents, conference presentations, and a book.