This work introduces a demonstration on how data from any format, location and type can be consolidated and federated using Apache Hadoop ecosystem tools for data sharing, integration, cleaning and preparation, as a stepping stone for artificial intelligence modelling and other uses. This is demonstrated with trial and errors for practical application while discovering a replicable workflow and a data analysis methodology.
It is globally understood that organizations need accurate, real-time and comprehensive information to compete in a growing economy. From the experience of the team members and based on relevant, emerging international best practices, lack of access to data blinds us to trends and behaviors that develop and emerge over time. Data is both the fuel and lifeline of any artificial intelligence project, hence data needs to be stored, consolidated and federated with focus on reliability, flexibility and scalability.
This post highlights four core problems that we identified that needs to be tackled to achieve our goal.
We approached everything in a matter of “shoot and than measure to correct” workflow. In other words trial and error approach, in finding possible solutions to the four problems which needs to be solved before any artificial intelligence project can be practically used for everyday life productions. Given the experimental nature of data, we are forced to make use of open source technologies due to it’s flexibility and availability.
We are aware that these technologies are in constant development, and some have not been adopted by the mainstream. In addition, 15 of the open source tools we used had a lot of cross functionalities. At every opportunity, we tried to use separate tools, in every trial to find out the differences and similarities of each. For example, we found Apache HBase easy to store data but hard to use for analysis compared to Apache Hive, which was relatively slower to store but fast in data analysis. However, we didn’t do any benchmark to quantify these findings, hence this is indecisive.
For data consolidation, we concentrated on a way to collect any and all kinds of data into a data lake without introducing any change in the status quo. We used Apache Hadoop, along with plug and play tools that make up the data consolidation ecosystem. We tested the setup by first manually importing static files (scanned paper, electronic documents, spreadsheets, logs and relational databases). Apache NiFi data flows were created for each of those using sample data, and once each node was working as intended, the flow was automated for self data consolidation without any human or external interactions.
For initial setup, we used 50 years of openly available data provided by the World Bank, which was directly stored in Apache Hadoop. For analysis of that data, we split, combined and cleaned the data using Apache Zeppelin and Apache Spark, and saved the cleaned data to Apache Hive. This data was then used by Apache Superset to create dashboards and charts, all while maintaining the data flow.
Using least squares approximation, we were able to compute the coefficients of nonlinear equations in the form of X-Y curves to interpolate and extrapolate data patterns in individual datasets (eg: estimate population growth of a country). Also, we were able to visualize the percentage error of curve-fit at all points in the form of bar charts. However, beyond that it was useless and it was understood that we will need a lot of source data for any kind of AI-based classisfications.
Hence, we decided to implement Android-based sensors data gathering system to integrate with our existing Hadoop-based data lake, to build an Internet of Things (IoT) Big Data Project. Out of seven proposed ideas, this was selected based on ease of implementation, data acquiring methods, constraints, data sources and frequency of data. This involves analysing the data coming from multiple sensors on Android-based mobile device(s), and using static and/or real-time data to find out insights about the user(s) of those devices and their environment.
We used Hortonworks Connected Data Architecture (CDA), Hortonworks Data Flow (HDF), Hortonworks Data Platform (HDP) and Apache NiFi to build this project. A proof of concept was built, that used Android’s data to replicate the required user and environment data, and used HDF and HDP to analyse the sensors data. Data was streamed continuously from an Android device to HDF, via Site-to-Site protocol (and used a mobile app to collect sensors data). Then we connected the NiFi service running on HDF, to HBase running on HDP. From within HDP, we learned to visually monitor sensors data in HBase using Zeppelin’s Phoenix Interpreter.
We were able to identify patterns in the sensors data while someone was walking and standing still.
In the process of lot of trial and errors in our data lake setup, and storage of different kind of data, we discovered a pattern in our workflow which we were able to replicate. We adopted it as UtotoAI Data Lake Workflow:
In data analysis, data understanding, business understanding and research, we found the following pattern that we kept on replicating:
This works demonstrates that data can be consolidated and federated using Apache Hadoop-based data lake, irrespective of data formats, locations and storage. Our observation suggests that using known working processes, this work can be scaled out to consolidate in a safe and reliable fashion for multiple use cases.
Finding out what kind of data is available if it is unknown would require a different approach. However, our work has shown that we are able to combine data from multiple formats and sources into a structured format for pattern recognition, while retaining the source data intact for future works and maintain it in a real-time data streaming environment.
This work has shown beyond any doubt once a data lake is populated, that data can be used and shared safely and reliably if needed. This includes static and real-time data.
The only known way to know that which we do not know is to consolidated data safely to find insights in as many ways as possible. Our work has shown that using our data lake this can be facilitated.
As shown in our work, consolidated and federated data can be integrated for accuracy and other needs by exploring the possibilities of data-as-a-service platforms.
This is our initial attempt in finding a solution for data consolidation, in addition to discovering replicable workflow and a data analysing model. Still, this work has a lot of shortcomings, and it is important to mention other factors (like multi-tier scalability, security, data repair and data protection) which should be included as a part of the total solution. All these solutions do exist today in the open source ecosystem.
In future posts we will share more details of these and other data lake workflows and data analysis methodologies, and compare them with ours using both qualitative and quantitative measurements. In addition, we also hope to explore and share more on data visualization and analysis methods.