Address: 2035 Sunset Lake Road, Suite B-2, Newark, 19702, Delaware, USA

Data Consolidation & Federation for AI Using Open Source

  • Home
  • Data Consolidation & Federation for AI Using Open Source

This work introduces a demonstration on how data from any format, location and type can be consolidated and federated using Apache Hadoop ecosystem tools for data sharing, integration, cleaning and preparation, as a stepping stone for artificial intelligence modelling and other uses. This is demonstrated with trial and errors for practical application while discovering a replicable workflow and a data analysis methodology.


It is globally understood that organizations need accurate, real-time and comprehensive information to compete in a growing economy. From the experience of the team members and based on relevant, emerging international best practices, lack of access to data blinds us to trends and behaviors that develop and emerge over time. Data is both the fuel and lifeline of any artificial intelligence project, hence data needs to be stored, consolidated and federated with focus on reliability, flexibility and scalability.

This post highlights four core problems that we identified that needs to be tackled to achieve our goal.

  1. Data Gap: the lack of knowledge of available data, where they reside and/or in what format.
  2. Data Sharing: the delays in sharing and managing dispersed data at different endpoints hinders streamlining of business processes and smart decision making.
  3. The Great Unknown: artificial intelligence solves wicked problems, hence, we don’t know what it is that we don’t know until an artificial intelligence project finds a solution to that unknown problem.
  4. Data Integration: lack of data standards and policies for data sharing and data architecture.


We approached everything in a matter of “shoot and than measure to correct” workflow. In other words trial and error approach, in finding possible solutions to the four problems which needs to be solved before any artificial intelligence project can be practically used for everyday life productions. Given the experimental nature of data, we are forced to make use of open source technologies due to it’s flexibility and availability.

We are aware that these technologies are in constant development, and some have not been adopted by the mainstream. In addition, 15 of the open source tools we used had a lot of cross functionalities. At every opportunity, we tried to use separate tools, in every trial to find out the differences and similarities of each. For example, we found Apache HBase easy to store data but hard to use for analysis compared to Apache Hive, which was relatively slower to store but fast in data analysis. However, we didn’t do any benchmark to quantify these findings, hence this is indecisive.

For data consolidation, we concentrated on a way to collect any and all kinds of data into a data lake without introducing any change in the status quo. We used Apache Hadoop, along with plug and play tools that make up the data consolidation ecosystem. We tested the setup by first manually importing static files (scanned paper, electronic documents, spreadsheets, logs and relational databases). Apache NiFi data flows were created for each of those using sample data, and once each node was working as intended, the flow was automated for self data consolidation without any human or external interactions.

For initial setup, we used 50 years of openly available data provided by the World Bank, which was directly stored in Apache Hadoop. For analysis of that data, we split, combined and cleaned the data using Apache Zeppelin and Apache Spark, and saved the cleaned data to Apache Hive. This data was then used by Apache Superset to create dashboards and charts, all while maintaining the data flow.

Using least squares approximation, we were able to compute the coefficients of nonlinear equations in the form of X-Y curves to interpolate and extrapolate data patterns in individual datasets (eg: estimate population growth of a country). Also, we were able to visualize the percentage error of curve-fit at all points in the form of bar charts. However, beyond that it was useless and it was understood that we will need a lot of source data for any kind of AI-based classisfications.

Hence, we decided to implement Android-based sensors data gathering system to integrate with our existing Hadoop-based data lake, to build an Internet of Things (IoT) Big Data Project. Out of seven proposed ideas, this was selected based on ease of implementation, data acquiring methods, constraints, data sources and frequency of data. This involves analysing the data coming from multiple sensors on Android-based mobile device(s), and using static and/or real-time data to find out insights about the user(s) of those devices and their environment.

We used Hortonworks Connected Data Architecture (CDA), Hortonworks Data Flow (HDF), Hortonworks Data Platform (HDP) and Apache NiFi to build this project. A proof of concept was built, that used Android’s data to replicate the required user and environment data, and used HDF and HDP to analyse the sensors data. Data was streamed continuously from an Android device to HDF, via Site-to-Site protocol (and used a mobile app to collect sensors data). Then we connected the NiFi service running on HDF, to HBase running on HDP. From within HDP, we learned to visually monitor sensors data in HBase using Zeppelin’s Phoenix Interpreter.


We were able to identify patterns in the sensors data while someone was walking and standing still.

In the process of lot of trial and errors in our data lake setup, and storage of different kind of data, we discovered a pattern in our workflow which we were able to replicate. We adopted it as UtotoAI Data Lake Workflow:

  1. Since we always consider “The Great Unknown”, all data consolidation projects starts with listing out whatever data is available, accessible and obtainable. No data is useless.
  2. In order to create the data flow, we first take a sample dataset from the whole.
  3. Using the sample, we create the data flow.
  4. This data is then stored, read-only, in the data lake.
  5. Next, we analyse the data set (for limitations, missing data, simple statistical regressions, etc.).
  6. If the data is found useless, we go back to step 1; if not, we start understanding the data (its importance, its possible uses, etc.).
  7. For that, we need to understand the business value of the data, commercial and/or knowledge.
  8. This pushes us to do research.
  9. Split or combine the data as necessary in a different data flow.
  10. Clean and prepare the data for training.
  11. Train the model.
  12. Evaluate or test the model. 
  13. Develop new practical products or services based on the trained model.
  14. Deploy the new products and/or services.

In data analysis, data understanding, business understanding and research, we found the following pattern that we kept on replicating:

  1. All analysis works commences by questioning why the data exists, its impact, its purpose, etc.
  2. The current state of the dataset always ends up in either chaotic or order.
  3. If chaotic need more research to better understand, with knowledge. But if, in order, since it’s seen as usable, we ignore it with clarity.
  4. With research or more knowledge, it always end up into clarity. But if we proceed or ignore without clearity, since we don’t explore any further, any possible use, become confusing.
  5. With clarity we get the freedom to use the data. But when in a confused state, we end up wanting to shape and control the data and outcome.
  6. When we are able to use the data, the data is always in order. But once we try to control it, it always end up chaotic. 


This works demonstrates that data can be consolidated and federated using Apache Hadoop-based data lake, irrespective of data formats, locations and storage. Our observation suggests that using known working processes, this work can be scaled out to consolidate in a safe and reliable fashion for multiple use cases. 

Data Gap

Finding out what kind of data is available if it is unknown would require a different approach. However, our work has shown that we are able to combine data from multiple formats and sources into a structured format for pattern recognition, while retaining the source data intact for future works and maintain it in a real-time data streaming environment.

Data Sharing

This work has shown beyond any doubt once a data lake is populated, that data can be used and shared safely and reliably if needed. This includes static and real-time data.

Great Unknown

The only known way to know that which we do not know is to consolidated data safely to find insights in as many ways as possible. Our work has shown that using our data lake this can be facilitated.

Data Integration

As shown in our work, consolidated and federated data can be integrated for accuracy and other needs by exploring the possibilities of data-as-a-service platforms.

Future Works

This is our initial attempt in finding a solution for data consolidation, in addition to discovering replicable workflow and a data analysing model. Still, this work has a lot of shortcomings, and it is important to mention other factors (like multi-tier scalability, security, data repair and data protection) which should be included as a part of the total solution. All these solutions do exist today in the open source ecosystem.

In future posts we will share more details of these and other data lake workflows and data analysis methodologies, and compare them with ours using both qualitative and quantitative measurements. In addition, we also hope to explore and share more on data visualization and analysis methods.