SAP analytics in Hadoop Environment
SAP analytics in Hadoop Environment
We have designed and implemented Hadoop based SAP data analytics solution. This solution is successfully used in British Gas.
Features of the solution:
- It enables combination of SAP data with other data source in the company.
- It reduces license costs for databases, data warehouses, ETL (Extract Transform Load), analytics and reporting tools.
- It enables faster deployment and easier modification of reports and dashboards without need of changes in data models. There is no need to change cube or star models as in relational solutions.
- It applies and validates many machine learning models for the same data in reasonable time.
- It enables faster processing of complex table joins queries compare to relational database solution.
- It enables easier integration with semi and unstructured data sources.
- It enables faster refresh of data from SAP system into Hadoop Hive tables. Instead of once a day it refreshes data every hour.
- It enables to run analytics on non-aggregated and full datasets.
- It utilizes Map/Reduce framework to perform ETL, SQL queries and machine learning algorithms.
- It simplifies join queries across complex SAP relational model by partial de-normalization.
- It is able to capture historical events from SAP and utilize from and to dates from SAP tables.
- It does not increase load on existing Oracle databases or SAP.
- It supports legacy analytics and reporting tools such as SAS and Terradata. These tools can use the same table structures as is in Oracle SAP database.
Overall Technical Design
SAP data is stored in Hadoop as HCatalog Hive tables. HCatalog format contains metadata about fields and tables and enables SQL like Hive queries as well as Pig scripts for ETL. Sqoop is used to import data from databases, Flume is used to import text files such as log files and Pig scripts import XML data.
Apache Falcon is used for data lifecycle management such as archiving, deletes, data workflows, replication and cluster failover. Oozie is used to schedule, coordinate and execute workflows. Oracle ODI (Oracle Data Integrator) is used to update data in Hadoop from SAP Oracle tables. Mahout and R Language are used to apply machine learning algorithms on data.
In order to automatize work with tens of thousands of tables and files I have developed framework that generates Pig, Hive, Sqoop, Falcon, bash, Flume, Mahout and Oozie scripts and configurations from metadata. Metadata describe database tables, fields and files and formats. I have used Python language because its simplicity and fast development cycle.
We have selected 14000 Oracle tables used by installed SAP modules. We have added 10000 tables from Oracle, MS SQL, PostgreSQL and other databases from 7 non-SAP inhouse business applications. We have also used XML data streams and log files produced by several external and internal systems. This led us into development of a tool that is capable to utilize metadata derived from SQL description of tables, fields, relational model, ETL tools and certain business rules.
Metadata files contain information about:
- Database fields such as types, descriptions, aliases, ETL rules
- Database tables such as location, primary keys, ETL and business rules.
- Higher level structures such as group of tables describing business entities (Partner, Contract).
- tables relations, information about databases locations, data refresh and retention periods, history capture and pre-processing ETL information. It includes information to create partially denormalized data structure.
Use of Oracle ODI to keep data updated in Hadoop
History capture is an ability to capture all changes in data coming from SAP or other systems.
Partial de-normalization by areas to simplify queries and improve processing speed.