Syoncloud Technical Documentation
Syoncloud covers all stages of Big Data and Data Science processing. Processing stages are connected via workflows. Algorithms, workflows, definitions of tables and fields as well as scripts are stored in Syoncloud Registry/Repository. It is possible to apply multiple machine learning algorithms and models to the same data and assess reliability of predictions. Processing stage performs:
- Connecting to Input Sources
- Masking of Sensitive Data
- Verification of Data Quality
- Duplicate Records Elimination
- Unused Fields Removal
- Fields Conversions to Common Formats
- Joining Datasets from Different Sources on Common Keys
- Creation of Hive and HBase Tables
- Data Import, Compression, Partitioning and Clustering of Datasets
- Statistical Analysis of Tables and Joins, Clustering, Regression Analysis, Anomaly Detection
- Supervised Machine Learning
- Validation of Models with Different Algorithms
- Application of Machine Learning Models to Full Datasets
- Creation of Output Datasets
- Visualizations and Realtime Analysis
- Integration Connectors with External Applications
Connection to Input Sources
Input Datasources include JDBC access to backup and production databases, log files produced by applications, comma separated and fixed length text files, XML and JSON files. Supported interfaces include web services, JMS, MQ, SOAP, REST and so on. Syoncloud enables regular updates of datasets from external sources.
Sensitive Data Masking
Sensitive data such as credit card numbers, social security numbers, names of clients can be masked. Real credit and debit card numbers, account numbers and customer's Ids are replaced by randomly generated. Randomly generated numbers for the same entity must be consistent across all datasets to enable joins and integration. Given process stores pairs of matching real numbers and randomly generated numbers into tables. These tables are stored in separate secured relational database that is continuously updated. This database is also used to match randomly generated numbers with real numbers after Big Data analysis are performed. It enables isolation of data scientists and administrators from sensitive information that is only accessible to authorized employees and applications.
Data Quality Verification
Input data is verified for duplicate records, missing mandatory fields and corrupted records. Quality statistics of imported datasets are calculated. If quality is below certain standard, an alert is triggered and data is moved to folder for manual investigation. If quality of input data is within range the next step processing steps are triggered.
Duplicate and Corrupted Records Elimination
Duplicate and corrupted records can appear at any stage of ETL process, it can be a result of repeated imports of the same records or due imports of the same data from multiple sources. We set rules that eliminate these records at this stage.
Unused Fields Removal
Imported files or datasets often include fields that will not be used. These fields are removed .
Fields Conversions to Common Formats
Time stamps, dates, currency fields often do have various formats in different databases and tables. We chose common formats for all these fields at this stage. We also perform unit conversions and transformations of multiple input fields into single output field. Transformed records are stored in files in processed directory and failed records in failed directory for deeper investigation.
Joining datasets from different sources on common key fields
This stage joins input data from multiple datasources into partially denormalized data storages. Joins, updates and inserts are performed on pre-selected keys.
Creation of Hive and HBase Tables
We utilize metadata from Syoncloud Registry/Repository to create Hive and Hbase tables. The selection of table types depends on number of fields in each record, on variability of fields among records and requirements on updates. Hbase tables are NoSQL, column oriented and are preferred for datasets with many hundreds or thousands of fields per record. Hive tables are preferred for datasets similar to relational tables.
Data Import, Compression, Partitioning and Clustering of Datasets
In order to achieve required performance of analysis and queries as well as to reduce size of stored data several optimization steps are performed. Many fields that contain text data or repeated values are highly compressible. Binary formats of certain fields can improve storage requirements as well as performance. Transformation to compressed and efficient output formats is performed. We also select fields that are used to partition given tables this enables to reduce amount of processed data during selects and improve performance. We can improve performance of many queries and machine learning algorithms by dividing tables into buckets or cluster on specific fields. All optimizations and parameters are stored in Syoncloud Registry/Repository. Optimization procedures are performed automatically during data imports. Newly arrived data is inserted into existing tables and records with the same key are updated.
Statistical Analysis of Tables and Joins, Clustering, Regression Analysis, Anomaly Detection
Supervised Machine Learning
Validation of Models with Different Algorithms
Application of Machine Learning Models to Full Datasets
Creation of Output Datasets
Visualizations and Realtime Analysis
Integration Connectors with External Applications