Managing Information and Analytics from Aroha Technologies: 2014

Wednesday, July 16, 2014

Complexity of Reporting in DW Environment

The maturity of the DW implementation is based on the following functionality when it comes to Reporting.

Every DW begins with a handful of reports which focuses mainly on the management requirement. Typically 90% of the reports are static by nature, only arguments change thus the required data is represented in the following formats (Table, Cross tab, chart, combination of all etc). As time passes, business have more requirements and thought process which we cannot implement only through static reports, so more tools / technologies comes under BI / Reporting area. For example a business user wants to add one more column / object in the report when they are analyzing the data.

Types of work / reports typically comes out of DW

Static / Canned Reports (What happened or Known Questions)
Ad hoc Reports (Why it happened or Analysis)
Dashboards (One view based on the role rather than digging deep – KPI representation)
Data Mining (What will happen / Predictive analysis – more of statistics)
Business Activity Monitoring (Defining threshold against every KPI and taking automated actions based on the event -- it's also called as Active DW in some implementations)

One technology in BI (Reporting / OLAP) won’t solve all the above. So, typically we have multiple technologies in the BI reporting side compared to ETL / Integration layer in the DW.

So, when compared to Integration (ETL) Layer in a DW, OLAP will have different software packages to solve each one of the problem we have in terms of requirements.

Sunday, July 6, 2014

Information as an Asset -- How to Start

I came across a word today "INFONOMICS" which is nothing but applying "ECONOMICS" on the DATA what we generate in the enterprise.

Any company which has a DW and deriving more value out of the data to take decisions which impacts the operational, tactical and strategic decisions are on its way to be INFONOMICS company.

"Data Monetisation" -- Finding ways to use the existing data to fine tune the process or finding a new process which either saves the money for the company or generates the money for the company.

To prove the role you are playing, link your work towards the revenue contribution, thus you also getting closer to business. When a technology person understands the business very well, you will make use of data very inventively in the business. When we reuse the data for new process, you saved the money which we spend to collect the information.

Two types of DATA -- ROO -- Record of Origin
ROR -- Record of Reference

The more ROR we have in the organization, we can prove the reuse as well as the cost we saved to process and present the data on need basis very effectively.

Record of Reference Systems
1. Master Data Management -- which improves the Data Quality, Data Reuse and Operational efficiency
2. DW and Business Intelligence -- which holds the historical changes and integrated data for tactical and strategic decisions.

When we don't have any platform which cannot be used to analyze and improve the decision making process (it can be operational, analysis, visualization, Machine Learning (data mining) etc), then we are not considering the data as asset.

Start building the repository of information which can be used across organization is the place to start looking at the data as an asset.

Wednesday, June 25, 2014

ETL Vs ELT

Difference between ETL and ELT

ETL -- Extract , Transform and Load

ELT -- Extract, Load and Transform

Its just the approach we take to get the job done. So, how do I say in my current project I am doing and ETL or ELT?

In most of the implementations we have ETL tools where there is an ETL Server like informatica, SSIS (SQL Server Integration Services) etc is used in the DW environments. In this case the transformations happen in the ETL server which is outside of the databases. Some of the ETL server / services are very power full and scalalbe and can understand various sources which makes it more effective in most of the implementations.

There are some RDBMS for example Teradata can load lots of data into database very fast compared to SQL Server as the architecture of Teradata is Parallel processing by default and its design to take care of huge amount of data. So, here using the RDBMS capabilities for transformation and loading makes it better. In this case ETL tool is not capable of loading data, so we some times use the procedures to do transformations and load the data into the DW.

In the integration layer, if you use more of database to implement transformations then it falls into ELT where as if you use more of ETL tool to do transformations then we can call it as ETL tool.

ETL Tools in the market: Informatica, Data Stage, SSIS
ELT Tools in the market: ODI (Oracle Data Integrator)

80% transformations in Database and 20% in others --- here we can conclude that we are using ELT
80% transformations in ETL tool and 20% in others --- here we can conclude that we are using ETL

In the ETL world, we utilize ETL Server the most for transformations. In the ELT world, we utilize more of database server (staging database) scripts and procedures to do transformations.

Thursday, May 8, 2014

Default Records in Dimension tables

At times, we face data quality issues while we are poulating the fact tables in the datawarehouses.

We refresh the data in the DW at various intervals. We found scenarios where we populate dimension tables once a Day and populate fact tables every hour so that organization can analyze what's going on at hourly level if required.

In the above scenario, there are some instances where a dimension record is missing but the fact related to that dimensional record is coming in for processing. (Got a new customer at 10.00 AM, this customer will flow into DW in tomorrow's dimension loading ETL, where as the transaction belongs to the customer will be reaching the DW by 11.30 AM). In this case, the ETL program (look up to get respective dimension key) will yield no record. What do we do in this scenario? We have to reject that record because of FK violation, if so then we will end up getting descripency between OLTP and DW. In this case, we accept the record into fact but associate a default dimensional record from the customer dimension so that we dont miss on the values for reconcilation.

As part of the house cleaning process, or self healing process of bad data, we identify the fact records which got associated to default dimension records and find the right dimensional DW key and update the same in the fact table. This can be a process which runs once in a week to make sure we all the right dimensional association to fact.

Tuesday, April 22, 2014

Advantages of Data Visualization tools

In today's result oriented environment, most companies wants the output in a short duration with lesser cost and quick turn around time.

Business Discovery is the term which the visualization tools as the data is viewed in an interactive manner. Display the data in a summarized way and slice and dice the data in a page or across the entire document enables the user to analze the data better.

Visualization tools typically work on In Memory Database ie when the data gets loaded to the server of the visualization tool, the data gets compressed to the maximum level so that it can hold more data in the server which enables the end user to interact with data using the tools.

Qlikview, Tableau, Spotfire etc are some of the visualization tools in the market which is in top 3 spots in this space.

Traditional Data Warehousing
Creating and enterprise Data warehousing or Data Mart for an enterprise is the way we create the DW systems. Typically this gets sponsored by the senior managers who understands the current issues (typicall data is not integrated across OR reporting is difficult for the MIS department). Once we implement the DW, the companies will be able to view the data at enterprise level, can analyze the data using OLAP technologies like SAP Business Objects, Cognos and Microstrategy. OLAP enables the companies to answer why it happend (Ad hoc) kind of questions which is the biggest advantages of creating the DW. This enables the DW users to cut across any subject area provided we have metadata in the OLAP tools for analysis.

Dis Advantages of traditional Data Warehousing implementation
Time consuming to implement the traditional DW is one of disadvantage. Deciding on ETL technology / methodology, building Data Models, creating meta data models for OLAP, creating reports, understand the data of different OLTP systems, creating the common reference model which fits all the operational system etc takes its own time to make it better. If the time and resource is the problem, then companies are requesting to create the reports (pre defined) and Dashboards which is interactive with lesser cost which takes less time to implement.

To beat the disadvantage of the traditional DW implementation, now companies are prefering to implement the visualization tools to make sure the data is displayed in the format the management requires with out building the DW and putting more emphasis on outcome (result). This is where the visualization tools gets better score to implement with in the department or enterprise. This tools provides a way to analyze data by making sure the interactive in the reports / dashboards are very high.

Here in the Visualization tools, we extract data from various sources and directly populate to the documents / reports rather than storing the data in a centralized repository. This makes the development faster. We refresh the data periodically based on the changes in the source data or based on the requirement the business has. The effort required to build Dashboards and interactivity using this tools are very less.

Monday, April 21, 2014

Top 10 to remember when you are ETL developer

Every resource who do programming to process data in an Integration environment should follow this top 10 thing to become more productivity.

1. Understand the overall objective of the task.

Are you loading data from source to staging OR staging to DW or DW to some other system.

2. Understanding the source and source type (database, XML, flat file etc)

Where does the source exists? Is it in the file system or its an RDBMS.

3. Visualize / Collect the sample data from source

Visualize the sample data, if there is no sample data, create the test data which is equivalent to the realistic data by taking the help of subject matter expert if available.

4. Frequency of the source data generation and acceptance

What is the frequency of data the source generate and how you identify the type of extract we should apply on the source data.

5. Understand the target database / files

What is the target file or database structure looks like? What is the PK of the target tables or how we uniquely identify the records in the file.

6. Visualize the data based on the sample source data

Based on the sample input data, you have to apply business rules so that transformed data can be visualized in target structure or files.

7. Create mapping document (contains the source system, source file / table, target system, target file / table, what business rule you apply on the columns ).

Use an excel sheet or use the standard mapping document according to the company you are working for. Make sure you have columns you identified as source and the target file / database columns and make sure you have the business rules documented as much as possible in the mapping document. By seeing the mapping document, you should know the work we have to do as ETL developer with the tool / program.

8. Identify the gaps between the source and required data in target

After completing the mapping document, you should be able to figure out how much of columns we are yet to identify. By using this mapping document, we can sit with any one in the team to get it clarified so that one person can take the responsibility to close the same.

9. Develop the ETL job

Use all the artifacts you collected till now and start creating the ETL job based on the mapping document.

10. Unit test the ETL

Create the unit test document and make sure all the possible combinations of business rules. Identify the different ways of testing, row count testing or value testing according to the transformations we apply.

By applying the above mentioned points, we should be able to make sure we can develop good ETL programs.

Sunday, March 16, 2014

Knowledge Discovery Platform

There are so many names we have in the industry for the Datawarehousing and Business Intelligence applications.

Stared with

EIS (Executive Information System)
DSS (Decision Support System)
DW (Data Warehouse)
BI (Business Intelligence)
BIDW etc

So, what is the name I can give for the same DW.....

Being a technology resource, I always find difficulty explaining technology people why we should learn this technology. On the other hand its easy to explain it to business people as they look on what is the benefit I get when I use it. When we show them the case study which they never thought about it or how the current problem in the company can be solved using technology then they understand the system and the ROI.

Here you go....... "Knowledge Discovery Platform"

So what is DW, its a platform which enables business with more knowledge (business) which they don't have it now. May be they have questions, but the IT landscape does not have solution which can provide the information with in a reasonable time.

Why we call it as platform, the system utilizes various technologies like ETL, Storage Layer, Reporting / Visualization, Data Mining etc

Bottom line is understand why we are building, what is the ROI for the business, what is the value IT provides (business value) etc. All the others keep changing.

Wednesday, March 12, 2014

Unit testing of BI Reports in General

The developer should test (unit test) the BI report . Following details have to be covered for every report against the requirement.

1. Data level testing (Extract SQL from the report, test the data in the report against the database)

2. Prompt testing ( Check the list of values, type of prompt etc)

3. Functionality testing ( Drill down, alerts, Conditional formating, Visibility of certain columns etc)

4. Format testing (fonts, standards etc)

5. Export testing (default format, other options of exporting)

6. Security testing (access to certain type of users or public report etc )

7. Scheduling test (if the report has to be scheduled)

If you cover the above set of basic things in every report, you are good to go.

Tuesday, March 11, 2014

Steps in creating BO universe

Remember the following steps while creating the universe in SAP Business Objects.

1. Create the connection -- usually it has to be secured type to export it to repository.
2. Import the tables / views into the universe (identify the set of tables u needed)
3. Identify the joins and establish the same.
4. See if there is a loop issue, if so see if you can avoid the same by creating an alias or context
5. Check if the chasm trap issue may occur, if so create the context
6. See if there is any fan trap issue, if so create either alias or derived table and create the context
7. Create the classes
8. Visualize the objects in each class, based on the hierarchy create the objects with in each class
9. Make sure you have the right object type (Dimension or Detail or Measure object)
10. Check the integrity constraint
11. If there are no errors, save the universe file.
12. Export the universe to repository.

Monday, March 10, 2014

Top 10 things to remember in ETL world

As an ETL developer, its important to see what are the various touch points in the ETL world. You may be working in any one of the touch point in the ETL stream line. In a DW system, most of the time we spend in analyzing the source data and creating ETL jobs to populate your data warehouse. Lets discuss about what 10 things you should know to become very effective ETL resource for an organization.

1. Sources
What is your source? Understand the nature of the source. Is the source data is in RDBMS, if so what is the credentials you have. How to get the meta data information about the source. Sample data from source etc. If its the flat file or XML or Web Service, understand the structure of the data you are getting.

2. Extraction cycle
What is the frequency in which the data gets updated in the source. Are you extracting the data once in a day / week / month / quarter etc. Are you going to extract the whole data every time you are accessing the source or you are extracting the data in an incremental (delta) way.

3. Staging
Staging database is used as a temp area where we can store the data for processing. Typically when we have more than one operational system where the data is stored in different RDBMS (Say oracle, sql server, db2 etc) then we can have to consolidate the data required in one RDBMS. The database which is used for the consolidation of data is called as Staging Database.

4. Transformations
Based on the target table / file, you have to convert the data which is in the different form in some cases. For example, in the DW you have to store the data at a monthly level for performance reasons. In the source, the data typically will be at a transactional level. So, in this scenario you have to to aggregate transformations. Typical transformations are router or splitter, aggregate, filter, sorter, de duplication etc.

5. Auditing
The ETL system should have data to clarify many questions from various stakeholders. For this same reason, we have to keep a track on how many records we got from a specific system, what data we got, how long it took to get etc. Any time we change the data, the system should keep a record for the audit purpose. If you see any ETL system with out auditing information then its not matured implementation of ETL system.

6. Reconciliation
There should be an automation way of identifying how many records we read, how many records got rejected because of data type mismatch, source definition and the data not matching, business validation failed etc. There are two types of reconciliation typically we do, one is row count validation and value based validation. If the granularity is same between the source system and the target system then we should go for row count matching, if the granularity is different between the source and the target, then you do the value matching. For example, do a sum(rev) at the source level, and sum(rev) at the target level. The value should match if the summary is different at target level. Send an email automatically when the expectation is not matched to production support team to take care of the same.

7. Reject Handling
In order to know the value we add over the time to the organization because of DW platform, we can show case how we are improving on understanding the organization data and how we improved the quality of the data over the period of time. When we reject data while we populate what happens, and how to handle the same. We have three types of rejected data

Self Heal Data -- Today's transaction data comes, but we don't have master data in the dimension table. So, when we load the txn data into fact, it fails because of corresponding dimension record does not exists. So, reject the data, try loading the data in the next day, meanwhile the dimension data comes, thus when we are loading this yesterday's rejection data, it works today. This is called as Self Healing data.
Pragmatically correct the data (Automated) -- Once we know the reason of failure, based on the business rule and data validation process we can fix the same. For example in some cases in the addresses we have zip code but don't have the city name, we can use some external data and correct the data to populate the data. If the ETL program rejects the data, create another program to cleanse the rejected data and load it again. In the ETL world, we call that as reject load as well.
Correct the data manually -- If the rejected records are not able to correct in an automated way, then we develop an interface to manually see the error and fix the data issues. Typically when we understand the data from across source systems, this happens the most until the system gets matured.
Reject (In complete source data) -- In some cases we cannot fix the data at all. So, this gets flagged as reject which gets discarded over the period of time.

8. House Cleaning
We end up storing the error records, log records when we process the data using ETL programs. If the log files occupy more space or the error records cannot be used any more then we can write the programs to delete the records or delete the files which we don't need to worry as we finished the reconciliation process and audited the reason for rejects etc. Typically in a frequent intervals, we remove the unwanted data / files from the DW platform using certain programs. This process is called as House Cleaning process.

9. Scheduling
As the ETL programs which we write has to be executed in a frequency basis, we have to make sure the scheduling is possible with in the ETL tool with out depending on the OS scheduling. If we use the database programming language as ETL tool then we have to use the schedular of the database. If you are using Oracle, then we can make use of DBMS_JOB.

10. Log Handling
Logging is very important in the ETL Layer. When there is job failure, it is the source for us to start the work to fix the issues. In some of the tools, it provides both database logging or Flat file logging. Logging levels should be there, should we log only errors or we should log every thing which will write the record what's been processed, timing for each of the transformations etc. The more detailed logging you are going to opt, the more time its going to consume to write the log files. So, do lot of analysis before you choose what level makes sense for your implementation.

These top 10 things will help you to design an ETL layer for your DW implementation, it can also be used to compare if the ETL tool is in par with industry standard.

Sunday, March 9, 2014

Data Layers in the EDW environment

In a complex data warehousing environment, we create different placeholders to store data.

We can visualize multiple layers where we deal with data.

1. AS IS Layer
is the layer where we store the data which we get it from the source system with out transforming any of the data. This typically acts as the source which we consumed for the DW, even if there is a change happens in the operational system, we can cross consider this layer to reconcile the data between the staging and source.

2. Staging Layer
Typically staging layer contains the data which you are transforming, used to convert the heterogeneous data sources to homogeneous. For example, in some cased you get XML data from one hand held device, tables from an RDBMS, a CSV file from some other business. In this case we covert all three sources into common area. The common area typically gets created in an RDBMS so that handling of data is easy.

3. ODS Layer
Operational Data Store typically exists reduce the data latency. Typically when the requirements are coming from business to get the real time data integration, it becomes difficult to add the operational data to the summarized data in the Data warehouse. So, in this scenario the architecture of DW recommends to add the ODS layer database which typically contains the less granularity of data compared to DW data. Usually in the ODS we store records worth of 2 to 3 months.

4. EDW Layer
The data which is processed (Gold data) based on the business requirement, cleansed, ready to use for analysis gets stored in the EDW layer. In some implementations you see EDW in the form of 3rd Normal form, in some cases you see EDW in Star Schema's (Constellation).

5. Data Mart Layer (Physical Data Mart, Logical Data Mart)
Based on the architecture of the DW, this layer is treated differently from organization to organization. Typically in Teradata based implementations, the business views are formed and exposed to the BI tools (BO, MSTR and Cognos).
If the architecture of EDW follows the 3rd NF, then the data mart typically has the star schema which has dimensions and fact tables. If we have the data in the data mart layer, then we call it as physical data mart. If the data mart is made of views only which gets exposed to BI Layer then we call it as Logical data mart.

I hope you are getting an idea of how many different data layers can exists in the DW platform.

Tuesday, February 25, 2014

Top 10 Advantages of building DW

At a very high level from data perspective, I am listing top 10 things what an organization gets when they start implementing the DW.

1. Organizations have the repository of operational data across enterprises

2. Integrated data is available for analysis

3. Management can get 360 degree of visualization

4. Historical data (current + previous versions of master records) exists

5. Enables the organization to answer any questions management has with little or no extra effort provided the data exists in the repository

6. Quality of data increases as we find gaps while integrating the operational systems

7. Enables the Ad hoc reporting for Analysis

8. Makes entire organization a data driven organization

9. Automation brings down the manual effort and it's ready to add any data provided there is a link

10. Evaluate your organization against the industry standard KPI's of your domain and make sure you are par with the way your competitor thinks and acts.

Stages of Data warehouse implementation

DW implementation evolves over the period of time in a company.

In the industry, we compare the implemenation with five stages. By comparing what the current DW is doing, we can visualize what else we can do with the DW to get maximum ROI (Return on Investment).

Typically in an organization, managers start asking various questions where there are no system can provide the information in one report or set of reports. In this scenario company hires a team of IT resources who can pull the information from various systems, put together reports manually by taking some time (it varies from 1 day to a week normally to get the result)

If the day today requirements from managers becomes more and more then its the time where company takes the decision to build a DW where the data is stored in a way that its easy for reporting and analysts can build the reports on the fly based on their requirments.

Stage1: What happend?

In this stage of DW, only static reports churn out from the reporting / MIS system.

Stage2: Why it happend?

In this stage, we are finding the reasons because of the facts. What happend will show the facts, why it happend will show the reasons on what happend. Here we use OLAP tools which enables us to visualize and drill down to detail levels and also allows us to create new reports on the fly.

Stage3: What will happen?

Prediction what is the future based on the past and current knowledge we gained from historical data. Usually we use Data Mining tools which gets data from the repository and apply mathematical / statistical functions on the data and predict the variable what you are looking for. For example, when a call goes out of call center or contact center, if the agent knows the probability of acceptance before making a call then he or she can pick the customers who has high probability. In this case, data mining tools will calculate the probability against each customer using an algorithm you choose.

Stage4: What is happening now compared to past?

Take the current data and match with past data and react accordingly in the operational systems. Here the feed from DW goes to operational systems which makes it more effective. The ROI of DW kicks in big time when we reach to this stage as your operations becomes very effective.

Stage5: Being Proactive instead of being reactive based on a scenario.

Here we send alerts / intimations to stakeholders based on the threshold values defined against each of the KPI's what business is looking for.