Wednesday, July 16, 2014

Complexity of Reporting in DW Environment


The maturity of the DW implementation is based on the following functionality when it comes to Reporting.

Every DW begins with a handful of reports which focuses mainly on the management requirement. Typically 90% of the reports are static by nature, only arguments change thus the required data is represented in the following formats (Table, Cross tab, chart, combination of all etc). As time passes, business have more requirements and thought process which we cannot implement only through static reports, so more tools / technologies comes under BI / Reporting area. For example a business user wants to add one more column / object in the report when they are analyzing the data.

Types of work / reports typically comes out of DW

  1. Static / Canned Reports (What happened or Known Questions)
  2. Ad hoc Reports (Why it happened or Analysis)
  3. Dashboards (One view based on the role rather than digging deep – KPI representation)
  4. Data Mining (What will happen / Predictive analysis – more of statistics)
  5. Business Activity Monitoring (Defining threshold against every KPI and taking automated actions based on the event -- it's also called as Active DW in some implementations)

One technology in BI (Reporting / OLAP) won’t solve all the above. So, typically we have multiple technologies in the BI reporting side compared to ETL / Integration layer in the DW.

So, when compared to Integration (ETL) Layer in a DW, OLAP will have different software packages to solve each one of the problem we have in terms of requirements.


Sunday, July 6, 2014

Information as an Asset -- How to Start

I came across a word today "INFONOMICS" which is nothing but applying "ECONOMICS" on the DATA what we generate in the enterprise.

Any company which has a DW and deriving more value out of the data to take decisions which impacts the operational, tactical and strategic decisions are on its way to be INFONOMICS company.

"Data Monetisation" -- Finding ways to use the existing data to fine tune the process or finding a new process which either saves the money for the company or generates the money for the company.

To prove the role you are playing, link your work towards the revenue contribution, thus you also getting closer to business. When a technology person understands the business very well, you will make use of data very inventively in the business. When we reuse the data for new process, you saved the money which we spend to collect the information.

Two types of DATA -- ROO -- Record of Origin
                                     ROR -- Record of Reference

The more ROR we have in the organization, we can prove the reuse as well as the cost we saved to process and present the data on need basis very effectively.

Record of Reference Systems
1. Master Data Management -- which improves the Data Quality, Data Reuse and Operational efficiency
2. DW and Business Intelligence -- which holds the historical changes and integrated data for tactical and strategic decisions.

When we don't have any platform which cannot be used to analyze and improve the decision making process (it can be operational, analysis, visualization, Machine Learning (data mining) etc), then we are not considering the data as asset.

Start building the repository of information which can be used across organization is the place to start looking at the data as an asset.


Wednesday, June 25, 2014

ETL Vs ELT

Difference between ETL and ELT

ETL -- Extract , Transform and Load

ELT -- Extract, Load and Transform

Its just the approach we take to get the job done. So, how do I say in my current project I am doing and ETL or ELT?

In most of the implementations we have ETL tools where there is an ETL Server like informatica, SSIS (SQL Server Integration Services) etc is used in the DW environments. In this case the transformations happen in the ETL server which is outside of the databases. Some of the ETL server / services are very power full and scalalbe and can understand various sources which makes it more effective in most of the implementations.

There are some RDBMS for example Teradata can load lots of data into database very fast compared to SQL Server as the architecture of Teradata is Parallel processing by default and its design to take care of huge amount of data. So, here using the RDBMS capabilities for transformation and loading makes it better. In this case ETL tool is not capable of loading data, so we some times use the procedures to do transformations and load the data into the DW.




In the integration layer, if you use more of database to implement transformations then it falls into ELT where as if you use more of ETL tool to do transformations then we can call it as ETL tool.

ETL Tools in the market: Informatica, Data Stage, SSIS
ELT Tools in the market: ODI (Oracle Data Integrator)

80% transformations in Database and 20% in others --- here we can conclude that we are using ELT
80% transformations in ETL tool and 20% in others --- here we can conclude that we are using ETL

In the ETL world, we utilize ETL Server the most for transformations. In the ELT world, we utilize more of database server (staging database) scripts and procedures to do transformations.

 

Thursday, May 8, 2014

Default Records in Dimension tables

At times, we face data quality issues while we are poulating the fact tables in the datawarehouses.

We refresh the data in the DW at various intervals. We found scenarios where we populate dimension tables once a Day and populate fact tables every hour so that organization can analyze what's going on at hourly level if required.

In the above scenario, there are some instances where a dimension record is missing but the fact related to that dimensional record is coming in for processing. (Got a new customer at 10.00 AM, this customer will flow into DW in tomorrow's dimension loading ETL, where as the transaction belongs to the customer will be reaching the DW by 11.30 AM). In this case, the ETL program (look up to get respective dimension key) will yield no record. What do we do in this scenario? We have to reject that record because of FK violation, if so then we will end up getting descripency between OLTP and DW. In this case, we accept the record into fact but associate a default dimensional record from the customer dimension so that we dont miss on the values for reconcilation.

As part of the house cleaning process, or self healing process of bad data, we identify the fact records which got associated to default dimension records and find the right dimensional DW key and update the same in the fact table. This can be a process which runs once in a week to make sure we all the right dimensional association to fact.

 

Tuesday, April 22, 2014

Advantages of Data Visualization tools

In today's result oriented environment, most companies wants the output in a short duration with lesser cost and quick turn around time.

Business Discovery is the term which the visualization tools as the data is viewed in an interactive manner. Display the data in a summarized way and slice and dice the data in a page or across the entire document enables the user to analze the data better.

Visualization tools typically work on In Memory Database ie when the data gets loaded to the server of the visualization tool, the data gets compressed to the maximum level so that it can hold more data in the server which enables the end user to interact with data using the tools.

Qlikview, Tableau, Spotfire etc are some of the visualization tools in the market which is in top 3 spots in this space.

Traditional Data Warehousing
Creating and enterprise Data warehousing or Data Mart for an enterprise is the way we create the DW systems. Typically this gets sponsored by the senior managers who understands the current issues (typicall data is not integrated across OR reporting is difficult for the MIS department). Once we implement the DW, the companies will be able to view the data at enterprise level, can analyze the data using OLAP technologies like SAP Business Objects, Cognos and Microstrategy. OLAP enables the companies to answer why it happend (Ad hoc) kind of questions which is the biggest advantages of creating the DW. This enables the DW users to cut across any subject area provided we have metadata in the OLAP tools for analysis.

Dis Advantages of traditional Data Warehousing implementation
Time consuming to implement the traditional DW is one of disadvantage. Deciding on ETL technology / methodology, building Data Models, creating meta data models for OLAP, creating reports, understand the data of different OLTP systems, creating the common reference model which fits all the operational system etc takes its own time to make it better. If the time and resource is the problem, then companies are requesting to create the reports (pre defined) and Dashboards which is interactive with lesser cost which takes less time to implement.

To beat the disadvantage of the traditional DW implementation, now companies are prefering to implement the visualization tools to make sure the data is displayed in the format the management requires with out building the DW and putting more emphasis on outcome (result). This is where the visualization tools gets better score to implement with in the department or enterprise. This tools provides a way to analyze data by making sure the interactive in the reports / dashboards are very high.

Here in the Visualization tools, we extract data from various sources and directly populate to the documents / reports rather than storing the data in a centralized repository. This makes the development faster. We refresh the data periodically based on the changes in the source data or based on the requirement the business has. The effort required to build Dashboards and interactivity using this tools are very less.

Monday, April 21, 2014

Top 10 to remember when you are ETL developer

Every resource who do programming to process data in an Integration environment should follow this top 10 thing to become more productivity.


1. Understand the overall objective of the task.
  • Are you loading data from source to staging OR staging to DW or DW to some other system.

2. Understanding the source and source type (database, XML, flat file etc)
  • Where does the source exists? Is it in the file system or its an RDBMS.
3. Visualize / Collect the sample data from source
  • Visualize the sample data, if there is no sample data, create the test data which is equivalent to the realistic data by taking the help of subject matter expert if available. 
4. Frequency of the source data generation and acceptance
  • What is the frequency of data the source generate and how you identify the type of extract we should apply on the source data.
 
5. Understand the target database / files
  •  What is the target file or database structure looks like? What is the PK of the target tables or how we uniquely identify the records in the file.
6. Visualize the data based on the sample source data
  • Based on the sample input data, you have to apply business rules so that transformed data can be visualized in target structure or files.
     
7. Create mapping document (contains the source system, source file / table, target system, target file / table, what business rule you apply on the columns ).
  • Use an excel sheet or use the standard mapping document according to the company you are working for. Make sure you have columns you identified as source and the target file / database columns and make sure you have the business rules documented as much as possible in the mapping document. By seeing the mapping document, you should know the work we have to do as ETL developer with the tool / program.
8. Identify the gaps between the source and required data in target
  • After completing the mapping document, you should be able to figure out how much of columns we are yet to identify. By using this mapping document, we can sit with any one in the team to get it clarified so that one person can take the responsibility to close the same.
9. Develop the ETL job
  • Use all the artifacts you collected till now and start creating the ETL job based on the mapping document.
10. Unit test the ETL
  • Create the unit test document and make sure all the possible combinations of business rules. Identify the different ways of testing, row count testing or value testing according to the transformations we apply.
 
By applying the above mentioned points, we should be able to make sure we can develop good ETL programs.

Sunday, March 16, 2014

Knowledge Discovery Platform

There are so many names we have in the industry for the Datawarehousing and Business Intelligence applications.

Stared with
  • EIS (Executive Information System)
  • DSS (Decision Support System)
  • DW (Data Warehouse)
  • BI (Business Intelligence)
  • BIDW etc
So, what is the name I can give for the same DW.....

Being a technology resource, I always find difficulty explaining technology people why we should learn this technology. On the other hand its easy to explain it to business people as they look on what is the benefit I get when I use it. When we show them the case study which they never thought about it or how the current problem in the company can be solved using technology then they understand the system and the ROI.

Here you go....... "Knowledge Discovery Platform"

So what is DW, its a platform which enables business with more knowledge (business) which they don't have it now. May be they have questions, but the IT landscape does not have solution which can provide the information with in a reasonable time.

Why we call it as platform, the system utilizes various technologies like ETL, Storage Layer, Reporting / Visualization, Data Mining etc

Bottom line is understand why we are building, what is the ROI for the business, what is the value IT provides (business value) etc. All the others keep changing.