Managing Information and Analytics from Aroha Technologies: March 2014

Sunday, March 16, 2014

Knowledge Discovery Platform

There are so many names we have in the industry for the Datawarehousing and Business Intelligence applications.

Stared with

EIS (Executive Information System)
DSS (Decision Support System)
DW (Data Warehouse)
BI (Business Intelligence)
BIDW etc

So, what is the name I can give for the same DW.....

Being a technology resource, I always find difficulty explaining technology people why we should learn this technology. On the other hand its easy to explain it to business people as they look on what is the benefit I get when I use it. When we show them the case study which they never thought about it or how the current problem in the company can be solved using technology then they understand the system and the ROI.

Here you go....... "Knowledge Discovery Platform"

So what is DW, its a platform which enables business with more knowledge (business) which they don't have it now. May be they have questions, but the IT landscape does not have solution which can provide the information with in a reasonable time.

Why we call it as platform, the system utilizes various technologies like ETL, Storage Layer, Reporting / Visualization, Data Mining etc

Bottom line is understand why we are building, what is the ROI for the business, what is the value IT provides (business value) etc. All the others keep changing.

Wednesday, March 12, 2014

Unit testing of BI Reports in General

The developer should test (unit test) the BI report . Following details have to be covered for every report against the requirement.

1. Data level testing (Extract SQL from the report, test the data in the report against the database)

2. Prompt testing ( Check the list of values, type of prompt etc)

3. Functionality testing ( Drill down, alerts, Conditional formating, Visibility of certain columns etc)

4. Format testing (fonts, standards etc)

5. Export testing (default format, other options of exporting)

6. Security testing (access to certain type of users or public report etc )

7. Scheduling test (if the report has to be scheduled)

If you cover the above set of basic things in every report, you are good to go.

Tuesday, March 11, 2014

Steps in creating BO universe

Remember the following steps while creating the universe in SAP Business Objects.

1. Create the connection -- usually it has to be secured type to export it to repository.
2. Import the tables / views into the universe (identify the set of tables u needed)
3. Identify the joins and establish the same.
4. See if there is a loop issue, if so see if you can avoid the same by creating an alias or context
5. Check if the chasm trap issue may occur, if so create the context
6. See if there is any fan trap issue, if so create either alias or derived table and create the context
7. Create the classes
8. Visualize the objects in each class, based on the hierarchy create the objects with in each class
9. Make sure you have the right object type (Dimension or Detail or Measure object)
10. Check the integrity constraint
11. If there are no errors, save the universe file.
12. Export the universe to repository.

Monday, March 10, 2014

Top 10 things to remember in ETL world

As an ETL developer, its important to see what are the various touch points in the ETL world. You may be working in any one of the touch point in the ETL stream line. In a DW system, most of the time we spend in analyzing the source data and creating ETL jobs to populate your data warehouse. Lets discuss about what 10 things you should know to become very effective ETL resource for an organization.

1. Sources
What is your source? Understand the nature of the source. Is the source data is in RDBMS, if so what is the credentials you have. How to get the meta data information about the source. Sample data from source etc. If its the flat file or XML or Web Service, understand the structure of the data you are getting.

2. Extraction cycle
What is the frequency in which the data gets updated in the source. Are you extracting the data once in a day / week / month / quarter etc. Are you going to extract the whole data every time you are accessing the source or you are extracting the data in an incremental (delta) way.

3. Staging
Staging database is used as a temp area where we can store the data for processing. Typically when we have more than one operational system where the data is stored in different RDBMS (Say oracle, sql server, db2 etc) then we can have to consolidate the data required in one RDBMS. The database which is used for the consolidation of data is called as Staging Database.

4. Transformations
Based on the target table / file, you have to convert the data which is in the different form in some cases. For example, in the DW you have to store the data at a monthly level for performance reasons. In the source, the data typically will be at a transactional level. So, in this scenario you have to to aggregate transformations. Typical transformations are router or splitter, aggregate, filter, sorter, de duplication etc.

5. Auditing
The ETL system should have data to clarify many questions from various stakeholders. For this same reason, we have to keep a track on how many records we got from a specific system, what data we got, how long it took to get etc. Any time we change the data, the system should keep a record for the audit purpose. If you see any ETL system with out auditing information then its not matured implementation of ETL system.

6. Reconciliation
There should be an automation way of identifying how many records we read, how many records got rejected because of data type mismatch, source definition and the data not matching, business validation failed etc. There are two types of reconciliation typically we do, one is row count validation and value based validation. If the granularity is same between the source system and the target system then we should go for row count matching, if the granularity is different between the source and the target, then you do the value matching. For example, do a sum(rev) at the source level, and sum(rev) at the target level. The value should match if the summary is different at target level. Send an email automatically when the expectation is not matched to production support team to take care of the same.

7. Reject Handling
In order to know the value we add over the time to the organization because of DW platform, we can show case how we are improving on understanding the organization data and how we improved the quality of the data over the period of time. When we reject data while we populate what happens, and how to handle the same. We have three types of rejected data

Self Heal Data -- Today's transaction data comes, but we don't have master data in the dimension table. So, when we load the txn data into fact, it fails because of corresponding dimension record does not exists. So, reject the data, try loading the data in the next day, meanwhile the dimension data comes, thus when we are loading this yesterday's rejection data, it works today. This is called as Self Healing data.
Pragmatically correct the data (Automated) -- Once we know the reason of failure, based on the business rule and data validation process we can fix the same. For example in some cases in the addresses we have zip code but don't have the city name, we can use some external data and correct the data to populate the data. If the ETL program rejects the data, create another program to cleanse the rejected data and load it again. In the ETL world, we call that as reject load as well.
Correct the data manually -- If the rejected records are not able to correct in an automated way, then we develop an interface to manually see the error and fix the data issues. Typically when we understand the data from across source systems, this happens the most until the system gets matured.
Reject (In complete source data) -- In some cases we cannot fix the data at all. So, this gets flagged as reject which gets discarded over the period of time.

8. House Cleaning
We end up storing the error records, log records when we process the data using ETL programs. If the log files occupy more space or the error records cannot be used any more then we can write the programs to delete the records or delete the files which we don't need to worry as we finished the reconciliation process and audited the reason for rejects etc. Typically in a frequent intervals, we remove the unwanted data / files from the DW platform using certain programs. This process is called as House Cleaning process.

9. Scheduling
As the ETL programs which we write has to be executed in a frequency basis, we have to make sure the scheduling is possible with in the ETL tool with out depending on the OS scheduling. If we use the database programming language as ETL tool then we have to use the schedular of the database. If you are using Oracle, then we can make use of DBMS_JOB.

10. Log Handling
Logging is very important in the ETL Layer. When there is job failure, it is the source for us to start the work to fix the issues. In some of the tools, it provides both database logging or Flat file logging. Logging levels should be there, should we log only errors or we should log every thing which will write the record what's been processed, timing for each of the transformations etc. The more detailed logging you are going to opt, the more time its going to consume to write the log files. So, do lot of analysis before you choose what level makes sense for your implementation.

These top 10 things will help you to design an ETL layer for your DW implementation, it can also be used to compare if the ETL tool is in par with industry standard.

Sunday, March 9, 2014

Data Layers in the EDW environment

In a complex data warehousing environment, we create different placeholders to store data.

We can visualize multiple layers where we deal with data.

1. AS IS Layer
is the layer where we store the data which we get it from the source system with out transforming any of the data. This typically acts as the source which we consumed for the DW, even if there is a change happens in the operational system, we can cross consider this layer to reconcile the data between the staging and source.

2. Staging Layer
Typically staging layer contains the data which you are transforming, used to convert the heterogeneous data sources to homogeneous. For example, in some cased you get XML data from one hand held device, tables from an RDBMS, a CSV file from some other business. In this case we covert all three sources into common area. The common area typically gets created in an RDBMS so that handling of data is easy.

3. ODS Layer
Operational Data Store typically exists reduce the data latency. Typically when the requirements are coming from business to get the real time data integration, it becomes difficult to add the operational data to the summarized data in the Data warehouse. So, in this scenario the architecture of DW recommends to add the ODS layer database which typically contains the less granularity of data compared to DW data. Usually in the ODS we store records worth of 2 to 3 months.

4. EDW Layer
The data which is processed (Gold data) based on the business requirement, cleansed, ready to use for analysis gets stored in the EDW layer. In some implementations you see EDW in the form of 3rd Normal form, in some cases you see EDW in Star Schema's (Constellation).

5. Data Mart Layer (Physical Data Mart, Logical Data Mart)
Based on the architecture of the DW, this layer is treated differently from organization to organization. Typically in Teradata based implementations, the business views are formed and exposed to the BI tools (BO, MSTR and Cognos).
If the architecture of EDW follows the 3rd NF, then the data mart typically has the star schema which has dimensions and fact tables. If we have the data in the data mart layer, then we call it as physical data mart. If the data mart is made of views only which gets exposed to BI Layer then we call it as Logical data mart.

I hope you are getting an idea of how many different data layers can exists in the DW platform.