One of the first steps in solving a data quality problem is to perform data profiling. As seen in Jason Hover’s article, Data Profiling: What, Why and How?, data profiling allows you to analyze your data to determine what it looks like and what problems exist in the data. Manual data profiling can be performed; however, using software such as Informatica Data Quality allows both data stewards and developers to collaboratively profile the data in a common repository more quickly, often yielding a more thorough analysis.
Informatica Data Quality offers profiling capabilities within both the Informatica Analyst and Informatica Developer tools. Within Analyst, data stewards can quickly create a profile of the data:
As seen in image (A.1), a data steward can see that there are 4 records with a NULL value in column ADDR2. The data steward can also see that the values are mixed case. Here, the data steward may determine that NULL values are not valid and that the values should be standardized to uppercase. The values of San Diego, SD, SAN DIEGO should also all be SAN DIEGO. The data steward can add this as a comment for the developer to create a rule (A.2).
In the Developer tool, the same profile can be viewed with the comments provided by the data steward (A.3).
The developer can now apply profile rules to the ADDR2 column based on the business rules provided. Applying profiling rules allows the data steward to identify possible inconsistencies in the data and potentially correct the data in the source system. Note: The output of the profile rule is a virtual column in the profile; the rule has not been applied to the data physically until a mapping has been created in Developer.
Informatica Data Quality provides a number of pre-built rules, which can be applied directly to the profile within either the Analyst or Developer tool. In this example, we will use the pre-built rules of rule_Completeness and rule_UpperCase and create a new rule to standardize the values for SAN DIEGO.
Before creating the rule to standardize values for SAN DIEGO, we need to create a list of valid values. In Informatica Data Quality, the list of valid values can be created as a reference table. The reference table will be used as a lookup to determine what value should be returned. TIP: A quick way to create a reference table with valid values is to add the values from the data that has been profiled. Right click on the value and select Add to Reference Table. (A.4).
Once the table is created, the list of valid values can be added (A.5). The ADDR2 column serves as the valid value. If the value in column1, column2, or column3 exists, it will return the valid value found in column ADDR2. The reference table can then be maintained through the Analyst tool by the data steward (where an audit trail is captured).
TIP: If the data steward does not want to edit the table to include additional columns, the same valid value in column ADDR2 can be added and the additional values can be listed in column1, column2, column3 as highlighted in Image (A.6).
Now we can create a rule in the Developer tool to use the reference table. After applying the rule in the profile, we can see that there are now 3 records with the value of SAN DIEGO (A.7).
Although we are using 3 different profile rules, we can take it one step further and combine the rules into one profile rule. The rules are seen as mapplets in Developer. In this example, we will create one mapplet named rule_ADDR2 that will include 3 mapplets (one for each rule previously identified) (A.8).
Taking the output from rule_Completeness, we will determine if the record is valid (A.9).
TIP: Make sure to validate the mapplet as a Rule so that it will appear in the Analyst tool (A.10).
After this rule is applied in the profile and re-run, the data steward can see how many records are valid or invalid. The data steward can then drilldown on the invalid records.
As seen in Image (A.11), there are 4 invalid records because they are NULL. The data steward can then export this data, perform further analysis, update the source system, update the reference table, and/or work with the developer to update the rule. Once profiling of the data is complete and all rules have been identified/updated, then the rules can be physically applied to the data and cleansed.
This data profiling example illustrates one of many techniques that can be used within Informatica Data Quality. Collaboration amongst the data steward and developer during the data profiling process are key in taking the first steps in solving a data quality problem.
PARTNER WITH US
Do you have a current Data Quality initiative? How is your Data Quality project progressing? We’d like to hear from you. Take a few minutes to send us an email and let us know about your project. If you’d like further assistance with your current project, feel free to contact us directly. We’d be happy to talk with you and develop a proposal around your project.
If you’d like some tips & tricks regarding Informatica Data Quality, download our eBook: SIX DESIGN TIPS IN INFORMATICA DATA QUALITY.