Wednesday, November 17, 2010

Data Cleanliness

Whether working on User provisioning, password management, compliance, directory virtualization or meta-directory projects, the first step has always been about checking the data and making sure that it is clean.

What constitutes clean data, and how do we get it that way? This is almost certainly the most important question that should be addressed when considering an Identity Management project.
When considering User Provisioning project, there are a few basic things to consider:

  1.  Is the data authoritative? It’s important that the data going into the provisioning solution comes from authoritative sources. Such sources would include HCM, Active Directory, etc.
  2. Does the data include a unique identifier (UID)? This can be a tricky value. Depending on legal and compliance rules, some attributes are not usable in a UID. Furthermore, UIDs that are based on name components frequently require additional elements to ensure uniqueness, which means the potential for additional transformations at some point in the provisioning process.
  3. Additionally, is there a way to link data from disparate sources? Some parsing or similar ETL transformations might need to occur to the data to make sure there is a way to link the same data.  Some organizations make the assumption that the same key must be used in all tables.  While this is certainly the goal, it can’t always happen and that should be realized in the architecture and business analysis / requirements phases of the project.

Compliance projects also require cleansing and preparation. In some ways, this should actually be easier since these projects generally occur after basic user provisioning. However, this is only half the battle as compliance data usually relies on two basic types of data, user and application. 
So it makes sense when considering user data:
  1. Do we have necessary data about the person, name, email, physical location
  2. Do we know who the manager or certifier is? If this is to be determined programmatically, it does not need to be defined here, but you might want to specify a default value.
  3. Do we have data to group the users by?  It might be manager, title, or department

Application data is somewhat different, but does not necessarily need to be that complex.
  1. Do we know the name of the item we’re certifying about?
  2. Is the entitlement clearly spelled out?
  3. Is the permission clearly spelled out?

The big challenge here is that all of the application information can be bound up in other pieces of data so unfortunately there will always be some need for additional transformation here. It’s important to work through the sample data so that the project can clearly define all of the elements that are needed to create the certification.

In any case, before beginning the project it makes sense to transform the data into a clean, clear and concise format.  Otherwise, the project is sure to extend with a combination of extended business analysis and development work before getting to work on the main goal of the project.  Even if you think the data is clean, to allocate a week or so for your project team to look over the data and understand it.  This type of “front loading” in the project will help make the build process work much smoother.

No comments: