TOC? http://www.eobjects.dk/datacleaner/IconDesignDocs.png

DataCleaner Design documentation

back to DataCleaner

This design documentation is targeted to developers wishing to co-develop DataCleaner, develop plugins for DataCleaner or simply just understand the guiding principles of the application.

Modules

DataCleaner is composed of four sub-projects:

  • DataCleaner-core - The core library with classes and methods, providing a framework for Open Source Data Quality. ( javadoc)
  • DataCleaner-gui - A Swing-based standalone desktop-application for working with Data Quality. This application offers data profiling, comparing and validation of rules. ( javadoc)
  • DataCleaner-webmonitor - A web-based application for monitoring and scheduling continous Data Quality efforts based on rules created in the desktop-application. (still pretty early stage)
  • DataCleaner-resources - Input used by DataCleaner to perform certain operations, for example dictionary files, example configurations and example data. (subject to removal, samples have been added to gui)

Additionally DataCleaner depends heavily on MetaModel, the common domain model for structures and querying of datastores. For more details on the technical documentation of MetaModel, check out MetaModelDevelopersGuide.

DataCleaner core design

In this documentation we will focus most on DataCleaner-core since the rest is for the most part "just" UI, which is more straight-forward to understand. DataCleaner-core is designed to efficiently execute batch data profiling and validation. It does this by examining a configuration (the input, ie. ProfileConfiguration or ValidationRuleConfiguration) and generates queries that retrieve the data needed to perform the profiling and validation. These batch processors are called runners and there are two of them; ProfileRunner and ValidationRuleRunner.

Each of the configurations contain a reference to the kind of profile/validation rule (the so-called descriptor), which columns that are to be profiled/validated and some properties that are used to configure the profile or validation rule. An example configuration of the regex validation rule could be as such:

  • validation rule descriptor: ValidationRuleManager.getValidationRuleDescriptor(dk.eobjects.datacleaner.validator.trivial.RegexValidationRule.class);
  • columns: { customers.email, employees.email }
  • properties:
    • PROPERTY_NAME: "My email check"
    • PROPERTY_REGEX: "[a-zA-Z0-9._%+-]*@[a-zA-Z0-9._%+-]*\.[a-z]{2,4}"

So to sum up:

  • There are profiles (IProfile interface) and validation rules (IValidationRule interface).
    • These are instantiated as part of the batch-processing for every single profiling or validation execution (ie. we don't instantiate them ourselves).
  • To reference them we have descriptors which contain metadata about the profiles or validation rules.
    • Profile and validation rule metadata include: Property names, display name, icon, class of profile/validation rule, supported column types.
    • We don't instantiate the descriptors either, we obtain them from ProfileManager or ValidationRuleManager.
  • What we DO instantiate is configuration objects. These contain references to the descriptors, columns to process and property-values.
  • We pass these configurations on to ProfileRunner or ValidationRuleRunner.

Creating a new kind profile class

To create a new type of profile you need to develop a class that implement the IProfile interface (we recommend extending AbstractProfile, which offers some simplification of trivial tasks in relation to honoring the interface contract). The lifecycle of a profile class is shown in the following pseudocode (representing how the runner will instantiate and execute the profile):

IProfile profile = MyProfile.class.newInstance(); //No-argument constructor is always used
profile.setProperties(properties); //The properties are extracted from the configuration
profile.initialize(columns); //The columns are extracted from the configuration

//A query is executed by the ProfileRunner, it will contain values for the columns and a distinct COUNT of these values
for (Row row : rows) {
  long count = row.getValue("COUNT(*)");
  profile.processRow(row, count);
}

IProfileResult result = profile.getResult();

Some tips for profile programming

  • To make things easy we've made a default implementation of the IProfileDescriptor interface, called BasicProfileDescriptor. If you want to avoid also having to develop your own descriptor there's only a single contract you have to honor in order to succesfully apply this descriptor: Define all your property-names as String-constants prefixed with "PROPERTY_", eg:
    public static final String PROPERTY_REGEX = "Regex to profile according to";
    public static final String PROPERTY_HITS = "Number of hits";
    

This will make the descriptor automatically discover the two property names: "Regex to profile according to" and "Number of hits".

  • Once the profiling is done the getResult() method will be called. The result object should contain one or more matrices of metrics. To help build matrices we recommend using the MatrixBuilder class which helps you generate and manipulate matrices in a mutable manner (think of it as an equivalent to StringBuilder).

Plugging in to DataCleaner-GUI

To integrate your profile with DataCleaner-GUI all you have to do is register a descriptor in the datacleaner-profiler-modules.xml file. Here's an example registration for the "Standard Measures" profile:

<bean class="dk.eobjects.datacleaner.profiler.BasicProfileDescriptor">
  <property name="displayName" value="Standard measures" />
  <property name="profileClass" value="dk.eobjects.datacleaner.profiler.trivial.StandardMeasuresProfile" />
  <property name="iconPath" value="images/profile_standard_measures.png" />
</bean>

These are the properties that can be added to your registration <bean> element:

  • displayName: The name of the profile as it reads to the user (required).
  • profileClass: The qualified class name of the IProfile implementation (required).
  • iconPath: The path to a 22x22 pixel icon to use to visually represent the profile (required).
  • literalsRequired: A boolean indicating whether or not processed columns must be literal (String) types (values "true" or "false", optional).
  • numbersRequired: A boolean indicating whether or not processed columns must be number types (values "true" or "false", optional).
  • datesRequired: A boolean indicating whether or not processed columns must be date/time types (values "true" or "false", optional).

Building custom configuration panels

If the default configuration panel (the tab content dedicated to your profile in DataCleaner-GUI) is not sufficient for your needs you can also create a custom panel for configuring the profile. To do this you need to create a class that implements the IConfigurationPanel interface. The lifecycle for such an object is as follows:

  • The initialize method is called right after instantiation.
    • Note that this can be both when creating a new profile or when reloading an existing profile from disk, so honor the configuration parameter as it is not always "clean".
  • The getPanel method is used once to generate the JPanel.
  • The getConfiguration method can be called several times to get the configuration to execute or save to disk.
  • The destroy method is called when the tab is closed.

When the configuration panel has been programmed it needs to be registered in the datacleaner-config.xml file as an <entry> element in the "configurationPanelManager" <bean> element.