TOC? http://www.eobjects.dk/datacleaner/IconUserGuide.png

DataCleaner User Guide

back to DataCleaner

Please be sure to visit the new official DataCleaner website at  http://datacleaner.eobjects.org

Requirements

DataCleaner requires a Java Runtime Environment (JRE) version 5.0 or higher. You can download the JRE at Sun Microsystems website:

Installation and configuration

Download DataCleaner

Refer to GetDataCleaner for downloading DataCleaner.

Install JDBC drivers

In DataCleaner GUI

To install a JDBC driver, open the file menu and select Register database driver. Then select the JAR-file containing the driver and fill in the driver class name.

In DataCleaner webmonitor

To install a JDBC driver, copy the JAR-file containing the driver to the WEB-INF/lib folder of DataCleaner webmonitor and restart the webapp. Alternatively you can also copy the JAR-file to a shared lib-folder of your servlet container.

Configure data connections

In order to have easy access to your database we recommend adding it to a configured list of database connections. Modify the file datacleaner-config.xml and insert an XML-element like the ones that are supplied with the default configuration file. Refer to DataCleanerFeatures for examples of various databases configurations, here are some of the most typical ones:

Configure modules

DataCleaner is based on an extensible module-oriented architecture where it's easy to add new modules, be it profiles, validation rules etc. The configuration of these modules are maintained in the configuration files datacleaner-profiler-modules.xml, datacleaner-validator-modules.xml etc. These configuration files contain definitions for each module, for example the pattern finder profile:

<bean class="dk.eobjects.datacleaner.profiler.BasicProfileDescriptor">
 <property name="displayName" value="Pattern finder" />
 <property name="profileClass" value="dk.eobjects.datacleaner.profiler.pattern.PatternFinderProfile" />
 <property name="iconPath" value="images/profile_pattern_finder.png" />
</bean>

Notice here the properties. The displayName determines the name of the profile, the profileClass determines the java class to use for doing the profiling and the iconPath determines the path for an icon image to use. For more details on module configuration and development, see the design documentation.

Using DataCleaner GUI

Using the profiler

The profiler is used to calculate and analyse various important measures based on the values of data. In this sense the results of a profiling will always have to be read and contemplated upon by a physical person, for example a Database Administrator, BI engineer or similar. The results are supposed to help this person determine where to look for data quality problem in the data source - to get an impression of the state of the data.

  • When presented with the Select a DataCleaner task dialog, press the Profile button.
  • Open either a database or a CSV-file, Excel spreadsheet, OpenOffice.org database-file or XML file by choosing either Open database or Open file button.
  • When the data source has been opened the tree on the left hand side will represent the schema of your data. You can right-click on the nodes in the tree to see what you can do with them. Double-click on columns to add them to your data selection (the data that you want to investigate).
  • Use the Add profile button to add certain computations to your profiling. Each new profile will be represented with a tab which will contain configuration options for that profile. Read more about the individual profiles here:
  • Press the Run profiling button to start profiling data. The results of the profiling will be displayed in a new window, where the measures from the profiles are shown.

Using the validator

In contrast to the profiler and the comparator, the validator will give you a result that can be interpreted as "good" or "bad", since the validator validates your data. In the validator-mode you set up business rules that apply to your data and recieve a result that can be used to fix validation errors.

  • When presented with the Select a DataCleaner task dialog, press the Validate button.
  • Open either a database or a CSV-file, Excel spreadsheet, OpenOffice.org database-file or XML file by choosing either Open database or Open file button.
  • When the data source has been opened the tree on the left hand side will represent the schema of your data. You can right-click on the nodes in the tree to see what you can do with them. Double-click on columns to add them to your data selection (the data that you want to investigate).
  • Use the Add validation rule button to add certain rules to your validation. Each new validation rule will be represented with a tab which will contain configuration options for that rule. Read more about the individual validation rules here:
    • Javascript evaluation ~ A rule that is dynamically created based on a user-written javascript.
    • Dictionary lookup? ~ Looks up values in a dictionary in order to verify their presence in the dictionary.
    • Value range evaluation ~ Evaluates whether or not values are in a specified value range.
    • Regex validation ~ Validates values according to a specified regular expression.
    • Not-null check ~ Simple validation rules that verifies that values are not null.
  • Press the Run validation button to start validating data. The results of the validation will be displayed in a new window, where rows that are not validated will be shown.

Using the comparator

No documentation for DataCleaner comparator yet.

Saving and loading your work

To save your work in the profiler or comparator, you can click the Save-button, shown to the right. This will enable you to save your current profiler or validator work in a file that you can reopen later on to perform the exact same tasks.

  • Profiler work is saved in .dcp files
  • Validator work is saved in .dcv files

You can open a saved files by going to the File menu and selecting Open file.

Using DataCleaner webmonitor

DataCleaner webmonitor has not been released yet.

Attachments