Last modified 3 years ago
DataCleaner features
back to DataCleaner
Applications
From the users perspective DataCleaner consists of two applications, namely:
- DataCleaner GUI (Graphical User Interface) - A standalone desktop-application for working with Data Quality. This application offers data profiling, data comparison and validation of rules against data. The GUI is a useful tool for work-in-progress, where explore data and test validation rules on the fly etc.
- DataCleaner webmonitor - A web-based application for monitoring and scheduling continous Data Quality efforts based on the work created in the desktop-application.
Data Quality areas
Specifically we are working on 4 types of user tasks related to data quality:
- Profiler - The profiler is used for profiling data. To profile data means to search through the content of data and deliver reports on the state of data. Profiling data includes both trivial measures like "how many null-values are there?", "what are the highest and lowest values?" and more complicated analysises like searching for patterns in data.
- Validator - The validator is used to build validation rules and validate data against them to help you find quality problems in the data. Validation rules can be simple as in "values in this column may not be null" or "values in the person.age column may range from 0 to 120", but can also take a much more complex form: "values in the first name column must be resolved in my dictionary of first names" or "if the value of country column is America, then the value of the state column may not be null". Validation rules will typically be built based on the profiling and comparison of data.
- Comparator - The comparator is used to compare two sources of data. Typically you have separate systems which are out of sync and dependent on people typing in values that can be misspelled or overlooked. The comparator will help you find these differences in data that are supposed to be the same, for example if "The customer 'Mærsk' is registered in one database and 'Maersk' is registered in another", the comparator will help you realize this inconsistency. If there are inconsistencies in the natural keys of a column, then the comparator will also be able to investigate other columns to determine similarity, for example if "G.E." and "General Electric" share the same address, then they are probably the same company.
- Monitor - The monitor is an extension of the validation features. Monitoring means to oversee the progress of continuous efforts to validate data. Used to schedule validation rules, the monitor will notify you when data of poor quality enters your system, so you will be able to fix it. The monitor is web based and validation reports can be read online to ensure simple access to the status of the data quality.
Additionally we're building a number of catalogues of reference data to support the data quality effort:
- Regular expressions ("regexes") - Regular expressions are used to validate conformance of values to certain patterns that can be expressed using a regex. You can either build your own expressions using our regex tester or you can just pick some of the regexes that ship with DataCleaner! This means that you get instant value by just applying a predefined set of expression on your datastore.
- Dictionaries - A dictionary is a set of valid values. An example of a dictionary could be a list of company names, boy or girl names or valid street names. In DataCleaner dictionaries can be set up several ways; by using a database containing reference data or by a simple text file where each line of the file is a dictionary entry. It's very simple and very powerful to validate or categorize your data according to your domain's dictionaries.
Technical features
DataCleaner strives for excellence on the technological area. Focus for the technology use is to ensure the widest support by complying to well known and open standards, while having performance in mind.
- Minimal query load - DataCleaner queries the databases for the data needed, and nothing but the data needed. Streaming data and an intelligent distribution of data to profiles, validation rules etc. ensure that the database is minimally affected by DataCleaner and that DataCleaner handles memory-usage optimally.
- Extensible architecture - A wide use of interfaces makes it possible to extend DataCleaner in almost any way you want to. You can add your own profiling mechanisms, validation rules etc. and configure them for usage in the GUI as well.
- Database compliancy - DataCleaner is compliant the JDBC and the SQL 99 standard and therefore almost any database. For more information refer to the database compliancy pages:
| MySql | fully compliant |
| Postgresql | fully compliant |
| Derby (IBM Cloudscape, Sun Java DB) | fully compliant |
| Oracle | fully compliant |
| Firebird (formerly Interbase) | fully compliant |
| IBM DB2? | not tested |
| Microsoft SQL Server | succesfully tested on version 2005 |
| Hypersonic SQL? (aka HSQLDB) | not tested |
- Platform independent - Written in the Java programming language, DataCleaner can run on almost any operating system, including Windows, Mac OS, Linux and other Unix-variants. The only requirement is a Java Runtime Environment (JRE) above version 5.0. You can download the JRE from Sun's website for free.
