Posts by author kasper

DataCleaner 2.5.2 released

DataCleaner 2.5.2 has just been released. The DataCleaner 2.5.2 release is a minor release, but does contain some significant feature improvements and enhancements. Here's a walkthrough of this release:

Apache CouchDB support

We've added support for the NoSQL database  Apache CouchDB. DataCleaner supports both reading from, analyzing and writing to your CouchDB instances.

CouchDB support
Connect to CouchDB databases

Update table writer

Following our previous efforts to bring ETLightweight-style features into DataCleaner, we've added a writer which updates records in a table. You can use this for example to insert or update records based on specific conditions.

Like the Insert into table writer, the new DataCleaner Update table writer is not restricted to SQL-based databases, but any datastore type which supports writing (currently relational databases, CSV files, Excel spreadsheets, MongoDB databases and MongoDB databases), but the semantics are the same as with a traditional UPDATE TABLE statement in SQL.

Drill-to-detail information saved in result files

When using the Save result feature of DataCleaner 2.5, some users experienced that their drill-to-detail information was lost. In DataCleaner 2.5.2 we now also persist this information, making your DQ archives much more valuable when investigating historic data incidents.

Improved EasyDQ error handling

The EasyDQ components have been improved in terms of error handling. If a momentary network issue occurs or another similar issue causes a few records to fail, the EasyDQ components will now gracefully recover and most importantly - your batch work will prevail even in spite of errors.

Table mapping for NoSQL datastores

Since CouchDB and MongoDB are not table based, but have a more dynamic structure we provide two approaches to working with them: The default, which is to let DataCleaner autodetect a table structure, and the advanced which allows you to manually specify your desired table structure. Previously the advanced option was only available through XML configuration, but now the user interface contains appropriate dialogs for doing this directly in the application.

We hope you enjoy the new 2.5.2 version of DataCleaner. Go get it now at the  downloads page.

DataCleaner adds data profiling to Pentaho

Today we announce an exciting new partnership with Pentaho, the leading open source Business Intelligence and Business Analytics stack! For the past years Human Inference, members of the DataCleaner community and Pentaho have been in close contact to design  a new data quality package for the Pentaho Suite. DataCleaner plays a key part in this new solution.

DataCleaner’s integration in Pentaho is primarily focused on the open source ETL product, Pentaho Data Integration (aka Kettle). Pentaho and Human Inference will be running a joint webinar on May 10th to tell everyone about all the new features ( register for the webinar here), but until then – here’s a summary!

Profile ETL steps using DataCleaner

When working with ETL you often find yourself asking what kinds of values to expect for a particular transformation. With the data quality package for Pentaho we offer a unique integration of profiling and ETL: Simply right click any step in your transformation, select ‘Profile’, and it will start up DataCleaner with the data available for profiling, which the step produces! Not only is this a great feature for Pentaho Data Integration, it is also a one-of-a-kind in the ETL space. We are very excited to see this great use of embedding DataCleaner into other applications.

Profile with DataCleaner in Pentaho Data Integration / Kettle
Right click any step to profile

Execute DataCleaner job

Another great feature in the Pentaho data quality package is that you now orchestrate and execute DataCleaner jobs using Pentaho Data Integration. This makes it significantly easier to manage scheduled executions, data quality monitoring and orchestration of multiple DataCleaner jobs. Mix and match DataCleaner’s DQ jobs with Kettle’s transformations and you’ve got the best of both worlds.

Execute DataCleaner jobs as part of your ETL flow
Execute DataCleaner jobs as part of your ETL flow

EasyDQ integration

Additionally, the data quality package for Pentaho contains the  EasyDQ cleansing functions as ETL steps, similar to what you know from their DataCleaner counterparts.

Deduplication and merging via DataCleaner

In addition to embedding DataCleaner for profiling of steps, you can also start up DataCleaner when browsing databases in Pentaho Data Integration. This will create a database connection which is appropriate for more in-depth interactions with the Database. For example, you can use it to find duplicates in your source or destination databases.

Detect duplicates in your sources
Detect duplicates in your sources

For more information:

The press release from Pentaho:  Pentaho announces new Data Quality solution

Installation instructions and information from Pentaho:  Pentaho wiki: Human Inference

Example of using the DataCleaner profiler with Pentaho:  Pentaho wiki: Kettle Data Profiling with DataCleaner

Information about the EasyDQ functions for Pentaho:  EasyDQ Pentaho page

DataCleaner 2.5 is out!

Today we announce the general availability of DataCleaner 2.5! This release is the result of months of hard work by the core DataCleaner crew, the EasyDQ group and the community at large.

Let’s get straight to the “What’s new” question. There are plenty of major improvements in this release:

Saving results to disk With DataCleaner 2.5 you can save, archive and share your analysis results. This is not only a time-saver for those who used to do manual exporting of analysis results, but it is also a means to improve your methodology around handling profiling results, sharing them with colleagues and for archiving historically profiles of your data.

Save results to disk

Saving is implemented so that future versions and/or custom solutions can take advantage of the results and potentially use it for scheduled profiling, data quality monitoring and more.

Data structure transformers With the rise of Big Data and NoSQL databases comes more advanced data structures. In next generation databases we see key/value pairs and list structures that are cumbersome to deal with in tools built for traditional relational data. To solve these issues DataCleaner 2.5 ships with a new set of “data structure” transformers, which allow you to easily wrap and unwrap structures, to be able to get to the parts that you want to analyze or process.

Data structures

The data structure transformers also include parsers and writers for JSON data, which is one of the more common representations of NoSQL datastructures.

Filters and transformers are now all "Transformations" Since DataCleaner 2.0 we’ve been pushing the idea of transformers and filters. The strength of these two types of components were evident from a technical perspective, but for the end-user the distinction has shown to be distracting from its main use-case: To process data in a flow of actions. Therefore DataCleaner 2.5 has consolidated these two terms, and made them available in a common metaphor for the user: Transformations. This means that the user will no longer have to look in multiple menus to find the component he is looking for.

New EasyDQ transformations: Merge duplicates and Due diligence check The EasyDQ on-demand data quality platform team has also been busy. We present to you three new functions and an optional extension for the advanced users.

Merge duplicates Due diligence check (people) Due diligence check (companies)

First is the Merge duplicates transformation. With this transformation you can turn your results from Duplicate detection into merged, golden records! The merge component is designed to handle a hierarchy of criteria when merging to make sure that critieria such as well-formedness, update date and manual overriding is taken into account.

Secondly we’ve introduced two services for Due diligence checks. These are transformations which will help you validate that the people you are engaging business with are not connected to sanction lists of terrorists, narcotics trafficking and other security threats.

These new features, as well as the other EasyDQ functions, are described in detail in the  EasyDQ reference documentation.

Lastly, there's a new extension available, the  EasyDQ essentials, which we recommend as a handy extra toolkit for those that want to go deep diving into the features of EasyDQ.

Defining datastore properties on the command line One of the areas that have been heavily enforced in the later releases of DataCleaner is the command line interface. Using this interface you can set up DataCleaner to execute in all environments, in a scheduled or managed fashion. In DataCleaner 2.5 we’ve also made it possible to override datastore properties from the command line. Why? Because it allows you to reuse the same job on different datastore definitions. If you are for example scanning a directory for CSV files, and want to run a DataCleaner job on each file, this is a solution for you. Refer to  the documentation for further explanation and examples.

Drill to detail information in value distribution results The Value distribution analyzer now contains a drill to detail option, to make it possible to see the source records for each value in the distribution. This greatly helps usability when doing explorative data profiling.

Database-specific connection panels The dialogs for setting up database connections have been enhanced with database-specific connection properties. This makes it a lot easier for the end-user to connect to a database without having to know the details of constructing a connection URL.

Database connection dialog

Database-specific configuration panels have been created for MySQL, PostgreSQL, Microsoft SQL Server and Oracle. Other database types are supported using the traditional way of connecting, as in previous versions of DataCleaner.

Execution and scheduling of DataCleaner jobs using Pentaho Data Integration Pentaho Data Integration (PDI, aka. Kettle) is an open source ETL product that the EasyDQ and DataCleaner team has had a lot of interactions with. For the DataCleaner 2.5 release we are now announcing that in next version of Pentaho Data Integration you will be able to execute and schedule DataCleaner jobs using Pentaho’s infrastructure.

Execution in Pentaho Data Integration

While this is not available, released software as of today, we are looking forward to telling you more about this in the near future!

For those still reading, we also did some minor improvements in DataCleaner 2.5:

  • We’ve added some number transformations for generating IDs, incrementing numbers and more.
  • Implemented a Date range filter, similar to the Number range and String range filters.
  • Support for matching against Synonym catalogs in Reference data matcher (which is previously known as the Matching analyzer).
  • Now all components have flow visualizations in their configuration panel. This feature helps retain the overview when working with large analysis jobs.
  • The sample data (the ‘orderdb’ database) has been reworked to contain better examples of data quality issues.
  • User experience improvements; more elegant dialog designs and trimming of window layout.

We hope you all enjoy the new release of DataCleaner 2.5. Please let us know what you think on the  forums, or on our  LinkedIn group, or on  Google Plus, or on  Blogger, or  tweet it, or...

MetaModel 2.2.2 released

MetaModel version 2.2.2 has just been released. This is a maintenance release to the 2.2 branch of  MetaModel, containing primarily bugfixes and a few small but useful feature enhancements:

  • The MongoDB module now supports having Maps and Lists as column types. This means that the table-model applied to MongoDB now is structurally compatible with MongoDB's native model (which is key/value based).
  • Query filters now support logical AND operators. Previously AND was implied between all filter items and therefore not included as a choice, but if nested filters with AND + OR combinations are needed, the new AND operator is useful.
  • The  DataContext.getColumnByQualifiedLabel(...) method is now fault-tolerant towards case differences.
  •  DataSets are now automatically being closed when garbage collected. Although this is not a desirable use-case, it does allow for a late safe-guarding against unclosed resources.
  • A bug in the  DataContextFactory.createExcelDataContext(...) method which caused it to go into stack overflow was fixed.

A detailed view of the work done can be seen in the milestone view.

We hope you enjoy the new release of MetaModel. Please provide your feedback on the MetaModel online forums.

EasyDQ releases patch for DataCleaner 2.4.2

The  EasyDQ on-demand data quality platform, which DataCleaner is integrated with, has released a patch for DataCleaner version 2.4.2. The patch includes a critical bugfix for the Inter-Dataset matching analyzer.

If you're using this functionality, please  download the patch and place it in the lib/ folder of DataCleaner. This will automatically apply the fix and matching multiple datasets will be working again.

The patch has also been applied to the Java WebStart version of DataCleaner, so WebStart users will not need to do anything.

DataCleaner 2.4.2 released

We've just released DataCleaner version 2.4.2, which is a bugfix and minor enhancements release. Please update to this latest version, which has a whole bunch of items fixed:

  • Database connection can now specify if multiple connections can be made or not. This solves an issue related to databases that did not allow this, and a potential application halt if no more connections was available.
  • There's now a separate distribution of DataCleaner specific for Mac OS. Using this version of DataCleaner you'll see a much nicer OS integration than previously.
  • Performance of the engine has been improved by providing some job-level metrics as lazy loaded values. For instance, the estimated row count is now lazy loaded, so in situations where this metric is not needed (eg. the command line interface and embedded use of DataCleaner), it will not be calculated.
  • The command line interface now has additional options to save the results of an analysis to a file, given a variety of output formats. Saved files can later be opened in the User Interface, allowing for a DIY data quality monitoring solution (see  Kasper Sørensen's blog for more details).
  • An issue with correct prefixing of table names in INSERT statements was fixed in the downstream dependencies for the "Insert into table" component.

For full details about all changes, check out the trac roadmap for DataCleaner 2.4.2, AnalyzerBeans 0.10 and MetaModel 2.2.1.

DataCleaner 2.4.1 released

As our new years present to all of you, we have a new release of DataCleaner. DataCleaner 2.4.1 is largely a release of bugfixes and minor feature enhancements.

Here's an overview of the improvements we've made:

Feature enhancements:

  • Batch loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
  • Writing to data has been more conveniently made available by adding the options to the window menu.
  • You can now easily rename components of a job by double clicking their tabs.
  • The Javascript transformer now has syntax coloring, so that your Javascripts are easier to inspect and modify.

Bugfixes:

  • When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
  • A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.

The full list is also available on the DataCleaner 2.4.1 milestone in the roadmap.

The 2.4.1 release should work as a drop-in replacement of DataCleaner 2.4, so we encourage everyone to upgrade. Get it on the  downloads page. Happy new year.

MetaModel 2.2 released

Today we're announcing the release of  MetaModel version 2.2! This new release represents an effort to sanitize, streamline and make the API of MetaModel more flexible. The two major areas of improvement are:

  • Introduction of an interceptor layer, which can be used for many purposes, for instance to do  automatic conversion of data types. The interceptor layer allows you to enrich MetaModel's functionality and to monitor queries and updates on your data.
  • Improvement of the JDBC write speed by carefully adapting it to use batch updates, prepared statements and controlled commits. With these improvements MetaModel is becoming much more appropriate as a data writing API, even for large batches of data.

There's also a few other smaller additions to this release. You can read it all on the  what's new in version 2.2 page and you can  download the library from both google code or use it as a maven artifact, as always.

Easy as DataCleaner 2.4!

Merry christmas! Today we announce the release of DataCleaner 2.4, which marks a huge joint effort by the community and the team at Human Inference to bring together the best ideas of both open source and cloud-based Data Quality.

Here's what's new in DataCleaner 2.4:

EasyDataQuality integration

With DataCleaner 2.4 we've made an alliance with the newly launched  EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:

  • Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
  • Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
  • Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
  • Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.

No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.

New analysis job components

Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.

  • Table lookup which allows you to look up any number of values based on any number of conditions. The lookup component has an intelligent caching mechanism and is highly performant. ( Docs).
  • Insert into table is a new option when writing data. With this option we are making it possible for DataCleaner to not only produce new files, but also to insert records into existing databases. That makes it a much more flexible writing option.

MongoDB support! And a few more...

Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:

  • Support for MongoDB datastores, which are both readable and writable with DataCleaner. MongoDB uses a schemaless design principle, so you have the choice of either letting DataCleaner auto-detect a virtual schema, or define it yourself. ( Docs).
  • Added more configuration options to Fixed width value files. Specifically, there is now the option to specify header line number.
  • Added support for custom table mapping of XML structures. For large XML files this is a recommended approach, since with a fixed table model, DataCleaner can do SAX-based XML parsing which is much less memory intensive and a lot faster. ( Docs).
  • The Command Line Interface ( Docs) has been further improved, by allowing you to inject job variables from the command line, which makes it possible to parameterize jobs and thereby reuse jobs for different purposes.

Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.

We hope you enjoy DataCleaner 2.4. We built it to be used, so  go grab it right away on the downloads page!

Cleanse your network-related data with DataCleaner network-tools

There's a new and nice extension ready for you at the ExtensionSwap:  The network tools extension.

Network tools can be used to work with IP addresses in data, resolve hostnames and more. Give it a look if you're dealing with network addresses (or eg. email addresses, website visitors etc.) in your data.

MetaModel 2.1 embraces MongoDB and expands XML features

 MetaModel, the universal data access library that provides SQL-like querying capabilities to databases and other data formats alike, have just been released in version 2.1.

The 2.1 version of  MetaModel is an exciting one. The primary archievements in this release has been to provide a mapping model for non-tabular datastores like the NoSQL database MongoDB and for XML files. This means that these two data formats that previously required you to do custom conversion and custom query implementations can now be queried (and in MongoDB's case also modified) in a standard fashion. For both MongoDB and XML files you have a choice of either letting MetaModel autodetect a table model (which may not be perfect, but good to begin with) or to specify your own table definitions and let MetaModel figure out the rest.

The 2.1 release also features a few bugfixes to the previous 2.0.2 release.

You can read all about  what is new in 2.1 on the MetaModel website.

MetaModel can be  downloaded as a independent distributable or as a Maven-style artifact for projects that use that.

We hope you like the new release. Please let us know of your experiences, either on the MetaModel forum or  Google group.

MetaModel 2.0.2 improves performance, fixes bugs, adds MongoDB support

Dear everbody,

We're happy to tell you that  MetaModel version 2.0.2 has been released. This release is a minor release, but even so it contains a few goodies worth noting:

  • The Excel adapter now uses the new Streaming API in Apache POI, which should mean that support for very large Excel spreadsheets just got a lot better.
  • A bug was fixed, which caused CSV writing not to respect the separator and quote char defined for the file format.
  • Performance improved in query postprocessing by applying sub-selections just-in-time, instead of ahead of time.
  • We've added a new experimental adaptor for MongoDB databases. The adaptor supports querying MongoDB using the well known MetaModel query API. Since MongoDB doesn't have schema definitions, you will have to define the schema yourself though.

To get MetaModel, simply use this Maven dependency:

<dependency>
    <groupId>org.eobjects.metamodel</groupId>
    <artifactId>MetaModel-full</artifactId>
    <version>2.0.2</version>
</dependency>

Thanks to everyone who contributed! We hope you enjoy the new release of MetaModel.

MetaModel 2.0.1 released

We're happy to announce that MetaModel 2.0.1 has been released!

This release of  MetaModel is a maintenance released which adds minor enhancements to the newly released 2.0 version. The highlights of the update are:

  • When writing CSV files in the UTF-8 encoding, MetaModel will automatically add a byte-order-mark (BOM) which will make tool support for reading the files (eg. in Excel) much better.
  • The LazyRef type has a new method for asynchronously loading the reference in a separate thread. This is nice if you know that you will need the reference, but don't need it instantly - then you can lazy load it asynchronously.
  • Additional testing and minor tweaking has gone into the framework. In particular we at Human Inference have been testing MetaModel deployments on non-Sun/Oracle Java implementations, like IBM Java on AIX and others platforms.

We hope you enjoy the new release of MetaModel.

MetaModel 2.0 has been released!

It is with great proudness that we can now say that  MetaModel 2.0 has been released! MetaModel is a library that encapsulates the differences and enhances the capabilities of different datastores, such as databases, CSV files, Excel spreadsheets and a lot of other data formats!

Until now MetaModel have always been a read-only library, allowing you to query all sorts of datastores. But with the new 2.0 release we've extended the API with a capability to write data to the datastores as well! MetaModel continues to have a robust and type-safe API for data creation, just as it is the case with querying. This means that with MetaModel you can design very dynamic applications, while still relying on staticly typed data operations and the comfort and safety that this brings to execution. In short - MetaModel is "write once, run anywhere" for datastores.

Read more about the new release here:

As always,  MetaModel is available both as a zip/tar.gz download and as a Maven artifact. Here's the dependency entry for your POM:

<dependency>
    <groupId>org.eobjects.metamodel</groupId>
    <artifactId>MetaModel-full</artifactId>
    <version>2.0</version>
</dependency>

We hope you enjoy working with MetaModel. Please let us know your feedback at our online forum or the  MetaModel mailing list.

Check out the Regex Parser extension!

As stated earlier, Human Inference is dedicated to deliver a rich set of extensions to the DataCleaner community, as well as we are seeing third party interest in contributing to the  ExtensionSwap.

Today we've published a new extension which many DataCleaner users will hopefully find useful: The  Regex parser.

With this extension you can easily implement your own parsing logic around regular expressions. The idea is that you use a regular expression to identify groups in your strings. These substring groups are extracted from the original value and isolated so you can process them individually. A quite nice application of DataCleaner's transformer mechanism!

For more information on how to create your own extensions, please refer to the  DataCleaner develop page.

SassyReader 02 hits the streets!

We've just released version 0.2 of  SassyReader!

This second release of our newest eobjects.org project brings a few minor improvements and updated compatibility with latest MetaModel versions:

  • Improved testing and capability to read latin diacritic and special characters (such as ä, â, é, æ, ø, å, © and more) in SAS datasets.
  • Added compatibility with  MetaModel 2.0-RC1, which was released just a few days ago.

You can grab the new release of SassyReader in the maven repos and read about it on the  SassyReader website!

MetaModel 2.0 release candidate 1

We've released the first release candidate for  MetaModel version 2.0. The big news in MetaModel 2.0 is that we are now going to make it possible to write data back to the datastores. Until now MetaModel has been a read-only exploration and querying framework, but we are expanding the scope quite a lot, and thus also bumping the major version number!

You can grab MetaModel 2.0 RC1 from the central maven repo's:

<dependency>
    <groupId>org.eobjects.metamodel</groupId>
    <artifactId>MetaModel-full</artifactId>
    <version>2.0-RC1</version>
</dependency>

There's a few new resources for those wanting to look into the new data creation API:

We hope you like the new 2.0 release candidate. Please provide your feedback asap, so we have time to adjust the API and the framework before a final release.

MetaModel 1.7.5 adds support for large Excel spreadsheets

Our data access framework,  MetaModel 1.7.5 has just been released and is making it's way into the central Maven repos as we write this announcement.

This release contains improvements primarily in the MetaModel-excel module:

  • Memory consumption when reading OOXML-based Excel spreadsheets have been drastically improved. Previously there would be a huge overhead caused by DOM parsing of the XML documents in a spreadsheet. The Excel module has been rewritten to take advantage of  Apache POI's SAX event API for reading Excel spreadsheets. This means that there will be little to no memory overhead when reading the data from an excel spreadsheet. Benchmarks show a decrease in memory consumed to about 1/5 of the previous amount.
  • An issue has been fixed which caused the Excel module to throw an unexpected exception when reading formulas which had errornous self-referencing expressions.

From a library consuming perspective, there are no breaking changes. This is a drop-in replacement for previous 1.7.x releases.

We hope you enjoy the new release. If you are using the MetaModel-excel module, we strongly encourage you to upgrade to version 1.7.5. Please go to the  downloads page for more information.

MetaModel 1.7.4 released

We're happy to announce the release of  MetaModel 1.7.4. The new version of MetaModel contains a critical bugfix and a minor improvement to functionality:

  • The critical bugfix pertained to an unsafe logging scenario where logging would potentially cause a stack overflow. This was because the hashCode() method of BaseObject contained a debug-logging message that would print the objects own toString() method which, unless modified, would include yet another call to hashCode(). The bug has been fixed by overriding the default toString() method of Object and using System.identityHashCode(...) instead of hashCode().
  • There's also a nice improvement in version 1.7.4: We've added support for fixed width files with individual column widths. What this means is that you can now read a file where each line has a fixed width, but the individual columns of the file has varying widths. This was not possible before, where all columns had to have equal widths, which was a bit restrictive.

All in all we're happy with the new release. It is available in the Central Maven repositories, simply use the following snippet in your POM:

<dependency>
  <groupId>org.eobjects.metamodel</groupId>
  <artifactId>MetaModel-full</artifactId>
  <version>1.7.4</version>
</dependency>

The release is fully backwards compatible with the previous version 1.7.3.

Head on over to the  MetaModel website for more information.

MetaModel 1.7.3 released

We've released  MetaModel version 1.7.3 today! This release is a minor bugfix release which continues to stabilize and build a firm foundation for this data access framework.

The improvements added in version 1.7.3 are:

  • Handling of errors in Excel formulas. If an invalid calculation is made in a formula, MetaModel will simply return the formula as a string.
  • Handling of invalid Excel formulas. Similarly if a formula in excel is simply invalid (eg. using symbols that do not exist), MetaModel will simply return the formula as a string.
  • Improved handling of fetch size caluclation for JDBC queries. In particular some queries can be identified as only returning a single row (eg. COUNT(*) queries and the like). Such queries will get an appropriate fetch size of 1.
  • All  Javadoc API Documentation warnings where fixed.

We hope you will enjoy the new version of  MetaModel!

DataCleaner 2.1.1 is here!

Another release of  DataCleaner sees the light of day today! Although this is not a major release, but a minor one, it does ship some quite nice stabilizing improvements and minor enhancements to the UI.

Enhancements in 2.1.1:

  • Added a search/filtering text field on the datastores list. This enables you to quickly find your datastore if you have registered more datastores than available on the screen.
  • Reference data for country codes was added to the standard distribution, thanks goes to  Graham Rhind for providing these.
  • Added a horizontal scroll bar to the data previewing windows of there are more than 10 columns.
  • Ability to add an extension package with new functionality in the Options dialog at runtime. More focus on extensions will follow in the upcoming releases.
  • We've exposed an early preview of our Command-Line Interface (CLI) by allowing you to invoke the application with the "-usage" parameter which will show the CLI options.
  • Added number formatting options to the "Convert to Number" transformer.

Bugfixes in 2.1.1:

  • Fixed an out-of-memory issue when querying tables with a LOT of columns (150+).
  • Fixed an issue that cause the "Limit analysis" check box to not be checked correctly when a job was re-opened after saving.
  • Not really a bugfix as it was never an official feature, but now we support restoring user preferences (the userpreferences.dat file) from previous versions of DataCleaner.

Thanks to everyone involved in the making of this release of DataCleaner.

DataCleaner 2.1.1 is available as a traditional download or as a Java Web Start application on  the downloads page. Keep in touch with your  feedback to the application on  the forums.

MetaModel 1.7.1 released

Due to the attention that we've received from our release earlier this week, we've quickly collected a few improvements to MetaModel 1.7 that where small but valuable. This is why we are today releasing MetaModel 1.7.1!

The release contains these new items:

  • Various minor improvements in the  API Documentation.
  • Fixed a minor bug that occurred when CSV headers are configured to be read from an unexisting line number.
  • Added the capability to calculate an appropriate FETCH_SIZE for JDBC/database queries. This feature allows for better memory management when used with databases that take advantage of eager buffering.

Please turn to  the MetaModel website for downloads!

MetaModel 1.7 released

Today is the day that  MetaModel version 1.7 has been released! The focus of the new version was to bring additional configurability and handling of special corner cases into the framework. In addition we've also improved performance and fixed a few minor issues.

For a full list of new stuff, take a look at the  What's new in MetaModel 1.7? page.

As always you can  download MetaModel as a distributable or get it from the Maven repositories:

<dependency>
  <groupId>org.eobjects.metamodel</groupId>
  <artifactId>MetaModel-full</artifactId>
  <version>1.7</version>
</dependency>

We look forward to getting your reactions and feedback. Please let us know if you use MetaModel and we will also be happy to add you to the  list of projects that use it.

DataCleaner 2.1 adds charts, stoppable jobs, database drivers and unifies the UI

We're happy to announce the release of  DataCleaner 2.1! This is a quite significant release and something that we hope users will recognize as a step forward from the 2.0 versions.

The major news in DataCleaner 2.1 are:

  • There was a lot of work done on the user interface (see  media page):
    • We decided to remove the left-hand side window containing environment configuration options.
    • Instead all these options have now been moved to the job building window so the user only has to focus on a single window for all the interactions needed to build a job.
    • The welcome/login dialog has also been removed in favor of a more discrete panel that can be pulled in or hidden from the main window.
    • Datastore selection and management is considered the first activity in the application, which is why it is also the first step to handle in the main window.
  • You can now stop jobs in case you decide to change something before it is done.
  • Bar and line charts were added to a lot of the analysis result screens, including String analyzer, Number analyzer, Date/time analyzer and Weekday distribution (see  media page).
  • All "preview data" windows now contain paging controls so you can move backwards and forwards in the data set.
  • Most common database drivers (MySQL, PostgreSQL, Oracle, MS SQL Server and Sybase) have been added to a default set of drivers.
  • Configuration of the Quick analysis function in the Options dialog.
  • Various minor bugfixes.
  • Transformer for extracting date parts (year, month, day etc.) from date columns.

We hope you enjoy DataCleaner 2.1. Please head over to the  downloads page to get it!

MetaModel 1.6 released

We're happy to announce today the release of MetaModel 1.6.

The new version of  MetaModel has three focus points:

  • A new datastore type has been added: Fixed width value files. This enables MetaModel to read flat files where every value has the same length (ie. not separated as such, but formatted in character columns).
  • Full support for paging queries. The query interface now has a setFirstRow(int) method, which in combination with the existing setMaxRows(int) method allows for paging and finer grained control over the resulting data sets.
  • Bugfixes pertaining to DB2 support: Added specific dialect support to ensure that queries for DB2 are formatted correctly, especially with regard to fully qualified schema names in queries.

You can get the new MetaModel at our  downloads page or through the central Maven repositories:

<dependency>
  <groupId>org.eobjects.metamodel</groupId>
  <artifactId>MetaModel-full</artifactId>
  <version>1.6</version>
</dependency>

We hope you enjoy MetaModel - please let us know and provide feedback at the forums.

DataCleaner 2.0.1 released

Since the release of DataCleaner 2.0, we've seen a renewed interest and a lot of activity around eobjects.org, DataCleaner and Human Inference. We're happy to get all this valuable feedback and it has also meant that there where some low hanging fruit to as well as a few very minor bugs that we could easily add into the existing DataCleaner 2.0 release. This is why, already a week after 2.0 was released, we're releasing an update: 2.0.1.

The update consist of minor updates:

  • Filter outcomes where added to the flow visualization.
  • A bug was fixed in the widget for selecting the tokenizer's separators.
  • The "Equals" filter can now have multiple values to compare with.
  • Some minor cosmetical improvements.

For more detail, take a look at the milestone contents at Trac.

DataCleaner 2.0.1 is available at the  downloads page and the update has also been automatically applied to our Java Web Start users.

Watch out, dirty data! DataCleaner 2.0 is in town!

The Open Source software community eobjects.org is happy to announce the release of DataCleaner 2.0. This release marks the biggest advance in technology and features for the DataCleaner platform throughout the history of the project.

Amongst exciting new features in DataCleaner 2.0 are:

  • Data transformations, allowing you to preprocess, extract, refine, combine and calculate data items as a part of your data profiling jobs.
  • Filtering, sampling and subflow management, allowing you to define criteria to exclude and include particular items of data.
  • Richer reporting with charts, graphs, navigation trees and more.
  • A bunch of new data quality functions for date gap analysis, phonetic similarity finding, synonym lookups and more.
  • More configuration options and added data quality measures for existing data quality functions like the Pattern finder, String analyzer and more.
  • Reusable profiling jobs, where you define your processing flow once and consequently run it on any data.
  • Support for MS Excel 2007+ spreadsheets.

For more information about what’s new in DataCleaner 2.0, see the  full list of new features in DataCleaner 2.0.

Today it was also announced that  Human Inference, the European data quality authority has finished their acquisition of the eobjects.org site, to actively enter the market for entry-level Open Source data quality products. All projects on eobjects.org will remain open source and the benefit for the community and the products are apparent. The release of DataCleaner 2.0 is the first visible outcome of the acquisition, resulting from several months of intense cooperation between Human Inference and the community members, to put together a state-of-the-art data profiling application.

For more information about the eobjects.org acquisition, see the  press release on the Human Inference website.

Times are really exciting in the eobjects.org community these days. We hope you’re all as enthusiastic about the new DataCleaner 2.0 as we are. The application is ready for download and for immediate launch through Java Web Start, so visit the  DataCleaner website now.

MetaModel 1.5 released. Unify your view on all datastores

MetaModel 1.5, an Open Source Java framework for accessing, exploring and querying different datastores using a unified API, have just been released.  MetaModel provides a single view and a SQL/LINQ-like query engine for everything ranging from relational databases, CSV files, Excel spreadsheets, XML files, dBase (.dbf), MS Access (.mdb) and OpenOffice.org (.odb) databases.

The 1.5 release has been more than a year under way, including substantial  new features and enhancements. Three major themes influence the new features of the 1.5 release:

Improved datastore compliancy

In addition to the already extensive set of supported datastore types, the following new datastore features have been added:

  • Support for Excel 2007+ (.xlsx) spreadsheets has been added.
  • Composite datastores have been added, allowing you to define queries that span multiple datastores.
  • Excel formula calculation have been added.

Fluent Query Builder API

MetaModel 1.5 retains the existing Querying API, which is extremely flexible but also complex, and therefore quite easy to make mistakes with. But MetaModel 1.5 adds a new layer of abstraction to the Querying API: The Query Builder API. With the Query Builder API you can define queries in an even easier, more safe and elegant way. The goal of the Query Builder API is to leverage the use of the compiler as far as possible for query expression.

An example demonstrates it quite well:

DataContext dc = DataContextFactory.create[your_datastore_type]DataContext (...);
Query q = dc.query()
   .from(projects).selectCount().and(community)
   .where(license).equals("oss")
   .groupBy(community).toQuery();

Interfaces and immutability

Instead of the previous JavaBeans based API, the 1.5 release includes interfaces for just about everything in the library. This means that it is as of now easier to test, integrate and deploy MetaModel. It also allows for better encapsulation internally as well as improved safety by exposing only immutable variants of the data structures (like Table, Schema, Column etc.) that are modifiable only by the framework.

Today it was also announced that  Human Inference, the European data quality authority has finished their acquisition of the eobjects.org site, to actively enter the market for entry-level Open Source data oriented applications. All projects on eobjects.org, including MetaModel, will remain Open Source, but heavily enforced by the invested time and resources that Human Inference is adding to these projects.

For more information about the eobjects.org acquisition, see the  press release on the Human Inference website.

MetaModel is already in use in a lot of projects, including the  DataCleaner data analysis/profiling application and  Quipu, the data warehouse generator. It is also in Human Inference’s plans to expand the usage of MetaModel into their enterprise-grade data matching and deduplication applications. If you think MetaModel 1.5 sounds interesting,  head over to the website to learn more. MetaModel is available as a  Maven artifact or as a traditional  download at Google code.

MetaModel 1.5 release candidate 4 is out

As mentioned earlier MetaModel 1.5 is almost done and today we take one of the final steps towards the release: The release of what looks like the last release candidate of 1.5: Release candidate 4 (RC4). Grab it while it's hot in the maven repositories:

<dependency>
 <groupId>org.eobjects.metamodel</groupId>
 <artifactId>MetaModel-full</artifactId>
 <version>1.5-RC4</version>
</dependency>

RC4 improves on the previous release candidates with a few minor, but important, tweaks to the framework:

We are now just looking forward for the release of MetaModel 1.5 final. If you experience any issues or have any feedback to RC4, let us know ASAP.

MetaModel 1.5 release candidate(s) released

During the last couple of weeks a lot of attention has gone into the next major version of MetaModel. Two release candidates have been released so far. Get the latest one at the maven repositories and tell us what you think:

<dependency>
 <groupId>org.eobjects.metamodel</groupId>
 <artifactId>MetaModel-full</artifactId>
 <version>1.5-RC2</version>
</dependency>

(notice the change of groupId to org.eobjects.metamodel).

The major changes in MetaModel 1.5 are:

  • Interfaces have been introduced for the types that are used throughout the library: DataContext, Schema, Table, Column, Relationship
  • Both mutable and immutable implementations of these types are offered.
  • A new query builder API is provided directly through the DataContext interface. The builder API will make sure that your queries are correctly built and makes it a lot easier to compose the queries correctly with auto-completion in your IDE because of a flexible set of interfaces for the various stages of building a query. Some examples are:
    DataContext dc = ...;
    
    Query q1 = dc.query.from(table).select(column1)
         .where(column2).equals("hello")
         .or(column3).isNull()
         .orderBy(column1).asc().toQuery();
    
    Query q2 = dc.query.from(table)
         .select(column1).selectCount()
         .groupBy(column1).toQuery();
    
  • Lots of minor bugfixes and improvements to performance.

We anticipate a final release of MetaModel 1.5 soon. Let us know what you think of the release candidates so that we can adjust if needed before the final release!

DataCleaner 1.5.4 released with dBase and MS Access support

Here it is:  DataCleaner 1.5.4 :)

Although this release is a minor release it contains a few exciting features and fixes:

  • We've updated the MetaModel version to 1.2 which adds support for two new datastores:
    • dBase databases (.dbf files)
    • MS Access databases (.mdb files)
  • We've fixed a bug pertaining to text-file dictionary "file not found" errors.
  • A lot of the other underlying libraries have been updated, providing improvements to performance and stability.

Head on over to the  downloads page to grab the new DataCleaner.

MetaModel 1.2 introduces cross-datastore querying and MS Access and dBase support

We're happy to present a new version of the wonderful  MetaModel component. This version adds a radical new feature: Cross-datastore querying, which means that you can now execute queries that spans multiple datastores (ie. with transparent client-side joining, filtering, grouping etc.). You can check out a simple example of this  at kasper's source (blog).

Version 1.2 also adds support for two long-awaited datastores: Microsoft Access databases and dBase databases. Access support is implemented for MetaModel with a core based on the  Jackcess project. MetaModel's dBase support is based on a derivate of  xBaseJ, courtesy of xBaseJ, American Coders and Joe McVerry.

To look into MetaModel 1.2, here are the crucial resources:

  • Downloadables at  google code.
  • Javadocs  available online.
  • Maven-support out of the box:
    <dependency>
      <groupId>dk.eobjects.metamodel</groupId>
      <artifactId>MetaModel-full</artifactId>
      <version>1.2</version>
    </dependency>
    

With MetaModel 1.2 we're feature-complete with all of the 1.x features of the MetaModel-roadmap. We hope that you will find it to be as great and useful as we ever intended it to be!

DataCleaner 1.5.3 released

After much waiting, we are finally ready to release DataCleaner 1.5.3. Here's the wrap-up on what's been going on:

  • The MetaModel dependency has been upgraded to version 1.1.8, which means:
    • Improved Excel spreadsheet support
    • Improved SQL Server support
    • Improved performance for CSV files
  • Fixed a bug that caused certain database connection errors to be ignored in terms of user feedback.
  • Fixed a bug that caused re-opening of database dictionaries to throw a NullPointerException.
  • Fixed a bug related to dictionary lookups of null values.
  • Added support for Teradata databases.
  • Added connection templates for SQL Server connections.
  • Added support for selection of custom encodings when reading CSV files.
  • Fixed a minor bug relating to reading files on the classpath when running in Java WebStart mode (which manifested in an exception thrown when clicking on "About DataCleaner").

So as you can see, it's been a mix of minor bugfixes and a couple of improvements to compatibility and performance regarding certain datastores. We hope you enjoy this new release of DataCleaner. As always, you can ...

Let us know what you think!

MetaModel 1.1.8 adds better SQL Server support

I'm happy to announce the release of MetaModel 1.1.8.

This release is a minor release with updates only relating to MS SQL Server. The changes are, however, profound in this regard. Microsoft SQL Server JDBC drivers are known to be quirky when it comes to metadata exploration and we are happy to say that MetaModel now addresses these issues. So if you're a MS SQL Server you should be sure to get the latest version of MetaModel!

MetaModel is as always available at the following locations:

  • Downloadables at  google code.
  • Javadocs  available online.
  • Maven-support out of the box:
    <dependency>
     <groupId>dk.eobjects.metamodel</groupId>
     <artifactId>MetaModel-full</artifactId>
     <version>1.1.8</version>
    </dependency>
    

We hope you're all satisified with the improvements of this release and don't hesitate to give us any feedback.

New book on Open Source Business Intelligence tells the DataCleaner-story

About half a year ago we received an exciting inquiry from Jos van Dongen on behalf of him and his co-author Roland Bouman, telling us that they where writing a new book about Open Source Business Intelligence and in particular Pentaho-based solutions. And for this they where looking into DataCleaner for the data profiling section of the book!

The book is now out! It's called "Pentaho Solutions" and it's published by Wiley Publishing. You can read  about it and buy it on their website as well.

The book contains a walkthrough for building a data warehouse using Open Souce tools and in doing so applying DataCleaner for the important job of profiling and validation.

We congratulate Roland Bouman and Jos van Dongen for their great work to promote Open Source Business Intelligence and thank them for mentioning DataCleaner while they're at it!

Explore and query all your datastores with MetaModel 1.1.7

We're pleased to announce the release of MetaModel 1.1.7. The major changes from our latest release is the introduction of two important improvements:

  • Microsoft SQL Server is finally supported and integration tests have been added to our portfolio of tests of supported databases. Thank you to Asbjørn Leeth for the major contributions of this feature.
  • We've added an option to configure the character encoding for opening CSV files.

With the addition of these two improvements we think that we've added some significant "drops in the ocean" on our way of becoming the most comprehensive and advanced framework for object-oriented querying and datastore-independent schema exploration.

If you use Maven, update your dependencies to the following:

<dependency>
 <groupId>dk.eobjects.metamodel</groupId>
 <artifactId>MetaModel-full</artifactId>
 <version>1.1.7</version>
</dependency>

... or if you don't, head on over to our  download site at Google Code and download a copy of the release.

eobjects.org announces Open Source data quality with DataCleaner 1.5.2

Dear DataCleaner users,

We are happy to announce the release of  DataCleaner 1.5.2. Users of DataCleaner 1.5.0 or 1.5.1 won't be able to see a lot of changes in the user interface, but this release actually holds quite a lot of improvements “beneath the surface”:

  • The most notable improvement is in the Value Distribution Profile. Previously this profile consumed quite a lot of memory which could lead to out-of-memory errors in extreme cases. This has been fixed by using on-disk caching with the berkeley db when nescesary.
  • Another notable feature is that we can now distribute DataCleaner as a single JAR file. This means that we will be serving the application as a Java WebStart application (ie. run it as if it's an online application) and we are also considering other distribution options.
  • When starting the application, it automatically downloads regular expressions from the  RegexSwap.
  • A bug in regards to matching number-based columns in dictionaries was reported and fixed.
  • A bug in regards to invalid characters in XML-export formats was reported and fixed.
  • When opening files, we are now ignoring suffix case so that .CSV files can be opened as well as .csv.
  • The number of columns shown in the preview window are automatically restricted if there are too many to show on a single screen.

You can download DataCleaner from the  downloads page or you can use our new feature:  Get it via Java WebStart!

This release underlines the ongoing evolution of  DataCleaner to be a more and more professionally capable data profiler and data quality tool. Seeing that DataCleaner is  being used in large corporations world wide I wish to address some thoughts that I have been having and that I know users are pondering with: How do you best combine the low adoption cost of Open Source applications like DataCleaner with the high flexibility that most commercial business-software provide? To service this need we've opened up a new division of the company that I work with,  Lund&Bendsen. Whether you need to deploy DataCleaner to high-scale installations, integrate the applications with your existing systems or develop customized profiles, validation rules or satisfy other enterprise needs, we offer you first class services and in-depth expertise you wont find anywhere else.

To cut to the chase: DataCleaner 1.5.2 is here and we wish to extends the community development with a professional effort. So don't hesitate to let us know if you see an opportunity to invest. Adding value by targeting your use of the product is in the interest of both customer, developer and community and this is the reason our business is there.

To all you non-business users out there: Sorry for the obvious commercial rant and we hope you all enjoy the newest DataCleaner release.

Best regards,
Kasper Sørensen
Founder of  eobjects.org and the  DataCleaner project

MetaModel 1.1.6 released: Small changes, a bug fixed

We've released yet another version of MetaModel, namely version 1.1.6.

This release contains very few changes to the 1.1.5 release:

  • A convenience method was added to the Query class: select(FunctionType, Column).
  • Upgrading the Apache POI version in MetaModel introduced a few bugs that we did not discover in the 1.1.5 milestone. In 1.1.6 we fixed these bugs and unittesting was significantly improved for this part of the code to prevent any new bugs from emerging.

We hope you enjoy this release and excuse for the hectic release schedule - the before mentioned bug fixes where critical and we hope that you appreciate the quick response from the community.

eobjects.org announces the release of MetaModel 1.1.5

We have just released the newest version of MetaModel, 1.1.5. This release is a minor release which means no API changes, but a few upgrades in terms of performance, flexibility and ease of distribution (full list):

  • The most important upgrade have been to CSV performance. We encountered a bug when querying this type of datastore that meant that the whole DataSet was stored in memory while using it. This has undergone quite some refactoring so that it will now stream through memory as expected, thus keeping the door open for very large CSV files.
  • A minor change in the column naming scheme have been implemented for the Excel-based DataContext's. This means that if the first row of a spreadsheet contains only blank fields, we will automatically assign the names "[column 1]", "[column 2]" etc. accordingly.
  • The  downloadable zip or tar.gz file will now contain a "MetaModel-1.1.5-all.jar" file, which is an assembled jar file containing the classes of all MetaModel modules (core, csv, jdbc, excel etc.), which should substantially ease deployment of the framework.

We hope you enjoy the new release of MetaModel and keep up the good work of providing the valuable feedback that drives development of it.

DataCleaner 1.5.1 released

We're happy to announce the release of DataCleaner version 1.5.1. This release is a minor release, nevertheless containing a few nice features - especially for the users who are enjoying the exporting features that was introduced in 1.5:

  • An additional HTML export format have been added to the built-in export formats (usable when exporting Profiler results in the desktop app and when executing the runjob command-line tool).
  • The export format is now choosable directly in the desktop app.
  • Four new measures where added to the String Analysis profile: avg. chars and max/min/avg white spaces.

The new version of DataCleaner is (as always) downloadable for free on the  downloads page and feedback from users is also greatly appreciated, ie:

We hope that you all enjoy DataCleaner 1.5.1.

DataCleaner 1.5 released!

"Finally!" one might say. And this is definately what is going through my head right as I write this news-item. Finally, DataCleaner 1.5 has been released! Once again the effort to bring about the best open source data quality solution is bearing fruit.

The new release is definately one of the most significant ones in the history of DataCleaner. The overall goal of the release has been to step up from the shadows of the "small tools" pool and mark DataCleaner as an enterprise-ready application for profiling and validating datastores of all kinds - both in scheduled mode, on servers and in an intuitive desktop environment.

For those of you with an interest in every little detail about this release, please feel free to review the complete list of changes - for everyone else, here's the recap:

  • Change of license to LGPL.
  • Multi-threaded execution of Profiler and Validator.
  • Command line (batch) execution of DataCleaner tasks.
  • More elaborate status information during profiler and validator execution.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • Load regex from the online RegexSwap repository.
  • Automatic download and install of popular database drivers.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)
  • Memory improvements in Time analysis profile.
  • Improved logging when running profiling and validation.
  • Information schema provided for file-based datastores.
  • Lazy-loading of columns in datastore-tree.

We hope you enjoy the new DataCleaner 1.5! Now go over and  download it right away.

Data quality pro launches DataCleaner articles

Things are starting to shape up for the big release of DataCleaner 1.5. We are starting off with a bit of excitement around in the data quality community.

data quality pro

Probably the most dedicated online magazine about data quality,  data quality pro, have launched a series of articles about profiling, validating and comparing data with DataCleaner. So far an introductory tutorial (including a complete and realistic example data-set) and a background article/interview have been published:

We hope that you will enjoy the articles and we thank  data quality pro for their great interest in our community.

First commercial support company for DataCleaner and MetaModel

Today we are announcing the first company,  Lund&Bendsen, to officially support DataCleaner and MetaModel on a commercial level. These eobjects.org projects are, as you know, independent projects that are run with the community in mind. But as time goes on they grow and for companies to pick them up and start using them in a commercial setting we also welcome third party commercial support to help spread the projects to environments where community-based support is insufficient.

 Lund&Bendsen is a Danish company with a strong expertise in Java development and training. Their service offerings include training, customization, integration and enhancement of DataCleaner and MetaModel so if your company is considering applying DataCleaner they might be interested in hiring some professionals to aid them in the process.

Over time more companies are expected to join in on commercial support for the eobjects.org projects. Keep up to date on the  DataCleaner support page and don't hesitate to contact us for any inquiries in this regard either.

Independent analysis firm points at DataCleaner for open source data quality

The  Technology Evaluation Centers (TEC) have published an interesting, unbiased and independent analysis of the market for Open Source business intelligence products. We are delighted to see that the article features a section about data quality and that TEC points at DataCleaner as a competent choise within the open source products:

In such situations, where the vendor does not support a specific functionality,
organizations can look to complementary open source solutions; the DataCleaner
project from eobjects.org, for instance, provides functionality to help profile
data and monitor data quality. It also points to a significant advantage with
open source applications: the fact that software is developed by the community
and for the community makes it much simpler to share innovative solutions
quickly and seamlessly.

You can read  the whole article by Anna Mallikarjunan from TEC by going to their website (user registration is required).

Another release candidate (2) of DataCleaner 1.5 ready for download

Another batch of updates, fixes and improvements for the upcoming DataCleaner release is ready. This time it's Release Candidate 2 offering a preview of what's to come in DataCleaner 1.5.

The main changes since Release Candidate 1 are multithreaded execution, the command line interface (runjob.sh / runjob.cmd), some UI updates and a few bugfixes. Go download the release candidate and use it as an opportunity to influence the development process by posting your comments on  the DataCleaner forum.

Release Candidate 1 of DataCleaner 1.5 out

After working hard for a couple of days to implement substantial new features regarding integration of eobjects services and automatic download and install of popular database drivers, a new release candidate of DataCleaner is ready!

We hope that a lot of people will use the release candidate and provide feedback for further development towards the 1.5 final release.

A few screenshots of recent development

I've spent the last couple of days implementing a couple of cool enhancements to the DataCleaner desktop-application:

  • Automatic download and install of popular database drivers. Followed along with template connection strings in the "Open database" dialog. This will hopefully make it much easier for less experienced users to set up a connection to their database of choice.
  • Direct integration with the new RegexSwap system so that the regexes that you post online will be accessible from within the desktop-application.

Screenshots have been posted to the  media page.

Wait for DataCleaner 1.5 for these features or build it yourself to check them out now.

MetaModel 1.1.4 released

A new release of MetaModel is ready for download. The new version, 1.1.4, is a bug-fix release with a critical issue for PostgreSQL databases fixed. Other than that no changes from 1.1.3, so it should be a drop-in replacement update.

Enjoy.

  • You can  download an archived version
  • Or get it using maven:
    <dependency>
     <groupId>dk.eobjects.metamodel</groupId>
     <artifactId>MetaModel-full</artifactId>
     <version>1.1.4/version>
    </dependency>
    

DataCleaner launches new regex sharing subsite - RegexSwap

Only a few days after the launch of the  new DataCleaner website, we are once again ready with new exciting features. This time we are launching the first edition of our new regular expression (regex) sharing subsite called "RegexSwap".

 RegexSwap is a specialized forum for sharing, categorizing, commenting and voting on regular expressions that can be used in DataCleaner and other regex-based applications. It is really easy to post your own regular expressions, test them online on the website, comment and vote on the regexes that you have found useful. In time the next releases of DataCleaner will also take advantage of this online "always up to date" regex resource and offer direct integration with RegexSwap.

RegexSwap is still in beta but is ready at a functional level which is why we are launching publically it now. It will recieve dedicated attention in the weeks and months to come.

A new website for DataCleaner

Dear everybody,

As a special christmas present we have been working hard to design a new website for DataCleaner! Hopefully you will all enjoy the new site, which have been designed to further support our community and let it grow by incorporating more features to socialize and share ideas online. So go visit it now at the new URL:

Among the new features are a more  personal profile system which is linked to some of the communities that our users already use frequently, namely  LinkedIn and  SourceForge. We have a whole new  media section with cool screenshots and webcasts. We are also redesigning our  mailing list structure. Instead of the single mailing list that we have been using so far, we are launching new "announcement" and "dev" mailing lists.

Our goal is to continuously launch new features on the website. The first one being a  user survey to gain a better insight into the minds of our users and community. So be sure to fill it out. In the future we will add more exiting features such as online sharing of regular expressions and reference data for DataCleaner dictionaries.

The  old website will continue to exist, but primarily as a wiki and bugtracking system. During the next couple of days we will be editing the wiki pages to make them more suitable for wiki-style editing (by everyone) as opposed to the former readonly strategy.

We hope you like our christmas present and that you will  let us know. and we wish you all a great 2009. Without a doubt, it will bring exiting times for DataCleaner and the DataCleaner community.

Maven issues and MetaModel 1.1.3-FINAL

As we where recently made aware of, we have once again messed up our maven deployments of MetaModel, sorry! If you're using maven for your Java projects and you just updated your <dependency> tag in your POM files, replacing the version entry "1.1.2" with "1.1.3", I'm sure you ran into a lot of ClassNotFoundException's, because the maven artifacts where in fact empty! We are very sorry about this poor release management situation, but here are a couple of ways that we (you) can fix this:

  1. You can add the eobjects maven repository to your POM. The eobjects maven repository contains valid maven artifacts so that's quite an easy fix:
    <repositories>
     <repository>
        <id>eobjects-maven</id>
        <name>Eobjects repository for Maven</name>
        <url>http://datacleaner.sourceforge.net/m2-repo/</url>
      </repository>
    </repositories>
    
  1. You can wait a few hours and the central maven repo will have been updated with a couple of new artifacts with the "1.1.3-FINAL" version literal. So your new dependency will look like this:
    <dependency>
       <groupId>dk.eobjects.metamodel</groupId>
       <artifactId>MetaModel-full</artifactId>
       <version>1.1.3-FINAL</version>
    </dependency>
    

MetaModel 1.1.3 released

We've just released MetaModel version 1.1.3. This is a stabilization release containing some microscopical bugfixes, specifically in regards of Schema serialization. If you're currently using any 1.1.x release of MetaModel, then you should do a drop-in replacement and expect no changes to your code.

As always MetaModel is available from our downloads page and through the maven repositories.

Unless anything urgent comes up this will be the last release of the 1.1 branch of MetaModel. The next focus of MetaModel 1.2 will be to include support for more datastore formats, including dBase and improved XML tag-to-table modelling.

And of course if you have any ideas for development, don't hesitate to let us know!

New eobjects hosts, return of continuous integration!

I'm happy to announce that eobjects.org have gotten new hosts and that the troubles that we have been experiencing the last couple of months due to weird server crashed is finally over! My final word on the matter is - getting a large OSS-based J2EE environment up and running on a proprietary power pc platform is kind of a nasty affair! :-) So luckily we've found a better solution. This also means that we can once again say hello to our friend Hudson, the continuous integration system. While it is already online I will be tweaking it for the days to come so look out for periodic builds, test-reports and all that stuff that we all love!

Update: After some initial problems cloning the old environment we have finally ruled out all the small defects I think. So lets have a cheers for our new postgresql server (humly hosting the trac system) and our new Hudson server:

Error in the maven repository version of MetaModel 1.1.1

We have identified a problem with the MetaModel 1.1.1 artifacts in the maven repository, so we will be releasing MetaModel 1.1.2 very shortly. The problem was related to the upload process and caused the jar's in the repository to not contain any .class files! The  downloadables from our website did not suffer from this problem, so if you're using those, you're OK.

The new maven artifacts can be downloaded using this dependency tag:

<dependency>
 <groupId>dk.eobjects.metamodel</groupId>
 <artifactId>MetaModel-full</artifactId>
 <version>1.1.2</version>
</dependency>

A single feature have been added in the 1.1.2 release - CSV and XML content is now accessible not only through files but all kinds of input sources, including internet URLs.

Update: After some repository-synchronization waiting time the 1.1.2 release have finally been submitted to the  public Maven repositories!

Minor fix release of MetaModel

We've just released MetaModel 1.1.1, the successor to the major 1.1 release!

This release is a minor fix release and you should be able to make an easy drop-in replacement of the 1.1 release. Here are the three fixes/improvements that we have been working on for the update:

  • Minor bug fixed: The equals() method of SelectClause had a minor bug related to comparing the distinct property.
  • Improvement: The Column and Table classes have had a getQualifiedLabel() method added. The qualified label is a dot-separated qualified name such as "MY_SCHEMA.MY_TABLE.MY_COLUMN". The qualified label can be used as a unique identifier for the column but is not necessarily directly transferable to SQL syntax.
  • Improvement: Getters and setters have been added to the SelectItem class

MetaModel 1.1 released!

As I write this newsitem I'm uploading the new version of MetaModel to the maven repositories! So let me take the time to tell you what's new in this release.

First of all I'd like to say that this is really a release with lots of fundamental changes and we have sacrificed backwards-compatibility at some places, so be sure to check that everything is working exactly as before. That said - those things that we have changed will also cause you compilation problems, so if you do a drop-in replacement and your build fails, then it's because the features have changed. We think this is the easiest way for everybody to deal with changes - it's a lot more obvious that you need to do something if it's really keeping your application from working! The good thing is that the new MetaModel provides a lot of great improvements and new features!

Here's a sum-up of the changes made to MetaModel from version 1.0 to 1.1:

  • We've done a major restructuring of the project as to make it more modular and easier to figure out.
  • This also means that the way you create  DataContext objects have changed. In 1.0 you used the constructor of DataContext. This approach have been replaced by a factory class, which does all the instantiation and initialization stuff for you:  DataContextFactory.
  • The MetaModel project is now LGPL licensed instead of using the Apache License version 2.0. For more info see MetaModelLicense.
  • The built-in query-engine, "Query postprocessor", which is used to serve CSV, Excel and XML content, have gone through numerous improvements to performance and functinality.
  • Column types can now be detected, narrowed and transformed using the Query postprocessor engine. This means that you can use the engine to detect and retrieve Integer, Double, Date, Time and Boolean types as well as the old String-based values, even from text-only datastores such as CSV files.
  • The JDBC datastores now have a query rewriter component which allows for optimization of queries using native SQL-syntax.
  • Query postprocessor now also generates information schemas used to investigate metadata about CSV, Excel and XML files.
  • Database compliancy have grown constantly during development and will keep doing so forward on. You can check the supported databases here: MetaModelCompliancy

All in all I think this release marks a high degree of maturity for the MetaModel project and we're very proud to present it to you!

DataCleaner 1.5 "snapshot" released

As we're moving steadily along towards the release of DataCleaner 1.5 we are fixing a few bugs and enhancing a lot of features. This leads to the desire to release our work since practically nothing has undergone changes that could destabilize the application since the 1.4 release. So today we're releasing DataCleaner 1.5 "snapshot". This also marks the first release under our new LGPL license.

Here are the changes from 1.4 so far:

  • Change of license to LGPL.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)

Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, download it, tell us what you think and we'll see you around!

Eobjects announces change in preferred license

We've made a principal decision at eobjects.org to change the preferred license of our projects from the Apache License 2.0 to the Lesser General Public License (LGPL).

The main difference between the two licenses are that the LGPL requires any modifications to be contributed back to the Open Source community (ie. licensed under a similar license; LGPL or GPL). The eobjects.org projects are gaining the obvious advantages of the LGPL by ensuring that improvements are submitted back to the projects. This also means that we don't risk that anyone sell modified versions of our projects. It is still just as appropriate to use the projects as a part of commercial applications, but any modifications must be contributed back to the community.

Initially this change in license will affect the two flagship projects of eobjects.org: DataCleaner and MetaModel. This means that the next versions of these projects (DataCleaner 1.5 and MetaModel 1.1 accordingly) will be LGPL licensed. Also, new projects will be LGPL licensed unless special circumstances suggest otherwise.

Go watch the new appetizer webcast of DataCleaner 1.4

We've just uploaded a webcast of the new DataCleaner 1.4 which provides a long awaited update for the old 0.4 webcasts!

Go enjoy the webcast - and be sure to download the newest version of DataCleaner. Over and out!

DataCleaner 1.4 released!

I'm please to announce the release of DataCleaner 1.4! This is a release that we feel will satisfy a lot of users with improvements and fixes for a lot of issues. Here's a very short compilation of changes, for more details, take a look at the roadmap.

  • Replaced "Repeated values" profile with better and more advanced "Value distribution" profile.
  • Dictionary matcher drill-to-details options.
  • New application logo.
  • Lots of small bugfixes and UI beautifications.
  • Lots of sample dictionaries and regexes.

We hope you enjoy the new version of DataCleaner - Get it now!.

Two new releases planned for DataCleaner

After some considerations about the future of DataCleaner, we've updated the roadmap to reflect our current plans for the direction of development. We are planning on releasing DataCleaner 1.4 by the end of the month and after that two new milestones have been added:

  • DataCleaner 1.5: The main focus of this release is to provide a command line interface for our data quality framework. This means that users will be able to easily create batch jobs that they can schedule using their favorite scheduler. Other features will also include Pattern Finder improvements and a couple of new profiles.
  • DataCleaner 1.6: We have a lot of suggestions that have been filling up our backlog. DataCleaner 1.6 will be all about getting everybody's needs into the application before we get ready to begin the webapp. Some of the exciting features of DataCleaner 1.6 will be relationship profiling and exporting of results.

Kasper Sørensen presenting DataCleaner at Open Source Days '08

Great news everybody. The Open Source Days '08 conference in Copenhagen will feature a so-called Lightning Speak by Kasper Sørensen on the topic of DataCleaner and the eobjects.org community.

We're really happy to get the message of DataCleaner out to more people and a conference like this is an ideal spot for demonstrations, discussions and experiences. Read more about the lightning speak at Kasper's blog:

Update: The presentation is over and you can now also read the retrospective at Kasper's blog:

eobjects.org have been acquired

During the last year eobjects.dk have grown rapidly and attracted a lot of attention both from Denmark where the community was originally founded, but also internationally from users and contributors in all parts of the world. We believe that this world wide interest in eobjects should be reflected in the website name and address, which is why we have acquired the  eobjects.org domain name as of today! Eobjects.dk will still prevail and the domain names are exact aliases but forward on we will undergo a gradual name change from .dk to .org. This will be reflected in several matters;

  • The official name of the website will change to eobjects.org.
  • For the sake of compatibility we will not change the package names of our java classes just yet. Only major version releases will include such changes (ie. wait for DataCleaner 2.0 and MetaModel 1.1).
  • The same principle goes for our Maven artifacts. In time they will probably change though, but this also depends on the repository crew at apache.

We are happy that we now have a domain name that symbolize the international appeal of our software and we hope that it will enforce the community with a likewise global culture and sense of vitality.

MetaModel 1.0.7 is out!

We're happy to announce that MetaModel 1.0.7 has just been released! The new release should be a drop-in replacement with minor improvements and bugfixes:

  • Improved memory handling and fixed a very slight memory leak (#191).
  • Added support for RIGHT JOIN when using the embedded query engine (#175).
  • XML support have been improved with more precise column types (#176).

You can download the new MetaModel at our  google code download site. For all you maven people out there, here's the update to your POM:

<dependency>
 <groupId>dk.eobjects.commons</groupId>
 <artifactId>MetaModel</artifactId>
 <version>1.0.7</version>
</dependency>

Enjoy.

Development/snapshot release of DataCleaner 1.4

We've released a development/snapshot release of DataCleaner 1.4 in order to get early reactions for all the improvements and new features as well as supporting our users with up to date functionality. In my own opinion the development release is just as stable and "safe to use" as 1.3, but of course it lacks a bit of the manual testing that we put into the real releases.

You can download the development release at our  sourceforge download site.

Here's a short list of fixes since DataCleaner 1.3:

  • Better memory handling and garbage collection
  • Reference columns in drill-to-details windows
  • Better error handling when loading schemas
  • Quoting of string values in visualized tables (in order to distinguish empty strings and white spaces)
  • New profile: Value Distribution, which is an improved version of the Repeated Values profile. The Value Distribution profile has an option to configure the top/bottom n values to include in the result.
  • Better control of profile result column width.
  • Bugfix: Copy to clipboard functions now work properly.
  • Bugfix: Scrollbars added to visualized tables.

Take a look at the roadmap for more current developments of DataCleaner.

Welcome to the new eobjects.dk website

After a great deal of work we're happy to announce the launch of the new eobjects.dk website at our new server host! Thanks to  Copenhagen Business School we now have a much better bandwidth available as well as more powerful hardware. Take a look around - a lot of things have changed, but the important stuff is still the same.

  • The most remarkable change is probably what you're looking at right now - the News page! With the News page we'll be sure to keep you updated with all that goes on at eobjects.dk - project releases, roadmap changes, events, visions and goals etc. etc.
  • There's a new left-hand side menu to ease navigation. We've created a new Docs page and a Downloads page for quick access to common inquiries. You'll also notice that the projects have been highlighted in the menu to give a better overview of our work.
  • For contributors and developers, the Hudson continuous integration is still not migrated yet. So we hope you have patience and discipline to live without CI for a couple of days.

We hope that you like the new website. If there's anything you'd like to comment on or anything that doesn't work as it should, please don't hesitate to go to the discussion forums and point it out for us! We will then make sure that the new website lives up to all the hopes we have for it.