AnalyzerBeans
AnalyzerBeans is a project created as a long-term strategy to enhance the current DataCleaner API (the "DataCleaner-core" module). To allow us to rethink the design, this has been founded as a seperate project in order not to intervene with the maturing development of DataCleaner 1.x.
It is the goal of DataCleaner 2.0 to be based upon AnalyzerBeans as it's core analysis engine.
New: This wiki page is no longer to be considered the main homepage of AnalyzerBeans! This serves as a community place for documentation collaboration and ad hoc writing of snippets and small pieces of information.
Please be sure to visit the new official AnalyzerBeans website at http://analyzerbeans.eobjects.org
Goals
There are some design goals of AnalyzerBeans that make it distinctly stand out from the DataCleaner API's:
- Annotation-based: The use of annotations instead of inheritance will provide a greater flexibility when developing profiles and validation rules. Providing metadata for components will be significantly easier and more consistent as compile-time checks and annotation-targeted properties allow for the framework to easily propagate these features to the user. Annotations for dependency/configuration injection are compliant with JSR-330.
- Single but flexible type of analysis: There should be no differences between profiles and validation rules, at the API level. This will allow DataCleaner to unite the two on a long term. Our common term for these components are "AnalyzerBeans"
- Transformation of input before analysis. Data can now be transformed, tokenized, standardized, converted and more before analysis. This enables AnalyzerBeans to do things such as name standardization (go from full name to first name, last name, middle names and titulation), string tokenizing, convert datatypes etc.
- "Exploring" profiles: Instead of trying to make all AnalyzerBeans fit within the same serial-processing pattern we want them to be able to chose a completely custom execution pattern. We call these beans exploring AnalyzerBeans. Exploring AnalyzerBeans will be given full control over a DataContext, thus giving them a lot more processing power.
- Better support for persistence API's and other frameworks. We want to store the input and output of AnalyzerBeans as POJOs, which will make it a lot easier to hook applications into JPA, JAXB or similar POJO-based frameworks.
- Pluggable multi-threading API which supports a variety of different execution modes: Single-threaded, java.util.concurrent-based multithreading and Timer- or JMS-based EJB-multithreading.
- Richer reference data types. In addition to the dictionary type we are adding new types of reference data. In particular we are adding support for Synonym catalogs and are also considering other types of reference/lookup data types.
- Pluggable rendering mechanism. The framework includes a way to pick up analysis results and render them to a particular format (eg. a Swing UI, a text file, an HTML element).
Two types of analyzers
An analyzer bean must implement either of these two interfaces:
- RowProcessingAnalyzer: Row-processing beans can be grouped and share the same DataSet's. They also support transformation, filtering and other pre-processing of data. The downside is that they are less flexible because they cannot produce queries themselves. All beans that implement the RowProcessingAnalyzer interface must also contain a @Configured Column-array.
- ExploringAnalyzer: Exploring beans are more flexible in terms of querying-capability, but does not scale very well compared to row-processing beans. Exploring beans are given full control of a DataContext and can thus perform any kind and any number of queries.
Configuration and analysis jobs
Two concepts are central to the design of AnalyzerBeans. These are the two things that the user of the framework has to provide.
- Configuration. The configuration contains all the information about the environment of AnalyzerBeans. This includes which datastores are available, what reference data there is etc., but also deployment-options such as tuning of multithreading and storage mechanism.
- Example configuration in XML format: examples/conf.xml
- XML schema: configuration.xsd
- Java interface: AnalyzerBeansConfiguration
- Analysis job. An analysis job describes a single job that makes used of a datastore in the configuration. A job consists of a set of source columns, some pre-processing (transformation, filtering etc.) elements and a set of analyzers.
- Example job in XML format: examples/employees_job.xml
- XML schema: job.xsd
- Java interface: AnalysisJob
AnalyzerBeans annotations
The new annotations represent the life-cycle of an analysis component - what used to be called either a Profile or a Validation Rule in DataCleaner. Here are the annotations introduced:
- @AnalyzerBean: This annotation is used at the class level to indicate that the class is an analysis component (dubbed "an AnalyzerBean"). The annotation has a single parameter:
- value: The display name of the type of analysis. Shown to the user.
- @TransformerBean: This annotation is used at the class level to indicate that the class is a transformer component. The annotation has a single parameter:
- value: The display name of the type of transformation. Shown to the user.
- @FilterBean: This annotation is used at the class level to indicate that the class is a filter component. A filter will categorize rows, which can then be used to split the flow of data in succeeding steps (ie. set filter requirements for further processing).
- @Configured: Used on fields, this annotation indicates that a property needs to be configured by the user and injected to the bean before the AnalyzerBean can execute. The valid types of properties can be single instances or arrays of:
- java.lang.Boolean (or boolean)
- java.lang.Byte (or byte)
- java.lang.Short (or short)
- java.lang.Integer (or int)
- java.lang.Long (or long)
- java.lang.Float (or float)
- java.lang.Double (or double)
- java.lang.Character or (char)
- java.lang.String
- java.io.File
- enum types
- java.util.regex.Pattern
- org.eobjects.analyzer.data.InputColumn
- org.eobjects.analyzer.reference.Dictionary
- org.eobjects.analyzer.reference.SynonymCatalog
- dk.eobjects.metamodel.schema.Column (only available for exploring analyzer beans)
- dk.eobjects.metamodel.schema.Table (only available for exploring analyzer beans)
- dk.eobjects.metamodel.schema.Schema (only available for exploring analyzer beans)
- @Provided: Used on fields, this annotation indicates that a property has to be provided, managed and injected by the AnalyzerBeans-framework. This is meaningful especially for persistent collections and row annotation utilities because they require complicated caching logic that should not be of the concern of the component developer. Currently the valid types of properties are:
- java.util.List (with parameterized/generic types of Boolean, Byte, Short, Integer, Long, Float, Double, Character or String)
- java.util.Map (with parameterized/generic types of Boolean, Byte, Short, Integer, Long, Float, Double, Character or String)
- java.util.Set (with parameterized/generic types of Boolean, Byte, Short, Integer, Long, Float, Double, Character or String)
- org.eobjects.analyzer.storage.CollectionFactory: A simple factory for creating the three collection types above. This factory can become useful if the amount of persistent collections needed is not fixed, ie. dependent on configuration or input.
- org.eobjects.analyzer.storage.RowAnnotationFactory: Used for creating row annotations which can be used to mark/annotate rows.
- @Initialize: Used to annotate a method that should be run on initialization, ie. after @Configured properties have been injected.
- @Close: Used to annotate a method that should be run after execution of the bean has finished. It can be used to release any resources that may have been used during execution. Note that all @Provided properties are automatically being closed if nescesary (and in this sense you should consider them 'managed').
Major obstacles to be decided or implemented
- How to implement validation of @Configured properties. Perhaps using exceptions on the @Initialize method? Or introduce a "validator" parameter to the @Configured annotation? What about validation of multiple properties.
See also
- Current API Documentation (Javadoc) for AnalyzerBeans: http://eobjects.org/analyzerbeans/apidocs
- Tickets on the first milestone of AnalyzerBeans: AnalyzerBeans-0.1
- Subversion path: http://eobjects.org/svn/AnalyzerBeans/trunk/
- Kasper's first blog-entry on AnalyzerBeans: Introducing AnalyzerBeans
- Developer tutorial: Developing a transformer
- Developer tutorial: Developing an analyzer
- User tutorial: Authoring an AnalyzerBeans job
