Last modified 4 weeks ago
AnalyzerBeans
AnalyzerBeans is a sandbox project created as a long-term strategy to enhance the current DataCleaner API (the "DataCleaner-core" module). To allow us to rethink the design, this has been founded as a seperate project in order not to intervene with the maturing development of DataCleaner.
Goals
There are some design goals of AnalyzerBeans that make it distinctly stand out from the DataCleaner API's:
- Annotation-based: The use of annotations instead of inheritance will provide a greater flexibility when developing profiles and validation rules. Providing metadata for components will be significantly easier and more consistent as compile-time checks and annotation-targeted properties allow for the framework to easily propagate these features to the user. Annotations for dependency/configuration injection are compliant with JSR-330.
- Single but flexible type of analysis: There should be no differences between profiles and validation rules, at the API level. This will allow DataCleaner to unite the two on a long term. Our common term for these components are "AnalyzerBeans"
- "Exploring" profiles: Instead of trying to make all AnalyzerBeans fit within the same serial-processing pattern we want them to be able to chose a completely custom execution pattern. We call these beans exploring AnalyzerBeans. Exploring AnalyzerBeans will be given full control over a DataContext, thus giving them a lot more processing power.
- Better support for persistence API's and other frameworks. We want to store the input and output of AnalyzerBeans as POJOs, which will make it a lot easier to hook applications into JPA, JAXB or similar POJO-based frameworks.
- Pluggable multi-threading API which supports a variety of different execution modes: Single-threaded, java.util.concurrent-based multithreading and Timer- or JMS-based EJB-multithreading.
- Transformation of input before analysis. Data can now be transformed, tokenized, standardized, converted and more before analysis. This enables AnalyzerBeans to do things such as name standardization (go from full name to first name, last name, middle names and titulation), string tokenizing, convert datatypes etc.
Two types of analyzers
A bean must implement either of these two interfaces:
- RowProcessingAnalyzer: Row-processing beans can be grouped and share the same DataSet's. The downside is that they are less flexible because they cannot produce queries themselves. All beans that implement the RowProcessingAnalyzer interface must also contain a @Configured Column-array.
- ExploringAnalyzer: Exploring beans are much more flexible but does not scale very well compared to row-processing beans. Exploring beans are given full control of a DataContext and can thus perform any kind and any number of queries.
AnalyzerBeans annotations
The new annotations represent the life-cycle of an analysis component - what used to be called either a Profile or a Validation Rule in DataCleaner. Here are the annotations introduced:
- @AnalyzerBean: This annotation is used at the class level to indicate that the class is an analysis component (dubbed "an AnalyzerBean"). The annotation has a single parameter:
- value: The display name of the type of analysis. Shown to the user.
- @Configured: Used on either fields or setter-methods, this annotation indicates that a property needs to be configured by the user and injected to the bean before the AnalyzerBean can execute. The valid types of properties can be single instances or arrays of:
- java.lang.Boolean (or boolean)
- java.lang.Long (or long)
- java.lang.Integer (or int)
- java.lang.Double (or double)
- java.lang.String
- dk.eobjects.metamodel.schema.Column
- dk.eobjects.metamodel.schema.Table
- dk.eobjects.metamodel.schema.Schema
- @Provided: Used on either fields or setter-methods, this annotation indicates that a property has to be provided, managed and injected by the AnalyzerBeans-framework. As of now this only pertains to persistent collections, but in case we encounter other types of managed properties (resource that should be made available to the bean-developer) this annotation is expected to be used for this as well. Currently the valid types of properties are:
- java.util.List (with parameterized/generic types of Boolean, Long, Integer, Double or String)
- java.util.Map (with parameterized/generic types of Boolean, Long, Integer, Double or String)
- @Initialize: Used to annotate a method that should be run on initialization, ie. after @Configured and @Provided properties have been injected.
- @Close: Used to annotate a method that should be run after execution of the bean has finished. It can be used to release any resources that may have been used during execution. Note that all @Provided properties are automatically being closed if nescesary (and in this sense you should consider them 'managed').
Major obstacles to be decided or implemented
- Result rendering? Multiple render-formats should be supported (such as HTML, XML and GUI components).
- How to implement validation of @Configured properties. Perhaps using exceptions on the @Initialize method? Or introduce a "validator" parameter to the @Configured annotation? What about validation of multiple properties.
- How to implement filtering/sampling for queries. Exploring Analyzers kind of make this impossible? Speculative solution: Create a wrapping DataContextStrategy that investigates (and modifies if nescesary) all queries going through.
See also
- Current API Documentation (Javadoc) for AnalyzerBeans: http://eobjects.org/analyzerbeans/apidocs
- Tickets on the first milestone of AnalyzerBeans: AnalyzerBeans-0.1
- Subversion path: http://eobjects.org/svn/AnalyzerBeans/trunk/
- Kasper's blog-entry: Introducing AnalyzerBeans
