datacleaner (#1) - Standard Measures: Distinct Count (#117) - Message List
Hey all, Hopefully this hasn't already been asked and I just couldn't find it --
Is there a way to add 'distinct count' (or duplicate count, whatever verbiage) to the Standard Measure profile under the 'Row count'?
Thanky! -D
-
Message #460
Shoot, there is a similar topic here - http://datacleaner.eobjects.org/topic/8/Duplicate-records
Although duplicate and distinct are pretty much the same thing (one just measures 'unique count', other measures how many are duplicates), along the same reasons as described there.
dhartford05/24/10 23:49:07 -
Message #461
Ah yes it looks like this feature has been overseen for some time.
When you say distinct count do you then mean on a per-column basis or a per-table basis? I can imagine that both measures will be of some value (eg. "how many distinct first names are there in the person-table" and "how many records are exactly the same in the person-table"). But which is what you need?
kasper05/25/10 08:32:46 -
Message #462
For my needs, distinct by on a per-column basis would be important for the standard measure section ("how many distinct first names"). The reason why I would like this in the Standard Measure is distinct is one of the first things I do when evaluating data.
The "Value Distribution" profile gives some good information on duplicates already (top/bottoms), and could include a more general metric of 'number of records that are duplicates', which would be more useful in this profile if you want to look more closely at this kind of information anyway.
dhartford05/25/10 14:39:55 -
Message #463
Honestly -- I think 'Distinct' probably does fit better in the Value Distribution profile.
It wouldn't be the first place I personally would look, but it makes better sense there.
dhartford05/25/10 14:49:33 -
Message #464
Oh, but if you disable the top/bottom properties of the value distribution profile, it will give you a complete listing of all values and their distributions! At the bottom of that list you will find "distinct values" :)
kasper05/25/10 14:53:57 -
Message #466
Ah, I see that now!
But it doesn't look like the right number I'm looking for - it doesn't include the 'unique duplicates'.
i.e. pulling the data into the database, I can do a distinct Address_field for 9942 for my example, but in the DataCleaner profile it returns 5333 (because there are a lot of duplicates that aren't included) out of 20,000.
Same dataset for a different field has 19842 sql distincts out of 20,000, while the DataCleaner shows 19764.
I think it's close, but not quite what I was hoping for :-(
dhartford05/25/10 15:07:36 -
Message #468
hmm...to be more precise, adding two more measures:
rowcount:20,000
Address_field:
Bottom: <unique values> 5333
[new]Distinct Count: 9942
[new]Duplicated Count: 4609
Does that sound reasonably useful to other people?
dhartford05/25/10 15:14:55 -
Message #469
It does sound very useful to me but I would probably not vote that we put those measures in the Standard Measures profile because of the way that that profile currently works by iterating through all records (ie. it will have to hold all values in memory). The Value Distribution profile is much better at handling this kind of issue because it has support for on-disk storage.
(This is a general design-issue with DataCleaner that is being treated with a redesign of some of the core components of the "engine" in datacleaner - a pilot project called analyzerbeans)
For more info:
- my blog entry about this: http://kasper.eobjects.dk/2009/06/introducing-analyzerbeans.html
- the wiki page for analyzerbeans: http://www.eobjects.dk/trac/wiki/AnalyzerBeans
kasper05/25/10 15:22:51 -
Message #470
Putting those in the Value Distribution profile actually sounds great to me, once thought-through! :-)
dhartford05/25/10 15:28:43
