To participate in the forums you have to register or login.

datacleaner (#1) - Standard Measures: Distinct Count (#117) - Message List

Standard Measures: Distinct Count

Hey all, Hopefully this hasn't already been asked and I just couldn't find it --

Is there a way to add 'distinct count' (or duplicate count, whatever verbiage) to the Standard Measure profile under the 'Row count'?

Thanky! -D

  • Message #460

    Shoot, there is a similar topic here -  http://datacleaner.eobjects.org/topic/8/Duplicate-records

    Although duplicate and distinct are pretty much the same thing (one just measures 'unique count', other measures how many are duplicates), along the same reasons as described there.

  • Message #461

    Ah yes it looks like this feature has been overseen for some time.

    When you say distinct count do you then mean on a per-column basis or a per-table basis? I can imagine that both measures will be of some value (eg. "how many distinct first names are there in the person-table" and "how many records are exactly the same in the person-table"). But which is what you need?

  • Message #462

    For my needs, distinct by on a per-column basis would be important for the standard measure section ("how many distinct first names"). The reason why I would like this in the Standard Measure is distinct is one of the first things I do when evaluating data.

    The "Value Distribution" profile gives some good information on duplicates already (top/bottoms), and could include a more general metric of 'number of records that are duplicates', which would be more useful in this profile if you want to look more closely at this kind of information anyway.

  • Message #463

    Honestly -- I think 'Distinct' probably does fit better in the Value Distribution profile.

    It wouldn't be the first place I personally would look, but it makes better sense there.

  • Message #464

    Oh, but if you disable the top/bottom properties of the value distribution profile, it will give you a complete listing of all values and their distributions! At the bottom of that list you will find "distinct values" :)

  • Message #466

    Ah, I see that now!

    But it doesn't look like the right number I'm looking for - it doesn't include the 'unique duplicates'.

    i.e. pulling the data into the database, I can do a distinct Address_field for 9942 for my example, but in the DataCleaner profile it returns 5333 (because there are a lot of duplicates that aren't included) out of 20,000.

    Same dataset for a different field has 19842 sql distincts out of 20,000, while the DataCleaner shows 19764.

    I think it's close, but not quite what I was hoping for :-(

  • Message #468

    hmm...to be more precise, adding two more measures:

    rowcount:20,000

    Address_field:

    Bottom: <unique values> 5333

    [new]Distinct Count: 9942

    [new]Duplicated Count: 4609

    Does that sound reasonably useful to other people?

  • Message #469

    It does sound very useful to me but I would probably not vote that we put those measures in the Standard Measures profile because of the way that that profile currently works by iterating through all records (ie. it will have to hold all values in memory). The Value Distribution profile is much better at handling this kind of issue because it has support for on-disk storage.

    (This is a general design-issue with DataCleaner that is being treated with a redesign of some of the core components of the "engine" in datacleaner - a pilot project called analyzerbeans)

    For more info:

  • Message #470

    Putting those in the Value Distribution profile actually sounds great to me, once thought-through! :-)