http://www.eobjects.dk/datacleaner/IconUserGuide.png

Profile: Pattern finder

back to DataCleaner | back to DataCleanerUserGuide

The pattern finder profile analyses the literal values of a column and tries to identify patterns in them. A typical appliance of the pattern finder could be an address field that one would presume was structured in a certain uniform way.

The patterns are identified based on three types and their delimitors:

  • Literal types (words) ~ marked with a's, eg. "foobar" -> "aaaaaa", "foobar" -> "aaa aaa" etc.
  • Numeric types (numbers) ~ marked with 9's, eg. "2200" -> "9999"
  • Mixed types ~ marked with ?'s, eg. "DK2200" -> "??????"
  • Delimitors ~ replicated (not marked).

Here are some examples:

  • Given the two values "foo bar, 21" and "foob ar, 3230" a single pattern is found:
pattern count
aaaa aaa, 9999 2
  • Given the three values "Mr. John Doe", "Mr. Doe John" and "John Doe" two patterns are found:
pattern count
aa. aaaa aaaa 2
aaaa aaaa 1
  • Given the four values "DK2100", "DK 2100", "2100" and "2200" three patterns are found:
pattern count
?????? 1
aa 9999 1
9999 2