Last modified 4 years ago
Profile: Pattern finder
back to DataCleaner | back to DataCleanerUserGuide
The pattern finder profile analyses the literal values of a column and tries to identify patterns in them. A typical appliance of the pattern finder could be an address field that one would presume was structured in a certain uniform way.
The patterns are identified based on three types and their delimitors:
- Literal types (words) ~ marked with a's, eg. "foobar" -> "aaaaaa", "foobar" -> "aaa aaa" etc.
- Numeric types (numbers) ~ marked with 9's, eg. "2200" -> "9999"
- Mixed types ~ marked with ?'s, eg. "DK2200" -> "??????"
- Delimitors ~ replicated (not marked).
Here are some examples:
- Given the two values "foo bar, 21" and "foob ar, 3230" a single pattern is found:
| pattern | count |
| aaaa aaa, 9999 | 2 |
- Given the three values "Mr. John Doe", "Mr. Doe John" and "John Doe" two patterns are found:
| pattern | count |
| aa. aaaa aaaa | 2 |
| aaaa aaaa | 1 |
- Given the four values "DK2100", "DK 2100", "2100" and "2200" three patterns are found:
| pattern | count |
| ?????? | 1 |
| aa 9999 | 1 |
| 9999 | 2 |
