Data Matching
Data matching involves bringing together data from different sources and comparing it. DataFix is an expert in finding definitive data sources for customer applications and have assisted owners of unique data sources to develop cooperative arrangements with organizations requiring the use of their data.
DataFix has developed a series of proprietary matching algorithms and methods that can be customized to meet the exact needs of individual clients. While our core business focuses on matching residential and commercial name and addresses, these algorithms have also been successful for any matching situation.
When matching disparate data sources, the use of exact matching algorithms can significantly impact overall match results. Variations in name formatting and spelling, minor mailing or location address differences, and missing, incomplete, or incorrect information can all prevent otherwise valid matches from being obtained. Over the past ten years, DataFix has developed a number of advanced algorithms to resolve these types of matching problems.
Matching Algorithm Overview
Numerous algorithms have been developed to allow for so-called ‘fuzzy’ matching, including soundex, NYSIIS, Jaro-Winkler, and many different types of character transposition functions. While these approaches are useful for matching specific fields (such as comparing a correct and a misspelled last name), they are not capable of accurately comparing larger blocks of data, such as full names or complete addresses.
Accordingly, DataFix has developed a series of algorithms and programs that utilize token-based matching in conjunction with various fuzzy-matching functions. For a given block of source data (for example, “8 KING ST E SUITE 600”), each token is compared to records from the target dataset. The order of the tokens is not important, so an address formatted like “SUITE 600 8 KING ST E” would be considered a perfect match.
Furthermore, when comparing individual tokens, a number of different algorithms are used. Perfect token matches (“KING” to “KING”, for example) generate the highest score. However, if the tokens do not match exactly they are compared using transposition algorithms (“KING” and “IKNG”), ‘begins with’ tests (“KING” and “KIN”), or simply a first character check. For each test, different scores are assigned.
Using this type of logic, the DataFix matching routines can provide a computed score plus a maximum possible score for each block of name or address data. With these numbers, a single percentage score can be assigned to each data block.
Why is Data Matching Needed?
- To group data by uniqueness of information for the creation of a single Customer Database or Prospects for Sales Call
- To identify duplicate records in customer files or prospect lists
- To match new account applications to existing customer and sales campaign files
- To check compliance in order to discover unmatched, mis-matched, or irregular records
Benefits
Using DataFix matching techniques could significantly improve the quality of your database and result in:
- New information from existing data
- The ability to match and utilize data that is considered to be of poor quality and finding missing, erroneous and inconsistent data
- The ability to access data in any format and from any source, quickly and easily
- Obtaining information from an unlimited number of sources using flexible criteria
- Shared information from a number of sources = an enhanced database