|
CLEAN DATA IS THE FOUNDATION of marketing success. You can learn more about how data hygiene (getting "clean data") fits into the "big picture" of data-driven marketing by reviewing What Goes Into a management analytics database? |
WHAT IS DATA HYGIENE?
Data Hygiene refers to maintaining your data in a way that is complete, consistent and correct. In the technical language of databases and computers, this is called "data integrity". It's simple to understand even if it may take some work to achieve.
"Complete" means that each record in your database contains each piece of data that is essential. For example, each customer or patron record must have a name and address information before it is useful.
"Consistent" means that all of the data of a certain type are stored in the same way. For example, each customer's name is stored as first-name-first ("FNF") or last-name-first ("LNF") -- but not intermixed. Another example is would be that no comments or notes be stored in a field reserved for ADDRESS.
"Correct" means that the data are accurate and free from typos and wrong information. For example, it means that each customer's name is spelled correctly and that their zipcode does not have any missing or reversed digits.
Errors like these are unavoidable, of course. No database is completely free from errors. Nevertheless, each instance of a data hygiene problem reduces the value of a database. If there are many such problems, it can require quite a lot of work, money and time to clean up the database enough to be able to profitably use it for marketing purposes.
HERE ARE EXAMPLES OF COMMON DATA HYGIENE PROBLEMS
Here is a list of some of the common data hygiene errors ("Database Bullets to Dodge ") that we have encountered. See if you recognize any of these as being familiar. The bad news is that every database -- even small ones -- have some of these problems. The good news is that Management Analytics Group has developed several tools and procedures for fixing most of the problems listed below.
Dodge this Bullet... |
that shows up this way... |
to avoid this problem. |
|---|---|---|
| Two unrelated people in one name field. | Jane Doe & Bob Smith | Causes duplicate mailings to same household because records may not match during de-duping. |
| “Family” names. | First:
The |
Causes multiple mailings to same address because records don’t match during de-duping. Also may cause address to fail NCOA updates because address will not code. |
| Blank Lastname fields. | First: Jerry Last: (blank) Apartment 4 123 Anystreet Yourtown, MO 64444 |
In de-duping, “blank” matches anything. Blank LASTNAMEs at a business, an apartment or other multi-unit address will dupe out all other names at that address. This can cause serious omissions. |
| Company name in “Last Name” field. | First: John Middle: Smith Last: ABC Company, Inc. |
Causes multiple mailings to same address because company names are treated differently for de-duping. |
| No apartment or suite number. | 123 Anystreet (which is a multi-unit building) |
Address may code, may or may not de-dupe but will NOT be deliverable. The route carrier may discard it. |
| Not tracking first name. | Mr. Green | Address changes may not process through NCOA and this may cause address to expire. |
| Not updating for NCOA at least every year. | Large amounts of returned or discarded mail. | NCOA uses a three-year “moving window”. If the list is not checked with NCOA at least once during the three-year cycle, lots of names will come back uncoded and thus become undeliverable. |
| Mailing to a non-existant address. | 457 Anystreet (when there is no building at 457 but the person lives next door at 455 or across the street at 456) |
New postal regulations require mail to be sent only to addresses that physically exist. Until recently, route carriers could adjust for address errors, but this is no longer allowed. |
| Not tracking address suffix (Street, Avenue, Terrace, Place, Circle, etc.) | 123 West
72nd instead of 123 West 72nd Street |
If there is a 72nd STREET and a 72nd TERRACE (or PLACE or CIRCLE or DRIVE or AVENUE, etc.) address will be unmailable. |
| Not tracking directionals (West, East, etc.) | 123 Main
Street instead of 123 West Main Street |
If directionals are meaningful (i.e., there is both a 100 WEST Main and a 100 EAST Main), address will be unmailable. |
| Having both a WORK address and a RESIDENTIAL address for the same person as two separate records in the database. | John Brown 123 Main Street and the same person, John Brown 3456 West 14th St. |
These will always be treated as two separate people unless an Aliases table is provided. (It's better to store the work and home address in the same record with the name.) |
| Different street numbers for the same person. | Jane Doe 123 West Main Street and the same person, Jane Doe 132 West Main Street |
These are always treated as two people who live at two different addresses. They just happen to have the same name. |
| Missing or Incorrect zipcode. | Northtown,
ME 6444 (the leading 0 is missing in the zip) |
Street numbers and street names are cross-checked against zipcodes. If there is a discrepancy but the address is otherwise OK, we can sometimes fix the zipcode. However, if there is any kind of problem with the address and the zipcode doesn’t match, the address usually will not code at all and must be discarded. Zipcodes also over-ride city name and state data. Generally, if the zipcode is wrong, the address is wrong and the patron will not receive the mail sent. |
| Mis-spelled Street Name. | 123 Wste Mian Street | We sometimes can catch and fix these errors but only if the Zip and Zip-4 codes are correct. |
| Using “Address 1” or “Address2” fields for non-essential information. | Jane Doe The Executive Apartments Apartment 14 123 Anystreet Yourtown, MO 64444 |
Only two address lines – Address1 and Address2 – are used to code the address. Information on other address fields is generally discarded. Sometimes this makes the address unusable. (In the example to the left, put "Apt. 14" on the same line as "123 Anystreet.") |
| Using Address1 or Address2 fields for COMPANY name. | Jane Doe XYZ Corporation Hightower Bldg, Suite 100 123 Anystreet Yourtown, MO 64444 |
If all meaningful address information fits on Address2, this can be OK – but if the COMPANY name displaces meaningful address information, the address probably won’t code and must be discarded. In addition, the address may not de-dupe properly. |
| Mixing foreign addresses in with domestic addresses. | Jane Doe 14 Surry Lane Oxfordshire, South Wales England 7H7C3B |
Foreign addresses don’t code through NCOA. They need to be processed separately. |
| Using any address field for DO NOT CONTACT or DECEASED. | Jane
Doe |
Two problems. First, the address may not code -- but worse, it might! Second and more importantly, crucial DO NOT CONTACT information is lost because the system does not “look for” DNC flags in address fields. This can cause embarrassment, customer frustration and even civil penalties under the new and pending DNC laws! (The solution is to set up a field just for flagging DNC names.) |
| Using any address field for notes and comments. | Jane Doe PU VISA #123456690234 123 Anystreet Yourtown, MO 64444 |
Three problems: First, the address may not code -- but it might, which may be worse. Second, valuable contact information is lost. Third, and most importantly, you may reveal private information to anyone who looks. No one wants their credit card, account number or private information on the front of their mail! |
Same patron in system as J. Smith and John Smith. |
J. Smith 123 Anystreet Yourtown, MO 64444 and also John Smith 123 Anystreet Yourtown, MO 64444 |
Actually, this is not really a problem. If the address is otherwise OK, the duplicate record will be spotted on the basis of last name and same exact address. |
| Names longer than 35 characters. | Alexandra
Elizabeth Longmaidenname-Longmarriedname |
Truncation results and de-duping results are unpredictable. |
| Address names longer than 35 characters. | The Beautiful Oaks Twin Towers Complex, Apartment 1004 | Truncation results. In this example, the apartment number will be dropped and the mail may be undeliverable. |
| City names longer than 25 characters. | Truncation results. (There are no cities with names longer than 25 characters.) | |
| Non-printing (hidden or invisible) characters in any field. | (You can't see them!) | Causes all nature of problems, headaches and extra processing costs to clean tabs and other non-printing characters out of the data. Single and double quotes (' and ") are especially troublesome because databases often treat these characters delimters. |
| Sending data in formats other than ASCII CSV, ACCESS or DBF tables. | Backup formats DAT formats Macintosh formats Many database formats Some compressed file formats |
Exotic data formats require extra processing to be usable. Often, proprietary formats (such as produced by many backup systems) cannot be read at all. |
| Expecting 100% accuracy in mailing and/or telemarketing lists. | We mailed
10,000 pieces and 52 people called to complain about receiving multiple
pieces -- and 441 were returned! (To put this into perspective: 52 out of 10,000 is less than 1/2 of 1%. 441 out of 10,000 is 4.4%.) |
Complex computer logic is used to determine what addresses are and are not duplicates. Yet even with clean data, the process is it is not perfect. Expect 3% to 8% of your contact lists and mailings to be somehow flawed. The error rate depends upon many factors. One factor is the city. In Atlanta, for example, “Peachtree” appears quite frequently in address data and can confuse de-duping logic. In Washington, DC, accurate and complete street directionals are essential. In Kansas City, suffixes are critical because nearly every numbered “Street” has a “Terrace” – and often a “Place” or “Circle”, too. Another factor is the concentration of ethnic groups. Hispanic and Asian neighborhoods are especially challenging. In these areas, there are many more people with the same last name than in the general US population. When the names are short and/or sound similar, de-duping logic can easily get confused. |
WHAT'S THE BOTTOM LINE?
The best time to take care of data hygiene is during data entry – before “dirty data” contaminate your database. That is, "dodge" these "database bullets" before they hit you. Investing in the effort to train your data entry team about these common problems always pays off in avoided costs and problems.
And, when using lists obtained from list brokers or other organizations, be extra careful about merging in "dirty" data. As with other kinds of hygiene problems, prevention is a lot easier and less costly than a cure!
Questions? Just ask! [Back to How-to Resources] [Top] |
