Data Mining Business Intelligence

Enhancement -- Plain or Fancy

Learn the difference between ZIP and HOUSEHOLD data overlays for modeling and when to use each.

by Alan Weber, Advisory Consultant, Management Analytics Group
(Published in Target Arts magazine, 5/1996)
Ask us!
Questions?
Just ask!
A 747 IS BETTER THAN A DC-3, a luxury car is better than an economy model and a steak and lobster dinner is a lot better than a burger and fries. On the other hand, a lot of flights would be impractical if all we had were 747's, not everyone can afford a luxury car and who has the time or money for a fancy meal every day?

Statistical models are useful for making predictions. Various inputs (or dependent variables) like income, type of dwelling and family size are used to estimate a response such as likelihood of purchase. Certain statistical techniques like regression or chi-square are used to create a specific formula that best predicts the desired response. The formula must be able to predict with some accuracy to be considered a mode. Most models are created by professional should have both statistical training and access to inputs that they can use to create their modes.

Since models are made by professionals, garden variety direct marketers don't need to know how to make one. But they do need to know what a model will do for them and what kind of model will work best for a specific situation. Just as ordinary drivers don't need to know how to build a Buick, they still do need to know if they need a pickup truck, a van or a sedan.

Household-level models are usually better than ZIP models at identifying or describing potential new customers, but that doesn't mean they are always practical, affordable or desirable. You wouldn't use a 747 to fly small loads to small airports. You wouldn't use a DC-3 to get across town. There are costs and benefits associated with each type of model that vary from situation to situation. It is as important to know when to use one or the other as it is to know when not to use one at all. Choosing an appropriate type of model is just the same.

To use either type of model, the customer and prospect files must first be enhanced. By enhanced, we mean adding information to the file beyond the things we already know like purchase history or list source. The reason the customer file is enhanced is to profile its demographics in the hope that the information can be used to find prospects who are similar to existing customers and who should respond at a higher rate than dissimilar prospects.

There are two basic ways to enhance a customer file. You can either add information based upon the area where a customer lives or you can add information about that specific individual customer's household.

Most modelers agree that household-level data are generally more accurate for making predictions than area-level ("ZIP") data. On the other hand, it costs more to enhance a file using household-level data than ZIP-level data. Household-level data can cost $30 to $50 per thousand names matched (or even more), depending upon the size fo the file and the type of data being added. In contrast, census-based ZIP-level data can be purchased on a CD-ROM disk for under $1,000 and thus costs very little on a per-name basis.

Either or both methods are likely to produce a certain amount of lift in response rates if used in a predictive model. However, there is no guarantee that either method will provide enough lift to be cost effective. It is important to look at estimates of how each type of model should perform to decide which to test. Testing the models in actual use and measuring the results is crucial for long-term success.

TYPES OF ENHANCEMENTS

Household-level data include information such as income, age of resident, home value, type of dwelling, presence of children and type and number of vehicles. Household-level data are specific to each household and can be quite detailed in terms of lifestyle information, financial information and whether they are direct response buyers. Usually, however, the more detailed the particular pieces of information are, the fewer the households that will have that information recorded in the database -- thus there will be fewer matches based upon such information.

Area-level (ZIP code) data can be based upon the five-digit ZIP code or smaller areas like ZIP+4, census tract or block group. Area-level data include information such as average home value, percentage of adults who have a college degree, types of jobs and industries in the area and average income.

Area-level data paint a broader picture of what type of people live there. Area-level data cannot be used to determine the age of each buyer, but can clearly distinguish Manhattan Island from Buffalo, Wyoming. If you're selling saddles or theater magazines, you will probably find area-level data to be a big help. Particular when the consumer buying the product is "keep up with the Joneses", area-level data tell what profile the Joneses are likely to have.

MAKING THE MOST OF ENHANCEMENT DATA

To effectively use any type of enhancement data, the cataloger must:

  1. Add the enhancement data to each file being used.
  2. Create a specific statistical model using the data for scoring, selecting or de-selecting names.

In the case of ZIP code-level information, you should probably expect nearly 100% of your customer file to be enhanced. The remainder are likely to have bad ZIP codes. Household-level data, depending upon the source, may not match anywhere from 15% to 40% of your file. Non-matching records will not have enhancement data and the model may treat them as one group.

If you are using household-level data, you must first enhance the entire file with the information you want. Then you should select only the names that the statistical model scores high enough to mail. To put it another way, you enhance the data for every name and then use only a portion of the names. You have to enhance the data whether you use the name or not.

A low match rate by household will often force the model to default to area-level data. For example, if household data is available for only 45% of the file and the other 55% has only ZIP+4-level data, most households will (obviously) not be modeled on household-level data. Non-matching names must, by default, be modeled either upon area-level data or not at all.

A low match rate using a household-level model is perhaps the worst of both worlds. You have all the fixed costs of enhancing the file and creating a household-level model. Then you either create what amounts a second model for area-level data and do not market to non-matching households, or you simply do not use any model for the names that do not match on the household level.

One way to avoid this situation is to enhance a sample of the data first to determine what sort of match rate can be achieved by the supplier of the enhancement data. If the match rate is low, or if a review of the data gives the appearance that a household-level model will not be a good investment, then stop right there! Do not proceed with either the supplier of the enhancement data nor with the household-level model.

Find a source that will have a higher match rate or consider an area-level model instead. Remember that the problem could be the names on your file. Outdated and incorrect addresses will not match, so be sure to clean your list first to achieve the highest match rates possible.

PRACTICAL LIMITS ON USING DATA

Most likely when modeling existing customers, purchase information will be of great value and enhancement data will be of little use. Plan from the start to do a good job with a Recency-Frequency-Monetary ("RFM") analysis when marketing to existing customers and forget about using enhanced data.

The exceptions to this are cross-selling and up-selling when customer groups or product offers/channels are substantially different. Enhancement data can be of great value in deciding which retail buyers should receive a catalog or which buyers of insurance are likely to purchase investments.

A model cannot select more names than it was given. A model is of no value if it consistently selects all of the names that it can select from. All names, whether selected or not, must be enhanced at some cost prior to being used in the model. This adds a fixed cost to every mailing that must be more than covered by increases in profits from achieving higher response rates or sales per catalog as the result of the modeling.

There is partial exception to this rule but it applies only to lists where the names are already enhanced. In this case, some sort of household model must be applied to select any names at all.

A compiled file normally has some kind of household-level information available with each name. A response file normally does not have such demographic data available to select. Catalogers normally mail only to response lists since know previous direct response buyers are more profitable to market to than buyers whose likelihood of buying through the mail is unknown.

According to Don Pelley, director of marketing for Houston, TX-based DMSI (Dynamic Marketing Services, Inc.), modeling beyond ZIP level is usually impractical for direct mailers. However, for retailers who want to target their mailings, a household-level model based upon compiled lists is nearly essential. The key is to match the type of model to the offer and lists available.

THE NEED TO MODEL

Models based upon either area-level or household-level data cannot replace circulation planning, accurate source code tracking and RFM modeling. In deed if these are not in place first, any prospect model is likely to fail. The fact that no model is being used does not mean that making one should be a company's top priority. Models should be judged only by what they make or save a company relative to what they cost. If modeling doesn't improve the bottom line, it shouldn't be used at all.

Sometimes the best way to get from point A to point B is by airplane, sometimes it is by car and sometimes it's just by waking. By the same token, sometimes the best model is based upon household data, sometimes it is based upon area-level data and sometimes the best model is no model at all. It all depends upon the offer, the data and the situation.

Ask us!
Questions? Just ask!     [Back to How-to Resources]      [Top]
About MAG | Privacy Policy | Contact Us | © Copyright 2009 Management Analytics Group LLC. All rights reserved.