Analysis can be viewed as the categorization, aggregation, and the manipulation of data to obtain answers to research question underlying the research project. The final aspect of analysis is interpretation. The process of interpretation involves taking the results of analysis, making inferences relevant to the research relationships studied, and drawing managerially useful conclusions about these relationships.
Basic Conscepts Of Analyzing Associative Data Our chapter begins with a brief discussion of cross-tabulation to mark the beginning of a major topic of this chapter—the analysis of associative data. A large part of the rest of the chapter will focus on methods for analyzing how the variation of one variable is associated with variation in other variables.
The computation of row or column percentages in the presentation of crosstabulations is taken up first. We then show how various insights can be obtained as one goes beyond two variables in a cross-tabulation to three (or more) variables. In particular, examples are presented of how the introduction of a third variable can often refine or explain the observed association between the first two variables.
Bivariate Cross-Tabulation
Bivariate cross-tabulation represents the simplest form of associative data analysis. At the minimum we can start out with only two variables, such as occupation and education, each of which has a discrete set of exclusive and exhaustive categories. Data of this type is called categorical, since each variable is assumed to be nominalscaled. Bivariate cross tabulation is widely used in marketing applications to analyze variables at all levels of measurement. In fact, it is the single most widely used bivariate technique in applied settings. Reasons for the continued popularity of bivariate crosstabulation include the following (Feick, 1984, p. 376) :
|
1. |
It provides a
means of data display and analysis that is clearly interpretable even to the
less statistically inclined researcher or manager. |
|
2. |
A series of
bivariate tabulations can provide clear insights into complex marketing
phenomena that might be lost in a simultaneous analysis with many variables. |
|
3. |
The clarity of interpretation affords a more readily constructed link between market research and market action. |
|
4. |
Bivariate
cross-tabulations may lessen the problems of sparse cell values that often
plague the interpretation of discrete multivariate analyses (bivariate
crosstabulations require the expected number of respondents in any table cell
be 5). |
The entities being cross-classified are usually people, objects, or events. The cross tabulation, at its simplest, consists of a simple count of the number of entities that fall into each of the possible categories of the cross-classification. Excellent discussions of ways to analyze cross-tabulations can be found in Hellevik (1984) and Zeisel (1957).
However, we usually want to do more than show the raw frequency data. At the very least, row or column percentages (or both) are usually computed.
Percentages
The simple mechanics of calculating percentages are known to all of us. In crosstabulation, percentages serve as a relative measure indicating the relative size of two or more categories
The ease and simplicity of calculation, the general understanding of its purpose, and the near universal applicability have made the percent statistic, or its counterpart the proportion, the most widely used statistical tool in marketing research. Yet its simplicity of calculation is sometimes deceptive, and the understanding of its purpose is frequently insufficient to ensure sound application and interpretation. The result is that the percent statistic is often the source of misrepresentations, either inadvertent or intentional
The sources of problems in using percentages are largely in identifying the direction in which percentages should be computed, and in knowing how to interpret percentage of change.
Both these problems can be illustrated by a small numerical example that uses a before and after with control experimental design. Let us assume that KEN’s Original, a small regional manufacturer of salad dressings, is interested in testing the effectiveness of spot TV ads in increasing consumer awareness of a new brand called Life. Two geographic areas are chosen for the test: (1) test area A and (2) control area B. The test area receives five 15 second television spots per week over an eight-week period, whereas the control area receives no spot TV ads at all. (Other forms of advertising were equal in each area.)
Assume that four independent samples (of telephone interviews) were conducted as before and after tests in each of the areas. Respondents were asked to state all the brands of salad dressing they could think of, on an aided basis. If Life was mentioned, it was assumed that this constituted consumer awareness of the brand. However, as it turned out, sample sizes differed across all four sets of interviews. This common occurrence in surveys (i.e., variation in sample sizes) increases the value of computing percentages.
Table 12.1 shows a set of frequency tables that were compiled before and after a TV ad for Life Salad Dressing was aired. Interpretation of Table 12.1 would be hampered if the data were expressed as raw frequencies and different percentage bases were reported. Accordingly, Table 12.1 shows the data, with percentages based on column and row totals. Can you see why the row percentages are more useful for analytical purposes?
Direction in Which to Compute Percentages
In examining the relationship between two variables, it is often clear from the context that one variable is more or less the independent or control variable and the other is the dependent or criterion variable. In cases where this distinction is clear, the rule is to compare percentages within levels of the dependent variable.
In Table 12.1, the control variable is the experimental area (test versus control) and the dependent variable is awareness. When comparing awareness in the test and control areas, row percentages are preferred. We note that before the spot TV campaign the percentage of respondents who are aware of Life is almost the same between test and control areas: 42 percent and 40 percent, respectively.
However, after the campaign the test-area awareness level moves up to 66 percent, whereas the control-area awareness (42 percent) stays almost the same. The small increase of 2 percentage points reflects either sampling variability or the effect of other factors that might be serving to increase awareness of Life in the control area.
On the other hand, computing percentages across the independent variable (column percent) makes little sense. We note that 61 percent of the aware group (before the spot TV campaign) originates from the test area; however, this is mainly a reflection of the differences in total sample sizes between test and control areas.
After the campaign we note that the percentage of aware respondents in the control area is only 33 percent, versus 39 percent before the campaign. This may be erroneously interpreted as indicating that spot TV in the test area depressed awareness in the control area. But we know this to be false from our earlier examination of raw percentages.
It is not always the case that one variable is clearly the independent or control variable and the other is the dependent or criterion variable. This should pose no particular problem as long as we agree, for analysis purposes, which variable is to be considered the control variable. Indeed, cases often arise in which each of the variables in turn serves as the independent and dependent variable.
Table 12.1 Aware of Life Salad Dressing — Before and After Spot TV
Table 12.2 shows the percentage of respondents who are aware of Life before and after the spot TV campaign in the test and control areas. First, we note that the test-area respondents displayed a greater absolute increase in awareness. The increase for the testarea respondents was 24 percentage points, whereas the control-area awareness increased by only 2 percentage points
Table 12.2 Aware of Life—Percentages Before and After the Spot TV Campaign
The percentage of possible increase for the test area is computed by first noting that the maximum percentage-point increase that could have occurred is 100 – 42 = 58 points. The increase actually registered is 24 percentage points, or 100(24/58) = 41 percent of the maximum possible. That of the control area is 100(2/60) = 3 percent of the maximum possible.
In terms of the illustrative problem, all three methods give consistent results in the sense that the awareness level in the test area undergoes greater change than that in the control area. However, in other situations conflicts among the measures may arise.
The absolute difference method is simple to use and requires only that the distinction between percentage and percentage points be understood. The relativedifference method can be misleading, particularly if the base for computing the percentage change is small. The percentage-of-possible-difference method takes cognizance of the greater difficulty associated with obtaining increases in awareness as the difference between potential-level and realized-level decreases. In some studies all three measures are used, inasmuch as they emphasize different aspects of the relationship.
Introducing a Third Variable into the Analysis
Cross tabulation analysis to investigate relationships need not stop with two variables. Often much can be learned about the original two-variable association through the introduction of a third variable that may refine or explain the original relationship. In some cases, it may show that the two variables are related even though no apparent relationship exists before the third variable is introduced. These ideas are most easily explained by example.
Consider the hypothetical situation facing MCX Company, a specialist in telecommunications equipment for the residential market. The company has recently testmarketed a new service for online recording of cable or satellite programs without a storage box. Several months after the introduction, a telephone survey was taken in which respondents in the test area were asked whether they had adopted the innovation. The total number of respondents interviewed was 600 (Table 11.3).
Table 12.3 Adoption—Percentage by Gender and Age
One of the variables of major interest in this study was the age of the respondent. Based on earlier studies of the residential market, it appeared that adopters of the firm’s new products tended to be less than 35 years old. Accordingly, the market analyst decides to cross-tabulate adoption and respondent age. Respondents are classified into the categories “under 35 years” (<35) and “equal to or greater than 35 years” (≥ 35) and then cross-classified by adoption or not. Table 12.3 shows the full three-variable crosstabulation. It seems that the total sample of 600 is split evenly between those who are under 35 years of age and those who are 35 years of age or older. Younger respondents display a higher percentage of adoption (37 percent = 100 + 11)/300) than older respondents (23 percent = (60 + 9)/300).
Analysis and Interpretation
The researcher is primarily interested in whether this finding differs when gender of the respondent is introduced into the analysis. As it turned out, 400 respondents in the total sample were men, whereas 200 were women.
Table 12.3 shows the results of introducing gender as a third classificatory variable. In the case of men, 50 percent of the younger men adopt compared with only 30 percent of the older men. In the case of women, the percentages of adoption are much closer. Even here, however, younger women show a slightly higher percentage of adoption (11 percent) than older women (9 percent).
The effect of gender on the original association between adoption and age is to refine that association without changing its basic character; younger respondents show a higher incidence of adoption than older respondents. However, what can now be said is the following : If the respondent is a man, the differential effect of age on adoption is much more pronounced than if the respondent is a woman.
This pattern is even easier to identify when we show this information graphically (Figure 12.1). The height of the bars within each rectangle represents the percentage of respondents who are adopters. The relative width of the bars denotes the relative size of the categories—men versus women—representing the third variable, gender. The shaded portions of the bars denote the percentage adopting by gender, the dashed line represents the weighted average percentage adopting by gender, and the dashed line represents the weighted average percentage across the genders.
Figure 12.1 Adoption—Percentage by Age and Gender
It is easy to see from the figure that adoption differs by age group (37 percent versus 23 percent). Furthermore, the size of the difference depends on the gender of the respondent: Men display a relatively higher rate of adoption, compared with women, in the younger age category
Recapitulation
Representatives of three-variable association can involve many possibilities that could be illustrated by the preceding adoption-age-gender example :
|
1. |
In the example presented, adoption and age exhibit initial association; this association is still maintained in the aggregate but is refined by the introduction of the third variable, gender |
|
2. |
Adoption and age do not appear to be associated. However, adding and controlling on the third variable, gender, reveals suppressed association between the first two variables within the separate categories of men and women. In the two-variable cases, men and women exhibit opposite patterns, canceling each other out. |
Although the preceding example was contrived to illustrate the concepts, the results are not unusual in practice. It goes almost without saying that the introduction of a third variable can often be useful in the interpretation of two-variable cross - tabulations.
However, the reader should be aware of the fact that we have deliberately used the phrase associated with rather than caused by. Association of two or more variables does not imply causation, and this statement is true regardless of our preceding efforts to refine some observed two-variable association through the introduction of a third variable.
In principle, of course, we could cross tabulate four or even more variables with the possibility of obtaining further insight into lower-order (e.g., two-variable) associations. However, somewhere along the line, a problem arises in maintaining an adequate cell size for all categories. Unless sample sizes are extremely large in the aggregate and the number of categories per variable is relatively small, cross-tabulations rarely can deal with more than three variables at a time. A further problem, independent of sample size, concerns the high degree of complexity of interpretation that is introduced by cross-tabulations involving four or more variables. In practice, most routine applications of cross-tabulation involve only two variables at a time.
As noted in Table 12.3, there are definite advantages associated with having a two category criterion variable, such as adoption versus non-adoption. In many applications, however, the criterion variable will have more than two categories. Cross-tabulations can still be prepared in the usual manner, although they become somewhat more tedious to examine.
Bivariate Analysis : Difference Between Sample Groups
Marketing activities largely focus on the identification and description of market segments. These segments may be defined demographically, attitudinally, by the quantity of the product used, by activities participated in or interests, by opinions, or by a multitude of other measures. The key component of each of these variables is the ability to group respondents into market segments. Often this segmentation analysis involves the identification of differences and the asking of questions about the marketing implications of those differences : Do differences in satisfaction exist for the two or more groups that are defined by age categories
Bivariate statistical analysis refers to the analysis of relationships between two variables. These analyses are often differences between respondent groups. In the following discussion, we explore bivariate statistical analysis and focus on the two variable case as a bridge between the comparatively simple analyses already discussed and the more sophisticated techniques that will command our attention in later chapters. We begin with what is perhaps the most used test of market researchers: cross tabulation. Next, we consider analysis of differences in group means. First, we discuss the t-test of differences in means of two independent samples, and then we look at one-way analysis of variance (ANOVA) that tests for differences in means for k groups. Finally, we provide a discussion of some of the more widely used nonparametric techniques. These are but a few of the possible parametric and nonparametric analyses that could be discussed (see Table 12.4).
Bivariate Cross -Tabulation
The chi-square statistic in a contingency table analysis is used to determine if the observed associations between the variables in the cross-tabulation are statistically significant Often called chi-square analysis, this technique is used when the data consist of counts or frequencies within categories of a tabulation or cross-tabulation table. In conjunction with the cross-tabulation we will introduce the chi-square statistic, X2, to determine the significance of observed association in cross-tabulations involving two or more variables. This is typically called a X2 test of independence
Cross-tabulation represents the simplest form of associative data analysis. At the minimum we can start out with a bivariate cross-tabulation of two variables, such as occupation and education, each of which identifies a set of exclusive and exhaustive categories. We know that such data are called categorical because each variable is assumed to be only nominal-scaled. Bivariate cross-tabulation is widely used in marketing applications to analyze variables at all levels of measurement. In fact, it is the single most widely used bivariate technique in applied settings.
Table 12.4 Selected Nonparametric Statistical Tests for Two-Sample Cases
In marketing research, observations may be cross-classified, as when we are interested in testing whether occupational status is associated with brand loyalty. Suppose, for illustrative purposes, that a marketing researcher has assembled data on brand loyalty and occupational status—white collar, blue collar, and unemployed or retired—that describes consumers of a particular product class. The data for our hypothetical problem appear in Table 12.5.
A total of four columns, known as banner points, are shown. Four rows or stubs are also shown. Professional cross-tabulation software will output multiple side by side tables that join multiple variables on the column banner points such that loyalty and another variable such as usage occasion could be analyzed simultaneously
In a study of 230 customers, we are interested in determining if occupational category may be associated with the characteristic loyalty status. The data suggests that a relationship exists, but is the observed association a reflection of sampling variation, or can we conclude that a true relationship exists?
Table 12.5 Contingency Table of Observed versus Theoretical Frequencies
A great amount of marketing research is concerned with estimating parameters of one or more populations. In addition, many studies go beyond estimation and compare such population parameters by testing hypotheses about differences between them. Means, proportions, and variances are often the summary measures of concern. Our concern at this point is with differences in means and proportions. Direct comparisons of variances are a special case of the more general technique of analysis of variance, which is covered later in this chapter
Standard Error of Differences
Here we extend the topic of sampling distributions and standard errors as they apply to a single statistic - to cover differences in statistics and show a traditional hypothesis for differences.
Standard Error of Difference of Means
For two samples, A and B, that are independent and randomly selected, the standard error of the differences in means is calculated by
For relatively small samples the correction factor Ni/ni–1 is used and the resulting formula for the estimated standard error is
Of course, these formulas would be appropriate for use in the denominator of the t-test.
Standard Error of Differences of Proportions
Turning now to proportions, the derivation of the standard error of the differences is somewhat similar. Specifically, for large samples,
|
1. |
Samples
must be independent |
|
2. |
Individual
items in samples must be drawn in a random manner |
|
3. |
The population
being sampled must be normally distributed (or the sample must be of
sufficiently large size). |
|
4. |
For small
samples, the population variances must be equal. |
|
5. |
The data must
be at least interval scaled. |
When these five conditions are met, or can at least be reasonably assumed to exist, the traditional approach is as follows.
|
1. |
The null
hypothesis (H0) is specified such that there is no difference between the
parameters of interest in the two populations (e.g., H0: μA - μB =
0) ; any observed difference is assumed to occur solely because of sampling
variation |
|
2. |
The
alpha risk is established (α.05 or other value). |
|
3. |
A Z value
is calculated by the appropriate adaptation of the Z formula. For testing
the difference between two means, Z is calculated in the following way: |
|
4. |
The
probability of the observed difference of the two sample statistics having occurred
by chance is determined from a table of the normal distribution (or the t distribution,
interpreted with [nA + nB – 2] degrees of freedom). |
|
5. |
If the
probability of the observed difference having occurred by chance is greater
than the alpha risk, the null hypothesis of no difference is not rejected
; it is concluded that the parameters of the two universes are not
significantly different. If the probability of the observed difference having
occurred by chance is less than the alpha risk, the null hypothesis is
rejected; it is concluded that the parameters of the two populations differ
significantly. In an applied setting, there are times when the level at which
significance occurs (the alpha level) is reported and management decides
whether to accept or reject. |
To illustrate
the small sample case, let us assume that we obtain the same mean values and
get values for sA and sB such that sxA
- sxB = $ 0,78 from samples nA = 15 nB
= 12. With these data we calculate t as follows :
The critical
value of t is obtained from a table of percentiles for the t
distribution. For, say, α.05, we determine the critical value of t for (nA
nB - 2) = 25 degrees of freedom to be 1.708 (one-tailed test).
Since the
calculated t of 1.54 < 1.708, we cannot reject the hypothesis of non difference
in average amount spent by the two types of families. This result is shown in
Figure 12.4. When samples are not independent, the same general procedure is
followed. The formulas for calculating the test statistics differ, however.
Figure 12.2 t -Distribution
The t-distribution revolutionized statistics and the ability to work with small samples. Prior statistical work was based largely on the value of Z, which was used to designate a point on the normal distribution where population parameters were known. For most market research applications it is difficult to justify the Z - test’s assumed knowledge of μ and σ. The t-test relaxes the rigid assumptions of the Z-test by focusing on sample means and variances (X and s). The t-test is a widely used market research statistic to test for differences between two groups of respondents.
In the previous section, the t statistic was described. Most computer programs recognize two versions of this statistic. In the previous section, we presented what is called the separate variance estimate formula :
The second method, called the pooled variance estimate, computes an average for the samples that is used in the denominator of the statistic :
Table 12.7 shows a sample output of t-tests from SPSS. Two respondent groups were identified for the supermarket study :
1. Respondents who were males
2. Respondents who were females
The analysis by gender is shown in Part A of the output for the attitudinal dimensions, friendliness of clerks and helpfulness of clerks. It will be noted that the sample sizes are approximately equal. To show a contrast we have included in Part B of the output a t-test from another study where the number of males and females varies widely. In all three examples, the differences between separate-variance and pooledvariance analysis are very small. Table 12.9 also shows an F statistic. This is Levene’s test for equality of variances. Using this test the researcher knows which t-test output to use—equal variances assumed or equal variances not assumed.
Table 12.7 Selected Output From PASW (SPSS) t -Test
Example : It is well known that interest ratings for TV ads are related to the advertising message. A simple one-way ANOVA to investigate this relationship might compare three messages :
Exibit 12.2 Example of ANOVA Methodology
Using the single-factor ANOVA design for the problem described in Exhibit 12.1, the actual interest rating data for 12 respondents (12 = 4 respondents for each of the 3 treatments) appears in tabular and graphical format as follows :
It is apparent that the messages differ in terms of their interest ratings, but how do we perform ANOVA to determine if these differences are different statistically? There are three values that must be computed in order to analyze this pattern of values.
|
1. |
Compute the
total sum of squares: The grand mean of the 12 observations is computed, followed
by the variance of the individual observations from this mean.
|
|
|
|
|
Total Sum of
Squares : Grand mean = 4.0 = G Computation of Total Sum of Squares ẋ ij - ẋ G ) = (6-4) 2 + (4-4)
2 + + (1-4) 2 = 110
|
|
2. |
Between-treatment
sum of squares: The means of the factor levels (messages A, B, and C) are computed,
followed by the deviation of the factor level means from the overall mean,
weighted by the number of observations ẋ j - ) 2
|
|
|
3. |
Compute the
within-treatment sum of squares: The means of the factor levels are computed,
followed by the deviation of the observations within each factor level from
that factor level mean.
|
|
Thus, an observation may be decomposed into three terms that are additive, and each explains a particular type of variance :
The overall mean is constant and common to all observations: The deviation of a group mean from the overall mean represents the effect on each observation of belonging to that particular group; the deviation of an observation from its group mean represents the effect on that observation of all variables other than the group variable.
The basic idea of ANOVA is to compare the between treatment-groups sum of squares (after dividing by degrees of freedom to get a mean square) with the withintreatment- group sum of squares (also divided by the appropriate number of degrees of freedom). This is the F statistic that indicates the strength of the grouping factor. Conceptually,
However, to make this comparison, it is necessary to assume that the error-term distribution has constant variance over all observations. This is exactly the same assumption as was made for the t test.
In the next section we shall (a) use more efficient computational techniques, (b) consider the adjustment for degrees of freedom to obtain mean squares, and (c) show the case of the F ratio in testing significance. Still, the foregoing remarks represent the basic ANOVA idea for comparing between- with within-sample variability
One Way (Single Factor) Analysis of Variance
One way ANOVA is analysis of variance in its simplest (single factor) form. Suppose a new product manager for the hypothetical Friskie Corp. is interested in the effect of shelf height on supermarket sales of canned dog food. The product manager has been able to secure the cooperation of a store manager to run an experiment involving three levels of shelf height (knee level, waist level, and eye level) on sales of a single brand of dog food, which we shall call Snoopy. Assume further that our experiment must be conducted in a single supermarket and that our response variable will be sales, in cans, dog food for some appropriate unit of time. But what shall we use for our unit of time? Sales of dog food in a single store may exhibit week-to-week variation, day today variation, and even hour-to-hour variation. In addition, sales of this particular brand may be influenced by the price or special promotions of competitive brands, the store management’s knowledge that an experiment is going on, and other variables that we cannot control at all or would find too costly to control.
Assume that we have agreed to change the shelf-height position of Snoopy three times per day and run the experiment over eight days. We shall fill the remaining sections of our gondola with a filler brand, which is not familiar to customers in the geographical area in which the test is being conducted. Furthermore, since our primary emphasis is on explaining the technique of analysis of variance in its simplest form (analysis of one factor : shelf height), we shall assign the shelf heights at random over the three time periods per day and not design an experiment to explicitly control and test for within-day and between day differences. Our experimental results are shown in Table 12.8
Here, we let Xij denote the sales (in units) of Snoopy during the ith day under the ith treatment level. If we look at mean sales by each level of shelf height, it appears as though the waist-level treatment, the average response to which is X2 = 90.9, results in the highest mean sales over the experimental period. However, we note that the last observation (93) under the eye-level treatment exceeds the waist-level treatment mean. Is this a fluke observation? We know that these means are, after all, sample means, and our interest lies in whether the three population means from which the samples are drawn are equal or not.
Now we shall show what happens when one goes through a typical one way analysis of variance computation for this problem.
This is the same quantity that would be obtained by subtracting the grand mean of 85.5 from each original observation, squaring the result, and adding up the 24 squared deviations. This mean-corrected sum of squares is equivalent to the type of formula used earlier in this chapter
The interpretation of this analysis is, like the t-test, a process of comparing the F value of 14.6 (2, 21 df) with the table value of 4.32 (p =.05). Because (14.6 > 4.32), we reject the null hypothesis that treatments 1, 2, and 3 have equivalent appeals.
The important consideration to remember is that, aside from the statistical assumptions underlying the analysis of variance, the variance of the error distribution will markedly influence the significance of the results. That is, if the variance is large relative to differences among treatments, then the true effects may be swamped, leading to an acceptance of the null hypothesis when it is false. As we know, an increase in sample size can reduce this experimental error. Though beyond the scope of this chapter, specialized experimental designs are available, the objectives of which are to increase the efficiency of the experiment by reducing the error variance.
Follow-up Tests of Treatment Differences
The question that now must be answered is the following : Which treatments differ? The F-ratio only provides information that differences exist. The question of where differences exist is answered by follow-up analyses, usually a series of independent samples t-tests, that compare the treatment level combinations ((1,2), (1,3), and (2,3)). Because of our previous discussion of the t-test, we will not discuss these tests in detail. We will only allude to the fact that there are various forms of the t-statistic that may be used when conducting a series of two group tests. These test statistics (which include techniques known as the LSD (Least Significant Difference), Bonnferoni’s test, Duncan’s multiple range tests, Scheffe’s test, and others) control the cumulative probability that a Type I error will occur when a series of statistical tests are conducted. Recall that if in a series of statistical tests, each test has a .05 probability of a Type I error, then in a series of 20 such tests we would expect one (20 * .05 = 1) of these tests would report a significant difference that did not exist (Type I error). These tests typically are options provided by the standard statistical packages, such as the PASW (SPSS) program Oneway.
Bivariate Analysis : Measures Of Association
Bivariate measures of association include the two-variable case in which both variables are interval or ratio scaled. Our concern is with the nature of the associations between the two variables and the use of these associations in making predictions
Correlation Analysis
When referring to a simple two-variable correlation, we refer to the strength and direction of the relationship between the two variables. As an initial step in studying the relationship between the X and Y variables, it is often helpful to graph this relationship in a scatter diagram (also known as an X-Y plot). Each point on the graph represents the appropriate combination of scale values for the associated X and Y variables, as shown in Figure 12.3.
The objective of correlation analysis, then, is to obtain a measure of the degree of linear association (correlation) that exists between the two variables. The Pearson correlation coefficient is commonly used this purpose and is defined by the formula
The alternate formulation shows the correlation coefficient to be the product of the Z scores for the X and Y variables. In this method of computing the correlation coefficient, the first step is to convert the raw data to a Z score by finding the deviation from the respective sample mean. The Z scores will be centered as a normally distributed variable (mean of zero and standard deviation of one).
The transformation of the X and Y variables to Z scores means that the scale measuring the original variable is no longer relevant, as a Z-score variable originally measured in dollars can be correlated with another Z-score variable originally measured on a satisfaction scale. The original metric scales are replaced by a new abstract scale (called correlation) that is the product of the two Z distributions.
Figure 12.3 Scatter Diagrams
Figure 12.5 Consumption Expenditure and Income Data
|
- |
Can we predict
a person’s weekly fast food and restaurant food purchases from that person’s
gender, age, income, or education level? |
|
- |
Can we predict
the dollar volume of purchase of our new product by industrial purchasing
agents as a function of our relative price, delivery schedules, product quality,
and technical service? |
|
- |
Highway
driving conditions |
|
- |
Average
temperature in the three-day period preceding the weekend |
|
- |
Local weather
forecast for the weekend |
|
- |
Amount of newspaper
space devoted to the resort’s advertisements in the surrounding city
newspapers |
|
- |
A moving
average of the three preceding weekends’ ticket sales |
|
1. |
Can we find a
predictor variable (or a linear composite of the predictor variables in the
multiple case) that will parsimoniously express the relationship between a
criterion variable and the predictor (set of predictors)? |
|
2. |
If we can, how
strong is the relationship; that is, how accurately can we predict values of
the criterion variable from values of the predictor (linear composite)? |
|
3. |
Is the overall
relationship statistically significant? |
|
4. |
Which
predictor is most important in accounting for variation in the criterion variable?
(Can the original model be reduced to fewer variables, but still provide adequate
prediction of the criterion?) |
Suppose that a marketing researcher is interested in consumers' attitudes toward nutritional additives in ready-to-eat cereals. Specifically, a set of written concept descriptions of a children's cereal is prepared that vary on
X1 : the amount of protein (in grams) per 2-ounce serving.
The researcher obtains consumers' interval-scaled evaluations of ten concept descriptions using a preference rating scale that ranges from 1, dislike extremely, up to 9, like extremely well.
Table 12.11 Preference Ratings of Ten Cereal Concepts Varying in Protein
Exhibit 12.3 Look at Your Data Before You Analyze
In deciding which type of regression approach to use, it is important that the researcher know the shape of the interrelationship. The shape of the interrelationship is easy to see on a scatter diagram. Looking at this visually helps decide whether the relationship is, or approximates being, linear or whether it has some other shape which would require a transformation of the data by converting to square roots or logarithms or treatment as nonlinear regression (Semon, 1993).
Examination of the data by scatter diagrams also allows the researcher to see if there are any “outliers” i.e., cases where the relationship is unusual or extreme as compared to the majority of the data points. A decision has to be made whether to retain such outliers in the data set for analysis.
When the regression line itself is included on the scatter diagram, comparisons between actual values and the values estimated by the regression formula can be compared and used to assess the estimating error. Of course, what the analyst is seeking is a regression function that has the best fit for the data, and this is typically based on minimizing the squares of the distances between the actual and estimated values the socalled least-squares criterion.
The equation for a linear model can be written Ŷ = a + bx , where Ŷ values of the criterion that are predicted by the linear model; a denotes the intercept, or value of Ŷ when X is zero; and b denotes the slope of the line, or change in Ŷ change in X.
But how do we find the numerical values of a and b? The method used in this chapter is known as least squares as discussed in Exhibit 12.3. As the reader will recall from introductory statistics, the method of least squares finds the line whose sum of squared differences between the observed values Yi and their estimated counterparts Ŷi (on the regression line) is a minimum.
Parameter Estimation
To compute the estimated parameters (a and b) of the linear model, we return to the data of Table 12.11. In the two-variable case, the formulas are relatively simple :
Underlying least-squares computations is a set of assumptions. Although leastsquares regression models do not need to assume normality in the (conditional) distributions of the criterion variable, this assumption is made when we test the statistical significance of the contribution of the predictor variable in explaining the variance in the criterion (does it differ from zero?) With this in mind the assumptions of the regression model are as follows (the symbols α and β are used denote population counterparts a and b :
|
1. |
For each fixed
value of X we assume a normal distribution of Y values exists. Our particular
sample we assume that each y value is drawn independently of all others. What
is being described is the “classical” regression model. Modern versions of
the model permit the predictors to be random variables, but their distribution
is not allowed to depend on the parameters of the regression equation. |
|
2. |
The means of
all of these normal distributions of Y lie on a straight line with β |
|
3. |
The normal
distributions of Y all have equal variances. This (common) variance does not
depend on values assumed by the variable X. |
Y = α + β X1 + ε
where
α = mean of Y population X1 = 0
β = change in Y population mean per unit change in X1
ε = error term drawn independently from a normally distributed universe with mean
Ų (
The nature of these assumptions is apparent in Figure 12.7. The reader should note that each value of X has associated with it a normal curve for Y (assumption 1). The means of all these normal distributions lie on the straight line shown in the figure (assumption 2).
What if the dependent variable is not continuous? Exhibit 12.4 gives an alternative when the dependent variable can be viewed as a categorical dichotomous variable use logistic regression (also known as logit). The analysis proceeds generally as we are discussing it—the major change is that a transformation has been applied to the dependent variable values
Exhibit 12.4 When to Use Logistic Regression
Data collected for customer satisfaction research provides a good illustration of when the researcher should consider transformation of data. Typically multi-point rating scales are used to obtain customer satisfaction data. Many believe that customer satisfaction ratings obtained on rating scales are not normally distributed, but are skewed toward higher scale values (Dispensa, 1997). Thus, in practice customers do not really view customer satisfaction ratings as continuous.
Ultimately, overall a customer is either satisfied or not satisfied. This creates a dichotomous dependent variable. Typically those customers who rate at the upper end of the scale, say 9 or 10 on a 10-point scale, are considered satisfied while all others are considered to be not satisfied. If this is so, normal regression analysis is not the proper technique to use as the dependent variable is binary not continuous.
A binary overall customer satisfaction variable follows the logistic distribution, thus allowing for the use of logistic regression. With one or more independent variables, this technique allows a researcher to determine the extent to which an independent variable affects the prediction of a satisfied customer through the logistic regression coefficients and their associated log-odds (Dispensa, 1997). Log-odds specify the direct association between the independent variable and the dependent variable. In addition, logistic regression calculates the probability of each customer being satisfied or not.
Typically, logistic regression is used for multiple regression situations where there are two or more independent variables. But, it is suitable for bivariate situations as well. The key to its being of value is the nature of the dependent variable, not the independent variable (s).
Figure 12.7 Two-Variable Regression Model--Theoretical
However, functional forms other than linear may be suggested by the preliminary scatter plot. Figure 12.8 shows various types of scatter diagrams and regression lines for the two variable case. Panel I shows the ideal case in which all the variation in Y is accounted for by variation in X1. We note that the regression line passes through the mean of each variable and that the slope b happens to be positive. The intercept a represents the predicted value of Y when X1 = 0. In Panel II we note that there is residual variation in Y, and, furthermore, that the slope b is negative. Panel III demonstrates the in which no association between Y and X1 is found. In this case the mean of Y is as good a predictor as the variable X1 (the slope b is zero). Panel IV emphasizes that a linear model is being fitted. That is, no linear association is found (b = 0), even though a curvilinear relationship is apparent from the scatter diagram. Figure 12.8 illustrates the desirability of plotting one's data before proceeding to formulate a specific regression model
Figure 12.8 Illustrative Scatter Diagrams and Regression Lines
The measure of strength of association in bivariate regression is denoted by r2 and is called the coefficient of determination. This coefficient varies between 0 and 1 andrepresents the proportion of total variation in Y (as measured about its own mean Y) that is accounted for by var ation in X1. For regression analyses it can also be interpreted as a measure of substantive significance, as we have previously defined this concept.
If we were to use the average of the Y values (Y) to estimate each separate value of Y, then a measure of our inability to predict Y would be given by the sum of the squared deviations
On the other hand, if we tried to predict Y by employing a linear regression based on X. we could use each Ŷi. In this case a measure of our inability to predict Ŷi to predict its counterpart Ŷi. In this case a measure of our inability to predict Ŷi. is given by
*From the
equation Ŷi = 0.491 + 0.886Xi1. This is the sum of squared errors in
predicting Yi from Ŷi. Next, we find :
This is the sum of squared errors in predicting Yi from Ӯ. Hence,
which is the accounted-for sum of squares due to the regression of Y on X1
Figure 12.10
(and 12.9 as well) put all these quantities in perspective by first showing
deviation of Yi
- Ӯ. As noted above, the sum of these squared deviations is 76.10.
Panel II shows the counterpart deviations of Yi from Ŷ; the sum
of these squared deviations is 21.09. Panel III shows the deviations of Ŷi from Ӯi
; the sum of these squared
deviations is 55.01. We note that the results are additive: 21.09 + 55.01 =
76.10.
Nonparametic
Analysis
One reason for the widespread use of chi-square in cross-tabulation analysis is that most computer computational routines show the statistic as part of the output, or at least it is an option that the analyst can choose. Sometimes ordinal data are available and as such are stronger than simple nominal measurement. In this situation other tests are more powerful than chi-square. Three regularly used tests are the Wilcoxon Rank Sum (T), the Mann-Whitney U, and the Kolmogorov-Smirnov test. Siegel (1956) and Gibbons (1993) provide more detailed discussions of these techniques.
The Wilcoxon T test is used for dependent samples in which the data are collected in matched pairs. This test takes into account both the direction of differences within pairs of observations and the relative magnitude of the differences. The Wilcox matchedpairs signed-ranks test gives more weight to pairs showing large differences between the two measurements than to a pair showing a small difference. Again, to use this test, measurements must at least be ordinal scaled within pairs. In addition, ordinal measurement must hold for the differences between pairs
This test has many practical applications in marketing research. For instance, an ordinal scaling device, such as a semantic differential, can be used to measure attitudes toward, say, a bank. Then, after a special promotional campaign, the same sample would be given the same scaling device. Changes in values of each scale could be analyzed by this Wilcoxon test
With ordinal measurement and two independent samples, the Mann Whitney U test may be used to test whether the two groups are from the same population. This is a relatively powerful nonparametric test, and is an alternative to the Student t test when the analyst cannot meet the assumptions of the t test or when measurement is at best ordinal. Both one - and two tailed tests can be conducted. As indicated earlier, results of U and t tests often are similar, leading to the same conclusion.
The Kolmogorov-Smirnov two-sample test is a test of whether two independent samples come from the same population or from populations with the same distribution. This test is sensitive to any kind of difference in the distributions from which the two samples were drawn differences in location (central tendency), dispersion, skewness, and so on. This characteristic of the test makes it a very versatile test. Unfortunately, the test does not by itself show what kind of difference exists. There is a Kolmogorov- Smirnov one-sample test that is concerned with the agreement between an observed distribution of a set of sample values and some specified theoretical distribution. In this case it is a goodness of fit test similar to single classification chi-square analysis
Indexes of Agreement
Chi-square is appropriate for making statistical tests of independence in crosstabulations. Usually, however, we are interested in the strength of association as well as the statistical significance of association. This concern is for what is known as substantive or practical significance. An association is substantively significant when it is statistically significant and of sufficient strength. Unlike statistical significance, however, there is no simple numerical value to compare with and considerable experimental research judgment is necessary. Although such judgment is subjective, it need not be completely arbitrary. The nature of the problem can offer some basis for judgment, and common sense can indicate that the degree of association is too low in some cases and high enough in others (Gold, 1969, p. 44).
Statisticians have devised a large number of indexes often called indexes of agreement for measuring the strength of association between two variables in a crosstabulation. The main descriptors for classifying the various indexes are
|
1. |
Whether the
table is 2 x 2 or larger, R x C |
|
2. |
Whether one,
both, or neither of the variables has categories that obey some natural order
(e.g., age, income level, family size) |
|
3. |
Whether
association is to be treated symmetrically or whether we want to predict
membership in one variable’s categories from (assumed known) membership in
the other variable’s categories |
Space does not permit coverage of even an appreciable fraction of the dozens of agreement indexes that have been proposed. Rather, we shall illustrate one commonly used index for 2 x 2 tables and two indexes that deal with different aspects of the larger R x C (row-by-column) tables.
The 2 × 2 Case The phi correlation coefficient is a useful agreement index for the special case of 2 x 2 tables in which both variables are dichotomous. Moreover, an added bonus is the fact that phi equals the product-moment correlation a cornerstone of multivariate methods—that one would obtain if he or she correlated the two variables expressed in coded 0 – 1 form.
To illustrate, consider the 2 x 2 cross-tabulation in Table 12.15, taken from a study of shampoos. We wish to see if inclusion of the shampoo benefit “body” in the respondent’s ideal set is associated with the respondent’s indication that her hair lacks natural “body.” We first note from the table that high frequencies appear in the cells: (a) “body” included in ideal set and “no” to the question of whether her hair has enough (natural) body; and (b) “body” excluded from the ideal set and “yes” to the same question.
Table 12.15 Does Hair Have Enough Body Verses Body Inclusion in Ideal Set
One of the most popular agreement indexes for summarizing the degree of association between two variables in a cross-tabulation of R rows and C columns is the contingency coefficient. This index is also related to chi-square and is defined as
where n is again the total sample size. From Table 12.15 we can first determine that chisquare is equal to 14.61, which, with 1 degree of freedom, is significant beyond the 0.01 level.
We can then find the contingency coefficient C as the following
Both phi and the contingency coefficient are symmetric measures of association. Occasions often arise in the analysis of R X C tables (or the special case of 2.2 tables) where we desire to compute an asymmetric measure of the extent to which we can reduce errors in predicting categories of one variable from knowledge of the categories of some other variable. Goodman and Kruskal’s lambda-asymmetric coefficient can be used for this purpose (Goodman & Kruskal, 1954).
To illustrate the lambda-asymmetric coefficient, let us return to the crosstabulation of Table 12.15. Suppose that we wished to predict what category no versus yes a randomly selected person would fall in when asked the question, “Does your hair have enough body?” If we had no knowledge of the row variable (whether that person included “body” in her ideal set or not), we would have only the column marginal frequencies to rely on
Our best bet, given no knowledge of the row variable, is always to predict “no,” the higher of the column marginal frequencies. As a consequence, we shall be wrong in 41 of the 84 cases, a probability error of 41/84 = 0.49 Can we do better, in the sense of lower prediction errors, if we utilize information provided by the row variable?
If we know that “body” is included in the ideal set, we shall predict “no” and be wrong in only 8 cases. If we know that “body” is not included in the ideal set, we shall predict “yes” and be wrong in 17 cases. Therefore, we have reduced our number of prediction errors from 41 to 8 + 17 = 25, a decrease of 16 errors. We can consider this error reduction relatively :
A less cumbersome (but also less transparent) formula for lambda-asymmetric is
Lambda asymmetric varies between zero, indicating no ability at all to eliminate errors in predicting the column variable on the basis of the row variable, and 1, indicating an ability to eliminate all errors in the column variable predictions, given knowledge of the row variable. Not surprisingly, we could reverse the role of criterion and predictor variables and find lambda-asymmetric for the row variable, given the column variable. In the case of Table 12.15, this result in λ = 0,26 Note that in this case we simply reverse the roles of row and column variables.
Finally, if desired, we could find a lambda-symmetric index via a weighted averaging of λC I R and λR I C. However, in the authors’ opinion, lambda-asymmetric is of particular usefulness to the analysis of cross-tabulations because we often want to consider one variable as a predictor and the other as a criterion. Furthermore, lambdaasymmetric h is a natural and useful interpretation as the percentage of total prediction errors that are eliminated in predicting one variable (e.g., the column variable) from another (e.g., the row variable).
Summary
We began by stating that data can be viewed as recorded information useful in making decisions. In the initial sections of this chapter, we introduced the basic concepts of transforming raw data into data of quality. The introduction was followed by a discussion of elementary descriptive analyses through tabulation and cross-tabulation. The focus of this discussion was heavily oriented toward how to read the data and how to interpret the results. The competent analysis of research-obtained data requires a blending of art and science, of intuition and informal insight, and of judgment and statistical treatment, combined with a thorough knowledge of the context of the problem being investigated.
The first section of the chapter dealt with cross-tabulation and chi-square analysis. This was followed by discussing bivariate analysis of differences in means and proportions. We next focused on the necessary statistical machinery to analyze differences between groups : t-test, and one-factor and two-factor analysis of variance. These techniques are useful for both experimental and nonexperimentally obtained data. We then looked at the process of analysis of variance. A simple numerical example was used to demonstrate the partitioning of variance into among- and within-components. The assumptions underlying various models were pointed out and a hypothetical data experiment was analyzed to show how the ANOVA models operate.
We concluded by examining bivariate analyses of associations for interval- or ratio scaled data. The concept of associations between two variables was introduced through simple two-variable correlation. We examined the strength and direction of relationships using the scatter diagram and Pearson correlation coefficient. Several alternative (but equivalent) mathematical expressions were presented and a correlation coefficient was computed for a sample data set.
Investigations of the relationships between variables almost always involve the making of predictions. Bivariate (two-variable) regression was discussed as the foundation for the discussion of multivariate regression in the next chapter
We ended the chapter with a discussion of the Spearman rank correlation as an alternative to the Pearson correlation coefficient when the data is of ordinal measurement and does not meet the assumptions of parametric methods. Also, the Goodman and Kruskal lambda measure for nominal measurement was briefly introduced, as were other nonparametric analyses
There is a wide array of statistical techniques (parametric and non parametric) that focus on describing and making inferences about the variables being analyzed. Some of these were shown in Table 12.4. Although somewhat dated, a useful reference for selecting an appropriate statistical is the guide published by the Institute for Social Research at the University of Michigan (Andrews, et. al., 1981 and its corresponding software Statistical Consultant. Fing (2003, pp. 78-80) presents a summary table of which technique to use under which condition.
Refernce
Andrews, F. M., Klem, L., Davidson, T. N., O’Malley, P. M., & Rodgers, W. L. (1981). A guide for selecting statistical techniques for analyzing social science data (2nd ed.). Ann Arbor: Institute for Social Research, University of Michigan.
Feick, L. F. (1984, November). Analyzing marketing research data with associated models. Journal of Marketing Research, 21, 376–386.
Fink, A. (2003). How to manage, analyze, and interpret survey data. Thousand Oaks, CA: Sage
Gibbons, J. D. (1993). Nonparametric statistics: An introduction. Newbury Park, CA: Sage.
Gold, D. (1969, February). Statistical tests and substantive significance. American Sociologist, 4, 44.
Goodman, L. A., & Kruskal, W. H. (1954, December). Measures of association for cross classification. Journal of the American Statistical Association, 49, 732–764.
Hellevik, O. (1984). Introduction to causal analysis: Exploring survey data. Beverly Hills, CA: Sage.
Lewis-Beck, M. S. (1995). Data analysis: An introduction. Thousand Oaks, CA: Sage.
Semon, T. T. (1999, August 2). Use your brain when using a chi-square. Marketing News, 33, 6.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill.
Smith, S. and Albaum, G. (2005). Fundamentals of marketing research. Thousand Oaks, CA: Sage Publications.
Zeisel, H. (1957). Say it with figures (4th ed.). New York: Harper & Row.