Marekting Research: Bivariate Data Analysis

Analysis can be viewed as the categorization, aggregation, and the manipulation of data to obtain answers to research question underlying the research project. The final aspect of analysis is interpretation. The process of interpretation involves taking the results of analysis, making inferences relevant to the research relationships studied, and drawing managerially useful conclusions about these relationships.

Basic Conscepts Of Analyzing Associative Data Our chapter begins with a brief discussion of cross-tabulation to mark the beginning of a major topic of this chapter—the analysis of associative data. A large part of the rest of the chapter will focus on methods for analyzing how the variation of one variable is associated with variation in other variables.

The computation of row or column percentages in the presentation of crosstabulations is taken up first. We then show how various insights can be obtained as one goes beyond two variables in a cross-tabulation to three (or more) variables. In particular, examples are presented of how the introduction of a third variable can often refine or explain the observed association between the first two variables.

Bivariate Cross-Tabulation

Bivariate cross-tabulation represents the simplest form of associative data analysis. At the minimum we can start out with only two variables, such as occupation and education, each of which has a discrete set of exclusive and exhaustive categories. Data of this type is called categorical, since each variable is assumed to be nominalscaled. Bivariate cross tabulation is widely used in marketing applications to analyze variables at all levels of measurement. In fact, it is the single most widely used bivariate technique in applied settings. Reasons for the continued popularity of bivariate crosstabulation include the following (Feick, 1984, p. 376) :

1.	It provides a means of data display and analysis that is clearly interpretable even to the less statistically inclined researcher or manager.
2.	A series of bivariate tabulations can provide clear insights into complex marketing phenomena that might be lost in a simultaneous analysis with many variables.
3.	The clarity of interpretation affords a more readily constructed link between market research and market action.
4.	Bivariate cross-tabulations may lessen the problems of sparse cell values that often plague the interpretation of discrete multivariate analyses (bivariate crosstabulations require the expected number of respondents in any table cell be 5).

The entities being cross-classified are usually people, objects, or events. The cross tabulation, at its simplest, consists of a simple count of the number of entities that fall into each of the possible categories of the cross-classification. Excellent discussions of ways to analyze cross-tabulations can be found in Hellevik (1984) and Zeisel (1957).

However, we usually want to do more than show the raw frequency data. At the very least, row or column percentages (or both) are usually computed.

Percentages

The simple mechanics of calculating percentages are known to all of us. In crosstabulation, percentages serve as a relative measure indicating the relative size of two or more categories

The ease and simplicity of calculation, the general understanding of its purpose, and the near universal applicability have made the percent statistic, or its counterpart the proportion, the most widely used statistical tool in marketing research. Yet its simplicity of calculation is sometimes deceptive, and the understanding of its purpose is frequently insufficient to ensure sound application and interpretation. The result is that the percent statistic is often the source of misrepresentations, either inadvertent or intentional

The sources of problems in using percentages are largely in identifying the direction in which percentages should be computed, and in knowing how to interpret percentage of change.

Both these problems can be illustrated by a small numerical example that uses a before and after with control experimental design. Let us assume that KEN’s Original, a small regional manufacturer of salad dressings, is interested in testing the effectiveness of spot TV ads in increasing consumer awareness of a new brand called Life. Two geographic areas are chosen for the test: (1) test area A and (2) control area B. The test area receives five 15 second television spots per week over an eight-week period, whereas the control area receives no spot TV ads at all. (Other forms of advertising were equal in each area.)

Assume that four independent samples (of telephone interviews) were conducted as before and after tests in each of the areas. Respondents were asked to state all the brands of salad dressing they could think of, on an aided basis. If Life was mentioned, it was assumed that this constituted consumer awareness of the brand. However, as it turned out, sample sizes differed across all four sets of interviews. This common occurrence in surveys (i.e., variation in sample sizes) increases the value of computing percentages.

Table 12.1 shows a set of frequency tables that were compiled before and after a TV ad for Life Salad Dressing was aired. Interpretation of Table 12.1 would be hampered if the data were expressed as raw frequencies and different percentage bases were reported. Accordingly, Table 12.1 shows the data, with percentages based on column and row totals. Can you see why the row percentages are more useful for analytical purposes?

Direction in Which to Compute Percentages

In examining the relationship between two variables, it is often clear from the context that one variable is more or less the independent or control variable and the other is the dependent or criterion variable. In cases where this distinction is clear, the rule is to compare percentages within levels of the dependent variable.

In Table 12.1, the control variable is the experimental area (test versus control) and the dependent variable is awareness. When comparing awareness in the test and control areas, row percentages are preferred. We note that before the spot TV campaign the percentage of respondents who are aware of Life is almost the same between test and control areas: 42 percent and 40 percent, respectively.

However, after the campaign the test-area awareness level moves up to 66 percent, whereas the control-area awareness (42 percent) stays almost the same. The small increase of 2 percentage points reflects either sampling variability or the effect of other factors that might be serving to increase awareness of Life in the control area.

On the other hand, computing percentages across the independent variable (column percent) makes little sense. We note that 61 percent of the aware group (before the spot TV campaign) originates from the test area; however, this is mainly a reflection of the differences in total sample sizes between test and control areas.

After the campaign we note that the percentage of aware respondents in the control area is only 33 percent, versus 39 percent before the campaign. This may be erroneously interpreted as indicating that spot TV in the test area depressed awareness in the control area. But we know this to be false from our earlier examination of raw percentages.

It is not always the case that one variable is clearly the independent or control variable and the other is the dependent or criterion variable. This should pose no particular problem as long as we agree, for analysis purposes, which variable is to be considered the control variable. Indeed, cases often arise in which each of the variables in turn serves as the independent and dependent variable.

Table 12.1 Aware of Life Salad Dressing — Before and After Spot TV

Interpretation of the Percentage Change

A second problem that arises in the use of percentages in cross-tabulations is the choice of which method to use in measuring differences in percentages. There are three principal ways to portray percentage change :

1. The absolute difference in percentages

2. The relative difference in percentages

3. The percentage of possible change in percentages

The same example can be used to illustrate the three methods.

Absolute Percentage Increase

Table 12.2 shows the percentage of respondents who are aware of Life before and after the spot TV campaign in the test and control areas. First, we note that the test-area respondents displayed a greater absolute increase in awareness. The increase for the testarea respondents was 24 percentage points, whereas the control-area awareness increased by only 2 percentage points

Table 12.2 Aware of Life—Percentages Before and After the Spot TV Campaign

Relative Percentage Increase

The relative increase in percentage is [(66 – 42)]/42] × 100 = 57 percent and [(42 – 40)/40] × 100 = 5 percent, respectively, for test- and control-area respondents.

Percentage Possible Increase

The percentage of possible increase for the test area is computed by first noting that the maximum percentage-point increase that could have occurred is 100 – 42 = 58 points. The increase actually registered is 24 percentage points, or 100(24/58) = 41 percent of the maximum possible. That of the control area is 100(2/60) = 3 percent of the maximum possible.

In terms of the illustrative problem, all three methods give consistent results in the sense that the awareness level in the test area undergoes greater change than that in the control area. However, in other situations conflicts among the measures may arise.

The absolute difference method is simple to use and requires only that the distinction between percentage and percentage points be understood. The relativedifference method can be misleading, particularly if the base for computing the percentage change is small. The percentage-of-possible-difference method takes cognizance of the greater difficulty associated with obtaining increases in awareness as the difference between potential-level and realized-level decreases. In some studies all three measures are used, inasmuch as they emphasize different aspects of the relationship.

Introducing a Third Variable into the Analysis

Cross tabulation analysis to investigate relationships need not stop with two variables. Often much can be learned about the original two-variable association through the introduction of a third variable that may refine or explain the original relationship. In some cases, it may show that the two variables are related even though no apparent relationship exists before the third variable is introduced. These ideas are most easily explained by example.

Consider the hypothetical situation facing MCX Company, a specialist in telecommunications equipment for the residential market. The company has recently testmarketed a new service for online recording of cable or satellite programs without a storage box. Several months after the introduction, a telephone survey was taken in which respondents in the test area were asked whether they had adopted the innovation. The total number of respondents interviewed was 600 (Table 11.3).

Table 12.3 Adoption—Percentage by Gender and Age

One of the variables of major interest in this study was the age of the respondent. Based on earlier studies of the residential market, it appeared that adopters of the firm’s new products tended to be less than 35 years old. Accordingly, the market analyst decides to cross-tabulate adoption and respondent age. Respondents are classified into the categories “under 35 years” (<35) and “equal to or greater than 35 years” (≥ 35) and then cross-classified by adoption or not. Table 12.3 shows the full three-variable crosstabulation. It seems that the total sample of 600 is split evenly between those who are under 35 years of age and those who are 35 years of age or older. Younger respondents display a higher percentage of adoption (37 percent = 100 + 11)/300) than older respondents (23 percent = (60 + 9)/300).

Analysis and Interpretation

The researcher is primarily interested in whether this finding differs when gender of the respondent is introduced into the analysis. As it turned out, 400 respondents in the total sample were men, whereas 200 were women.

Table 12.3 shows the results of introducing gender as a third classificatory variable. In the case of men, 50 percent of the younger men adopt compared with only 30 percent of the older men. In the case of women, the percentages of adoption are much closer. Even here, however, younger women show a slightly higher percentage of adoption (11 percent) than older women (9 percent).

The effect of gender on the original association between adoption and age is to refine that association without changing its basic character; younger respondents show a higher incidence of adoption than older respondents. However, what can now be said is the following : If the respondent is a man, the differential effect of age on adoption is much more pronounced than if the respondent is a woman.

This pattern is even easier to identify when we show this information graphically (Figure 12.1). The height of the bars within each rectangle represents the percentage of respondents who are adopters. The relative width of the bars denotes the relative size of the categories—men versus women—representing the third variable, gender. The shaded portions of the bars denote the percentage adopting by gender, the dashed line represents the weighted average percentage adopting by gender, and the dashed line represents the weighted average percentage across the genders.

Figure 12.1 Adoption—Percentage by Age and Gender

It is easy to see from the figure that adoption differs by age group (37 percent versus 23 percent). Furthermore, the size of the difference depends on the gender of the respondent: Men display a relatively higher rate of adoption, compared with women, in the younger age category

Recapitulation

Representatives of three-variable association can involve many possibilities that could be illustrated by the preceding adoption-age-gender example :

1.	In the example presented, adoption and age exhibit initial association; this association is still maintained in the aggregate but is refined by the introduction of the third variable, gender
2.	Adoption and age do not appear to be associated. However, adding and controlling on the third variable, gender, reveals suppressed association between the first two variables within the separate categories of men and women. In the two-variable cases, men and women exhibit opposite patterns, canceling each other out.

Although the preceding example was contrived to illustrate the concepts, the results are not unusual in practice. It goes almost without saying that the introduction of a third variable can often be useful in the interpretation of two-variable cross - tabulations.

However, the reader should be aware of the fact that we have deliberately used the phrase associated with rather than caused by. Association of two or more variables does not imply causation, and this statement is true regardless of our preceding efforts to refine some observed two-variable association through the introduction of a third variable.

In principle, of course, we could cross tabulate four or even more variables with the possibility of obtaining further insight into lower-order (e.g., two-variable) associations. However, somewhere along the line, a problem arises in maintaining an adequate cell size for all categories. Unless sample sizes are extremely large in the aggregate and the number of categories per variable is relatively small, cross-tabulations rarely can deal with more than three variables at a time. A further problem, independent of sample size, concerns the high degree of complexity of interpretation that is introduced by cross-tabulations involving four or more variables. In practice, most routine applications of cross-tabulation involve only two variables at a time.

As noted in Table 12.3, there are definite advantages associated with having a two category criterion variable, such as adoption versus non-adoption. In many applications, however, the criterion variable will have more than two categories. Cross-tabulations can still be prepared in the usual manner, although they become somewhat more tedious to examine.

Bivariate Analysis : Difference Between Sample Groups

Marketing activities largely focus on the identification and description of market segments. These segments may be defined demographically, attitudinally, by the quantity of the product used, by activities participated in or interests, by opinions, or by a multitude of other measures. The key component of each of these variables is the ability to group respondents into market segments. Often this segmentation analysis involves the identification of differences and the asking of questions about the marketing implications of those differences : Do differences in satisfaction exist for the two or more groups that are defined by age categories

Bivariate statistical analysis refers to the analysis of relationships between two variables. These analyses are often differences between respondent groups. In the following discussion, we explore bivariate statistical analysis and focus on the two variable case as a bridge between the comparatively simple analyses already discussed and the more sophisticated techniques that will command our attention in later chapters. We begin with what is perhaps the most used test of market researchers: cross tabulation. Next, we consider analysis of differences in group means. First, we discuss the t-test of differences in means of two independent samples, and then we look at one-way analysis of variance (ANOVA) that tests for differences in means for k groups. Finally, we provide a discussion of some of the more widely used nonparametric techniques. These are but a few of the possible parametric and nonparametric analyses that could be discussed (see Table 12.4).

Bivariate Cross -Tabulation

The chi-square statistic in a contingency table analysis is used to determine if the observed associations between the variables in the cross-tabulation are statistically significant Often called chi-square analysis, this technique is used when the data consist of counts or frequencies within categories of a tabulation or cross-tabulation table. In conjunction with the cross-tabulation we will introduce the chi-square statistic, X², to determine the significance of observed association in cross-tabulations involving two or more variables. This is typically called a X²test of independence

Cross-tabulation represents the simplest form of associative data analysis. At the minimum we can start out with a bivariate cross-tabulation of two variables, such as occupation and education, each of which identifies a set of exclusive and exhaustive categories. We know that such data are called categorical because each variable is assumed to be only nominal-scaled. Bivariate cross-tabulation is widely used in marketing applications to analyze variables at all levels of measurement. In fact, it is the single most widely used bivariate technique in applied settings.

Table 12.4 Selected Nonparametric Statistical Tests for Two-Sample Cases

In marketing research, observations may be cross-classified, as when we are interested in testing whether occupational status is associated with brand loyalty. Suppose, for illustrative purposes, that a marketing researcher has assembled data on brand loyalty and occupational status—white collar, blue collar, and unemployed or retired—that describes consumers of a particular product class. The data for our hypothetical problem appear in Table 12.5.

A total of four columns, known as banner points, are shown. Four rows or stubs are also shown. Professional cross-tabulation software will output multiple side by side tables that join multiple variables on the column banner points such that loyalty and another variable such as usage occasion could be analyzed simultaneously

In a study of 230 customers, we are interested in determining if occupational category may be associated with the characteristic loyalty status. The data suggests that a relationship exists, but is the observed association a reflection of sampling variation, or can we conclude that a true relationship exists?

Table 12.5 Contingency Table of Observed versus Theoretical Frequencies

In analyzing the problem by means of chi-square, we make use of the marginal totals (column and row totals) in computing theoretical frequencies given that we hypothesize independence (no relationship) between the attributes loyalty status and occupational status. For example, we note from Table 12.5 that 33.9 percent (78/320) of the respondents are highly loyal. If possession of this characteristic is independent of occupational status, we would expect that 33.9 percent (78/320) of the 90 respondents classified as white-collar workers (i.e., 30.5) would be highly loyal. Similarly, 37.8 percent (87/320) of the 90 = 34.1 would be moderately loyal, and 28.3 percent (65/320) of the 90 = 25.4 would be brand switchers. In a similar fashion we can compute theoretical frequencies for each cell on the null hypothesis that loyalty status is statistically independent of occupational status. (It should be noted that the frequencies are the same, whether we start with the percentage of the row or the percentage of the column.)

The theoretical frequencies (under the null hypothesis) are computed and appear in parentheses in Table 12.5. The chi-square statistic is then calculated (and shown in the table) for each of the data cells in the table using the observed and theoretical frequencies :

where fi = actual observed frequency, Fi = theoretical expected frequency, and k = number of cells (r xc).

The appropriate number of degrees of freedom to use in this example is four. In general, if we have R rows and C columns, the degrees of freedom associated with the chi-square statistic are equal to the product

If we use a significance level of 0.05, the probability table value of chi-square is 11.488. Hence, Because the computed X²of 21.08 is greater than the table value of 11.488, we reject the null hypothesis of independence between the characteristics loyalty status and occupational status.

A correction factor must be applied to the formula for chi-square when the number of observations in a cell is less than 10 (or where a 2 × 2 contingency table is involved),. The numerator within the summation sign becomes ( 1 fi - Fi │ 1/2)2 where the value 1/2 is the Yates continuity correction. This correction factor adjusts for the use of a continuous distribution to estimate probability in a discrete distribution.

Chi-square analysis of independence can be extended to deal with more than two variables. No new principles are involved. Three characteristics of the technique should be borne in mind, however. First, chi-square analysis deals with counts (frequencies) of data. If the data are expressed in percentage form, they should be converted to absolute frequencies. Second, the technique assumes that the observations are drawn independently. Third, the chi-square statistic cannot describe the relationship; it only gauges its statistical significance, regardless of logic or sense (Semon, 1999).

To assess the nature of the relationship, the researcher must look at the table and indicate how the variables appear to be related—a type of eyeball approach. This may involve examining any of the following combinations: (a) the variable combinations that produce large X² values in the cells; (b) those with a large difference between the observed and expected frequencies; or (c) those where the cell frequency count, expressed as a percentage of the row total, is most different from the total column percentage (marginal column %). When variables are ordered or loosely ordered, a pattern can sometimes be observed by marking cells with higher than expected observed frequencies with a (+) and those with lower than expected observed frequencies with a (–), or even graphing the deviations from expected values, or cell X² values using a 3 - dimensional contour map.

Bivariate Analysis : Differences in Means and Proportions

A great amount of marketing research is concerned with estimating parameters of one or more populations. In addition, many studies go beyond estimation and compare such population parameters by testing hypotheses about differences between them. Means, proportions, and variances are often the summary measures of concern. Our concern at this point is with differences in means and proportions. Direct comparisons of variances are a special case of the more general technique of analysis of variance, which is covered later in this chapter

Standard Error of Differences

Here we extend the topic of sampling distributions and standard errors as they apply to a single statistic - to cover differences in statistics and show a traditional hypothesis for differences.

Standard Error of Difference of Means

For two samples, A and B, that are independent and randomly selected, the standard error of the differences in means is calculated by

This estimate of the standard error is appropriate for use in the denominator of the z test formula. If the population standard deviations, σi, are not known, then the estimated\ standard error becomes

For relatively small samples the correction factor Ni/ni–1 is used and the resulting formula for the estimated standard error is

Of course, these formulas would be appropriate for use in the denominator of the t-test.

Standard Error of Differences of Proportions

Turning now to proportions, the derivation of the standard error of the differences is somewhat similar. Specifically, for large samples,

For small samples, the correction factor is applied, re sulting in

This estimate would again be appropriate for use in the denominator of the Z-test of proportions.

Testing of Hypotheses

When applying the standard error formulas for hypotheses testing concerning parameters, the following conditions must be met:

1.	Samples must be independent
2.	Individual items in samples must be drawn in a random manner
3.	The population being sampled must be normally distributed (or the sample must be of sufficiently large size).
4.	For small samples, the population variances must be equal.
5.	The data must be at least interval scaled.

When these five conditions are met, or can at least be reasonably assumed to exist, the traditional approach is as follows.

The null hypothesis (H0) is specified such that there is no difference between the parameters of interest in the two populations (e.g., H0: μA - μB = 0) ; any observed difference is assumed to occur solely because of sampling variation

The alpha risk is established (α.05 or other value).

A Z value is calculated by the appropriate adaptation of the Z formula. For testing the difference between two means, Z is calculated in the following way:

and for proportions

The probability of the observed difference of the two sample statistics having occurred by chance is determined from a table of the normal distribution (or the t distribution, interpreted with [nA + nB – 2] degrees of freedom).

If the probability of the observed difference having occurred by chance is greater than the alpha risk, the null hypothesis of no difference is not rejected ; it is concluded that the parameters of the two universes are not significantly different. If the probability of the observed difference having occurred by chance is less than the alpha risk, the null hypothesis is rejected; it is concluded that the parameters of the two populations differ significantly. In an applied setting, there are times when the level at which significance occurs (the alpha level) is reported and management decides whether to accept or reject.

An example will illustrate the application of this procedure. Let us assume we have conducted a survey of detergent and paper goods purchases from supermarkets among urban (population A) and rural (population B) families (see Table 12.6).

Table 12.6 Sample Group and Average Expenditure

The question facing the researcher is the following: “Do urban families spend more on these items, or is the $1.20 difference in means caused by sampling variations?” We proceed as follows. The hypothesis of no difference in means is established. We assume the alpha risk is set at .05. Since a large sample test is called for, the Z value is calculated using the separate variances estimate of the standard error of differences in means :

The Z value is then determined to be

The probability of the observed difference in the sample means having been due to sampling is specified by finding the area under the normal curve that falls to the right of the point Z = 1.54. Consulting a table of the Cumulative Normal Distribution, we find this area to be 1.0 -.9382 = 0.0618. Since this probability associated with the observed difference ( p = 0.06) is greater than the preset alpha, a strict interpretation would be that there is no difference between the two types of families concerning the average expenditure on nonfood items. In a decision setting, however, the manager would have to determine whether this probability (0.06) is low enough to conclude, on pragmatic grounds, that the families do not differ in their behavior.

As stated previously, and discussed in the previous chapter, often there is no preset alpha and decisional considerations require that the manager determines the meaning of the reported alpha

To illustrate the small sample case, let us assume that we obtain the same mean values and get values for s_A and s_B such that s_xA - s_xB= $ 0,78 from samples n_A= 15 n_B = 12. With these data we calculate t as follows :

The critical value of t is obtained from a table of percentiles for the t distribution. For, say, α.05, we determine the critical value of t for (n_A n_B - 2) = 25 degrees of freedom to be 1.708 (one-tailed test).

Since the calculated t of 1.54 < 1.708, we cannot reject the hypothesis of non difference in average amount spent by the two types of families. This result is shown in Figure 12.4. When samples are not independent, the same general procedure is followed. The formulas for calculating the test statistics differ, however.

Figure 12.2 t -Distribution

Testing the Means of Two Groups: The Independent Samples t-Test

The t-distribution revolutionized statistics and the ability to work with small samples. Prior statistical work was based largely on the value of Z, which was used to designate a point on the normal distribution where population parameters were known. For most market research applications it is difficult to justify the Z - test’s assumed knowledge of μ and σ. The t-test relaxes the rigid assumptions of the Z-test by focusing on sample means and variances (X and s). The t-test is a widely used market research statistic to test for differences between two groups of respondents.

In the previous section, the t statistic was described. Most computer programs recognize two versions of this statistic. In the previous section, we presented what is called the separate variance estimate formula :

This formula is appropriate where differences in large samples are tested.

The second method, called the pooled variance estimate, computes an average for the samples that is used in the denominator of the statistic :

The pooling of variances is a simple averaging of variances that is required when (1) testing for the same population proportion in two populations or (2) testing the difference in means between two small samples

Table 12.7 shows a sample output of t-tests from SPSS. Two respondent groups were identified for the supermarket study :

1. Respondents who were males

2. Respondents who were females

The analysis by gender is shown in Part A of the output for the attitudinal dimensions, friendliness of clerks and helpfulness of clerks. It will be noted that the sample sizes are approximately equal. To show a contrast we have included in Part B of the output a t-test from another study where the number of males and females varies widely. In all three examples, the differences between separate-variance and pooledvariance analysis are very small. Table 12.9 also shows an F statistic. This is Levene’s test for equality of variances. Using this test the researcher knows which t-test output to use—equal variances assumed or equal variances not assumed.

Table 12.7 Selected Output From PASW (SPSS) t -Test

Testing of Multiple Group Means: Analysis of Variance

Analysis of variance (ANOVA) is a logical extension of the independent groups ttest methodology. Rather than test differences between two group means, we test the overall difference in k group means, where the k groups are thought of as levels of a treatment or control variable(s) or factor(s). ANOVA is a general set of methodologies that handle many different types of research and experimental designs. Traditional use of analysis of variance has been to isolate the effects of experimental variables that are

related in various experimental situations. The respondent receives a combination of the treatment levels from the different factors and the response is measured. More specifically, ANOVA is used to test the statistical significance of differences in mean responses given the introduction of one or more treatment effects.

Much experimental work has been conducted in medicine and agriculture. In pharmaceutical drug testing, positive and negative effects of dosage and formulation are measured over time and for different types of ailments and patient illness levels. The variables influencing the results are called experimental factors. In agriculture, crop yields are measured for plots of land, each of which receives a different treatment level that is defined by a factor or control variable. Control variables in this application might include seed type, fertilizer type, fertilizer dosage, temperature, moisture, and many other variables thought to influence production. In each of these plots, the average yield is measured and analyzed to determine the effect of the specific measured levels of the factors being evaluated. Marketing research experiments have control variables that are certainly different from agricultural experiments, but the principles are the same.

The proper planning and design of the relationships between the experimental factors may require the use of experimental design methodology such as the completely randomized, randomized block, Latin square, and factorial designs. Here we discuss the two most basic forms of the ANOVA methodology: the one-way ANOVA and the twofactor ANOVA designs (see Exhibit 12.1).

Exibit 12.1 ANOVA Designs

Example : It is well known that interest ratings for TV ads are related to the advertising message. A simple one-way ANOVA to investigate this relationship might compare three messages :

Two - factor ANOVA includes a second factor, possibly the type of advertisement (magazine or TV) :

Each of the cells in this matrix would contain an average interest rating for the measures taken for the particular combination of message and media.

This brief introduction to the idea behind an ANOVA will hopefully reduce the impression that the technique is used to test for significant differences among the variances of two or more sample universes. This is not strictly the case. ANOVA is used to test the statistical significance of differences in mean responses given the introduction of one or more treatments effects.

The ANOVA Methodology

The appropriateness of the title analysis of variance comes from the method’s ability to explain the variation in responses to the various treatment combinations. The methodology for explaining this variation is explained in Exhibit 12.2, which presents an example regarding responses to messages.

Exibit 12.2 Example of ANOVA Methodology

Using the single-factor ANOVA design for the problem described in Exhibit 12.1, the actual interest rating data for 12 respondents (12 = 4 respondents for each of the 3 treatments) appears in tabular and graphical format as follows :

It is apparent that the messages differ in terms of their interest ratings, but how do we perform ANOVA to determine if these differences are different statistically? There are three values that must be computed in order to analyze this pattern of values.

Compute the total sum of squares: The grand mean of the 12 observations is computed, followed by the variance of the individual observations from this mean.

Total Sum of Squares : Grand mean = 4.0 = G Computation of Total Sum of Squares ẋ ij - ẋ G ) = (6-4) 2 + (4-4) 2 + + (1-4) 2 = 110

Between-treatment sum of squares: The means of the factor levels (messages A, B, and C) are computed, followed by the deviation of the factor level means from the overall mean, weighted by the number of observations ẋ j - ) 2

Compute the within-treatment sum of squares: The means of the factor levels are computed, followed by the deviation of the observations within each factor level from that factor level mean.

Thus, an observation may be decomposed into three terms that are additive, and each explains a particular type of variance :

The overall mean is constant and common to all observations: The deviation of a group mean from the overall mean represents the effect on each observation of belonging to that particular group; the deviation of an observation from its group mean represents the effect on that observation of all variables other than the group variable.

The basic idea of ANOVA is to compare the between treatment-groups sum of squares (after dividing by degrees of freedom to get a mean square) with the withintreatment- group sum of squares (also divided by the appropriate number of degrees of freedom). This is the F statistic that indicates the strength of the grouping factor. Conceptually,

The larger the ratio of between to within, the more we are inclined to reject the null hypothesis that the group mean μ1 = μ2 = μ3. Conversely, if the three sample means were very close to each other, the between-samples sum of squares would be close to zero and we would conclude that the population means are the same, once we consider the variability of individual cases within each sample group

However, to make this comparison, it is necessary to assume that the error-term distribution has constant variance over all observations. This is exactly the same assumption as was made for the t test.

In the next section we shall (a) use more efficient computational techniques, (b) consider the adjustment for degrees of freedom to obtain mean squares, and (c) show the case of the F ratio in testing significance. Still, the foregoing remarks represent the basic ANOVA idea for comparing between- with within-sample variability

One Way (Single Factor) Analysis of Variance

One way ANOVA is analysis of variance in its simplest (single factor) form. Suppose a new product manager for the hypothetical Friskie Corp. is interested in the effect of shelf height on supermarket sales of canned dog food. The product manager has been able to secure the cooperation of a store manager to run an experiment involving three levels of shelf height (knee level, waist level, and eye level) on sales of a single brand of dog food, which we shall call Snoopy. Assume further that our experiment must be conducted in a single supermarket and that our response variable will be sales, in cans, dog food for some appropriate unit of time. But what shall we use for our unit of time? Sales of dog food in a single store may exhibit week-to-week variation, day today variation, and even hour-to-hour variation. In addition, sales of this particular brand may be influenced by the price or special promotions of competitive brands, the store management’s knowledge that an experiment is going on, and other variables that we cannot control at all or would find too costly to control.

Assume that we have agreed to change the shelf-height position of Snoopy three times per day and run the experiment over eight days. We shall fill the remaining sections of our gondola with a filler brand, which is not familiar to customers in the geographical area in which the test is being conducted. Furthermore, since our primary emphasis is on explaining the technique of analysis of variance in its simplest form (analysis of one factor : shelf height), we shall assign the shelf heights at random over the three time periods per day and not design an experiment to explicitly control and test for within-day and between day differences. Our experimental results are shown in Table 12.8

Here, we let Xij denote the sales (in units) of Snoopy during the ith day under the ith treatment level. If we look at mean sales by each level of shelf height, it appears as though the waist-level treatment, the average response to which is X² = 90.9, results in the highest mean sales over the experimental period. However, we note that the last observation (93) under the eye-level treatment exceeds the waist-level treatment mean. Is this a fluke observation? We know that these means are, after all, sample means, and our interest lies in whether the three population means from which the samples are drawn are equal or not.

Now we shall show what happens when one goes through a typical one way analysis of variance computation for this problem.

Table 12.9 shows the calculations for the among-treatments, within-treatments, and total sums of squares, the mean squares, and the F ratio. Had the experimenter used an alpha risk of 0.01, the null hypothesis of no differences among treatment levels would have been rejected. A table of F ratios is found in most statistics and research texts.

Note that Table 12.9 shows shortcut procedures for finding each sum of squares. For example, the total sum of squares is given by

This is the same quantity that would be obtained by subtracting the grand mean of 85.5 from each original observation, squaring the result, and adding up the 24 squared deviations. This mean-corrected sum of squares is equivalent to the type of formula used earlier in this chapter

The interpretation of this analysis is, like the t-test, a process of comparing the F value of 14.6 (2, 21 df) with the table value of 4.32 (p =.05). Because (14.6 > 4.32), we reject the null hypothesis that treatments 1, 2, and 3 have equivalent appeals.

The important consideration to remember is that, aside from the statistical assumptions underlying the analysis of variance, the variance of the error distribution will markedly influence the significance of the results. That is, if the variance is large relative to differences among treatments, then the true effects may be swamped, leading to an acceptance of the null hypothesis when it is false. As we know, an increase in sample size can reduce this experimental error. Though beyond the scope of this chapter, specialized experimental designs are available, the objectives of which are to increase the efficiency of the experiment by reducing the error variance.

Follow-up Tests of Treatment Differences

The question that now must be answered is the following : Which treatments differ? The F-ratio only provides information that differences exist. The question of where differences exist is answered by follow-up analyses, usually a series of independent samples t-tests, that compare the treatment level combinations ((1,2), (1,3), and (2,3)). Because of our previous discussion of the t-test, we will not discuss these tests in detail. We will only allude to the fact that there are various forms of the t-statistic that may be used when conducting a series of two group tests. These test statistics (which include techniques known as the LSD (Least Significant Difference), Bonnferoni’s test, Duncan’s multiple range tests, Scheffe’s test, and others) control the cumulative probability that a Type I error will occur when a series of statistical tests are conducted. Recall that if in a series of statistical tests, each test has a .05 probability of a Type I error, then in a series of 20 such tests we would expect one (20 * .05 = 1) of these tests would report a significant difference that did not exist (Type I error). These tests typically are options provided by the standard statistical packages, such as the PASW (SPSS) program Oneway.

Bivariate Analysis : Measures Of Association

Bivariate measures of association include the two-variable case in which both variables are interval or ratio scaled. Our concern is with the nature of the associations between the two variables and the use of these associations in making predictions

Correlation Analysis

When referring to a simple two-variable correlation, we refer to the strength and direction of the relationship between the two variables. As an initial step in studying the relationship between the X and Y variables, it is often helpful to graph this relationship in a scatter diagram (also known as an X-Y plot). Each point on the graph represents the appropriate combination of scale values for the associated X and Y variables, as shown in Figure 12.3.

The objective of correlation analysis, then, is to obtain a measure of the degree of linear association (correlation) that exists between the two variables. The Pearson correlation coefficient is commonly used this purpose and is defined by the formula

where n pairs of (Xi, Yi) values provide a sample size n, and X, Y, SX, and sY represent the sample means and sample standard deviations of the X and Y variables. The values of the correlation coefficient may range from +1 to –1.

These extreme values indicate perfect positive and negative linear correlations. Other relationships may appear to be curvilinear or even random plots and have coefficients between zero and the extreme values.

The alternate formulation shows the correlation coefficient to be the product of the Z scores for the X and Y variables. In this method of computing the correlation coefficient, the first step is to convert the raw data to a Z score by finding the deviation from the respective sample mean. The Z scores will be centered as a normally distributed variable (mean of zero and standard deviation of one).

The transformation of the X and Y variables to Z scores means that the scale measuring the original variable is no longer relevant, as a Z-score variable originally measured in dollars can be correlated with another Z-score variable originally measured on a satisfaction scale. The original metric scales are replaced by a new abstract scale (called correlation) that is the product of the two Z distributions.

Figure 12.3 Scatter Diagrams

Alternative formula:

Where

By continuing our digression one step further, we can show how the correlation coefficient becomes positive or negative.

We know that the ZX and ZY values will generally fall in the range –3 to +3. When both ZX and ZY are positive or both are negative, rXY is positive, as shown in Figure 12.6. When ZX is positive and ZY negative (or the opposite), a negative correlation will exist. Of course, we are talking of the individual respondent’s pairs of the ZX and ZY variables, which when summed produce the overall correlation coefficient.

Figure 12.4 Bivariate Products of Standard Z-Scores

To summarize, the sign (+ or -) indicates the direction of the correlation and the absolute value of the coefficient indicates the degree of correlation (from 0 to 1). Thus, correlations of +.7 and -.7 are of exactly the same strength, but the relationships are in opposite directions.

We are cautioned that near-perfect positive or negative linear correlations do not mean causality. Often, other factors underlie and are even responsible for the relationship between the variables. For example, at one time Kraft Foods reported that sales of its macaroni and cheese product were highly correlated (negatively) with national indices of the health of the economy. We may not imply that Kraft sales directly caused fluctuations in the national economy, or vice versa. Consumer expectations and possibly personal income vary as a function of the national economy. In times of reduced family income, macaroni and cheese is a low-price dietary substitute for more expensive meals.

To demonstrate, we will consider a brief example of a correlation analysis that examines the relationships between (1) family income and (2) family consumption expenditures. The data and plot appear in Figure 12.5

In order to calculate the correlation coefficient, we will reduce the previously identified alternative formula to the basic computational formula consisting of sums for the X and Y variables. This equation looks formidable but allows for easy computation by simply entering the appropriate summation values from the bottom of Table 12.10

Figure 12.5 Consumption Expenditure and Income Data

Table 12.10 Family Income and Family Consumption Expenditures

Needless to say, hand computations such as this are rarely done today. Researchers routinely perform their analyses using Excel spreadsheets or computer packages. However, our brief discussion is included to provide understanding of underlying processes and the computational formula :

Introduction to Bivariate Regression Analysis

In the analysis of associative data the marketing researcher is almost always interested in problems of predictions :

-	Can we predict a person’s weekly fast food and restaurant food purchases from that person’s gender, age, income, or education level?
-	Can we predict the dollar volume of purchase of our new product by industrial purchasing agents as a function of our relative price, delivery schedules, product quality, and technical service?

The list of such problems is almost endless. Not surprisingly, the linear regression model as applied in either the bivariate (single predictor) or multivariate form (multiple predictors)—is one of the most popular methods in the marketing researcher’s tool kit. The bivariate form is also known as simple regression.

The regression model has been applied to problems ranging from estimating sales quotas to predicting demand for new shopping centers. As an illustration, one of the leading ski resorts in the United States used a regression model to predict the weekend ticket sales, based on variables including the following :

-	Highway driving conditions
-	Average temperature in the three-day period preceding the weekend
-	Local weather forecast for the weekend
-	Amount of newspaper space devoted to the resort’s advertisements in the surrounding city newspapers
-	A moving average of the three preceding weekends’ ticket sales

The model’s accuracy was within ± 6 percent of actual attendance throughout the season. Regression analysis in its simplest bivariate form involves a single dependent (criterion) and a single independent (predictor) variable. In its more advanced multipleregression form, a set of predictors are used to form an additive linear combination that predicts the single dependent variable. In considering the use of either simple or multiple regressions, the researcher is interested in four main questions :

1.	Can we find a predictor variable (or a linear composite of the predictor variables in the multiple case) that will parsimoniously express the relationship between a criterion variable and the predictor (set of predictors)?
2.	If we can, how strong is the relationship; that is, how accurately can we predict values of the criterion variable from values of the predictor (linear composite)?
3.	Is the overall relationship statistically significant?
4.	Which predictor is most important in accounting for variation in the criterion variable? (Can the original model be reduced to fewer variables, but still provide adequate prediction of the criterion?)

The basic ideas of bivariate regression are most easily explained by a numerical example. We proceed one step at a time.

Suppose that a marketing researcher is interested in consumers' attitudes toward nutritional additives in ready-to-eat cereals. Specifically, a set of written concept descriptions of a children's cereal is prepared that vary on

X1 : the amount of protein (in grams) per 2-ounce serving.

The researcher obtains consumers' interval-scaled evaluations of ten concept descriptions using a preference rating scale that ranges from 1, dislike extremely, up to 9, like extremely well.

Table 12.11 Preference Ratings of Ten Cereal Concepts Varying in Protein

Figure 12.6 Scatter Diagram and Least-Squares Regression Line - Preference Rating versus Grams of Protein

One of the first things that is usually done in examining two-variable relationships is to prepare a scatter diagram in which the ten values of Y are plotted against their X1 counterparts. Figure 12.6 shows this plot. It appears that there is a direct relationship between Y and X1. Moreover, it would seem that a linear or straight-line relationship might be an appropriate model for describing the functional form. A scatter diagram is a useful tool for aiding in model specification. The value of using a scatter diagram is

illustrated in Exhibit 12.3.

Exhibit 12.3 Look at Your Data Before You Analyze

In deciding which type of regression approach to use, it is important that the researcher know the shape of the interrelationship. The shape of the interrelationship is easy to see on a scatter diagram. Looking at this visually helps decide whether the relationship is, or approximates being, linear or whether it has some other shape which would require a transformation of the data by converting to square roots or logarithms or treatment as nonlinear regression (Semon, 1993).

Examination of the data by scatter diagrams also allows the researcher to see if there are any “outliers” i.e., cases where the relationship is unusual or extreme as compared to the majority of the data points. A decision has to be made whether to retain such outliers in the data set for analysis.

When the regression line itself is included on the scatter diagram, comparisons between actual values and the values estimated by the regression formula can be compared and used to assess the estimating error. Of course, what the analyst is seeking is a regression function that has the best fit for the data, and this is typically based on minimizing the squares of the distances between the actual and estimated values the socalled least-squares criterion.

The equation for a linear model can be written Ŷ = a + bx , where Ŷ values of the criterion that are predicted by the linear model; a denotes the intercept, or value of Ŷ when X is zero; and b denotes the slope of the line, or change in Ŷ change in X.

But how do we find the numerical values of a and b? The method used in this chapter is known as least squares as discussed in Exhibit 12.3. As the reader will recall from introductory statistics, the method of least squares finds the line whose sum of squared differences between the observed values Yi and their estimated counterparts Ŷi (on the regression line) is a minimum.

Parameter Estimation

To compute the estimated parameters (a and b) of the linear model, we return to the data of Table 12.11. In the two-variable case, the formulas are relatively simple :

where n is the sample size and Y and X denote the mean of Y and X, respectively. Having found the slope b, the intercept a is found from leading to the linear function

This function is drawn with the scatter plot of points in Figure 12.6. It appears to fit the plotted points rather well, and the model seems to be well specified (a linear rather than arc linear or other form seems to fit). Assumptions of the Model

Underlying least-squares computations is a set of assumptions. Although leastsquares regression models do not need to assume normality in the (conditional) distributions of the criterion variable, this assumption is made when we test the statistical significance of the contribution of the predictor variable in explaining the variance in the criterion (does it differ from zero?) With this in mind the assumptions of the regression model are as follows (the symbols α and β are used denote population counterparts a and b :

1.	For each fixed value of X we assume a normal distribution of Y values exists. Our particular sample we assume that each y value is drawn independently of all others. What is being described is the “classical” regression model. Modern versions of the model permit the predictors to be random variables, but their distribution is not allowed to depend on the parameters of the regression equation.
2.	The means of all of these normal distributions of Y lie on a straight line with β
3.	The normal distributions of Y all have equal variances. This (common) variance does not depend on values assumed by the variable X.

Because all values rarely fall on the regression line (only when the correlation is 1.0), there is unexplained error in predicting Y. This error is shown in Figure 12.3 as the difference between the values Yi and the regression line Ŷ model is expressed algebraically as :

Y = α + β X1 + ε

where

α = mean of Y population X1 = 0

β = change in Y population mean per unit change in X1

ε = error term drawn independently from a normally distributed universe with mean

Ų (ε) ; the error term is indepent of X1

The nature of these assumptions is apparent in Figure 12.7. The reader should note that each value of X has associated with it a normal curve for Y (assumption 1). The means of all these normal distributions lie on the straight line shown in the figure (assumption 2).

What if the dependent variable is not continuous? Exhibit 12.4 gives an alternative when the dependent variable can be viewed as a categorical dichotomous variable use logistic regression (also known as logit). The analysis proceeds generally as we are discussing it—the major change is that a transformation has been applied to the dependent variable values

Exhibit 12.4 When to Use Logistic Regression

Data collected for customer satisfaction research provides a good illustration of when the researcher should consider transformation of data. Typically multi-point rating scales are used to obtain customer satisfaction data. Many believe that customer satisfaction ratings obtained on rating scales are not normally distributed, but are skewed toward higher scale values (Dispensa, 1997). Thus, in practice customers do not really view customer satisfaction ratings as continuous.

Ultimately, overall a customer is either satisfied or not satisfied. This creates a dichotomous dependent variable. Typically those customers who rate at the upper end of the scale, say 9 or 10 on a 10-point scale, are considered satisfied while all others are considered to be not satisfied. If this is so, normal regression analysis is not the proper technique to use as the dependent variable is binary not continuous.

A binary overall customer satisfaction variable follows the logistic distribution, thus allowing for the use of logistic regression. With one or more independent variables, this technique allows a researcher to determine the extent to which an independent variable affects the prediction of a satisfied customer through the logistic regression coefficients and their associated log-odds (Dispensa, 1997). Log-odds specify the direct association between the independent variable and the dependent variable. In addition, logistic regression calculates the probability of each customer being satisfied or not.

Typically, logistic regression is used for multiple regression situations where there are two or more independent variables. But, it is suitable for bivariate situations as well. The key to its being of value is the nature of the dependent variable, not the independent variable (s).

Figure 12.7 Two-Variable Regression Model--Theoretical

In constructing the estimating equation by least squares we have computed a regression line for a sample, and not for the population: Y = a + bX₁ where Ŷ theoretical model. As already noted, this line appears in Figure 12.6 for the specific bivariate problem of Table 12.11

However, functional forms other than linear may be suggested by the preliminary scatter plot. Figure 12.8 shows various types of scatter diagrams and regression lines for the two variable case. Panel I shows the ideal case in which all the variation in Y is accounted for by variation in X₁. We note that the regression line passes through the mean of each variable and that the slope b happens to be positive. The intercept a represents the predicted value of Y when X₁ = 0. In Panel II we note that there is residual variation in Y, and, furthermore, that the slope b is negative. Panel III demonstrates the in which no association between Y and X₁ is found. In this case the mean of Y is as good a predictor as the variable X₁ (the slope b is zero). Panel IV emphasizes that a linear model is being fitted. That is, no linear association is found (b = 0), even though a curvilinear relationship is apparent from the scatter diagram. Figure 12.8 illustrates the desirability of plotting one's data before proceeding to formulate a specific regression model

Figure 12.8 Illustrative Scatter Diagrams and Regression Lines

Strength of Association

It is one thing to find the regression equation (as shown in Figure 12.6), but at this point we still do not know how strong the association is well. Does the regression line Ŷ = a + bx (which uses X as a predictor) explain the variation in Y (predict Y)? To answer this question, consider that Ŷ the total variation in the Y variable may be divided into two component parts: (1) variance that is explained by the regression line, and (2) unexplained variation (residual). This may be expressed as and is viewed graphically in Figure 12.9.

Figure 12.9 Scatter Diagram and Regression Line for Cereal Problem

The measure of strength of association in bivariate regression is denoted by r²and is called the coefficient of determination. This coefficient varies between 0 and 1 andrepresents the proportion of total variation in Y (as measured about its own mean Y) that is accounted for by var ation in X¹. For regression analyses it can also be interpreted as a measure of substantive significance, as we have previously defined this concept.

If we were to use the average of the Y values (Y) to estimate each separate value of Y, then a measure of our inability to predict Y would be given by the sum of the squared deviations

On the other hand, if we tried to predict Y by employing a linear regression based on X. we could use each Ŷi. In this case a measure of our inability to predict Ŷi to predict its counterpart Ŷi. In this case a measure of our inability to predict Ŷi. is given by

We can define r² as a function of these two qualities :

If each Ŷi predicts its counterpart Yi perfectly, then r² = 1, since the numerator of the variance unaccounted for is zero. However, if using the regression equation does no better than Ӯ alone, then the total variance equals the variance unaccounted for and r² = 0, indicating no ability to predict Yi (beyond the use of Ӯ itself). The use of X₁ in a linear regression can do no worse than Y. Even if b turns out to be zero, the predictions are Ŷi = a = Y, which are the same as using the mean of criterion values in the first place

Table 12.12 shows the residuals obtained after using the regression equation to predict each value of Y_t via its counterpart Ŷi. We then find r_x(where we now show the explicit subscripts) by computing from the table :

*From the equation Ŷ_i = 0.491 + 0.886Xi1. This is the sum of squared errors in predicting Y_i from Ŷ_i. Next, we find :

This is the sum of squared errors in predicting Y_i from Ӯ. Hence,

and we say that 72% of the variation in Y has been accounted for by variation in X₁. As might also be surmised, there is one more quantity of interest:

which is the accounted-for sum of squares due to the regression of Y on X₁

Figure 12.10 (and 12.9 as well) put all these quantities in perspective by first showing deviation of Y_i - Ӯ. As noted above, the sum of these squared deviations is 76.10. Panel II shows the counterpart deviations of Y_i from Ŷ; the sum of these squared deviations is 21.09. Panel III shows the deviations of Ŷ_ifrom Ӯ_i; the sum of these squared deviations is 55.01. We note that the results are additive: 21.09 + 55.01 = 76.10.

Interpretation of Bivariate Regression

The sample data of Table 12.11 were analyzed using a standard linear regressionanalysis routine. Table 12.13 shows the output provided. The coefficients shown earlier appear here, along with some other measures as well.

Table 12.13 Summary Output of Regression Analysis of Table 12.11 Sample data

Overall, the linear regression equation is:

Y = 0.491 + 0.886 X₁

The standardized coefficient for the independent variable (beta or β) is moremeaningful for multiple regress ion as it is a measure of the change in Y due to a unit change in X_iwhen all independent variables have been “transformed” to the same units.

The F ratio is the appropriate test for the hypotheses that the regression coefficient, b1 = 0 and r2 = 0. This test was discussed in more detail in Chapter 11, in the context of analysis of variance. However, as recalled from basic statistics, the F distribution is the distribution followed by the ratio of two independent, unbiased

estimates of the population variance σ

If r2 is zero, then the sample r2 reflects only sampling error and the F ratio will tend to be equal to 1.0. We first obtain the mean squares, 55.013 and 2.636, by dividing the corresponding sums of squares by their respective degrees of freedom. Then we find the F ratio of 20.871. This value, with 1 degree of freedom for numerator and 8 degrees of freedom for denominator, is compared with a tabular F (see Table A-4 in Appendix A) of 3.46 with an (illustrative) significance level of 0.1. We reject the preceding null hypotheses and conclude that the overall regression equation is statistically significant.

Rank Correlation

The previous discussion of correlation and bivariate regression were based on the premise that the dependent variable is at least interval scaled or can be treated as such with little error. There are marketing problems, however, where the dependent and independent variables are rank orders or are best transformed into such rankings. In this situation the Spearman r’s rank correlation can be used to estimate the association between sets of data.

We show the use of this measure by an example. Suppose a sales manager ranks salespersons by two different methods (performance index and a new subjective method). Since the new method is easier to use, the manager wants to know if it will yield the same relative results as the proven existing method. The scores have been transformed into rankings so that each salesperson has two rankings. Table 12.14 shows the rankings.

To measure the extent of rank correlation we use the statistic

where N is the number of pairs of ranks and d is the difference between the two rankings for an individual (that is, X - Y). Applying this formula to our example, we get

If the subjects whose scores were used in computing r_swere randomly drawn from a population, we can test the significance of the obtained value. The null hypothesis is that the two variables are not associated, and thus the true value of ϥ is zero. Under H0 any observed value would be due to chance. When N ≥ 10, significance can be tested using the statistic

which is interpreted from a table of t values with (N - 2) degrees of freedom. For our example, we calculate

Looking at the table of critical values of t, we find that p > .10 (two-tailed test) for (10 - 2 = 8) degrees of freedom. Thus, if a strict α level is to be adhered to (.10 or less), we tentatively accept H0 and conclude that it is unlikely that a correlation exists between the scores from the two evaluation methods.

One final point concerning the use of the Spearman rank correlation coefficient is warranted. At times, tied observations will exist. When this happens, each of them is λ applied (see Siegel, 1956, pp. 206–10). if large, howefer, a correction factor must be applied (see Siegel, 1956, pp. 206–10).

The Spearman rank correlation coefficient is equivalent to the Pearson productmoment correlation coefficient with ranks substituted for the measurement observations, X and Y. The Spearman and other measures are discussed more fully by Siegel (1956, Chapter 9) and Gibbons (1993).

Finally, when the variables are nominally scaled, ordinal measures such as tau and rs are not appropriate measures. Nominal variables lack the ordering property. One measure that can be used is in Goodman and Kruskal’s Lambda (λ) association whose calculation and interpretation are straightforward. Lambda tells us how much we can reduce our error in predicting Y once we know X, and is shown as

This measure, and others as well, are discussed by Lewis-Beck (1995, Chap. 4). Lambda is an option provided by most Crosstab programs. We discuss lambda further in the next section.

Nonparametic Analysis

One reason for the widespread use of chi-square in cross-tabulation analysis is that most computer computational routines show the statistic as part of the output, or at least it is an option that the analyst can choose. Sometimes ordinal data are available and as such are stronger than simple nominal measurement. In this situation other tests are more powerful than chi-square. Three regularly used tests are the Wilcoxon Rank Sum (T), the Mann-Whitney U, and the Kolmogorov-Smirnov test. Siegel (1956) and Gibbons (1993) provide more detailed discussions of these techniques.

The Wilcoxon T test is used for dependent samples in which the data are collected in matched pairs. This test takes into account both the direction of differences within pairs of observations and the relative magnitude of the differences. The Wilcox matchedpairs signed-ranks test gives more weight to pairs showing large differences between the two measurements than to a pair showing a small difference. Again, to use this test, measurements must at least be ordinal scaled within pairs. In addition, ordinal measurement must hold for the differences between pairs

This test has many practical applications in marketing research. For instance, an ordinal scaling device, such as a semantic differential, can be used to measure attitudes toward, say, a bank. Then, after a special promotional campaign, the same sample would be given the same scaling device. Changes in values of each scale could be analyzed by this Wilcoxon test

With ordinal measurement and two independent samples, the Mann Whitney U test may be used to test whether the two groups are from the same population. This is a relatively powerful nonparametric test, and is an alternative to the Student t test when the analyst cannot meet the assumptions of the t test or when measurement is at best ordinal. Both one - and two tailed tests can be conducted. As indicated earlier, results of U and t tests often are similar, leading to the same conclusion.

The Kolmogorov-Smirnov two-sample test is a test of whether two independent samples come from the same population or from populations with the same distribution. This test is sensitive to any kind of difference in the distributions from which the two samples were drawn differences in location (central tendency), dispersion, skewness, and so on. This characteristic of the test makes it a very versatile test. Unfortunately, the test does not by itself show what kind of difference exists. There is a Kolmogorov- Smirnov one-sample test that is concerned with the agreement between an observed distribution of a set of sample values and some specified theoretical distribution. In this case it is a goodness of fit test similar to single classification chi-square analysis

Indexes of Agreement

Chi-square is appropriate for making statistical tests of independence in crosstabulations. Usually, however, we are interested in the strength of association as well as the statistical significance of association. This concern is for what is known as substantive or practical significance. An association is substantively significant when it is statistically significant and of sufficient strength. Unlike statistical significance, however, there is no simple numerical value to compare with and considerable experimental research judgment is necessary. Although such judgment is subjective, it need not be completely arbitrary. The nature of the problem can offer some basis for judgment, and common sense can indicate that the degree of association is too low in some cases and high enough in others (Gold, 1969, p. 44).

Statisticians have devised a large number of indexes often called indexes of agreement for measuring the strength of association between two variables in a crosstabulation. The main descriptors for classifying the various indexes are

1.	Whether the table is 2 x 2 or larger, R x C
2.	Whether one, both, or neither of the variables has categories that obey some natural order (e.g., age, income level, family size)
3.	Whether association is to be treated symmetrically or whether we want to predict membership in one variable’s categories from (assumed known) membership in the other variable’s categories

Space does not permit coverage of even an appreciable fraction of the dozens of agreement indexes that have been proposed. Rather, we shall illustrate one commonly used index for 2 x 2 tables and two indexes that deal with different aspects of the larger R x C (row-by-column) tables.

The 2 × 2 Case The phi correlation coefficient is a useful agreement index for the special case of 2 x 2 tables in which both variables are dichotomous. Moreover, an added bonus is the fact that phi equals the product-moment correlation a cornerstone of multivariate methods—that one would obtain if he or she correlated the two variables expressed in coded 0 – 1 form.

To illustrate, consider the 2 x 2 cross-tabulation in Table 12.15, taken from a study of shampoos. We wish to see if inclusion of the shampoo benefit “body” in the respondent’s ideal set is associated with the respondent’s indication that her hair lacks natural “body.” We first note from the table that high frequencies appear in the cells: (a) “body” included in ideal set and “no” to the question of whether her hair has enough (natural) body; and (b) “body” excluded from the ideal set and “yes” to the same question.

Table 12.15 Does Hair Have Enough Body Verses Body Inclusion in Ideal Set

Before computing the phi coefficient, first note the labels, A, B, C, and D assigned to the four cells in Table 12.15. The phi coefficient is defined as

The value 0.417 is also what would be found if an ordinary product moment correlation, were computed across the 84 pairs of numbers where the following dichotomous code values are used to identify the responses :

The phi coefficient can vary from - 1 to 1 (just like the ordinary product-moment correlation). However, in any given problem the upper limit of phi depends on the relationships among the marginals. Specifically, a phi coefficient of –1 (perfect negative association) or 1 (perfect positive association) assumes that the marginal totals of the first variable are identical to those of the second. Looking at the letters (A, B, C, D) of Table 12.15, assume that the row marginals equaled the column marginals: then, φ 1 if B = C = 0; similarly, φ 1 if A = D = 0. The more different the marginals, the lower the upper limit that the (absolute) value of phi can assume.

The phi coefficient assumes the value of zero if the two variables are statistically independent (as would be shown by a chi-square value that is also zero). Indeed, the absolute value of phi is related to chi-square by the expression

where n is the total frequency (sample size). This is a nice feature of phi, in the sense that it can be computed quite easily after chi-square has been computed. Note, however, that phi, unlike chi-square, is not affected by total sample size because we have the divisor n in the above formula to adjust for differences in sample size.

The R x C Case

One of the most popular agreement indexes for summarizing the degree of association between two variables in a cross-tabulation of R rows and C columns is the contingency coefficient. This index is also related to chi-square and is defined as

where n is again the total sample size. From Table 12.15 we can first determine that chisquare is equal to 14.61, which, with 1 degree of freedom, is significant beyond the 0.01 level.

We can then find the contingency coefficient C as the following

As may be surmised, the contingency coefficient lies between zero and 1, with zero reserved for the case of statistical independence (a chi-square value of zero). However, unlike the phi coefficient, the contingency can never attain a maximum value of unity. For example, in a 2 × 2 table, C cannot exceed 0.707. As might be noticed by the reader, there is an algebraic relationship between phi and the contingency coefficient (if the latter is applied to the 2 × 2 table) :

In a 4 X 4 table its upper limit is 0.87. Therefore, contingency coefficients computed from different-sized tables are not easily comparable.

However, like phi, the contingency coefficient is easy to compute from chisquare; moreover, like phi, its significance has already been tested in the course of running the chi-square test.

Both phi and the contingency coefficient are symmetric measures of association. Occasions often arise in the analysis of R X C tables (or the special case of 2.2 tables) where we desire to compute an asymmetric measure of the extent to which we can reduce errors in predicting categories of one variable from knowledge of the categories of some other variable. Goodman and Kruskal’s lambda-asymmetric coefficient can be used for this purpose (Goodman & Kruskal, 1954).

To illustrate the lambda-asymmetric coefficient, let us return to the crosstabulation of Table 12.15. Suppose that we wished to predict what category no versus yes a randomly selected person would fall in when asked the question, “Does your hair have enough body?” If we had no knowledge of the row variable (whether that person included “body” in her ideal set or not), we would have only the column marginal frequencies to rely on

Our best bet, given no knowledge of the row variable, is always to predict “no,” the higher of the column marginal frequencies. As a consequence, we shall be wrong in 41 of the 84 cases, a probability error of 41/84 = 0.49 Can we do better, in the sense of lower prediction errors, if we utilize information provided by the row variable?

If we know that “body” is included in the ideal set, we shall predict “no” and be wrong in only 8 cases. If we know that “body” is not included in the ideal set, we shall predict “yes” and be wrong in 17 cases. Therefore, we have reduced our number of prediction errors from 41 to 8 + 17 = 25, a decrease of 16 errors. We can consider this error reduction relatively :

In other words, 39 percent of the errors in predicting the column variable are eliminated by knowing the individual’s row variable.

A less cumbersome (but also less transparent) formula for lambda-asymmetric is

Where F kr is the maximum frequency found within each subclass of the row variable, Fc is the maximum frequency among the marginal totals of the column variable, and n is the total number of cases

Lambda asymmetric varies between zero, indicating no ability at all to eliminate errors in predicting the column variable on the basis of the row variable, and 1, indicating an ability to eliminate all errors in the column variable predictions, given knowledge of the row variable. Not surprisingly, we could reverse the role of criterion and predictor variables and find lambda-asymmetric for the row variable, given the column variable. In the case of Table 12.15, this result in λ = 0,26 Note that in this case we simply reverse the roles of row and column variables.

Finally, if desired, we could find a lambda-symmetric index via a weighted averaging of λC I R and λR I C. However, in the authors’ opinion, lambda-asymmetric is of particular usefulness to the analysis of cross-tabulations because we often want to consider one variable as a predictor and the other as a criterion. Furthermore, lambdaasymmetric h is a natural and useful interpretation as the percentage of total prediction errors that are eliminated in predicting one variable (e.g., the column variable) from another (e.g., the row variable).

Summary

We began by stating that data can be viewed as recorded information useful in making decisions. In the initial sections of this chapter, we introduced the basic concepts of transforming raw data into data of quality. The introduction was followed by a discussion of elementary descriptive analyses through tabulation and cross-tabulation. The focus of this discussion was heavily oriented toward how to read the data and how to interpret the results. The competent analysis of research-obtained data requires a blending of art and science, of intuition and informal insight, and of judgment and statistical treatment, combined with a thorough knowledge of the context of the problem being investigated.

The first section of the chapter dealt with cross-tabulation and chi-square analysis. This was followed by discussing bivariate analysis of differences in means and proportions. We next focused on the necessary statistical machinery to analyze differences between groups : t-test, and one-factor and two-factor analysis of variance. These techniques are useful for both experimental and nonexperimentally obtained data. We then looked at the process of analysis of variance. A simple numerical example was used to demonstrate the partitioning of variance into among- and within-components. The assumptions underlying various models were pointed out and a hypothetical data experiment was analyzed to show how the ANOVA models operate.

We concluded by examining bivariate analyses of associations for interval- or ratio scaled data. The concept of associations between two variables was introduced through simple two-variable correlation. We examined the strength and direction of relationships using the scatter diagram and Pearson correlation coefficient. Several alternative (but equivalent) mathematical expressions were presented and a correlation coefficient was computed for a sample data set.

Investigations of the relationships between variables almost always involve the making of predictions. Bivariate (two-variable) regression was discussed as the foundation for the discussion of multivariate regression in the next chapter

We ended the chapter with a discussion of the Spearman rank correlation as an alternative to the Pearson correlation coefficient when the data is of ordinal measurement and does not meet the assumptions of parametric methods. Also, the Goodman and Kruskal lambda measure for nominal measurement was briefly introduced, as were other nonparametric analyses

There is a wide array of statistical techniques (parametric and non parametric) that focus on describing and making inferences about the variables being analyzed. Some of these were shown in Table 12.4. Although somewhat dated, a useful reference for selecting an appropriate statistical is the guide published by the Institute for Social Research at the University of Michigan (Andrews, et. al., 1981 and its corresponding software Statistical Consultant. Fing (2003, pp. 78-80) presents a summary table of which technique to use under which condition.

Refernce

Andrews, F. M., Klem, L., Davidson, T. N., O’Malley, P. M., & Rodgers, W. L. (1981). A guide for selecting statistical techniques for analyzing social science data (2nd ed.). Ann Arbor: Institute for Social Research, University of Michigan.

Feick, L. F. (1984, November). Analyzing marketing research data with associated models. Journal of Marketing Research, 21, 376–386.

Fink, A. (2003). How to manage, analyze, and interpret survey data. Thousand Oaks, CA: Sage

Gibbons, J. D. (1993). Nonparametric statistics: An introduction. Newbury Park, CA: Sage.

Gold, D. (1969, February). Statistical tests and substantive significance. American Sociologist, 4, 44.

Goodman, L. A., & Kruskal, W. H. (1954, December). Measures of association for cross classification. Journal of the American Statistical Association, 49, 732–764.

Hellevik, O. (1984). Introduction to causal analysis: Exploring survey data. Beverly Hills, CA: Sage.

Lewis-Beck, M. S. (1995). Data analysis: An introduction. Thousand Oaks, CA: Sage.

Semon, T. T. (1999, August 2). Use your brain when using a chi-square. Marketing News, 33, 6.

Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill.

Smith, S. and Albaum, G. (2005). Fundamentals of marketing research. Thousand Oaks, CA: Sage Publications.

Zeisel, H. (1957). Say it with figures (4th ed.). New York: Harper & Row.

Thursday, June 1, 2023

Bivariate Data Analysis