Marekting Research: Hypothesis Testing And Univariate Analysis

Scientific research is directed at the inquiry and testing of alternative explanations of what appears to be fact. For behavioral researchers, this scientific inquiry translates into a desire to ask questions about the nature of relationships that affect behavior within markets. It is the willingness to formulate hypotheses capable of being tested to determine (1) what relationships exist, and (2) when and where these relationships hold.

The first stage in the analysis process is identified to include editing, coding, and making initial counts of responses (tabulation and cross tabulation). In the current chapter, we then extend this first stage to include the testing of relationships, the formulation of hypotheses, and the making of inferences

In formulating hypotheses the researcher uses “interesting” variables, and considers their relationships to each other, to find suggestions for working hypotheses that may or may not have been originally considered. In making inferences, conclusions are reached about the variables that are important, their parameters, their differences, and the relationships among them. A parameter is a summarizing property of a collectivity—such as a population—when that collectivity is not considered to be a sample (Mohr, 1990, p.12).

Although the sequence of procedures, (a) formulating hypotheses, (b) making inferences and (c) estimating parameters is logical, in practice these steps tend to merge and do not always follow in order. For example, the initial results of the data analysis may suggest additional hypotheses that in turn require more and different sorting and analysis of the data. Similarly, not all of the steps are always required in a particular project; the study may be exploratory in nature, which means that it is designed more to formulate the hypotheses to be examined in a more extensive project, than to make inferences or estimate parameters

An Overview Of The Analysis Process

The overall process of analyzing and making inferences from sample data can be viewed as a process of refinement that involves a number of separate and sequential steps that may be identified as part of three broad stages :

1.	Tabulation: identifying appropriate categories for the information desired, sorting the data by categories, making the initial counts of responses, and using summarizing measures to provide economy of description and thereby facilitate understanding.
2.	Formulating additional hypotheses: using the inductions derived from the data concerning the relevant variables, their parameters, their differences, and their relationships to suggest working hypotheses not originally considered.
3.	Making inferences: reaching conclusions about the variables that are important, their parameters, their differences, and the relationships among them.

The Data Tabulation Process

Seven steps are involved in the process of data tabulation :

1.	Categorize. Define appropriate categories for coding the information collected.
2.	Editing and Coding Data. Assign codes to the respondent’s answers.
3.	Create the Data File. Enter the data into the computer and create a data file.
4.	Error Checking and Handling Missing Data. Check the data file for errors by performing a simple tabulation analysis to identify errors in coding or data entry. Once errors are identified, data may be edited or recoded to collapse, combine or delete responses or categories.
5.	Generate New Variables. New variables may be computed by data manipulations that multiply, sum, or otherwise transform variables.
6.	Weight Data Subclasses. Weights are often used to adjust the proportionate representation of sample subgroups so that they match the proportions found in the population.
7.	Tabulate. Summarize the responses to each variable included in the analysis.

As simple as these steps are from a technical standpoint, data management is most important in assuring a quality analysis and thereby merit an introductory discussion. A more in depth discussion of survey-based data management is provided by Fink (2003, Chap.1).

Defining Categories

The raw input to most data analyses consists of the basic data matrix, as shown in Table 11.1. In most data matrices, each row contains a respondent’s data and the columns identify the variables or data fields collected for the respondent. The analyses of a column of data might include a tabulation of data counts in each of the categories or the computation of the mean and standard deviation. This analysis is often done simply because we want to summarize the meaning of the entire column of values. In so doing we often (willingly) forgo the full information provided by the data in order to understand some of its basic characteristics, such as central tendency, dispersion, or categories of responses. Because we summarize the data and make inferences from it, it is doubly important that the data be accurate.

Tabulation of any sizable array of data often requires that responses be grouped into categories or classes. The identification of response categories early in the study has several advantages. Ideally, it forces the analyst to consider all possible interpretations and responses to the questionnaire. It often leads to improvements in the questionnaire or observation forms. It permits more detailed instruction of interviewers and results in higher consistency in interpreting responses. Editing problems are also reduced.

The definition of categories allows for identification of the database columns and values assigned to each question or variable and to indicate the values assigned to each response alternative. Depending on the data collection method, data code sheets can be prepared and precoded. Data files are often formatted as comma separated variable (CSV) files, meaning that each variable appears in the same relative position for each respondent with a comma separating each of the variables. The major data analysis software programs read data files and then display them in a spreadsheet-like database (see Table 11.1). Often the data are entered directly into a Microsoft Excel spreadsheet for import into the statistical program to be used for analysis. Where data is not collected and formatted electronically, pre-coding of printed questionnaires will eliminate transcription and thereby decreasing both processing errors and costs. Most of today’s computer-based software for telephone (CATI) or internet surveys (Qualtrics.com) automate this entire process. They not only define the question categories in the database but also automatically build the database and record the completed responses as they are submitted. The data may then be analyzed online, exported to Microsoft Excel™ or imported into a dedicated statistical analysis program such as PASW (formerly known as SPSS). Response categories are coded from 1 for the first category to the highest value for the last category. Category values can be recoded to assign different numbers as desired by the researcher.

As desirable as the early definition of categories is, it can sometimes only be done after the data have been collected. This is usually the case when open-end text questions, unstructured interviews, and projective techniques are used.

The selection of categories is controlled by both the purposes of the study and the nature of the responses. Useful classifications meet the following conditions :

1.	Similarity of response within the category. Each category should contain responses that, for purposes of the study, are sufficiently similar that they can be considered homogenous.
2.	Differences of responses between categories. Differences in category descriptions should be great enough to disclose any important distinctions in the characteristic being examined
3.	Mutually exclusive categories. There should be an unambiguous description of categories, defined so that any response can be placed in only one category.
4.	Categories should be exhaustive. The classification schema should provide categories for all responses.

The use of extensive open-end questions often provides rich contextual and anecdotal information, but is a practice often associated with fledgling researchers. Open-end questions, of course, have their place in marketing research. However, the researcher should be aware of the inherent difficulties in questionnaire coding and tabulation, not to mention their tendency to be more burdensome to the respondent. All of this is by way of saying that any open-end question should be carefully checked to see if a closed-end question (i.e., check the appropriate box) can be substituted without doing violence to the intent of the question. Obviously, sometimes this substitution should not be made.

Editing and Coding

Editing is the process of reviewing the data to ensure maximum accuracy and clarity. This applies to the editing of the collection forms used for pretesting as well as those for the full-scale project. Careful editing during the pre-test process will often catch misunderstandings of instructions, errors in recording, and other problems so as to eliminate them for the later stages of the study. Early editing has the additional advantage of permitting the questioning of interviewers while the material is still relatively fresh in their minds. Obviously, this has limited application for printed questionnaires, though online or CATI surveys can be edited even when data is being collected.

Editing is normally centralized so as to ensure consistency and uniformity in treatment of the data. If the sample is not large, a single editor usually edits all the data to reduce variation in interpretation. In those cases where the size of the project makes the use of more than one editor mandatory, it is usually best to assign each editor a different portion of the data collection form to edit. In this way the same editor edits the same items on all forms, an arrangement that tends to improve both consistency and productivity.

Typically, interviewer and respondent data are monitored to ensure that data requirements are fulfilled. Each collection form should be edited to ensure that data quality requirements are fulfilled. Regarding data obtained by an interviewer (and to an extent self-report) the following should be specifically evaluated :

1.	Legibility of entries. Obviously the data must be legible in order to be used. Where not legible, although it may be possible to infer the response from other data collected, where any real doubt exists about the meaning of data it should not be used.
2.	Completeness of entries. On a fully structured collection form, the absence of an entry is ambiguous. It may mean either that the respondent could not or would not provide the answer, that the interviewer failed to ask the question, or that there was a failure to record collected data.
3.	Consistency of entries. Inconsistencies raise the question of which response is correct. (If a respondent family is indicated as being a non-watcher of game shows, for example, and a later entry indicates that they watched Wheel of Fortune twice during the past week, an obvious question arises as to which is correct.) Discrepancies may be cleared up by questioning the interviewer or by making callbacks to the respondent. When discrepancies cannot be resolved, discarding both entries is usually the wisest course of action.
4.	Accuracy of entries. An editor should keep an eye out for any indication of inaccuracy in the data. Of particular importance is the detection of any repetitive response patterns in the reports of individual interviews. Such patterns may well be indicative of systematic interviewer or respondent bias or dishonesty.

Coding is the process of assigning respondent answers to data categories and numbers are assigned to identify them with the categories. Pre-coding refers to the practice of assigning codes to categories. Sometimes these codes are printed on structured questionnaires and observation forms before the data are collected. Using these predefined codes, the interviewer is able to code the responses when interpreting the response and marking the category into which it should be placed.

Post-coding is the assignment of codes to responses after the data are collected, and is most often required when responses are reported in an unstructured format (open-ended text or numeric input). Careful interpretation and good judgment are required to ensure that the meaning of the response and the meaning of the category are consistently and uniformly matched.

When not using CATI or online data collection technologies, a formal coding manual or codebook is often created and made available to those who will be entering or analyzing the data. The codebook used for a study of supermarkets in the United States is shown in Figure 11.1 as an illustration.

Like good questionnaire construction, good coding requires training and supervision. The editor-coder should be provided with written instructions, including examples. He or she should be exposed to the interviewing of respondents and become acquainted with the process and problems of collecting the data, thus providing aid in its interpretation. The coder also should be aware of the computer routines that are expected to be applied, insofar as they may require certain kinds of data formats.

Whenever possible (and when cost allows) more than one person should do the coding, specifically the post-coding. By comparing the results of the various coders, a process known as determining inter-coder reliability, any inconsistencies can be brought out. In addition to the obvious objective of eliminating data coding inconsistencies, the need for recoding sometimes points to the need for additional categories for data classification and may sometimes mean that there is a need to combine some of the categories. Coding is an activity that should not be taken lightly. Improper coding leads to poor analyses and may even constrain the types of analysis that can be completed.

Qualtrics has an interesting feature that uses a “wizard” to take a survey by selecting random choices and following the various logic paths available. The resulting test data conforms to the “sample size” specified by the researcher and the pre-specified logic and coding can be checked for errors that the researcher has made.

Tabulation: Cleaning the Data

The purpose of the initial data cleaning tabulation is to identify outliers, missing data and other indications of data, coding, transcription, or entry errors. The tabulation of responses will invariably reveal codes that are out of range or otherwise invalid. For example, one tabulation might reveal 46 males (category 1), 54 females (category 2), and one category 5 response, which is obviously an error. Some errors, such as the preceding one, represent entry of values that are out-of-range or wild codes (Lewis-Beck, 1995, p. 7). That is, the value is not one that has been assigned to a possible response to a question. A miscoding error that is more difficult to detect is one where an erroneous recording of a response category is made using a number that is assigned to another response category. That is, in the coding shown in Figure 11.1 a response of self to question number 3 (code = 1) might have been coded as spouse (code = 2). Hopefully, not too many errors of this type occur.

An aspect of cleaning the data is dealing with missing data. That is, some respondents may not provide responses for all the questions. One way of handling this is to use statistical imputation. This involves estimating how respondents who did not answer particular questions would have answered if they had chosen to. Researchers are mixed in their views about this process. A much safer way to handle a nonresponse situation is to treat the nonresponse as missing data in the analysis. Statistical software programs can handle this either question-byquestion or by deleting the respondent with a missing value from all analyses. Also, the researcher can choose to eliminate a respondent from the data set if there is too much missing data. Yet another way is to simply assign the group’s mean value to the missing items. Or, when an item is missing from a multi-item measure, a respondent’s mean value for the rest of the items can be used for the missing value.

Another issue that can arise is how to deal with outliers (Fink, 2003, pp. 22–23). Outliers are respondents whose answers appear to be inconsistent with the rest of the data set. An easy way to check for outliers is by running frequency analyses, or counts, of responses to questions. Regression analysis also can be used to detect outliers. This is discussed in Chapter 13. Outliers can be discarded from the analysis, but one must be careful to not throw out important and useful information as would be the case when the outliers belong to a unique category of respondents heretofore unidentified. If an outlier is retained then it may be best to use the median rather than the mean as the measure of central tendency when such a measure is part of the analysis.

Short of having two or more coders create the data file independently of each other and then assessing intercoder reliability, there is not much that can be done to prevent coder error except to impress upon coders the necessity of accurate data entry. Multiple coders can be very time-consuming and costly, particularly for large data files. Each error that is identified should be traced back to the questionnaire to determine the proper code. The cleaning process is complete when either the data file has been edited to correct the errors or the corrections have been made in the analysis program.

Tabulation: Basic Analysis

Tabulation may be thought of as the final step in the data collection process and the first step in the analytical process. Tabulation is simply the counting of the number of responses in each data category (often a single column of the data matrix contains the responses to all categories).

The most basic is the simple tabulation, often called the marginal tabulation and familiar to all students of elementary statistics as the frequency distribution. A simple tabulation or distribution consists of a count of the number of responses that occur in each of the data categories that comprise a variable. A cross-tabulation involves the simultaneous counting of the number of observations that occur in each of the data categories of two or more variables. An example is given in Table 11.2. We shall examine the use of cross-tabulations in detail later in the chapter. A cross-tabulation is one of the more commonly employed and useful forms of tabulation for analytical purposes.

The flexibility and ease of conducting computer analysis increases the importance of planning the tabulation analysis. There is a common tendency for the researcher to decide that, because cross-tabulations (and correlations) are so easily obtained, large numbers of tabulations should be run. Not only is this methodologically unsound, but in commercial applications it is often costly in analyst time as well. For 50 variables, for example, there are 1,225 different twovariable cross-tabulations that can be made. Only a few of these are potentially of interest in a typical study.

Formulating Hypotheses

As a beginning point in the discussion of hypotheses testing, we ask: what is a hypothesis? A hypothesis is an assertion that variables (measured concepts) are related in a specific way such that this relationship explains certain facts or phenomena. From a practical standpoint, hypotheses may be developed to solve a problem, answer a question, or imply a possible course of action. Outcomes are predicted if a specific course of action is followed. Hypotheses must be empirically testable. A hypothesis is often stated as a research question when reporting either the purpose of the investigation or the findings. The hypothesis may be stated informally as a research question, or more formally as an alternative hypothesis, or in a testable form known as a null hypothesis. The null hypothesis makes a statement that no difference exists (see Pyrczak, 1995, pp. 75-84).

Research questions state in layman's terms the purpose of the research, the variables of interest, and the relationships to be examined. Research questions are not empirically testable, but aid in the important task of directing and focusing the research effort. To illustrate, a sample research question is developed in the following scenario :

Exhibit 11.1 Development of a Research Question for Mingles

Mingles is an exclusive restaurant specializing in seafood prepared with a light Italian flair. Barbara C., the owner and manager, has attempted to create an airy contemporary atmosphere that is conducive to conversation and dining enjoyment. In the first three months, business has grown to about 70 percent of capacity during dinner hours.

Barbara wants to track customer satisfaction with the Mingles concept, the quality of the service, and the value of the food for the price paid. To implement the survey, a questionnaire was developed using a five-point expectations scale with items scaled as values from -2 to +2. The questionnaire asks, among other things :

When tabulated, the average response was found to be +0.89 with a sample standard deviation of 1.43. The research question asks if Mingles is perceived as being better than average when considering the price and value of the food. Additional questions measure customer satisfaction by addressing “How satisfied Mingles customers are with the concept, service, food, and value”.

Null Hypotheses (H0) are statements identifying relationships that are statistically testable and can be shown not to hold (nullified). The logic of the null hypothesis is that if we hypothesize no difference, and we “reject” the hypotheses if a difference is found. If, however we confirm that no difference exists, then we "tentatively accept" the null hypothesis. We may only accept the null on a "tentative" basis because another testing of the null hypothesis using a new sample may reveal that sampling error was present and that the null hypothesis should be rejected.

For example, to compare the population and the sample, the null hypothesis might be: "There is no difference between the perceived price-value of Mingles food and what is expected on average. In this example, the difference between the population average, which is assumed to be the middle scale value of 0 = "about average" and the sample's mean evaluation of Mingles can be tested using the z distribution.

A null hypothesis may also be used to specify other types of relationships that are being tested, such as the difference between two groups, or the ability of a specific variable to predict a phenomenon such as sales or repeat business. Two examples :

1.	Comparing two sample groups: H0: There is no difference in the value of the food for the price paid as perceived by first time patrons and repeat patrons. This is tested by a t-test of the difference in means between two patron groups.
2.	Predicting intention to return to Mingles: H0: The perceived quality of service is not related to the likelihood of returning to Mingles. This is a regression analysis problem that uses quality of service to predict likelihood of returning to Mingles.

Alternative hypotheses may be considered to be the opposite of the null hypotheses. The alternative hypothesis makes a formal statement of expected difference, and may state simply that a difference exists or that a directional difference exists, depending upon how the null hypothesis is stated. Because population differences may exist, even if not verified by the current sample data, the alternative form is considered to be empirically non-testable. The relationship between hypothesis and research questions is summarized in Table 11.3.

Table 11.3 Hypotheses and Research Questions

	Purpose	Example	Decision
Research Question	Express the purpose of the research	What is the perception of Mingles customers regarding the pricevalue of the food?	None used
Alternative Hypothesis	The alternative hypothesis states the specific nature of the hypothesized relationship. i.e., that there is a difference. The alternative hypothesis is the opposite of the null hypothesis. The alternative hypothesis cannot be falsified because a relationship hypothesized to exist may not have been verified, but may in truth exist in another sample. (You can never reject an alternative hypothesis unless you testthe population on all possible samples.	Mingles is perceived as having superior food value for the price when compared to the average evaluation.	Not tested because we cannot reject. We may only accept that a relationship exists.
Null Hypothesis	The null hypothesis is testable in the sense that the hypothesized lack of relationship can be tested. If a relationship is found, the null hypothesis is rejected. The Null hypothesis states that there is no difference between groups (with respect to some variable) or that a given variable does not predict or otherwise explain an observed phenomena, effect or trend.	There is no difference in perceived food value for the price for Mingles and the average evaluation	We may reject a null hypothesis (Find a relationship). We may only tentatively accept that no relationship exists.

The objectives and hypotheses of the study should be stated as clearly as possible and agreed upon at the outset. Objectives and hypotheses shape and mold the study; they determine the kinds of questions to be asked, the measurement scales for the data to be collected, and the kinds of analyses that will be necessary. However, a project will usually turn up new hypotheses, regardless of the rigor with which it was planned and developed. New hypotheses are continually suggested as the project progresses from data collection through the final interpretation of the findings.

In Chapter 2 it was pointed out that when the scientific method is strictly followed, hypothesis formulation must precede the collection of data. This means that according to the rules for proper scientific inquiry, data suggesting a new hypothesis should not be used to test it. New data must be collected prior to testing a new hypothesis.

In contrast to the strict procedures of the scientific method, where hypotheses formulation must precede the collection of data, actual research projects almost always formulate and test new hypotheses during the project. It is both acceptable and desirable to expand the analysis to examine new hypotheses to the extent that the data permit. At one extreme, it may be possible to show that the new hypotheses are not supported by the data and that no further investigation should be considered. At the other extreme, a hypothesis may be supported by both the specific variables tested and by other relationships that give similar interpretation. The converging results from these separate parts of the analysis strengthen the case that the hypothesized relationship is correct. Between these extremes of nonsupport-support are outcomes of indeterminacy: the new hypothesis is neither supported nor rejected by the data. Even this result may indicate the need for an additional collection of information.

In a position yet more extreme from scientific method, Selvin and Stuart (1966) convincingly argue that in survey research, it is rarely possible to formulate precise hypotheses independently of the data. This means that most survey research is essentially exploratory in nature. Rather than having a single pre-designated hypothesis in mind, the analyst often works with many diffuse variables that provide a slightly different approach and perspective on the situation and problem. The added cost of an extra question is so low that the same survey can be used to investigate many problems without increasing the total cost. However, researchers must resist the syndrome of “just one more question”. Often, the one more question escalates into many more questions of the type “it would be nice to know”, which can be unrelated to the research objectives.

In a typical survey project, the analyst may alternate between searching the data (analyzing) and formulating hypotheses. Obviously, there are exceptions to all general rules and phenomena. Selvin and Stuart (1966), therefore, designate three practices of survey analysts :

Snooping. The process of searching through a body of data and looking at many relations in order to find those worth testing (that is, there are no pre-designated hypotheses)

Fishing. The process of using the data to choose which of a number of pre-designated

variables to include in an explanatory model

Hunting. The process of testing from the data all of a pre-designated set of hypothese

This investigative approach is reasonable for basic research but may not be practical for decisional research. Time and resource pressures seem to require that directed problem solving be the focus of decision research. Rarely can the decision maker afford the luxury of dredging through the data to find all of the relationships that must be present. Again, it simply reduces to the question of cost versus value

Making Inferences

Testing hypotheses is the broad objective that underlies all decisional research. Sometimes the population as a whole can be measured and profiled in its entirety. Often, however, we cannot measure everyone in the population but instead must estimate the population using a sample of respondents drawn from the population. In this case we estimate the population “parameters” using the sample “statistics”. Thus, in both estimation and hypothesis testing, inferences are made about the population of interest on the basis of information from a sample.

We often will make inferences about the nature of the population and ask a multitude of questions, such as: Does the sample's mean satisfaction differ from the mean of the population of all restaurant patrons? Does the magnitude of observed the differences between categories indicate that actual differences exist, or are they the result of random variations in the sample?

In other studies, it may be sufficient to simply estimate the value of certain parameters of the population, such as the amount of our product used per household, the proportion of stores carrying our brand, or the preferences of housewives concerning alternative styles or package designs of a new product. Even in these cases, however, we would want to know about the underlying associated variables that influence preference, purchase, or use (color, ease of opening, accuracy in dispensing the desired quantity, comfort in handling, etc.), and if not for purposes of the immediate problem, then for solving later problems. In yet other case studies, it might be necessary to analyze the relationships between the enabling or situational variables that facilitate or cause behavior. Knowledge of these relationships will enhance the ability to make reliable predictions, when decisions involve changes in controllable variables.

The Relationship Between a Population, a Sampling Distribution, and a Sample

In order to simplify the example, suppose there is a population consisting of only five persons. On a specific topic, these five persons have a range of opinions that are measured on a 7-point scale ranging from very strongly agree to very strongly disagree. The frequency distribution of the population is shown in the bar chart of Figure 11.2.

The parameters describe this population as having a mean of μ = 4 and standard deviation = 2.

Now that we know the “parameters” of the population, we will consider the sampling distribution for our example data. Assume for a moment that like most populations, ours is so large that we are not able to measure all persons in this population, but must rely instead on a sample. In our example, the population is not large, and we will assume a sample of size n = 2.

The sampling distribution is the distribution if sample means from all possible samples of size n=2. The sampling distribution of means and standard errors are shown in Table 11.4.

Table 11.4 Computation of Sampling Distribution, Mean, and Standard Error

The mean of all possible two-member sample means is

and summing the standard errors of the mean for the sampling distribution gives

which gives a standard deviation of

Understand that the sampling distribution becomes more normal as the sample size increases and that even in this simple case, we observe a somewhat normal shape. Also understand that the population mean ƴ ± 4 is always equal to the mean of the sampling distribution of all possible sample means (ƴ ± = 4)

The Relationship Between the Sample and the Sampling Distribution

When we draw a sample, we rarely know anything about the population, including its shape, ƴ ƍ, or . We must, therefore compute statistics from the sample (± and s) and make inferences about the population ƴ, and ƍ using the sample information.

Suppose we were to repeatedly draw samples of n = 2 (without replacement). The relevant statistics for the first of these samples having the values of (1,2) are :

Given this ± and S ± , we can now estimate with a given probability, the intervals that give a range of possible values that could include ƴ, the population mean. For this single sample, they are :

Thus, we could state that we are 99% confident that the population mean would fall within the interval –14.41 to 17.41. We note that this range is very wide and does include our population mean of 4.0. The size of the range is large because the small sample size (n=2). As the sample size increases, the numbers become larger, and gradually approximate a standard normal distribution.

The above discussion is summarized graphically in Exhibit 11.2. Part I of Exhibit 11.2 shows the relationship between the population, the sample, and the sampling distribution while Part II illustrates the impact of sample size on the shape of the sampling distribution for differently shaped population distributions.

Exhibit 11.2 Population, Sample, and Sampling Distribution

Part I

Part II

Acceptable Error in Hypothesis Testing

A question that continually plagues analysts is, what significance level should be used in hypothesis testing? The significance level refers to the amount of error we are willing to accept in our decisions that are based on the hypothesis test. Hypotheses testing involves specifying the value ɑ, which is the allowable amount of Type I error.

In hypothesis testing the sample results sometimes lead us to reject H0 when it is true. This is a Type I error. On other occasions the sample findings may lead us to accept H0 when it is false. This is a Type II error. The nature of these errors is shown in Exhibit 11.3.

The amount of type I error, ɑ, we are willing to accept should be set after considering (a) how much it costs to make such an error and (b) the decision rule used. A ”classic" paper dealing with criteria relevant for this question is Labovitz (1968). There are substantial differences in costs of errors in research conducted for exploratory purposes and research conducted to make a decision where large financial investments are made. The acceptable levels of significance (error) used in a basic research project may be entirely inappropriate for a managerial decision dealing with the same problem, but where millions of dollars, or a company’s market strategy is involved, managerial decisions are rarely simple or based on a single piece of research.

A tradition of conservatism exists in basic research and has resulted in the practice of keeping the Type I error at a low level (.05 or .01). The Type I error has been traditionally considered to be more important than the Type II error and, correspondingly, that it is more impor ant to have a low ɑ than a low β. The basic researcher typically assigns higher costs to a Type I than to a Type II error

Exhibit 11.3 Types of Error in Making a Wrong Decision

There are two types of error that result from a mismatch between the conclusion of a research study and reality. In the null-hypothesis format, there are two possible research conclusions, to retain H0 and to reject H0. There are also two possibilities for the true situation: H0 is true or H0 is false. These outcomes provide the definitions of Type I and Type II errors and the confidence level and power of the test :

1.	A Type I error occurs when we incorrectly conclude that a difference exists. This is expressed ɑ, the probability that we will incorectly reject HO, the null hypothesis or sometimes called the hypothesis of no difference.
2.	A Type II error occurs when we accept a null hypothesis when it is in reality false (we find no difference when a difference really does exist).
3.	Confidence level: we correctly retain the null hypothesis (we could also say tentatively accept or it could not be rejected). This is equal to the area under the normal curve less the area occupied by ɑ, the significance level
4.	The power of the test is the ability to reject the null hypothesis when it should be Rejected (when false) becomes larger, researches may choose an ɑ of 10 to increase power Alternatively, sample size may be increased to increase power. Increasing sample size is the preferred option for most market researchers.

The four possible combinations are shown in the following diagram :

In decisional research, the costs of an error are a direct result of the consequences of the errors. The cost of missing a market entry by not producing a product and foregoing gain (Type II error) may be even greater than the loss from producing a product when we should not (Type I error). The cost depends on the situation and the decision rule being used. Of course not all decision situations have errors leading to such consequences. In some situations, making a Type I error may lead to an opportunity cost (for example, a foregone gain) and a Type II error may create a direct loss.

In decisional research, the acceptable level of significance (i.e., the specification of a significance level or ɑ) should be made by the client or the manager for whom the study has been conducted. This allows the assessment of risk by the person who will use the results as a basis for decisions (Semon, 2000). This means that the researcher should merely report the level at which significance occurs, letting the manager decide what this means. For basic research situations, a typical finding is often reported as significant or not as tested against a specified level, often .05. Again, it seems sensible that the researcher reports for each finding, the level at which significance occurs, letting the reader of a study decide the meaning of the alpha level.

Power of a Test

The power of a hypotheses test is defined as 1- β, or 1 minus the probability of a Type II error. This means that the power of a test is the ability to reject the null hypothesis when it is false (or to find a difference when one is present).

The power of a statistical test is determined by several factors. The main factor is the acceptable amount of discrepancy between the tested hypothesis and the true situation. This can be controlled by increasing ɑ Power is also increased by increasing the sample size (which decreases the confidence interval).

In Figure 11.3, we observe two sampling distributions, the first having a mean μ and the second having a mean μ0 + 1ɑ .In testing the equality of the two distributions, we first identify ɑ, which is here set to .05. (The one tail probably is .95 for 1.65 ƍ). In Figure 11.2, we observe that for the second sampling distribution, power is the area to the right of μ0 + 1.65 ƍ, i.e., the area in which a difference between means from the two sampling distributions exists and was found to exist.

Figure 11.3 Two Sampling Distributions

Selecting Tests Of Statistical Significance

Up to this point, we have considered data analysis at a descriptive level. It is now time to introduce ways to test whether the association observed is statistically significant. In many cases this involves testing hypotheses concerning tests of group means. These types of tests are performed on interval or ratio data using what is known as "parametric tests" and include such techniques as the F, t, and z tests. Often however, we have only nominal or loosely ordinal data and we are not able to meet the rigid assumptions of a parametric test. Cross tabulation analysis with the X²test is often used for hypothesis testing in these situations. The X²statistic is from the family of non-parametric methods.

Nonparametric methods are often called distribution-free methods because the inferences are based on a test statistic whose sampling distribution does not depend upon the specific distribution of the population from which the sample is drawn (Gibbons, 1993, p., 2). Thus, the methods of hypothesis testing and estimation are valid under much less restrictive assumptions than classical parametric techniques—such as independent random samples drawn from normal distributions with equal variances, interval level measurement. These techniques are appropriate for many Marketing applications where measurement is often at an ordinal or nominal level.

There are many parametric and nonparametric tests. The one that is appropriate for analyzing a set of data depends on: (1) the level of measurement of the data, (2) the number of variables that are involved, and for multiple variables, how they are assumed to be related.

As measurement scales become more restrictive and move from nominal to ordinal, and interval levels of measurement, the amount of information and power to extract that information from the scale increases. Corresponding to this spectrum of available information is an array of non-parametric and parametric statistical techniques that focus on describing (i.e., measures of central tendency and dispersion) and making inferences about the variables contained in the analysis--i.e., tests of statistical significance (as shown in Table 11.5). A useful reference for selecting an appropriate statistical technique is found in the Qualtrics Survey University.

Table 11.5 Selected Parametric and Nonparametric Univariate Analyses

What type of analysis do you want?	Level of Measurement
What type of analysis do you want?	Nominal	Ordinal	Interval (Parametric)
Measure of central tendency	Mode	Median	Mean
Measure of Dispersion	None	Percentile	Standard Deviation
One-Sample test of statistical significance	Binomial test Π² one-sample test	Kolmogorov-Smirnov One-sample test One-sample runs test	t-test Z-test

Exhibit 11.4 Ignoring Statistical Power

Much has been written about product failure, and about the failure of advertising campaigns. But, except in rare instances, very little has been said about research failure, or research that leads to incorrect conclusions. Yet research failure can occur even when a study is based on an expertly designed questionnaire, good field work, and sophisticated analysis. The flaw may be inadequate statistical power.

Mistaking chance variation for a real difference is one risk, called Type II error and the 95% criterion largely eliminates it. By doing so, we automatically incur a high risk of Type II errormistaking a real difference for chance variation. The ability of a sample to guard against Type II error is called statistical power.

Power was hot stuff in the 1980s. According to Semon (1990), however, but we continued to ignore statistical power, for three reasons :

1.	The concept is more complicated than statistical significance or confidence limits.
2.	Consideration of statistical power would often indicate that we need larger (more costly) samples than we are now using.
3.	Numerical objectives must be specified before a research budget is fixed.

The last reason may be the worst of all. Management, even if it has a target in mind, is often unwilling to do so.

The issue really is one of the degree of sensitivity required, the answer or the question, "What is the smallest change or difference we need to measure, with a specified degree of confidence?"

Suppose we have a pricing or a package test using two monadic samples for determining which of two options to use. Suppose also that (unknown to us) Option X if superior to Y by 10 percentage points, say 30% strong buying interest vs. 20% interest.

If we use two samples of 200 each, how likely will our study identify X as being significantly superior, using a 95% significance criterion? The likelihood of that correct result is only 68%, just a shade better than 2:1, in effect a 32% risk of research failure. If we want to reduce that risk to 10% (that is, 90% statistical power), we need samples of 400 each.

Setting the desired levels of protection against the two types of error should not be a rote process. It should take into consideration the relative business risks and, therefore, requires management participation since researchers usually do not have access to enough information to make these decisions

Another way of looking at statistical power is to suggest that the researcher look at statistical insignificance rather than statistical significance. By “testing” for statistical insignificance one asks, “Is this result so likely to occur by chance that we should ignore it?” Is statistical insignificance high enough to dismiss the finding?

But, as pointed out by one observer, there is no scientific, objective way to determine what is “high enough” (Semon, 1999). For example, if the significance level associated with the difference in preference for two packages is 75%, the insignificance level is 25%. That’s the same as saying the likelihood of a difference as large as observed occurring by chance is at odds of 1:3. Is that “too high?” Clearly, 1:3 odds should deserve some consideration. But, the level that such odds should have is a matter of personal preference and reflects one’s risk taking philosophy.

Perhaps we should review the specs. Why do we need 95% protection against Type I error? If we lower our significance criterion to 90%, the sample size requirement drops to 300.

Parametric And Non – Parametric Analysis

Our discussion of statistical inference has emphasized that analytical procedures require assumptions about the distribution of the population of interest. For example, for normally distributed variables, we assume normality and homogeneity of variances. When a variable to be analyzed conforms to the assumptions of a given distribution, the distribution of the variable can be expressed in terms of its parameters (μ and σ). This process of making inferences from the sample to the population's parameters is called parametric analysis

Sometimes, however, problems occur: what if we cannot assume normality, or we must question the measurement scale used. Parametric methods rely almost exclusively on either interval or ratio scaled data. In cases where data can be obtained only using ordinal or categorical scales the interpretation of the results may be questionable especially if the ordinal categories are not of equal interval. When data do not meet the rigorous assumptions of parametric method, we must rely on non-parametric methods which free us of the assumptions about the distribution.

Whereas parametric methods make inferences about parameters of the population (μ and s), non-parametric methods may be used to compare entire distributions that are based on nominal data. Other non-parametric methods that use an ordinal measurement scale test for the ordering of observations in the data set.

Problems that may be solved with parametric methods may often be solved by a nonparametric method designed to address a similar question. Often times, the researcher will find that the same conclusion regarding significance is made when data are analyzed by a parametric method and by its “corresponding” non-parametric method. We will now discuss a univariate parametric and non-parametric analyses. In the next chapter bi-variate parametric and nonparametric analyses are presented. Additional non-parametric analyses are presented in the appendix to Chapter 12.

Univariate Analyses of Parametric Data

Marketing researchers are often concerned with estimating parameters of a population. In addition, many studies go beyond estimation and compare population parameters by testing hypotheses about differences between them. Very often, the means, proportions and variances are the summary measures of concern. Our concern at this point is with differences between the means and proportions of the sample and those of the population as a whole. These comparisons involve a single variable. In the following sections, we will demonstrate three important concepts: (1) how to construct and interpret a confidence interval; (2) how to perform a hypothesis test, and (3) how to determine the power of a hypothesis test. These issues are discussed in more depth by Mohr (1990).

The Confidence Interval

The concept of a confidence interval is central to all parametric hypothesis testing. The confidence interval is a range of values with a given probability (.95, .99, etc.) of covering the true population parameter.

For example, assume we have a normally distributed population with population mean μ and a known population variance σ². Suppose we sample one item from the population, X. This single item is an estimate of μ, the population mean. Further, because the single item has been drawn randomly from a normally distributed population, the possible distribution of x values is the same as the population. This normal distribution permits us to estimate the probability associated with various intervals of values of X. For example, p (-1.96 ≤ z ≥ + 1.96) =.95. about 95% of the area under the normal probability curve is within this range. The confidence interval shows both the z values and the values that are included in the confidence interval. We compute this range for the sample problem that follows.

Suppose that it is a well-known fact that the average supermarket expenditure on laundry and paper products is normally distributed, with a mean of μ=$32.00 per month, and the known population standard deviation is σ =10.00. The 95% confidence interval about the population mean is computed as:

Thus, we expect that 95% of all household expenditures will fall within this range, as shown in Figue 11.5.

Figure 11.5 Sampling Distribution of the Mean ( u ẋ )

If we expand this analysis to construct a confidence interval around the mean of a sample rather than the mean of a population we must rely on a sampling distribution to define our normal distribution. The sampling distribution is defined by the means of all possible samples of size n. Recall that the population mean μ is also the mean of the normally distributed sampling distribution, where as μ z ± σ describes the confidence interval for a population, the value (± α s/√ᶯ)) describes the confidence interval for the sampling distribution. This is the probability that this specified area around the sample mean covers the population mean. It is interesting to note that because n, the sample size, is included in the computation of the "standard error," we may estimate the population mean with any desired degree of precision, simply by having a large enough sample size.

Univariate Hypothesis Testing of Means

Where the Population Variance is Known

Researchers often desire to test a sample mean to determine if it is the same as the population mean. The z statistic describes probabilities of the normal distribution and is the appropriate tool to test the difference between μ, the mean of the sampling distribution, and, the sample mean when the population variance is known. The z statistic may, however, be used only when the following conditions are met :

1.	Individual items in the sample must be drawn in a random manner
2.	The population must be normally distributed. If this is not the case, the sample must be large (>30), so that the sampling distribution is normally distributed.
3.	The data must be at least interval scaled.
4.	The variance of the population must be known.

When these conditions are met, or can at least be reasonably assumed to exist, the traditional hypothesis testing approach is as follows :

1.	The null hypothesis (H0) is specified that there is no difference between μ and . Any observed difference is due solely to sample variation.
2.	The alpha risk (Type I error) is established (usually .05).
3.	The z value is calculated by the appropriate z formula :
4.	The probability of the observed difference having occurred by chance is determined from a table of the normal distribution (Appendix B, Table B-1).
5.	If the probability of the observed differences having occurred by chance is greater than the alpha used, then H0 cannot be rejected and it is concluded that the sample mean is drawn from a sampling distribution of the population having mean μ.

Returning to our supermarket example, suppose now that a random sample of 225 respondents was collected. The sample has a mean of =$30.80 and the population standard deviation is known to be $10.00. We want to know if the population mean μ=$32.00 equals the sample mean μ = 30.80, given sample variation.

Steps :

The results are shown in Figure 11.6.

Figure 11.6 Hypothesis Test, Variance Known

Population Variance is Unknown

Researchers rarely know the true variance of the population, and must therefore rely on an estimate of σ², namely, the sample variance s²With this variance estimate, we compute the ^t statistic.

Let's assume that for the supermarket example, we do not know the population values, but we again want to test if the average monthly purchase of laundry and paper goods equals $32.00, when the sample of 225 shoppers shows a mean ẋ, , of $30.80 and a standard deviations, of $9.00. The t statistic is computed as :

With a probability :

P (t=2.00, df = n-1=224) = .951 (2 tailed test)

1-P (t=2.00) = .049

Confidence Interval :

1.96 s/√ᶯ )= 1.176

We therefore reject the null hypothesis. Figure 11.7 shows these results.

Figure 11.7 Hypothesis Test, Variance Unknown

A question may arise why we reject the null hypothesis for the t-test but do not reject it for z-test. The answer is relatively simple. In the z-test, the numerator is a known constant σ√ᶯ). It is the same regardless of the sample that is drawn, as long as the sample is of size n. The denominator of the t-test varies with the sample variance. Because the variance of the sample was less than the variance of the population (9 vs. 10), the size of the confidence interval was reduced, allowing us to reject the null hypothesis.

Unlike the z-distribution, the t is a family of distributions, each having a different shape, depending on the degrees of freedom (see Figure 11.8). The probability for a given value of t varies across distributions.

The appropriate t-distribution to use in an analysis is determined by the available "degrees of freedom." Unfortunately, the concept of degrees of freedom is inadequately defined. Many statisticians use the term to describe the number of values that are free to vary. A second definition is that “degrees of freedom” is a measure of how much precision an estimate of variation has. A general rule is that the degrees of freedom decrease when more parameters have to be estimated. This concept seems to something that “we all know what it is, but cannot precisely define it.” It seems reasonable to accept the meaning in terms of “values that vary” by specifying that degrees of freedom refers to a mathematical property of a distribution related to the number of values in a sample that can be freely specified once one knows something about the sample.

In univariate analyses, the available degrees of freedom are n-1. We lose one degree of freedom for each population parameter that we estimate (μ). To explain further, suppose we have a sample with size n = 5 in which the mean value of some measure is calculated to be zero. If we select at random any 4 numbers from this sample, say -6, -4, 4, and 9, we know that the last number is completely determined and must be -3. Each time we estimate a parameter we lose one degree of freedom, while n - 1 are free to estimate that parameter.

Figure 11.8 The t-distributions for df = 1, df = 4 and df = 10

As n becomes large ( ≥ 30) reached at the limit (n = ∞). The t-statistic is widely used in both univariate and bi-variate market research analyses, due to the relaxed assumptions over the z - statistic, as follows :

1.	Individual items in the sample are drawn at random.
2.	The population must be normally distributed. If not, the sample must be large (>30).
3.	The data must be at least interval scaled.
4.	The population variance is not known exactly, but is estimated by the variance of the sample

Univariate Analysis of Categorical Data :

The Chi-Square Goodness of Fit Test

Chi-square analysis (X²) can be used when the data identifies the number of times or frequency that each category of a tabulation or cross tabulation appears. Chi-square is a useful technique for testing the following relationships :

1.	Determining the significance of sample deviations from an assumed theoretical distribution; that is, does a certain model fit the data. This is typically called a goodness-of-fit test.
2.	Determining the significance of the observed associations found in the cross tabulation of two or more variables. This is typically called a test of independence. (Discussed in Chapter 12).

The procedure involved in chi-square analysis is quite simple. We compare the observed (frequency) data with another set of "data" based on a set of theoretical frequencies. These theoretical frequencies may result from application of some specific model of the phenomenon being investigated--relationship 1 above. Or we might specify that the frequency of occurrence of two or more characteristics is mutually independent--relationship 2 above.

In either case we compute a measure (chi-square) of the variation between actual and theoretical frequencies, under the null hypothesis that there is no difference between the model and the observed frequencies. We may say that the model fits the facts. If the measure of variation is "high," we reject the null hypothesis at some specified alpha risk. If the measure is "low," we accept the null hypothesis that the model's output is in agreement with the actual frequencies.

Single Classification

Suppose that we are interested in how frequently a sample of respondents selects each of two test packages. Test package A contains an attached coupon involving so many cents off on a subsequent purchase of the brand by the respondent. Test package B is the same package but, rather than an attached coupon, it contains an attached premium (a ballpoint pen) that the respondent may keep. The packages are presented simultaneously to a sample of 100 respondents and each respondent is asked to choose one of the two packages

The frequency of choice is presented in the first column (labeled "Observed") of Table 11.6. We see that 63 respondents out of 100 select package A. Suppose that the researcher believes the "true" probability of selecting A versus B to be 50-50 and that the observed 63-37 split reflects sampling variation. Given that the researcher's model would predict an estimated frequency of 50-50, are the observed frequencies compatible with this theoretical prediction? In chi-square analysis we set up the null hypothesis that the observed (sample) frequencies are consistent with those expected under application of the model.

We use the following notation. Assume that there are k categories and a random sample of n observations ; each observation must fall into one and only one category. The observed frequencies are

The theoretical frequencies are :

In the problem above,

f1=63, f2=37, F1=50, F2=50, n=100

We compute the chi-square statistic :

Table 11.6 Observed vs. Theoretical Frequencies (Test-Package Illustration)

In the above one-way classification problem, the statistic is approximately distributed as chi square with k-1 degrees of freedom. That is, we have only two categories and, hence, 1 degree of freedom. Table A.3 in the Appendix at the end of this book shows the appropriate distribution.

In Table A.3 the tabular chi-square value for α = 0.05 and k-1=1 is 3.84. If the null hypothesis is true, the probability of getting a chi-square value greater then 3.84 is 0.05. Since our computed chi-square value is 6.76 (see Table 11.6), we reject the null hypothesis that the output of the theoretical model corresponds with the observed frequencies (see Figure 11.9). In using the chi-square table we note that only k, the number of categories, is pertinent, rather than the sample size n. Sample size is important to the quality of the approximation and the power of the test. A good rule of thumb, however, is that chi-square analysis should be used only when the theoretical frequencies in each cell exceed five; otherwise, the distribution in Table A.3 will not be a good approximation. Pragmatically, the "risk" is that with a theoretical frequency less than five, a single cell's chi-square value may be unusually high and, thus, unduly influence the overall value.

Figure 11.9 Chi Square Distribution

Univariate Analysis: Test of a Proportion

The standard normal distribution may be used to test not only means, as explained above, but also differences in proportions. The univariate test of proportions, like the univariate test of means, compares the population proportion to the proportion observed in the sample. For a sample proportion, p,

Where sp, the estimated standard error of the proportion

z = standard normal value

p = the sample proportion of successes

q = (1-p) = the sample proportion of failures

n = sample size

In a simple example, suppose the marketing manager of a snack food company is evaluating a new snack. 225 respondents are surveyed at a local shopping mall. The survey indicates that 87% are favorable toward the snack. The manager needs a 90% favorability rate. Is it safe to say that this is simply sampling variation?

given α = .05 with z = 1.96, we cannot reject the null hypothesis.

In this example the manager could (in a statistical sense) claim that 87% approval is nodifferent that 90% approval with sampling variation. In terms of corporate policy, however, set cut points may be more rigidly held.

Summary

Chapter 11 has introduced the basic concepts of formulating hypothesis testing and making statistical inference in the context of univariate analysis. In actual research, the analyst may alternate between analyzing the data and formulating hypotheses.

A hypothesis is a statement that variables (measured constructs) are related in a specific way. The null hypothesis, H0, is a statement that no relationship exists between the variables tested or that there is no difference.

Statistics are based on making inferences from the sample of respondents to the population of all respondents by means of a sampling distribution. The sampling distribution is a distribution of the parameter values (means or variances) that are estimated when all possible samples are collected.

When testing hypotheses, the analyst may correctly identify a relationship as present or absent, or may commit one of two types of errors. A Type I Error occurs when a true H₀is rejected (there is no difference, but we find there is). A Type II Error occurs when we accept a false H₀(there is a difference, but we find that none exists). The power of a test was explained as the ability to reject H0 when it should be rejected.

Selecting the appropriate statistical technique for investigating a given relationship depends of the level of measurement (nominal, ordinal, interval) and the number of variables to be analyzed. The choice of parametric vs. non-parametric analyses depends on the analyst's willingness to accept the distributional assumptions of normality and homogeneity of variances

Finally, univariate hypothesis testing was demonstrated using the standard normal distribution statistic (z) to compare a mean and proportion to the population values. The t - test was demonstrated as a parametric test for populations of unknown variance and samples of small size. The chi-square goodness of fit test was demonstrated as a non-parametric test of nominal data that make no distributional assumptions. The observed frequencies were compared to an expected distribution.

In this chapter, we focused on univariate analyses. Chapter 12 expands this discussion to include non-parametric and parametric analyses involving two variables. In Chapter 12, we also focus on bivariate analyses involving Measures of Association. Finally, in Chapters 13 and 14, we focus on multivariate analyses, involving dependence, and interdependence of associative data.

In each chapter we will observe that the appropriate statistical technique is selected, in part based on the number of variable and their relationship, when included in the analysis. However, without higher levels of measurement and their associated characteristics of central tendency dispersion and rates of change, the ability to investigate these more complex relationships is not possible.

Refernce

Andrews, F.M., Klem, L., Davidson, T.N., O’Malley, P.M., and Rodgers, W.L. (1981), A Guide for Selecting Statistical Techniques for Analyzing Social Science Data. Ann Arbor, MI: Institute for Social Research, University of Michigan, Second Edition

Fink, A. (2003). How to Manage, Analyze, and Interpret Survey Data. Thousand Oaks, CA: Sage Publications, Second Edition

Gibbons, J.D. (1993). Nonparametric Statistics: An Introduction. Newbury Park, CA: Sage Publications

Labovitz, S. (1968). “Criteria for Selecting A Significance Level: A Note on the Sacredness of .05,” The American Sociologist, 3 (August), 220-222.

Lewis-Beck, M. S. (1995). Data analysis: An introduction. Thousand Oaks, CA: Sage

Mohr, L.B. (1990). Understanding Significance Testing. Newbury Park, CA: Sage Publications

Pyrczak, F. (1995). Making Sense of Statistics: A Conceptual Overview. Los Angeles: Pyrczak Publishing

Selvin, H. and Stuart, A. (1966). “Data-Dredging Procedures in Survey Analysis,” The American Statistician, June, 20-23

Semon, T.T. (2000). “When Determining Significance, Let Clients Decide the Risks,” Marketing News October 9, 15

Semon, T.T. (1999). “Consider A Statistical Insignificance Test,” Marketing News, 33, 3 (February 1), 9

Semon, T.T. (1990). :Keep Ignoring Statistical Power at Your Own Risk,” Marketing news, 24, 181 (September 30), 21

Thursday, May 25, 2023

Hypothesis Testing And Univariate Analysis