Marekting Research: General Concepts Of Measurement and Scaling

Questions focus on the problem we are trying to solve, while answers are more closely associated with the measurement scale we are using to achieve our analysis.

Survey research, as a source of marketing information addresses many topics of practical interest, including concept testing for new products, corporate image measurement, ad copy evaluation, purchase intentions, customer satisfaction, and so forth. Regardless of the research topic, useful data is obtained only when the researcher exercises care in making procedural decisions such as :

1. Defining what is to be measured

2. Deciding how to make the measurements

3. Deciding how to conduct the measuring operations

4. Deciding how to analyze the resulting data

Definitions and decisions play a significant role in scientific inquiry, especially in marketing research and the behavioral sciences.

In the first section of this chapter, we focus on conceptual and operational definitions and their use in research. Increasingly, behavioral scientists are paying greater attention to defining the concepts measured in their specific disciplines, and refining operational definitions that specify how to measure and quantify the variables defining the concepts

In the next section we discuss measurement scales and their relationship to the interpretation of statistical techniques. This section serves as useful background for the discussion of statistical techniques covered in later chapters. We then discuss the pragmatics of writing good questions.

The overall quality of a research project depends not only on the appropriateness and adequacy of its research design and sampling techniques, but also on the measurement procedures used. The third section of this chapter looks at measurement error and how we may control the reliability and validity in these measurements.

Definitions in Marketing Measurement

Marketers measure marketing program success as increased brand awareness, ad awareness, ratings of brand likeability and uniqueness, new product concept ratings and purchase intent, and customer satisfaction (Morgan, 2003). Researchers will often model these constructs

Models are representations of reality and therefore raise the fundamental question of how well each model confidently represents reality on all significant issues. The quality of a model is judged against the criteria of validity and utility. Validity refers to a model’s accuracy in describing and predicting reality, whereas utility refers to the value it adds to the making of decisions. A sales forecasting model that does not forecast sales with reasonable accuracy is probably worse than no sales forecasting model at all.

Model quality also depends on completeness and validity, two drivers of model accuracy. Managers should not expect a model to make decisions for them, but instead models should be viewed as one additional piece of information to help make decisions

Clearly, managers will probably benefit from models that are simple enough to understand and deal with. But models used to help make multi-million-dollar decisions should be more complete than those used to make hundred-dollar decisions. The required sophistication of a model to be used depends on the model’s purpose. We measure the value of a model based on its efficiency in helping us arrive at a decision. Models should be used only if they can help us arrive at results faster, with less expense, or with more validity

Building Blocks for Measurement and Models

We cannot measure an attitude, a market share, or even sales, without first specifying how it is defined, formed, and related to other marketing variables. To better understand this, we must briefly study the building blocks of measurement theory: concepts, constructs, variables, operational definitions, and propositions.

Concepts and Constructs

A concept is a theoretical abstraction formed by a generalization about particulars. “Mass”, “strength”, and “love” are all concepts, as are “advertising effectiveness”, “consumer attitude”, and “price elasticity”. Constructs are also concepts, but they are observable, measurable, and are defined in terms of other constructs. For example, the construct “attitude” may be defined as “a learned tendency to respond in a consistent manner with respect to a given object of orientation.”

Variables

Researchers loosely call the constructs that they study variables. Variables are constructs in measured and quantified form. A variable can take on different values (i.e., it can vary).

Operational Definitions

We can talk about “consumer attitudes” as if we know what it means, but the term makes little sense until we define it in a specific, measurable way. An operational definition assigns meaning to a variable by specifying what is to be measured and how it is to be measured. It is a set of instructions defining how we are going to treat a variable. For example, the variable “height” could be operationally defined in a number of different ways, including measures in inches with a precision ruler with the person (1) wearing shoes, or (2) not wearing shoes, (3) by an altimeter or barometer, or (4) for a horse, by the number of “hands”

As another example, measuring “purchase intentions” for Brand X window cleaner might be operationally defined as the answer to the following question :

The researcher could have just as appropriately defined a measurement of “purchase intention” in other ways. For example, the concepts of “attitude” and “importance”, which have been summed as the multiplicative product of a series of attributes like cleaning ability, fresh

smell, etc. This would appear as :

P	≈	BI	=	∑Ai	×	Bi
Purchase		Purchase		Attitudes		Importance of
Behavior		Intention		about Brand X Window		Attributes of
Toward		For Brand X		Cleaner Attributes		Brand X Window Cleaner

Propositions

A proposition defines the relationships between variables, and specifies both the variables influencing the relationship and the form of the relationship. It is not enough to simply state that the concept “sales” is a function of the concept “advertising”, such that S = f (Adv). Intervening variables must be specified, along with the relevant ranges for the effect, including where we would observe saturation effects, threshold effects, and the symbolic form of the relationship.

Integration into a Systematic Model

A proposition is quite similar to a model. A model is produced by linking propositions together to provide a meaningful explanation for a system or a process. When concepts, constructs, variables and propositions are integrated into a model for a research plan, we should conceptually ask the following questions :

􀁸 Are concepts and propositions specified?

􀁸 Are the concepts relevant to solving the problem at hand?

􀁸 Are the principal parts of the concept clearly defined?

􀁸 Is there consensus as to which concepts are relevant in explaining the problem?

􀁸 Are the concepts properly defined and labeled?

􀁸 Is the concept specific enough to be operationally reliable and valid?

􀁸 Do clear assumptions made in the model link the concepts?

􀁸 Are the limitations of the model stated?

􀁸 Can the model explain and predict?

􀁸 Can the model provide results for managerial decision making?

􀁸 Can the model be readily quantified?

􀁸 Are the outcomes of the model supported by common sense?

If the model does not meet the relevant criteria, it probably should be revised. Concept definitions may be made more precise; variables may be redefined, added, or deleted ; operational definitions and measurements may be tested for validity; and/or mathematical forms revised.

Inaccuracies In Measurement

Before delving into measurement scales and question types, it is helpful to remember that measurements in marketing research are rarely “exact.” Inaccuracies in measurement arise from a variety of sources or factors. A portion of this variation among individual scores may represent true differences in what is being measured, while other variation may be error in measurement. For any given research project, not all will necessarily be operative, but the many possible sources causing variations in respondent scores can be categorized as follows :

-	True differences in the characteristic or property
-	Other relatively stable characteristics of individuals which affect scores (intelligence, extent of education, information processed)
-	Transient personal factors (health, fatigue, motivation, emotional strain)
-	Situational factors (rapport established, distractions that arise)
-	Variations in administration of measuring instrument, such as interviewers
-	Sampling of items included in the instrument
-	Lack of clarity (ambiguity, complexity, interpretation of words and context)
-	Mechanical factors (lack of space to record response, appearance of instrument)
-	Factors in the analysis (scoring, tabulation, statistical compilation)
-	Variations not otherwise accounted for (chance), such as guessing an answer

In the ideal situation, variation within a set of measurements would represent only true differences in the characteristic being measured. For instance, a company wanting to measure attitudes toward a possible new brand name and trademark would like to feel confident that measurement differences concerning the proposed names represent only the individuals’ differences in this attitude. Obviously the ideal situation for conducting research seldom, if ever exists. Measurements are often affected by characteristics of individual respondents such as intelligence, education level, and personality attributes. Therefore, the results of a study will reflect not only differences among individuals in the characteristic of interest but also differences in other characteristics of the individuals. Unfortunately, this type of situation cannot be easily controlled unless the investigator knows all relevant characteristics of the population members such that control can be introduced through the sampling process.

There are many influences in a measurement other than the true characteristic of concern—that is, there are many sources of potential error in measurement. Measurement error has a constant (systematic) dimension and a random (variable) dimension. If the error is truly random, (it is just as likely to be greater than the true values as less) then the expected value of the sum of all errors for any single variable will be zero, and therefore less worrisome than nonrandom measurement error (Davis, 1997). Systematic error is present because of a flaw in the measurement instrument or the research or sampling design. Unless the flaw is corrected, there is nothing the researcher can do to get valid results after the data are collected. These two subtypes of measurement error affect the validity and reliability of measurement, topics that are discussed in the later part of this chapter. But now that we are aware of the conceptual building blocks and errors in measurement that should be considered in developing measurement scales, we will consider the types of measurement and associated questions that are commonly used in marketing research today.

Measurement Concept

Measurement can be defined as a way of assigning symbols to represent the properties of persons, objects, events, or states. These symbols should have the same relevant relationship to each other as do the things they represent. Another way of looking at this is that measurement is the assignment of numbers to objects to represent amounts or degrees of a property possessed by all of the objects” (Torgerson, 1958, p. 19). If a characteristic, property, or behavior is to be represented by numbers, a one-to-one correspondence between the number system used and the various quantities (degrees) of that being measured must exist. There are three important characteristics or features of the real number series :

1. Order. Numbers are ordered.

2. Distance. Differences exist between the ordered numbers.

3. Origin. The series has a unique origin indicated by the number zero.

A scale of measurement allows the investigator to make comparisons of amounts and changes in the variable being measured. It is important to remember that it is the attributes or characteristics of objects we measure, not the objects themselves

Primary Types of Scales

To many people, the term scale suggests such devices as a bathroom scale, pan balances, yard sticks, gasoline gauges, measuring cups, and similar instruments for finding length, weight, volume, and the like. We ordinarily tend to think about measurement in the sense of well-defined scales possessing a natural zero and constant unit of measurement. In the behavioral sciences (including marketing research), however, we must frequently settle for less-precise data. Scales can be classified into four major categories, designated as Nominal, Ordinal, Interval, and Ratio scales.

Each scale possesses its own set of underlying assumptions about order, distance and origin, and how well the numbers correspond with real-world entities. As our rigor in conceptualizing concepts increases, we can upgrade our measurement scale. One example is the measurement of color. We may simply categorize colors (nominal scale), or measure the frequency of light waves (ratio scale).

The specification of scale is extremely important in all research, because the type of measurement scale dictates the specific analytical (statistical) techniques that are most appropriate to use in analyzing the obtained data.

Nominal Scales

Nominal scales are the least restrictive and, thus, the simplest of scales. They support only the most basic analyses. The nominal scale serves only as labels or tags to identify objects, properties or events. The nominal scale does not possess order, distance, or origin. For example, we can assign numbers to baseball players. We have a one-to-one correspondence between number and player and are careful to make sure that no players receive the same number (or that a single player is assigned two or more numbers). The classification of supermarkets into categories that “carry our brand” versus those that “do not carry our brand” is further illustration of the nominal scale.

It should be clear that nominal scales permit only rudimentary mathematical operations. We can count the stores that carry each brand in a product class and find the modal (highest number of mentions) brand carried. The usual statistical operations involving the calculations of means, standard deviations, etc. are not appropriate or meaningful for nominal scales.

Ordinal Scales

Ordinal scales are ranking scales and possess only the characteristic of order. These scales require the ability to distinguish between objects according to a single attribute and direction. For example, a respondent may be asked to rank a group of floor polish brands according to “cleaning ability”. An ordinal scale results when we assign the number 1 to the highest-ranking polish, 2 to the second-highest ranking polish, and so on. Note, however, that the mere ranking of brands does not quantify the differences separating brands with regard to cleaning ability. We do not know if the difference in cleaning ability between the brands ranked 1 and 2 is larger, less than, or equal to the difference between the brands ranked 2 3. In dealing with ordinal scales, statistical description can employ positional measures such as the median, quartile, and percentile, or other summary statistics that deal with order among.

An ordinal scale possesses all the information of a nominal scale in the sense that equivalent entities receive the same rank. Also, like the nominal scale, arithmetic averaging is not meaningful for ranked data.

Interval Scales

Interval scales possess a constant unit of measurement and permit one to make meaningful statements about differences separating two objects. This type of scale possesses the properties of order and distance, but the zero point of the scale is arbitrary. Among the most common examples of interval scaling are the Fahrenheit and Centigrade scales used to measure temperature, and various types of indexes like the Consumer Price Index. While an arbitrary zero is assigned to each temperature scale, equal temperature differences are found by scaling equal volumes of expansion in the liquid used in the thermometer. Interval scales permit inferences to be made about the differences between the entities to be measured (say, warmth), but we cannot meaningfully state that any value on a specific interval scale is a multiple of another.

An example should make this point clearer. It is not empirically correct to say that 50°F is twice as hot as 25°F. Converting from Fahrenheit to Centigrade, we find that the corresponding temperatures on the centigrade scale are 10°C and –3.9°C, which are not in the ratio 2:1. We can say, however, that differences between values on different temperature scales are multiples of each other. That is, the difference of 50°F–0°F is twice the difference of 25°– 0°F. The corresponding differences on the Centigrade scale are 10°C – (–17.7°C) = 27.7°C and – 3.9°C – (–17.7°C) = 13.8°C are in the same 2:1 ratio.

Interval scales are unique up to a transformation of the form y = a + bx; b > 0. This means that interval scales can be transformed from one to another by adding or multiplying a constant. For example, we can convert from a Fahrenheit to Celsius using the formula :

TC = 5/9 (TF – 32)

Most ordinary statistical measures (such as arithmetic mean, standard deviation, and correlation coefficient) require only interval scales for their computation

Ratio Scales

Ratio scales represent the elite of scales and contain all the information of lower-order scales and more besides. These are scales like length and weight that possess a unique zero point, in addition to equal intervals. All types of statistical operations can be performed on ratio scales

An example of ratio-scale properties is that 3 yards is three times 1 yard. If transformed to feet, then 9 feet and 3 feet are in the same 3:1 ratio. It is easy to move from one scale to another merely by applying an appropriate positive multiplicative constant; this is the practice followed when changing from grams to pounds or from feet to inches.

Relationships Among Scales

To provide some idea of the relationships among nominal, ordinal, interval, and ratio scales, the marketing researcher who uses descriptive statistics (arithmetic mean, standard deviation) and tests of significance (t-test, F-test) should require that the data are (at least) interval-scaled

From a purely mathematical point of view, you can obviously do arithmetic with any set of numbers—and any scale. What is at issue here is the interpretation and meaningfulness of the results. As we select more powerful measurement scales, our abilities to predict, explain, and otherwise understand respondent ratings also increase.

Table 10.1 Scales of Measurement

*Scale*	*Mathematical* *Group Structure*	*Permissible Statistics*	*Typical Elements*
Nominal	Permutation group y = f(x), where f(x) means any one-to-one correspondence	Mode Contingency Coefficient	Numbering of football players Assignment of type or model numbers to classes
Ordinal	Isotonic group y = f(x), where f(x) means any strictly increasing function	Median Percentile Order correlation Sign test; run test	Hardness of minerals Quality of leather, lumber, wool, etc. Pleasantness of odors
Interval	General linear group y = a+bx b > 0	Mean Average deviation Standard deviation Product-moment correlation t-test, F-test	Temperature (Fahrenheit and centigrade) EnergyCalendar dates
Ratio	Similarity group y = cx c > 0	Geometric mean Harmonic mean Coefficient of variation	Length, width, density, resistance Pitch scale, loudness scale

( Stevens 1946, p. 678)

Basic Question and Answer Formats

Underlying every question is a basic reason for asking it. This reason reflects the construct to be measured, the problem to be solved or hypothesis to be tested. Constructing a question that reflects this reason will result in a higher probability that the desired response will be obtained. Table 10.2 shows nine different types of questions (based on the nature of content), the broad reason underlying asking each type of question, and some examples of each type.

Table 10.2 Basic Question Types

Type of Question	Goal of Question	Positioning of Question
Factual or behavioral	To get information.	Questions beginning with what, where, when, why, who and how.
Explanatory	To get additional information or to broaden discussion.	How would that help? How would you go about doing that? What other things should be considered?
Attitudinal	To get perceptions, motivations, feelings, etc., about an object or topic.	What do you believe to be the best? How strongly do you feel about XYZ?
Justifying	To get proof to challenge old ideas and to get new ones.	How do you know? What makes you say that?
Leading	To introduce a thought of your own.	Would this be a possible solution? What do you think of this plan?
Hypothetical	To use assumptions or suppositions	What would happen if we did it this way? If it came in blue would you buy it today?
Alternative	To get a decision or agreement.	Which of these plans do you think is best? Is one or two o’clock best for you?
Coordinative	To develop common agreement. To take action.	Do we all agree that this is our next step?
Comparative	To compare alternatives or to get a judgment anchored by another item.	Is baseball more or less exciting to watch on TV than soccer?

Based on this structure, and the information in Table 10.3, which deals with standard answer formats, we are able to distinguish four basic question/answer types :

1. Free-answer (open-ended text)

2. Choice answers: dichotomous, single choice and multiple choice (select k of n)

3. Rank order answers

4. Constant sum answers

Table 10.3 Standard Answer Formats Based on Task

Measurement Scale*	Format Type	Description
N,O,I	Select 1/n—pick-1 :	The respondent is given a list of n options and is required to choose one option only.
N,O	Select k/n—pick-k :	The respondent gets a set of n options to select from but this time chooses up to k options (k =n).
N,O	Select k1/k2/n—pickand-Pick :	The respondent is asked to select k1 options in Category 1 and k2 options in Category 2. Each option can be selected in only one of the two categories.
N, I	Sort and rank :	The respondent picks k items and allocates them into L buckets, then items allocated to each bucket are assigned ranks
N,O	Rank k/n—rank :	In this question the respondent gets n options and is asked to rank the top k (k = n).
N,O	Select k1/n and Rank k2/k1—pick and rank :	This question type is similar to pick-k, but in addition to selecting k1 options from a list of n options, the respondent is then asked to rank some fraction, k2/k1 of those selected
O,I	Integer Rating :	The respondent is asked to rate on a linear scale of 1 to n the description on the screen or accompanying prop card (for example, 1 for completely disagree to 5 for completely agree). Only integer responses are accepted.
O,I	Continuous Rating :	This is similar to integer rating, except that the response can be any number (not necessarily an integer number) within the range (for example, 5.2 on a scale of 0 to 10).
R	Constant Sum :	The respondent is provided with a set of attributes (5, 10, etc.) and is asked to distribute a total of p points across those attributes.
N,O	Yes/No :	This question entails a yes/no answer and is of course, a Select 1/2---pick-1 question type.
I	Integer—integer-#:	The respondent is asked for a fact that can be expressed in integer number form. A valid range can be provided for error checking. Example : Age.
I,R	Real—real-#:	Similar to integer-# except that the answer expected is in the form of a real (not necessarily an integer) number. Example: Income. A valid range can be provided for error checking
C	Character:	The respondent types in a string of characters as a response. Example : Name.
I	Multiple Integer Ratings :	This question type is identical to integer-scale except that multiple questions (classified as “options”) can appear on a single screen. Each question is answered and recorded separately.
I,R	Multiple Real Number Ratings :	This question type is identical to real-scale except that multiple questions (classified as “options”) can appear on a single screen. Each question is answered and recorded separately.

*Legend: N = Nominal, O = Ordinal, I = Interval, R = Ratio, C = Alpha-Numeric Text Characters

Free Answer or Open-Ended Text Questions/Answers

The free answer (or open-ended text question) has no fixed alternatives to which the answer must conform. The respondent answers in his or her own words and at the length he or she chooses, subject of course to any limitations imposed by the questionnaire itself. Interviewers are usually instructed to make a verbatim record of the answer

While free-answer questions are usually shorter and less complex than multiple-choice and dichotomous questions, they place greater demands on the ability of the respondents to express themselves. As such, this form of question provides the opportunity for greater ambiguity in interpreting answers. To illustrate, consider the following verbatim transcript of one female respondent’s reply to the question :

What suggestions could you make for improving tomato juice?

“I really don’t know. I never thought much about it. I suppose that it would be nice if you could buy it in bottles because the can turns black where you pour the juice out after it has been opened a day or two. Bottles break, though.”

Did she have “no suggestion”, “suggest packaging in a glass container”, or “suggest that some way be found to prevent the can from turning black around the opening”?

One way to overcome some of these problems, at least in personal and telephone surveys is to have interviewers probe respondents for clarity (rather than additional information). One practitioner has gone so far as to suggest that questionnaires should clearly instruct interviewers to probe only once for additional information, and to continue to probe for clarity until the interviewer understands a respondent’s reply.

Compared with other question forms (see Exhibit 10.1), we may tentatively conclude that the free-answer question provides the lowest probability of the questions being ambiguous, but the highest probability of the answers being ambiguous,.

Exhibit 10.1 Open-Ended Questions and Answers

The advantages of the open-ended format are considerable, but so are its disadvantages (Sudman and Bradburn, 1982). In the hands of a good interviewer, the open format allows and encourages respondents to give their opinions fully and with as much nuance as they are capable of. It also allows respondents to make distinctions that are not usually possible with the fixed alternative formats, and to express themselves in language that is comfortable for them and congenial to their views. In many instances it produces vignettes of considerable richness and quotable material that will enliven research reports.

The richness of the material can be a disadvantage if there is need to summarize the data into simple response categories. Coding of free-response material is known as content analysis and is not only time consuming and costly, but also introduces some amount of coding error.

Open-ended questions also take somewhat more time and psychological work to answer than closed questions. They also require greater interviewer skill to recognize ambiguities of response and to probe and draw respondents out, particularly those who are reticent and not highly verbal, to make sure that they give answers that can be coded. Open-ended response formats may work better with telephone interviews, where a close supervision of interview quality can be maintained, although there is a tendency for shorter answers to be given on the telephone. No matter how well controlled the interviewers may be, however, factors such as carelessness and verbal facility will generate greater individual variance among respondents than would be the case with fixed alternative response formats.

Dichotomous and Multiple-Choice Answers

The select k of n format is the workhorse of survey building, and provides the general form for both dichotomous and multiple-choice answer types. Three general forms of questions are frequently used :

Select Exactly 1 of n Answers:

When selecting k = 1/n, the type of answer scale is dependent on n, the number of answers. A dichotomous question has two fixed answer alternatives of the type “Yes/No”, “In favor/Not in favor”, “Use/Do not use”, and so on. The question quoted earlier, “Do you like the taste of tomato juice?”is an example of a dichotomous question. Multiple-choice questions are simply an extension of the dichotomous question that have more answer points and often take the form of an ordered or interval measurement scale.

Traditional multiple–choice answers also are of the select 1 of n answer form, but have more than two available answers. For example, an agreement scale could have three, five, or seven available answers

Three answers	:	Agree/Neutral/Disagree
Five answers	:	Strongly Agree/Agree/Neither/Disagree/Strongly Disagree
Seven answers	:	Very Strongly Agree/Strongly Agree/Agree/Neither Agree nor Disagree/ Disagree/Strongly Disagree/Very Strongly Disagree

As with all select 1 of n answers, the specific text associated with the answer options is variable and could measure many different constructs such as affect (liking), satisfaction, loyalty, purchase likelihood, and so forth

Select Exactly k of n Answers

When questions are developed that accept or require multiple responses within a set of answers, the form “exactly k of n” or “as many as k of n” can be used. This general form asks the respondent to indicate that several answers meet the requirements of the question. In this case, the data collected would be of type categorical or even loosely ordered if presence or absence of a characteristic is being measured (data is coded as 0 if not selected, and 1 if selected). This type of question might be

Select as Many as k of n Answers

A variable number of answers may also be appropriate, particularly where long lists of attributes or features are given. In these cases, the respondent is asked to select as many as k of the n possible answers, where k can be any number from 2 to n. For example, in the previous question, the respondent could select as many as three (one, two, or three) of 10 possible answers. The question might be reworded to read something like . . .

Please identify which service activities are most likely to be outsourced in the next 12 months (check all that apply).

Whitlark and Smith (2004) show an application of pick k of n data that asks respondents to pick a small number of attributes that they feel best describe a brand from a list of 10, 20, or even 30 attributes. Collecting the pick data is much faster than asking a respondent to rate brands with respect to a long list of attributes. In an online survey environment, respondents can quickly scan down columns or across screens and quickly complete the pick data task for a familiar brand, thereby saving time and reducing respondent fatigue and dropout rates.

Having people describe a brand by picking attributes from a list is a quick and simple way to assess brand performance and positioning. Whitlark and Smith (2004) show that when respondents are asked to pick from one third to one half of the viewed items, the pick k of n data can be superior to scaled data in terms of reliability and power to discriminate between attributes.

Rank-Order Questions/Answers

The next level of measurement rank-orders the answers and thereby increases the power of the measurement scale over categorical measurement by including the characteristic of order in the data. Whereas the categorical data associated with many dichotomous or multiple-choice items does not permit us to say that one item is greater than another, rank-order data allows for the analysis of differences. Rank-order questions use an answer format that requires the respondent to assign a rank position to all items, or a subset of items in the answer list. The first, second, and so forth up to the nth item would be ordered. Procedures for assigning position numbers can be very versatile, resulting in different types of questions that can be asked. Typical questions might include identifying preference rankings, or attribute associations from first to last, most recent to least recent or relative position (most, next most, and so forth, until either a set number of items is ordered or all items may be ordered).

When this type of question is administered online or using a CATI (Computer Aided Telephone Interviewing) system, additional options for administration may include randomization and acceptance/validation of ties in the ranking. Randomization of the answer list order helps to control for presentation order bias. It is well established that in elections, being the first in a ballot candidate list increases chances of receiving the voter’s election

Tied rankings are another issue to be considered for rank-order questions. When ties are permitted, several items may be evaluated as having the same rank. In general, this is not a good idea because it weakens the data. However, if ties truly exist, then the ranking should reflect this. Rank-order questions are generally a difficult type of question for respondents to answer, especially if the number of items to be ranked goes beyond five or seven.

Constant Sum Questions/Answers

A constant sum question is a powerful question type that permits collection of ratio data, meaning that the data is able to express the relative value or importance of the options (option A is twice as important as option B). This type of question is used when you are relatively sure of the answer set (i.e., reasons for purchase, or you want to evaluate a limited number of reasons that you believe are important). The following example of a constant sum question from Qualtrics, uses sliding scales to select a sum of 100 points :

Advanced Measurement And Scaling Concepts

Continuing our discussion of scales, we now focus on some of the more common scaling techniques and models. We focus on broad concepts of attitude scaling—the study of scaling for the measurement of managerial and consumer or buyer perception, preference, and motivation. All attitude (and other psychological) measurement procedures are concerned with having people—consumers, purchasing agents, marketing managers, or whomever—respond about certain stimuli according to specified sets of instructions. The stimuli may be alternative products or services, advertising copy themes, package designs, brand names, sales presentations, and so on. The response may involve judging which copy theme is more pleasing than another, which package design is more appealing than another, what mental images do new brand names evoke, which adjectives best describe each salesperson, and so on.

Scaling procedures can be classified in terms of the measurement properties of the final scale (nominal, ordinal, interval, or ratio), the task that the subject is asked to perform, or in still other ways, such as whether the scale measures the subject, the stimuli, or both (Torgerson, 1958).

We begin with a discussion of various methods for collecting ordinal-scaled data (paired comparisons, rankings, ratings, etc.) in terms of their mechanics and assumptions regarding their scale properties. Then specific procedures for developing these actual scales are discussed. Techniques such as Thurstone Case V scaling, semantic differential, the Likert summated scale, and the Thurstone differential scale are illustrated. The chapter concludes with some issues and limitations of scaling.

Advanced Ordinal Measurement Methods

The variety of ordinal measurement methods includes a number of techniques :

􀁸 Paired comparisons

􀁸 Ranking procedures

􀁸 Ordered-category sorting

􀁸 Rating techniques

We discuss each of these data collection procedures in turn

Paired Comparisons

As the name suggests, paired comparisons require the respondent to choose one of a pair of stimuli that “has more of”, “dominates”, “precedes”, “wins over”, or “exceeds” the other with respect to some designated property of interest. If, for example, six laundry detergent brands are to be compared for “sudsiness”, a full set of paired comparisons would involve (n x n – 1)/2 = (6 x 5) / 2, or 15, paired comparisons (if order of presentation is not considered). Respondents are asked which one of each pair has the most sudsiness.

A sample question format for paired comparisons is shown in Table 10.4. The order of presentation of the pairs and which item of a pair is shown first are typically determined and/or presented randomly. Consider the following hypothetical brand names (and numerical categories): Arrow (1), Zip (2), Dept (3), Advance (4), Crown (5), and Mountain (6).

Table 10.4 Example of the Paired Comparisons Question

Original Data

Brand
Brand		Arrow	Zip	Advance	Dept	Crown	Mountain
	Arrow	X	0	1	1	1	1
	Zip	1	X	1	1	1	1
	Advance	0	0	X	0	0	0
	Dept	0	0	1	X	0	0
	Crown	0	0	1	1	X	1
	Mountain	0	0	1	1	0	X

Brand
Brand		Zip	Arrow	Crown	Mountain	Dept	Advance	Total
	Arrow	X	1	1	1	1	1	5
	Zip	0	X	1	1	1	1	4
	Advance	0	0	X	1	1	1	3
	Dept	0	0	0	X	1	1	2
	Crown	0	0	0	1	X	1	1
	Mountain	0	0	0	1	0	X	0

A cell value of 1 implies that a row brand exceeds the column brand, “0” otherwise.

Ranking Procedures

Ranking procedures require the respondent to order stimuli with respect to some designated property of interest. For example, instead of using the paired-comparison technique, respondent might have been asked to directly rank the detergents with respect to sudsiness. Similarly, ranking can be used to determine key attributes for services

In a survey conducted by Subaru of America, new Subaru car purchasers were asked questions regarding the purchase and delivery processes. One question required ranking :

A variety of ordering methods may be used to order k items from a full set of n items. These procedures, denoted by Coombs (1964) as “order k/n” (k out of n), expand the repertory of ordering methods quite markedly. The various ordering methods may pre-specify the value of k (“order the top three out of six brands with respect to sudsiness”) as illustrated by the Subaru study, or allow k to be chosen by the respondent (“select those of the six brands that seem to exhibit the most sudsiness, and rank them”).

When the groups can be ordered by some category, a procedure known as category sorting can be used. For example, if it is desired that a respondent rank all items in a longer list of items, the pick-group-rank procedure may be used. The respondent first sorts the items into a number of ordered categories or piles (each of which has a relatively small, equal number of items). Then, the request is made to rank the items within each pile. This task can be completed in personal interviews or in online surveys using something like the Qualtrics pick-group-rank question that facilitates the ordered ranking within the ordered-category sorting tasks.

Ordered-Category Sorting

Pick-Group-Rank is one of a variety of data collection procedures that have as their purpose the assignment of a set of stimuli to a set of ordered categories. For example, if 15 varieties of laundry detergents represented the stimulus set, the respondent might be asked to complete the following task :

The pick-group-rank could be used with ordered categories to sort all of a large list of items, where there is :

1.	free versus forced assignment of names to grouping categories
2.	free verses forced assignment of stimuli to grouping categories
3.	the assumption of equal intervals between category boundaries versus the weaker assumption of category boundaries that are merely ordered with regard to the attribute of interest

In ordinal measurement methods one assumes only an ordering of category boundaries. The assumption of equal intervals separating boundaries is part of the interval/ratio measurement set of methods. Ordered-category sorting appears especially useful when the researcher is dealing with a relatively large number of stimuli (over 15 or so) and it is believed that a subject’s discrimination abilities do not justify a strict (no ties allowed) ranking of the stimulus objects.

Rating Techniques

Rating scales are ambiguous as to whether or not they meet the criterion of equal intervals separating the category boundaries. In some cases, the scaled responses are considered by the researcher to be only ordinal, while in other cases, the researcher treats them as interval or ratio-scaled. The flexibility of rating procedures makes them appropriate for either the ordinal or interval/ratio measurement data collection methods (depending on the nature of the scale values).

The rating task typically involves having a respondent place that which is being rated (a person, object, or concept) along a continuum or in one of an ordered set of categories. Ratings allow the respondent to designate a degree or an amount of a characteristic or attribute as a point on a scale. The task of rating is one of the most popular and easily applied data collection methods, and is used in a variety of scaling approaches, such as the semantic differential and the Likert summated scale.

Rating scales can be either monadic or comparative. In monadic scaling, each object is measured (rated) by itself, independently of any other objects being rated. In contrast, comparative scaling objects are evaluated in comparison with other objects. For example, a recent in-flight survey conducted by United Airlines asked the following questions :

The rating is monadic. United then asked respondents another question:

Ratings are used very widely because they are easier and faster to administer and yield data that are amenable to being analyzed as if they are interval-scaled. But there is a risk of lack of differentiation among the scores when the particular attributes are worded positively or are positive constructs, such as values, and the respondents end-pile their ratings toward the positive end of the scale. Such lack of differentiation may potentially reduce the variance of the items being rated and reduce the ability to detect relationships with other variables.

McCarty and Shrum (2000) offer an alternative to simple rating. Respondents first picked their most and least important values (or attributes or factors), and then rated them. The remaining values were then rated. Their results indicate that, compared with a simple rating of values, the most-least procedure reduces the level of end-piling and increases the differentiation of values ratings, both in terms of dispersion and the number of different rating points used

Respondents may have trouble choosing a rating scale value on the high end, but not so at the low end. One person’s rating of 9 or 10 may be equal in meaning to another’s 7 or 8. Semon (1999) suggests that one way to find the real difference in perception or attitude is to ask each respondent three questions at the start of an interview :

1. On this scale, how do you rate the brand you now use or that you know best?

2. How do you rate the best brand you know about?

3. What rating represents the minimum acceptable level?

Questions such as these are often asked in product and brand studies, to interpret ratings and provide anchor points for a respondent’s ratings. A respondent’s actual ratings can be translated into responses relative to one or more of these anchors to produce real-meaning relative ratings that can be reliably aggregated and analyzed without depending upon assumptions that may be questionable.

Rating methods can take several forms, numerical, graphic and verbal. Often two or more of these formats appear together, as illustrated in Figures 10.1 and 10.2. Many other types of rating methods are in use (Haley & Case, 1979).

Figure 10.1 Examples of Rating Scales Used in Marketing Research

In many instances where rating scales are used, the researcher assumes not only that the items are capable of being ranked, but also that the descriptive levels of progress are in equalinterval steps psychologically. That is, the numerical correspondences shown in Panels (c) and (e) of Figure 10.1 may be treated—sometimes erroneously—as interval- or ratio-scaled data. Even in cases represented by Panels (a), (b), and (d), it is not unusual to find that the researcher assigns successive integer values to the various category descriptions and subsequently works with the data as though the responses were interval-scaled.

Treating rating scales as interval or ratio measurements is a practice that is well documented and widespread. Research shows that there is little error in treating the data as being of a higher level of measurement than it is. Research evidence supports this practice, in that often when ordinal data are treated as interval and parametric analysis are used, the conclusions reached are the same as when the data are treated as ordinal and tested using non-parametric analyses.

Figure 10.2 A Rating Thermometer

Constructing a Behaviorally Anchored Rating Scale

One type of itemized rating scale that has merit in cases where leniency error (lack of discrimination) may be troublesome is the behaviorally-anchored rating scale, or BARS (see Figure 10.3). This scale uses behavioral incidents to define each position on the rating scale rather than verbal, graphic, or numeric labels. Thus, providing specific behavioral anchors can reduce leniency errors and increase discrimination. Developing scales such as these requires a great amount of testing and refinement to find the right anchors for the situation under examination.

Figure 10.3 Behaviorally Anchored Rating Scale (BARS)

The basic process of developing a behaviorally anchored rating scale consists of four steps :

1.	Construct definition—the construct being measured must be explicitly defined and the key dimensions identified
2.	Item generation—statements must be generated describing actual behaviors that would illustrate specific levels of the construct for each dimension identified
3.	Item testing—to unambiguously fit behavioral statements to dimensions
4.	Scale construction—lay out the scale with behavioral statements as anchors

In following this process, sets of judges are used. It should be clear that developing BARS is a time-consuming and costly task. Thus, they should be reserved for those applied settings where they can minimize the errors they are designed to curtail, especially leniency error. As an example, families with elderly members were surveyed to determine their need for in home health-care services. A BARS was used for one critical measure of how well elderly members of the household were able to perform everyday living activities :

Table 10.5

identifies nine questions that must be answered when a scale is constructed.

Issues in Constructing a Rating Scale

1.	Should negative numbers be used?
2.	How many categories should be included?
3.	Related to the number of categories is: Should there be an odd number or an even number? That is, should a neutral alternative be provided?
4.	Should the scale be balanced or unbalanced?
5.	Is it desirable to not force a substantive response by giving an opportunity to indicate “don’t know,” “no opinion,” or something similar?
6.	What does one do about halo effects—that is, the tendency of raters to ascribe favorable property levels to all attributes of a stimulus object if they happen to like a particular object in general?
7.	How does one examine raters’ biases—for example, the tendency to use extreme values or, perhaps, only the middle range of the response scale, or to overestimate the desirable features of the things they like (i.e., the generosity error)?
8.	How should descriptive adjectives for rating categories be selected?
9.	How anchoring phrases for the scale’s origin should be chosen?

Some research on these questions has been conducted showing that errors are not made when a neutral option is provided. Our suggestion is that it always be included, unless the researcher has a compelling reason to not do so (e.g., the problem situation/sample mix is such that each sample member can be expected to have a non-neutral attitude). Expected voting in a survey of voters is an example

Question 4 deals with the interesting issue of Balance, referring to having an equal number of negative response alternatives as positive ones. When using importance scales for attributes, the alternatives provided may be “very important”, “important”, “neither important nor unimportant”, “unimportant”, and “very unimportant”, or additional categories may be included. Thomas Semon (2001) has questioned the use of balance (or symmetry, as he calls it) in importance scales. He argues that importance is not a bipolar concept. Importance ranges from some positive amount to none, not a negative amount. Although this appears to have conceptual appeal, researchers continue to use successfully use importance scales from some mid-point— specified or implied. There would seem to be three keys to successful importance scale use :

1. Isolating any findings of unimportance

2. Recognizing that importance is ordinally scaled

3. Accurately interpreting the relative nature of importance findings

Answers to questions such as these will vary by the researcher’s approach, and by the problem being studied. The effects of research design on reliability and validity of rating scales are discussed in two excellent review papers (Churchill and Peter, 1984; Peter and Churchill, 1986).

Exhibit 10.2 Measuring Preferences of Young Children Calls for Creativity

The children’s market is a multi-billion dollar market in direct purchasing power and an even greater market in purchasing influence. Among the areas of most concern are better scaling techniques for measuring children’s product preferences. Widely used approaches for assessing children’s preferences are itemized rating scales using a series of stars (a scale from 1 to 5 stars) or a series of facial expressions (a scale anchored at one end with a happy face and at the other end with a sad face), as illustrated below :

(a) Smiling Faces Scale

(b) Star Scale

Children are asked to indicate how much they like a product, or how much they like a particular feature

of a product, by pointing to one of the visual anchors on the scale.

Although these scales have done well in varied research applications, there is often a problem with leniency that emerges, particularly when used with young children under the age of eight. This error emerges when young children consistently use the extreme positions (usually on the positive side) with

relatively little use of intermediate scale positions. If this is done for all products tested, the overall sensitivity of existing (traditional) rating scales is lowered, resulting in inconclusive findings about

children’s preferences.

In summary, rating methods—depending on the assumptions of the researcher—can be considered to lead to ordinal-, interval-, or even ratio-scaled responses. The latter two scales are taken up next. We shall see that rating methods figure prominently in the development of quantitative-judgment scales.

Interval/Ratio Procedures

Direct-judgment estimates, fractionation, constant sum and rating methods assume more than ordinal properties about respondents’ judgments) are all variants of interval/ratio procedures or metric measurement methods.

Direct-Judgment Methods

In direct-judgment methods, the respondent is asked to give a numerical rating to each stimulus with respect to some designated attribute. In the case of continuous rating scales, the respondent is free to choose his or her own number along some line that represents his or her judgment about the magnitude of the stimulus relative to some reference points (Figure 10.4).

These continuous scales work effectively in a semantic differential context (to be discussed later in this chapter) (Albaum, Best, & Hawkins, 1981), and appear to be insensitive to fluctuations in the length of the line used (Hubbard, Little, & Allen, 1989).

The limited-response category sub case, illustrated by Panel (b) in Figure 10.4, is nothing more than a straight rating procedure, with the important addition that the ratings are now treated as either interval- or ratio-scaled data (depending on the application).

Figure 10.4 Sample Interval-Ratio Scales

Fractionation

Fractionation is a procedure in which the respondent is given two stimuli at a time (e.g., a standard laundry detergent and a test brand) and asked to give some numerical estimate of the ratio between them, with respect to some attribute, such as sudsiness. The respondent may answer that the test brand, in his or her judgment, is three-fourths as sudsy as the standard. After this is done, a new test brand is compared with the same standard, and so on, until all test items are judged. Panel (c) in Figure 10.4 illustrates this procedure

In other cases, the test item can be more or less continuously varied by the respondent. For example, in an actual test of the attribute is sweetness of lemonade, the respondent may be asked to add more sweetener until the test item is “twice as sweet” as the standard.

Constant Sum

Constant-sum methods have become quite popular in marketing research, primarily because of their simplicity and ease of instructions. In constant-sum methods the respondent is given some number of points—typically 10 or 100—and asked to distribute them over some set of stimuli or attribute alternatives in a way that reflects their relative importance or magnitude (Figure 10.4d). Constant sum forces the respondent to make comparative evaluations across the stimuli, and effectively standardizes each scale across persons, since all scores must add to the same constant. Generally, it is assumed that a subjective ratio scale is obtained by this method.

In summary, unlike ordinal measurement methods, the major assumption underlying ratio/interval measurement methods) is that a unit of measurement can be constructed directly from respondents’ estimates about scale values associated with a set of stimuli. The respondent’s report is taken at face value and any variation in repeated estimates (over test occasions within respondent or over respondents) is treated as error; repeated estimates are usually averaged over persons and/or occasions. The problems associated with Interval-Ratio Scaling methods include the following :

1.	Respondents’ subjective scale units may differ across each other, across testing occasions, or both.
2.	Respondents’ subjective origins (zero points) may differ across each other, across occasions, or both.
3.	Unit and origin may shift over stimulus items within a single occasion
4.	Subjective distance between stimuli may not equal one’s perception of the distance on the scale.

These problems should not be treated lightly, but considered in the design of the question and scale points.

Most ratings measurement methods have the virtue of being easy to apply. Moreover, little additional work beyond averaging is required to obtain the unit of measurement directly. Indeed, if a unique origin can be established (e.g., a zero level of the property), then the researcher obtains both an absolute origin and a measurement unit. As such, a subjective ratio scale is obtained

Techniques for Scaling Stimuli

Mission Impossible?

You are making good progress in your first internship and have just been asked by the head of research to present a summary of how your brand is perceived relative to the four major competitor brands… for tomorrow’s meeting

You find the results of a survey conducted just before you came on board. The survey used a paired comparison task to rate preference (10 pairs: 1 vs. 2, 1 vs. 3, 1 vs. 4, etc.), and another simple rank order question for the 5 brands. But your manager specifically askedthat preference be displayed on a single continuous (interval or ratio scale).

You’re not yet in full blown panic… you average the preference rankings and put them into a symmetric square matrix (Table 10.6)…. but how do you convert a data matrix into a single dimension interval preference scale?

The answer to this problem is that ranking methods may undergo a further transformation (via an intervening scaling model) to produce set of scale values that are interval-scaled. One such transformation, Thurstone’s Case V method, is capable of transforming ordinal data obtained from ranking methods. It should be noted that technically speaking, the raw data obtained from ratings methods also requires an intervening model in order to prepare an interval scaled summary measure. However, in this case the model may be no more elaborate than averaging the raw data across respondents and/or response occasions.

Osgood’s semantic differential is an illustration of a procedure for dealing with raw data obtained from interval-ratio scale ratings methods. We consider each of these techniques in turn

Case V Scaling

Thurstone’s Case V Scaling model, based on his Law of Comparative Judgment, permits the construction of a unidimensional interval scale using responses from ordinal measurement methods, such as paired comparisons (Thurstone, 1959). This model can also be used to scale ranked data or ordered-category sorts. Several sub cases of Thurstone’s model have been developed. We shall first describe the general case and then concentrate on Case V, a special version particularly amenable to application in marketing situations.

Essentially, Thurstone’s procedure involves deriving an interval scale from comparative judgments of the type “A is fancier than B”, “A is more prestigious than B”, “A is preferred to B”, and so on. Scale values may be estimated from data in which one individual makes many repeated judgments on each pair of a set of stimuli or from data obtained from a group of individuals with few or no replications per person.

The example should make the Case V procedure easier to follow. Assume that the survey you found had asked 100 homemakers to compare five brands of “fortified juice” with respect to “overall preference of flavor”. The homemakers sipped a sample of each brand paired with a sample of every other brand (a total of ten pairs) from paper cups that were marked merely with identifying numbers. Table 10.6 shows the empirically observed proportion for each comparison.

From this table we see that 69 percent of the respondents preferred Juice C to Juice A and the remainder, 31 percent preferred Juice A to Juice C (if we arbitrarily let column dominate row). It is customary to set self-comparisons (the main-diagonal entries of Table 10.7) to 0.5; this has no effect on the resulting scale values (Edwards, 1957). From the data of this table we next prepare Table 10.7, which summarizes the Z-values appropriate for each proportion. These Zvalues were obtained from Table A.1 in Appendix A at the end of this book. If the proportion is less than 0.5, the Z-value carries a negative sign; if the proportion is greater than 0.5, the Z-value carries a positive sign. The Z-values are standard unit variates associated with a given proportion of total area under the normal curve. The Thurstonian model assumes normally distributed scale differences in mean = 0 and standard deviation = 1.0

For example, from Table 10.6 we note that the proportion of respondents preferring Juice B over Juice A is 0.82. We wish to know the Z-value appropriate thereto. This value (labeled Z in the standard unit normal table of Table Appendix A.1) is 0.92. That is, 82 percent of the total area under the normal curve is between Z = – ∞ and Z =0.92. All remaining entries in Table 10.6 are obtained in a similar manner, a minus sign being prefixed to the Z-value when the proportion is less than 0.5.

Column totals are next found for the entries in Table 10.7. Scale values are obtained from the column sums by taking a simple average of each column’s Z-values. For example, from Table 10.7, we note that the sum of the Zs for the first column (Juice A) is –0.36. The average Z for column A is simply :

This scale value expresses Juice A as a deviation from the mean of all five scale values. The mean of the five values, as computed from the full row of Zs, will always be zero under this procedure. Similarly, we find the average Z-value for each of the remaining four columns of Table 10.7.

Next, since the zero point of an interval scale is arbitrary, we can transform the minimum scale so that it becomes zero. Since Juice D has the lowest value (RD = ZD = –0.674), we force it to become the reference point (or origin) of zero by adding .674. We then simply add 0.674 to each of the other Z-values to obtain the Case V scale values of the other four brands. These are denoted by R* and appear in the last row of Table 10.7.

The scale values of Juices A through E indicate the preference ordering B > C > A > E >D. Moreover, assuming that an interval scale exists, we can say, for example, that the difference in “goodness of flavor” between Juices B and A is 2.3 times the difference in “goodness of flavor” between Juices C and A, since

B – A = 2.3 (C – A)

1.484 – 0.602 = 2.3(0.984 – 0.602)

0.882 = 2.3(0.382) (within rounding error).

The test of this model is how well scale values can be used to work backward—that is, to predict the original proportions. The Case V model appears to fit the data in the example quite well. For any specific brand, the highest mean absolute proportion discrepancy is 0.025 (Juice A). Moreover, the overall mean absolute discrepancy is only .02 (rounded). Even the simplest version (Case V) of the Thurstonian model leads to fairly accurate predictions. The R* scale values of the Case V model preserve the original rank ordering of the original proportions data.

The Semantic Differential

The semantic differential (Osgood, Suci, & Tannenbaum, 1957) is a ratings procedure that results in (assumed interval) scales that are often further analyzed by such techniques as factor analysis (see Chapter 14). Unlike the Case V model, the semantic differential provides no way to test the adequacy of the scaling model itself. It is simply assumed that the raw data are interval-scaled; the intent of the semantic differential is to obtain these raw data for later processing by various multivariate models.

The semantic differential procedure permits the researcher to measure both the direction and the intensity of respondents’ attitudes (i.e., measure psychological meaning) toward such concepts as corporate image, advertising image, brand or service image, and country image

As shown in Figure 10.5, the respondent may be given a set of pairs of antonyms, the extremes of each pair being separated by seven intervals that are assumed to be equal. For each pair of bi-polar adjectives (e.g., powerful/weak), the respondent is asked to judge the concept along the seven-point scale with implicit descriptive phrases.

Figure 10.5 Corporate Profile Obtained by the Semantic Differential

In practice, however, profiles would be built up for a large sample of respondents, with many more bipolar adjectives being used than given here.

By assigning a set of integer values, such as +3, +2, +1, 0, –1, –2, –3, to the seven gradations of each bipolar scale in Figure 10.6, the responses can be quantified under the assumption of equal-appearing intervals. These scale values, in turn, can be averaged across respondents to develop semantic differential profiles. For example, Figure 10.5 shows a chart comparing evaluations of Companies X and Y. The average score for the respondents show that the Company X is perceived as very weak, unreliable, old-fashioned, and careless, but rather warm. Company Y is perceived as powerful, reliable, and careful, but rather cold as well; it is almost neutral with respect to the modern/old-fashioned scale.

Figure 10.6 Average-Respondent Profile Comparisons of Companies X and Y

Company X = ----- Company Y = --------

In marketing research applications, the semantic differential often uses bipolar descriptive phrases rather than simple adjectives, or a combination of both types. These scales are developed for particular context areas, so the scales have more meaning to respondents, thus leading usually to a high degree of reliability.

The same issues of scale construction presented in Figure 10.7 apply to the semantic differential. In addition, the researcher must select an overall format for presentation of the scales. As Figure 10.7 illustrates, there are many formatting variations to semantic differential scaling, such as scales that include the numbers 1-7 at the scale points, or numerical comparative scale where respondents make their judgments for KMart, Wal-Mart, and Sears on one attribute before moving to the next one (Golden, Brockett, Albaum, & Zatarain, 1992).

Figure 10.7 Alternate Formats for the Semantic Differential

The number and type of stimuli to evaluate and the method of administration (personal interview, mail, telephone and email) should determine at least which format the researcher shoud use. Comparative studies of data quality (including reliability) seem to indicate that the choice of a format may be appropriately made on the basis of ease of subject understanding, ease of coding and interpretation for the researcher, ease of production and display, and cost. If a large number of stimuli are to be evaluated, this would tend to favor use of the graphic positioning or numerical comparative scales.

A Concluding Remark

The semantic differential technique is appropriate for use in a variety of applications :

-	Comparing corporate images, both among suppliers of particular products and against an ideal image of what respondents think a company should be
-	Comparing brands and services of competing suppliers
-	Determining the attitudinal characteristics of purchasers of particular product classes or brands within a product class, including perceptions of the country of origin for imported products
-	Analyzing the effectiveness of advertising and other promotional stimuli toward changing attitudes

The comparatively widespread use of the semantic differential by marketing researchers suggests that this method provides a convenient and reasonably reliable way for scaling stimuli (scaling images of brands, corporations, services, etc.), and developing profiles of

consumer/buyer attitudes on a wide variety of topics.

Techniques for Scaling Respondents

In contrast to the approaches for scaling stimuli just discussed, researchers also have available techniques whose primary purpose is to scale respondents along some attitude continuum of interest. Two better-known procedures for doing this are the summated scale, and the Q-sort technique. Each of these is described in turn.

The Summated Scale

The summated scale was originally proposed by Rensis Likert (pronounced “lick-ert”), a psychologist (Likert, 1967; Kerlinger, 1973). To illustrate, assume that the researcher wishes to scale some characteristic, such as the public’s attitude toward travel and vacations.

To illustrate the Likert scale, a set of seven statements regarding travel and vacations used in a study by a travel company are shown in Figure 10.8. Each of the seven test items has been classified as “favorable” (items 1, 3, and 7) or “unfavorable” (items 2, 4, 5, and 6). Each subject would be asked to indicate their agreement with the statement. The responses are scored +2 for “strongly agree”, +1 for “agree”, 0 for “neither”, –1 for “disagree”, and –2 for “strongly disagree”. Since, we reverse scaled items 2, 4, 5, and 6 on “unfavorable” statements, we would reverse the order of the scale values so as to maintain a consistent direction ( +2 would stand for“strongly disagree,” and so on).

Suppose that a subject evaluated the seven items as follows such that the respondentwould receive a total score of :

+ 2 + 1 + 1 + 2 + 1 + 2 + 2 = 11

Suppose that another respondent responded to the seven items by marking (1) strongly disagree, (2) neither, (3) disagree, (4) strongly agree, (5) strongly disagree, (6) strongly agree, and (7) neither. This person’s score would be :

– 2 + 0 – 1 – 2 – 2 – 2 + 0 = –9

This listing indicates that the second respondent would be ranked “lower” than the first— that is, as having a less-favorable attitude regarding travel and vacations. However, as indicated earlier, a given total score may have different meanings

Figure 10.8 A Direction-Intensity Scale for Measuring Attitudes Toward Travel and Vacations

In applying the Likert summated-scale technique, the steps shown in Exhibit 10.3 are typically carried out.

Exhibit 10.3 Steps in Constructing a Likert Summated Scale

1.	The researcher assembles a large number (e.g., 75 to 100) of statements concerning the public’s sentiments toward travel and vacations.
2.	Each of the test items is classified by the researcher as generally “favorable” or“unfavorable” to the attitude under study. No attempt is made to scale the items; however, a pretest is conducted that involves the full set of statements and a limited sample of respondents. Ideally, the initial classification should be checked across several judges.
3.	In the pretest the respondent indicates approval (or not) with every item, checking one of the following direction-intensity descriptors :
4.	Each response is given a numerical weight (e.g., +2, +1, 0, -1, -2 or +1 to +5).
5.	The individual’s total-attitude score is represented by the algebraic summation of weights associated with the items checked. In the scoring process, weights are assigned such that the direction of attitude— favorable to unfavorable—is consistent over items. For example, if a + 2 were assigned to “strongly approve/agree” for favorable items, a + 2 should be assigned to “strongly disapprove/disagree” for unfavorable items.
6.	On the basis of the results of the pretest, the analyst selects only those items that appear to discriminate well between high and low total scorers. This may be done by first finding the highest and lowest quartiles of subjects on the basis of total score. Then, the mean differences on each specific item are compared between these high and low groups (excluding the middle 50 percent of subjects).
7.	The 20 to 25 items finally selected are those that have discriminated “best” (i.e., exhibited the greatest differences in mean values) between high versus low total scorers in the pretest.
8.	Steps 3 through 5 are then repeated in the main study.

When analysis is completed, many researchers assume only ordinal properties regarding the placement of respondents along the continuum. Nonetheless, two respondents could have the same total score even though their response patterns to individual items were quite different. That is, the single (summated) score ignores the details of just which items were agreed with and which ones were not. Moreover, the total score is sensitive to how the respondent reacts to the descriptive intensity scale.

Often, a researcher will reverse the polarity of some items in the set (i.e., word items negatively) as a way to overcome the possibility of acquiescence bias (being overly agreeable). Having positively and negatively worded statements hopefully forces respondents with strong positive or negative attitudes to read carefully and use both ends of a scale. A researcher should reverse the polarity of some items, but may need to adjust the scoring, as appropriate. That is, a “strongly agree” response to a positive statement and a “strongly disagree” to a negative statement should be scored the same, and so forth.

Another approach to wording the summated scale adapts the statements into a set of nondirectional questions, thereby alleviating the problems associated with mixed-wording scales (Wong, Rindfleisch, & Burroughs, 2003). As an illustration, a non-directional format for one item would be :

“How much pleasure do you get from traveling? [Very little…A great deal]”

In contrast, the normal Likert format for this item is :

“Traveling gives me a lot of pleasure [strongly agree, agree, neither agree nor disagree, disagree, strongly disagree]”

Some final comments are in order. When using this format, Likert (1967) stated that a key criterion for statement preparation and selection should be that all statements be expressions of desired behavior and not statements of fact. Because two persons with decidedly different attitudes may agree on fact, it is recognized that direction is the only meaningful measure obtained when using statements of facts

The second concern is that the traditional presentation of a Likert scale is one-stage, with both intensity and direction combined. As stated earlier, this may lead to reluctance on the part of respondents to either give extreme scores or use the extreme position on an individual scale item (central tendency error). To compensate for this situation the longer two-stage format, whereby direction and intensity are separate evaluations, can be used.

The Q-Sort Technique

The Q-sort technique has aspects in common with the summated scale. Very simply, the task required of a respondent is to sort a number of statements (on individual cards or a pick and group Qualtrics question) into a predetermined number of categories (usually 11) with a specified number having to be placed in each category.

In illustrating the Q-sort technique, assume that four respondents evaluate the test items dealing with travel and vacations. For purposes of illustration, only three groups will be used. The respondents are asked to sort items into :

Suppose that the responses toward seven items by the four respondents, A, B, C, and D, result in the following scale values :

As can be noted, the respondent pairs A & B and C & D seem to be the “most alike” of the six distinct pairs that could be considered. We could, of course, actually correlate each respondent’s scores with every other respondent and, similar to semantic differential applications, and then conduct factor or cluster analyses (see Chapter 14) to group the respondents or items. Typically, these additional steps are undertaken in Q-sort studies.

Multi-Item Scales

Each of the types of scales discussed in this chapter can be used either alone or part of a multi-item scale used to measure some construct. A multi-item scale consists of a number of closely related individual rating scales whose responses are combined into a single index, composite score, or value (Peterson, 2000). Often the scores are summated to arrive at a total score. Multi-item scales are used when measuring complex psychological constructs that are not easily defined by just one rating scale or captured by just one question.

The major steps in constructing a multi-item scale. The first, and perhaps most critical, step is to clearly and precisely define the construct of interest. A scale cannot be developed until it is clear just what the scale is intended to measure. This is followed by design and evaluation of the scale. A pool of items is developed and then subject to analysis to arrive at the initial scale. Along the way, a pilot study is conducted to further refine the scale and move toward the final version. Validation studies are conducted to arrive at the final scale. Of concern is constructvalidation, in which an assessment is made that the scale measures what it is supposed to measure. At the same time that validity data are collected, normative data can also be collected. Norms describe the distributional characteristics of a given population on the scale. Individual scores on the scale then can be interpreted in relation to the distribution of scores in the population (Spector, 1992, p. 9).

A good multi-item scale is both reliable and valid. Reliability is assessed by the scale’s stability (test-retest reliability) and internal consistency reliability (coefficient alpha). According to Spector (1992), there are several other characteristics of a good multi-item scale :

1.	The items should be clear, well-written, and contain a single idea
2.	The scale must be appropriate to the population of people who use it, such as having an appropriate reading level.
3.	The items should be kept short and the language simple.
4.	Consider possible biasing factors and sensitive items.

Table 10.8 gives an example of a multi-item scale developed to measure consumer ethnocentrism within a nation, the CETSCALE (Shimp & Sharma, 1987). This scale is formatted as a 7 point agreement question in the Likert scale format. A Cetscale score for an individual respondent is obtained as a sum of item ratings, and ranges from 17 to 119 with higher numbers indicating greater consumer ethnocentrism. A compilation of multi-item scales frequently used in consumer behavior and marketing research is provided by Bearden and Netemeyer (1999).

Table 10.8 Example of a Multi-Item Scale: Consumer Ethnocentrism (CETSCALE)

1.	American people should always buy American-made products instead of imports.
2.	Only those products that are unavailable in the United States should be imported.
3.	Buy American-made products. Keep America working.
4.	American products first, last and foremost.
5.	Purchasing foreign-made products is un-American
6.	It is not right to purchase foreign products
7.	A real American should always buy American-made products
8.	We should purchase products in America instead of letting other countries get rich off us.
9.	It is always best to purchase American products.
10.	There should be very little trading or purchasing of goods from other countries unless out of necessity.
11.	Americans should not buy foreign products, because this hurts American business and causes unemployment.
12.	Curbs should be put on all imports
13.	It may cost me in the long run, but I prefer to support American products.
14.	Foreigners should not be allowed to put their products on our markets
15.	Foreign products should be taxed heavily to reduce their entry into the United States.
16.	We should buy from foreign countries only those products that we cannot obtain within our own country.
17.	American consumers who purchase products made in other countries are responsible for putting their fellow Americans out of work.

Note : Items composing the 10-item reduced version are items 2, 4 through 8, 11, 13, 16, and 17.

Predictions from attitude scales, preference ratings, and the like still need to be transformed into measures (sales, market share) of more direct value to the marketer. We still do not know, in many cases, how to effectively translate verbalized product ratings, attitudes about corporations, and so on into the behavioral and financial measures required to evaluate the effectiveness of alternative marketing actions.

The Art of Writing Good Questions

Strength of Question Wording

The wording of questions is a critical consideration when obtaining information from respondents. Consider that the following question differing only in the use of the words “should”, “could” and “might” was shown to three matched samples of respondents (Payne, 1951, pp. 8–9).

Do you think anything should be done to make it easier for people to pay doctor or hospital bills? (82 percent replied “Yes”.)

For the sample shown the sentence with the word “could”, 77 percent replied “Yes”, and with “might”, 63 percent replied “Yes”. These three words are sometimes used as synonyms, and yet at the extreme, responses are 19 percentage points apart.

As another example, Rasinski (1989) posed a question where labels for the topic issue was changed :

Are we spending too much, too little, or about the right amount on welfare?

In this case, 23.1 percent of respondents replied “too little”, but when the label was changed to assistance to the poor, 62.8 percent replied “too little”. Questions portraying a more descriptive and positive position may show a large difference in the evaluation score

Reducing Question Ambiguity

Reducing ambiguity and bias is critical in both the respondent’s understanding and proper consideration of the question and in the researcher’s understanding of the answer’s meaning. In this section we discuss issues of question structure and form that can greatly influence and improve the quality of your questionnaire.

The Qualtrics “Survey University” and survey and question libraries provide many suggestions and helpful examples for writing unambiguous questions. Writing questions is an art, which like all arts requires a great amount of work, practice, and help from others. In Exhibit 10.4 we provide an overview of the common pitfalls we often see in "bad questionnaires" that lead to various forms of ambiguity.

Exhibit 10.4 Key Considerations for Reducing Question Ambiguity

1.	Strength of Question Wording The wording of questions is a critical consideration when obtaining information from respondents. One study “should”, “could” and “might” was shown to three matched samples of respondents (Payne, 1951, pp. 8–9). Do you think anything should be done to make it easier for people to pay doctor or hospital bills? (82 percent replied “Yes”.) For the sample shown the sentence with the word “could”, 77 percent replied “Yes”, and with “might”, 63 percent replied “Yes”. These three words are sometimes used as synonyms, and yet at the extreme, responses are 19 percentage points apart. Questions portraying a more descriptive and positive position may show a large difference in the evaluation score.
2.	Avoid loaded or leading words or questions. Slight wording changes can produce great differences in results. Could, Should, Might all sound almost the same, but may produce a 20% difference in agreement to a question (The supreme court could.. should.. might.. have forced the breakup of Microsoft Corporation). Strong words that represent control or action, such as prohibit produces similar results (Do you believe that congress should prohibit insurance companies from raising rates?) Sometimes wording is just biased: You wouldn't want to go to Rudolpho’s Restaurant for the company’s annual party would you?
3.	Framing effects. Information framing effects reflect the difference in response to objectively equivalent information depending upon the manner in which the information is labeled or framed. Levin, Schneider, and Gaeth (1998) and Levin et al. (2001) identify three distinct types of framing effects :
	-	Attribute framing effects occur when evaluations of an object or product are more favorable when a key attribute is framed in positive rather than negative terms
	-	Goal framing effects occur when a persuasive message has different appeal depending on whether it stresses the positive consequences of performing an act to achieve a particular goal or the negative consequences of not performing the act.
	-	Risky choice framing effects occur when willingness to take a risk depends upon whether potential outcomes are positively framed (in terms of success rate) or negatively framed (in terms of failure rate).
	Which type of potential framing effects should be of concern to the research designer depends upon the nature of the information being sought in a questionnaire. At the simplest level, if intended purchase behavior of ground beef was being sought, the question could be framed as “80 percent lean” or “20 percent fat.” This is an example of attribute framing. It should be obvious that this is potentially a pervasive effect in question design, and is something that needs to be addressed whenever it arises. More detailed discussion of these effects is given by Hogarth (1982).
4.	Misplaced questions. Questions placed out of order or out of context should be avoided. In general, a funnel approach is advised. Broad and general questions at the beginning of the questionnaire as a warm-up (What kind of restaurants do you most often go to?). Then more specific questions, followed by more general easy to answer questions (like demographics) at the end of the questionnaire.
5.	Mutually non-exclusive response categories. Multiple choice response categories should be mutually exclusive so that clear choices can be made. Non-exclusive answers frustrate the respondent and make interpretation difficult at best.
6.	Nonspecific questions. Do you like orange juice? This is very unclear...do I like what? Taste, texture, nutritional content, Vitamin C, the current price, concentrate, fresh squeezed? Be specific in what you want to know about. Do you watch TV regularly? (what is regularly?).
7.	Confusing or unfamiliar words. Asking about caloric content, acrylamide, phytosterols, and other industry specific jargon and acronyms are confusing. Make sure your audience understands your language level, terminology and above all, what you are asking
8.	Non-directed questions give respondents excessive latitude. What suggestions do you have for improving tomato juice? The question is about taste, but the respondent may offer suggestions about texture, the type of can or bottle, mixing juices, or something related to use as a mixer or in recipes.
9.	Forcing answers. Respondents may not want, or may not be able to provide the information requested. Privacy is an important issue to most people. Questions about income, occupation, finances, family life, personal hygiene and beliefs (personal, political, religious) can be too intrusive and rejected by the respondent
10.	Non-exhaustive listings. Do you have all of the options covered? If you are unsure, conduct a pretest using the "Other (please specify) __________" option. Then revise the question making sure that you cover at least 90% of the respondent answers.
11.	Unbalanced listings. Unbalanced scales may be appropriate for some situations and biased in others. When measuring alcohol consumption patterns, one study used a quantity scale that made the heavy drinker appear in the middle of the scale with the polar ends reflecting no consumption and an impossible amount to consume. However, we expect all hospitals to offer good care and may use a scale of excellent, very good, good, fair. We do not expect poor care.
12.	Double barreled questions. What is the fastest and most convenient Internet service for you? The fastest is certainly not the most economical. The double barreled question should be split into two questions.
13.	Independent answers. Make sure answers are independent. For example the question "Do you think basketball players as being independent agents or as employees of their team?” Some believe that yes, they are both.
14.	Long questions. Multiple choice questions are the longest and most complex. Free text answers are the shortest and easiest to answer. When you Increase the length of questions and surveys,nyou decrease the chance of receiving a completed response.
15.	Questions on future intentions. Yogi Berra (Famous New York Yankees Baseball Player) once said that making predictions is difficult, especially when they are about the future. Predictions are rarely accurate more than a few weeks or in some case months ahead.

Validitiy And Reliability Of Measurement

The content of a measurement instrument includes a subject, theme, and topics that relate to the characteristics being measured. However the measuring instrument does not include all of the possible items that could have been included. When measuring complex psychological constructs such as perceptions, preferences, and motivations, hard questions must be asked to identify the items most relevant in solving the research problem :

1.	Do the scales really measure what we are trying to measure?
2.	Do subjects’ responses remain stable over time?
3.	If we have a variety of scaling procedures, are respondents consistent in their scoring over those scales that purport to be measuring the same thing?

By solving these problems, we establish the validity and reliability of scaling techniques. Note that our focus is only on the general concepts and measures of validity and reliability that are used in cross-sectional studies. There is little documented research on issues of measure reliability and validity for time-series analysis.

Validity

Validity simply means that we are measuring what we believe we are measuring. The data must be unbiased and relevant to the characteristic being measured. The validity of a measuring instrument reflects the absence of systematic error. Systematic error may arise from the instrument itself, the user of the instrument, the subject, or the environment in which the scaling procedure is being administered. Since in practice we rarely know true scores, we usually have to judge a scaling procedure’s validity by its relationship to other relevant standards.

The validity of a measuring instrument hinges on the availability of an external criterion that is thought to be correct. Unfortunately the availability of such outside criteria is often low. What makes the problem even more difficult is that the researcher often is not interested in the scales themselves, but the underlying theoretical construct that the scale purports to measure. It is one thing to define IQ as a score on a set of tests; it is quite another to infer from test results that a certain construct, such as intelligence, or a dimension of intelligence is being measured.

In testing the validity of a scale, the researcher must be aware that many forms of validity exist, including (1) Content validity, (2) Criterion validity, and (3) Construct validity.

Content Validation

Content Validity concerns how the scale or instrument represents the universe of the property or characteristic being measured. It is essentially judgmental and is ordinarily measured by the personal judgments of experts in the field. That is, several content experts may be asked to judge whether the items being used in the instrument are representative of the field being investigated. Closely related to this approach for assessing content validation is a method involving known groups. For instance, a scale purported to measure attitudes toward a brand could be tested by administering it to a group of regular buyers of the product (which presupposes a favorable attitude) and compared with those from a group of former buyers or other non-buyers (who presumably have a negative attitude). If the scale does not discriminate between the two groups, then its validity with respect to measuring brand attitude is highly

questionable. Caution must be exercised in using this method in that other group differences besides their known behavior might exist, and account for the differences in measurement.

Face Validity is a preliminary or exploratory form of content validity. It is based on a cursory review of items by non-experts such as one’s wife, mother, tennis partner, or similarly convenient to access persons. A simple approach is to show the measurement instrument to a convenient group of untrained people and ask whether or not the items look okay (Litwin, 2003).

Logical Validation refers simply to an intuitive, or common-sense, evaluation. This type of validation is derived from the careful definition of the continuum of a scale and the selection of items to be scaled. Thus, in an extreme case, the investigator reasons that everything that is included is done so because it is obvious that it should be that way. Because things often do not turn out to be as obvious as believed, it is wise for the marketing researcher not to rely on logical validation alone.

Example: of research lacking content validity is the Coca-Cola Company’s introduction many years ago of New Coke. Since the product represented a major change in taste, thousands of consumers were asked to taste New Coke. Overwhelmingly, people said they liked the new flavor. With such a favorable reaction, why did the decision to introduce the product turn out to be a mistake? Executives of the company acknowledge that the consumer survey conducted omitted a crucial question. People were asked if they like the new flavor, but they were not asked if they were willing to give up the old Coke. In short, they were not asked if they would buy the new product in place of the old one.

Criterion Validation

Criterion Validity, also known as pragmatic validity, has two basic dimensions known as predictive validity and concurrent validity. They question if the instrument works and are better decisions can be made with it than without it?

The New Coke example also illustrates a case of poor predictive validity. The measures of liking, and so on, were not very good predictors of purchase, which was the real measure of managerial interest.

In concurrent validity, a secondary criterion, such as another scale, is used to compare results. Concurrent validity can be assessed by correlating the set of scaling results with some other set, developed from another instrument administered at the same time. Often product researchers will ask a question like “Overall, how much do you prefer Brand A soft drink?”, and then follow with another question such as, “Given the following four brands, indicate the percentage of your total soft drink purchases that you would make for each brand.”

Alternatively, the correlation may be carried out with the results of the same question asked again later in the survey or on another testing occasion

Construct Validation

In construct validation the researcher is interested both in the question, “Does it work?” (i.e., predict), and in developing criteria that permit answering theoretical questions of why it works and what deductions can be made concerning the theory underlying the instrument. Construct validity involves three subcases: convergent, discriminant, and nomological validity.

Convergent Validity: The correspondence in results between attempts to measure the same construct by two or more independent methods. These methods need not all be scaling techniques.

Discriminant Validity: Refers to properties of scaling procedures that do differ when they are supposed to—that is, in cases where they measure different characteristics of stimuli and/or subjects. That is, more than one instrument and more than one subject characteristic should be used in establishing convergent-discriminant validity. Discriminant validity concerns the extent to which a measure is unique (and not simply a reflection of other variables), and as such it provides the primary test for the presence of method variance.

Nomological Validity: “Understanding” a concept (or construct). In nomological validity the researcher attempts to relate measurements to a theoretical model that leads to further deductions, interpretations, and tests, gradually building toward a nomological net, in which several constructs are systematically interrelated.

Ideally, the marketing researcher would like to attain construct validity, thus achieving not only the ability to make predictive statements but understanding as well. Specifically, more emphasis should be placed on the theories, the processes used to develop the measures, and the judgments of content validity.

Reliability

Reliability is concerned with the consistency of test results over groups of individuals or over the same individual at different times. A scale may be reliable but not valid. Reliability, however, establishes an upper bound on validity. An unreliable scale cannot be a valid one. Reuman (1982, p. 1099) states that “according to classical test theory, highly reliable measures are necessary, but not sufficient, for demonstrating high construct validity or high criterion validity.”

The achievement of scale reliability is, of course, dependent on how consistent the characteristic being measured is from individual to individual (homogeneity over individuals) and how stable the characteristic remains over time. Just how reliable a scaling procedure turns out to be will depend on the dispersion of the characteristic in the population, the length of the testing procedure, and its internal consistency. Churchill and Peter (1984) concluded that rating scale estimates were largely determined by measuring characteristics such as number of items in a scale, type of scale, and number of scale points. They further concluded that sampling characteristics and measurement development processes had little impact.

In general, a measurement of the reliability of a scale (or measurement instrument) may be measured by one of three methods: test-retest, alternative forms, or internal consistency. The basics of reliability in a marketing context are reviewed by Peter (1979).

Test-Retest

The test-retest method examines the stability of response over repeated applications of the instrument. Do we achieve consistent results, assuming that the relevant characteristics of the subjects are stable over trials? One potential problem, of course, is that the first measurement may have an effect on the second one. Such effects can be reduced when there is a sufficient time interval between measurements. If at all possible, the researcher should allow a minimum of two weeks to elapse between measurements. Reliability may be estimated by any appropriate statistical technique for examining differences between measures

Alternative Forms

The alternative forms method attempts to overcome the shortcomings of the test-retest method by successively administering equivalent forms of the measure to the same sample. Equivalent forms can be thought of as instruments built such that the same types and structures of questions are included on each form, but where the specific questions differ. The forms of the measurement device may be given one after the other or after a specified time interval, depending upon the investigator’s interest in stability over time. Reliability is estimated by correlating the results of the two equivalent forms.

Internal Consistency

Internal consistency refers to estimates of reliability within single testing occasions. In a sense it is a modification of the alternative form approach, but differs in that alternatives are formed by grouping variables. The basic form of this method is split-half reliability, in which items are divided into equivalent groups (say, odd- versus even-numbered questions, or even a random split) and the item responses are correlated. In practice, any split can be made.

A potential problem arises for split-half in that results may vary depending on how the items are split in half. A way of overcoming this is to use coefficient alpha, known also as Cronbach’s alpha, which is a type of mean reliability coefficient for all possible ways of splitting an item in half (Cronbach, 1951). Whenever possible, alpha should be used as a measure of the internal consistency of multi-item scales. Alpha is perhaps the most widely used measure of internal consistency for multiple-item measures within marketing research. One caution however, is that there should be a sufficient number of items in the measure so that alpha becomes meaningful. Alpha has been used for as few as two items, and this essentially amounts to a simple correlation between the two. Although there is no generally acceptable heuristic covering the number of items, common sense would indicate that the minimum number of items should be four or perhaps even six. What is clear, however, is that alpha is a function of the number of items in a scale (i.e., the more items, the greater alpha will tend to be), and also a function of the intercorrelation of the items themselves (Cortina, 1993; Voss, Stem, & Fotopoulos, 2000). Consequently, when interpreting an obtained alpha, the number of items must always be kept in mind.

The usual application of coefficient alpha is to calculate it using a statistical analysis packages, report it, and assess whether the value obtained exceeds some rule-of-thumb minimum value, typically 0.70. There now exist methods to make inferential tests about the size of alpha, and to attach confidence intervals to the measure (Iacobucci & Duhachek, 2003).

In many projects, measurements or evaluations are made by more than a single evaluator. Sometimes this is done when coding answers to open-ended questions. In these situations, the researcher is interested in the reliability of these evaluations. This is known as interrater or interobserver reliability. The most common measure used is a correlation (Litwin, 2003).

A Concluding Comment

Although it is not our objective to pursue in detail the methods by which reliability or validity can be tested, we hope to provide an appreciation of the difficulties encountered in designing and analyzing psychological measure. One question that has not been answered is, “What is a satisfactory level of reliability, or what minimum level is acceptable?” There is no simple definitive answer to this question. Much depends on the investigator’s or decision maker’s primary purpose in measurement and on the approach used to estimate reliability. In trying to arrive at what constitutes satisfactory reliability, the investigator must at all times remember that reliability can affect certain qualities of a study including (1) Validity, (2) The ability to show relationships between variables, and (3) The making of precise distinctions among individuals and groups.

Summary

This chapter focused on general concepts of measurement. We discussed the role of definitions and defined concepts, constructs, variables, operational definitions, and propositions. We then turned to measurement and examined what it is and how measurement relates to development of scales. Also discussed, but rather briefly, were alternative sources that cause variations within a set of measurements derived from a single instrument. This was followed by a description of different types of scales that are commonly used in marketing research. Advanced scaling techniques for scaling stimuli and respondents were also discussed. We concluded with a brief overview of measurement validity and reliability, and the various types of each that are of concern to an investigator

References

Albaum, G., Best, R., & Hawkins, D. (1981). Continuous vs. discrete semantic differential scales. Psychological Reports, 49, 83–86.

Bearden, W. O., & Netemeyer, R. G. (1999). Handbook of Marketing Scales (2nd ed.). Thousand Oaks, CA: Sage.

Churchill, G. A., Jr., & Peter, J. P. (1984, November). Research design effects on the reliability of rating scales: A meta analysis. Journal of Marketing Research, 21, 360–375

Coombs, C. H. (1964). A Theory of Data. New York: Wiley.

Cortina, J. (1993). What is coefficient alpha? An examination of theory and applications. Journal of

Applied Psychology, 78(1), 98–104.

Cronbach, L. J. (1951, September). Coefficient alpha and the internal structure of tests. Psychometrika,

16, 297–334.

Davis, D. W. (1997). Nonrandom measurement error and race of interviewer effects among African-

Americans. Public Opinion Quarterly, 61, 183–207.

Golden, L., Brockett, P., Albaum, G., & Zatarain, J. (1992). The golden numerical comparative scale format for economical multi-object/multiattribute comparison questionnaires. Journal of Official Statistics, 8 (1), 77–86.

Guttman, L. (1985). Measuring the true-state of opinion. In R. Ferber & H. Wales (Eds.), Motivation and market behavior. Homewood, IL: Richard D. Irwin, 393–415

Haley, R., & Case, P. (1979, Fall). Testing thirteen attitude scales for agreement and brand determination. Journal of Marketing, 43, 20–32.

Hogarth, R. M. (Ed.) (1982). Question framing and response consistency. San Francisco: Jossey-Bass.

Hubbard, R., Little, E. L., & Allen, S. J. (1989). Are responses measured with graphic rating scales

subject to perceptual distortion? Psychological Reports, 69, 1203–1207

Iacobucci, D., & Duhachek, A. (2003). Applying confidence intervals to coefficient alpha. Unpublished

working paper, Kellogg School of Management, Northwestern University, Evanston, IL.

Kerlinger, F. (1973). Foundation of behavioral research (2nd ed.). New York: Holt, Rinehart & Winston.

Levin, I. P., Gaeth, G. J, Evangelista, F., Albaum, G., & Schreiber, J. (2001). How positive and negative

frames influence the decisions of persons in the United States and Australia. Asia Pacific Journal of

Marketing and Logistics, 13(2), 64–71.

Levin, I. P., Schneider, S. L., & Gaeth, G. J. (1998, November). All frames are not created equal: A

typology of framing effects. Organizational Behavior and Human Decision Processes, 76(2), 149–188.

Likert, R. (1967). The method of constructing an attitude scale. In M. Fishbein (Ed.), Readings in attitude theory and measurement. New York: Wiley, pp. 90–95.

Litwin, M. S. (2003). How to assess and interpret survey psychometric (2nd ed.). Thousand Oaks, CA:

Sage.

McCarty, J. A., & Shrum, L. J. (2000). The measurement of personal values in survey research: A test of

alternative rating procedures. Public Opinion Quarterly, 64, 271–298

Morgan, M. (2003, March 31). Be careful with survey data. Marketing News, 37, 26.

Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana:

University of Illinois Press.

Peter, J. P., & Churchill, G. A., Jr. (1986, February). Relationships among research design choices

and psychometric properties of rating scales: A meta-analysis. Journal of Marketing Research, 23, 1–10.

Peterson, R. A. (2000). Creating effective questionnaires. Thousand Oaks, CA: Sage.

Payne, S. L. (1951). The art of asking questions. Princeton, NJ: Princeton University Press.

Rasinski, K. A. (1989). The effect of question wording on public support for government spending.

Public Opinion Quarterly, 53, 388–394.

Reuman, D. A. (1982). Ipsative behavioral variability and the quality of thematic apperceptive measurement of the achievement motive. Journal of Personality and Social Psychology, 43(5), 1098–

1110.

Semon, T. T. (1999, August 30). Scale ratings always betrayed by arithmetic. Marketing News, 33, 7.

Semon, T. T. (2001, October 8). Symmetry shouldn’t be goal for scales. Marketing News, 35, 9.

Shimp, T. A., & Sharma, S. (1987, August). Consumer ethnocentrism: Construction and validation of the CETSCALE. Journal of Marketing Research, 24, 280–289.

Spector, P. E. (1992). Summated rating scale construction: An introduction. Newbury Park, CA: Sage.

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.

Sudman, S., & Bradburn, N. (1983). Asking questions. San Francisco: Jossey-Bass.

Thurstone, L. L (1959). The measurement of values. Chicago: University of Chicago Press.

Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley.

Voss, K. E., Stem, D. E, Jr., and Fotopoulos, S. (2000). A comment on the relationship between coefficient alpha and scale characteristics. Marketing Letters, 11(2), 177—191.

Whitlark, D., & Smith, S. (2004). "Pick and Choose “ Marketing Research Volume 16, Issue 4, Pages 8-

14, American Marketing Association, Chicago, December, 2004

Wong, N., Rindfleisch, A., & Burroughs, J. E. (2003, June). Do reverse-worded items confound measures in cross-cultural research? The case of the material values scale. Journal of Consumer Research, 30, 72–91.

Thursday, May 18, 2023

General Concepts Of Measurement and Scaling