BOOK REVIEW by Kathy Emery
In spite of claims that IQ tests such as the Stanford Binet and Wechsler batteries are objective, Thorndike and Lohman's history of the development of such tests reveals them to be inherently culturally bound. Tests whose validity is based on teacher's assessments are necessarily bound by the cultural values of those teachers. By the 1890s, most teachers were white, middle class females. The standardized tests being invented at that time were made so that the test results correlated with the assessments of white middle class teachers of their middle and working class students. Equally problematic is the fact that the definition of intelligence itself has had no consensus among psychometricians (the people who make these tests) throughout the history of intelligence test development . Consequently, the validity of each succeeding generation of tests has been based on the original so-called “IQ” tests developed at the turn of the 19th century. The original tests and therefore their successive progeny (IQ tests eventually morphed into “achievement” tests) , are hardly objective. So, why do we accept that they are?
It seems that one of the greatest ironies in the history of the creation of standardized Intelligence tests is the reliance upon teacher's estimates of intelligence in order to create a test that claims to be independent of those same estimates. Thorndike argues that
...progress toward elimination of subjective bias in schooling and the workplace and reward of talent without regard to racial social or ethnic background have been two of the major accomplishments of testing. (104)
He, consequently, warns that if
...well meaning courts and legislatures substitute their beliefs for the empirical evidence that has been collected over a period spanning more than twenty five years, the beneficial effects of tests in providing objective alternatives to the subjective judgments of teachers and employers will be lost and the possibility of bias in our social institutions will increase. [my emphasis] (105)
Yet, it was these very teachers whose opinions were used as the criteria for validity throughout the development of ability tests. From the very first time a mental test appeared in print (J.M. Cattell, 1890), one of Galton's postscripts on that paper called for test validation, arguing that such tests need to be "compared with an individual's estimate." (p.3) Franz Boas (1891), Gilbert (1897) and Ebbinghaus (1890s) sought out teachers to provide such estimates. (p. 9-11) Wissler, when challenging Cattell's tests, did so by asserting that the results of the test did not correlate substantially with the grades his Columbia student/subjects received from their professors. (p.10)
When Terman began his work to standardize the administration procedures as well as the scoring of the dozen or so versions of the Binet-Simon test in use at the time (1911), one of his three criteria for item placement and validity was that the results were consistent with teachers' judgments. Terman's work resulted in the Stanford-Binet test of 1916 -- the test used as the criterion for validity for all other tests during the following twenty years. (p. 38)
Thorndike attributes the widespread use of group testing after World War I to the success of the Army Alpha test during the War. For this test, the role of army officers' opinions were substituted for teachers' opinions as a criterion for validity. Versions of the Binet and Army Alpha were used to set up tracking systems as the public school system continued its reorganization that began before the War.
During the 1920s and 30s, Binet's original rejection of a theory for empiricism began to see a reversal as theoretical bases for the development of tests were developed (p.57). While the theoretical debate raged over how to do “factor analysis” and the definitions of intelligence began to become increasingly complex, Wechsler developed an intelligence test for adults (1939). His criteria for validity of the chosen test items were other tests and the subjective ratings of teachers, army officers and businessmen. (p.81) Thorndike argues in this 1990 version of the history of IQ tests that the Wechsler-Bellevue test "rounded out the collection of tests available to measure human intellectual abilities" and that "there have been relatively few important innovations in ability measurement introduced in the years since 1939." (p. 83)
Given the consistent use of teachers' opinions as a primary criterion for validity of the Binet and Wechsler tests, it seems odd to claim then that such tests provide "objective alternatives to the subjective judgments of teachers and employers." If the tests' primary claim to predictive validity is that their results have strong correlations with academic success, one wonders how an objective test can predict performance in an acknowledged subjective environment? No one seems willing to acknowledge the circular and tortuous reasoning behind the development of tests that rely on the subjective judgments of secondary teachers in order to develop an assessment device that claims independence of those judgments so as to then be able to claim that it can objectively assess a student’s ability to gain the approval of subjective judgments of college professors. (And remember, these tests were used to validate the next generation of tests and those tests validated the following generation and so forth on down to the tests that are being given today.) Anastasi (1985) comes close to admitting that bias is inherent in the tests when he confesses the tests only measure what many anthropologists would called a culturally bound definition of intelligence. The tests measure
"a kind of intelligent behavior that is both developed by formal schooling and required for progress within the academic system...The particular combination of cognitive skills and knowledge sampled by these tests plays a significant part in much of what goes on in modern, technologically advance societies." (p. 105)
The vague and reluctant admission of the assumptions upon which these tests are based suggests that the purpose of the tests is not the promotion of a meritocracy but to perpetuate the myth of a meritocracy.
Thorndike quotes his grandfather, E. L. Thorndike, to suggest that it is impossible "to measure intelligence in a manner that is independent of the cultural context." (p. 101)
What abilities and tasks shall be treated as intellectual is essentially a matter of arbitrary assumption or choice at the outset, either directly, of the abilities or tasks themselves, or indirectly, of the consensus which provides the criterion. After the first choice is made, tasks not included in it, and even not known, may be found to correlate perfectly with the adopted total, and so be "intellectual"; but their intellectualness is tested by and depends on the first arbitrary choice. Had a different first choice been made, they might not be intellectual. [E. L. Thorndike, 1926, as quoted by his grandson] (p. 57)
If one defines the ability to manipulate the vocabulary, diction and syntax of Standard English as evidence of intelligence, then the mastery of the vocabulary, diction, syntax and logic of Non-Standard English is, by definition, evidence of stupidity. (see my Ebonics paper for a full discussion of this point).
The changing definitions of intelligence by the psychometricians themselves further suggests the arbitrary nature of the definition of intelligence. Binet's early attempts to define what it was that his tests were measuring seemed to equate intelligence with personality or character. Intelligence was a fundamental faculty involving judgment, good sense, practical sense, initiative and adaptation to circumstances(p. 15). Terman (1906), when evaluating Binet's 1904 test, also agreed that intelligence was a general characteristic (p.27). Spearman, in 1904, provided the statistical formula that supposedly calculated the general intelligence factor from a "hotchpotch" of different tests. But Terman asserted that from 1904 through 1932, he consistently believed that Spearman's conclusions were "absurd"(p. 28).
E. L. Thorndike contributed to the debate in 1926 by dividing intelligence into four components - altitude, width, area and speed. A summary score would be problematic since he argued that the four factors were not of equal importance (p. 57); therefore, general intelligence could not be measured directly. (p.58) Thompson (1916) joined E. L. Thorndike's critique of general intelligence by developing a correlation matrix with dice (entities presumably not having g, or "intelligence") that fit Spearman's tetrad equation. Thompson argued in 1920 that intelligence was made up of a great deal of discrete intelligences and no test was able to encompass them all. (pp. 66-67)
Thurstone, having attacked Binet's mental age concept in 1926, complicated the debate in 1931 by introducing his theory of multiple factor analysis. After administering 56 measures (fifteen hours) to 240 college students and computing the results for six months, Thurstone concluded that there were 12 factors to intelligence (five of which, however, were not defined clearly enough to name). Since the twelve factors did not correlate to Spearman's analysis, Thurstone "believed that the multidimensional nature of intelligence as he had described it precluded the existence of g . " (p. 72) Spearman responded by arguing that Thurstone had extracted too many factors. (p. 73) How intelligence was to be defined became a debate over the relationship among factors. (p. 75)
Alexander (1935) joined this debate by arguing that two factors, x and z, affected all test results. These factors constituted "personality dimensions" such as persistence, interest in the test tasks and desire to succeed. Wechsler agreed, believing that IQ tests "inevitably measure a number of other capacities which cannot be defined as either purely cognitive or intellective." Nevertheless, for Wechsler, these capacities are part of what is intelligence which "is the aggregate or global capacity of the individual to act purposefully, to think rationally, and deal effectively with his environment." (p. 82)
Thorndike observes that in the 1940s and 50s "psychologists were factor analyzing everything in sight" with the concomitant explosion of the number of factors. The proliferation of factors threatened to make testing or, in other words, measuring impossible. Guildford (1959, 1967, 1982) identified 150 distinct abilities. (p. 96) Horn, in 1976 rejected Guildford's rotational theory as being equally able to confirm random hypotheses. Humphreys (1962) and Vernon (1961) attempted to create categories of measurement by establishing an "abilities hierarchy" which did not, however, solve the problem of how large (general) or small (specific) the number of clusters of factors should be when defining intelligence. Cattell decided that there should be two clusters - fluid and crystallized. (p. 97) While Horn revised Cattel's Gf-Gc theory by identifying ten second-order factors. (p. 113) Continued critiques and revisions of the definition of intelligence has finally fractured the concept so that debate ranges not only over the nature and kinds of intelligences (Sternberg, Gardner) but over whether to use intelligence as a noun or adjective. (p. 125)
Thorndike when evaluating this expanding and increasingly complex analysis of intelligence remarks that human intelligence " will not yield to ready explanation by the methods of cognitive science any more than it yielded to ready explanation by the method of factor analysis." (p. 130) One would think that if one can't explain it, then measuring it is even more problematic. Thus, tests that claim to measure "intelligence" seem to be vulnerable to the critique of false advertisement. And if these so-called intelligence tests do not measure intelligence, then their justification as a criterion for the unequal distribution of investment, resources and wealth suffers from an internal contradiction, a contradiction made moot if the real reason for the tests is to justify sorting people into a class system in this country.
Perhaps because of an awareness of such a contradiction, the developers of standardized "intelligence" tests have revealed a defensiveness throughout the history of ability testing. They have always been aware of the tentativeness of their own assertions as well as the ambiguity of their results. For this reason, testmakers from Binet (1905) to Kaufman (1993) have acknowledged that the quantitative data generated by the tests need to be supplemented by qualitative data.
After creating a test that correlated with teachers' assessments, Binet developed his theoretical justification. His definition of intelligence was so global, however, that he said it could only be determined by the sum total of the scores on a wide variety of tests supplemented by clinical observations (p. 16). E. L. Thorndike, in 1904, argued implicitly for some kind of assessment that would be able to capture the complexity of human behavior that is contextually bound:
...with human affairs not only do our measurements give varying results; the thing itself is not the same from time to time, and the individual things of a common group are not identical with each other...Even a very simple mental trait, say arithmetical ability or superstition or respect for the law, is, compared with physical things, exceedingly complex." (p. 26)
Whipple (1915) seemed to echo such a concern:
Now that interest is directed so much toward the question of 'types', it seems particularly necessary to caution against the fallacy of taking the result of a single test as a positive indication that s falls into this or that type...we can[not] regard any mental function as so clear cut, distinct and open to isolation...(p. 32)
Yerkes, when developing his point scale in 1915, acknowledged that the benefit of having a variety of tests in a battery for a specific age was to maintain the subjects' motivation and interest (p. 33). This is a rather indirect concession, but a concession nonetheless, that there are factors influencing test results that can only lend themselves to discovery through observation during the time of the test.
These concerns about the situationality of test performance provoked a chorus of calls for standardized conditions under which tests should be taken. Such standardization was seen as minimizing what Kaufman in 1993 would call "influences" mediating the subject's test performance. In creating the 1916 version of the Binet test, Terman worked hard to standardize the administration as well as scoring procedures (p.38).
Yet the concerns continued in spite of the development of standardized administration of tests. The widespread use of the Army Alpha test without any clinical component provoked Yoakum and Yerkes to complain in 1920: "The ease with which the army group test can be given and scored makes it a dangerous method in the hands of the inexpert. It was not prepared for civilian use and is applicable only within certain limits (p. 47)." A concern echoed by James Popham about the use of standardized tests today.
Thompson added to such concerns when promoting his caveat that since intelligence was made up of such a large number of abilities that any "test" merely assessed a small sample from this "universe of abilities." (p. 66) The development of factor analysis can be seen as a response to such concerns, an attempt to capture as many of these "abilities" as possible. Wechsler developed his performance scale in an attempt to capture those quantitatively elusive variables such as "persistence, interest in test tasks, desire to succeed and other personality dimensions" (p. 82).
As "intelligence" tests became increasingly cumbersome in their attempts to avoid becoming self-defeating, the appearance of standardized "achievement" tests were welcomed with palpable relief. These tests took a great deal of pressure off the IQ tests, pressure that came from continued criticism (from within the testing community!) that quantitative tests were not measuring what people believe to be intelligence. The 1986 revision of the Stanford Binet even replaced the name "IQ" with "SAS" (Standard Age Score) in order to "avoid some of the over interpretation that has become attached" to scores that are labeled as IQ scores (p. 93). Such "over interpretation" existed because numbers, in and of themselves, can never provide the answers to the complex problems the tests were devised to solve.
Thorndike, as a representative of the mainstream psychometric opinion, concedes that IQ tests are problematic. Nevertheless, he defends their use because they provide "useful information". His ultimate fall back position is that these tests are the best way the few have to make decisions about the many with any shred of credibility. In a society whose population is socialized to disown internal authority and rely on external authority for definitions of good/bad, worthy/unworthy, intelligent/stupid, powerful/unpowerful, these tests will continue to have greater credibility among the many than they do among the few. Better ways of assessing an individual's abilities (e.g. Denny Taylor's literacy biographies) will not be able to successfully compete with standardized tests as long as the locus of power in this society is centered in the hands of the few.
Those who support the use of "intelligence tests" and the modern "achievement tests" argue that the harm done is by those who interpret or use the test scores and not the tests themselves. But such an argument does not take into consideration the concerns expressed by testmakers themselves that interpreting the scores is inherently problematic. This is so not only because the test, at best, only tests a slice of a person's "intelligence" or "knowledge" but there is serious disagreement over which slice of the brain is being revealed by a particular test. In addition, the guns-don't-kill-people-people-do defense doesn't take into consideration the logical contradiction that a test developed to correlate with middle class values and knowledge is independent of those values. One wonders how complicated the tests have to become and how transparently inadequate the defense of such tests must be before they collapse under their own weight. But in an age where the federal government makes war under false pretenses, allows industry to plunder "Healthy Forests" and pollute "Clear Skies," it is really not surprising that standardized tests are being used to leave many children behind.