Using Verbal Reports in Studies of Language Assessment
Professor Alister CummingIn this workshop participants will review, distinguish, and practice alternative ways of using verbal reports to produce data to analyze the performance of students, examinees, or raters in second or foreign language assessments. We will consider specific types of verbal report data (e.g., concurrent or retrospective reports, focus groups, interviews), the perspectives and limitations that each offers, how to orient or train people to produce them, and how to handle the resulting data for research purposes (e.g., segmenting into units of analysis, developing and refining coding categories, interpreting themes or issues, summarizing and verifying findings, making claims for validating or revising assessment instruments). [return]
Using and Reflecting on Language Test Specifications
Professor Fred DavidsonThe objective of this workshop is twofold: (1) to go through a canonical language test specification experience: creating and critiquing test specs, and (2) to allow opportunity to reflect on that process. The reflection is extremely important, for it allows participants to discuss the practicality of specification-driven testing. To achieve these goals, participants will work with a number of test specs in varying states of completion or refinement: from germinal to established and 'set-in-stone'.[return]
A Teacher-Verification Study of Speaking and Writing Prototype Tasks for a
New TOEFL
Professor Alister CummingI will describe a study undertaken--in collaboration with colleagues Leslie Grant (Central Michigan University), Patricia Mulcahy-Ernt (University of Bridgeport), and Donald Powers (Educational Testing Service) and in conjunction with other studies field-testing prototype tasks for a new TOEFL--to evaluate the content validity, perceived authenticity, and educational appropriateness of these prototype tasks. We interviewed 7 highly experienced instructors of English a Second Language (ESL) at 3 universities, asking them to rate their students’ abilities in
English and to review samples of their students’ performance to determine whether they thought 7 prototype speaking and writing tasks being field-tested for a new version of the TOEFL® test: (a) represented the domain of academic English required for studies at English-medium universities or colleges in North America, (b) elicited performance from their adult ESL students that corresponded to their usual performance in ESL classes and course assignments, and (c) realized the evidence claims on which the tasks had been designed. The instructors thought that most
of their students’ performances on the prototype test tasks were equivalent to or better than their usual performance in classes. The instructors viewed positively the new prototype tasks that required students to write or to speak in reference to reading or listening source texts, but they observed certain problems with these novel tasks and suggested ways that their content and presentation might be improved for the formative development of these tasks. [return]
Development of a Web-Based Japanese Listening Placement Exam
Tim Farnsworth & Yuko Haga (Fukuchi)A quarter of those who take Japanese courses at UCLA’s Department of East Asian Languages and Cultures come with some Japanese knowledge, mainly from high school, and those students take a Japanese placement test to start them in the most appropriate level. We will report on the ongoing development of the new, web-based version of this test. Our test development process is based on Bachman & Palmer’s (1996) theoretical framework, which aims to increase test usefulness. We have encountered challenges due to practical constraints, and we will discuss the way we bridged a gap between the theoretical framework and the practical situation as an example of a theory-based test development process.
We administered the first paper-and-pencil pilot in February 2003 to the 100 students enrolled in second quarter Japanese at UCLA. The purpose of the piloting was to examine the appropriateness of test design and scoring method, and preparing for automated scoring. We did a descriptive analysis of the pilot data, and located some unexpected patterns in the results. We will discuss revisions to the test tasks and design based on the information gained from the piloting.
This project revealed the importance of the collaboration between language testers and language programs and teachers. The more the two groups communicate, the better the quality and usefulness of the test.[return]
Testing Accommodation and the Americans with Disabilities Act--An Update
Lisa Diane MahrerHigh-stakes testing – just the sound of it causes anxiety and fear. These feelings are typical of any student who is faced with a test that can dictate the future of their education. If a “normal” student feels anxious and fearful, what about the atypical student, the student with a physical or emotional disability, how might they feel? The fear and anxiety that are experienced by a student with disabilities is caused by much more than the question of how they will perform. Their fear and anxiety is caused by not knowing whether they will be able to complete the test with any type of success. Moreover, the student with disabilities may have reason to fear that the test itself will not be fair and equitable to them. One step toward alleviation of these fears is the Americans with Disabilities Act (ADA). Enacted by the United States Congress in 1990, its purpose was to eliminate discrimination against people with disabilities and insure that life’s daily opportunities were accessible to them. One key provision of the ADA is the requirement that “qualified “individuals with disabilities be assessed using “reasonable accommodations.” This requirement has implications for all aspects of the assessment process, including development, selection, administration and interpretation.
This presentation will first address the questions of what constitutes a “qualified” individual with a disability. By studying the language of the ADA and relevant Federal court cases on this issue, we will develop an understanding of that term and it’s significance in the area of “reasonable accommodation” in testing. Once we have defined the “qualified” individual, the presentation will turn to the issue of “reasonable accommodation.” This study will use recent court cases, and articles by Geisinger and Carlson, S.E. Phillips, and M. Pitoniak and J. Royer. In addition, a study by The National Center of Education Outcomes, a report by the National Center for Education Statistics, and a publication by the Educational Testing Service, Princeton, will be examined. Using these resources, this presentation will define reasonable accommodations, develop a list of possible accommodations in the testing area, examine the difficulties in administering and interpreting tests with reasonable accommodation and conclude with a brief look at the effect reasonable accommodations have had on testing results.[return]
A review of the California Adult Student Assessment System
David GormanThe California Adult Student Assessment System (CASAS) tests serve as the primary reading assessment tool in California for Adult Basic Education (ABE) and ESL students. The series of tests have been used for over 20 years to measure students’ ability to comprehend reading materials specifically related to life skills. Scores on the test determine student entry, mastery and exit levels. Moreover, scores help determine federal and state funding for an array of adult programs from community college and adult education centers to correctional facilities. But surprisingly, for a set of tests so widely used over such a long time, there are few reports outside the CASAS organization that have reviewed or evaluated test reliability, validity and effectiveness.
This presentation will review the history behind the CASAS tests, report on statistically relevant information regarding implementation as well as assess the system’s in terms of its strengths and weaknesses. As CASAS continues to be the assessment tool of choice in a number of US states, understanding its construction and use will help us understand its overall effectiveness.[return]
Reverse Engineering in Language Test Development
Professor Fred Davidson"Reverse Engineering" (RE) is the creation of test specifications (also known as blueprints) from existing test items. Four major types are defined and examples given: straight RE to replicate a test, historical RE to examine changes in a testing system over time, critical RE to improve a test, and "test deconstruction" (El Atia, 2002), which is RE that facilitates analysis of sociopolitical systems. My discussion of RE allows us to explore several important related topics: the nature and role of test specifications themselves, critical language testing and test change, historical precedents in testing practices, and certain philosophical tensions in the use of tests. [return]
An evaluation of unidimensional IRT model assumptions using data from an English as a second language academic reading test
Viphavee VongpumivitchSecond language reading has been argued to consist of many components. Yet, very little second language research using the unidimensional IRT models as a test analysis method has investigated the issue of reading test dimensionality. Given the fact that most reading tests consist of reading passages with several questions attached, it is possible that second language reading test data will violate the IRT unidimensionality and local independence assumptions. This paper is an exploratory study of the extent to which data from a second language reading test satisfied two important assumptions of unidimensional IRT models, namely the unidimensionality and the local dependence assumptions. The study used a data set consisted of item-level responses of 573 students who took the reading section of an ESL placement test. The two IRT assumptions were evaluated using different computer programs and the results were interpreted based on qualitative content analysis of the test items.
It was found that while the analysis of the tetrachoric correlation matrix showed that the reading test used in this study was essentially unidimensional, the local independence assumption was violated and fifteen item pairs were found to be related closely to each other due to factors other than the dominant one. Since the two IRT assumptions are quantitatively equal, it seems implausible that only one of the two was violated. Content analysis of the items showed that there may be other dimensions that are related to the levels of text processing required to answer the reading test items. Without a detailed qualitative analysis of the item content of the type undertaken here, substantive interpretations of the unidimensional IRT model assumptions would not have been possible. It is hoped that the combination of qualitative and quantitative analyses demonstrated in this paper will encourage more studies to do the same.[return]
Factorial Invariance of TOEFL for Different Language Groups: Confirmatory Factor Analysis Approach.
Hoky MinThis study addressed the issues of the dimensionality of second language competence underlying the Test of English as a Foreign Language (TOEFL) and construct validity of the TOEFL for two different language groups: 1500 cases of Indo European groups (French, Portuguese, and Spanish) and 1500 cases of non-Indo European groups (Chinese, Korean, and Japanese). Specifically, the number of factors (abilities) underlying the TOEFL was examined for the groups using confirmatory factor analysis (CFA), and the equivalence of the dimensionalities, factor patterns, and factor loadings was examined between the groups, using multiple group CFA. The findings of the analyses indicated that there are two factors (listening and non-listening) underlying the performance on the TOEFL. Among the four postulated models (one-, two-, three-, and seven-factor models), the model with listening and non-listening factors was found to be best fitting. The findings also indicated that the TOEFL seems to provide the evidence for construct validity for the two different language groups because the two-factor model is the best fitting for both groups, and because factor loadings are invariant across the groups. Therefore, it has been concluded that the TOEFL seems to measure the same abilities to the same degree across the Indo European and non-Indo European groups. In addition, the findings will be discussed with respect to the nature of the two distinctive language abilities as well as the implications for test fairness and bias of the TOEFL for the test takers with different language backgrounds.[return]