The modern theory of testing ( Eng. Item Response Theory ) is a set of methods that allows you to assess the likelihood of the correct response of subjects to tasks of varying difficulty. It is used to get rid of bad (uninformative) questions in the survey and provide questions with adequate scores based on their complexity. In Russian, the name Item Response Theory is translated in various ways. Yu. Neumann and V. Khlebnikov propose to call it "Theory of modeling and parameterization of pedagogical tests" (TMPPT) [1] . V. Avanesov - “Mathematical-statistical theory of evaluating latent parameters of test tasks and the level of preparedness of subjects” [2] .
In psychometry, modern test theory (IRT) is the paradigm for designing, analyzing, and evaluating tests, questionnaires, and similar measuring tools, relationships, or other variables. This is a testing theory based on the relationship between the quality of answers to questions and the overall quality of knowledge that is tested in the test. To present the characteristics of the question and the respondent, several different statistical models are taken [3] . Unlike simpler alternatives for creating scales and evaluating answers to questionnaires, modern test theory does not imply that each question is equally difficult. This distinguishes the IRT from, for example, Likert’s assumption in scaling that “all elements are considered to be replicas of each other or in other words: elements are considered to be parallel tools” [4] . On the contrary, modern test theory considers the complexity of each element (ICC - element characteristic curve) as information that should be included in the scaling of elements.
The theory is based on the use of coupled mathematical models to test data. It by name refers to the focus of the theory of the issue. Thus, the IRT models the response of each subject to each element of the test. IRT is based on the idea that the probability of a correct answer to an element is a mathematical function of the parameters of a person and an element.
Content
History
A common source for creating the IRT was the so-called logistic function of the form known in biological science since 1844. Since then, it has been widely used in biology to model plant growth or the growth of organisms. As a model of the psychological and pedagogical dimension, it began to be applied starting from the 50s of the XX century. The origins of the development of IRT models were the desire to visualize the formal characteristics of test tasks, attempts to overcome the numerous shortcomings of the classical theory of tests, improve measurement accuracy and, finally, the desire to optimize the control procedure by adapting the test to the student’s level of preparedness using a computer [2] .
The original work of IRT as a theory arose in the 1950s and 1960s. These were members of the : , Danish mathematician and Austrian sociologist Paul Lazarsfeld . Key figures driving IRT progress are and .
Among the first prerequisites for the creation of IRT were the results of research by Alfred Binet and Theodore Simon [5] , which reflected the authors' desire to identify how, figuratively speaking, the tasks that they gave children of different ages “work”. Then, after locating the points on the coordinate plane where the age (in years) was plotted on the abscissa, and the proportion of correct answers in each age group of the subjects on the y-axis, the authors saw that the points obtained, after averaging over each group, resemble a curve later named characteristic.
In 1936, MWRichardson conducted an extensive empirical study, interviewing 1,200 students in 803 assignments, during which students, depending on their test score, were divided into 12 groups, one hundred people each. She was the first to pay attention to the different steepness of the curves of the test tasks and put forward the idea of considering the measure of steepness as an approximate estimate of the differentiating ability of the task [6] . MWRichardson was apparently the first to recognize the fruitfulness of using averaged points to graphically present the formal characteristics of the tasks of the designed tests [7] .
In particular, the purpose of the IRT is to provide a framework for analyzing how well the evaluations work and how well the individual elements of the assessment work. The most common IRT application area is education, where psychometrists use it to design and design exams, maintain question banks for exams, and compare difficulty questions for future exam versions [8] .
Item Response Function IRF
IRF gives the likelihood that a person with a given level of ability will answer correctly.
Three parameters of the logistic model
For example, in a three-parameter logistic model (3PL), the probability of a correct answer to the dichotomous element i, as a rule, a multiple choice question, is:
Where indicates that human abilities are modeled as a sample from a normal distribution in order to evaluate the parameters of an element. After evaluating the parameters of the question, the capabilities of each are evaluated for reporting purposes. , and Are the parameters of the element. Element parameters determine the form of the element response function. Figure 1 shows the perfect 3PL ICC.
Element parameters can be interpreted as a change in the shape of a standard logistic function:
Parameters by which the elements of the test are characterized:
- b is the difficulty. This value tells us how easy or complicated the question is. function position
- a - discrimination. This value tells us how effectively this issue can distinguish students by their level of knowledge. - maximum slope
- c - guessing. This value tells us how likely it is that subjects can get the correct answer by guessing. - asymptotic minimum
IRT Models
IRT models can be divided into two families: one-dimensional and multidimensional. One-dimensional models require a single measurement value (ability) . It is assumed that model responses to multidimensional IRT models suggest the occurrence of multiple traits. However, due to the significantly greater complexity, most IRT research and applications use a one-dimensional model.
IRT models can also be classified by the number of responses received. A typical element of multiple choice is dichotomous; Despite the possibility of four or five options, it is still considered only right / wrong. Another class of models is applicable to polytomic results, where each answer has a different evaluation value [9] . A common example of this is Likert -type elements, such as “1 to 5”
The number of parameters included in the analytical definition of functions is the basis for subdividing families of logical functions into classes.
Among the logistic functions distinguish [10] :
1) One-parameter model of G. Rush - AND Where and - independent variables for the first and second functions, respectively; The factor 1.7 is used for compatibility of the G.Rasch model with the A.Fergusson model, where the probability of a correct answer to the task is expressed by the normal distribution integral, which allows using the well-studied normalized normal distribution integral function instead of logistic curves.
The Rush model is called the “1 Parametric Logistic Latent Trait Model” (1PL), and the A. Fergusson model is called the “1 Parametric Normal Ogive Model” (1PN). Since the Rush model describes the probability of success of the subject as a function of one parameter ( ), it is sometimes called the one-parameter IRT model.
The interaction of two sets and forms data with the property of "joint additivity" (conjoint additivity). The correct use of the Rush model makes it possible to separate the assessments of subjects from assessments of difficulty of tasks and vice versa. This property of the Rasch Measurement is called separability parameter estimates - “independence of the assessment of the tasks from the subjects and the assessments of the subjects from the parameters of the tasks”.
In fig. Figure 2 shows three characteristic curves with the difficulties of tasks -2, 0 and +2 logit (the first is the easiest, the second is average, the third is the most difficult). It can be seen from the above dependencies that the higher the level of preparedness θ of the test subject, the higher the probability of success in a given task. For example, for a subject with the probability of correctly answering the first task is close to unity, the second is 1/2 and the third is almost zero. Note that at the points where the probability of a correct answer is 0.5. That is, if the difficulty of the task is equal to the level of preparedness (ability) of the subject, then he is equally likely to cope or not cope with this task.
Characteristic (logistic) curves for test items in the English literature are called Item Characteristic Curve (ICC).
In fig. Figure 3 shows the three characteristic curves of the subjects - "Person Characteristic Curve" (PCC). The graphs are shown for three subjects with a level of preparedness of -2 logits (the weakest), 0 logits (average) and +2 logits (strong subject).
From the above dependencies it is seen that the higher the level of preparedness, the higher the likelihood of a correct answer to the task. For example, a task with difficulty b = 0, the first test subject (q = -2) will practically be unable to complete, the second q = 0) has a probability of completing the task equal to 0.5, the third q = + 2) will easily cope with the task, since for him the success almost equal to one.
2) two-parameter model of -
If the test contains tasks with different differentiating ability, then the 1PL single-parameter model cannot describe such empirical data. To overcome this difficulty, A. Birnbaum introduced another parameter - a (item discrimination parameter).
The parameter aj determines the slope (steepness) of the characteristic curve of the jth tasks. Examples of characteristic curves are shown in Fig. 4. It is seen that the more aj the steeper the curve goes, the higher the differentiating ability of the task.
To better match the empirical data, A. Birnbaum introduced the third parameter c - the guessing parameter. In fig. Figure 5 shows examples of characteristic curves for three tasks with difficulty β = 1, discriminatory parameter aj = 1 and various guessing parameters cj = 0, cj = 0.25, cj = 0.5.
It can be seen from the graphs that the presence of the guessing parameter leads to a proportional shift of ICC upward by cj.
3) three-parameter model of A. Birnbaum
where Cj is the third parameter of the model characterizing the probability of a correct answer to the jth task in the event that this answer is guessed, and not based on knowledge.
Model Rush
A characteristic feature of the Rush model is that the characteristic curves (ICC) do not intersect (Fig. 3). This means that if some task “A” is easier than task “B”, then this ratio is preserved over the entire interval of variation of θ.
A completely different picture is observed for two- and three-parameter models. In Fig. 4, this is clearly visible. A task with αj = 0.5 in the region of positive values of θ is the most difficult of the three tasks presented, that is, the probability of a correct answer to this task is the lowest. In the area of negative q values, this same task is now the easiest - the probability of a correct answer to it is greatest. It turns out that for weak students this is the easiest task, and for strong students it is the most difficult.
A similar picture is observed for the three-parameter model. Figure 5 shows a rare case of disjoint characteristic curves, since the same parameters βj = 1 and αj = 1 were chosen for them, i.e., all three tasks have the same difficulty and the same differentiating ability parameter.
Figure 6 shows another example. Here, the task with the parameter cj = 0 has the difficulty βj = -1 changed, which immediately caused the intersection of the characteristic curves. The task with cj = 0 in the region θ <-2 is the most difficult. In the region -1.5 <θ <-1, this task is easier than the task with cj = 0.25 and more difficult than the task with cj = 0.5. In the region θ> -1, the task with cj = 0 is the easiest. Such an intersection of ICC almost always occurs for two- and three-parameter models.
Thus, only the one-parameter Rush model meets the requirements for high-quality measuring instruments. It is the RASCH MEASUREMENT model that is most suitable for building a test as a measuring tool.
Key IRT Assumptions [11]
1) there are latent / hidden personality parameters (which are not available for direct observation). For example, in intellectual testing - this is the level of preparedness of the subject and the level of difficulty of the task);
2) there are indicators that are associated with latent parameters, but in contrast to them, are available for observation. By the values of indicators, one can judge the values of latent parameters;
3) the estimated latent parameter should be one-dimensional (the scale should measure one and only one variable.) If the condition of one-dimensionality is not fulfilled, then it is necessary to work on the test, check the consistency of tasks, the dimension of the test, etc. All tasks that violate homogeneity should be excluded from the scale.
Comparison of modern and classical test theories [12]
| Classical Testing Theory (CTT) | IRT (rush models) | |
|---|---|---|
| one | Assessment of the difficulty of test tasks depends on the level of preparedness of a particular sample of subjects | Estimates of the difficulty of test tasks are invariant with respect to the contingent of subjects, according to the test results of which they were obtained |
| 2 | Assessment of the level of preparedness of the subjects (primary scores) depends on the difficulty level of a particular test | Estimates of the level of preparedness of the subjects are invariants with respect to test tasks, based on the results of which they were obtained |
| 3 | The measurement error is a constant for all subjects. Job measurement error not evaluated | The measurement error is evaluated individually for each subject and each task. Moreover, the error is calculated directly, and not indirectly |
| four | Reliability assessment methods require significant limitations and give distorted results. | It is possible to evaluate separately the reliability of the measurement of the subjects and the reliability of the assessment of test tasks |
| five | The primary point scale is ordinal. No conversion of primary scores to CTT raises the level of the scale | The logit scale is interval, which makes it possible to switch from ranking subjects and tasks to measuring, respectively, the level of preparedness and level of difficulty |
| 6 | The normal distribution of test scores and the difficulty of the test tasks plays a significant role | Normal distribution of parameters is not required |
| 7 | Ways to establish correspondence between the scores of subjects performing various options require difficult to fulfill assumptions | It is possible to perform the alignment procedure of indicators of various options and carry out scaling on a single metric scale. Creation of task banks is possible |
| eight | Not suitable for computer adaptive testing | The whole theory of computer adaptive testing is based on the IRT |
| 9 | The analysis focuses only on assessing the difficulty of the tasks and measures of the subjects | An analysis of the influence of additional factors on the assessment of the parameters of tasks and measures of the subjects is possible. |
| ten | Artificial assignment of weights to tasks can lead to distortion of information about the level of preparedness of subjects | The weight (information contribution) of the test task can be calculated separately, regardless of the characteristics of other tasks |
See also
- Psychological Test Reliability
- Psychological testing
- Psychometry
- Standardized test
Notes
- ↑ Neumann Yu.M., Khlebnikov V.A. Introduction to the theory of modeling and parameterization of pedagogical tests. -M .: Prometheus, -169 p. Archived copy (inaccessible link) . Date of treatment June 3, 2017. Archived June 4, 2017.
- ↑ 1 2 Avanesov V.S. The use of test forms in the Rasch Measurement // Pedagogical measurements, 2005, No. 4. -C.3-20. Archived copy (inaccessible link) . Date of treatment June 3, 2017. Archived June 4, 2017.
- ↑ National Council on Measurement in Education http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorI Archived from 22 July 2017 on the Wayback Machine
- ↑ A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing. 20, 196-201
- ↑ Binet A., Simon TH The Development of Intelligence in Young Children. Vineland, NJ: The Training School, 1916.
- ↑ Richardson Marion W. The Relation Between the Difficulty and the Difference Validity of a Test / Psychometrika, 1936, 1: 2, 33-49.
- ↑ Richardson MW Notes on the Rationale of Item Analysis./Psychometrika, 1936.1: 169-76.
- ↑ Hambleton, RK, Swaminathan, H., & Rogers, HJ (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Press.
- ↑ Ostini, Remo; Nering, Michael L. (2005). Polytomous Item Response Theory Models. Quantitative Applications in the Social Sciences. 144. SAGE. ISBN 978-0-7619-3068-6 .
- ↑ http://koi.tspu.ru/koi_books/samolyuk/lek13.htm
- ↑ M.K. Rybnikova. The theory of tests: classical, modern and "intellectual" http://www.ht.ru/cms/component/content/article/1-aricles/109862-13022014
- ↑ Kardanova E.Yu. Advantages of the modern theory of testing in comparison with the classical theory of testing. Testing issues in education. 2004, N 10
Literature
- Lord, FM (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum.
- Embretson, Susan E. Item Response Theory for Psychologists / Susan E. Embretson, Steven P. Reise. - Psychology Press, 2000. - ISBN 978-0-8058-2819-1 .
- Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD.
- Baker, Frank B. Item Response Theory: Parameter Estimation Techniques / Frank B. Baker, Seock-Ho Kim. - 2nd. - Marcel Dekker, 2004 .-- ISBN 978-0-8247-5825-7 .
- Handbook of Modern Item Response Theory . - Springer, 1996. - ISBN 978-0-387-94661-0 .
- de Boeck, Paul. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach / Paul de Boeck, Mark Wilson. - Springer, 2004 .-- ISBN 978-0-387-40275-8 .
- Fox, Jean-Paul. Bayesian Item Response Modeling: Theory and Applications . - Springer, 2010 .-- ISBN 978-1-4419-0741-7 .
- Crocker, Linda. Introduction to the classical and modern theory of tests / Linda Crocker, Algina James. - HIGHER EDUCATION TODAY, 2010. - ISBN 978-5-98704-437-5 .
Links
- "HISTORY OF ITEM RESPONSE THEORY (up to 1982)" , en: University of Illinois at Chicago
- A Simple Guide to the Item Response Theory (PDF)
- Psychometric Software Downloads
- flexMIRT IRT Software
- IRT Tutorial
- IRT Tutorial FAQ
- An introduction to IRT
- The Standards for Educational and Psychological Testing
- IRT Command Language (ICL) computer program
- IRT Programs from SSI, Inc.
- IRT Programs from en: Assessment Systems Corporation
- Latent Trait Analysis and IRT Models
- Rasch analysis
- Rasch Analysis Programs from Winsteps
- Free IRT software
- Packages for using IRT in R
- IRT / EIRT support in Lertap 5
- ShinyItemAnalysis (2017) application for working with IRT online
- Tasks for estimating the parameters of the IRT model
- The use of IRT in testing educational achievements