Applied Psychological Measurement: Volume 47, issue 1, page(s) 34-47
This study used simulation to investigate the performance of the t-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust z statistic with 2.7 as the cutoff value.
Applied Psychological Measurement: Volume 46, issue 2, page(s) 571-588
This study evaluates the degree to which position effects on two separate low-stakes tests administered to two different samples were moderated by different item (item length, number of response options, mental taxation, and graphic) and examinee (effort, change in effort, and gender) variables. Items exhibited significant negative linear position effects on both tests, with the magnitude of the position effects varying from item to item.
Academic Medicine: June 2022
This study examines the associations between Step 3 scores and subsequent receipt of disciplinary action taken by state medical boards for problematic behavior in practice. It analyzes Step 3 total, Step 3 computer-based case simulation (CCS), and Step 3multiple-choice question (MCQ) scores.
Journal of Educational Measurement: Volume 59, Issue 2, Pages 140-160
A conceptual framework for thinking about the problem of score comparability is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.
Academic Medicine: Volume 97 - Issue 4 - Pages 476-477
Response to to emphasize that although findings support a relationship between multiple USMLE attempts and increased likelihood of receiving disciplinary actions, the findings in isolation are not sufficient for proposing new policy on how many attempts should be allowed.
Educational Measurement: Issues and Practices: Volume 41 - Issue 1 - Pages 95-96
Often unanticipated situations arise that can create a range of problems from threats to score validity, to unexpected financial costs, and even longer-term reputational damage. This module discusses some of these unusual challenges that usually occur in a credentialing program.
Journal of Applied Technology: Volume 23 - Special Issue 1 - Pages 30-40
The interpretations of test scores in secure, high-stakes environments are dependent on several assumptions, one of which is that examinee responses to items are independent and no enemy items are included on the same forms. This paper documents the development and implementation of a C#-based application that uses Natural Language Processing (NLP) and Machine Learning (ML) techniques to produce prioritized predictions of item enemy statuses within a large item bank.