Vertical scaling is necessary to facilitate comparison of scores from test forms of different difficulty levels. It is widely used to enable the tracking of student growth in academic performance over time. Most previous studies on vertical scaling methods assume relatively long tests and large samples. Little is known about their performance when the sample is small or the test is short, challenges that small testing programs often face. This study examined effects of sample size, test length, and choice of item response theory (IRT) models on the performance of IRT-based scaling methods (concurrent calibration, separate calibration with Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma transformation) in linear growth estimation when the 2-parameter IRT model was appropriate. Results showed that IRT vertical scales could be used for growth estimation without grossly biasing growth parameter estimates when sample size was not large, as long as the test was not too short (≥20 items), although larger sample sizes would generally increase the stability of the growth parameter estimates. The optimal rate of return in total estimation error reduction as a result of increasing sample size appeared to be around 250. Concurrent calibration produced slightly lower total estimation error than separate calibration in the worst combination of short test length (≤20 items) and small sample size (n ≤ 100), whereas separate calibration, except in the case of the Mean/Sigma method, produced similar or somewhat lower amounts of total error in other conditions.
All Science Journal Classification (ASJC) codes
- Social Sciences (miscellaneous)
- Psychology (miscellaneous)