The key feature of this important trait requires the standardization of administration and tasks as well as scoring. Reliability is the consistency of test scores across facets of the test.
Independent items or tasks in a single test should correlate with each other and the test-total score. That is, there is an assumption that items are testing the same thing, and adequately discriminating between the better and weaker students. This is technically referred to as ‘Internal Consistency’.
Moss (1994) raises the question of whether it is possible to have validity without reliability, because it has been stated in the language testing literature that without reliability, there could be no validity.
Estimates of reliability in large-scale testing are based upon four assumptions (Fulcher & Davidson, 2007):
- Stability: The abilities of the test takers will not change dramatically over short periods of time. It is interesting to note that the scores from many large-scale educational tests are given a shelf life of two years.
- Discrimination: Tests are constructed in such a way that they discriminate as well as possible between the better and poorer test takers, and the quality of the individual test items or tasks is dependent upon its discriminatory properties.
- Test Length: Traditional measures of reliability are also closely tied to the length of the test. Very simply, the more items or tasks are included in the test, the higher the reliability coefficient will be. Conversely, the shorter the test, the lower it will be.
- Homogeneity: There is also an assumption in large-scale testing that all the tasks or items measure the same construct, and so the items are related or correlated to each other. So each piece of information is independently contributing to the test score, and the test score is the best possible representation of the knowledge, ability or skills of the test taker.
Methods of estimating:
- Paralleled Forms:
- Coefficient Alpha: