Interpreting claims of test comparability: important considerations, concerns, and caveats


To avoid any misinterpretation of the comparison tables found on this website, please refer to the article below and be fully aware of the potential concerns and limitations of test comparisons.



It is important to use caution when making any kind of comparison between two tests which have been produced for different purposes, to different test specifications, and which contain different content. As such, test users are urged to be very careful about interpreting the kinds of comparison charts which are presented on this website. The following article will detail some of the risks inherent in making such comparisons. It will also outline a framework for the principled application of relevant theory in order to help test developers provide information that can help test users make informed decisions to meet their real-world needs.


The first important point to remember is that different tests, for example the EIKEN Grade 1 test and the TOEFL® iBT, are not usually considered to be equivalent. Within the field of educational measurement, equivalence is generally used to refer to the technical properties which ensure comparability between two alternative forms of the same test (Taylor, 2004). For example, when test takers take a version of test A administered in January, and then another version of test A on another testing date, they do not typically expect both versions to contain exactly the same questions and content. In fact, such a case would usually be considered a threat to the security of the test, and to have a potentially negative impact on the validity of the interpretations of the test scores. But test takers do expect that appropriate steps have been taken to insure the two different forms cover the same range of content which the test is designed to cover, and that appropriate procedures have been employed to ensure the results or test scores are in fact comparable.


There are many technical procedures associated with maintaining the comparability of different forms of the same test, and these days such procedures commonly include the use of IRT analysis and equating methodology to ensure that test scores from different forms can be placed on a common scale and are indeed comparable. However, while the technical properties associated with maintaining comparability of separate forms of the same test are often quite clearly elaborated (though nonetheless difficult to achieve), the issue of what constitutes comparability or an appropriate level of comparison for scores or results obtained from different tests is not so clear. Professional standards and codes of conduct issue guidelines for test developers regarding both types of comparisons, but these too are not always specific. For example, Section C of the International Language Testing Association’s (ILTA) Code of Practice contains the following standard: If a decision must be made on candidates who did not all take the same test or the same form of a test, care must be taken to ensure that the different measures used are in fact comparable. The responsibility is clear, but the procedures are not, although several more detailed standards in the same Code relate to the comparison of different forms of the same test. When tests have different numbers of tasks and test items, and include different content because they are built from different test specifications designed for populations, uses, and educational contexts which may overlap but are not identical, the question of how useful score comparisons are is a legitimate one. Test users should be aware that there are many experts in the field who question the legitimacy of such comparisons.


Test developers, however, are faced with the dilemma that many test users have legitimate needs to consider the results from different assessments when making the decisions which test results are designed to inform. Admissions officers and employers do receive applications from applicants from different national (and thus educational) backgrounds. Although admissions officers or employers may be more familiar with one type of assessment tool or test, it would be exclusive and in terms of consequences or impact, potentially extremely detrimental to many test takers to require them to take only one type of test when they may already have suitable qualifications which are recognized locally but not internationally. Learners, too, who wish simply to make informed decisions about their own learning achievements and goals arguably can benefit from making careful comparisons of their ability as measured by some test which they have access to and other tests designed to measure similar abilities but which are used in different (perhaps wider or international) contexts.


Of course, studies to assess the degree of correlation between two different tests designed to measure the same or similar abilities also form a long-standing part of test validation, traditionally falling under the category of criterion-related validity studies. In fact, Kane (1992) has noted that prior to the widespread adoption of the unitary concept of validity which now underpins validity theory, such correlation studies were perhaps the most common form of empirical test validation research.


At STEP, we take the position that test users—those who require and make decisions based on test scores as well as learners themselves—do make comparisons between the variety of language tests available, and as such, rather than ignoring these legitimate concerns because of the difficulties involved, we prefer to provide information to help test users make informed decisions. At the same time, we would like to stress that such information is limited in its scope and application and needs to be approached with care. A good overview of many of the concerns and issues involved with making test comparisons can be found in Taylor (2004). Concerning the use of common frameworks such as the CEFR to facilitate comparison, she cautions:

There is no doubt that comparative frameworks can serve a useful function for a wide variety of test stakeholders... But we need to understand that they have their limitations, too: they may mask significant differentiating features, they tend to encourage oversimplification and misinterpretation, and there is always a danger that they are adopted as prescriptive rather than informative tools. (Taylor, 2004)


Despite the caveats and potential pitfalls inherent in the activity of comparing different tests, it should be noted that methodological frameworks for linking, that is comparing, different assessments have been suggested by measurement specialists and which account for different levels of robustness in terms of the claims that can be made. Mislevy (1992) and Linn (1993) have both outlined five-level systems ranging from equating (the most robust and strict in terms of requirements) to social moderation (with Mislevy's projection corresponding to Linn's prediction). In Mislevy's framework, projection, statistical moderation, and social moderation all allow for comparison in which tests "don't measure the same thing." Of course there still has to be some sound, substantive, and explicitly stated reasoning for why making such a comparison is reasonable. But well documented statistical procedures exist for making comparisons within such theoretical frameworks as Mislevy's and Linn's. Social moderation, defined by North (2000) as “building up a common framework of reference through discussion and reference to work samples,” has traditionally been seen as the least robust of the procedures for linking exams. However, North (ibid) suggests that when combined with other procedures, the building up of a common framework of reference through social moderation can also be a powerful tool. 


The comparisons included on this website have come from several perspectives. The TOEFL® score comparisons could be seen as falling in the traditional sphere of projection or prediction in Mislevy’s and Linn’s frameworks. The CEFR is a common framework for reference which includes a descriptive scale of language proficiency (with a claim to empirical scaling). It is not a single test and cannot be approached with the same procedures employed in the EIKEN-TOEFL® studies. Linking to the CEFR necessarily involves social moderation as different users build up a common interpretation of what the different descriptions of proficiency for each level in the framework mean in real terms. At the same time, statistical procedures designed for setting cut-off scores on tests, known as standard setting, have been increasingly employed in projects to link or compare tests to the CEFR in an effort to help build a more principled understanding of the relationship between different tests and the CEFR, and to insure any claims are underpinned, as far as standard setting allows, by empirical data.


We would suggest that those who have proposed and supported the different kinds of techniques available to help test developers investigate the relationship of their tests with other measures of the same or similar abilities stress the need for transparency and accountability. In other words, it is incumbent on test developers making such claims to make clear what procedures were used to arrive at the results and to make sure people are aware of what level of meaning they contain. Section D of the ILTA Code of Practice contains the following standard which is relevant to the issue of transparency and accountability: those preparing and administering tests will “refrain from making false or misleading claims about the test.” We have tried to spell out as clearly as possible the intended purpose and necessary limitations of the information we have provided regarding the comparison of the EIKEN tests with other measures of English proficiency. The claims on this website are deliberately intended to be limited in scope and as far as they are supported by studies designed to gather data within an appropriate theoretical framework, we can safely say they are not false. But we also need to avoid deliberately ambiguous language which would cause non-specialists to arrive at unwarranted conclusions (i.e. the tests being compared are in fact the same, or two scores from the different tests mean exactly the same thing because they have been positioned to look that way in a comparison table). In line with the professional requirement outlined in the ILTA Code of Practice, we want to reiterate that the comparisons on this website are meant to be taken as one source of information to help test users make informed decisions. Test users, however, need to be aware of the limitations of such comparisons in general, and in particular be aware of the methods and caveats involved in the specific studies mentioned here.


Equally, it is important to reiterate that the comparison tables presented on this website do NOT suggest in any way equivalence of content between the EIKEN tests and the other measures, either in the technical sense or the everyday sense of the word. Information on the specific procedures employed in arriving at the different comparison tables is included under the headings for the relevant sections (Investigating the relationship of the EIKEN tests with the CEFR, and Comparing EIKEN and the TOEFL® Tests). We would stress, however, that no one procedure is designed to be sufficient on its own to support any claim of comparability.


As a final point we would emphasize that for any important decision regarding a test taker’s future, it is rarely appropriate to take only one source of information into account. In fact the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 2004) is very clear on this point, stating: “Avoid using a single test score as the sole determinant of decisions about test takers. Interpret test scores in conjunction with other information about individuals.” We would suggest that such a caveat extends to the kinds of test comparisons included on this website. Test users are invited to view these comparisons critically and should be aware that such comparisons cannot and should not be made (and certainly cannot be adequately supported) by the use of one study or one procedure. Especially for high-stakes purposes, one source of evidence is almost never sufficient. The information and research results we can provide through studies within the framework of projection, etc., should be considered useful, but limited and never sufficient on their own.



Read more articles about EIKEN



International Language Testing Association. (2007). International Language Testing Association Code of Practice. Retrieved April 2, 2010 from


Joint Committee on Testing Practices. (2004). Code of Fair Testing Practices in Education. Retrieved April 2, 2010 from


Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.


Linn, R.I., (1993) Linking results of different assessments. Applied Meusurement in Education 6 (1), 83-102


Mislevy, J. (1992) Linking Educational Assessments. Concepts, Issues, Methods and Prospects. Educational Testing Service, Princeton, NJ


North, B., (2000) Linking language assessments: an example in a low stakes context. System 28, 555-577.


Taylor, L. (2004) Issues of test compatability. Research Notes 15, 2-5.