The Applicability of Item Response Theory Based Statistics to Detect Differential Item Functioning in Polytomous Tests

The study used statistical procedures based on Item Response Theory to detect Differential Item Functioning (DIF) in polytomous tests. These were with a view to improving the quality of test items construction. The sample consisted of an intact class of 513 Part 3 undergraduate students who registered for the course EDU 304: Tests and Measurement at Sule Lamido University during 2017/2018 Second Semester. A self-developed polytomous research instrument was used to collect data. Data collected were analysed using Generalized Mantel Haenszel, Simultaneous Item Bias Test, and Logistic Discriminant Function Analysis. The results showed that there was no significant relationship between the proportions of test items that function differentially in the polytomous test when the different statistical methods are used.  Further, the three parametric and non-parametric methods complement each other in their ability to detect DIF in the polytomous test format as all of them have capacity to detect DIF but perform differently. The study concluded that there was a high degree of correspondence between the three procedures in their ability to detect DIF in polytomous tests. It was recommended that test experts and developers should consider using procedure based on Item Response Theory in DIF detection.


INTRODUCTION
The presence of Differential Item Functioning (DIF) jeopardizes the ideal of a correct measurement procedure. Once identified, DIF may be attributed to item impact or to item bias. Item impact is evident when examinees from different groups have differing probabilities of responding correctly to (or endorsing) an item because there are true differences between the groups in the underlying ability being measured by the item. Item bias occurs when examinees of one group are less likely to answer an item correctly (or endorse an item) than examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test purpose. DIF is required, but not enough, for item bias (Ibrahim, 2016).
DIF is said to be present in a test item when, despite controls for overall test performance, examinees from different groups have a different probability of answering an item correctly or when examinees from two subpopulations with the same trait level have different expected scores on the same item (Osterlind & Everson, 2009). One of the pioneering methods used to detect DIF is known as the Generalized Mantel-Haenszel procedure (GMH). This method is based on contingency table analysis and was first used to detect DIF by Holland and Thayer (1988). The GMH procedure compares the item performance of the reference and focal groups, which were previously matched on the trait measured by the test; the observed total test score is normally used as the matching criterion. In the standard GMH procedure, an item shows DIF if the odd of correctly answering the item is different for the two groups at a given level of the matching variable (Ibrahim, 2018).
According to Kristjansson, Aylesworth & Zumbo (2005), the Generalized Mantel-Haenszel (GMH) is a generalized statistic for nominal response data based on group differences in the entire response distribution. Because the GMH tests differences across the entire response scale, it should be sensitive to both uniform and non-uniform DIF. Formulae for calculating the GMHχ 2 statistic are given by Zwick, Donoghue & Grima (2010). The data for the studied item for the examinees in the reference and focal groups are arranged into a series of 2 x 2 contingency tables, one for each level of the matching variable. In the notation of GMHχ 2 statistic, and dia (NT) is a (k -1) diagonal matrix with elements NT. Whereas Am, E(Am), and V(Am) are scalars in the dichotomous case, Am and E(Am) are now vectors of length k -1, corresponding to k -1 of the k response categories, and V (Am) is a (k-1) by (k -1) covariance matrix. Then the GMH test statistic is given by This statistic has a large sample chi-square distribution with k-1 degrees of freedom under the null hypothesis of conditional independence between group membership and item response. A significant test statistic implies that uniform DIF is present in the item. The GMH statistic does not explicitly take into account the possible ordering of response categories; instead, it provides for the comparison of the two groups in terms of their entire response distributions, rather than their means alone. The odds that focal group members will be assigned a particular score category can be compared to the odds for the reference group, conditional on the matching variable (Holland & Wainer, 2009). Using a log-odds transformation, Holland & Thayer (2006) converted αMH into a difference on a delta (∆) scale, called MHD-DIF. MHD-DIF has been frequently used as a measure of DIF. Two generalizations of the dichotomous MH procedure can be applied to assess DIF in polytomous item responses, one for ordinal response data, the other for nominal response data.
Like the GMH procedure, Simultaneous Item Bias Test (SIBTEST) proposed by Shealy & Stout (1993) is a conceptually simple method and involves a test of significance based on the ratio of the weighted difference in proportion correct (for reference and focal group members) to its standard error. SIBTEST was originally intended for use with dichotomous test items but has since been extended to handle ordered items. Like the GMH procedure, SIBTEST yields an overall statistical test as well as a measure of the effect size for each item (β is an estimate of the amount of DIF). SIBTEST is the designation given to the statistical methodology for detecting uniform DIF and is based on the comparison of the probability of a correct response on the target item for the reference group at a given value of the latent ability (θ), with the probability of a correct response on the target item for the focal group at the same ability level (Ibrahim, 2018). The null DIF definition for SIBTEST is that an item exhibits DIF if the expected scores are identical for the reference and focal groups matched on θ. The amount of DIF at θ is measured by: At a given ability (θ), this difference is expressed as The SIBTEST statistic, β is the average difference in the probability of a correct response for the two groups, so when uniform DIF is not present this value is 0. Because the true distribution of θ is unknown, examinees are matched on their observed scores from a subset of the items. To correct for ability differences in the two groups, which are known to influence comparison of conditional probabilities, these observed subtest scores are taken separately for each group and adjusted using a regression equation based in classical test theory to estimate true scores, TR(s) and TF(s), for members of the reference and focal groups, respectively. The proportion correct for each group is then conditioned on a common true score, which is estimated as the average of TR(s) and TF(s) (Ibrahim, 2018).
As with SIBTEST and the Generalized Mantel-Haenszel procedures, Logistic Discriminant Function Analysis (LDFA), proposed by Miller & Spray (2003), is a parametric DIF detection approach which provides both a significance test and a measure of effect size. LDFA is closely related to logistic regression, and it is also model-based. However, there is one major difference in the LDFA method namely that group membership is the dependent variable rather than item score. Thus, in LDFA, the probability of group membership is estimated from total score and item score. This is a logistic form of the probability used in discriminant function analysis. LDFA is a DIF identification of items that are polytomously scored (items with multiple correct responses such as a Likert scale or a constructed-response item). In LDFA, three equations are derived: an equation predicting group membership from total score only; an equation predicting group membership from total score and item score; and an equation predicting group membership from total score, item score, and item by total score. A likelihood ratio goodness-of-fit statistic, G 2 , is computed for each model. As with the other two DIF techniques described here, its Type 1 error is generally near or below the normal rate of 0.05 but may be problematic when group ability differences are present. In the logistic regression model, the item response variable, U, is treated as a random variable and X and G are assumed to be fixed explanatory variables. However, it has been shown that it is reasonable to use the logistic regression procedure to estimate Prob (G\X,U) even though G is fixed and U is random (Hosmer & Lemeshow, 2000).
In this form, Prob (G\X, U) is simple a logistic form of the posterior probability used in discriminant analysis. This procedure is called logistic discriminant function analysis (LDFA). When applying the logistic discriminant function analysis to assess DIF in ordered item responses, the discriminant function (without item notation) can be written as: With these methods, however, there is not yet a consensus about how to test DIF when item responses are polytomously scored, even though, the most widely used DIF detection methods are procedures based on Item Response Theory (IRT). These methods have been useful in detecting DIF over time. Several extensions of the DIF procedures have been proposed for use with polytomous item responses, such as the Ordinal Logistic Regression procedure, the Mantel procedure for ordered response categories, the Generalized Mantel Haenszel procedure for nominal data, the polytomous extension of SIBTEST, the polytomous extension of the standardization approach, and Logistic Discriminant Function Analysis. However, their utilities in assessing DIF in ordinal items have not received the thorough and rigorous study accorded to the dichotomous DIF, thus necessitating further research to investigate their performance before they are ready for routine operational use (Ibrahim, 2017). As a corollary to the above, this study empirically compares the relative ability of the three statistical methods for detecting Differential Item Functioning in polytomous test items. Towards this end, the specific objective of the study appears germane namely to determine the relationship between the proportions of test items that function differently in the polytomous tests when the different methods are used. To achieve the objective of the study, a null research hypothesis was postulated:

Research Hypothesis
There is no significant relationship between the proportions of test items that function differentially in the polytomous tests when the different methods are used.

METHOD
This study employed the descriptive-comparative research design. According to Upadhya and Singh (2008), descriptive-comparative research design study compares two or more groups on one variable with a view to discovering something about one or all the things being compared. Hence, two groups: reference and focal groups' combination were used in the Differential Item Functioning analysis. In carrying out this study, therefore, the researcher collected data from subset of the population (undergraduate students in 300 level) in such a way that the knowledge to be gained is representative of the total population under study. Essentially, the researcher used the data collected to explore the three statistical DIF detection methods being studied in this study. In this study, the three methods SIBTEST, GMH, and LDFA were based on a contingency table framework; within this framework, total test score was used as the measure of trait level. Hence, DIF was held to exist if group differences occur in item score after matching on total score. The nature of this research, the sample and data collected determined the relevance/appropriateness of this design.

Participants
All undergraduate students who registered for a compulsory course in Tests and Measurement during the Second of 2017/2018 Session in the Faculty of Education of the Sule Lamido University, Kafin Hausa, Jigawa State, Nigeria, constituted the target population for the study. There were 513 undergraduate students who registered for the course during the session. The sample consisted of an intact class of 513 part 3 undergraduate students who registered for EDU 304 in Second Semester of 2017/2018 session. Thus, the entire population was therefore used, and no sampling was carried out as sampling procedure was a convenience type.

Research Instrument
A self-developed polytomous instrument was used in the study namely: "Undergraduate Students Achievement and Efficacy Scale (USAES)". The instrument contained 74 items divided into dichotomous and ordinal scales and rated on a five-point Likert-scale. First, the dichotomous section of the instrument consists of a 50, 4-option multiple-choice test that was developed using the course (EDU 304: Tests and Measurement) content. Second, the ordinal section of the instrument consists of 24-item which is made up of six subscales. The response format for the scale was the Likert type with five options of Strongly Agree (SA), Agree (A), Undecided (U), Disagree (D), and Strongly Disagree (SD). The content and construct validity of the instrument was established using expert judgments. Experts in Tests and Measurement, Statistics, Psychology for scrutiny and modification established the content validity of the instrument. The experts were able to review the items in the instrument in terms of relevance to the subject-matter, coverage of the content areas, appropriateness of the language usage and clarity of purpose. The experts' judgments revealed that the instrument had adequate content, construct and face validity. Thereafter, a reliability process was done to establish how reliable the instrument is. Hence, reliability test was conducted on the whole data collected for pilot testing using the reliability analysis tool on the Statistical Package for Social Sciences (SPSS), version 24.0. The instrument was pilot tested using 60 part three students in the Faculty of Education, Bayero University, Kano, Kano State, Nigeria, who were also offering the same course with similar course content. The reliability of the scores obtained in the pilot study was estimated using Cronbach's Alpha, Spearman Brown Split-Half Coefficient, and Guttman Split-Half Coefficient. The Coefficients obtained were 0.76, 0.89, and 0.89 respectively. Its mean ( ̅ ) difficulty index is 0.70 with a standard deviation of 0.28. The item discrimination indices have a mean ( ̅ ) value of 0.23 and a standard deviation of 0.17, with minimum and maximum scores of 10.0 and 35.0 respectively, and a variance of 67.7. Noteworthy, the Split-Half reliability method was preferred because of the desire to determine the internal consistency of the instrument for data collection. The Split-Half method was preferred because it was not feasible to repeat the same test. Also, it was considered a better reliability method in the sense that all the data required for computing reliability are obtained on one occasion, and therefore, variations arising out of the testing situations do not interfere with the results and outcomes of this study. According to Afolabi (2012), Split-Half reliability provided a measure of consistency with respect to content sampling; hence its preference in this study. All these values were acceptable as appropriately high for study of human behaviour due to its complexity. Consequently, the instrument was accepted being stable over time, hence its usage in this study.

Procedure for Data Collection
The instrument was administered by the researcher. The hard copies of the instrument were administered on the students with the assistance of the course Lecturers of EDU 304, as well as a handful of some Assistant Lecturers in the Department of Education of the Sule Lamido University, Kafin Hausa, Jigawa State, Nigeria. The instrument administration was conducted under strict but friendly condition. However, adequate time was provided for respondents to respond to all the items. Furthermore, the respondents were instructed not to omit any item as it is mandatory to answer all items in the instrument as they marked on the instrument that response which they have decided is most correct. Such a procedure provided a uniform response set thereby minimizing individual differences in responding. Consequently, the administered instrument copies were collected immediately. A total of 513 copies of the instrument were administered, while 502 copies were finally collected on return, as being properly completed and were used for analysis.

Method of Data Analysis
DIF statistical analyses were conducted for each item using GMH, SIBTEST, and LDFA statistical methods. These test statistics were interpreted at an alpha-level of 0.05. The software package DIF OpenStat developed by Miller (2011); and DIF LazStats developed by Pezzulo (2010) were used to run the three statistical procedures. Updated SPSS version 24.0 and Microsoft Excel version 12.0 were used to manage and organize the datasets.

RESULTS
In an effort to better understand the direction of DIF magnitude, we investigated the Item Characteristic Curves (ICCs) for items displaying the highest and lowest error rates. Figure  1 contains a two ICCs for the two groups, one for a high-difficulty, medium discrimination item with DIF of 1.0 ( Figure 1A) and the other with DIF of .4 ( Figure 1B). The DIF error rate for SIBTEST, GMH and LDFA were .971, .658, and .427, respectively, in the dichotomous test items and .622, .436 and .239, respectively, in the ordinal test items. For examinees with ability values from approximately -1 through 4, it appears that the separation between the two groups is clear in the DIF = 1.0 case. Conversely, at the low end of the ability scale, the ICCs come very close to one another so that they are difficult to disentangle visually. Indeed, at θ = -1.0 the probability of a correct response for the reference group was .243 and for focal group, .206. In contrast, at θ = -3.0 this gap had closed considerably, with a correct response probability for the reference group of .202 and for the focal group, .200. In short, for individuals at lower ability levels, the probability of a correct response approached the lower asymptote of .2, regardless of group membership, thus quite possibly leading to the detection of DIF. The DIF = .4 condition resulted in a much lower error rate for SIBTEST, GMH, and LDFA.
In this case, the gap between curves for the two groups was much smaller across the ability levels as compared to the ordinal test item. At θ = -1.0 the reference and focal groups' probabilities of a correct response were .224 and .211, respectively, while at θ = -3.0 both groups had a probability of a correct response of .201. For both dichotomous and ordinal test items there was very little difference in the ability of a correct response for the two groups at low abilities, with both approaching the lower asymptote of 0.2. However, the item containing greater b DIF had a larger gap in the probability of a correct response between the two groups for abilities greater than -1 than did the item containing less b DIF. Thus, for items where the ICCs experience a greater change in the gap between the two across ability scale, SIBTEST and LDFA detect an interaction and signal the presence of DIF, although the ɑ-parameter values for the two groups were equal.
Similarly, Table 2 displays the results of the proportions of test items that function differentially in the ordinal test when the different methods are used. As seen in the table, the p-value difference for item 1, when the three methods are used, appears to be out of line with the results for the same item for both reference and focal groups. For instance, when SIBTEST was used, the p-value difference of .25, p < .05, was obtained, as compared with the GMH result 0.03, p < .05, and LDFA result 0.05, p < .05, which shows that item 1, was identified as proportionately functioning differently when the three methods are used. Hence, ten such items were found, being items 1, 3, 7, 8, 11, 14, 17, 19, 20, and 22 identified by SIBTEST as proportionately functioning differently for both reference and focal groups. Further examination of the 24 items indicated that only items 5,9,11,13,14,and 22 were flagged by GMH and items 3,16,18,and 20 were identified by LDFA respectively as proportionately functioning differently for both reference and focal groups. Also, GMH flagged item 13 as outlier (.25, p < .05) amongst the remaining two methods (SIBTEST = .08, p < .05, and LDFA = .01 p < .05).

*Significant, p < .05
Further, Table 3 presents the results of the Chi-square (χ 2 ) analysis. From Table 3, 8% of the proportion of items functioning differentially in ordinal test flagged DIF when GMH was used as compared with 23% of the items flagged as containing DIF in dichotomous test. Also, SIBTEST flagged 14% of the items in the ordinal test as proportion of items functioning differentially and 22% of the items in the dichotomous test flagged as proportion of items functioning differentially. Similarly, LDFA flagged 11% of the ordinal items as proportion of items functioning differentially and 23% of the items flagged as proportion of items functioning differentially in the dichotomous test.  Further, the Chi-square (χ 2 ) analysis of the results yielded 0.98, which is not significant at p>0.05. Thus, the null hypothesis is confirmed; that is, there is no significant relationship between the proportion of test items that function differentially in the dichotomous and ordinal tests when the different methods are used.

DISCUSSION
The finding of this study indicated that there was no significant relationship between the proportion of test items that function differentially in the dichotomous and ordinal tests when the different methods are used. The results of this study are in consonance with the earlier findings of DeAyala (2012) which concluded that the LR procedure was as powerful as the MH procedure in detecting uniform DIF, and more powerful than the MH in detecting. In addition, as Dorans and Schmitt (2009) stated, If LR DIF can detect nonuniform DIF better than the MH DIF method, and is as powerful at detecting uniform DIF as the MH DIF method, then the inclusion of an effect size would make LR DIF a very attractive choice as a DIF detection method. The researcher believes that the results of this study can bear more significance by taking one point into account. LDFA is a parametric DIF detection approach which is a response to the previous DIF techniques which could only screen uniform DIF such as Standardization, GMH or SIBTEST. This, implicitly, can be considered as a reassuring point for the developers of the dichotomous and ordinal tests. This finding is similar to Gierl, Khaliq and Boughton (2003) who reported that while LDFA has comparable power to GMH and SIBTEST in detecting uniform DIF, it is superior in power for detecting non-uniform DIF. Hosmer and Lemeshow (2000) found that effect size measures (for GMH and SIBTEST) were highly correlated across DIF procedures except the measure for non-uniform DIF, which could only be assessed by GMH.
These results are in line with Miller and Spray (2003) who used LDFA and SIBTEST for DIF identification in polytomously scored items, confirmed that for both item 4 and item 17, the power to detect DIF increased as the DIF magnitude increased. This trend occurred when there were no missing data as well as when missing data were present. Conditions with a DIF magnitude of .25 had the poorest power, while conditions with a DIF magnitude of .75 had the highest power. Conditions with a DIF magnitude of .25 typically had power below 70% which is generally considered adequate power. On average, item 17 had slightly higher power values than item 4. This difference may be due to the varying degree of difficulty of item 4 and item 17. Altogether, these findings provide tacit confirmation as to the superiority of GMH over SIBTEST.

CONCLUSION
On the strength of the findings obtained from the study, it can be concluded, therefore, that the three methods complement each other in their ability to detect DIF in the dichotomous and ordinal test formats as all of them have capacity to detect DIF but perform differently. From the findings of this study, the following recommendations were made: (i) statistical methods for detecting Differential Item Functioning should be an essential part of test development and test evaluation efforts; (ii) moreover, quantitative and qualitative (expert judgment) analyses that can inform the test development process should be conducted after the administration of a test; and (iii) test experts and developers should consider using contingency table approaches, preferably the GMH and LDFA approaches in DIF detection. DIF testing must be conducted especially for very important tests like psychological instruments used by various researchers in all Nigerian Universities.