Examining the Validity and Reliability of ChatGPT 3.5-Generated Reading Comprehension Questions for Academic Texts

This research examines the capacity of ChatGPT 3.5 in generating reading comprehension questions for academic texts, with a focus on their alignment with higher-order cognitive skills as per Bloom’s Taxono my. A paper-based test comprising 30 multiple-choice questions was constructed using ChatGPT 3.5, based on three selected TOEFL ITP reading comprehension passages. The study employed a mixed-methods approach, integrating qualitative content analysis to assess the cognitive level of each question and quantitative methods to analyze student responses. Data collection involved administering the AI-generated questions to students and scoring their responses. Analysis techniques included Pearson correlation coefficients to determine validity and reliability analysis using Cronbach's Alpha to measure internal consistency. The findings revealed that ChatGPT 3.5 is capable of producing questions that cover a range of cognitive levels, from analysis to creation, however only 10 out of 30 questions met the validity criteria, indicating a need for improvement in the AI's question generation process. The reliability of these questions was moderate, suggesting a reasonable level of internal consistency. The study concludes that while AI-generated questions show promise in educational assessments, ongoing improvement of AI models is necessary to enhance their effectiveness. The implications of this research are significant for the future integration of AI in educational settings, indicating a potential role for AI in developing meaningful assessment tools. The study recommends future research to explore various question types and incorporate student feedback to optimize the effectiveness of AI in education.

The existing literature has highlighted the potential of AI in education, with studies demonstrating the increasing utility of AI as a national growth engine and its potential to provide tremendous value across various educational fields (Su & Yang, 2023).The use of artificial intelligence (AI) in education has gained significant attention in recent years, with a growing number of educational institutions and organizations exploring the potential benefits of AI-driven technologies (Su & Yang, 2023).However, the specific application of AI in generating reading comprehension questions for academic texts and its impact on accurately measuring students' reading comprehension proficiency remains an area that requires further investigation.
While AI models have shown potential in creating educational materials, there are associated risks, such as the potential diminishment of critical thinking skills and educational inequalities (Schiff, 2021).This emphasizes the importance of critically evaluating the validity and reliability of ChatGPT 3.5-generated reading comprehension questions for academic texts.The development and implementation of AI-based interventions aimed at promoting language proficiency and addressing literacy challenges among students with diverse linguistic backgrounds have been identified as crucial areas for exploration (Cukurova et al., 2019).This highlights the need to assess the effectiveness of AI-generated language tests and explore the potential of AI-generated items in language assessment.
The existing literature has provided valuable insights into the potential applications of AI in education, particularly in language education and assessment.However, there is a notable gap in the literature regarding the specific examination of the reliability and validity of ChatGPT 3.5-generated reading comprehension questions for academic texts.Therefore, conducting this research is crucial to address this gap and provide evidencebased insights into the effectiveness of AI-generated reading comprehension questions in accurately measuring students' reading comprehension proficiency in the context of academic texts.

Research Design
The research employed a mixed method of qualitative and quantitative approaches to comprehensively evaluate the reliability and validity of ChatGPT-generated reading comprehension questions for academic texts (Kim, 2015).The qualitative approach was conducted by employing content analysis.The questions generated by ChatGPT were analyzed by using content analysis to ensure that each question involved Higher Order of Thinking Skills (Cognitive Level 4, 5, and 6) as the prompt instructed.While statistical methods were used to analyze the quantitative data.The validity and reliability of the questions generated by ChatGPT were calculated by using SPSS version 29.

Population and Sample
The study targeted students enrolled in the English Education Study Program at the Faculty of Teacher Training and Education, Universitas Al Washliyah Medan, specifically those in the third, fifth and seventh semesters during the academic year 2023-2024.The population comprises 42 students, and through the utilization of random sampling technique involving a wheel spin, 25 students comprising of 10 semester 3 students, 10 semester 5 students, and 5 semester 5 students, were selected for the research.This sampling method aimed to ensure representation across different semesters, allowing for a diverse range of proficiency levels among the participants, a crucial aspect for the study's objectives.

Research Instrument
To gather data, a set of reading comprehension test was administered through a paperbased-test, comprising multiple-choice questions with 4 options created by Chat-GPT 3.5.These questions were generated based on carefully selected academic texts, specifically chosen from the TOEFL ITP reading comprehension passages.The chosen academic passages consisted of three expository passages with diverse topics: 1 ✓ Focus: Scientific methods for dating.✓ Main Topics: Radiocarbon dating and dendrochronology (tree ring dating) as tools for establishing a time spectrum, the process of tree ring formation, factors influencing ring thickness, and the correlation of growth rings between trees.In summary, passage 1 is about astronomy and telescope technology, passage 2 is about the impressionist art movement and its technological influences, and passage 3 is about scientific dating methods, specifically radiocarbon and tree ring dating.The topics of the passages are different.While they all involved some aspect of technology or scientific methods, the specific subjects and contexts differ significantly.
For each passage, Chat-GPT 3.5 generated ten questions, resulting in a total of 30 test items.The questions aimed to comprehensively assess the sample's higher order of thinking skills covering Cognitive Levels 4-6, inspired by the revised version of Bloom Taxonomy.The prompt was carefully formulated in order to result optimal generated reading comprehension questions.The following is the prompt: "Generate 10 multiple-choice questions based on the provided passage.Ensure that the questions involve higher-order thinking skills, specifically focusing on cognitive levels 4 to 6.For cognitive level 4 (analyzing), use operational verbs such as comparing, organizing, deconstructing, attributing, outlining, finding, structuring, and integrating.For cognitive level 5 (evaluating), use verbs like checking, hypothesizing, critiquing, experimenting, judging, testing, detecting, and monitoring.For cognitive level 6 (creating), employ verbs such as designing, constructing, planning, producing, inventing, devising, and making." The answer key along with the discussion for each question was also generated by using ChatGPT.

Techniques for Collecting Data
The students were administered reading comprehension tests comprising ChatGPTgenerated multiple choice questions.The questions generated were qualitatively analyzed by employing content analysis.The students' responses were quantitatively analyzed to assess the validity and reliability of the test through statistical analyses.The test validity was assessed through Pearson Correlation formula to ensure internal validity.On the other  The distribution of questions across these cognitive levels in accordance with the principles of Bloom's Taxonomy, which emphasizes the importance of assessing higherorder thinking skills such as analysis, evaluation, and creation alongside lower-order cognitive skills.This approach is crucial for promoting critical thinking and problemsolving abilities among students, as it encourages them to engage with the material at a deeper level and apply their knowledge in novel contexts.

Validity and Reliability Examinations Validity Examination
The validity assessment, conducted through the Pearson Correlation formula, indicated that out of 30 questions generated by utilizing ChatGPT 3.5, 10 were found to be valid.According to Pearson Correlation standards, if the correlation coefficient is higher than rtable, the correlation is deemed statistically significant.This implies that the validity of the questions is established.The statistical analysis of reliability, as indicated by Cronbach's Alpha, resulted in a value of 0.671.Cronbach's Alpha is a measure of internal consistency, reflecting how well the items in a scale or test measure the same targeted construct.Generally, higher values of Cronbach's Alpha (closer to 1.0) indicate greater reliability.In this analysis, a value of 0.671 suggested a moderate level of internal consistency among the 30 items included in the analysis.While it was not exceptionally high, the value indicated a reasonable degree of reliability.

Research Findings
The following is the interpretation of the findings: 1.The results of content analysis suggested that the ChatGPT 3.5-generated reading comprehension questions have effectively met the prompt objective of constructing questions that involved higher order of thinking skills, specifically within cognitive Levels 4-6 of Blooms' Taxonomy.It was also found out that it effectively achieved a well-balanced cognitive level distribution with 40% at Cognitive Level 4 (Analysis), 30% at Cognitive Level 5 (Evaluation), and another 30% at Cognitive Level 6 (creation).The ChatGPT 3.5's ability to match with established educational frameworks indicated its effectiveness in generating questions that met recognized standards for cognitive complexity.Moreover, the ChatGPT 3.5's generated questions promoted critical thinking and problem-solving among students.This suggested that ChatGPT 3.5 was successful in encouraging students to think critically, analyze information, evaluate content, and create new ideas.It implied that the generated questions were designed to foster a deeper understanding of academic text.Thus, the interpretation revealed that ChatGPT 3.5 has effectively generated reading comprehension questions that meet the specified higher-order thinking skills (Levels 4-6).2. The findings reveal a picture of ChatGPT 3.5's performance in generating reading comprehension questions.The valid questions, numbering 10 out of 30, demonstrate the ChatGPT 3.5's ability to generate content in line with the expected constructs.However, the majority of questions did not meet the validity criteria, suggesting a need for improvements in generating relevant and effective questions.The moderate reliability suggests a consistent internal structure among the questions, providing a basis for possible improvement.Overall, while ChatGPT 3.5 shows promise.Further adjustment and enhancement are crucial.to ensure a more reliable and valid tool for generating academic reading comprehension questions.

CONCLUSION
The ChatGPT 3.5-generated reading comprehension questions for academic texts align effectively with Cognitive Levels 4-6 of Bloom's Taxonomy.It effectively fulfills the research objective by generating reading comprehension questions that engage students in higher-order thinking skills within the specified cognitive levels.The results suggest its potential as a reliable tool for constructing assessments that assess and enhance students' cognitive abilities in academic contexts.
. Passage 1 -Telescope Photography and Technology Impact: ✓ Focus: Astronomy and telescope photography.✓ Main Topics: Direct photography in astronomy, use of glass plates, limitations of photography, technology impact on modern astronomy (radio and x-ray telescopes), and the role of image processing.2. Passage 2 -Impressionism and Artistic Innovations: ✓ Focus: Art and the Impressionist movement.✓ Main Topics: Emergence of Impressionism in 1874, dissatisfaction with the academic art establishment, technological innovations in art (new brushes, collapsible tin tubes, and a new palette of colors), and the impact of these innovations on the artistic style.3. Passage 3 -Radiocarbon Dating and Tree Ring Dating:

Figure 1 .
Figure 1.Number of Correct Answers per Question Based on the analysis of the cognitive levels of the questions, it is evident that the questions are distributed across different levels of cognitive levels.The findings indicated that 40% of the questions were categorized under Cognitive Level 4 -Analysis, 30% under Cognitive Level 5 -Evaluating, and another 30% under Cognitive Level 6 -Creating.This distribution suggests a balanced representation of different cognitive levels within the assessment, which is essential for evaluating students' comprehensive understanding and application of knowledge.The distribution of questions across these cognitive levels in accordance with the principles of Bloom's Taxonomy, which emphasizes the importance of assessing higherorder thinking skills such as analysis, evaluation, and creation alongside lower-order cognitive skills.This approach is crucial for promoting critical thinking and problemsolving abilities among students, as it encourages them to engage with the material at a deeper level and apply their knowledge in novel contexts.

Table 1 .
Results of Analysis of Each Question Based on the Specified Cognitive Levels Table 2 presents the results of the validity assessment.