AI “Research”- Dangerous Methodology

Can we talk about how dangerous it is that this methodology is not only being used, but being cited as the new standard for research practice when investigating the use of AI in education?

Despite the bold headline, the truth behind the result is a simple statement, “The AI said that it was doing a good job.” Do you believe that? Have we not all had an experience similar to the poison mushroom meme?

View this post on Instagram

This recent publication makes a very enticing claim about the success of a pedagogically sound AI-based math tutoring model. Unfortunately there is no evidence that it works in human populations. I would hazard a guess that the majority of readers will stop at the headline and abstract. Unfamiliar with actual educational measurement and methodology, one might be drawn to a rush conclusion. I was about to cite this paper. Thankfully I read the methods first. You can too:

Lee, U., Shin, M., Jeong, Y., Lee, S., Moon, J., Joo, K., … & Kwon, H. (2026). LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring. arXiv preprint arXiv:2605.27088.

In the description of the setup, the authors use AI as the tutor, the student, and the judge (see page 4 section 3.1):

“We match the setup of Dinucu-Jianu et al. (2025); Lee et al. (2026a). The tutor model is Qwen2.5-7B-Instruct for the NoThink condition with a maximum of 256 output tokens, and Qwen3-8B for all thinking conditions with 384 output tokens and a thinking budget of 1,024 tokens. The student model is LLaMA-3.1-8B-Instruct with a maximum of 512 tokens. Reward judgments are made by GPT-4o-mini (Zheng et al., 2023), and prompt improvements are proposed by GPT-4o as the reflection model. Each dialog runs up to 5 turns under 5 conditions identical to Lee et al. (2026a): NoThink (Qwen2.5-7B), Think NoReward and Think Reward (Qwen3-8B, with/without Rthink), and their pedagogical-seed variants. Whether each condition enables thinking, applies Rthink, or seeds with a pedagogical prompt is encoded in Table 1 as the Th./Th.R/Prompt indicator triplet.”

The authors even go so far as to state their decision to use this exact setup is a limitation to their own work (see page 9 section 8):

“All tutoring dialogs use a simulated student (LLaMA-3.1-8B) rather than real learners; while standard in the field (Dinucu-Jianu et al., 2025; Lee et al., 2026a), student behavior may differ from authentic interactions… Evaluation relies on LLM-as-judge (Zheng et al., 2023) without human evaluation.”

The end result? The only true claim that can be made here is “AI said AI did a good job.”

On large scale evaluations, AI hallucinates about as often as it does in the process of generating text in the first place. There is frequently little difference in the percent of items passing human review before and after the use of an LLM evaluator or fix. Without any human review, this is possibly no better than a redistribution of errors. Did the LLM improve its prompts? What was the quality of the tutoring session before and after? Were failures missed because they were inherited from the training data? Did the students learn anything? Is this a good tutor? Can results be generalized to human students?

How are we defining “Pedagogical” within the context of this manuscript? There are numerous definitions throughout, but none are concrete. The word appears 47 times in the manuscript with no clear understanding of how they intended it to be used.

According to UNESCO “Pedagogy refers to the art and science of teaching, encompassing various methods and strategies educators use to facilitate learning. It involves understanding how students learn, the design of instructional materials, and the assessment of educational outcomes.”

While it appears the authors have extensively studied the pedagogy they wish to train into their model, the definition seems to shift across the framing of the manuscript. The operational metrics that define the pedagogical “balance” in the introduction are shown in one place as “post-test solve rate, leak control, and helpfulness”. On the same page researchers suggest they will apply methods “where optimization must jointly balance scaffolding quality, answer leak prevention, and student learning outcomes.” Later they claim they define “pedagogical priors” as including “scaffolding”, “leak prevention”, and “meta-optimization”. One of these terms has something to do with how students learn while the other two are methodological constraints to keep the LLM from misbehaving.

I am not saying the model failed the task, or that the engineers who designed the tutor did a poor job. I am saying that there is no way to know if it was successful because I don’t know the goal. And if I did, the conclusion most certainly fails to match the title. The definition of pedagogical is unclear and the students aren’t human.

And if we know that LLMs are lying, telling us they’ve done a great job or completed a task correctly when the most brief review will tell you otherwise, what are we even calling research into the field that uses LLMs or AI to simulate research?

So please, to anyone out there pretending to be an AI-Scientist in the field of educational measurement, please respect the profession. If your limitation suggests that your title is false, don’t waste my time.

Author: Sarah

I'm the girl with no name. View all posts by Sarah

Share this:

Related

Author: Sarah

Leave a comment Cancel reply