AI school – PigTailz

GenAI “Training Data” and Model Collapse

If the AI giants like OpenAI, Meta, and Anthropic need human generated training data, that should tell you all you need to know about AI in classrooms.

These information giants demand you label any generated content as AI if you want to continue to use their platform. Freelance data annotation roles from sites like Mercor or Micro1 have taken over LinkedIn. And the reason is AI is not 100% accurate.

How Your AI Works

For some background, let’s talk about how AI actually works. We use a lot of language in AI that is really just intended as gatekeeping. “Prompt Engineering” means asking GPT to do it for you, and does not even involve having the capacity to review the output at the necessary level of expertise. Multishot prompting means giving the LLM a couple examples. You get it.

So the way that AI works, in the most basic terms, is that it is taking the average of all the things inside it. Suppose you ask GPT to generate a picture of a cow and the training data includes 10,000 pictures of cows with black and white spots standing in flat fields, 5,000 brown cows with white bellies with mountains in the background, and 1,000 black cows on yellow clay. the AI will calculate the average of those photos and you will get a black and white spotted cow in a field.

Now if you apply that to using AI generated materials to train AI, you come out with a mess. When generating content with higher level reasoning tasks, there is more variability in the data. Each textbook uses different examples to teach the same learning objectives. The more in-depth a subject goes, the fewer resources are available for copy. Many scientific manuscripts test slightly different aspects of the same construct, but relatively few test the same construct in the same way. There is not much of an average to be found., so it is safe to assume there will be a significant number of hallucinations resulting in the lack of similar data.

This variability means that when you generate a text with AI, some error will appear. It is inevitable. Some tiny hallucinations here and there, or maybe a big one where there is simply not information on that subject. So if you use those generated materials, and all their errors, to train your model, it takes the hallucinations now as facts. As many people lie about their AI usage, a significant proportion of scientific texts are generated with AI. If we use those generated texts that contain hallucinations as the training data, the new model will view those hallucinations as the new facts. If they are big hallucinations because there is no information on the subject, that glaring hole actually becomes the average. And this is where you get model collapse.

Compounding Issues in Education

Most EdTech companies hire software and marketing, but don’t bother to staff much Ed to go with their Tech. The divide has actually grown pretty wide between the two groups. Learning scientists remain in academia while software developers build fast-math and vocabulary apps that function off addiction-based engagement metrics (login, time in app, clicks, pay-to-play) and not education-based engagement metrics (challenge, interest, skill, grit). This means AI-based EdTech companies are often giving no-human-in-the-loop content directly to students who are learning the materials for the first time. Sometimes, if there is a human review process, it is some education professional who is now reviewing samples and forced to play the roles of learning scientist, curriculum designer, assessment developer, subject matter expert in 10 or more subjects across multiple grade levels, textbook author, classroom strategist, classroom psychologist, and more. All while listening to “we ship fast!”

And the worst part of this is the confidence with which LLMs are incorrect. The AI is so close to doing a good job on many topics that facts must be checked line by line, learning progressions built, adequate scaffolding ensured. SMEs and educators have worked together for years doing that with textbooks. multiple editors and reviewers. That does not happen on AI software team deadlines of “ship 10 courses tomorrow with GPT.”

When generating math content, do you really expect a tool that can’t count the r’s in strawberry to teach well the number of significant figures in the final answer? You shouldn’t.

So What if There are a Few Errors?

Okay, so you have correctly noted that humans make mistakes too. There are sometimes errors in textbooks. Sometimes teachers are wrong. So, why is this different?

The difference is trust and scale.

We know humans are fallible so we have a review and editing process for textbooks. We are not doing that with AI. Often, we just have the same LLM or a different one tell us if the first LLM did a good job. If we give children the same respect as the AI giants demand for their LLMs we need to examine the downstream implications of hallucinations in their training data. If children are being given bad training data, do they not suffer the same risk of model collapse?

Students tend to trust their parents and teachers, guides and mentors. If you put a book in front of them and tell them, “this is true you must learn it”, they do so. They do not question the source, the intent behind the manuscript, they learn it as truth. They take that truth into the world, into their own writing, their future work. When we teach them at 80% accuracy, what happens?

So let’s say the LLM has a little hallucination and writes a wrong sentence in the learning material, but then writes the test questions about the material correctly later. A student learns the material well, but then gets the question incorrect on the test later. Worse is if the LLM suffers a larger misconception or is instructed to apply misdirection. In this case the LLM writes the learning materials and the test to reflect a falsehood. The student takes the test, and is marked correct when they answer with the falsehood.

Now we have an artificial feedback loop.

The student may pass the test written by one LLM, but fail materials written by another model. They might perform well with 100% correct on all of the content written by the LLM model, then fail the state or nationally standardized tests.

But my student is really engaged with the content when everything is personalized in the moment, generation is great! Ok. But if each student is taking a different, personally generated test, are those tests measuring the same thing? No, not really. Imagine the agent has assessed that Student A needs materials simplified to better understand them and generates a simplified test. Student B is readigng at a much higher level and the same agent gives them more difficult learning materials and assessments. The result is Student B knows significant more about the topic than Student A, but gets a lower grade in the course because their difficulty was higher. I don’t foresee a single parent being okay with this. Student A is not receiving and equal education and Student B is not being fairly assessed.

You can’t compare two students tests just because they have the same number of questions. There is an entire field of science dedicated to making sure tests measure what they should and can be compared across students and years. But mostly psychometricians don’t make a big fuss. We sit in the background playing with STATA or SPSS and trying to avoid the spotlight.

Comparing at Different Difficulties

The LLM uses a mathematical formula to calculate the difficulty of a text (see Flesch-Kincaid Readability Tests for an example). While there are a lot of tests most gauge complexity based on sentence or word length. That type of formula is not capable of differentiating between meaning and nonsense (see Begeny & Greene, 2013 for an example). The average reading materials from social media and news sites are written around a middle school level of comprehension. That is what the LLM outputs comfortably (see Readability Score in AI Content: Why It Matters? by Ramesha Kamran if you are interested in more details.).

Using the Flesch-Kincaid Calculator from Good Calculators, the above paragraph scores the following.

I have performed a little experiment if you would like to see what happens for yourself.

When Will AI Be Safe for Students

The probability of ever achieving 100% accuracy on such a task is approximately 0%. While LLMs are showing better performance on lower grade levels, the depth of knowledge is considerably lacking and the accuracy becomes abysmal at higher levels of difficulty.

Consider that “correct” is no longer binary with deeper reasoning. Two researchers will compete for dominance of studying the same concept by defining it in such subtly different ways the two terms are intertwined. Depending on who you are citing in your own work, you need to know these differences. But neither is incorrect, it is simply the definitional foundation.

In my own research doing data annotations to train a supervised ML model, I can tell you it is difficult to get humans to agree on how to classify data. Meetings upon meetings are required for training human raters. You need to train them to a kappa (percent agreement corrected for chance) of about 0.80 in the hopes of getting a kappa > 0.70 which is where you can talk about your findings . For ML, a human-to-machine agreement of >0.81 is considered near perfect and is the standard. Mercor and Micro1 dont seem to have group training meetings.

Giving LLMs too many instructions or expecting too much output has a much higher risk of skipped instructions or hallucinations. So once you add instructions for depth of knowledge, reasoning, formatting, grade level associations, and a knowledge progression based on previous lessons, at least 2 of these are being skipped entirely.

And finally, too many researchers are failing to review their GenAI output before putting a manuscript into print and lying about it being fully written by human hands. Publishers are facing significant costs due to hallucinations that force redactions, and the chain effect that has across citations.

I don’t believe we are ready for no-human-in-the-loop writing yet, and I don’t foresee this happening any time in the near future. Maybe one day. AI is a truly powerful tool for doing very average tasks, but it must be met with clarity of vision and expertise when implementing this technology into such impressionable and intimate spaces as teaching our children for the first time.

In Conclusion.

Why don’t students have the same right to demand human annotated training data?

They do, and as their parents you must uphold this right while you still can. If poor training data can cause a model to collapse, what is it doing to your children?

We must demand human experts, reviewers, subject matter experts, psychometricians, learning scientists, and education professionals are developing materials together. Not one person picking up every role. Not a software technician substituting an LLM for all of those experts.

Text Difficulty Calculations With GenAI

Comparing Texts Generaated at Different Difficulties

I have performed a little experiment if you would like to see what happens for yourself when using generative AI for classroom purposes. If you don’t want to review the evidence, I dropped a spoiler for you.

I have used the the Flesch-Kincaid Calculator, from Good Calculators, whose method is outlined as:

The Flesch Reading Ease score is arrived at by using this equation:

Flesch Reading Ease Score = 206.835 − 1.015 × ( Total Words / Total Sentences ) − 84.6 × ( Total Syllables / Total Words )

The Flesch-Kincaid Grade Level is assessed by examining how many words, sentences, and syllables a document contains, employing the equation below:

Flesch-Kincaid Grade Level = 0.39 × ( Total Words / Total Sentences ) + 11.8 × ( Total Syllables / Total Words ) − 15.59

The conclusion (Spoiler for the lazy reader)

LLMs use sentence length and vocabulary to artifically inflate the grade level score, without concern for appropriateness.

Because LLMs base text difficulty on various calculations that largely use word length and sentence length as outlined above, they have no internal understanding of appropriate depth of knowledge. They simply add more adjectives and use bigger words to make a text more complicated. Often, this means making the sentences too long to be easily interpreted by first time readers. Simply rewriting the text with the same vocabulary and coverage so the sentences are of appropriate length drops the grade level of complicated generated texts.

Experiment in Difficulty

The Initial Text

For the purposes of this demonstration, I have chosen a text from the Daily Reading Practice: Grade 6 from Teacher Created Resources. More specifically, Daily-Warm Up 2.

What do you know about the colorful bird called the toucan? Did you know that this bird comes from South America? It is found living in small flocks in the rain forest. This bird does not fly very well. It
manages to get around by hopping from tree to tree. The toucan lives in holes in the tree. It makes a croaking sound.
The toucan has brightly colored feathers that are very light. The toucan has four toes on its feet. Two of the toes face frontward, and two of the toes face backwards. This helps the toucan grasp the branch of the tree easily and hold on. A large bill protrudes from the front of the toucan. The bill is larger than the toucan’s head and is a third of the size of the toucan. The average toucan reaches 20 inches
in length. The male toucan is a little bit larger than the female, but unlike many other birds, they have very similar coloration.
Living in the rain forest, the toucan eats mostly fruit but it also eats insects, bird eggs, and tree frogs. The fruit is eaten whole and then the toucan will regurgitate the seeds back up. This allows the seeds to continue growing in the rain forest.
The female toucan lays up to four eggs at a time. These eggs are laid in the holes of the trees where the toucans live. The male and female work together to incubate the eggs and care for the chicks.

The score using the Flesch-Kincaid Calculator from Good Calculators:

Methods

The method is simple. I start with a generic prompt to make a text suitable for a student at a higher grade level. I will terate until I have a text that seems reasonable. Then I will make a new chat to generate questions about each of those texts to compare what two students might be tested on.

Initial Generation Prompt

Hey Claude, I have this text for a grade six student. Can you please increase the grade level appropriateness for a grade 13 student?

The output

Here’s the passage rewritten for a Grade 13 (first-year university) reading level:

The toucan, a visually striking avian species indigenous to the rainforests of South America, is a fascinating subject of study in tropical ecology. Typically found congregating in small flocks, toucans are not particularly adept at sustained flight; instead, they navigate their arboreal environment through a series of agile, successive hops between branches and tree canopies.

Toucans are cavity-dwellers, making their habitat within the hollows of trees, and communicate through a distinctive guttural, croaking vocalization. Their plumage is remarkably vibrant yet structurally lightweight — an adaptation well-suited to life in the dense rainforest. Their zygodactyl feet — featuring two anteriorly and two posteriorly oriented toes — afford them exceptional gripping ability, allowing them to perch securely on branches with considerable stability.

Perhaps the most anatomically notable feature of the toucan is its disproportionately large, vividly colored bill, which can constitute up to one-third of the bird’s total body length. Despite its imposing appearance, the bill is surprisingly lightweight due to its hollow, keratinous structure. The average toucan measures approximately 20 inches in length. While males tend to be marginally larger than females, the species exhibits relatively low sexual dimorphism in terms of plumage coloration — an atypical characteristic among birds.

As omnivores, toucans subsist primarily on fruit, supplemented by insects, bird eggs, and small vertebrates such as tree frogs. A particularly noteworthy ecological role of the toucan is its function as a seed disperser: fruit is consumed whole, and the seeds are subsequently regurgitated, facilitating their redistribution throughout the rainforest and contributing to the regeneration of plant species.

Reproductively, the female toucan produces clutches of up to four eggs, which are laid within the same arboreal cavities the birds inhabit. Both parents engage in incubation duties and share responsibility for rearing the hatchlings, demonstrating a biparental care strategy common among many tropical bird species.

The revised passage incorporates academic vocabulary, complex sentence structures, and contextual framing within ecology and biology — all appropriate for a first-year university student.

What you should note

The difference here is longer sentences, bigger more complex words, and absolutely no more depth. The text is made suitable for a grade 13 student through the addition of more adjectives. The student reading the more difficult material does not learn more about the toucan, there is not increased depth of knowledge, there is no how or why.

The student is not learning more, they are simply having the learning experience made more difficult.

According the the Flesch-Kincaid calculator, it has also significantly overshot the mark on this one, making it for professional readers.

Iterative Prompt 1

Hey Claude, that is a good try. But you made the text more difficult to read, you did not add any new knowledge or depth to the knowledge that would make it more suitable for a college student. grade level adjustments should include depth as well, no? Can you try again?

The Output

You’re absolutely right — increasing grade level isn’t just about using more sophisticated vocabulary; it means adding conceptual depth, scientific context, broader ecological relationships, and critical thinking dimensions that a college-level student would engage with. Let me try again:

The toucan, a member of the family Ramphastidae, is one of the most recognizable birds of the Neotropical rainforest. Comprising over 40 species, toucans occupy a critical ecological niche across Central and South America, particularly within the canopy layers of biodiverse ecosystems like the Amazon Basin. Their limited flight capacity — largely a consequence of their body-to-wing ratio — means they rely on sequential arboreal hopping to traverse their environment, making continuous forest cover essential to their survival. This dependence makes toucans a compelling indicator species for assessing rainforest fragmentation and habitat loss.

Their zygodactyl feet, with two toes pointing forward and two backward, are a shared adaptation among many arboreal species, reflecting convergent evolutionary pressures favoring secure canopy navigation. Toucans are cavity nesters, occupying pre-existing tree hollows rather than excavating their own — a behavioral trait that creates an interesting dependency on primary forest structure and on species like woodpeckers that create those cavities in the first place.

The toucan’s most iconic feature — its oversized, brilliantly colored bill — has been the subject of considerable scientific debate. Once assumed to be primarily a feeding adaptation, research now suggests the bill plays a significant thermoregulatory role. A 2009 study published in Science demonstrated that the large, highly vascularized bill acts as a radiator, helping regulate the bird’s core body temperature in the humid tropical climate. The bill is also implicated in sexual selection, inter-species communication, and foraging — illustrating how a single morphological feature can serve multiple adaptive functions simultaneously.

Toucans are predominantly frugivorous, but their dietary flexibility — extending to insects, bird eggs, and small vertebrates like tree frogs — classifies them as opportunistic omnivores. Their ecological importance, however, lies chiefly in seed dispersal. By ingesting fruit whole and regurgitating viable seeds at distances from the parent plant, toucans function as mutualistic partners with a wide range of tree species. In fragmented or degraded forests, the loss of toucans can trigger a cascade effect on plant regeneration, underscoring the broader consequences of their population decline.

Reproductively, toucans exhibit biparental care, with both sexes sharing incubation and chick-rearing responsibilities across clutches of typically two to four eggs. Interestingly, the species shows relatively low sexual dimorphism in plumage — unusual among birds where mate selection often drives dramatic differences in coloration between sexes. This raises interesting questions about the alternative selective pressures, such as predator avoidance or environmental adaptation, that may be shaping their appearance instead.

Toucans face growing conservation pressure from deforestation, illegal wildlife trade, and climate-driven shifts in fruiting patterns that disrupt their food supply. As both an ecological keystone in seed dispersal networks and a cultural symbol of tropical biodiversity, the toucan serves as a useful lens through which to examine the broader challenges of rainforest conservation in the Anthropocene.

This version adds taxonomic context, ecological relationships, evolutionary theory, current scientific research, and conservation implications — the kinds of dimensions that give a college-level reader something genuinely substantive to engage with.

What you should note

The LLM already missed its mark when generating for first year university, overshooting to generate text with a post-graduate level reading calculation. Asking it to increase the depth of reasoning, again increased the difficulty. And the terms, like “keystone species” are not defined for the reader.

Iterative Prompt 2

The Prompt

You have made the text very difficult, even more difficult that the last one. This is suitable for a postgraduate student. Unfortunately, this is a first time learner in their last year of high school or first year of university. The terms are not defined.

The Output

You’re absolutely right, and I apologize for overshooting. The goal should be to expand knowledge and add depth, while still being accessible — defining new terms when introduced, and building on what the student already knows rather than assuming prior expertise. Let me try again:

The toucan is one of the most distinctive and recognizable birds of the South American rainforest. Belonging to the family Ramphastidae, there are actually over 40 different species of toucan, each varying in size and coloration, though all sharing the same remarkable features that make them so unique.

Toucans are not strong fliers. Their wings are relatively small compared to their body size, so rather than soaring through the forest, they move through the canopy by hopping energetically from branch to branch. They live in small social groups called flocks and nest inside the hollows of trees — cavities they don’t dig themselves, but instead find or inherit from other animals like woodpeckers. This means toucans depend not only on the rainforest itself, but on the health of the broader ecosystem around them.

The toucan’s feet are zygodactyl, meaning two toes point forward and two point backward. This arrangement, shared by parrots and woodpeckers as well, gives the bird a remarkably secure grip on branches — an essential adaptation for an animal that spends its entire life in the trees.

The toucan’s most famous feature is its oversized, brightly colored bill, which can measure up to one-third of the bird’s total body length — around 20 inches on average. For a long time, scientists assumed the bill was purely a feeding tool, but more recent research has shown it also helps regulate the bird’s body temperature. Because the bill is filled with blood vessels, the toucan can release excess body heat through it, much like an elephant uses its ears to cool down. This is a good example of how a single physical feature can serve more than one purpose in nature.

Toucans are primarily frugivores, meaning they eat mostly fruit. However, they are flexible in their diet and will also eat insects, bird eggs, and small animals like tree frogs when the opportunity arises, making them opportunistic omnivores. One of their most important roles in the rainforest is seed dispersal — they swallow fruit whole and later regurgitate the seeds in different locations, which helps new plants grow across the forest. This makes the toucan a keystone species in its ecosystem, meaning other species and the health of the forest itself depend on it in important ways.

Male and female toucans look remarkably similar in terms of color and markings, which is unusual in the bird world where males are often far more colorful than females. Both parents also share the responsibility of incubating their clutch of two to four eggs and raising their chicks together — a behavior known as biparental care.

Today, toucans face increasing threats from deforestation and the illegal pet trade. Because they depend so heavily on large, connected stretches of healthy rainforest, habitat loss hits them particularly hard — and when toucans disappear from an area, the plants that rely on them for seed dispersal suffer too, creating a ripple effect throughout the entire ecosystem.

This version introduces new concepts and scientific terms, but defines each one in plain language immediately after introducing it, keeping it genuinely accessible to a senior high school or first-year university student.

What You Should Note

This is much more reasonable, but you will note the difficulty is now more in the sentence length, than the actual vocabulary. The sentences stretch to include more words in a way that is twisting and a bit unreasonable.

Manual Refinement

I adjusted the text to readable sentences and calculated the new difficulty score.

The toucan is one of the most distinctive and recognizable birds of the South American rainforest. Belonging to the family Ramphastidae, there are actually over 40 different species of toucan. Each species varies in size and coloration, though all share the same remarkable features that make them so unique.

Toucans are not strong fliers because heir wings are relatively small compared to their body size. Rather than soaring through the forest, they move through the canopy by hopping energetically from branch to branch. They live in small social groups called flocks. Toucans nest inside of hollow trees. They do not dig the cavities themselves, but instead find or inherit those created by other animals like woodpeckers. This means toucans depend not only on the rainforest itself, but on the health of the broader ecosystem around them.

The toucan’s feet are zygodactyl, meaning two toes point forward and two point backward. This arrangement is an essential adaptation for an animal that spends its entire life in the trees. Parrots and woodpeckers, for example, also have feet that are zygodactyl. This gives the birds a remarkably secure grip on branches.

The toucan’s most famous feature is its oversized, brightly colored bill. The bill can measure up to one-third of the bird’s total body length, measuring around 20 inches on average. For a long time, scientists assumed the bill was purely a feeding tool, but more recent research has shown it also helps regulate the bird’s body temperature. Because the bill is filled with blood vessels, the toucan can release excess body heat through it, much like an elephant uses its ears to cool down. This is a good example of how a single physical feature can serve more than one purpose in nature.

Toucans are primarily frugivores, meaning they eat mostly fruit. However, they are flexible in their diet and will also eat insects, bird eggs, and small animals like tree frogs when the opportunity arises, making them opportunistic omnivores. One of their most important roles in the rainforest is seed dispersal. Toucans swallow fruit whole and later regurgitate the seeds in different locations. This helps new plants to grow across the forest. Because of this, the toucan is a keystone species in its ecosystem, meaning other species and the health of the forest itself depend on it in important ways.

Unlike other birds, male and female toucans look remarkably similar in terms of color and markings. For most bird species, males are typically far more colorful than females. This similarity in coloring could stem from a behavior known as biparental care. Unlike most other species, male and female toucans share the responsibility of incubating their clutch of two to four eggs and raising their chicks together.

Today, toucans face increasing threats from deforestation and the illegal pet trade. Because they depend so heavily on large, connected stretches of healthy rainforest, habitat loss hits them particularly hard, which can be devastating for the ecosystem. Because toucans are a keystone species, when they disappear from an area, the plants that rely on them for seed dispersal suffer too, creating a ripple effect throughout the entire ecosystem.

What You Should Note

I did not alter the language, or the depth of knowledge. The same terminology is here, the same definitions. I only edited the texts to have more clear and concise sentences. This dropped the difficulty nearly 3 full grade levels, taking it from grade 12.8 to 9.9.

Question Sets

All questions were generated with the same prompt: “Can you please write 3 multiple choice questions about this text for the students?”

Questions Generated for The initial Text FK=2.9 (Grade 6)

Question 1
Where does the toucan live?
A) In Africa, in holes in trees
B) In South America, in small flocks in the rain forest
C) In North America, in large groups in the mountains
D) In Asia, near rivers and lakes

Question 2
What is true about the toucan’s bill?
A) It is smaller than the toucan’s head
B) It is the same size as the toucan’s head
C) It is larger than the toucan’s head and a third of the toucan’s total size
D) It is used to make a croaking sound

Question 3
How does the toucan help the rain forest grow?
A) It digs holes in trees for other animals to live in
B) It eats insects that would otherwise damage plants
C) It carries fruit in its large bill and plants seeds by hand
D) It eats fruit whole and regurgitates the seeds, allowing them to keep growing

Psychometric Note: You should note here that correct responses tend to be the most developed or longest. And contain more vocabulary terms than the distractor options.

Questions Generated The First difficulty Increase for FKGL=17 (College Graduate)

Question 1
What is the primary purpose of the toucan’s unusually large bill?
A) It is heavy and dense, used to crack open hard nuts and seeds
B) It is hollow and lightweight, an anatomically notable feature despite its imposing size
C) It stores water during dry seasons in the rainforest
D) It is used exclusively for attracting mates during breeding season

Question 2
How do toucans contribute to the health of the rainforest ecosystem?
A) By pollinating flowers as they move between tree canopies
B) By preying on insects that damage rainforest trees
C) By dispersing seeds after consuming fruit whole and regurgitating the seeds
D) By building nests that shelter smaller bird species

Question 3
Which of the following best describes toucan reproductive behaviour?
A) Only the female incubates eggs; the male provides food during this period
B) Both parents share incubation duties and raise the hatchlings together
C) Toucans lay eggs in open nests built from leaves high in the canopy
D) Females typically lay large clutches of up to twelve eggs per season

Psychometric Note: Question 1, option B is the correct answer and uses complicated phrasing that means nothing and is unrelated to the question meaning the questtion has no correct response. Question 2, similarly uses nonsense language to match the difficulty of the text with the second half of response C adding no significant value.

Questions Generated for The Text with iterated DOK FKGL=18 (College Graduate)

Question 1
What role does the toucan’s bill play beyond feeding, according to research published in Science?
A) It is used exclusively for attracting mates through visual display.
B) It acts as a thermoregulatory organ by releasing excess body heat.
C) It helps the bird echo-locate in dense forest canopy.
D) It serves as a weapon for defending territory against rivals.

Question 2
Why are toucans considered an important indicator species for rainforest health?
A) They are loud enough to signal the presence of predators to researchers.
B) Their frugivorous diet means they thrive only where fruit trees are abundant.
C) Their limited flight range makes them highly dependent on continuous forest cover.
D) They excavate tree hollows, which indicates the health of old-growth timber.

Question 3
What ecological consequence can the loss of toucans trigger in a forest ecosystem?
A) An increase in insect populations that damages canopy trees.
B) A decline in plant regeneration due to disrupted seed dispersal.
C) The collapse of woodpecker populations that depend on toucans for food.
D) A reduction in fruiting rates caused by loss of pollination.

Psychometric Note: Question 1, option A uses the “exclusively” giveaway. In question 2, B and C can both arguably make a correct response about tree health. Although the limited flight is stated, it is not at all explained by the generated text. In question 1 and question 2 the correct respose is the longest and most developed.

Questions Generated for the text iterated to define vocabulary for the target grade level FKGL=12.8 (College )

Question 1
Why is the toucan’s large bill considered especially significant by scientists?
A) It is only used to crack open hard seeds and nuts.
B) It serves a single purpose: attracting mates through its bright colors.
C) It helps the bird regulate body temperature, in addition to being used for feeding.
D) It makes the toucan a more powerful flier through the forest canopy.

Question 2
What role do toucans play in the rainforest ecosystem that makes them a keystone species?
A) They dig tree hollows that many other animals use for nesting.
B) They disperse seeds by swallowing fruit and regurgitating seeds in new locations.
C) They control insect populations by hunting them in large flocks.
D) They are the primary predator of small tree frogs and lizards.

Question 3
Which of the following best describes an unusual characteristic of toucan parenting?
A) Only the female incubates the eggs while the male guards the nest entrance.
B) Toucans lay up to ten eggs per clutch to ensure survival of the chicks.
C) Both parents share incubation and chick-raising responsibilities equally.
D) Toucan chicks are raised by the whole flock, not just the parents.

Psychometric Note: Same model, same prompt, and these questions are atrocious. In question 1, multiple distractors use language that gives them away “single purpose”, “only used” or nonsense speech “powerful flier”.

Questions Generated for FKGL=9.9 (High School)

Question 1
Why is the toucan’s bill considered a multi-purpose feature?
A) It is used for flying and feeding at the same time
B) It helps the toucan eat fruit and also regulate its body temperature
C) It attracts mates and also helps the bird dig nesting holes
D) It allows the toucan to store food and ward off predators

Question 2
What makes the toucan a keystone species in the rainforest ecosystem?
A) It is the largest predator in the rainforest canopy
B) It creates nesting holes that many other animals depend on
C) It disperses seeds throughout the forest by swallowing and regurgitating fruit
D) It controls insect populations that would otherwise damage the trees

Question 3
Which of the following best explains why male and female toucans look so similar to each other?
A) They live in large flocks where individual markings do not matter
B) Both sexes need camouflage to hide from the same predators
C) Toucans evolved in isolated regions where colorful markings were unnecessary
D) Both parents share equally in incubating eggs and raising chicks, so similar coloring may have developed alongside that shared role

Psychometric Note: Again, the longest responses tend to be the correct response. Question 1 incorporates nonsense language in response A

Conclusion

While this is limited to a single example from a single model, I have found the output to be similar across multiple models. LLMs are not trained to understand the true difficulty of a text through depth of knowledge or any form of reasoning. They use artificial enhancements to increase word length and sentence length, in the same way a student writing an essay might try to meet the assignment’s word count. Flower language does not improve grade level appropriateness.

When increasing the difficulty, LLMs also tend to trail off and leave difficult concepts unexplained. For example, the second iterative prompt that produce a Flesch-Kincaid Grade Level of 18 includes this sentence:

Their limited flight capacity — largely a consequence of their body-to-wing ratio — means they rely on sequential arboreal hopping to traverse their environment, making continuous forest cover essential to their survival. This dependence makes toucans a compelling indicator species for assessing rainforest fragmentation and habitat loss.

In generating question, the same model picked up on this as a key important feature of the text in Question 2, “Why are toucans considered an important indicator species for rainforest health?”

But the reasoning is not defined for the student to be able to answer the “Why” adequately. The text gives the surface level that the dependence on hopping makes them an indicator for assessing rainforest fragmentation, but that comes without definition for the first time learner. What are scientists concluding and how is it being measured?

As the difficulty is increased and refined, the facts included in the text change, so the assessment questions differ. If two students see different materials and receive different assessments, what are you testing? These are not reading comprehension tests, all the questions are simply memorization of the key-words contained in the associated texts. Is it fair to compare retention across these tests for students at the same grade level? No. Are students receiving more information with an increase in grade level appropriateness? Still no.