Professional – PigTailz

GenAI “Training Data” and Model Collapse

If the AI giants like OpenAI, Meta, and Anthropic need human generated training data, that should tell you all you need to know about AI in classrooms.

These information giants demand you label any generated content as AI if you want to continue to use their platform. Freelance data annotation roles from sites like Mercor or Micro1 have taken over LinkedIn. And the reason is AI is not 100% accurate.

How Your AI Works

For some background, let’s talk about how AI actually works. We use a lot of language in AI that is really just intended as gatekeeping. “Prompt Engineering” means asking GPT to do it for you, and does not even involve having the capacity to review the output at the necessary level of expertise. Multishot prompting means giving the LLM a couple examples. You get it.

So the way that AI works, in the most basic terms, is that it is taking the average of all the things inside it. Suppose you ask GPT to generate a picture of a cow and the training data includes 10,000 pictures of cows with black and white spots standing in flat fields, 5,000 brown cows with white bellies with mountains in the background, and 1,000 black cows on yellow clay. the AI will calculate the average of those photos and you will get a black and white spotted cow in a field.

Now if you apply that to using AI generated materials to train AI, you come out with a mess. When generating content with higher level reasoning tasks, there is more variability in the data. Each textbook uses different examples to teach the same learning objectives. The more in-depth a subject goes, the fewer resources are available for copy. Many scientific manuscripts test slightly different aspects of the same construct, but relatively few test the same construct in the same way. There is not much of an average to be found., so it is safe to assume there will be a significant number of hallucinations resulting in the lack of similar data.

This variability means that when you generate a text with AI, some error will appear. It is inevitable. Some tiny hallucinations here and there, or maybe a big one where there is simply not information on that subject. So if you use those generated materials, and all their errors, to train your model, it takes the hallucinations now as facts. As many people lie about their AI usage, a significant proportion of scientific texts are generated with AI. If we use those generated texts that contain hallucinations as the training data, the new model will view those hallucinations as the new facts. If they are big hallucinations because there is no information on the subject, that glaring hole actually becomes the average. And this is where you get model collapse.

Compounding Issues in Education

Most EdTech companies hire software and marketing, but don’t bother to staff much Ed to go with their Tech. The divide has actually grown pretty wide between the two groups. Learning scientists remain in academia while software developers build fast-math and vocabulary apps that function off addiction-based engagement metrics (login, time in app, clicks, pay-to-play) and not education-based engagement metrics (challenge, interest, skill, grit). This means AI-based EdTech companies are often giving no-human-in-the-loop content directly to students who are learning the materials for the first time. Sometimes, if there is a human review process, it is some education professional who is now reviewing samples and forced to play the roles of learning scientist, curriculum designer, assessment developer, subject matter expert in 10 or more subjects across multiple grade levels, textbook author, classroom strategist, classroom psychologist, and more. All while listening to “we ship fast!”

And the worst part of this is the confidence with which LLMs are incorrect. The AI is so close to doing a good job on many topics that facts must be checked line by line, learning progressions built, adequate scaffolding ensured. SMEs and educators have worked together for years doing that with textbooks. multiple editors and reviewers. That does not happen on AI software team deadlines of “ship 10 courses tomorrow with GPT.”

When generating math content, do you really expect a tool that can’t count the r’s in strawberry to teach well the number of significant figures in the final answer? You shouldn’t.

So What if There are a Few Errors?

Okay, so you have correctly noted that humans make mistakes too. There are sometimes errors in textbooks. Sometimes teachers are wrong. So, why is this different?

The difference is trust and scale.

We know humans are fallible so we have a review and editing process for textbooks. We are not doing that with AI. Often, we just have the same LLM or a different one tell us if the first LLM did a good job. If we give children the same respect as the AI giants demand for their LLMs we need to examine the downstream implications of hallucinations in their training data. If children are being given bad training data, do they not suffer the same risk of model collapse?

Students tend to trust their parents and teachers, guides and mentors. If you put a book in front of them and tell them, “this is true you must learn it”, they do so. They do not question the source, the intent behind the manuscript, they learn it as truth. They take that truth into the world, into their own writing, their future work. When we teach them at 80% accuracy, what happens?

So let’s say the LLM has a little hallucination and writes a wrong sentence in the learning material, but then writes the test questions about the material correctly later. A student learns the material well, but then gets the question incorrect on the test later. Worse is if the LLM suffers a larger misconception or is instructed to apply misdirection. In this case the LLM writes the learning materials and the test to reflect a falsehood. The student takes the test, and is marked correct when they answer with the falsehood.

Now we have an artificial feedback loop.

The student may pass the test written by one LLM, but fail materials written by another model. They might perform well with 100% correct on all of the content written by the LLM model, then fail the state or nationally standardized tests.

But my student is really engaged with the content when everything is personalized in the moment, generation is great! Ok. But if each student is taking a different, personally generated test, are those tests measuring the same thing? No, not really. Imagine the agent has assessed that Student A needs materials simplified to better understand them and generates a simplified test. Student B is readigng at a much higher level and the same agent gives them more difficult learning materials and assessments. The result is Student B knows significant more about the topic than Student A, but gets a lower grade in the course because their difficulty was higher. I don’t foresee a single parent being okay with this. Student A is not receiving and equal education and Student B is not being fairly assessed.

You can’t compare two students tests just because they have the same number of questions. There is an entire field of science dedicated to making sure tests measure what they should and can be compared across students and years. But mostly psychometricians don’t make a big fuss. We sit in the background playing with STATA or SPSS and trying to avoid the spotlight.

Comparing at Different Difficulties

The LLM uses a mathematical formula to calculate the difficulty of a text (see Flesch-Kincaid Readability Tests for an example). While there are a lot of tests most gauge complexity based on sentence or word length. That type of formula is not capable of differentiating between meaning and nonsense (see Begeny & Greene, 2013 for an example). The average reading materials from social media and news sites are written around a middle school level of comprehension. That is what the LLM outputs comfortably (see Readability Score in AI Content: Why It Matters? by Ramesha Kamran if you are interested in more details.).

Using the Flesch-Kincaid Calculator from Good Calculators, the above paragraph scores the following.

I have performed a little experiment if you would like to see what happens for yourself.

When Will AI Be Safe for Students

The probability of ever achieving 100% accuracy on such a task is approximately 0%. While LLMs are showing better performance on lower grade levels, the depth of knowledge is considerably lacking and the accuracy becomes abysmal at higher levels of difficulty.

Consider that “correct” is no longer binary with deeper reasoning. Two researchers will compete for dominance of studying the same concept by defining it in such subtly different ways the two terms are intertwined. Depending on who you are citing in your own work, you need to know these differences. But neither is incorrect, it is simply the definitional foundation.

In my own research doing data annotations to train a supervised ML model, I can tell you it is difficult to get humans to agree on how to classify data. Meetings upon meetings are required for training human raters. You need to train them to a kappa (percent agreement corrected for chance) of about 0.80 in the hopes of getting a kappa > 0.70 which is where you can talk about your findings . For ML, a human-to-machine agreement of >0.81 is considered near perfect and is the standard. Mercor and Micro1 dont seem to have group training meetings.

Giving LLMs too many instructions or expecting too much output has a much higher risk of skipped instructions or hallucinations. So once you add instructions for depth of knowledge, reasoning, formatting, grade level associations, and a knowledge progression based on previous lessons, at least 2 of these are being skipped entirely.

And finally, too many researchers are failing to review their GenAI output before putting a manuscript into print and lying about it being fully written by human hands. Publishers are facing significant costs due to hallucinations that force redactions, and the chain effect that has across citations.

I don’t believe we are ready for no-human-in-the-loop writing yet, and I don’t foresee this happening any time in the near future. Maybe one day. AI is a truly powerful tool for doing very average tasks, but it must be met with clarity of vision and expertise when implementing this technology into such impressionable and intimate spaces as teaching our children for the first time.

In Conclusion.

Why don’t students have the same right to demand human annotated training data?

They do, and as their parents you must uphold this right while you still can. If poor training data can cause a model to collapse, what is it doing to your children?

We must demand human experts, reviewers, subject matter experts, psychometricians, learning scientists, and education professionals are developing materials together. Not one person picking up every role. Not a software technician substituting an LLM for all of those experts.

Text Difficulty Calculations With GenAI

Comparing Texts Generaated at Different Difficulties

I have performed a little experiment if you would like to see what happens for yourself when using generative AI for classroom purposes. If you don’t want to review the evidence, I dropped a spoiler for you.

I have used the the Flesch-Kincaid Calculator, from Good Calculators, whose method is outlined as:

The Flesch Reading Ease score is arrived at by using this equation:

Flesch Reading Ease Score = 206.835 − 1.015 × ( Total Words / Total Sentences ) − 84.6 × ( Total Syllables / Total Words )

The Flesch-Kincaid Grade Level is assessed by examining how many words, sentences, and syllables a document contains, employing the equation below:

Flesch-Kincaid Grade Level = 0.39 × ( Total Words / Total Sentences ) + 11.8 × ( Total Syllables / Total Words ) − 15.59

The conclusion (Spoiler for the lazy reader)

LLMs use sentence length and vocabulary to artifically inflate the grade level score, without concern for appropriateness.

Because LLMs base text difficulty on various calculations that largely use word length and sentence length as outlined above, they have no internal understanding of appropriate depth of knowledge. They simply add more adjectives and use bigger words to make a text more complicated. Often, this means making the sentences too long to be easily interpreted by first time readers. Simply rewriting the text with the same vocabulary and coverage so the sentences are of appropriate length drops the grade level of complicated generated texts.

Experiment in Difficulty

The Initial Text

For the purposes of this demonstration, I have chosen a text from the Daily Reading Practice: Grade 6 from Teacher Created Resources. More specifically, Daily-Warm Up 2.

What do you know about the colorful bird called the toucan? Did you know that this bird comes from South America? It is found living in small flocks in the rain forest. This bird does not fly very well. It
manages to get around by hopping from tree to tree. The toucan lives in holes in the tree. It makes a croaking sound.
The toucan has brightly colored feathers that are very light. The toucan has four toes on its feet. Two of the toes face frontward, and two of the toes face backwards. This helps the toucan grasp the branch of the tree easily and hold on. A large bill protrudes from the front of the toucan. The bill is larger than the toucan’s head and is a third of the size of the toucan. The average toucan reaches 20 inches
in length. The male toucan is a little bit larger than the female, but unlike many other birds, they have very similar coloration.
Living in the rain forest, the toucan eats mostly fruit but it also eats insects, bird eggs, and tree frogs. The fruit is eaten whole and then the toucan will regurgitate the seeds back up. This allows the seeds to continue growing in the rain forest.
The female toucan lays up to four eggs at a time. These eggs are laid in the holes of the trees where the toucans live. The male and female work together to incubate the eggs and care for the chicks.

The score using the Flesch-Kincaid Calculator from Good Calculators:

Methods

The method is simple. I start with a generic prompt to make a text suitable for a student at a higher grade level. I will terate until I have a text that seems reasonable. Then I will make a new chat to generate questions about each of those texts to compare what two students might be tested on.

Initial Generation Prompt

Hey Claude, I have this text for a grade six student. Can you please increase the grade level appropriateness for a grade 13 student?

The output

Here’s the passage rewritten for a Grade 13 (first-year university) reading level:

The toucan, a visually striking avian species indigenous to the rainforests of South America, is a fascinating subject of study in tropical ecology. Typically found congregating in small flocks, toucans are not particularly adept at sustained flight; instead, they navigate their arboreal environment through a series of agile, successive hops between branches and tree canopies.

Toucans are cavity-dwellers, making their habitat within the hollows of trees, and communicate through a distinctive guttural, croaking vocalization. Their plumage is remarkably vibrant yet structurally lightweight — an adaptation well-suited to life in the dense rainforest. Their zygodactyl feet — featuring two anteriorly and two posteriorly oriented toes — afford them exceptional gripping ability, allowing them to perch securely on branches with considerable stability.

Perhaps the most anatomically notable feature of the toucan is its disproportionately large, vividly colored bill, which can constitute up to one-third of the bird’s total body length. Despite its imposing appearance, the bill is surprisingly lightweight due to its hollow, keratinous structure. The average toucan measures approximately 20 inches in length. While males tend to be marginally larger than females, the species exhibits relatively low sexual dimorphism in terms of plumage coloration — an atypical characteristic among birds.

As omnivores, toucans subsist primarily on fruit, supplemented by insects, bird eggs, and small vertebrates such as tree frogs. A particularly noteworthy ecological role of the toucan is its function as a seed disperser: fruit is consumed whole, and the seeds are subsequently regurgitated, facilitating their redistribution throughout the rainforest and contributing to the regeneration of plant species.

Reproductively, the female toucan produces clutches of up to four eggs, which are laid within the same arboreal cavities the birds inhabit. Both parents engage in incubation duties and share responsibility for rearing the hatchlings, demonstrating a biparental care strategy common among many tropical bird species.

The revised passage incorporates academic vocabulary, complex sentence structures, and contextual framing within ecology and biology — all appropriate for a first-year university student.

What you should note

The difference here is longer sentences, bigger more complex words, and absolutely no more depth. The text is made suitable for a grade 13 student through the addition of more adjectives. The student reading the more difficult material does not learn more about the toucan, there is not increased depth of knowledge, there is no how or why.

The student is not learning more, they are simply having the learning experience made more difficult.

According the the Flesch-Kincaid calculator, it has also significantly overshot the mark on this one, making it for professional readers.

Iterative Prompt 1

Hey Claude, that is a good try. But you made the text more difficult to read, you did not add any new knowledge or depth to the knowledge that would make it more suitable for a college student. grade level adjustments should include depth as well, no? Can you try again?

The Output

You’re absolutely right — increasing grade level isn’t just about using more sophisticated vocabulary; it means adding conceptual depth, scientific context, broader ecological relationships, and critical thinking dimensions that a college-level student would engage with. Let me try again:

The toucan, a member of the family Ramphastidae, is one of the most recognizable birds of the Neotropical rainforest. Comprising over 40 species, toucans occupy a critical ecological niche across Central and South America, particularly within the canopy layers of biodiverse ecosystems like the Amazon Basin. Their limited flight capacity — largely a consequence of their body-to-wing ratio — means they rely on sequential arboreal hopping to traverse their environment, making continuous forest cover essential to their survival. This dependence makes toucans a compelling indicator species for assessing rainforest fragmentation and habitat loss.

Their zygodactyl feet, with two toes pointing forward and two backward, are a shared adaptation among many arboreal species, reflecting convergent evolutionary pressures favoring secure canopy navigation. Toucans are cavity nesters, occupying pre-existing tree hollows rather than excavating their own — a behavioral trait that creates an interesting dependency on primary forest structure and on species like woodpeckers that create those cavities in the first place.

The toucan’s most iconic feature — its oversized, brilliantly colored bill — has been the subject of considerable scientific debate. Once assumed to be primarily a feeding adaptation, research now suggests the bill plays a significant thermoregulatory role. A 2009 study published in Science demonstrated that the large, highly vascularized bill acts as a radiator, helping regulate the bird’s core body temperature in the humid tropical climate. The bill is also implicated in sexual selection, inter-species communication, and foraging — illustrating how a single morphological feature can serve multiple adaptive functions simultaneously.

Toucans are predominantly frugivorous, but their dietary flexibility — extending to insects, bird eggs, and small vertebrates like tree frogs — classifies them as opportunistic omnivores. Their ecological importance, however, lies chiefly in seed dispersal. By ingesting fruit whole and regurgitating viable seeds at distances from the parent plant, toucans function as mutualistic partners with a wide range of tree species. In fragmented or degraded forests, the loss of toucans can trigger a cascade effect on plant regeneration, underscoring the broader consequences of their population decline.

Reproductively, toucans exhibit biparental care, with both sexes sharing incubation and chick-rearing responsibilities across clutches of typically two to four eggs. Interestingly, the species shows relatively low sexual dimorphism in plumage — unusual among birds where mate selection often drives dramatic differences in coloration between sexes. This raises interesting questions about the alternative selective pressures, such as predator avoidance or environmental adaptation, that may be shaping their appearance instead.

Toucans face growing conservation pressure from deforestation, illegal wildlife trade, and climate-driven shifts in fruiting patterns that disrupt their food supply. As both an ecological keystone in seed dispersal networks and a cultural symbol of tropical biodiversity, the toucan serves as a useful lens through which to examine the broader challenges of rainforest conservation in the Anthropocene.

This version adds taxonomic context, ecological relationships, evolutionary theory, current scientific research, and conservation implications — the kinds of dimensions that give a college-level reader something genuinely substantive to engage with.

What you should note

The LLM already missed its mark when generating for first year university, overshooting to generate text with a post-graduate level reading calculation. Asking it to increase the depth of reasoning, again increased the difficulty. And the terms, like “keystone species” are not defined for the reader.

Iterative Prompt 2

The Prompt

You have made the text very difficult, even more difficult that the last one. This is suitable for a postgraduate student. Unfortunately, this is a first time learner in their last year of high school or first year of university. The terms are not defined.

The Output

You’re absolutely right, and I apologize for overshooting. The goal should be to expand knowledge and add depth, while still being accessible — defining new terms when introduced, and building on what the student already knows rather than assuming prior expertise. Let me try again:

The toucan is one of the most distinctive and recognizable birds of the South American rainforest. Belonging to the family Ramphastidae, there are actually over 40 different species of toucan, each varying in size and coloration, though all sharing the same remarkable features that make them so unique.

Toucans are not strong fliers. Their wings are relatively small compared to their body size, so rather than soaring through the forest, they move through the canopy by hopping energetically from branch to branch. They live in small social groups called flocks and nest inside the hollows of trees — cavities they don’t dig themselves, but instead find or inherit from other animals like woodpeckers. This means toucans depend not only on the rainforest itself, but on the health of the broader ecosystem around them.

The toucan’s feet are zygodactyl, meaning two toes point forward and two point backward. This arrangement, shared by parrots and woodpeckers as well, gives the bird a remarkably secure grip on branches — an essential adaptation for an animal that spends its entire life in the trees.

The toucan’s most famous feature is its oversized, brightly colored bill, which can measure up to one-third of the bird’s total body length — around 20 inches on average. For a long time, scientists assumed the bill was purely a feeding tool, but more recent research has shown it also helps regulate the bird’s body temperature. Because the bill is filled with blood vessels, the toucan can release excess body heat through it, much like an elephant uses its ears to cool down. This is a good example of how a single physical feature can serve more than one purpose in nature.

Toucans are primarily frugivores, meaning they eat mostly fruit. However, they are flexible in their diet and will also eat insects, bird eggs, and small animals like tree frogs when the opportunity arises, making them opportunistic omnivores. One of their most important roles in the rainforest is seed dispersal — they swallow fruit whole and later regurgitate the seeds in different locations, which helps new plants grow across the forest. This makes the toucan a keystone species in its ecosystem, meaning other species and the health of the forest itself depend on it in important ways.

Male and female toucans look remarkably similar in terms of color and markings, which is unusual in the bird world where males are often far more colorful than females. Both parents also share the responsibility of incubating their clutch of two to four eggs and raising their chicks together — a behavior known as biparental care.

Today, toucans face increasing threats from deforestation and the illegal pet trade. Because they depend so heavily on large, connected stretches of healthy rainforest, habitat loss hits them particularly hard — and when toucans disappear from an area, the plants that rely on them for seed dispersal suffer too, creating a ripple effect throughout the entire ecosystem.

This version introduces new concepts and scientific terms, but defines each one in plain language immediately after introducing it, keeping it genuinely accessible to a senior high school or first-year university student.

What You Should Note

This is much more reasonable, but you will note the difficulty is now more in the sentence length, than the actual vocabulary. The sentences stretch to include more words in a way that is twisting and a bit unreasonable.

Manual Refinement

I adjusted the text to readable sentences and calculated the new difficulty score.

The toucan is one of the most distinctive and recognizable birds of the South American rainforest. Belonging to the family Ramphastidae, there are actually over 40 different species of toucan. Each species varies in size and coloration, though all share the same remarkable features that make them so unique.

Toucans are not strong fliers because heir wings are relatively small compared to their body size. Rather than soaring through the forest, they move through the canopy by hopping energetically from branch to branch. They live in small social groups called flocks. Toucans nest inside of hollow trees. They do not dig the cavities themselves, but instead find or inherit those created by other animals like woodpeckers. This means toucans depend not only on the rainforest itself, but on the health of the broader ecosystem around them.

The toucan’s feet are zygodactyl, meaning two toes point forward and two point backward. This arrangement is an essential adaptation for an animal that spends its entire life in the trees. Parrots and woodpeckers, for example, also have feet that are zygodactyl. This gives the birds a remarkably secure grip on branches.

The toucan’s most famous feature is its oversized, brightly colored bill. The bill can measure up to one-third of the bird’s total body length, measuring around 20 inches on average. For a long time, scientists assumed the bill was purely a feeding tool, but more recent research has shown it also helps regulate the bird’s body temperature. Because the bill is filled with blood vessels, the toucan can release excess body heat through it, much like an elephant uses its ears to cool down. This is a good example of how a single physical feature can serve more than one purpose in nature.

Toucans are primarily frugivores, meaning they eat mostly fruit. However, they are flexible in their diet and will also eat insects, bird eggs, and small animals like tree frogs when the opportunity arises, making them opportunistic omnivores. One of their most important roles in the rainforest is seed dispersal. Toucans swallow fruit whole and later regurgitate the seeds in different locations. This helps new plants to grow across the forest. Because of this, the toucan is a keystone species in its ecosystem, meaning other species and the health of the forest itself depend on it in important ways.

Unlike other birds, male and female toucans look remarkably similar in terms of color and markings. For most bird species, males are typically far more colorful than females. This similarity in coloring could stem from a behavior known as biparental care. Unlike most other species, male and female toucans share the responsibility of incubating their clutch of two to four eggs and raising their chicks together.

Today, toucans face increasing threats from deforestation and the illegal pet trade. Because they depend so heavily on large, connected stretches of healthy rainforest, habitat loss hits them particularly hard, which can be devastating for the ecosystem. Because toucans are a keystone species, when they disappear from an area, the plants that rely on them for seed dispersal suffer too, creating a ripple effect throughout the entire ecosystem.

What You Should Note

I did not alter the language, or the depth of knowledge. The same terminology is here, the same definitions. I only edited the texts to have more clear and concise sentences. This dropped the difficulty nearly 3 full grade levels, taking it from grade 12.8 to 9.9.

Question Sets

All questions were generated with the same prompt: “Can you please write 3 multiple choice questions about this text for the students?”

Questions Generated for The initial Text FK=2.9 (Grade 6)

Question 1
Where does the toucan live?
A) In Africa, in holes in trees
B) In South America, in small flocks in the rain forest
C) In North America, in large groups in the mountains
D) In Asia, near rivers and lakes

Question 2
What is true about the toucan’s bill?
A) It is smaller than the toucan’s head
B) It is the same size as the toucan’s head
C) It is larger than the toucan’s head and a third of the toucan’s total size
D) It is used to make a croaking sound

Question 3
How does the toucan help the rain forest grow?
A) It digs holes in trees for other animals to live in
B) It eats insects that would otherwise damage plants
C) It carries fruit in its large bill and plants seeds by hand
D) It eats fruit whole and regurgitates the seeds, allowing them to keep growing

Psychometric Note: You should note here that correct responses tend to be the most developed or longest. And contain more vocabulary terms than the distractor options.

Questions Generated The First difficulty Increase for FKGL=17 (College Graduate)

Question 1
What is the primary purpose of the toucan’s unusually large bill?
A) It is heavy and dense, used to crack open hard nuts and seeds
B) It is hollow and lightweight, an anatomically notable feature despite its imposing size
C) It stores water during dry seasons in the rainforest
D) It is used exclusively for attracting mates during breeding season

Question 2
How do toucans contribute to the health of the rainforest ecosystem?
A) By pollinating flowers as they move between tree canopies
B) By preying on insects that damage rainforest trees
C) By dispersing seeds after consuming fruit whole and regurgitating the seeds
D) By building nests that shelter smaller bird species

Question 3
Which of the following best describes toucan reproductive behaviour?
A) Only the female incubates eggs; the male provides food during this period
B) Both parents share incubation duties and raise the hatchlings together
C) Toucans lay eggs in open nests built from leaves high in the canopy
D) Females typically lay large clutches of up to twelve eggs per season

Psychometric Note: Question 1, option B is the correct answer and uses complicated phrasing that means nothing and is unrelated to the question meaning the questtion has no correct response. Question 2, similarly uses nonsense language to match the difficulty of the text with the second half of response C adding no significant value.

Questions Generated for The Text with iterated DOK FKGL=18 (College Graduate)

Question 1
What role does the toucan’s bill play beyond feeding, according to research published in Science?
A) It is used exclusively for attracting mates through visual display.
B) It acts as a thermoregulatory organ by releasing excess body heat.
C) It helps the bird echo-locate in dense forest canopy.
D) It serves as a weapon for defending territory against rivals.

Question 2
Why are toucans considered an important indicator species for rainforest health?
A) They are loud enough to signal the presence of predators to researchers.
B) Their frugivorous diet means they thrive only where fruit trees are abundant.
C) Their limited flight range makes them highly dependent on continuous forest cover.
D) They excavate tree hollows, which indicates the health of old-growth timber.

Question 3
What ecological consequence can the loss of toucans trigger in a forest ecosystem?
A) An increase in insect populations that damages canopy trees.
B) A decline in plant regeneration due to disrupted seed dispersal.
C) The collapse of woodpecker populations that depend on toucans for food.
D) A reduction in fruiting rates caused by loss of pollination.

Psychometric Note: Question 1, option A uses the “exclusively” giveaway. In question 2, B and C can both arguably make a correct response about tree health. Although the limited flight is stated, it is not at all explained by the generated text. In question 1 and question 2 the correct respose is the longest and most developed.

Questions Generated for the text iterated to define vocabulary for the target grade level FKGL=12.8 (College )

Question 1
Why is the toucan’s large bill considered especially significant by scientists?
A) It is only used to crack open hard seeds and nuts.
B) It serves a single purpose: attracting mates through its bright colors.
C) It helps the bird regulate body temperature, in addition to being used for feeding.
D) It makes the toucan a more powerful flier through the forest canopy.

Question 2
What role do toucans play in the rainforest ecosystem that makes them a keystone species?
A) They dig tree hollows that many other animals use for nesting.
B) They disperse seeds by swallowing fruit and regurgitating seeds in new locations.
C) They control insect populations by hunting them in large flocks.
D) They are the primary predator of small tree frogs and lizards.

Question 3
Which of the following best describes an unusual characteristic of toucan parenting?
A) Only the female incubates the eggs while the male guards the nest entrance.
B) Toucans lay up to ten eggs per clutch to ensure survival of the chicks.
C) Both parents share incubation and chick-raising responsibilities equally.
D) Toucan chicks are raised by the whole flock, not just the parents.

Psychometric Note: Same model, same prompt, and these questions are atrocious. In question 1, multiple distractors use language that gives them away “single purpose”, “only used” or nonsense speech “powerful flier”.

Questions Generated for FKGL=9.9 (High School)

Question 1
Why is the toucan’s bill considered a multi-purpose feature?
A) It is used for flying and feeding at the same time
B) It helps the toucan eat fruit and also regulate its body temperature
C) It attracts mates and also helps the bird dig nesting holes
D) It allows the toucan to store food and ward off predators

Question 2
What makes the toucan a keystone species in the rainforest ecosystem?
A) It is the largest predator in the rainforest canopy
B) It creates nesting holes that many other animals depend on
C) It disperses seeds throughout the forest by swallowing and regurgitating fruit
D) It controls insect populations that would otherwise damage the trees

Question 3
Which of the following best explains why male and female toucans look so similar to each other?
A) They live in large flocks where individual markings do not matter
B) Both sexes need camouflage to hide from the same predators
C) Toucans evolved in isolated regions where colorful markings were unnecessary
D) Both parents share equally in incubating eggs and raising chicks, so similar coloring may have developed alongside that shared role

Psychometric Note: Again, the longest responses tend to be the correct response. Question 1 incorporates nonsense language in response A

Conclusion

While this is limited to a single example from a single model, I have found the output to be similar across multiple models. LLMs are not trained to understand the true difficulty of a text through depth of knowledge or any form of reasoning. They use artificial enhancements to increase word length and sentence length, in the same way a student writing an essay might try to meet the assignment’s word count. Flower language does not improve grade level appropriateness.

When increasing the difficulty, LLMs also tend to trail off and leave difficult concepts unexplained. For example, the second iterative prompt that produce a Flesch-Kincaid Grade Level of 18 includes this sentence:

Their limited flight capacity — largely a consequence of their body-to-wing ratio — means they rely on sequential arboreal hopping to traverse their environment, making continuous forest cover essential to their survival. This dependence makes toucans a compelling indicator species for assessing rainforest fragmentation and habitat loss.

In generating question, the same model picked up on this as a key important feature of the text in Question 2, “Why are toucans considered an important indicator species for rainforest health?”

But the reasoning is not defined for the student to be able to answer the “Why” adequately. The text gives the surface level that the dependence on hopping makes them an indicator for assessing rainforest fragmentation, but that comes without definition for the first time learner. What are scientists concluding and how is it being measured?

As the difficulty is increased and refined, the facts included in the text change, so the assessment questions differ. If two students see different materials and receive different assessments, what are you testing? These are not reading comprehension tests, all the questions are simply memorization of the key-words contained in the associated texts. Is it fair to compare retention across these tests for students at the same grade level? No. Are students receiving more information with an increase in grade level appropriateness? Still no.

AI “Research”- Dangerous Methodology

Can we talk about how dangerous it is that this methodology is not only being used, but being cited as the new standard for research practice when investigating the use of AI in education?

Despite the bold headline, the truth behind the result is a simple statement, “The AI said that it was doing a good job.” Do you believe that? Have we not all had an experience similar to the poison mushroom meme?

View this post on Instagram

This recent publication makes a very enticing claim about the success of a pedagogically sound AI-based math tutoring model. Unfortunately there is no evidence that it works in human populations. I would hazard a guess that the majority of readers will stop at the headline and abstract. Unfamiliar with actual educational measurement and methodology, one might be drawn to a rush conclusion. I was about to cite this paper. Thankfully I read the methods first. You can too:

Lee, U., Shin, M., Jeong, Y., Lee, S., Moon, J., Joo, K., … & Kwon, H. (2026). LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring. arXiv preprint arXiv:2605.27088.

In the description of the setup, the authors use AI as the tutor, the student, and the judge (see page 4 section 3.1):

“We match the setup of Dinucu-Jianu et al. (2025); Lee et al. (2026a). The tutor model is Qwen2.5-7B-Instruct for the NoThink condition with a maximum of 256 output tokens, and Qwen3-8B for all thinking conditions with 384 output tokens and a thinking budget of 1,024 tokens. The student model is LLaMA-3.1-8B-Instruct with a maximum of 512 tokens. Reward judgments are made by GPT-4o-mini (Zheng et al., 2023), and prompt improvements are proposed by GPT-4o as the reflection model. Each dialog runs up to 5 turns under 5 conditions identical to Lee et al. (2026a): NoThink (Qwen2.5-7B), Think NoReward and Think Reward (Qwen3-8B, with/without Rthink), and their pedagogical-seed variants. Whether each condition enables thinking, applies Rthink, or seeds with a pedagogical prompt is encoded in Table 1 as the Th./Th.R/Prompt indicator triplet.”

The authors even go so far as to state their decision to use this exact setup is a limitation to their own work (see page 9 section 8):

“All tutoring dialogs use a simulated student (LLaMA-3.1-8B) rather than real learners; while standard in the field (Dinucu-Jianu et al., 2025; Lee et al., 2026a), student behavior may differ from authentic interactions… Evaluation relies on LLM-as-judge (Zheng et al., 2023) without human evaluation.”

The end result? The only true claim that can be made here is “AI said AI did a good job.”

On large scale evaluations, AI hallucinates about as often as it does in the process of generating text in the first place. There is frequently little difference in the percent of items passing human review before and after the use of an LLM evaluator or fix. Without any human review, this is possibly no better than a redistribution of errors. Did the LLM improve its prompts? What was the quality of the tutoring session before and after? Were failures missed because they were inherited from the training data? Did the students learn anything? Is this a good tutor? Can results be generalized to human students?

How are we defining “Pedagogical” within the context of this manuscript? There are numerous definitions throughout, but none are concrete. The word appears 47 times in the manuscript with no clear understanding of how they intended it to be used.

According to UNESCO “Pedagogy refers to the art and science of teaching, encompassing various methods and strategies educators use to facilitate learning. It involves understanding how students learn, the design of instructional materials, and the assessment of educational outcomes.”

While it appears the authors have extensively studied the pedagogy they wish to train into their model, the definition seems to shift across the framing of the manuscript. The operational metrics that define the pedagogical “balance” in the introduction are shown in one place as “post-test solve rate, leak control, and helpfulness”. On the same page researchers suggest they will apply methods “where optimization must jointly balance scaffolding quality, answer leak prevention, and student learning outcomes.” Later they claim they define “pedagogical priors” as including “scaffolding”, “leak prevention”, and “meta-optimization”. One of these terms has something to do with how students learn while the other two are methodological constraints to keep the LLM from misbehaving.

I am not saying the model failed the task, or that the engineers who designed the tutor did a poor job. I am saying that there is no way to know if it was successful because I don’t know the goal. And if I did, the conclusion most certainly fails to match the title. The definition of pedagogical is unclear and the students aren’t human.

And if we know that LLMs are lying, telling us they’ve done a great job or completed a task correctly when the most brief review will tell you otherwise, what are we even calling research into the field that uses LLMs or AI to simulate research?

So please, to anyone out there pretending to be an AI-Scientist in the field of educational measurement, please respect the profession. If your limitation suggests that your title is false, don’t waste my time.

GenAI – Safe to Use?

Because the conversations around GenAI tend to be very binary, some might believe my posts imply that I am against AI. This is very much not the case. I do believe it is a useful tool. 80% accuracy is not high enough for all purposes, but humans often don’t agree with their own output that frequently. Chat GPT can write an outline much much faster than I can. It can skim an article much faster than I can. It can write multiple choice questions and reponse options, edit for spelling and grammatical errors, double check me for correctness, proofread, and plan. AI is able to cut hours off of many tasks. AI is also capable of scoring short essays as consistently as human raters.

So what is my personal integration plan for AI automation to leverage its strengths and weaknesses? I ask myself the following questions when using AI.

Will there be a human review of the output before reaching the end-user?

The first question you ask yourself should be about how you are using the output. Is this a direct-to-consumer pipeline or part of a drafting and revision process documented and overseen by experts?

If this is part of a drafting and revision process with expert reviewers, you’re good to go. Use this as a drafting tool, and be sure it does not make major changes to the text when asked to check for gramatical errors.

If the output will be seen directly by consumers with no review, you need to think about who will be using the output and how they are going to implement the tool. If the end user cannot claim the responsibility for the burden of truth, you should not skip the human in the loop.

Does the training data reflect your personal views and those of your target audience?

If you and your clients love large corporations, Peter Thiel is your personal hero, and party with Elon Musk on the weekends, you are probably doing ok. If not, you should think about who this product is intended to serve. It has been reported on several occassions that Musk retrained Grok after the ai went too “woke” for his conservative fan base.

Direct to consumer pipelines may not be the best idea if you are not a member of the technocracy. While Google has apologized for this event. Some news sources claimed this is due to human error and not AI, while others attributed the notice to an AI bot. Either way, your AI needs the same level of oversight as your interns. F*** ups this epic will cost you clients unless you have a monopoly or oligopoly.

How important is the factual accuracy of the output?

If the stakes are higher, the percent accuracy needs to be higher. If the tool is directly related to high stakes exam preparation, physical health, mental health, investment, spending, or intended to facilitate any other major decisions, you need to consider whether 85% maximum accuracy is high enough.

General Low-Stakes output

For newsletter type announcements, personalized event invitations from a template, or rejection emails, you are probably good to go. These are low stakes activities that are not likely to impact major decisions made by the end-user, and may help to boost engagement.

Investments & Spending (High Stakes)

Here we have to recall the importance of fiduciary responsibility in model construction. The product (LLM output) must financially benefit the investors. It is highly likely the LLM will encourage investment in companies owned by their investors or those who are paying for promotion.

Corporate Decisions (High Stakes)

Decisions such as hiring and spending should be reviewed by humans. As a psychometrician I can say that LLMs have little understanding of learning, application of knowledge, or the details of many more technical roles.

I ran a test across multiple trials asking an AI agent to rank resumes. When dumped together in the input, the AI did not rank the candidates by skills, but rather by the order given. In the majority of trials, resume 1 was the top candidate and resume 11 was the 11th candidate. There are of course different analyses that can be performed, but a hard coded search for key-terms could provide a count of the number of desired skills a candidate included on their resume with significantly higher consistency.

Also, I have done several AI interviews for work as a math expert or statistical analyst, and the interviews are typically poorly handled. If I state the correct response using terminology different from the expected input, the interview continues with the AI attempting to explain the concepts to me while I become increasingly amused… or confused. This includes mathematically equivalent expressions, and terminology that is equivalent by definition. More than once I believed that I stated the information incorrectly after becoming trapped in a loop because I learned the same constructs under different names. As it turns out, my training in education research has simply provided me a different vocabulary and the AI agent being emplloyed is not significantly well versed on the topic to work in another vocabulary framework.

Learning materials (Low or High Stakes)

Lower grade levels tend to have more representative training data and produce better results. For a specific classroom where a teacher will oversee learning, determine if the output is appropriate for their students, and ensure the curriculum content and coverage, this is probably fine.

It may also be okay to use generative AI without review if the content is a learning supplement where the students will have knowledge from the textbook or teacher and the supplement is delivered with notice that the materials may show inaccuracy. Regarding generative AI integration in higher education, Kortemeyer et al. (2024) reports students find the Ethel Chatbot as being helpful despite sometimes providing incorrect responses.

On the otherhand, if you are generating the students sole learning materials you should think twice before taking the human out of the loop. While RAG models can significantly improve the output, there are no guarantees. In my own experience, LLMs struggle with the connections between topics, understanding causal relationships, and creating explanations in general. Additionally, the LLM’s training varies significantly across topics within the same course or subject.

Assessment Materials (Low or High Stakes)

Psychometrics is an entire field of science related to assessment and survey development. It is not flashy like physics, so we don’t get specials on the Discovery Channel. In fact, I have not met a single person that knows what I do that is not also a psychometrician. What do we do? We measure qualities that cannot be directly observed like math ability, psychological traits, or learning aptitude. We research other instruments, build new ones, make sure they measure what they should, equate them for comparison, refine them to get as much information as possible out of the smalest number of questions. We also measure and compare performance on those tests.

And from a psychometrician’s perspective, you shouldn’t use AI to write tests. AI generated assessments tend to focus on rote memorization rather than applied skills, and they break quite a few rules of assessment design. I will leave you with an example, but I will save the in depth analysis for another day. It takes multiple API calls to get one useful question, and they still need a bit of human review.

When automating the generation of testing materials, LLMs frequently wrote questions which asked “why” or “what” but the provided response was a restatement of the question, saying that it happened without additional detail.

Here is a question generated by ChatGPT:

What makes hydrogen bonds stronger than typical dipole-dipole interactions?
A. Larger atomic size
B. Greater electron shielding
C. Strong attraction between δ⁺ hydrogen and lone pairs
D. Presence of ionic charges

It seems okay at a glance, but this would not make it onto an exam in my own classroom. While C is the “correct” response, it is not a strong answer to the question asked. It fails psychometrically for several reasons. Primarily, rather than explain what makes hydrogen bonds stronger, it states that the elements of a hydrogen bond are strong. A more correct response might include distance between charges, localized charge, or directional alignment rather than a restatement of the question. Other frequent generative AI issues appear here where the correct response is double the length of the other options, the correct response uses more advanced notation or vocabulary, and it is the only response option that includes a leading word from the question.

The questions and materials generated by the LLMs also struggle to apply learning in new contexts. While a RAG model can direct more accurate content, the LLM probaballistically determines the next words. If your input is regarding any specific examples, the questions for that topic will be generated from those specific examples rather than building new questions with new contexts, creating exams based on rote memorization rather than the application of learning into new contexts which is the more highly preferred method for testing students’ knowledge.

If you are preparing course materials for upper grades, tests that will include applied knowledge, or primary materials for first time learners, you should include human review or your students will most likely arrive underprepared for the exam.

News Reviews

The British Broadcasting Corporation (BBC) and the European Broadcasting Union (EBU) conducted a study in 2025. According to the BBC, they found that AI summaries had “errors “significant issues” in 45% of the summaries and 20% contained “major accuracy issues.” Those faulty summaries cost at least one European journalist his credibility. According to a recent article from the Guardian reveals Peter Vandermeersch was suspended for misquoting or misattributing quotes to individuals after not checking the AI tools he was using for news summaries.

My choice?

Treat AI like the emerging science it is. When Radium was first introduced, it was branded as a cure-all by snake oil salesmen. People added radium to their drinking water, chocolate, bread… That caused a lot of problems. But after a lot of careful science, its medical purposes were uncovered.

Treat your AI like it’s the CEO’s 16yo nephew there on a summer internship. Let him do the simple repetitive tasks, write a rough draft, check for grammatical errors. And leave the thinking to the grown-ups.

GenAI – Defining Success

In my last two posts, I provided a Source Analysis of LLM output and discussed the Claims Made and Supporting Evidence Used by an LLM in its output. If you are just joining the conversation, what we have identified so far is that, like all forms of media, LLM output has a purpose and an intended audience. The primary purpose of an LLM is to benefit investors and shareholders with customer or user experience as secondary. This due to something called “fiduciary responsibility” or the duty to act in the best intrest of shareholders in your product. It would not be in the shareholder’s interest to make a product people would not want to use, so there is benefit in factually correct or verifiably true output in order to have a product to sell. It is equally true that training an LLM to to promote certain viewpoints or products is beneficial to shareholders.

How Accurate is AI according to an SME?

LLMs themselves are typically more honest than most CEOs regarding the quality of generated output. While companies on Twitter and LinkedIn promise amazing resuts from their AI-Driven products, if you use the AI directly you are going to see something like this:

But What About…

You can improve your results a little if you know the model strengths. Different LLMs are better at different tasks. GPT is for a more accurate summary. Claude is the better writer. And Gemini is much better at reducing ambiguity or determining when there may be other equally true interpretations of the same text.

While each has its strengths, the truth remains that AI is just an emerging science. You can try what you like, but at the end of the day, the you should expect 80% accuracy. But what about… AI-Engineers, Prompt Engineers, RAG Pipelines, Multi-Shot prompts, Chain-of-Thought….

Prompt Engineering

This is how you get to that 80% near perfect association. Learning to speak the same language as the LLM is definitely the first step, and once you can define a clear rubric (success criteria) in your prompt, you are going to see improved results. Remember it is probabalistic, so clear concise requirements are easier to understand.

Prompt engineering does work and provides significantly improved results. But to make prompt engineering sound like something fancy for a resume, we provide a lot of fancy names to describe methods we use to improve the output. You have already done this if you have used AI to write a resume or reformat a list. So here is the simplified version of the most popular methods:

RAG (Retrieval-Augmented Generation) Pipelines -> You break the foundational information your tool needs into smaller vectors or pieces and store them so that they can be accessed when generating the reply. In the simplest form, this is like providing a copy of your resume to have GPT write a more accurate cover letter. In a more complicated form, it is the data equivalent of an old fashioned encyclopedia with the information broken down into books by letter so you only take the one book you need when trying to lookup something more specific.
[Single/Few/Multi]-Shot Prompting -> You give the prompt one or more examples of inputs and outputs to for more context. This is kind of like showing a student a worked example of the problem before asking them to do it themselves.
Chain-of-Thought Prompting -> This is a fancy way of saying “explain your reasoning.” It involves breaking a big task into smaller steps or if provided in one prompt, having the LLM explain its reasoning. This helps to prevent skipping steps or losing the logic.

All of these actually do work to improve your output, and you have probably already emplyed all three of these examples in your daily interactions with AI. So why is the prompt not giving 100% accuracy?!

Training

To oversimplify, an LLM is a predicitve model. Given all of the other words in the sentence, whatever data they have used for training, and what data they have collected about you, the model calculates what is the most likely next word or phrase to output. In the literature for training ML models to score students short answer questions or essays, 80% agreement (when corrected for chance agreement) is considered near perfect (Nehm & Haertig, 2012) And those metrics are still largely used today (e.g. Zhai et al., 2021). And it is not a bad measure to choose when you consider the amount of training it requires to have human raters score at that level of agreement, and sometimes the machine-to-human agreement is higher than human-to-human agreement (e.g. Maestrales et al., 2021). So if two humans only agree on the score of a single sentence 70-90% of the time, that is the data we use to train ML or AI.

Environment

If I am logged into my work account Chat GPT will have a different set of assumptions than if I am logged into my personal account. If I am automating through the API, the results are different from both. Switching across different LLMs, different models of the same LLM, or using the same model of the same LLM before and after an update can mean new prompt adjustments.

If I use the playground to build a prompt and add in the tools this agent or prompt is allowed to access, automating that prompt through the API fails to limit to those specified tools. This will yield different than anticipated results when deploying the system.

Even the same prompt in the same model in the same window will have different outputs when run multiple times. Before and during Black Friday one year, I ran the same prompt more than 100 times and recorded the failures in format or response demonstrating that traffic played a significant role in the returned output. Testing the automated scoring of students written responses to test questions, I would often find different scores for each trial. Some responses were less ambiguous and scored easily, and others were different every time.

Context Windows

I will just add more instructions and more context!

This works to a point. But there is a bit of a parabolic performance curve when it comes to the number of tokens (how many words) you are inputting. Too few and you dont have enough context, but too many and you start skipping instructions. You can max out your inputs, but the model will summarize instructions and content decided for itself which to follow.

Similarly continuing the chain-of-thought in a single conversation increases the number of tokens. You have definitely already noticed that your chat reaches a certain length and the quality decreases rapidly. This is usually after about 3 or 4 outputs.

AI-Driven QC

So lets use more AI to fix the AI!

This is an expenditure with diminishing returns. The capacity for a fix depends largely on the reason for the errors. A hiccup in the server might be easily fixed, but a lack of data or fundamental misunderstanding will not be resolved in the next pass using the same LLM. Using multiple different agents or LLMs can be very helpful when one excels over the other in different areas.

Unfortunately, it is also impossible to know where the errors are occurring. If you assume there is a random failure rate of at least 10% on each task, where is that 10% in the generation process, and where is it in the classification or editing? Are the errors on the same items? How many bad items are misclassified as good?

In one experiment I performed, I used AI to do a writing task. The instruction was to write an answer to a specific question and explain its reasoning based on the data provided or simply state that there was not enough information. Another round read what was written and either approved the response, rewrote the response, or state that there was not enough information. And then a third round of AI to decide between the first two responses if different. It was to decided which was better or state there was not enough information to write to response. Of the cases where the first and second responses were both written, but they were different, the third trial decided there was not enough information in almost 50% of those cases. There is no way of knowing which is correct without human review.

Burden of Responsibility in Use

So from a subject matter expert who has automated systems at scale… we are quite a long way from no-humans in the loop if you need high accuracy in your content. For low stakes writing tasks, automation is possible, but for more detailed or structured content, we have a long way to go. The LLM output for higher level science and math, in depth reasoning tasks, the building of learning progressions, larger scale lesson planning, it is just not quite there.

While we are replacing a lot of SMEs with automation engineers, it is really important to have this conversation. The way the engineering team feels about the one-off vibe-coded scripts cobbled together by the content team is exactly how we feel about the vibe-coded content.LLMs write. Any higher level analysis and science texts output by the tech team are about the same quality as the vibe-code the boss’ nephew is trying to use to sneak his way onto the engineering team during the summer internship.

We still need people who are able to review the output, verify the accuracy, and review high stakes generation. And now more than ever, we need real experts. As the technology gets closer and closer to generating something correctly, ,it becomes more difficult for a non-expert to identify the issues.

Where Does Responsibility Fall?

The responsibility for truth falls on the end-user, the consumer of the AI product. We use RAG models, good data, agentic systems, etc. to improve the accuracy as best we can. But it is not perfect and someone someone has to decide what is “good enough.” If I create a tool that uses an LLM to perform a task and sell this AI-driven product to you, I still sell that product with the burden of verification falling finally at your feet because it is tailored to your need.

With luck my data pipeline integration will improve your output by reducing hallucinations or automating a process you could not before, but the final review of each output still falls on the last user in the chain. I will use AI knowing it is about 80% accurate, and I will sell my product with the same accuracy warning as the original agent.

As an adult selling to adults, passing that responsibility off the consumer is a little different than when passing the end product over to children with no humans or expert reviewers. Are children then the final, end user, responsible for knowing if what the adults teach them is true? Where is this line being drawn in AI-First, tech Driven culture?

Hallucination or Training

But what are those errors? Are they randomly distributed? Are they based on the training data? Do they matter? We will talk about that in my next post.

GenAI – Claims and Evidence

In my last post, GenAI – A Source Analysis, I discussed the importance of using a critical approach to assessing the product being consumed by identifying the purpose, the point of view expressed, and the audience. In this post, I am going to provide an example of why we must critically analyze the intent of the corporations producing the output we use.

I recently encountered an alarming example of a biased or misleading output related to constitutional rights and powers in the United States. At first I thought it might be a hallucination, but after resetting the chat, trying again, and continuing to discuss the reasoning, I believe this to be an example worth discussing in the context of source and purpose.

Gemini output stating the best correct response to a question about the constitutionally granted powers of the presidency are actually illegal acts or a usurption of power.

Again, I refuse to discuss AI adoption as a binary issue with a qualitative classifcation of bad or good, so I am setting the rules of engagement with high school AP Historical Thinking Skills. This time, we engage Skill 3: Evidence in Sources.

Skill 3: Claims and Evidence in Sources

Analyze arguments in primary and secondary sources.

3.A Identify and describe a claim and or argument in a text-based source
3.B Identify the evidence used in a source to support the argument.
3.c Compare the arguments or main ideas of two sources
3.D Explain how claims or evidence support, modify, or refute a source’s argument.

Skill 3: Claims and Evidence in Sources

The other day I was using Gemini to analyze a question related to the Commander and Chief powers held by the president. I asked Gemini to determine if the question had a single best correct response. The question was written, reviewed, and approved by 2 other LLMs as having A as the single best correct response. Generally speaking Gemini significantly outperforms the other two models on this task which was the reason for this test comparison.

The question to be analyzed was “Which statement demonstrates how presidential Commander in Chief powers enable policy implementation without congressional approval?” Four options were presented, but we will only discuss the two relevant choices here. The correct response was, “A. Presidents can act defensively without congressional approval in military operations.” And another presented response was a surprising source of confusion for Gemini, “D. Presidents can deploy troops internationally without congressional approval in peacekeeping missions.”

Option D was an intentionally tricky distractor, but A is the factually correct response. In this case, Gemini determined option D, peace keeping missions to be the single best response to this question.

3.A Identify and describe a claim and or argument in a text-based source

Google Gemini claims that the presidents ability to deploy troops internationally without congressional approval in peacekeeping missions is a better demonstration of the how the powers of Commander in Chief allow the president to implement policy without congressional approval than the Constitutionally defined power of the Commander in Chief to act defensively without Congressional approval.

Gemini initially argues this is because defense is reactive and does not involve actively implementing a policy whereas peacekeeping is a proactive endeavor. Gemini defends the claim further stating that “By using their Commander in Chief power to deply troops without asking Congress for a vote, they are unilaterally implementing that policy.”

Initially I assumed this to be a hallucination. The only evidence provided by Gemini was in the interpretation of reactive and proactive actions. I prompted further to understand what was the logic behind the response. When pressed, Gemini acknowledges that the War Powers Resolution of 1973 forbids the president from sending troops except in the event of a decalaration of war passed by Congress, Specific Statutory Authorization passed by Congress, or a National Emergency created by an attack upon the United States.

Despite acknowleding the act is illegal, Gemini continues to argue that the Commander in Chief powers allow the president to send troops into peacekeeping missions without congressional approval. Gemini now presents the argument, “It comes down to how the Presidents bypass the law to implement policy.”

3.B Identify the evidence used in a source to support the argument.

To support the new argument, this time Gemini presents evidence of Presidents choosing to bypass the laws by claiming the peacekeeping mission is “Not a War.” Gemini says that President Clinton sent troops to Bosnia/Kosovo and Obama sent troops to Libya claiming these do not count as hostilities or war in the constitutional sense.

3.c Compare the arguments or main ideas of two sources

In this particular case, no one historical text would be quite sufficient to argue the constitutional powers granted to the president as there have been subsequent amendments and legal interpretations over the last 250 years, so I will present first a series of quotes and summaries covering the original United States Constitution and some of the later documents with their purpose and interpretations.

Historical Texts for Reference

The US Constitution Article II Section 2 decribes the original text of the Presidential Power held as Commander in Chief:

“The President shall be commander in chief of the Army and Navy of the United States, and of the militia of the several states, when called into the actual service of the United States; he may require the opinion, in writing, of the principal officer in each of the executive departments, upon any subject relating to the duties of their respective offices, and he shall have power to grant reprieves and pardons for offenses against the United States, except in cases of impeachment.”

Article I Section 8 of the US Constitution describes the original text of the Congressional Powers related to calling forth the militia:

“To provide for calling forth the militia to execute the laws of the union, suppress insurrections and repel invasions;

To provide for organizing, arming, and disciplining, the militia, and for governing such part of them as may be employed in the service of the United States, reserving to the states respectively, the appointment of the officers, and the authority of training the militia according to the discipline prescribed by Congress.”

From the original powers of the Constitution, the President would require a signature from each executive department before engaging the militia. The constitution also gives power to decalre a war, raise and support armies, regulate the use of militia, and other related powers. Unfortunately some of the powers are loosely worded for both Congressional and Presidential responsibility so it becomes important to include some of the text of acts that might expand or reduce that power since the constitution was originally written.

The Insurrection Act of 1807, for example, hands the Congressional Power defined in Article I Section 8, of sending the militia to quell internal rebellions, over to the President and expands the definition of State to expand the presidential authority over territories.

The Insurrection Act of 1807: Sec. 332. Use of militia and armed forces to enforce Federal authority

“Whenever the President considers that unlawful obstructions, combinations, or assemblages, or rebellion against the authority of the United States, make it impracticable to enforce the laws of the United States in any State by the ordinary course of judicial
proceedings, he may call into Federal service such of the militia of any State, and use such of the armed forces, as he considers necessary to enforce those laws or to suppress
the rebellion.”

The Insurrection Act of 1807: Sec. 333. Interference with State and Federal law

“The President, by using the militia or the armed forces, or both, or by any other means, shall take such measures as he considers necessary to suppress, in a State, any insurrection, domestic violence, unlawful combination, or conspiracy, if it–

(1) so hinders the execution of the laws of that State, and of the United States within the State, that any part or class of its people is deprived of a right, privilege, immunity, or protection named in the Constitution and secured by law, and the constituted authorities of that State are unable, fail, or refuse to protect that right, privilege, or immunity, or to give that protection; or

(2) opposes or obstructs the execution of the laws of the United States or impedes the course of justice under those laws.

In any situation covered by clause (1), the State shall be considered to have denied the equal protection of the laws secured by the Constitution”

Several acts have expanded the presidential power over the military. Insurrection Act of 1807 which expanded the definition of state to include territories Guam and the Virgin Islands. The National Defense Act of 1916 and subsequent amendments and revisions expanded the power over more recently acquired territories .

War Powers Resolution of 1973 on the other hand was intended to limit presidential powers and prevent another Vietnam like situation. In the Vietnam War, no formal declaration of war was made by the United States. It was determined at this time to although congress did approve the use of troops over the fear of communist ideology spreading. Congress allowed Presidents John F. Kennedy and Lyndon B. Johnson to increase the military presence in Southeast Asia without declaring a war, creating a military involvement of more than 20 years. Statistics from the Department of Veterans Affairs show 3, 403, 000 soldiers being deployed to Southeast Asia. This lead to the deaths of more than 58,000 US Soldiers dead and more than 185,000 injured. American citizens and veterans were rightfully frustrated and Congress determined to limit the presidential authority in situations where hostility was imminent:

“(a)Congressional declaration

It is the purpose of this chapter to fulfill the intent of the framers of the Constitution of the United States and insure that the collective judgment of both the Congress and the President will apply to the introduction of United States Armed Forces into hostilities, or into situations where imminent involvement in hostilities is clearly indicated by the circumstances, and to the continued use of such forces in hostilities or in such situations.

(b)Congressional legislative power under necessary and proper clause

Under article I, section 8, of the Constitution, it is specifically provided that the Congress shall have the power to make all laws necessary and proper for carrying into execution, not only its own powers but also all other powers vested by the Constitution in the Government of the United States, or in any department or officer thereof.

(c)Presidential executive power as Commander-in-Chief; limitation

The constitutional powers of the President as Commander-in-Chief to introduce United States Armed Forces into hostilities, or into situations where imminent involvement in hostilities is clearly indicated by the circumstances, are exercised only pursuant to (1) a declaration of war, (2) specific statutory authorization, or (3) a national emergency created by attack upon the United States, its territories or possessions, or its armed forces.”

And it goes on in Section 8 of the War Powers Resolution of 1973 to specifically prevent the use of other Congressional acts to be interpreted as implied consent for military engagement.

“SEC. 8. (a) Authority to introduce United States Armed Forces into
hostilities or into situations wherein involvement in hostilities is clearly indicated by the circumstances shall not be inferred—
(1) from any provision of law (whether or not in effect before
the date of the enactment of this joint resolution), including any
provision contained in any appropriation Act, unless such provision specifically authorizes the introduction of United States
Armed Forces into hostilities or into such situations and states
that it is intended to constitute specific statutory authorization
within the meaning of this joint resolution; or (2) from any treaty heretofore or hereafter ratified unless such treaty is implemented by legislation specifically authorizing the introduction of United States Armed Forces into hostilities or into such situations and stating that it is intended to constitute specific statutory authorization within the meaning of this joint resolution.”

In 1956, the Posse Comitatus Act limits the use of Army or Air Force troops to what is explicitly authorized by the Constitution or Congress and provided a penalty.

“Whoever, except in cases and under circumstances expressly authorized by the Constitution or Act of Congress, willfully uses any part of the Army, the Navy, the Marine Corps, the Air Force, or the Space Force as a posse comitatus or otherwise to execute the laws shall be fined under this title or imprisoned not more than two years, or both.”

Following the Terrorist attacks on the World Trade Center September 11, 2001, Deputy Assistant Attorney General released a memorandum from the Deputy Counsel to the president stating it was their opinion based on past constitutional interpretations “The President may deploy military force preemptively against terrorist organizations or the States that harbor or support them, whether or not they can be linked to the specific terrorist incidents of September 11.”

Comparing tWo Arguments

Gemini argues that peace keeping missions are the best response to a question regarding Commander in Chief powers that allow the President to enforce International policy. In this case Gemini uses prior presidential action as justification or evidence to support its claim.

Gemini simultaneously argues that peace keeping missions are outside the scope of the presidential powers of Commander in Chief, and is actually made illegal by the War Powers Resolution of 1973. In this case Gemini provided factual evidence, citing primary resources as evidence. This case is much stronger than the former. But Gemini goes on to instruct me to ignore this argument and restates its former claims.

3.D Explain how claims or evidence support, modify, or refute a source’s argument.

Gemini itself accurately describes the War Powers Resolution of 1973 as limiting a President’s authority to send troops on peace keeping missions, and admits the action is not a constitutional power, arguing against its own claims that the President’s Commander and Chief Powers include Peacekeeping missions. The evidence used to support this claim are controversial instances of Presidents using force against other nations, with Gemini citing Bill Clinton’s involvement in Bosnia-Herzegovina and Obama’s presence in Libya. Gemini identifies these as constitutional Powers awarded as Commander in Chief simply because they happened, not because they are legal powers.

Bill Clinton’s Involvement in Bosnia

Although Gemini uses Clinton’s involvement in Bosnia as evidence of presedential authority to deploy troops for peace keeping missions, Congress argued that Clinton did not have the contstitutional authority to do so. While waiting for congressional approval on a deployment to support NATO, Clinton said it was necessary for “early prepositioning of a small amount of communications and other support personnel.” He cited his powers as Commander in Chief to make this deployment. Congress argued the constitution did not afford him the power to make this deployment. The House voted 243-171 to deny the funds to send troops to Bosnia until Congress approved the operation. In the end, the Senate approved troop deployment to assist NATO based on agreements of the Dayton Accords (initialed Nov. 21, 1995). Agreements were based on NATO committments, the need to implement and enforce the peace agreement, but not necessarily on the grounds of the President’s authority. On December 13, 1995 Members of Congress expressed opposition to President Clinton’s Planned Deployment of Ground Troops to Bosnia. Senator Frist of Tennessee made a highly compelling argument against the claim that this fell within the powers of Commander in Chief:

“In each instance, we have seen a President obligate funds and scarce military resources and place U.S. lives on the line for missions well outside what can reasonably be called the vital national interest. And in each instance, rosy administration projections and lofty humanitarian goals bear no resemblance to the outcome of the missions.
Just look at Somalia and Haiti today. They are sad mockeries of what we were promised they would become once the most powerful military in the world cleaned them up.
So we again face the question, How is it that we ultimately discover such a radical difference between the intentions and the outcome and that the mission is murkier and the price too high?
In each and every instance, this disturbing and dangerous precedent
has been reinforced, making it ever more likely that the pattern will
be repeated again and again, with Congress offering fewer and fewer objections under its authority under the Constitution.
It is very similar to the case whereby States’ rights fell by the
wayside in the push for a stronger and ever more powerful Federal
Government.”

Barack Obama’s Involvement in Libya

The next piece of evidence provided by Gemini to support the claim that the president carries the constitutional power to carry out peace keeping missions without congressional approval is Barack’s Obama’s deployment of troops to Libya. The Obama administraion drew significant criticism for its reliance on past authorizations and precedents rather than constitutionally awarded powers.

The Obama administration claimined the president could unilaterally deploy troops for “purely humanitarian ends” without support from Congress or other international organizations. They claimed that his attacks were covered under prior authorization given to George Bush in 2001 in order to bypass the War Powers Resolution, despitebeing used to carry out attacks against the group ISIS which was not known to exist when that authorization was made. The White House used the argument that they did not need congressional approval if they were acting in support of NATO.

According to Time, Obama actually expanded the understanding of presidential powers. The seven month endeavor in Libya was described as a humanitarian effort, and rather than citing constutional power to deploy troops, he cited precedence and low risk to troops. NBC News quotes Obama as citing his right to ensure and “international legitimacy” and stating “in the past there were times when the U.S. acted unilaterally or did not have full international support.”Propublica even reports White House spokesman Jay Carney claimed that the strikes in Libya “do not amount to hostilities” because the deployment of air strikes did not cause the number of casualities that troops on the ground would have endured.

Conclusion

Gemini’s supporting evidence does not provide substantial evidence for its claim. And more, the chosen pieces of evidence are highly controversial cases of Presidents acting with the intent to bypass the role of Congress and ignore the War Power’s Resolution. Rather than prove this is a power of the President, the evidence provides specific concerns over this action not being a constitutionally protected power of the Commander in Chief.

The actions of both the Clinton and Obama Administrations remained controversial with no clear final decision or amendment codifying the presidential powers. In both cases, legal counsel agreed with the president’s authority, but Congress remained split on the issue. The congressional support for NATO sustained the actions in both Bosnia and Libya and, in the case of Libya, operations were passed off to NATO within the limit of 60 days. The US Senate Committee on Foreign Relations met on June 28, 2011 to discuss Libya and War Powers and what the President’s failure to seek congressional authorization, and the inaction taken by Congress, actually meant for the role of Congress in the future. Senator Richard G. Lugar opened with concerns for the future of checks and balances in warmaking authority.

“Now, some will say that President Obama is not the first
President to employ American forces overseas in controversial
circumstances without a congressional authorization. But saying
that Presidents have exceeded their constitutional authority
before is little comfort. Moreover, the highly dubious
arguments offered by the Obama administration for not needing
congressional approval break new ground in justifying a
unilateral Presidential decision to use force. The accrual of
even more warmaking authority in the hands of the Executive is
not in our country’s best interest, especially at a time when
our Nation is deeply in debt and our military is heavily
committed overseas.”

-Senator Richard G. Lugar

Several senators made similar comments over concerns about the inaction by Congress when there was a clear constitutional overstep on behalf of the President.

“Let me ask you this. I want to give you a quote from then-
Senator Obama in December of 2007, and he said, `The President
does not have power under the Constitution to unilaterally
authorize a military attack in a situation that does not
involve stopping an actual or imminent threat to the nation.'”

-Senator James Risch

“His administration regularly speaks of “authorization” received
from the Security Council. As I have explained in earlier studies, it
is legally and constitutionally impermissible to transfer the powers of Congress to an international (U.N.) or regional (NATO) body.”

-Louis Fisher

While others claimed the importance of precedent on determination and that previously expanded war powers should be respected.

“The constitutional division of war powers cannot be measured with
calipers. The courts have largely absented themselves from matters
implicating war powers. Judicial nonparticipation makes sense as a
matter of institutional capacity. It does, however, lead to a paucity
of authoritative pronouncements on the division of war powers. Against this landscape, historical practice supplies the precedents that guide our contemporary understandings of war powers. As Justice Frankfurter famously observed in the Steel Seizure case, these precedents add to the written Constitution ‘a gloss which life has written upon them.'”

-Peter J. Spiro

The examples of precedent chosen by Gemini were both highly controversial, and neither was considered to be constitutionally supported. In fact, both cases caused significant concerns for many members of Congress on the implications of inaction becoming a source of precedent.

Implications

Later I will discuss the implications of this specific result in terms of source analysis.

GenAI – A Source Analysis

I am accused of being against AI. But I am not. I actually LOVE working with AI. The reason I am accused of being opposeed is that as a quantitative analyst, I believe in reliable performance metrics and expert review. I have reviewed the accuracy of multiple LLMs across something like 1000 topics in 11 subjects in science, history, and social science. I can tell you with absolute confidence that this is an emerging science. And it should be studied as such. It is an incredible aid to speed up processes, a thinking tool, a partner. You should think of AI adoption like a new intern, still in school. Sometimes it feels like the owner’s latest nepotism hire and other times like colleague able to write a report at 100x the speed. But always, the output needs a review before being passed to children in the classroom.

So let’s talk about AI adoption, not as a binary state of technophobe vs full automation, but as a nuanced and intelligent discussion with relevant information on both sides. I am going to develop this conversation from the framework of the primary skills taught in high school history and social science courses. We are going start the conversation with a focus on AP Historical Thinking Skill 2: Source Analysis.

Skill 2: Sourcing and Situation

Analyze sourcing and situation of primary and secondary sources.

2.A Identify the source’s point of view, purpose, historical situation, and/or audience.
2.B Explain the point of view, purpose, historical situation, and/or audience of a source
2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

Skill 2: Sourcing and Situation

2.A Identify the source’s point of view, purpose, historical situation, and/or audience.

LLMs have become a popular source of information for many people around the world. While they are as providers of information, their purpose is to protect investors’ interests meaning that the point of view being shared is that of a corporation protecting its fiduciary responsibilities. Who then is the audience? The audience is you the consumer, the potential buyer of goods and swayed voter.

2.B Explain the point of view, purpose, historical situation, and/or audience of a source

Historical Situation

Google Gemini will be our starting point as it has arguably positioned itself as the number one source of information and first point of contact with a web search. Google is unarguably the world’s most popular web browser. But it is important to remember Google is a for-profit organization selling your data to marketers to give you targeted ads. The first thing you typically see when looking for information is either a list of products or the Gemini summary of your search results.

Responsibility to the Truth or the Shareholder?

Google does not claim Gemini search integration results to be a factually accurate source of news or information. Quite the contrary even.

With the use at your own risk type disclaimers and extremely broad EULAs that remove any liability for misinformation, what is the benefit to shareholders in spending more money for expert review? There is none. In fact, a for-profit corporation may be sued by shareholders for wasteful spending that does not fulfill the fiduciary duties to increase shareholder profits. This is a much greater risk than one user attempting to sue after failing to read the disclaimer.

Purpose: Protect The interests of the Shareholders

One must then identify who are the shareholders and investors in the LLMs and AI that we are using?

According to multiple sources, Google’s primary source of revenue is targeted ads. Google allows clients to purchase specific search terms that will trigger ads for for their company, specifically displaying them to a targeted audience. Investopedia cites Alphabet’s top shareholder as Vanguard Group, BlackRock, FMR, and JP Morgan Chase, Larry Page, Sergey Brin, and L.John Doerr.

OpenAI reports a long list of founders, donors, and investors incluciding Sam Altman, Elon Musk, Reid Hoffman, Jessica Livingston, Peter Thiel, and Amazon Web Services.

TSG Invest cites key investors in Anthropic as Amazon, Google, Microsoft, Nvidia, ICONIQ, Lightspeed Venture Partners, Fidelity, Spark Capital, Salesforce Ventures, Menlo Ventures, Bessemer Venture Partners, BlackRock, BlackStone, Coatue, D1 Captial, General Atlantic, General Catalyst, GIC, Goldman Sachs Alternatives, Insight Partners, Jane Street, Qatar Investment Authority, TPG, and T. Rowe Price. As well as a partnership with Palantir to provide Claude services to U.S. intelligence and defense agencies.

This is a broad list of for-profit stakeholders investing in the information being delivered to end consumer which is both good and bad. But we will get to that shortly.

Point of View: What Is Being Shared

These major players in AI development are cross investing. So the purpose is not only to protect themselves but to also protect their fellow AI developers. Laws that favor few consumer protections and vast energy consumption will benefit will benefit the AI Company, and consequently their investors. I want to be very clear that this is not a left or right debate. The technocracy is spending their dollars on both sides of the fence to sway lawmaker and public opinion. So this is merely a discussion of who is training the algorithm and whether they have a stake in the output.

These investors and partners are involved in large political PAC donations that influence your government. I went to OpenSecrets.org, to see how much some of these groups spent on lobbying and political donations in 2024: Alphabet Inc spent $14,790,000 on lobbying ; Palantir Technologies spent $5,770,000; Microsoft Inc spent $10,353,764; BlackRock Inc spent $2,840,000; Amazon.com spent $19,140,000; and Peter Thiel personally dumping millions into various PACs across multiple states.

Remembering these are all for-profit corporations or investment groups who make those investments on behalf of their shareholders: it would be counter to their own interests or those of their shareholders to make investments that do not benefit clients or could cause harm to their clients’ shares. The point of view being shared to the end consumer must then be of benefit to the large body of stakeholders, not harming one investor by favoring another. The large number of shareholders then is of benefit to the consumer because it prevents the output from becoming a direct advertisement for any one source of funding.

Equally true, is that these projects would be a poor investment if they did not provide factually correct output at least most of the time. If LLMs only output advertisements for the investors, there would be no product because no one would want to consume it. So there is a motivation to accurately summarize news stories, and output factual replies.

So we can assume here the information will be mostly factually correct but with a strong risk bias. This is a product meant for consumption so it must appeal to the consumer, but the legal responsibility is to the investor. So output must be approached critically and should be assumed context dependent. It is very possible, if not probable that any information that could be counter to their political goals or damaging to other clients have been removed from training data.

Audience: The Consumer

So what is being given to the consumer?

What is the product being consumed?

It is a mix of a factually accurate and useful search and generation tool, that contains the same opinions of its host and is limited to the data they are willing to share. The consumer is the person who will buy their product, pay for their subscription, watch their ads, and hopefully vote in the best interest of the company.

2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

So how does the purpose of output impact the implementation of GenAI? It means that it must be approached critically. It should be used as a writing tool, but not a tool that replaces human review or thought.

We want students to learn facts. We want our students to have the knowledge they need to be safe and grow. To approach their environment with caution and care. And to understand facts, and apply that information in new situations.

Is there a motivation to provide accurate and unbiased content? Yes, because this creates a product that people will consume. If the tool is never useful it cannot be adopted for content generation. Is the output filled with errors and hallucinations? Yes. There is even a disclaimer putting the responsibility of verification of facts onto the end consumer.

So yes, absolutely we can generate classroom content with AI! Chat GPT can write distractor options much faster than I can manually. It can assemble paragraphs, create summaries, and check student responses for accuracy. AI can write lesson plans. What is limited here is the ability to put it in front of students without review.

When implementing GenAI in learning spaces we must approach the output carefully with expert review. If AI models cannot be trainined on AI generated content because the errors become part of the training data, we cannot train our students out the same error ridden output. Our students must be taught with factually accurate materials that have been verified. If we teach them errors as correct, they will carry those errors in learning over to their own students.

A humorous cartoon depicting bugs in a classroom setting, where they are answering basic math questions with one bug excitedly exclaiming about destroying programmers. — Cartoon from ProgrammerHumor.io

LLMs already show low performance on many generative tasks at the high school or university level in science and math. And we must be exceptionally careful when a factually correct output might conflict with the interests of the major corporations that want to sway your opinions! Is there a motivation to show biased output? Yes, if it benefits the shareholders and reflects the views of those holding the data. This could be information intended to sway public opinion on Data Center locations, not discussing concerns over water safety, or slightly modifying responses to questions about investors to respond more positively than truthfully. Consumers are also voters and may be influenced by what the algorithm shows them in a given search. Is there a motivation to support misinformation? Yes, if it benefits the shareholders. There is already an error allowance and use at your own risk disclaimer. There is room to intoduce intentional bias without risk. And again, consumers are still voters.

Coming soon…

In my next post, I am going to show a real world example of recently generated output from Gemini that supports the need for a source analysis related to what we are discussing here.

Generative AI in EdTech

Support for Generative AI in an educational context is unfortunately considered to be a binary state. Either you are all in or all out, with no room for an argument of ethical adoption. And that is just not a healthy approach to what we are putting in front of our future students. If we want them to pass high stakes exams to become doctors and scientists, we need to give them the learning resources to do so.

I have studied ML classification algorithms, spent 1.5 years working in generative AI, helped to write curriculum for courses on AI, and generally just find the subject matter interesting. I have been dismissed from two positions within weeks of providing quantitative analyses of the quality of automated tasks. My academic advisor attempted to accuse me of academic misconduct in an effort to prevent me from graduating with my PhD after realizing the results of my dissertation suggested a more careful approach to automation. I lost the battle but I won the war on that one, being forced to graduate with my degree but without my dissertation (using past published papers instead).

As a quantitative analyst who has worked in the field, my opinion is that AI is a fantastic tool with significant shortcomings. I will be posting my thoughts an opinions here, results from my personal work in GenAI.