GenAI – Safe to Use?

Because the conversations around GenAI tend to be very binary, some might believe my posts imply that I am against AI. This is very much not the case. I do believe it is a useful tool. 80% accuracy is not high enough for all purposes, but humans often don’t agree with their own output that frequently. Chat GPT can write an outline much much faster than I can. It can skim an article much faster than I can. It can write multiple choice questions and reponse options, edit for spelling and grammatical errors, double check me for correctness, proofread, and plan. AI is able to cut hours off of many tasks. AI is also capable of scoring short essays as consistently as human raters.

So what is my personal integration plan for AI automation to leverage its strengths and weaknesses? I ask myself the following questions when using AI.

Will there be a human review of the output before reaching the end-user?

The first question you ask yourself should be about how you are using the output. Is this a direct-to-consumer pipeline or part of a drafting and revision process documented and overseen by experts?

If this is part of a drafting and revision process with expert reviewers, you’re good to go. Use this as a drafting tool, and be sure it does not make major changes to the text when asked to check for gramatical errors.

If the output will be seen directly by consumers with no review, you need to think about who will be using the output and how they are going to implement the tool. If the end user cannot claim the responsibility for the burden of truth, you should not skip the human in the loop.

Does the training data reflect your personal views and those of your target audience?

If you and your clients love large corporations, Peter Thiel is your personal hero, and party with Elon Musk on the weekends, you are probably doing ok. If not, you should think about who this product is intended to serve. It has been reported on several occassions that Musk retrained Grok after the ai went too “woke” for his conservative fan base.

Image taken from NYTimes.com.

Direct to consumer pipelines may not be the best idea if you are not a member of the technocracy. While Google has apologized for this event. Some news sources claimed this is due to human error and not AI, while others attributed the notice to an AI bot. Either way, your AI needs the same level of oversight as your interns. F*** ups this epic will cost you clients unless you have a monopoly or oligopoly.

How important is the factual accuracy of the output?

If the stakes are higher, the percent accuracy needs to be higher. If the tool is directly related to high stakes exam preparation, physical health, mental health, investment, spending, or intended to facilitate any other major decisions, you need to consider whether 85% maximum accuracy is high enough.

General Low-Stakes output

For newsletter type announcements, personalized event invitations from a template, or rejection emails, you are probably good to go. These are low stakes activities that are not likely to impact major decisions made by the end-user, and may help to boost engagement.

Investments & Spending (High Stakes)

Here we have to recall the importance of fiduciary responsibility in model construction. The product (LLM output) must financially benefit the investors. It is highly likely the LLM will encourage investment in companies owned by their investors or those who are paying for promotion.

Corporate Decisions (High Stakes)

Decisions such as hiring and spending should be reviewed by humans. As a psychometrician I can say that LLMs have little understanding of learning, application of knowledge, or the details of many more technical roles.

I ran a test across multiple trials asking an AI agent to rank resumes. When dumped together in the input, the AI did not rank the candidates by skills, but rather by the order given. In the majority of trials, resume 1 was the top candidate and resume 11 was the 11th candidate. There are of course different analyses that can be performed, but a hard coded search for key-terms could provide a count of the number of desired skills a candidate included on their resume with significantly higher consistency.

Also, I have done several AI interviews for work as a math expert or statistical analyst, and the interviews are typically poorly handled. If I state the correct response using terminology different from the expected input, the interview continues with the AI attempting to explain the concepts to me while I become increasingly amused… or confused. This includes mathematically equivalent expressions, and terminology that is equivalent by definition. More than once I believed that I stated the information incorrectly after becoming trapped in a loop because I learned the same constructs under different names. As it turns out, my training in education research has simply provided me a different vocabulary and the AI agent being emplloyed is not significantly well versed on the topic to work in another vocabulary framework.

Learning materials (Low or High Stakes)

Lower grade levels tend to have more representative training data and produce better results. For a specific classroom where a teacher will oversee learning, determine if the output is appropriate for their students, and ensure the curriculum content and coverage, this is probably fine.

It may also be okay to use generative AI without review if the content is a learning supplement where the students will have knowledge from the textbook or teacher and the supplement is delivered with notice that the materials may show inaccuracy. Regarding generative AI integration in higher education, Kortemeyer et al. (2024) reports students find the Ethel Chatbot as being helpful despite sometimes providing incorrect responses.

On the otherhand, if you are generating the students sole learning materials you should think twice before taking the human out of the loop. While RAG models can significantly improve the output, there are no guarantees. In my own experience, LLMs struggle with the connections between topics, understanding causal relationships, and creating explanations in general. Additionally, the LLM’s training varies significantly across topics within the same course or subject.

Assessment Materials (Low or High Stakes)

Psychometrics is an entire field of science related to assessment and survey development. It is not flashy like physics, so we don’t get specials on the Discovery Channel. In fact, I have not met a single person that knows what I do that is not also a psychometrician. What do we do? We measure qualities that cannot be directly observed like math ability, psychological traits, or learning aptitude. We research other instruments, build new ones, make sure they measure what they should, equate them for comparison, refine them to get as much information as possible out of the smalest number of questions. We also measure and compare performance on those tests.

And from a psychometrician’s perspective, you shouldn’t use AI to write tests. AI generated assessments tend to focus on rote memorization rather than applied skills, and they break quite a few rules of assessment design. I will leave you with an example, but I will save the in depth analysis for another day. It takes multiple API calls to get one useful question, and they still need a bit of human review.

When automating the generation of testing materials, LLMs frequently wrote questions which asked “why” or “what” but the provided response was a restatement of the question, saying that it happened without additional detail.

Here is a question generated by ChatGPT:

What makes hydrogen bonds stronger than typical dipole-dipole interactions?
A. Larger atomic size
B. Greater electron shielding
C. Strong attraction between δ⁺ hydrogen and lone pairs
D. Presence of ionic charges

It seems okay at a glance, but this would not make it onto an exam in my own classroom. While C is the “correct” response, it is not a strong answer to the question asked. It fails psychometrically for several reasons. Primarily, rather than explain what makes hydrogen bonds stronger, it states that the elements of a hydrogen bond are strong. A more correct response might include distance between charges, localized charge, or directional alignment rather than a restatement of the question. Other frequent generative AI issues appear here where the correct response is double the length of the other options, the correct response uses more advanced notation or vocabulary, and it is the only response option that includes a leading word from the question.

The questions and materials generated by the LLMs also struggle to apply learning in new contexts. While a RAG model can direct more accurate content, the LLM probaballistically determines the next words. If your input is regarding any specific examples, the questions for that topic will be generated from those specific examples rather than building new questions with new contexts, creating exams based on rote memorization rather than the application of learning into new contexts which is the more highly preferred method for testing students’ knowledge.

If you are preparing course materials for upper grades, tests that will include applied knowledge, or primary materials for first time learners, you should include human review or your students will most likely arrive underprepared for the exam.

News Reviews

The British Broadcasting Corporation (BBC) and the European Broadcasting Union (EBU) conducted a study in 2025. According to the BBC, they found that AI summaries had “errors “significant issues” in 45% of the summaries and 20% contained “major accuracy issues.” Those faulty summaries cost at least one European journalist his credibility. According to a recent article from the Guardian reveals Peter Vandermeersch was suspended for misquoting or misattributing quotes to individuals after not checking the AI tools he was using for news summaries.

My choice?

Treat AI like the emerging science it is. When Radium was first introduced, it was branded as a cure-all by snake oil salesmen. People added radium to their drinking water, chocolate, bread… That caused a lot of problems. But after a lot of careful science, its medical purposes were uncovered.

Treat your AI like it’s the CEO’s 16yo nephew there on a summer internship. Let him do the simple repetitive tasks, write a rough draft, check for grammatical errors. And leave the thinking to the grown-ups.

GenAI – Defining Success

In my last two posts, I provided a Source Analysis of LLM output and discussed the Claims Made and Supporting Evidence Used by an LLM in its output. If you are just joining the conversation, what we have identified so far is that, like all forms of media, LLM output has a purpose and an intended audience. The primary purpose of an LLM is to benefit investors and shareholders with customer or user experience as secondary. This due to something called “fiduciary responsibility” or the duty to act in the best intrest of shareholders in your product. It would not be in the shareholder’s interest to make a product people would not want to use, so there is benefit in factually correct or verifiably true output in order to have a product to sell. It is equally true that training an LLM to to promote certain viewpoints or products is beneficial to shareholders.

How Accurate is AI according to an SME?

LLMs themselves are typically more honest than most CEOs regarding the quality of generated output. While companies on Twitter and LinkedIn promise amazing resuts from their AI-Driven products, if you use the AI directly you are going to see something like this:

But What About…

You can improve your results a little if you know the model strengths. Different LLMs are better at different tasks. GPT is for a more accurate summary. Claude is the better writer. And Gemini is much better at reducing ambiguity or determining when there may be other equally true interpretations of the same text.

While each has its strengths, the truth remains that AI is just an emerging science. You can try what you like, but at the end of the day, the you should expect 80% accuracy. But what about… AI-Engineers, Prompt Engineers, RAG Pipelines, Multi-Shot prompts, Chain-of-Thought….

Prompt Engineering

This is how you get to that 80% near perfect association. Learning to speak the same language as the LLM is definitely the first step, and once you can define a clear rubric (success criteria) in your prompt, you are going to see improved results. Remember it is probabalistic, so clear concise requirements are easier to understand.

Prompt engineering does work and provides significantly improved results. But to make prompt engineering sound like something fancy for a resume, we provide a lot of fancy names to describe methods we use to improve the output. You have already done this if you have used AI to write a resume or reformat a list. So here is the simplified version of the most popular methods:

  • RAG (Retrieval-Augmented Generation) Pipelines -> You break the foundational information your tool needs into smaller vectors or pieces and store them so that they can be accessed when generating the reply. In the simplest form, this is like providing a copy of your resume to have GPT write a more accurate cover letter. In a more complicated form, it is the data equivalent of an old fashioned encyclopedia with the information broken down into books by letter so you only take the one book you need when trying to lookup something more specific.
  • [Single/Few/Multi]-Shot Prompting -> You give the prompt one or more examples of inputs and outputs to for more context. This is kind of like showing a student a worked example of the problem before asking them to do it themselves.
  • Chain-of-Thought Prompting -> This is a fancy way of saying “explain your reasoning.” It involves breaking a big task into smaller steps or if provided in one prompt, having the LLM explain its reasoning. This helps to prevent skipping steps or losing the logic.

All of these actually do work to improve your output, and you have probably already emplyed all three of these examples in your daily interactions with AI. So why is the prompt not giving 100% accuracy?!

Training

To oversimplify, an LLM is a predicitve model. Given all of the other words in the sentence, whatever data they have used for training, and what data they have collected about you, the model calculates what is the most likely next word or phrase to output. In the literature for training ML models to score students short answer questions or essays, 80% agreement (when corrected for chance agreement) is considered near perfect (Nehm & Haertig, 2012) And those metrics are still largely used today (e.g. Zhai et al., 2021). And it is not a bad measure to choose when you consider the amount of training it requires to have human raters score at that level of agreement, and sometimes the machine-to-human agreement is higher than human-to-human agreement (e.g. Maestrales et al., 2021). So if two humans only agree on the score of a single sentence 70-90% of the time, that is the data we use to train ML or AI.

Environment

If I am logged into my work account Chat GPT will have a different set of assumptions than if I am logged into my personal account. If I am automating through the API, the results are different from both. Switching across different LLMs, different models of the same LLM, or using the same model of the same LLM before and after an update can mean new prompt adjustments.

If I use the playground to build a prompt and add in the tools this agent or prompt is allowed to access, automating that prompt through the API fails to limit to those specified tools. This will yield different than anticipated results when deploying the system.

Even the same prompt in the same model in the same window will have different outputs when run multiple times. Before and during Black Friday one year, I ran the same prompt more than 100 times and recorded the failures in format or response demonstrating that traffic played a significant role in the returned output. Testing the automated scoring of students written responses to test questions, I would often find different scores for each trial. Some responses were less ambiguous and scored easily, and others were different every time.

Context Windows

I will just add more instructions and more context!

This works to a point. But there is a bit of a parabolic performance curve when it comes to the number of tokens (how many words) you are inputting. Too few and you dont have enough context, but too many and you start skipping instructions. You can max out your inputs, but the model will summarize instructions and content decided for itself which to follow.

Similarly continuing the chain-of-thought in a single conversation increases the number of tokens. You have definitely already noticed that your chat reaches a certain length and the quality decreases rapidly. This is usually after about 3 or 4 outputs.

AI-Driven QC

So lets use more AI to fix the AI!

This is an expenditure with diminishing returns. The capacity for a fix depends largely on the reason for the errors. A hiccup in the server might be easily fixed, but a lack of data or fundamental misunderstanding will not be resolved in the next pass using the same LLM. Using multiple different agents or LLMs can be very helpful when one excels over the other in different areas.

Unfortunately, it is also impossible to know where the errors are occurring. If you assume there is a random failure rate of at least 10% on each task, where is that 10% in the generation process, and where is it in the classification or editing? Are the errors on the same items? How many bad items are misclassified as good?

In one experiment I performed, I used AI to do a writing task. The instruction was to write an answer to a specific question and explain its reasoning based on the data provided or simply state that there was not enough information. Another round read what was written and either approved the response, rewrote the response, or state that there was not enough information. And then a third round of AI to decide between the first two responses if different. It was to decided which was better or state there was not enough information to write to response. Of the cases where the first and second responses were both written, but they were different, the third trial decided there was not enough information in almost 50% of those cases. There is no way of knowing which is correct without human review.

Burden of Responsibility in Use

So from a subject matter expert who has automated systems at scale… we are quite a long way from no-humans in the loop if you need high accuracy in your content. For low stakes writing tasks, automation is possible, but for more detailed or structured content, we have a long way to go. The LLM output for higher level science and math, in depth reasoning tasks, the building of learning progressions, larger scale lesson planning, it is just not quite there.

While we are replacing a lot of SMEs with automation engineers, it is really important to have this conversation. The way the engineering team feels about the one-off vibe-coded scripts cobbled together by the content team is exactly how we feel about the vibe-coded content.LLMs write. Any higher level analysis and science texts output by the tech team are about the same quality as the vibe-code the boss’ nephew is trying to use to sneak his way onto the engineering team during the summer internship.

We still need people who are able to review the output, verify the accuracy, and review high stakes generation. And now more than ever, we need real experts. As the technology gets closer and closer to generating something correctly, ,it becomes more difficult for a non-expert to identify the issues.

Where Does Responsibility Fall?

The responsibility for truth falls on the end-user, the consumer of the AI product. We use RAG models, good data, agentic systems, etc. to improve the accuracy as best we can. But it is not perfect and someone someone has to decide what is “good enough.” If I create a tool that uses an LLM to perform a task and sell this AI-driven product to you, I still sell that product with the burden of verification falling finally at your feet because it is tailored to your need.

With luck my data pipeline integration will improve your output by reducing hallucinations or automating a process you could not before, but the final review of each output still falls on the last user in the chain. I will use AI knowing it is about 80% accurate, and I will sell my product with the same accuracy warning as the original agent.

As an adult selling to adults, passing that responsibility off the consumer is a little different than when passing the end product over to children with no humans or expert reviewers. Are children then the final, end user, responsible for knowing if what the adults teach them is true? Where is this line being drawn in AI-First, tech Driven culture?

Hallucination or Training

But what are those errors? Are they randomly distributed? Are they based on the training data? Do they matter? We will talk about that in my next post.

GenAI – A Source Analysis

I am accused of being against AI. But I am not. I actually LOVE working with AI. The reason I am accused of being opposeed is that as a quantitative analyst, I believe in reliable performance metrics and expert review. I have reviewed the accuracy of multiple LLMs across something like 1000 topics in 11 subjects in science, history, and social science. I can tell you with absolute confidence that this is an emerging science. And it should be studied as such. It is an incredible aid to speed up processes, a thinking tool, a partner. You should think of AI adoption like a new intern, still in school. Sometimes it feels like the owner’s latest nepotism hire and other times like colleague able to write a report at 100x the speed. But always, the output needs a review before being passed to children in the classroom.

So let’s talk about AI adoption, not as a binary state of technophobe vs full automation, but as a nuanced and intelligent discussion with relevant information on both sides. I am going to develop this conversation from the framework of the primary skills taught in high school history and social science courses. We are going start the conversation with a focus on AP Historical Thinking Skill 2: Source Analysis.

Skill 2: Sourcing and Situation

Analyze sourcing and situation of primary and secondary sources.

  • 2.A Identify the source’s point of view, purpose, historical situation, and/or audience.
  • 2.B Explain the point of view, purpose, historical situation, and/or audience of a source
  • 2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

Skill 2: Sourcing and Situation

2.A Identify the source’s point of view, purpose, historical situation, and/or audience.

LLMs have become a popular source of information for many people around the world. While they are as providers of information, their purpose is to protect investors’ interests meaning that the point of view being shared is that of a corporation protecting its fiduciary responsibilities. Who then is the audience? The audience is you the consumer, the potential buyer of goods and swayed voter.

2.B Explain the point of view, purpose, historical situation, and/or audience of a source

Historical Situation

Google Gemini will be our starting point as it has arguably positioned itself as the number one source of information and first point of contact with a web search. Google is unarguably the world’s most popular web browser. But it is important to remember Google is a for-profit organization selling your data to marketers to give you targeted ads. The first thing you typically see when looking for information is either a list of products or the Gemini summary of your search results.

Responsibility to the Truth or the Shareholder?

Google does not claim Gemini search integration results to be a factually accurate source of news or information. Quite the contrary even.

With the use at your own risk type disclaimers and extremely broad EULAs that remove any liability for misinformation, what is the benefit to shareholders in spending more money for expert review? There is none. In fact, a for-profit corporation may be sued by shareholders for wasteful spending that does not fulfill the fiduciary duties to increase shareholder profits. This is a much greater risk than one user attempting to sue after failing to read the disclaimer.

Purpose: Protect The interests of the Shareholders

One must then identify who are the shareholders and investors in the LLMs and AI that we are using?

According to multiple sources, Google’s primary source of revenue is targeted ads. Google allows clients to purchase specific search terms that will trigger ads for for their company, specifically displaying them to a targeted audience. Investopedia cites Alphabet’s top shareholder as Vanguard Group, BlackRock, FMR, and JP Morgan Chase, Larry Page, Sergey Brin, and L.John Doerr.

OpenAI reports a long list of founders, donors, and investors incluciding Sam Altman, Elon Musk, Reid Hoffman, Jessica Livingston, Peter Thiel, and Amazon Web Services.

TSG Invest cites key investors in Anthropic as Amazon, Google, Microsoft, Nvidia, ICONIQ, Lightspeed Venture Partners, Fidelity, Spark Capital, Salesforce Ventures, Menlo Ventures, Bessemer Venture Partners, BlackRock, BlackStone, Coatue, D1 Captial, General Atlantic, General Catalyst, GIC, Goldman Sachs Alternatives, Insight Partners, Jane Street, Qatar Investment Authority, TPG, and T. Rowe Price. As well as a partnership with Palantir to provide Claude services to U.S. intelligence and defense agencies.

This is a broad list of for-profit stakeholders investing in the information being delivered to end consumer which is both good and bad. But we will get to that shortly.

Point of View: What Is Being Shared

These major players in AI development are cross investing. So the purpose is not only to protect themselves but to also protect their fellow AI developers. Laws that favor few consumer protections and vast energy consumption will benefit will benefit the AI Company, and consequently their investors. I want to be very clear that this is not a left or right debate. The technocracy is spending their dollars on both sides of the fence to sway lawmaker and public opinion. So this is merely a discussion of who is training the algorithm and whether they have a stake in the output.

These investors and partners are involved in large political PAC donations that influence your government. I went to OpenSecrets.org, to see how much some of these groups spent on lobbying and political donations in 2024: Alphabet Inc spent $14,790,000 on lobbying ; Palantir Technologies spent $5,770,000; Microsoft Inc spent $10,353,764; BlackRock Inc spent $2,840,000; Amazon.com spent $19,140,000; and Peter Thiel personally dumping millions into various PACs across multiple states.

Remembering these are all for-profit corporations or investment groups who make those investments on behalf of their shareholders: it would be counter to their own interests or those of their shareholders to make investments that do not benefit clients or could cause harm to their clients’ shares. The point of view being shared to the end consumer must then be of benefit to the large body of stakeholders, not harming one investor by favoring another. The large number of shareholders then is of benefit to the consumer because it prevents the output from becoming a direct advertisement for any one source of funding.

Equally true, is that these projects would be a poor investment if they did not provide factually correct output at least most of the time. If LLMs only output advertisements for the investors, there would be no product because no one would want to consume it. So there is a motivation to accurately summarize news stories, and output factual replies.

So we can assume here the information will be mostly factually correct but with a strong risk bias. This is a product meant for consumption so it must appeal to the consumer, but the legal responsibility is to the investor. So output must be approached critically and should be assumed context dependent. It is very possible, if not probable that any information that could be counter to their political goals or damaging to other clients have been removed from training data.

Audience: The Consumer

So what is being given to the consumer?

What is the product being consumed?

It is a mix of a factually accurate and useful search and generation tool, that contains the same opinions of its host and is limited to the data they are willing to share. The consumer is the person who will buy their product, pay for their subscription, watch their ads, and hopefully vote in the best interest of the company.

2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

So how does the purpose of output impact the implementation of GenAI? It means that it must be approached critically. It should be used as a writing tool, but not a tool that replaces human review or thought.

We want students to learn facts. We want our students to have the knowledge they need to be safe and grow. To approach their environment with caution and care. And to understand facts, and apply that information in new situations.

Is there a motivation to provide accurate and unbiased content? Yes, because this creates a product that people will consume. If the tool is never useful it cannot be adopted for content generation. Is the output filled with errors and hallucinations? Yes. There is even a disclaimer putting the responsibility of verification of facts onto the end consumer.

So yes, absolutely we can generate classroom content with AI! Chat GPT can write distractor options much faster than I can manually. It can assemble paragraphs, create summaries, and check student responses for accuracy. AI can write lesson plans. What is limited here is the ability to put it in front of students without review.

When implementing GenAI in learning spaces we must approach the output carefully with expert review. If AI models cannot be trainined on AI generated content because the errors become part of the training data, we cannot train our students out the same error ridden output. Our students must be taught with factually accurate materials that have been verified. If we teach them errors as correct, they will carry those errors in learning over to their own students.

A humorous cartoon depicting bugs in a classroom setting, where they are answering basic math questions with one bug excitedly exclaiming about destroying programmers.
Cartoon from ProgrammerHumor.io

LLMs already show low performance on many generative tasks at the high school or university level in science and math. And we must be exceptionally careful when a factually correct output might conflict with the interests of the major corporations that want to sway your opinions! Is there a motivation to show biased output? Yes, if it benefits the shareholders and reflects the views of those holding the data. This could be information intended to sway public opinion on Data Center locations, not discussing concerns over water safety, or slightly modifying responses to questions about investors to respond more positively than truthfully. Consumers are also voters and may be influenced by what the algorithm shows them in a given search. Is there a motivation to support misinformation? Yes, if it benefits the shareholders. There is already an error allowance and use at your own risk disclaimer. There is room to intoduce intentional bias without risk. And again, consumers are still voters.

Coming soon…

In my next post, I am going to show a real world example of recently generated output from Gemini that supports the need for a source analysis related to what we are discussing here.