GenAI – Defining Success

In my last two posts, I provided a Source Analysis of LLM output and discussed the Claims Made and Supporting Evidence Used by an LLM in its output. If you are just joining the conversation, what we have identified so far is that, like all forms of media, LLM output has a purpose and an intended audience. The primary purpose of an LLM is to benefit investors and shareholders with customer or user experience as secondary. This due to something called “fiduciary responsibility” or the duty to act in the best intrest of shareholders in your product. It would not be in the shareholder’s interest to make a product people would not want to use, so there is benefit in factually correct or verifiably true output in order to have a product to sell. It is equally true that training an LLM to to promote certain viewpoints or products is beneficial to shareholders.

How Accurate is AI according to an SME?

LLMs themselves are typically more honest than most CEOs regarding the quality of generated output. While companies on Twitter and LinkedIn promise amazing resuts from their AI-Driven products, if you use the AI directly you are going to see something like this:

But What About…

You can improve your results a little if you know the model strengths. Different LLMs are better at different tasks. GPT is for a more accurate summary. Claude is the better writer. And Gemini is much better at reducing ambiguity or determining when there may be other equally true interpretations of the same text.

While each has its strengths, the truth remains that AI is just an emerging science. You can try what you like, but at the end of the day, the you should expect 80% accuracy. But what about… AI-Engineers, Prompt Engineers, RAG Pipelines, Multi-Shot prompts, Chain-of-Thought….

Prompt Engineering

This is how you get to that 80% near perfect association. Learning to speak the same language as the LLM is definitely the first step, and once you can define a clear rubric (success criteria) in your prompt, you are going to see improved results. Remember it is probabalistic, so clear concise requirements are easier to understand.

Prompt engineering does work and provides significantly improved results. But to make prompt engineering sound like something fancy for a resume, we provide a lot of fancy names to describe methods we use to improve the output. You have already done this if you have used AI to write a resume or reformat a list. So here is the simplified version of the most popular methods:

  • RAG (Retrieval-Augmented Generation) Pipelines -> You break the foundational information your tool needs into smaller vectors or pieces and store them so that they can be accessed when generating the reply. In the simplest form, this is like providing a copy of your resume to have GPT write a more accurate cover letter. In a more complicated form, it is the data equivalent of an old fashioned encyclopedia with the information broken down into books by letter so you only take the one book you need when trying to lookup something more specific.
  • [Single/Few/Multi]-Shot Prompting -> You give the prompt one or more examples of inputs and outputs to for more context. This is kind of like showing a student a worked example of the problem before asking them to do it themselves.
  • Chain-of-Thought Prompting -> This is a fancy way of saying “explain your reasoning.” It involves breaking a big task into smaller steps or if provided in one prompt, having the LLM explain its reasoning. This helps to prevent skipping steps or losing the logic.

All of these actually do work to improve your output, and you have probably already emplyed all three of these examples in your daily interactions with AI. So why is the prompt not giving 100% accuracy?!

Training

To oversimplify, an LLM is a predicitve model. Given all of the other words in the sentence, whatever data they have used for training, and what data they have collected about you, the model calculates what is the most likely next word or phrase to output. In the literature for training ML models to score students short answer questions or essays, 80% agreement (when corrected for chance agreement) is considered near perfect (Nehm & Haertig, 2012) And those metrics are still largely used today (e.g. Zhai et al., 2021). And it is not a bad measure to choose when you consider the amount of training it requires to have human raters score at that level of agreement, and sometimes the machine-to-human agreement is higher than human-to-human agreement (e.g. Maestrales et al., 2021). So if two humans only agree on the score of a single sentence 70-90% of the time, that is the data we use to train ML or AI.

Environment

If I am logged into my work account Chat GPT will have a different set of assumptions than if I am logged into my personal account. If I am automating through the API, the results are different from both. Switching across different LLMs, different models of the same LLM, or using the same model of the same LLM before and after an update can mean new prompt adjustments.

If I use the playground to build a prompt and add in the tools this agent or prompt is allowed to access, automating that prompt through the API fails to limit to those specified tools. This will yield different than anticipated results when deploying the system.

Even the same prompt in the same model in the same window will have different outputs when run multiple times. Before and during Black Friday one year, I ran the same prompt more than 100 times and recorded the failures in format or response demonstrating that traffic played a significant role in the returned output. Testing the automated scoring of students written responses to test questions, I would often find different scores for each trial. Some responses were less ambiguous and scored easily, and others were different every time.

Context Windows

I will just add more instructions and more context!

This works to a point. But there is a bit of a parabolic performance curve when it comes to the number of tokens (how many words) you are inputting. Too few and you dont have enough context, but too many and you start skipping instructions. You can max out your inputs, but the model will summarize instructions and content decided for itself which to follow.

Similarly continuing the chain-of-thought in a single conversation increases the number of tokens. You have definitely already noticed that your chat reaches a certain length and the quality decreases rapidly. This is usually after about 3 or 4 outputs.

AI-Driven QC

So lets use more AI to fix the AI!

This is an expenditure with diminishing returns. The capacity for a fix depends largely on the reason for the errors. A hiccup in the server might be easily fixed, but a lack of data or fundamental misunderstanding will not be resolved in the next pass using the same LLM. Using multiple different agents or LLMs can be very helpful when one excels over the other in different areas.

Unfortunately, it is also impossible to know where the errors are occurring. If you assume there is a random failure rate of at least 10% on each task, where is that 10% in the generation process, and where is it in the classification or editing? Are the errors on the same items? How many bad items are misclassified as good?

In one experiment I performed, I used AI to do a writing task. The instruction was to write an answer to a specific question and explain its reasoning based on the data provided or simply state that there was not enough information. Another round read what was written and either approved the response, rewrote the response, or state that there was not enough information. And then a third round of AI to decide between the first two responses if different. It was to decided which was better or state there was not enough information to write to response. Of the cases where the first and second responses were both written, but they were different, the third trial decided there was not enough information in almost 50% of those cases. There is no way of knowing which is correct without human review.

Burden of Responsibility in Use

So from a subject matter expert who has automated systems at scale… we are quite a long way from no-humans in the loop if you need high accuracy in your content. For low stakes writing tasks, automation is possible, but for more detailed or structured content, we have a long way to go. The LLM output for higher level science and math, in depth reasoning tasks, the building of learning progressions, larger scale lesson planning, it is just not quite there.

While we are replacing a lot of SMEs with automation engineers, it is really important to have this conversation. The way the engineering team feels about the one-off vibe-coded scripts cobbled together by the content team is exactly how we feel about the vibe-coded content.LLMs write. Any higher level analysis and science texts output by the tech team are about the same quality as the vibe-code the boss’ nephew is trying to use to sneak his way onto the engineering team during the summer internship.

We still need people who are able to review the output, verify the accuracy, and review high stakes generation. And now more than ever, we need real experts. As the technology gets closer and closer to generating something correctly, ,it becomes more difficult for a non-expert to identify the issues.

Where Does Responsibility Fall?

The responsibility for truth falls on the end-user, the consumer of the AI product. We use RAG models, good data, agentic systems, etc. to improve the accuracy as best we can. But it is not perfect and someone someone has to decide what is “good enough.” If I create a tool that uses an LLM to perform a task and sell this AI-driven product to you, I still sell that product with the burden of verification falling finally at your feet because it is tailored to your need.

With luck my data pipeline integration will improve your output by reducing hallucinations or automating a process you could not before, but the final review of each output still falls on the last user in the chain. I will use AI knowing it is about 80% accurate, and I will sell my product with the same accuracy warning as the original agent.

As an adult selling to adults, passing that responsibility off the consumer is a little different than when passing the end product over to children with no humans or expert reviewers. Are children then the final, end user, responsible for knowing if what the adults teach them is true? Where is this line being drawn in AI-First, tech Driven culture?

Hallucination or Training

But what are those errors? Are they randomly distributed? Are they based on the training data? Do they matter? We will talk about that in my next post.

GenAI – A Source Analysis

I am accused of being against AI. But I am not. I actually LOVE working with AI. The reason I am accused of being opposeed is that as a quantitative analyst, I believe in reliable performance metrics and expert review. I have reviewed the accuracy of multiple LLMs across something like 1000 topics in 11 subjects in science, history, and social science. I can tell you with absolute confidence that this is an emerging science. And it should be studied as such. It is an incredible aid to speed up processes, a thinking tool, a partner. You should think of AI adoption like a new intern, still in school. Sometimes it feels like the owner’s latest nepotism hire and other times like colleague able to write a report at 100x the speed. But always, the output needs a review before being passed to children in the classroom.

So let’s talk about AI adoption, not as a binary state of technophobe vs full automation, but as a nuanced and intelligent discussion with relevant information on both sides. I am going to develop this conversation from the framework of the primary skills taught in high school history and social science courses. We are going start the conversation with a focus on AP Historical Thinking Skill 2: Source Analysis.

Skill 2: Sourcing and Situation

Analyze sourcing and situation of primary and secondary sources.

  • 2.A Identify the source’s point of view, purpose, historical situation, and/or audience.
  • 2.B Explain the point of view, purpose, historical situation, and/or audience of a source
  • 2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

Skill 2: Sourcing and Situation

2.A Identify the source’s point of view, purpose, historical situation, and/or audience.

LLMs have become a popular source of information for many people around the world. While they are as providers of information, their purpose is to protect investors’ interests meaning that the point of view being shared is that of a corporation protecting its fiduciary responsibilities. Who then is the audience? The audience is you the consumer, the potential buyer of goods and swayed voter.

2.B Explain the point of view, purpose, historical situation, and/or audience of a source

Historical Situation

Google Gemini will be our starting point as it has arguably positioned itself as the number one source of information and first point of contact with a web search. Google is unarguably the world’s most popular web browser. But it is important to remember Google is a for-profit organization selling your data to marketers to give you targeted ads. The first thing you typically see when looking for information is either a list of products or the Gemini summary of your search results.

Responsibility to the Truth or the Shareholder?

Google does not claim Gemini search integration results to be a factually accurate source of news or information. Quite the contrary even.

With the use at your own risk type disclaimers and extremely broad EULAs that remove any liability for misinformation, what is the benefit to shareholders in spending more money for expert review? There is none. In fact, a for-profit corporation may be sued by shareholders for wasteful spending that does not fulfill the fiduciary duties to increase shareholder profits. This is a much greater risk than one user attempting to sue after failing to read the disclaimer.

Purpose: Protect The interests of the Shareholders

One must then identify who are the shareholders and investors in the LLMs and AI that we are using?

According to multiple sources, Google’s primary source of revenue is targeted ads. Google allows clients to purchase specific search terms that will trigger ads for for their company, specifically displaying them to a targeted audience. Investopedia cites Alphabet’s top shareholder as Vanguard Group, BlackRock, FMR, and JP Morgan Chase, Larry Page, Sergey Brin, and L.John Doerr.

OpenAI reports a long list of founders, donors, and investors incluciding Sam Altman, Elon Musk, Reid Hoffman, Jessica Livingston, Peter Thiel, and Amazon Web Services.

TSG Invest cites key investors in Anthropic as Amazon, Google, Microsoft, Nvidia, ICONIQ, Lightspeed Venture Partners, Fidelity, Spark Capital, Salesforce Ventures, Menlo Ventures, Bessemer Venture Partners, BlackRock, BlackStone, Coatue, D1 Captial, General Atlantic, General Catalyst, GIC, Goldman Sachs Alternatives, Insight Partners, Jane Street, Qatar Investment Authority, TPG, and T. Rowe Price. As well as a partnership with Palantir to provide Claude services to U.S. intelligence and defense agencies.

This is a broad list of for-profit stakeholders investing in the information being delivered to end consumer which is both good and bad. But we will get to that shortly.

Point of View: What Is Being Shared

These major players in AI development are cross investing. So the purpose is not only to protect themselves but to also protect their fellow AI developers. Laws that favor few consumer protections and vast energy consumption will benefit will benefit the AI Company, and consequently their investors. I want to be very clear that this is not a left or right debate. The technocracy is spending their dollars on both sides of the fence to sway lawmaker and public opinion. So this is merely a discussion of who is training the algorithm and whether they have a stake in the output.

These investors and partners are involved in large political PAC donations that influence your government. I went to OpenSecrets.org, to see how much some of these groups spent on lobbying and political donations in 2024: Alphabet Inc spent $14,790,000 on lobbying ; Palantir Technologies spent $5,770,000; Microsoft Inc spent $10,353,764; BlackRock Inc spent $2,840,000; Amazon.com spent $19,140,000; and Peter Thiel personally dumping millions into various PACs across multiple states.

Remembering these are all for-profit corporations or investment groups who make those investments on behalf of their shareholders: it would be counter to their own interests or those of their shareholders to make investments that do not benefit clients or could cause harm to their clients’ shares. The point of view being shared to the end consumer must then be of benefit to the large body of stakeholders, not harming one investor by favoring another. The large number of shareholders then is of benefit to the consumer because it prevents the output from becoming a direct advertisement for any one source of funding.

Equally true, is that these projects would be a poor investment if they did not provide factually correct output at least most of the time. If LLMs only output advertisements for the investors, there would be no product because no one would want to consume it. So there is a motivation to accurately summarize news stories, and output factual replies.

So we can assume here the information will be mostly factually correct but with a strong risk bias. This is a product meant for consumption so it must appeal to the consumer, but the legal responsibility is to the investor. So output must be approached critically and should be assumed context dependent. It is very possible, if not probable that any information that could be counter to their political goals or damaging to other clients have been removed from training data.

Audience: The Consumer

So what is being given to the consumer?

What is the product being consumed?

It is a mix of a factually accurate and useful search and generation tool, that contains the same opinions of its host and is limited to the data they are willing to share. The consumer is the person who will buy their product, pay for their subscription, watch their ads, and hopefully vote in the best interest of the company.

2.C Explain the significance of a source’s point of view, purpose, historical situation, and/or audience, including how thse might limit the use(s) of a source.

So how does the purpose of output impact the implementation of GenAI? It means that it must be approached critically. It should be used as a writing tool, but not a tool that replaces human review or thought.

We want students to learn facts. We want our students to have the knowledge they need to be safe and grow. To approach their environment with caution and care. And to understand facts, and apply that information in new situations.

Is there a motivation to provide accurate and unbiased content? Yes, because this creates a product that people will consume. If the tool is never useful it cannot be adopted for content generation. Is the output filled with errors and hallucinations? Yes. There is even a disclaimer putting the responsibility of verification of facts onto the end consumer.

So yes, absolutely we can generate classroom content with AI! Chat GPT can write distractor options much faster than I can manually. It can assemble paragraphs, create summaries, and check student responses for accuracy. AI can write lesson plans. What is limited here is the ability to put it in front of students without review.

When implementing GenAI in learning spaces we must approach the output carefully with expert review. If AI models cannot be trainined on AI generated content because the errors become part of the training data, we cannot train our students out the same error ridden output. Our students must be taught with factually accurate materials that have been verified. If we teach them errors as correct, they will carry those errors in learning over to their own students.

A humorous cartoon depicting bugs in a classroom setting, where they are answering basic math questions with one bug excitedly exclaiming about destroying programmers.
Cartoon from ProgrammerHumor.io

LLMs already show low performance on many generative tasks at the high school or university level in science and math. And we must be exceptionally careful when a factually correct output might conflict with the interests of the major corporations that want to sway your opinions! Is there a motivation to show biased output? Yes, if it benefits the shareholders and reflects the views of those holding the data. This could be information intended to sway public opinion on Data Center locations, not discussing concerns over water safety, or slightly modifying responses to questions about investors to respond more positively than truthfully. Consumers are also voters and may be influenced by what the algorithm shows them in a given search. Is there a motivation to support misinformation? Yes, if it benefits the shareholders. There is already an error allowance and use at your own risk disclaimer. There is room to intoduce intentional bias without risk. And again, consumers are still voters.

Coming soon…

In my next post, I am going to show a real world example of recently generated output from Gemini that supports the need for a source analysis related to what we are discussing here.