GenAI – Defining Success

In my last two posts, I provided a Source Analysis of LLM output and discussed the Claims Made and Supporting Evidence Used by an LLM in its output. If you are just joining the conversation, what we have identified so far is that, like all forms of media, LLM output has a purpose and an intended audience. The primary purpose of an LLM is to benefit investors and shareholders with customer or user experience as secondary. This due to something called “fiduciary responsibility” or the duty to act in the best intrest of shareholders in your product. It would not be in the shareholder’s interest to make a product people would not want to use, so there is benefit in factually correct or verifiably true output in order to have a product to sell. It is equally true that training an LLM to to promote certain viewpoints or products is beneficial to shareholders.

How Accurate is AI according to an SME?

LLMs themselves are typically more honest than most CEOs regarding the quality of generated output. While companies on Twitter and LinkedIn promise amazing resuts from their AI-Driven products, if you use the AI directly you are going to see something like this:

But What About…

You can improve your results a little if you know the model strengths. Different LLMs are better at different tasks. GPT is for a more accurate summary. Claude is the better writer. And Gemini is much better at reducing ambiguity or determining when there may be other equally true interpretations of the same text.

While each has its strengths, the truth remains that AI is just an emerging science. You can try what you like, but at the end of the day, the you should expect 80% accuracy. But what about… AI-Engineers, Prompt Engineers, RAG Pipelines, Multi-Shot prompts, Chain-of-Thought….

Prompt Engineering

This is how you get to that 80% near perfect association. Learning to speak the same language as the LLM is definitely the first step, and once you can define a clear rubric (success criteria) in your prompt, you are going to see improved results. Remember it is probabalistic, so clear concise requirements are easier to understand.

Prompt engineering does work and provides significantly improved results. But to make prompt engineering sound like something fancy for a resume, we provide a lot of fancy names to describe methods we use to improve the output. You have already done this if you have used AI to write a resume or reformat a list. So here is the simplified version of the most popular methods:

  • RAG (Retrieval-Augmented Generation) Pipelines -> You break the foundational information your tool needs into smaller vectors or pieces and store them so that they can be accessed when generating the reply. In the simplest form, this is like providing a copy of your resume to have GPT write a more accurate cover letter. In a more complicated form, it is the data equivalent of an old fashioned encyclopedia with the information broken down into books by letter so you only take the one book you need when trying to lookup something more specific.
  • [Single/Few/Multi]-Shot Prompting -> You give the prompt one or more examples of inputs and outputs to for more context. This is kind of like showing a student a worked example of the problem before asking them to do it themselves.
  • Chain-of-Thought Prompting -> This is a fancy way of saying “explain your reasoning.” It involves breaking a big task into smaller steps or if provided in one prompt, having the LLM explain its reasoning. This helps to prevent skipping steps or losing the logic.

All of these actually do work to improve your output, and you have probably already emplyed all three of these examples in your daily interactions with AI. So why is the prompt not giving 100% accuracy?!

Training

To oversimplify, an LLM is a predicitve model. Given all of the other words in the sentence, whatever data they have used for training, and what data they have collected about you, the model calculates what is the most likely next word or phrase to output. In the literature for training ML models to score students short answer questions or essays, 80% agreement (when corrected for chance agreement) is considered near perfect (Nehm & Haertig, 2012) And those metrics are still largely used today (e.g. Zhai et al., 2021). And it is not a bad measure to choose when you consider the amount of training it requires to have human raters score at that level of agreement, and sometimes the machine-to-human agreement is higher than human-to-human agreement (e.g. Maestrales et al., 2021). So if two humans only agree on the score of a single sentence 70-90% of the time, that is the data we use to train ML or AI.

Environment

If I am logged into my work account Chat GPT will have a different set of assumptions than if I am logged into my personal account. If I am automating through the API, the results are different from both. Switching across different LLMs, different models of the same LLM, or using the same model of the same LLM before and after an update can mean new prompt adjustments.

If I use the playground to build a prompt and add in the tools this agent or prompt is allowed to access, automating that prompt through the API fails to limit to those specified tools. This will yield different than anticipated results when deploying the system.

Even the same prompt in the same model in the same window will have different outputs when run multiple times. Before and during Black Friday one year, I ran the same prompt more than 100 times and recorded the failures in format or response demonstrating that traffic played a significant role in the returned output. Testing the automated scoring of students written responses to test questions, I would often find different scores for each trial. Some responses were less ambiguous and scored easily, and others were different every time.

Context Windows

I will just add more instructions and more context!

This works to a point. But there is a bit of a parabolic performance curve when it comes to the number of tokens (how many words) you are inputting. Too few and you dont have enough context, but too many and you start skipping instructions. You can max out your inputs, but the model will summarize instructions and content decided for itself which to follow.

Similarly continuing the chain-of-thought in a single conversation increases the number of tokens. You have definitely already noticed that your chat reaches a certain length and the quality decreases rapidly. This is usually after about 3 or 4 outputs.

AI-Driven QC

So lets use more AI to fix the AI!

This is an expenditure with diminishing returns. The capacity for a fix depends largely on the reason for the errors. A hiccup in the server might be easily fixed, but a lack of data or fundamental misunderstanding will not be resolved in the next pass using the same LLM. Using multiple different agents or LLMs can be very helpful when one excels over the other in different areas.

Unfortunately, it is also impossible to know where the errors are occurring. If you assume there is a random failure rate of at least 10% on each task, where is that 10% in the generation process, and where is it in the classification or editing? Are the errors on the same items? How many bad items are misclassified as good?

In one experiment I performed, I used AI to do a writing task. The instruction was to write an answer to a specific question and explain its reasoning based on the data provided or simply state that there was not enough information. Another round read what was written and either approved the response, rewrote the response, or state that there was not enough information. And then a third round of AI to decide between the first two responses if different. It was to decided which was better or state there was not enough information to write to response. Of the cases where the first and second responses were both written, but they were different, the third trial decided there was not enough information in almost 50% of those cases. There is no way of knowing which is correct without human review.

Burden of Responsibility in Use

So from a subject matter expert who has automated systems at scale… we are quite a long way from no-humans in the loop if you need high accuracy in your content. For low stakes writing tasks, automation is possible, but for more detailed or structured content, we have a long way to go. The LLM output for higher level science and math, in depth reasoning tasks, the building of learning progressions, larger scale lesson planning, it is just not quite there.

While we are replacing a lot of SMEs with automation engineers, it is really important to have this conversation. The way the engineering team feels about the one-off vibe-coded scripts cobbled together by the content team is exactly how we feel about the vibe-coded content.LLMs write. Any higher level analysis and science texts output by the tech team are about the same quality as the vibe-code the boss’ nephew is trying to use to sneak his way onto the engineering team during the summer internship.

We still need people who are able to review the output, verify the accuracy, and review high stakes generation. And now more than ever, we need real experts. As the technology gets closer and closer to generating something correctly, ,it becomes more difficult for a non-expert to identify the issues.

Where Does Responsibility Fall?

The responsibility for truth falls on the end-user, the consumer of the AI product. We use RAG models, good data, agentic systems, etc. to improve the accuracy as best we can. But it is not perfect and someone someone has to decide what is “good enough.” If I create a tool that uses an LLM to perform a task and sell this AI-driven product to you, I still sell that product with the burden of verification falling finally at your feet because it is tailored to your need.

With luck my data pipeline integration will improve your output by reducing hallucinations or automating a process you could not before, but the final review of each output still falls on the last user in the chain. I will use AI knowing it is about 80% accurate, and I will sell my product with the same accuracy warning as the original agent.

As an adult selling to adults, passing that responsibility off the consumer is a little different than when passing the end product over to children with no humans or expert reviewers. Are children then the final, end user, responsible for knowing if what the adults teach them is true? Where is this line being drawn in AI-First, tech Driven culture?

Hallucination or Training

But what are those errors? Are they randomly distributed? Are they based on the training data? Do they matter? We will talk about that in my next post.