AI Agents: Not Always Right But Seldom In Doubt

October 29, 2025 00:22:11
AI Agents: Not Always Right But Seldom In Doubt
The Josh Bersin Company
AI Agents: Not Always Right But Seldom In Doubt

Oct 29 2025 | 00:22:11

/

Show Notes

This week I discuss the latest BBC study on AI answer quality from public data sources. As I discuss, the BBC and EBU found that 45% of news queries produce erroneous answers, so the reality has set in: public domain AI engines are neither “superintelligent” nor are they perfect.

Yet they are very self-confident. So we, as users, need to be careful.

As you’ll hear, there are three things to consider here, and you can read more about this in my latest article on the topic. For those of us in corporate roles, the message is clear: data quality must be our #1 priority, and this is a whole new domain for HR and other service functions.

Like this podcast? Rate us on Spotify or Apple or YouTube.

Additional Information

BBC Research Findings

Interpretation of the Findings and the “polluted corpus” problem

Claude’s admission of the “polluted corpus” problem

AI Thinking Skills You Need To Stay Safe (podcast)

Galileo: The World’s Trusted Agent for Everything HR

Chapters

View Full Transcript

Episode Transcript

[00:00:00] Good morning everybody. Today I want to talk about the lack of trust in AI agents in the public domain ones. OpenAI Gemini copilot perplexity Claude and the real point of this discussion is to educate you in a way, not necessarily to criticize anybody on the realities of these large language models because I think there's a perception in the market, certainly amongst HR people that these things are extremely smart, quote unquote, and threatening our human reasoning abilities. [00:00:37] But I want to convince you otherwise. And this all comes back to the research that came out over the weekend from the BBC and the European Broadcasting Union that basically said that 45% of all of the uses of AI and these tools is producing erroneous results. What the study found, including research by around 3,000 queries from, from a variety of companies and individuals, is that when looking at information from the news standpoint, and by the way, a lot of queries that people run on these tools are news related, benchmark related statistics of, you know, various trends in the world, who's happening where, what, et cetera. 45% of the answers have at least one significant error, 31% have misleading or missing or incorrect attributions as to the source. [00:01:32] 20% have hallucinations and outdated information. [00:01:36] One of the tools, Gemini, was incorrect in 76% of its responses. And unfortunately, despite these errors, the AI agents are dangerously self confident. They sound like, and they just, they talk like they know the answers to questions. Now my experience with this is very extensive because I do this a lot. I use these tools, I usually use OpenAI, but I'm starting to use Claude more to find information on the labor market, to get financial information on companies, to look up other statistics, unemployment rates, salary trends, et cetera. And I don't trust it very much anymore because the aggregated answers it gives me are usually from many sources I can't find. And when I analyze the answers against other answers, I tend to find that they're incorrect. So one of the things the BBC also did, which I really want to give them credit for this, you guys really should read their, their work is they built a toolkit for dealing with these integrity issues. And they basically said there are eight different problems. Number one, the they fabricate facts, in other words, they hallucinate. Number two, they source, they provide the wrong source. So they get confused between the source and the answer. And I think the reason that happens is some of their sources are articles about the articles, not the actual sources themselves. And so they get all sorts of exaggerations from that. They have out of date information where they'll pull something that's 10 years old and cited as it's current. They inaccurately represent the chronology and they get things out of order. They inaccurately represent causal relations. And I find this all the time where you'll ask it for a number or a statistic, but it doesn't really realize that the statistic you asked was a chain of causal events from the prior question. So it makes something up. Inaccurate scope or generalization. In other words, they incorporate data that is either bigger or smaller than the data set you actually wanted and try to give you an answer. [00:03:38] Incorrectly represent entities and relations. They mix up who's responsible for which information and then failure of reasoning or logic, which is what I find a lot where I'll get an answer that is just logically incorrect. But the system didn't even apply simple logic to see if the scope was correct. And you know, I won't give you all the examples, but in the BBC report they have many, many examples of where these things make mistakes. Simple things like asking it who is the Pope? Or you know, what is the makeup of the things like that. And then of course, in the news media, when we're talking about weekly daily news about wars or legal changes or major events that have happened around the world, we don't know what's correct. You know, the conclusion that I'm coming to, and I don't know if you all feel this way, but I just want to make you aware of it, is I really don't trust the answers anymore unless I really validate the sources. Which means even though the system is doing, quote, unquote, deep research, that deep research isn't enough because we need to do deep research to make sure that its research is correct. Now the reason that this takes place isn't really that hard to figure out. And if you read the link to the analysis I did on Claude, the way the LLMs work is they take, you know, billions and billions of web pages and documents and they take the tokens in each one and they statistically analyze the relationship between this word and another word. And they don't do it just in one document, they do it in the entire thing. So it's basically an N dimensional database where every word or token is related to every other word or token, which sounds a little bit bizarre, but mathematically it's sort of, that's what the innovation is in the transformer model. And so it is statistically figuring out the answer to a question based on, you know, lots of Micro decisions about what word or phrase goes with what word or phrase. There is no human logic behind this. There is no higher level logic saying to itself, does this question make sense in whatever language you ask? Because it doesn't speak languages, it only speaks tokens. And the manifestation is it looks like a real human intelligence because it's speaking your language. It feels like it knows what it's talking about, but it doesn't. It is really a machine statistically probabilistically answering questions. Now for an application like a self driving car or a supply chain analysis of inventory or something that calculates, you know, perhaps the weather trends, where it's looking at physical data and analyzing the trend based on history or based on mathematical calculations that isn't that hard to imagine being correct because it's using statistical data against statistical techniques, against statistical data and against numbers. But when it's dealing with words and it doesn't really know what the words mean, and it's treating a word as if it's a representation of numbers, it seems quite logical that it would make these mistakes. Because as I talked about in the article I published over the weekend, if you had a mass amount of data in your company about say HR policies as an example, and 1 or 2% of the policies were incorrect, and maybe it was the pay policy or the standard of living, cost of living variation policy or something like that, when you started asking questions about salaries and promotions and policies for pay changes as people move from city to city and it's sourcing a small amount of error data in the answer, many, many answers would be affected. Not just answers about that topic, but answers that relate to that topic are going to be affected. So as the CLAUDE system confirmed with me, these are essentially polluted corpuses which, with a very small amount of pollution that affects everything. I suppose the analogy would be if you're drinking water from a water treatment plant that has a small amount of pollution in it, do you really care that it's a small amount when it's going into your body? Yes, certainly I do. So this small amount of incorrect data, which is inevitable when you're scraping data from the Internet, especially if you start doing advertising, you know, has the potential to create, as I said, 45% of queries producing incorrect or erroneous results. Now this BBC study wasn't that big. It was 3,000 queries. It's pretty big. And I don't think the BBC has any axe to grind with AI vendors whatsoever. They're simply stating facts. I think in, you know, most of our experiences, we'll find that these systems are very great to use and they help you find flights that are, you know, going in a certain location for a certain price and look up things on the Internet. And I think they'll get better over time as they get more interconnected into transaction systems and e commerce systems. And certainly the AI tools we use for learning in Galileo are exhaustively accurate and good. But in our case, there's no data in the Galileo corpus that we don't know where it came from. We are responsible for every piece of every single word in that corpus when you're on the Internet. And these companies in particular, OpenAI, but I think they'll all go this direction, start to monetize their business businesses around advertising or transaction fees. There's going to be bias if, and I'm not saying they're doing this yet, but I think it's coming. If OpenAI starts to take a transaction fee on an airline flight you purchase or something like that, they have an incentive to promote one airline over another, which could lead the LLM to interpret a query about airlines. That one airline doesn't go to a certain city and the other one does, when actually it does, but it's motivated or essentially programmed not to give you the alternative. Obviously, if advertising takes place in line when you're asking a simple query about anything, all bets are off whether the system will be trustable. And the thing about trust is, once you lose it, it's hard to get it back. In fact, I'm not sure you ever get it back. If you relied upon a doctor or a dentist or a lawyer or an accountant who made a significant mistake. [00:10:04] It would not surprise me if you fired them and didn't go back to them again because you don't know what other mist they also may be making. This is something, by the way, we dealt with enormously in great detail at Deloitte. This sense of trust that every conversation, every interaction, every question, every answer that you get from a person or an entity or a company has an impact on your trust. And, you know, we as humans are very intuitive animals and we tend to have antenna or six senses, it's called, about different entities that we trust and different people that we trust. And my sense of trust from OpenAI is not very high. Now, I'm not saying it isn't useful. Sometimes it does amazing things. And I just did a bunch of queries on HR benchmarking that were very good. But then I found out that the source of the data that it produced for me was shrm. And it only really was informed by small to medium sized companies. So even though it didn't tell me this and the answer, it was actually giving me quite biased data and I was smart enough to know to look at the source and make sure that I wasn't misusing it. So this is a state of the market that's real. The solutions, as I talked about in the article, are three things. Number one is for all of your internal systems, you are responsible for data quality. If it's HR policies, if it's recruiting data, if it's compensation data analysis, if it's performance distributions, whatever it may be, you as a company and you as an HR function have to make sure that before you launch something to a bunch of employees or a bunch of job seekers that the information behind it is trusted and accurate. That means that the people that sourced it need to be involved in a governance chain of maintaining it. And you need a number of people that are responsible for keeping it up to date. We do this with Galileo because we know where all the research came from and we're not that big and this is a very core part of our business. But in the HR department you probably don't have people doing this full time and you really need to. This idea that we're going to reduce all of our HR self service agents and put everybody into a chatbot is great, but that doesn't mean there's going to be no people involved. There's going to be a lot of people maintaining the data behind the scenes. As I mentioned the other day, IBM told us last week that they've done this for 6000 HR policies and they not only have owners of each policy who are responsible for keeping them up to date, but they also are building an agent that's comparing their policies to regulatory laws all over the world so that they will get alerts when one of their policies is out of legal compliance. That's the kind of stuff we're going to have to do. And you're going to have to rely on vendors like us or others to help you keep these things up to date. The second solution to this problem is intelligent thinking I guess is the word I would say. Now, you know, I'm an analyst and I've been a debater and you know, I'm the kind of person that challenges assumptions all the time and you know, that's one of the reasons I'm good at what I do. And I think one of the reasons people sometimes get frustrated talking to me. But anyway, what results in that is I use the scientific method. I don't believe anything until it's proven. And I have hypotheses that help me make decisions that are constantly being up to date, made up to date by new data. And this is, you know, one of my skills as a, as an analyst. And it's a, it's a tricky skill that I've developed over almost 70 years. Doing debate in high school, studying physics and math and English and history in college, working in sales, working in marketing, working in a whole bunch of different companies, just having a lot of experience. [00:13:39] My judgmental skills of judging and analyzing data come from my own personal experience. Every employee in your company has some or some limited amount of similar experience. We want them to use that judgment when they interpret the results of what an AI agent tells them. It may be that the AI agent is much, much, much more informed than you are. It may have data that you have no access to and therefore be making a recommendation on who to hire or who to give a raise to or whatever based on information that you just do not have. That's fine. But you really owe it to yourself and to your company to see what assumptions were made and to validate them against your human judgment. Now you're not going to do that every time you use it. If you use hired score and it just gives you a number scoring candidates, you're probably not going to dig in and ask it for every single answer as to why it made every single score. But you're going to actually going to have to do something like that. Because if a candidate is rejected and that candidate goes to the EEOC because they believe they've been discriminated against for some reason and they file a lawsuit, someone in your company, the legal department or someone else is going to have to go back and figure out how those decisions were made. So this whole human in the loop thing, which I hate that phrase, is really the existential future we have with AI. You know, my trainer has a Tesla and we talk about Tesla all the time. Every time I see him and he's told me about his self driving features and he's a very bright guy and you know, he says I use it all, he uses it all the time. He has a long commute. But he says that it makes mistakes pretty regularly. It goes too fast, goes too slow, it sometimes changes lanes and does other things he doesn't want it to do. So he's keeping his eye on it all the time. And you know, Tesla's pretty advanced, but it's not that advanced. These systems are not Perfect. And particularly LLMs that are language models. All they look at is words, you know, how intricate and subtle and complex language is. You know, two people will read the same sentence and interpret the results or the meaning in a different way because of the way the words were put together or the theme or the tone. We have to apply human judgment to these things, at least for the next couple of years until they become much, much better at what they do. And honestly, I am really concerned about the business models of these companies after they've spent trillions of dollars moving towards advertising. The third area of attention, really teaching your company as a whole on what to outsource and what to bring in house because of the potential lack of trust in these external systems. You know, particularly in our domain of hr, everybody thinks they're an expert in hr. I've never met somebody who doesn't think they know a lot about hiring or management or training people, et cetera. Well, you know, there's a lot of anecdotal information on the Internet. Are you comfortable letting your employees use that for business decisions inside of your company? I do a lot of companies financial analysis. I look at revenue per employee, I look at growth rates, I look at market cap, all sorts of stuff like that. That's fairly easy to find and it's fairly easy to standardize on the data. But you get one level deeper into inventories or number of employees or growth rates of various parts of these different businesses and you get different answers from different sources. Your company unfortunately is going to have to make decisions as to which one of these decisions you're willing to let employees make from data, which ones you want to make from inside data. The thing that I find the most interesting and maybe fun about AI in this new AI technology cycle we've entered is that these tools are programmable by you. The Vibe coding tools. And I just had a demo of Rovo yesterday from Atlassian. And other tools are now at the point where you as a business person, as a staff person can program them in English or whatever language you speak. And so just like you can program Excel yourself, you don't need a software engineer to build a spreadsheet. You're going to be able to program your agent, you're going to be able to quote, unquote, teach your agent to do what you want it to do. And when your agent makes mistakes, which it will, you know, these things are not 100% accurate. You are going to have to figure out, is it the Source of the data that's wrong? Is it the way I programmed it? Is it the nature of the question that was asked? And so we are going to have to bring this stuff in house. From the standpoint of the public citizen use of AI, I think, you know, we'll see basically what happens in the public domain of things like this BBC report and whether people complain or just ignore it and just go on with their lives. I think it's pointing out a reality of the AI industry that there's a lot of hard work required to make these things trusted. By the way, that's what great technology and data vendors will do. The technologies of AI are very data dependent. These are probabilistic systems that are basically training themselves on the data. [00:18:55] So data labeling and data quality, data sourcing, data relevance, data timeliness are massively important issues in the utility and value of these systems. We've never really had technology like this before. If you went out and bought SAP or Workday or some ats, whoever it was, you didn't care about the data. You put the data into it, as you used it, it functioned regardless of the data. The AI systems are the opposite. They function differently based on the data. So we are going to be in a new world of business re engineering around these systems. If you look at sophisticated mature AI companies, if you look at Amazon for example, they've had a lot of time to work on this. They've been building AI systems for many years. They've dealt with, there have been errors. I mean, you probably remember when Amazon created this recruiting bot that didn't like women. They, they've seen what can happen when things aren't trained well. But I think each one of us in all of our companies are going to go down this learning curve. Even the credit agencies who use AI to evaluate your credit, the credit card companies who look for fraud, There have been decades of experience here in financial services in particular in using these AI systems and training them to be be as accurate as possible. But for those of us that are using them for language stuff, this is new. So that's kind of the theme here. I would also like to make a plug for Galileo in the domain of hr, which includes all the HR practices, HR tools, HR benchmarks, skills by role, job titles, dynamic data on turnover, spanner control, regulatory data around the world. We are standing firmly behind the credibility of our data. So, so if you're trying to find a source of credible data for some of those domains, and we're adding salary data to it very soon, so I'll have an announcement next year on that and a couple of other cool things. You can rely on us as a trusted data source. So not only is Galileo an amazing tool to solve problems and to learn, but it is a trusted corpus. And that's really the business that I've been in for 30 years is not explaining or producing data that's inaccurate or opinionated, but really basing everything on facts and analysis and research that we've done. So you know, we're one of the credible sources out there. There are others in different domains, I think. I think each domain of business will have companies like us in law, in medicine, in financial benchmarking, in various other domains as well. But the public domain ones, I think you're going to just be a little more careful. It's going to be really fun to watch what OpenAI does. It was interesting to me that Gemini poured some scored very poorly in the BBC report when Google has more news data than anybody else. I've noticed the same thing. By the way, I don't use Gemini at all because it seems to produce all sorts of wacky answers in my experience. But maybe there's some tricks to using it that I don't understand. [00:21:58] So take a look at the article, take a look at the research, call us if you have any questions and get your hands on Galileo and you'll see what a really trusted AI system looks and feels like. That's it for now. Talk to you later.

Other Episodes

Episode 0

November 25, 2022 00:25:11
Episode Cover

Dashing Thru Europe: Skills vs. Competencies, Recruiting, Inflation, And The King & Queen of The Netherlands

In this podcast I describe my two-week tour through Northern Europe: Sweden with Workday and Spotify, UK book tour with banks, publishers, and media...

Listen

Episode 0

July 24, 2023 00:17:31
Episode Cover

AI in Human Resources: Early Stories From Companies Around The World

We just finished a long call with more than 100 companies about their AI in HR strategies. Guess what - everyone is experimenting. While...

Listen

Episode

June 04, 2025 00:19:47
Episode Cover

How Mastercard Uses AI to Rewire HR for Employee and Business Performance

Anshul Sheopuri, Executive Vice President, People Operations & Insights, Mastercard, talks with Kathi Enderes about the company’s method for using AI in HR. From...

Listen