Structural Risk in AI-Driven Origination: The Plausible List Problem
The target universe is not a starting point to refine later. It is the foundation that determines which founders you reach, which companies you never see, and whether the mandate completes in six months or twelve. If you are deploying out of a fund of any meaningful size, this list deserves the same scrutiny as the investment memo that follows it. Increasingly, that scrutiny is absent because the list arrives looking complete. Clean formatting, estimated revenues, contact information, fit scores. The presentation suggests rigor. What it does not reveal is how the underlying classifications were made, or how many companies were confidently excluded before anyone had a chance to review them.
Over the past three years, a new category of buy-side origination has emerged. The marketing is consistent: AI-powered market mapping, proprietary data infrastructure, intelligent targeting that identifies companies before they come to market. The landscape ranges from workflow platforms to specialized firms claiming advanced classification systems built on large language models. The underlying assumption is reasonable. If AI can parse contracts, generate code, and pass professional licensing exams, it should be able to classify companies against an investment thesis. The technology exists, the infrastructure is accessible, and the outputs arrive looking indistinguishable from those built on deeper foundations.
What most investors have not examined is what actually powers these systems. The answer varies in sophistication at the front end but converges at the point that matters most. Some firms license commercial databases and export raw lists based on industry codes, geography, and size filters. Others build proprietary scraping infrastructure, pulling from Google Maps, Secretary of State filings, permit databases, or industry directories. The more advanced approaches combine multiple sources and deduplicate across them. But regardless of how the initial universe is assembled, classification typically follows the same path: a large language model reads each company's website and classifies it as in-scope or out-of-scope based on a prompt describing the investment thesis. The pipeline can be assembled in weeks and produces results that appear comprehensive.
These tools deliver value for initial list building and enrichment, where error tolerance is higher and mistakes are caught downstream. The question is whether classification belongs in that category. Classification is where the investment thesis meets the target universe. It determines which companies enter the funnel and which are excluded before anyone reviews them. Errors at this stage do not get caught downstream. They compound silently until conversion rates reveal what the architecture could not. This is the plausible list problem: outputs that look complete but cannot be verified, built on systems that prioritize speed over accuracy. The research now available suggests these systems are not reliable for this task, and the trajectory is worsening.
Why the Models Are Getting Worse
The assumption underlying most AI investment is that newer models are more accurate than older ones. For years, that assumption held. Each generation reduced hallucination rates, improved factual recall, and extended the range of tasks that could be reliably automated. That trend has now reversed in ways that matter directly for origination, and the reversal is documented by the labs building these systems.
In April 2025, OpenAI released o3 and o4-mini, their most advanced reasoning models. Internal testing revealed that o3 hallucinated on 33 percent of factual queries in the PersonQA benchmark, more than double the rate of its predecessor o1. The smaller o4-mini performed worse, producing incorrect information 48 percent of the time. On SimpleQA, which tests general factual knowledge, hallucination rates reached 51 percent for o3 and 79 percent for o4-mini. The previous model hallucinated 44 percent of the time on the same benchmark. The pattern is unambiguous: the more sophisticated the reasoning capability, the less reliable the factual output.
OpenAI's technical report acknowledged the regression but offered no solution. "More research is needed to understand the cause of these results." The company's subsequent research provided a structural explanation that should concern anyone building on these systems. Models are trained to maximize accuracy on benchmarks, which rewards confident guessing over acknowledging uncertainty. A model that declines to answer a question it cannot reliably answer gets penalized. A model that guesses confidently has a chance of being right, and over thousands of evaluation queries, the guessing model scores higher on the metrics that matter for leaderboards and product marketing. The incentive to hallucinate is not a bug. It is embedded in how these systems are designed.
The research goes further than identifying the problem. It explains why hallucinations are mathematically inevitable given how language models are trained. Some facts do not have sufficient signal in the training data to be reliably recovered as truth, yet the model is still pressured to output something that appears to be true. The intuition that more data or larger models will solve the problem does not hold. More data can reduce some errors, but it cannot eliminate the structural incentive to guess. Random noise averages out with scale; systematic bias multiplies. If the training process rewards confident answers over calibrated uncertainty, each generation of models will continue to produce outputs that sound authoritative regardless of whether the underlying evidence supports them.
Where Classification Fails
Research from Harvard Business School and Boston Consulting Group identified what they termed a "jagged technological frontier," a boundary where some tasks are easily accomplished by AI while others, though seemingly similar in difficulty, fall completely outside its reliable capability. The study tested over 750 consultants and found that for tasks inside the frontier, AI assistance improved performance by 12 to 40 percent across multiple dimensions. When researchers introduced a task designed to fall outside the frontier, accuracy among consultants dropped from 84 percent without AI to between 60 and 70 percent with it. The AI did not merely fail to help. It actively degraded performance by producing wrong answers that sounded authoritative, and users could not distinguish which tasks fell on which side of the boundary.
Company classification sits directly on this frontier. The task resembles pattern recognition, which language models perform well, but the actual requirement is judgment under uncertainty. Classifying whether a company fits an investment thesis means interpreting ambiguous evidence, weighing incomplete information, and making calls that depend on context the model cannot access. A company's website may mention "heating and cooling" once in a list of services, and the model will confidently classify it as an HVAC contractor. Another company may do substantial HVAC work but present itself as a commercial construction firm, and the model will confidently exclude it. Neither decision is flagged for review.
The errors are systematic in ways that compound for origination specifically. Companies with minimal digital footprints are under-indexed because the model has less text to process. Regional operators who do not market heavily online fall through because their websites lack the keywords that trigger classification. Businesses that describe their work in language outside standard taxonomy get misclassified because the model pattern-matches against training data that reflects how industries are typically described, not how individual operators present themselves. These blind spots affect every thesis in every sector where the best acquisition targets are not optimizing for web visibility, which is precisely the market institutional investors want to reach.
Retrieval-augmented generation does improve factual accuracy. OpenAI's own data shows GPT-4o with web search hits 90 percent on SimpleQA versus 51 percent without. But retrieval improves recall, not judgment. The model still has to decide whether a company fits a thesis based on ambiguous evidence, and that decision remains probabilistic regardless of what information is retrieved. The deeper problem is that humans review inclusions, not exclusions. When a list arrives, someone may validate whether the companies on it belong. No one sees the companies that were confidently excluded before the list was generated. The errors that matter most are invisible by design.
How We Built for This
After years building origination infrastructure inside private equity and watching where investment toward AI-powered sourcing was headed, it became clear that the industry's direction and the technology's actual reliability were diverging. The tools were getting faster. The outputs were not getting more accurate. The gap between what was being marketed and what was being delivered would eventually surface in conversion rates, mandate timelines, and coverage that investors assumed was complete but was not.
Origin was built to close that gap. Everyone has some version of data. The difference is whether that data is structured to work alongside an investment thesis with precision, or whether it is a generic export filtered by prompts that cannot distinguish fit from keyword match. The system uses AI throughout the pipeline, including classification, but is architected to reduce hallucinations and maintain granular control over data quality in ways that general-purpose models do not allow. The tradeoff is real: months of infrastructure development instead of weeks, ongoing validation processes instead of set-and-forget deployment. The architecture takes everything that remotely resembles a client's investment framework and reduces it to the investable universe with precision that off-the-shelf systems cannot match.
Across the sectors we have mapped, coverage consistently exceeds commercial platforms by 20 to 40 percent, capturing the companies that probabilistic systems miss. Outreach converts above 20 percent. Full platform searches complete in six months at maximum, because the universe is accurate from the start and does not require mid-mandate reconstruction.
The Infrastructure Question
The firms adopting the newest reasoning models for origination workflows are following the historical pattern where each generation of AI outperformed the last on most tasks. That pattern has broken for factual accuracy on ambiguous classification problems, and the labs building these models have now documented why the break is structural rather than temporary. For investors evaluating origination partners, the question is no longer whether AI is involved but whether the architecture accounts for what the research now shows.
Partners who describe AI as a general capability without specifying how classification decisions are governed are building on a foundation the labs admit is unreliable for this task. Partners who can explain how their systems handle uncertainty, how accuracy is validated against outcomes, and how errors are identified and corrected demonstrate the infrastructure discipline that produces reliable coverage. The distinction is not between firms that use AI and firms that do not. It is between firms that understood the limitations before building and firms that will discover them after capital has been deployed.
The plausible list problem is solvable. Solving it requires more than faster tools or newer models. It requires applied data science built for a different objective: not speed to output, but conviction that what you are seeing is actually what exists. The firms that treat the list as a given will keep discovering its limits six months into deployment. The firms that scrutinize the architecture behind it will know what they have before capital moves.
Sources
OpenAI, "Why Language Models Hallucinate" (2025) https://openai.com/index/why-language-models-hallucinate/
OpenAI, "o3 and o4-mini System Card" (April 2025) https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
Dell'Acqua et al., "Navigating the Jagged Technological Frontier," Harvard Business School Working Paper No. 24-013 (September 2023) https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf