Sponsor detection

Enhanced sponsor detection using LLMs.

By Suraj Thapa Aug 29, 2025

Problem statement

Our tool for matching sponsors in Texas Tribune stories frequently missed matches, forcing editors to manually add entries. The issue: Sponsors are mentioned inconsistently across articles, breaking our regex-based text search.

Examples:

"Texas Farm Bureau's" (in the article) wasn't matched to "Texas Farm Bureau" (sponsor)
"University of Texas Rio Grande Valley" (in the article) wasn't matched to "University of Texas - Rio Grande Valley" (sponsor)

Our search tool failed when text variations broke our regex-based approach.

Enhanced sponsor detection system

Overview

Our new sponsor detection system uses AI to identify sponsor mentions in articles through entity extraction, semantic embeddings, and similarity matching. The system is designed to detect sponsors when articles mention those organizations and individuals, helping with disclosure requirements.

How does it work?

We extract the entities from the stories using OpenAI’s LLM
We cross match against those extracted entities with the existing sponsors in our sponsor list

Takeaway: Our new sponsor detection system leverages LLM technologies' extraction capabilities and advanced search through cosine similarity, moving beyond traditional ML methods. This makes the system far more flexible in handling name variations. For example, it recognizes Texas Farm Bureau is 96% similar to Texas Farm Bureau's.

Engineering deep dive

Architecture Overview

The system follows a multi-stage pipeline:

Text Processing & Chunking
Named Entity Recognition (NER)
Semantic Embedding Generation
Similarity Matching
Post-processing & Filtering

1. Text Processing & Chunking

The system first processes incoming text through intelligent chunking by configuring chunk size. This is to handle the long articles that exceed OpenAI's token limits.

2. Named Entity Recognition (NER)

This is the core extraction phase using OpenAI's GPT-4 model. We crafted a prompt that instructs the AI to extract organizations, extract people, handle variations with some examples, and output the json structured list.

3. Semantic Embedding Generation

The system uses OpenAI's text-embedding-3-large model to convert both extracted entities and sponsor names into high-dimensional vectors.

4. Similarity Matching

We use cosine similarity as the core matching algorithm, with a threshold of 0.94 determined through iterative testing. Cosine scores were computed for all possible entity–sponsor pairs using NumPy.

5. Post-processing and Filtering

We apply post-processing and filtering to handle edge cases that arise from the nondeterministic behavior of LLMs.

How did we decide this was the best approach?

Our approach was highly iterative and driven entirely by metrics. I used MLflow as the central stack for tracking metrics, which allowed us to systematically evaluate different methods with the goal of achieving over 90% in precision, accuracy, and recall. Here are some of the approaches we experimented with: (1) Extracting entities using alternative models such as spaCy, (2) Testing different n-gram combinations, (3) Applying fuzzy search with existing names, (4) Adjusting cosine similarity thresholds, and (5) Designing and refining LLM prompts. By leveraging MLflow for detailed metric tracking, we were able to compare approaches effectively and select the solution that best met our requirements. This iterative process proved invaluable in optimizing our results.

Performance Optimizations

Batch Embedding: Processes multiple entities simultaneously
Caching: Local sponsor data caching
Parallel Processing: Multi-threading for similarity calculations
Adaptive Chunking: Dynamic text segmentation based on API limits

API Integration

The system is deployed as an AWS Lambda function. I used AWS SAM and AWS cloudformation to manage the deployment. I also added github actions CI/CD pipelines to track the metrics while creating pull requests. The metrics are also sent to our mlflow server.

Monitoring

All the logs are configured to send to aws cloudwatch.

Note on a unique error

Modern tools -> modern bugs! When Facebook is mentioned as a social media, it will not recognize Facebook as an organizational entity. Example, “We will serve until we run out,” the Inn posted in a Facebook invitation. Thus, the LLMs do not extract Facebook as an organizational entity, until we are explicit about social media platforms in our prompt.

Results

Our system delivers a robust and scalable solution for sponsor detection by combining large language models with semantic similarity matching. This hybrid approach enables high accuracy while handling real-world newsroom complexities—such as variations, abbreviations, and alternative entity names. Compared to our earlier baseline of 68% accuracy, the improved system now achieves: 98% precision, 98% recall, and 97% accuracy. This marks a significant leap in performance.

Future Improvements

Looking ahead, there’s an opportunity to build automation that regularly feeds real-world input back into the system, ensuring it stays reliable over time. For example, when editors update sponsor suggestions, that signals the original recommendation was incorrect. Capturing these moments would allow us to incorporate the feedback into future iterations of this system.