Sponsor detection
Problem statement
Our tool for matching sponsors in Texas Tribune stories frequently missed matches, forcing editors to manually add entries. The issue: sponsors are mentioned inconsistently across articles, breaking our traditional text search.
Examples:
- "Texas Farm Bureau's" wasn't matched to "Texas Farm Bureau"
- "University of Texas Rio Grande Valley" wasn't matched to "University of Texas - Rio Grande Valley" Our search tool failed when text variations broke our regex-based approach. We resolved this by enhancing the detection system with an LLM, which is more adaptable to variation.
How the Sponsor Detection System Works
Overview
Our new sponsor detection system uses AI to identify sponsor mentions in articles through Named Entity Recognition, semantic embeddings, and similarity matching. The system is designed to detect sponsors when articles mention those organizations and individuals, helping with disclosure requirements.
How does it work?
- We extract the sponsors from the stories using OpenAI’s LLM
- We cross match against those extracted sponsors with the existing sponsors in our sponsor list
Takeaway: Our new sponsor detection system leverages LLM technologies' extraction capabilities and advanced search through cosine similarity, moving away from traditional ML approaches.
Engineering deep dive
Read more ⬍
Architecture Overview
The system follows a multi-stage pipeline:
- Text Processing & Chunking
- Named Entity Recognition (NER)
- Semantic Embedding Generation
- Similarity Matching
- Post-processing & Filtering
1. Text Processing & Chunking
The system first processes incoming text through intelligent chunking by configuring chunk size. This is to handle the long articles that exceed OpenAI's token limits.
2. Named Entity Recognition (NER)
This is the core extraction phase using OpenAI's GPT-4 model. We crafted a prompt that instructs the AI to extract organizations, extract people, handle variations with some examples, and output the json structured list.
3. Semantic Embedding Generation
The system uses OpenAI's text-embedding-3-large model to convert both extracted entities and sponsor names into high-dimensional vectors.
4. Similarity Matching
We use cosine similarity as the core matching algorithm, with a threshold of 0.94 determined through iterative testing. Cosine scores were computed for all possible entity–sponsor pairs using NumPy.
5. Post-processing and Filtering
We apply post-processing and filtering to handle edge cases that arise from the nondeterministic behavior of LLMs.
How did we decide this was the best approach?
Our approach was highly iterative and driven entirely by metrics. I used MLflow as the central stack for tracking metrics, which allowed us to systematically evaluate different methods with the goal of achieving over 90% in precision, accuracy, and recall. Here are some of the approaches we experimented with: (1) Extracting entities using alternative models such as spaCy, (2) Testing different n-gram combinations, (3) Applying fuzzy search with existing names, (4) Adjusting cosine similarity thresholds, and (5) Designing and refining LLM prompts. By leveraging MLflow for detailed metric tracking, we were able to compare approaches effectively and select the solution that best met our requirements. This iterative process proved invaluable in optimizing our results.
Performance Optimizations
- Batch Embedding: Processes multiple entities simultaneously
- Caching: Local sponsor data caching
- Parallel Processing: Multi-threading for similarity calculations
- Adaptive Chunking: Dynamic text segmentation based on API limits
API Integration
The system is deployed as an AWS Lambda function. I used AWS SAM and AWS cloudformation to manage the deployment. I also added github actions CI/CD pipelines to track the metrics while creating pull requests. The metrics are also sent to our mlflow server.
Monitoring
All the logs are configured to send to aws cloudwatch.
Note on a unique error
Modern tools -> modern bugs! When Facebook is mentioned as a social media, it will not recognize it as an organizational entity. Example, “We will serve until we run out,” the Inn posted in a Facebook invitation. Thus, the LLMs do not extract Facebook as an organizational entity, until we are explicit about social media platforms in our prompt.
Results
Our system delivers a robust and scalable solution for sponsor detection by combining large language models with semantic similarity matching. This hybrid approach enables high accuracy while handling real-world newsroom complexities—such as variations, abbreviations, and alternative entity names.Compared to our earlier baseline of 68% accuracy, the improved system now achieves: 98% precision, 98% recall, and 97% accuracy. This marks a significant leap in performance.