How We Built an AI System That Reads 500-Page Vendor Proposals So Procurement Teams Don't Have To
How We Built an AI System That Reads 500-Page Vendor Proposals So Procurement Teams Don't Have To
Client: Under NDA — mid-size IT services firm | Industry: Enterprise Procurement / IT Services | Timeline: ~8 months | Region: India (deployed for global use)
At a Glance
| Metric | Result |
|---|---|
| Evaluation Time per Vendor | ~3 weeks → under 2 days |
| Criteria Extraction | Manual → fully automated |
| Compliance Errors | Down by ~70% |
| Evaluators Needed per RFP | 8–10 → 3–4 (rest reassigned) |
What Was Going Wrong
Every quarter, this company evaluates 15–25 vendor proposals in response to large RFPs — government contracts, enterprise IT deals, infrastructure bids. Each RFP runs 200–500 pages. Each vendor proposal is another 100–300 pages. Multiply that by 8–12 vendors per project.
The evaluation process was entirely manual. A team of 8–10 people would split the proposals, read through them, compare against a checklist of eligibility criteria, score each vendor on a dozen dimensions, and write up a recommendation. It took 3–4 weeks per RFP cycle. During peak season, they'd have three cycles running simultaneously.
The problems were predictable:
Inconsistency. Two evaluators reading the same proposal would score it differently. One person flags a missing ISO certification, another misses it entirely. When panel members compared notes, they'd spend hours reconciling — not because anyone was wrong, but because 400-page documents are genuinely hard to read consistently.
Missed criteria. The RFP itself contains dozens of eligibility requirements — some mandatory, some optional, some buried in annexures. The team extracted these by hand. They'd miss a few every time. On one project, a vendor was shortlisted despite not meeting a mandatory financial threshold that nobody caught until the legal review. That one almost went to court.
No audit trail. When leadership asked "why did we pick Vendor B over Vendor C?", the answer was usually a combination of spreadsheets, email threads, and someone's memory. Not exactly defensible in a dispute.
They weren't doing a bad job — they were doing an impossible job with the wrong tools. You can't consistently evaluate 3,000+ pages of dense technical and legal text with spreadsheets and willpower.
What We Built
A platform that handles the entire RFP evaluation lifecycle — from uploading the RFP document to exporting a signed-off comparison report. The AI does the heavy reading; humans do the thinking.
Step 1: Upload and Extraction
The RFP manager uploads the RFP document (PDF, DOCX, or Excel — sometimes all three). The system extracts the full text, splits it into chunks that preserve section boundaries and page numbers, and immediately starts two parallel jobs:
Criteria extraction — GPT-4 reads the entire RFP and pulls out every eligibility criterion it can find. Legal requirements, financial thresholds, technical certifications, experience minimums, geographic restrictions. Each one gets classified as mandatory or optional. The first time we ran this, it found 14 criteria that the team's manual checklist had missed. Three of them were mandatory.
Indexing — Every chunk of text gets embedded (converted to a numerical vector) and indexed for search. This powers the chat feature later and the scoring engine's ability to find relevant evidence in proposals.
Step 2: Vendor Proposals Go In
Vendors submit proposals — sometimes a single PDF, sometimes a zip file with 15 documents inside. The system extracts text from everything, chunks it the same way, generates embeddings, and indexes it all. Each vendor's content is kept separate but searchable.
Step 3: Eligibility Testing
This is where the AI earns its keep. For each vendor, the system takes every extracted criterion and asks GPT-4: "Does this vendor's proposal demonstrate that they meet this requirement? Show your reasoning."
It's not a simple keyword match. The AI reads the relevant sections, understands context, and gives a pass/fail with a paragraph of reasoning. "The vendor claims ISO 27001 certification on page 47, but the certificate provided in Annexure D expired in March 2024. Marked as FAIL — certification is not current."
If any mandatory criterion fails, the vendor is flagged as ineligible. But — and this is important — any panel member can override that decision with a written justification. The system tracks the override, who did it, and why. Because sometimes the AI is wrong, and sometimes the real world is more nuanced than a binary pass/fail.
During the pilot, the AI correctly identified eligibility issues in about 85% of cases. The other 15% were either edge cases (vendor met the spirit but not the letter of the requirement) or the AI misread a scanned document. Not perfect, but dramatically better than the previous process where roughly 1 in 5 criteria checks had errors.
Step 4: The 11-Section Deep Dive
For each eligible vendor, the system generates a detailed qualitative analysis across 11 standardized sections:
- •Missing Documents Analysis
- •Proposal Overview
- •Key Commitments & Guarantees
- •Solution Summary
- •Technical Approach
- •Implementation & Timeline
- •Financial Analysis
- •Resource & Team Structure
- •Risk Assessment
- •Compliance & Security
- •Final Recommendation
Each section is a markdown document — typically 500–1500 words — written by GPT-4 after reading the vendor's full proposal against the RFP requirements. Panel members can't edit the AI's analysis (it stays read-only as a reference), but they can add their own "Due Diligence Notes" alongside each section. If the AI missed something or got something wrong, the panel member's notes capture that.
If a section needs rework, evaluators can request a refinement with specific feedback. The system creates a new version (v1 → v2) — the old one stays for audit trail.
Step 5: Quantitative Scoring
The system scores each vendor against the top weighted evaluation criteria (usually 6–8). This isn't just the AI guessing a number. For each criterion:
- •It generates a search query from the criterion
- •Retrieves the most relevant chunks from the vendor's proposal using hybrid search (keyword + semantic)
- •Builds a context window from those chunks
- •Sends everything to GPT-4: "Score this vendor 0–100 on this criterion. Explain your reasoning. Rate your confidence."
Scores are weighted and aggregated into an overall score. Panel members can override any individual score — the system tracks the original AI score, the override, and the justification.
One thing we learned during development: the AI tends to be generous. Left uncalibrated, most vendors would cluster between 65 and 80. We added prompt engineering to force more differentiation and built in confidence levels so evaluators know when the AI is guessing versus when it has strong evidence.
Step 6: Compare and Export
The platform shows all vendors ranked by overall score with side-by-side comparisons. An AI-generated comparison analysis highlights where vendors diverge — "Vendor A's implementation timeline is 6 months shorter, but Vendor C offers 24/7 support while A offers business hours only."
The final report exports as a DOCX — scores, metrics, analysis, panel member notes, overrides, everything. Panel members acknowledge the report with a digital sign-off. Full audit trail from upload to final recommendation.
The Chat Feature
At any point, evaluators can ask the system questions about any document. "What does Vendor B say about disaster recovery?" or "Compare the financial proposals of Vendor A and C." It's RAG-powered — retrieves relevant chunks from Azure AI Search, sends them to GPT-4 with the question, and returns an answer with citations (which document, which page, which section).
This turned out to be one of the most-used features. Evaluators used it not for formal scoring but for quick sanity checks. "Did anyone mention cloud-native deployment?" — faster than Ctrl+F across 12 PDFs.
The full pipeline. Documents go in at the top, a defensible vendor recommendation comes out at the bottom. The chat feature sits alongside the whole process for ad-hoc queries.
Technical stack (for the engineering-minded)▾
- •Backend: FastAPI (Python 3.11+), async SQLAlchemy 2.0, Alembic for migrations. Background jobs with asyncio — all AI operations are non-blocking (return job ID, frontend polls for status).
- •Frontend: React 18 + TypeScript + Vite. MUI for components, Zustand for state, React Query for server state, TipTap for rich text editing of panel notes.
- •LLM: Azure OpenAI GPT-4/GPT-4o for criteria extraction, eligibility testing, section analysis, scoring, and comparison generation. Structured prompts with chain-of-thought reasoning for scoring. Few-shot examples for eligibility classification.
- •Embeddings: text-embedding-ada-002 (1536 dimensions), batched generation with rate limit handling and exponential backoff.
- •Search: Azure AI Search with hybrid retrieval (vector + BM25 keyword). Project-scoped indices. Top-K retrieval with metadata filtering.
- •Database: Azure SQL Server, 30+ tables covering projects, documents, vendors, criteria, evaluations, panel reviews, score overrides, and audit logs. Soft deletes everywhere.
- •Document processing: python-docx, PyPDF, PyMuPDF, openpyxl for text extraction. Custom chunking that preserves section headings, page numbers, and table structures.
- •Auth: Azure AD SSO via MSAL + custom JWT. Three roles: Admin, RFP Manager, Panel Member. Project-scoped access control.
- •Infrastructure: Docker + Azure DevOps CI/CD. Azure Blob Storage for documents, Azure Container Registry for images.
- •Cost controls: tiktoken for token counting before every API call. Max 3 concurrent vendor evaluations to stay within rate limits. Batch embedding with 0.5s delays.
How We Rolled It Out
Months 1–2: Understanding the mess. We sat with their procurement team for two weeks. Watched them evaluate an actual RFP. The process was more chaotic than anyone had admitted — different evaluators used different scoring rubrics, criteria checklists were copy-pasted from previous projects and never updated, and one panel member was tracking everything in a personal Excel sheet that nobody else had access to.
We mapped the entire workflow, identified where AI could replace manual reading vs. where human judgment was non-negotiable, and agreed on the 11-section evaluation framework.
Months 3–5: Building the core. Document processing pipeline, criteria extraction, eligibility testing, and the scoring engine. The hardest part wasn't the AI — it was document extraction. Government RFPs come in every format imaginable. Scanned PDFs with no OCR. Excel sheets where requirements are spread across 14 tabs. DOCX files with track changes still on. We spent almost a month just making the extraction pipeline robust enough to handle real-world documents.
Month 6: Pilot with a live RFP. Used the system alongside the manual process for one actual RFP — 8 vendors, 340-page RFP. The AI extracted 47 eligibility criteria; the manual checklist had 39. The team found the AI's 11-section analysis genuinely useful but pushed back on the scoring — "it doesn't know how to read between the lines of a vendor who's overselling." Fair criticism. We added confidence levels and adjusted the prompts to be less generous.
Months 7–8: Iteration and rollout. Incorporated feedback, added the chat/RAG feature (the team requested it — they wanted to ask questions without reading 300 pages), built the report export with panel sign-off, and rolled it out for all new RFP cycles.
What Changed
Six months after full rollout, across 9 completed RFP evaluation cycles:
| What we measured | Before | After | Change |
|---|---|---|---|
| Time to evaluate one vendor | ~3 weeks | 1–2 days | ~90% faster |
| Eligibility criteria caught | Missed ~15% | Missed ~3% | Much fewer gaps |
| Scoring consistency across evaluators | Varied widely | Anchored by AI baseline | Subjective → structured |
| Evaluators needed per RFP cycle | 8–10 | 3–4 | More than halved |
| Report turnaround | 2–3 weeks after evaluation | Same day | Instant |
| Audit trail | Emails + spreadsheets | Full system log | Actually defensible |
| Cost per evaluation cycle | ~$45K (people time) | ~$12K (people + AI costs) | Roughly 70% less |
The biggest shift wasn't in the numbers — it was in how the team worked. Before, evaluators spent 80% of their time reading and 20% thinking. Now it's flipped. The AI does the reading; the team focuses on judgment calls, edge cases, and the stuff that actually requires human experience.
The panel override feature gets used more than we expected — about 12% of AI scores get adjusted. That's healthy. It means the team is engaging with the AI's output, not blindly accepting it.
One thing that didn't go as planned: the chat feature was supposed to be a nice-to-have. It became the first thing evaluators open every morning. They use it to prep before meetings — "summarize Vendor D's approach to data migration in two paragraphs." We didn't anticipate that usage pattern at all.
"The part I didn't expect to matter was the criteria extraction. We'd been using the same checklist template for years, just editing it for each RFP. The first time the AI ran, it found requirements we'd been missing consistently. That was the moment the team stopped being skeptical." — Head of Procurement (name withheld at client's request)
What's Next
The system works well for their standard RFP cycle, but there are gaps they want to close:
- •
Scanned documents — The extraction pipeline still struggles with older scanned PDFs that have no text layer. We're integrating Azure Document Intelligence (OCR) for these, but accuracy on low-quality scans is hit-or-miss.
- •
Multi-language RFPs — They're starting to bid on Middle East contracts where RFPs come partly in Arabic. The chunking and extraction pipeline needs to handle mixed-language documents without breaking section detection.
- •
Historical learning — Right now each RFP project starts fresh. They want the system to learn from past evaluations — "last time we scored Vendor X's security approach as weak, flag if they submit similar language again." We're exploring this but carefully — there's a fine line between useful memory and introducing bias.
- •
Scoring calibration per industry — IT infrastructure RFPs need different scoring sensitivity than software development RFPs. The prompts are currently one-size-fits-all. Custom prompt profiles per RFP category are on the roadmap.
Built by GammaEdge. If your team is drowning in document-heavy evaluation processes — RFPs, compliance audits, vendor assessments — we should talk.
Authored by:
We build and ship production-grade AI systems that drive measurable outcomes. No demos, no slides — just systems that run.
Read moreWant similar results?
Tell us your challenge. We'll scope it and show you the ROI.