Improving Search for Deep Research Agents

Recently I've been working on a side quest to help legal researchers generate qualitative reports based on research questions they have across 250 court documents.

Some of the questions looked like:

What were some of the reasons that caused delays in hearings?
What was the influence of a judge's tenure in a petition's disposal?
What were some interesting factors that influenced the outcome of petitions?

If a human researcher were to try and answer these questions, they would have to read through all the court documents, look for specific keywords or phrases that offer insights, jot down notes, and finally present the answer in the form of a report.

This would take about a week for a human researcher to get done. By using a deep research agent, it was possible to reduce the time to about 15 minutes, which was about ~99.85% less time than a human researcher would need.

As a result, a fully functional deep research agent was a compelling tool to use for this task, so I ventured forward.

Around this time OpenAI released their Deep Research API and I spent the past couple of weeks setting it up and getting it to work.

The Deep Research agent gives you two ways to give it the ability to browse through external information, you can either use a remote MCP server or use OpenAI's own web search tool.

I had already set up a database with 250 court documents, so in my case providing access to the database through an MCP server made the most sense.

To integrate an MCP server with the deep research agent, the server must provide two specific tools:

A search tool that takes a query and returns matching results
A fetch tool that takes an id and returns the full document

Based on these tool definitions, the deep research agent basically works this way:

Based on a given user query, the agent makes a bunch of searches with keywords and phrases that return the most relevant documents that it can use for the report
Based on the results, the agent looks up the full document to get a much better understanding
The agent repeats 1 and 2 till it reaches a point where it can use the context to generate a report

The fetch tool is pretty straightforward to implement, you return the whole document by looking up the id.

On the other hand, the search tool requires more thought and care to implement since it directly affects the quality of the "rabbit hole" the agent can go into to do research.

The deep research agent calls the search tool using a single string query, presumably optimized for semantic searches. Since I had already implemented a search engine where I could apply filters across multiple fields, I decided to use that engine to power the search tool instead.

I took inspiration from Stripe's search syntax to design the query language for the tool, and tried two things:

Convert string query from deep research agent, convert that into the search syntax using an LLM in between, and then execute it against the search engine.

{
  "input": "final judgment delayed Bombay High Court detention reasons",
  "output": "court_id:HCBM01 content:\"final judgment delayed\" content:detention"
}
{
  "input": "21 days representation addressing delay unconstitutional",
  "output": "content:\"21 days representation addressing delay unconstitutional\""
}
{
  "input": "Bombay High Court detention petition quashed success rate",
  "output": "court_id:HCBM01
  habeas_nature.liberty_claim_type:state_detention outcome.type:quashed"
}

Using an intermediary step to translate the search syntax into a format that the search engine can understand improved the results, but only marginally.

Allow deep research agent to directly call the search tool with the search syntax.

I learnt that deep research did a terrible job at generating the search syntax despite my best efforts at including it in the system prompt. It was clear at this point that the agent only worked well when it could perform semantic searches. Perhaps the agent was trained to use this way of search to come up with specific phrases and keywords to dig deeper into the database and find answers.

At this point, this is how my architecture looked like.

After I realized that the deep research agent worked well with semantic search, I used OpenAI's vector store to do all that magic for me. I uploaded 250 documents as .txt files and it did all the chunking for me and I could do semantic search using their API.

At this point, when I ran the agent, it took about 15 minutes and multiple tool calls to generate a report. The report generated by the model was very detailed and covered a lot of aspects which I didn’t expect it to cover. But the model only considered a small corpus of data of ~20 documents instead 250 documents, lacking in coverage.

I also realized that the response from the search was little too extensive and too heavy resulting in the context window to get filled up quickly possibly causing the regression. I tried updating the system prompt asking it to go through all the files in the vector store and after a bunch of retries, it could only hit at most 30 documents, which wasn't great.

One of the first things I did was, instead of always sending back the entire document for every query the deep research agent sent to the search tool, I used an LLM in between to generate a summarized response relating to the query asked by the model pertaining to each result. This way the results were shorter and addressed the query directly.

This caused a significant improvement to both the search quality and coverage. This time, the agent was able to go through ~50 documents than it did before.

After digging into what I could do next, I figured that if I sharded the vector store by 50 documents each and ran them independently through 5 runs of deep research agent, then that could help improve coverage. I had to make a few changes to the MCP, where I had to split the database into 5 different vector stores with different vector ids and then passed the vector id as a custom header to help the search tool identify which vector store it should query on. This way when the deep research agent sends a query to the search tool, it could query the correct vector store.

So after having each agent go through their assigned vector stores and produce their reports, I then used a master agent to consolidate these reports into a single final report.

Following this approach helped to increase coverage to ~120 cases documents, which was better, but still not the best. When I started out, I was hoping the deep research agent would go through all 250 documents and generate a detailed report, but now I realize there are lots of things to do beforehand in order to improve document coverage for the kind of task I'm working.

I will spend the next couple of days trying to discover more improvements as I work through this.