Recently I've been working on a side quest to help legal researchers generate qualitative reports based on research questions they have across 500 court petitions.
Some of the questions looked like:
If a human researcher were to try and answer these questions, they would have to read through all the court petitions, look for specific keywords or phrases that offer insights, jot down notes, and finally present the answer in the form of a report.
This would take about a week for a human researcher to get done. By using a deep research agent, it was possible to reduce the time to about 15 minutes, which was about ~99.85% less time than a human researcher would need.
As a result, a fully functional deep research agent was a compelling tool to use for this job.
Around this time OpenAI released their Deep Research API and I spent the past couple of weeks setting it up and getting it to work.
The Deep Research agent gives you two ways to give it the ability to browse through external information, you can either use a remote MCP server or use OpenAI's own web search tool.
I had already set up a database with 500 court petitions, so in our case providing access to the knowledge base through an MCP server made the most sense.
To integrate an MCP server with the deep research agent, the server must provide two specific tools:
search
tool that takes a query and returns matching resultsfetch
tool that takes an id and returns the full documentBased on these tool definitions, the deep research agent basically works this way:
The fetch
tool is pretty straightforward to implement, you return the whole document by looking up the id
.
On the other hand, the search
tool requires more thought and care to implement since it directly affects the quality of "rabbit hole" the agent goes into.
It was a little tricky to implement since I had to build my own search engine that works best for the agent.
Initially, I started by setting up a search function in which I convert the query into a pike query. I then query supabase based on the pike query generated. This method had one major drawback. The queries generated by deep research models weren't effectively converted to pike queries. Even if they were, the results were not always relevant and the search returned very few results.
Since I’m trying to build a legal research agent, I decided it’ll be better to use similarity search to retrieve the relevant data for the model generated query. For my use case I’ve decided to upload all my relevant case data in OpenAI’s vector storage. We can use the vector store id in the search tool to retrieve relevant data with respect to the query.
The fetch tool will take an id and will return all the case related data by querying supabase.
It’s important to note the format in which the fetch and search tool returns the response. Providing the wrong format will result in returning an empty response for the queries asked by the deep research agent.
The search tool should return an array of object where each object will have an id, title, text and url. The fetch tool will return a single object with the same properties as search.
Lastly the approval mode for the MCP tools should be set to never - since both the search and fetch actions are read only the human-in-the-loop reviews add lesser value and are currently unsupported.
With the scaffolding set up we can start to run our deep research agent to generate a detailed report based on our legal private knowledge source. You can either use the api or the playground to test this functionality ( If you’re using the playground add your custom MCP server to the list of tools ).
While running the query you’ll notice that unlike deep research in ChatGPT you’ll not see follow up questions being asked before the research starts. This is because in ChatGPT there is an additional abstraction layer involving smaller model like 4.1 which will understand the initial user query and ask follow up clarification questions to the user and then create a structured query which will be passed on to the deep research agent.
Update the system prompt and enter the user query. It will take a good amount of time around 10-15 minutes for the model to return a detailed report.
The report generated by the model was very detailed and covered a lot of aspects which I didn’t expect it to cover. But the model only considered a small corpus of data ( ~20 cases ) instead of all the relevant data ( ~200 cases ). I realized that the response from the search was little too extensive and too heavy causing the context window to get filled up quickly.
I noticed that there’s no reason to return the entire case history and details for every query the deep research sends to the search tool. Instead taking inspiration from ChatGPT’s deep research I added a middleware agent ( using gpt-4o-mini ) whose job is to generate a summarized response with respect to the query asked by the model. This way the search tool can return responses that’s relevant to the query and not the entire case data.
On subsequent run’s of the deep research agent, I was able to make the agent consider significantly more number of cases (~50 cases ) than it did before. But I wanted it to go through all the relevant cases before it can write a report.
I decided to use a MapReduce kind of approach to the system. Where there will be multiple deep research agent’s where each agent will go through a specific number of cases (~50 cases ) and then we will use a master reasoning agent (o3) to take the report’s/output generated by the sub agent’s and provide a consolidated detailed report.
I had to make some changes to the MCP. I split the data into 5 different vector stores with different vector id and then passed the vector id as a custom header to help the search tool identify which vector store it should query on. This way when the deep research agent sends a query to the search tool, it can query on the correct vector store.
Following this approach helped to increase the total number of cases considered (~110 cases ). I’m still thinking of increasing the no of tool calls the model can make to increase the input token number so that the model can take in more cases. Need to test this hypothesis out to confirm.