Please join our Discord server! https://discord.gg/XCazaEVNzT
Exploring DeepSeek-R1 s Agentic Capabilities Through Code Actions
I ran a fast experiment investigating how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool usage natively, and I was rather satisfied by initial outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, visualchemy.gallery where the model not just plans the actions however also formulates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other designs by an even larger margin:
The experiment followed design use standards from the DeepSeek-R1 paper and the model card: Don't use few-shot examples, prevent adding a system timely, and set the temperature to 0.5 - 0.7 (0.6 was used). You can discover more examination details here.
Approach
DeepSeek-R1's strong coding abilities enable it to act as a representative without being clearly trained for tool usage. By permitting the model to create actions as Python code, it can flexibly connect with environments through code execution.
Tools are carried out as Python code that is consisted of straight in the timely. This can be a simple function meaning or a module of a bigger package - any legitimate Python code. The model then actions that call these tools.
Results from performing these actions feed back to the model as follow-up messages, driving the next actions till a final response is reached. The representative framework is a simple iterative coding loop that moderates the conversation in between the design and its environment.
Conversations
DeepSeek-R1 is utilized as chat model in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching information from websites. This drives the discussion with the environment that continues up until a final response is reached.
On the other hand, o1 designs are known to perform inadequately when utilized as chat designs i.e. they don't try to pull context during a discussion. According to the linked short article, o1 models carry out best when they have the full context available, with clear directions on what to do with it.
Initially, I also tried a full context in a single timely technique at each action (with arise from previous actions consisted of), but this led to considerably lower scores on the GAIA subset. Switching to the conversational technique explained above, I was able to reach the reported 65.6% efficiency.
This raises an intriguing question about the claim that o1 isn't a chat model - maybe this observation was more pertinent to older o1 designs that lacked tool usage abilities? After all, isn't tool use support an important mechanism for making it possible for models to pull additional context from their environment? This conversational approach certainly appears reliable for DeepSeek-R1, though I still require to conduct comparable try outs o1 models.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is exceptional that generalization to agentic tasks with tool usage via code actions works so well. This capability to generalize to agentic jobs reminds of current research study by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool use wasn't examined because work.
Despite its ability to generalize to tool use, DeepSeek-R1 typically produces long thinking traces at each action, compared to other models in my experiments, limiting the effectiveness of this model in a single-agent setup. Even simpler jobs often take a long period of time to complete. Further RL on agentic tool use, be it by means of code actions or not, could be one alternative to enhance performance.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model often changes between various reasoning thoughts without sufficiently checking out appealing courses to reach an appropriate service. This was a major reason for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.
Future experiments
Another typical application of thinking models is to use them for preparing only, while utilizing other designs for creating code actions. This could be a potential brand-new feature of freeact, if this separation of functions proves beneficial for more complex tasks.
I'm also curious about how thinking designs that already support tool usage (like o1, o3, ...) perform in a single-agent setup, with and without creating code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look intriguing.