Please join our Discord server! https://discord.gg/XCazaEVNzT

Exploring DeepSeek-R1 s Agentic Capabilities Through Code Actions

From Speedrunwiki.com
Revision as of 14:19, 9 February 2025 by RomanPomeroy (talk | contribs) (Created page with "<br>I ran a quick experiment [http://anphap.vn investigating] how DeepSeek-R1 [https://www.xn--studiofrsch-s8a.se performs] on agentic jobs, [http://wiki.die-karte-bitte.de/i...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


I ran a quick experiment investigating how DeepSeek-R1 performs on agentic jobs, wiki.die-karte-bitte.de despite not supporting tool usage natively, and I was quite impressed by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, brotato.wiki.spellsandguns.com where the model not only prepares the actions however likewise develops the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% appropriate, and other models by an even bigger margin:


The experiment followed design usage standards from the DeepSeek-R1 paper and the design card: Don't utilize few-shot examples, wiki.lafabriquedelalogistique.fr avoid adding a system prompt, and bphomesteading.com set the temperature level to 0.5 - 0.7 (0.6 was used). You can discover additional assessment details here.


Approach


DeepSeek-R1's strong coding abilities enable it to act as an agent without being clearly trained for tool use. By permitting the design to produce actions as Python code, it can flexibly interact with environments through code execution.


Tools are carried out as Python code that is included straight in the prompt. This can be a basic function meaning or a module of a larger package - any valid Python code. The design then generates code actions that call these tools.


Results from performing these actions feed back to the model as follow-up messages, driving the next actions till a final answer is reached. The agent framework is a basic iterative coding loop that moderates the conversation between the design and its environment.


Conversations


DeepSeek-R1 is utilized as chat model in my experiment, where the design autonomously pulls additional context from its environment by utilizing tools e.g. by using a search engine or bring data from web pages. This drives the discussion with the environment that continues till a last response is reached.


On the other hand, o1 models are known to perform improperly when used as chat designs i.e. they don't try to pull context during a conversation. According to the linked post, o1 designs perform best when they have the full context available, with clear instructions on what to do with it.


Initially, I also tried a full context in a single timely method at each action (with outcomes from previous actions consisted of), but this led to substantially lower ratings on the GAIA subset. Switching to the conversational approach explained above, I had the ability to reach the reported 65.6% performance.


This raises a fascinating question about the claim that o1 isn't a chat model - maybe this observation was more appropriate to older o1 models that lacked tool use capabilities? After all, isn't tool use support a crucial mechanism for making it possible for models to pull extra context from their environment? This conversational method certainly appears effective for DeepSeek-R1, though I still require to perform similar experiments with o1 designs.


Generalization


Although DeepSeek-R1 was mainly trained with RL on mathematics and coding jobs, it is amazing that generalization to agentic jobs with tool use via code actions works so well. This capability to generalize to agentic tasks reminds of current research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't investigated in that work.


Despite its ability to generalize to tool usage, archmageriseswiki.com DeepSeek-R1 frequently produces really long reasoning traces at each step, compared to other models in my experiments, limiting the effectiveness of this model in a single-agent setup. Even simpler tasks often take a long time to complete. Further RL on agentic tool usage, be it via code actions or not, might be one option to enhance efficiency.


Underthinking


I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design often switches between different thinking thoughts without sufficiently exploring promising courses to reach an appropriate service. This was a significant factor for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.


Future experiments


Another typical application of thinking models is to use them for preparing only, while using other designs for creating code actions. This might be a prospective new function of freeact, if this separation of roles shows useful for more complex jobs.


I'm also curious about how reasoning designs that already support tool usage (like o1, o3, ...) perform in a single-agent setup, engel-und-waisen.de with and without generating code actions. Recent developments like OpenAI's Deep Research or Face's open-source Deep Research, which likewise utilizes code actions, oke.zone look intriguing.