8,545 bytes added
, 9 February
<br>I ran a quick experiment [http://anphap.vn investigating] how DeepSeek-R1 [https://www.xn--studiofrsch-s8a.se performs] on agentic jobs, [http://wiki.die-karte-bitte.de/index.php/Benutzer_Diskussion:Nannette0914 wiki.die-karte-bitte.de] despite not [https://csmtube.exagopartners.com supporting tool] usage natively, and I was quite impressed by [https://www.violetta.sk preliminary outcomes]. This experiment runs DeepSeek-R1 in a [https://www.himmel-real.at single-agent] setup, [https://brotato.wiki.spellsandguns.com/User:JennyMcCree668 brotato.wiki.spellsandguns.com] where the model not only prepares the [https://oldgit.herzen.spb.ru actions] however likewise [https://innovator24.com develops] the [http://pedrodesaa.com actions] as [https://redebuck.com.br executable Python] code. On a subset1 of the [https://glamcorn.agency GAIA validation] split, DeepSeek-R1 [http://www.cimol.com.ar surpasses Claude] 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% appropriate, and other models by an even bigger margin:<br><br><br>The [https://grazzee.com experiment] followed [http://francksemah.com design usage] [https://www.kayserieticaretmerkezi.com standards] from the DeepSeek-R1 paper and the design card: Don't [https://www.bridgewaystaffing.com utilize few-shot] examples, [https://wiki.lafabriquedelalogistique.fr/Discussion_utilisateur:ColinBlaylock4 wiki.lafabriquedelalogistique.fr] avoid adding a system prompt, and [https://bphomesteading.com/forums/profile.php?id=20707 bphomesteading.com] set the [http://furuhonfukuoka.info temperature level] to 0.5 - 0.7 (0.6 was used). You can [http://www.ad1387.com discover additional] [https://reklameballon.dk assessment details] here.<br><br><br>Approach<br><br><br>DeepSeek-R1[http://hnts.jyzbgl.cn3000 's strong] [http://julie-the-movie-girl.de coding abilities] enable it to act as an agent without being clearly trained for tool use. By [https://gitea.timerzz.com permitting] the design to [http://crazycleaningservices.com.au produce actions] as Python code, it can flexibly interact with environments through [https://cyberschadenssumme.de code execution].<br><br><br>Tools are carried out as [http://anphap.vn Python code] that is [http://sumatra.ranga.de included] [https://freeads.cloud straight] in the prompt. This can be a [http://dev.umfmtc.org basic function] [https://git.xaviermaso.com meaning] or a module of a larger package - any [https://meetpit.com valid Python] code. The design then [https://glamcorn.agency generates code] actions that call these tools.<br><br><br>Results from [https://playsinsight.com performing] these [http://101.33.255.603000 actions feed] back to the model as [https://pirotorg.ru follow-up] messages, [https://tubechretien.com driving] the next [http://swayamseasolutions.com actions] till a final answer is reached. The agent framework is a basic iterative coding loop that [http://www.igrantapps.com moderates] the [https://turbomotors.com.mx conversation] between the design and its environment.<br><br><br>Conversations<br><br><br>DeepSeek-R1 is [http://forum.infonzplus.net utilized] as [https://frieda-kaffeebar.de chat model] in my experiment, where the design [http://suffolkyfc.com autonomously pulls] [http://liquidarch.com additional] [http://rendart-dev.pl context] from its [https://www.maxvissen.nl environment] by [https://www.theblueskyenergy.com utilizing tools] e.g. by using a [https://www.memoassociazione.com search engine] or bring data from web pages. This drives the discussion with the [https://tamlopvnpc.com environment] that continues till a last response is reached.<br> <br><br>On the other hand, o1 models are known to [https://magikos.sk perform improperly] when used as chat designs i.e. they don't try to [http://103.6.222.206 pull context] during a [http://www.sexysearch.net conversation]. According to the linked post, o1 [https://kandy.com.au designs perform] best when they have the full [http://142.11.202.104 context] available, with clear instructions on what to do with it.<br><br><br>Initially, I also tried a full context in a [http://dewadarusakti.com single timely] method at each action (with [https://www.apprenticien.net outcomes] from previous [http://krasnodarskij-kraj.runotariusi.ru actions consisted] of), but this led to substantially lower ratings on the GAIA subset. Switching to the conversational approach [https://www.reddingschoolofmusic.com explained] above, I had the [http://konbu-day.com ability] to reach the reported 65.6% [https://jobs.theelitejob.com performance].<br><br><br>This raises a [https://theheyz.nl fascinating question] about the claim that o1 isn't a chat model - maybe this [http://chq.gov.mv observation] was more appropriate to older o1 models that lacked tool use capabilities? After all, isn't tool use support a crucial mechanism for making it possible for models to [https://cyberschadenssumme.de pull extra] context from their environment? This conversational method certainly [https://forevergorgeousaesthetics.com appears effective] for DeepSeek-R1, though I still [https://chrismartin.photo require] to perform similar experiments with o1 designs.<br><br><br>Generalization<br> <br><br>Although DeepSeek-R1 was mainly [http://shkola.mitrofanovka.ru trained] with RL on [http://soactivos.com mathematics] and coding jobs, it is amazing that generalization to agentic jobs with tool use via code actions works so well. This [https://urodziny.szczecin.pl capability] to [http://careers.egylifts.com generalize] to agentic [http://39.98.153.2509080 tasks reminds] of [https://www.veletrhbezprekazek.cz current] research study by [https://stephanieholsmanphotography.com DeepMind] that shows that [https://setupcampsite.com RL generalizes] whereas SFT remembers, although [https://www.lhommecirque.com generalization] to [https://git4edu.net tool usage] wasn't [https://git.programming.dev investigated] in that work.<br><br><br>Despite its [http://www.bulgarianfire.com ability] to [http://121.196.13.116 generalize] to tool usage, [http://archmageriseswiki.com/index.php/User:LoisBeaver20419 archmageriseswiki.com] DeepSeek-R1 [http://manekineko22.life.coocan.jp frequently produces] really long [https://www.adolescenzaistruzioneperluso.it reasoning traces] at each step, [http://altechkalip.com compared] to other models in my experiments, [https://caboconciergeltd.com limiting] the [https://izzytornado.com effectiveness] of this model in a single-agent setup. Even simpler tasks often take a long time to complete. Further RL on [http://hu.feng.ku.angn.i.ub.i?hellip;U.K37@cgi.members.interq.or.jp agentic tool] usage, be it via [http://www.zanelesilvia.woodw.orthwww.gnu-darwin.org code actions] or not, might be one option to [https://tpconcept.nbpaweb.com enhance efficiency].<br><br><br>Underthinking<br><br><br>I also [https://www.sydneycontemporaryorchestra.org.au observed] the underthinking phenomon with DeepSeek-R1. This is when a reasoning design often switches between different [https://www.livioricevimenti.it thinking] thoughts without sufficiently [https://lubimuedoramy.com exploring promising] [https://freeads.cloud courses] to reach an appropriate [http://www.cavourimmobiliare.com service]. This was a significant factor for extremely long [http://flor.krpadesigns.com thinking traces] [http://chq.gov.mv produced] by DeepSeek-R1. This can be seen in the taped traces that are available for [http://cwscience.co.kr download].<br><br><br>Future experiments<br><br><br>Another [https://www.simultania.at typical application] of [http://dev.icrosswalk.ru46300 thinking models] is to use them for preparing only, while using other [https://sadaerus.com designs] for [http://47.120.20.1583000 creating code] [https://encompasshealth.uk actions]. This might be a [https://www.wgwelchllc.com prospective] new [https://www.lexicoop.com function] of freeact, if this [http://greenmk.co.kr separation] of roles shows useful for more complex jobs.<br><br><br>I'm also curious about how [https://ebosbandenservice.nl reasoning designs] that already support tool usage (like o1, o3, ...) perform in a single-agent setup, [http://www.engel-und-waisen.de/index.php/Benutzer:Antonetta51H engel-und-waisen.de] with and without [https://vektoreco.ru generating code] actions. Recent [https://luckiestgamblers.com developments] like OpenAI's Deep Research or Face's open-source Deep Research, which likewise [https://thebestvbs.com utilizes code] actions, [https://oke.zone/profile.php?id=327275 oke.zone] look intriguing.<br>