More than code

More Than Code
The efficiency of your iteration of reading, practicing and thinking decides your understanding of the world.
  1. 首页
  2. 未分类
  3. 正文

BottomUpAgent

2025年11月27日 7点热度 0人点赞 0条评论

论文:Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution
代码整体比较简单,很多细节并没有仔细的处理,不过思路值得学习一下,毕竟是一个training-free的方法,适合没有算力的小伙伴

当前大多数 LLM-based agents(基于大模型的****智能体****) 都采用 自上而下(Top-Down) 的设计方式:

  • 人类先拆解任务

  • 设计工作流

  • 指定 agent 执行每一步

➡️ 这种方式在基准任务上效果好,但 缺乏自我学习能力,无法适应开放环境。

论文作者希望实现一种与人类更相似的学习方式:自下而上的(Bottom-Up)技能演化****智能体****。 智能体通过:

  • 自主探索

  • 反思失败

  • 总结技能

  • 不断积累来逐步提升能力。


作者在Slay the Spire****(杀戮尖塔) 和 Civilization V(文明5) 两款复杂环境中验证了这种范式,智能体直接以 视觉输入 + 鼠标行为输出 与环境互动,无需任何游戏专用的提示词或API。

涉及到的技术点主要有两个:

  • 如何做技能的探索,反思,总结。来做到不断积累

  • 如何和环境做交互

Skill Evolution


根据skill的生命周期,有skill的生成,评估,淘汰:

  • 每一个skill是一系列的原子操作组成的。
    • 比如攻击skill,是由[点击攻击卡,点击敌人]组成的
  • Skill augmentation是根据现有的skill,增加一些原子操作,得到新的skill。
    • 比如现有的skill的动作序列是点击一张卡。是一个完全无用的skill

    • 在这个skill后面增加一个点击敌人的操作,就是发动这张卡,就会得到一个新的skill

    • 通过MCTS做,会高概率的拓展一些高分的skill

  • Skill evaluation,就是评估skill的效果。

    • 比如图示中,执行一个操作后,会得到一个新的界面。把这两个时刻的状态,和执行的操作输入给LLM

    • 和论文里说的不太一样,看代码评估的点主要是两个:

      • consistent,skill是否按照预期改变了游戏环境。(比如攻击预期敌人会掉血)

      • progressive,skill是否对游戏进度产生影响。(比如一直开关菜单不会对游戏产生什么正面的进度)

  • Skill refinement,对应skill的淘汰

    • 部分skill在多次执行后,会发现用处不大,就会被skill library中删除掉
  • Skill invocation:
    • 这里会先让LLM根据语义从所有的skill中选择出来一个子集,比如当前状态决策要进行攻击。

    • 然后通过MCTS来决策具体要执行那个skill

    • 核心思路是让LLM做高层的决策,由MCTS来决定怎么做最有效


一个直观的理解,来看看skill都有什么

Environment Interaction

这块和一些GUI Agent是一样的,和环境的交互主要是两块:

  • 感知环境
    • 每一个状态对应的是游戏的一个截图。相似的状态会做去重(通过视觉模型的Embedding相似度卡一个阈值)

    • 通过SegmentAnythingModel,得到当前游戏截图中的所有UI Object(bbox, image)

  • 作出动作

    • 通用的鼠标/键盘的输入

    • 具体执行动作时,比如点击object(1)这个动作,会被上面探测出来的UI Object,转化为点击(xxx, yyy)这个像素点

Framework


Agent在进入游戏后,会进入一个循环。每一个step会执行:

  • 获取当前游戏截图,找到是否有相似的历史状态。从而得到当前的状态
    • 比如对决相同的敌人
  • 首先通过LLM,根据当前状态选择skill的子集
    • 比如是要做进攻动作,还是防御动作
  • 如果没有skill,会尝试做skill augmentation,生成新的skill

  • 通过MCTS,在子集中选择skill进行执行。并评估skill的效果

    • 这里因为是游戏环境,rollout不太好做。所以也没有那么MCTS,只不过是记录了每个skill的得分/尝试次数,然后用UCT(Upper Confidence Bounds for Trees)来算了个分数,平衡新策略的探索和老策略的利用
  • 做skill evolution,清理无效的skill

CaseStudy


后续改进

可以看到上面的决策过程中,没有关注长期策略的影响,更多的是关注当前skill是否对游戏产生进展。同时也没有考虑状态之间的转移。
后续的一篇工作EXPERIENCE -DRIVEN EXPLORATION FOR EFFICIENT API-FREE AI AGENTS做了一些改进:

  • 类似Q-Learning等方法刻画了动作价值,这里以图的形式记录了每一个状态下,执行不同skill可能得到的reward。

  • 同时skill的reward会考虑到移动到新状态的价值。而状态的价值则由这个状态下skill的价值来计算得到。(非常类似贝尔曼方程刻画V/Q的关系了,只不过添加了一些偏启发的因素在里面)

  • 这篇paper感觉更像是,用RL的算法(或者说近似RL的算法)来跑策略,然后由VLM来生成reward

可惜的是两篇paper并没有类似这篇paper一样《SWE-Exp: Experience-Driven Software Issue Resolution》“针对轨迹进行反思,生成一些指导性质的策略,来辅助做后续决策的思路”,更多的还是类似RL这种比较“统计”一些方法

  • 这种应用在现实中的Agent训练成本会比较高。

  • 比如slay the spire中,通过轨迹总结出不同敌人的攻击模式/特性,来得到一些指导性的策略。

还有一个点,就是两篇paper都把整个过程建模为POMDP,也就是只能通过观测得到一个可能的状态,来做后续的决策。

  • 游戏这种状态比较复杂的,无法单独通过一个截图表达状态。比如在SC2上做RL的时候,也会编码很多其他的数值特征放进去。

  • 如果是以API-FREE的方式想,可能引入一些类似Agent Memory的机制,把一些关键信息记录下来做后续决策可能会更好一些。

    • 比如state = (vision_embedding(screen_shot) | text_embedding(summary))

代码

https://github.com/AngusDujw/Bottom-Up-Agent

Hand

Hand 类是 Agent 的执行器(执行层),负责将决策转化为实际的鼠标和键盘操作,控制游戏或应用程序。可以把它理解为 Agent 的"手"
依赖pyautogui

  • move/left_single_click/right_single_click(x, y)

  • do_operation() 高级操作:

    • 没有单独的按键的操作,只有一个hot key是针对组合键的

Eye

Eye 类是 Agent 的视觉感知层(观察器),负责获取游戏窗口的屏幕截图和检测画面变化。它就像 Agent 的"眼睛",用来观察环境状态。

  • get_screenshot_cv获取窗口截图

  • detect_acted_cv获取两次截图的diff,看像素差值是否过大(用来节省输入给模型的图片)

Detector

Detector 类是 Agent 的视觉理解层,负责从屏幕截图中提取和识别可交互对象。它结合了 SAM(Segment Anything Model) 进行对象分割和 CLIP 进行图像编码。

  • 分割相关
    • extract_objects,提取场景中的对象,并做一下过滤/去重的清洗,获取对象的:
      • Boundary box/area, hash, center, image
    • update_objects,提取对象,并和之前提取的对象做去重。(尝试合并object)

  • 编码:

    • 用的ViT-B/32模型,clip库可以load一些这种多模态的模型,提供encode_text/image的接口,生成图片/一段文字的embedding

    • encode_text

    • encode_image

LongMemory

LongMemory 类是 Agent 的长期记忆系统,使用 SQLite 数据库持久化存储游戏状态、对象、技能和技能簇等信息,让 Agent 能够跨会话学习和复用知识。
不是我们通常认为的会话类的memory,更像是知识库一样
三组接口:

  • Object,处理binary,比如图片的存储
    • get_by_id

    • update_object。

  • States

    • save_state

    • get_state

  • Skill/SkillCluster

    • get_skill_cluster_by_id/save_skill_cluster

    • get_skill_by_id/save_skill/delete_skill

PreKnowledge

通过Prompt注入的一些任务相关的知识
比如slay the spire的:

You are an AI assistant playing the deck-building roguelike game **Slay the Spire**. Below is a summary of the game rules and controls you must know:

Game Overview:
- You control a hero who climbs a spire by defeating enemies in turn-based card battles.
- In each battle, you have **energy** (default 3 per turn) to play cards.
- Cards can be **Attacks**, **Skills**, **Powers**, or **Curses**.
- The goal is to reduce the enemy's HP to 0 while surviving.

Card Types:
- **Attack**: Deals damage to the enemy.
- **Skill**: Provides defense (block), buffs, or utility.
- **Power**: Applies a passive effect for the rest of the battle.
- **Curse/Status**: Unplayable or harmful cards.

Turn System:
- Each turn, you draw 5 cards.
- You can play cards as long as you have energy.
- Enemies show **intents** (e.g. attack, buff, block) above their heads.

Combat Strategy Basics:
- **Block** mitigates damage but disappears at the end of your turn.
- Use **energy efficiently** — don't waste points.
- Prioritize removing enemies with high damage output or debuffs.
- Watch for **vulnerable**, **weak**, and **frail** effects (common debuffs).

Controls (UI-based):
- Click on cards to play them (if you have enough energy).
- Drag cards to enemies or yourself depending on the target.
- Click the “End Turn” button to end your turn.
- Hover or click on enemy intent icons to see what they plan to do.

Goals for AI:
- Analyze visible cards, energy, and enemy intents.
- Decide the best action: which cards to play, which enemies to target.
- Consider card cost, effects, and current HP/block values.

Main

在main之前还有一个brain更核心,但是因为是主体的驱动流程,所以和main一起看

  • run_step
    • get_observation,获取当前屏幕的截图,以及对应的embedding

    • 从long memory中查找和当前屏幕状态相似的历史状态

    • 如果没有相似的,那么会初始化一个新的state,包含:

      • state_feature,就是截图的embedding

      • object_id

      • 一个空的mcts,和空的skill cluster

    • 下一步会让模型选择Skill cluster

    • 然后根据skill cluster,选择skill

    • 选择出来的可能是explore,探索新的技能,或者是exploit,执行已有的技能。

      • 为了避免无限重试,如果执行技能失败了会重试,并将失败的技能记录下来,下次不会选择。
        • 同时这里好像会把skill cluster下所有的skill都执行一遍
      • Explore
        • 通过skill augment生成新的技能,并保存下来
          • 这里会先把当前mcts节点下的operation都执行了,然后更新屏幕里的object id

          • 然后生成potential operation:对于所有object,都click一下

          • 对于当前mcts节点的所有子节点,过滤一下已经存在的operation

          • 剩下的就是未被探索过的operation,这里会随机选一个,或者是让人来帮忙选一个

          • 然后做do operation,看新的这个操作是否会导致屏幕变化。如果没有变化,则探索失败

          • 如果屏幕变化了,会生成新的skill,并且如果state feature有改变,则会设置node.fixed,并退出生成skill的环节。

        • 生成的skill会利用大模型做cluster,得到skill cluster

      • Exploit

        • 对于选中的skill,skill会包含若干个operation,对每一个operation
          • 首先进行grounding,把操作的对象id映射到屏幕的坐标中,然后执行操作
        • 执行完之后,会判断执行后和执行前的diff,如果差距过小,则说明执行失败,返回失败

        • 没有失败的话,会开始进行技能的评估,brain.skill_evaluate

          • 用大模型做的
        • 如果技能执行效果和描述的一致,会增加skill_fitness。如果技能使得任务有进展,也会增加skill_fitness

        • 保存skill的fitness和evaluate num,返回continue继续进行下一个skill的执行

    • 最后还会有一个skill evolution,会根据一个阈值,清理掉执行次数多,但是fitness比较低的skill

      • 比如有的skill点了一直没啥用。或者点了没有使任务产生进展,就会被删除掉

      Prompt

      然后来看看一些核心任务的Prompt

    | Generate skill |
    ```Markdown | 通过operation生成skill的描述 |\
    || Objectives: |\
    || - Analyze the recent UI-level operations you performed. |\
    |||\
    || Inputs: |\
    || 1. A list of UI-level operations: '{operations}' |\
    || 2. Screenshots of the game captured during these operations. |\
    |||\
    || Instructions: |\
    || 1. Determine which UI elements were interacted with, based on the coordinates provided. |\
    || 2. Analyze how the game screen changed during the operations. |\
    || 3. Give this sequence of operations a meaningful **name** and a **description** that includes: |\
    || - What the operation does (its function) |\
    || - Any precautions to take |\
    || 4. The name must be highly relevant to the actual operations performed. |\
    || 5. If the operations are **meaningless**, use the `no_meaning_skill` tool to report it. |\
    || 6. If the operations are **meaningful**, use the `save_skill` tool to save the result. |\
    || 7. Think step by step before making a decision. |\

    ```
    </th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td>Cluster skill</td>
    <td>
    ```SQL | 聚类相似的skill,生成skill cluster |\
    Background: |\
    You are an assistant for grouping similar skills. |\
    Input: a JSON array called “new_skills”. |\
    |\
    Task: Identify skills that are nearly identical in meaning or function, even if their expressions differ. |\
    • Group together skills that essentially perform the same task or behavior, even if worded differently. |\
    • Do not group skills that express different functions or intentions, even if they appear related. |\
    • Think of this as merging duplicates or near-duplicates, not broad semantic clustering. |\
    |\
    Output must strictly call the function “cluster_skills” with no extra text. |\
    |\
    Here is the list of new_skills (id, name, description): |\
    {json.dumps(skills, indent=2)} |\
    |\
    Please: |\
    1. Identify cluster of functionally equivalent skills. |\
    2. For each cluster, select a representative “action_name” and “action_description”. |\
    3. List its members as an array of action ids. |\
    ```
    </td>
    </tr>
    <tr>
    <td>Merge skill cluster</td>
    <td>
    ```SQL | 把skill合并到cluster中 |\
    Background: |\
    You are an assistant that merges and clusters skills in one call. |\
    Input: |\
    • existing_skill_clusters: clusters with cluster_id, name, description, members |\
    • new_skills: raw skills with id, name, description |\
    Instruction: |\
    1) Cluster new_skills among themselves by semantic similarity. |\
    2) For each resulting new cluster: |\
    - If it matches an existing cluster, merge into it (reuse that cluster_id). |\
    - Otherwise, assign cluster_id = -1. |\
    3) Each output cluster must include: |\
    - cluster_id |\
    - name & description (representative) |\
    - members: combined list of all action IDs. |\
    Output: |\
    Exactly one function_call to "merge_skills", no extra text. |\
    |\
    existing_skill_clusters: |\
    {json.dumps(existing_skill_clusters, indent=2)} |\
    |\
    new_skills: |\
    {json.dumps(new_skills, indent=2)} |\
    |\
    Please perform the merge as specified. |\
    Return the merged list under the key "clusters". |\
    ```
    </td>
    </tr>
    <tr>
    <td>Select skill</td>
    <td>
    ```SQL | 实际上是选择skill cluster |\
    Objectives: |\
    - Select the best skill from the provided skill list. |\
    |\
    Inputs: |\
    1. A list of learned skills. |\
    2. A screenshot of the current game state. |\
    |\
    Skills: |\
    {skills} |\
    |\
    Instructions: |\
    1. Analyze the screenshot to understand the current state of the game. |\
    2. For each skill, assess: |\
    - Whether the execution conditions are currently satisfied. |\
    - Whether it is functionally suitable for current state. |\
    3. First, eliminate any skills that are not currently executable. |\
    4. Then, among the remaining executable skills, choose the one most appropriate. |\
    5. Must return your selection using the function tool provided — do NOT respond with text. |\
    6. Think step by step before making your final decision. |\
    ```
    </td>
    </tr>
    <tr>
    <td>Skill evaluate</td>
    <td>
    ```SQL | 评估skill是否progressive,以及效果是否一致 |\
    Task: |\
    - '{task}' |\
    |\
    Inputs: |\
    1. Action Information |\
    Name and description of the action: |\
    {skill_info} |\
    |\
    2. Screenshots |\
    A pair of images showing the game state before and after the action. |\
    |\
    Instructions: |\
    |\
    Step 1: Expected Change |\
    - Based on the action description, summarize what is expected to change in the game state if the action works as intended. |\
    |\
    Step 2: Actual Change |\
    - Analyze the differences between the before and after screenshots. |\
    - Describe all observable UI changes (e.g., elements, values, progress bars, icons). |\
    |\
    Step 3: Consistency Check → is_consistent |\
    - If the actual changes match the expected outcome (in type and magnitude), set is_consistent = true; otherwise, false. |\
    |\
    Step 4: Progress Indicators |\
    - Identify at least three concrete, task-relevant indicators that suggest progress toward the task goal. |\
    |\
    Step 5: Progress Check → is_progressive |\
    - Evaluate whether any of the indicators were clearly met. |\
    - If any indicator is met meaningfully, set is_progressive = true. |\
    - If none are met, or if changes are negligible/irrelevant, set is_progressive = false. |\
    - If the action is only preparatory (e.g., hovering, previewing, selecting without effect), set is_progressive = false. |\
    |\
    Step 6: Default to False |\
    - If information is incomplete, unclear, or ambiguous, default both booleans to false. |\
    |\
    Remember: |\
    → Do not guess. |\
    → Only return the action_reflex function call with the evaluated booleans. |\
    ```
标签: 暂无
最后更新:2025年11月27日

sheep

think again

点赞
< 上一篇
下一篇 >

文章评论

取消回复

COPYRIGHT © 2021 heavensheep.xyz. ALL RIGHTS RESERVED.

THEME KRATOS MADE BY VTROIS