论文：Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution
代码整体比较简单，很多细节并没有仔细的处理，不过思路值得学习一下，毕竟是一个training-free的方法，适合没有算力的小伙伴

当前大多数 LLM-based agents（基于大模型的****智能体****） 都采用 自上而下（Top-Down） 的设计方式：

人类先拆解任务
设计工作流
指定 agent 执行每一步

➡️ 这种方式在基准任务上效果好，但 缺乏自我学习能力，无法适应开放环境。

论文作者希望实现一种与人类更相似的学习方式：自下而上的（Bottom-Up）技能演化****智能体****。 智能体通过：

自主探索
反思失败
总结技能
不断积累来逐步提升能力。

作者在Slay the Spire****（杀戮尖塔） 和 Civilization V（文明5） 两款复杂环境中验证了这种范式，智能体直接以 视觉输入 + 鼠标行为输出 与环境互动，无需任何游戏专用的提示词或API。

涉及到的技术点主要有两个：

如何做技能的探索，反思，总结。来做到不断积累
如何和环境做交互

Skill Evolution

根据skill的生命周期，有skill的生成，评估，淘汰：

每一个skill是一系列的原子操作组成的。
- 比如攻击skill，是由[点击攻击卡，点击敌人]组成的
Skill augmentation是根据现有的skill，增加一些原子操作，得到新的skill。
- 比如现有的skill的动作序列是点击一张卡。是一个完全无用的skill
- 在这个skill后面增加一个点击敌人的操作，就是发动这张卡，就会得到一个新的skill
- 通过MCTS做，会高概率的拓展一些高分的skill
Skill evaluation，就是评估skill的效果。
- 比如图示中，执行一个操作后，会得到一个新的界面。把这两个时刻的状态，和执行的操作输入给LLM
- 和论文里说的不太一样，看代码评估的点主要是两个：
  - consistent，skill是否按照预期改变了游戏环境。（比如攻击预期敌人会掉血）
  - progressive，skill是否对游戏进度产生影响。（比如一直开关菜单不会对游戏产生什么正面的进度）
Skill refinement，对应skill的淘汰
- 部分skill在多次执行后，会发现用处不大，就会被skill library中删除掉
Skill invocation：
- 这里会先让LLM根据语义从所有的skill中选择出来一个子集，比如当前状态决策要进行攻击。
- 然后通过MCTS来决策具体要执行那个skill
- 核心思路是让LLM做高层的决策，由MCTS来决定怎么做最有效

一个直观的理解，来看看skill都有什么

Environment Interaction

这块和一些GUI Agent是一样的，和环境的交互主要是两块：

感知环境
- 每一个状态对应的是游戏的一个截图。相似的状态会做去重（通过视觉模型的Embedding相似度卡一个阈值）
- 通过SegmentAnythingModel，得到当前游戏截图中的所有UI Object（bbox, image)
作出动作
- 通用的鼠标/键盘的输入
- 具体执行动作时，比如点击object(1)这个动作，会被上面探测出来的UI Object，转化为点击(xxx, yyy)这个像素点

Framework

Agent在进入游戏后，会进入一个循环。每一个step会执行：

获取当前游戏截图，找到是否有相似的历史状态。从而得到当前的状态
- 比如对决相同的敌人
首先通过LLM，根据当前状态选择skill的子集
- 比如是要做进攻动作，还是防御动作
如果没有skill，会尝试做skill augmentation，生成新的skill
通过MCTS，在子集中选择skill进行执行。并评估skill的效果
- 这里因为是游戏环境，rollout不太好做。所以也没有那么MCTS，只不过是记录了每个skill的得分/尝试次数，然后用UCT(Upper Confidence Bounds for Trees)来算了个分数，平衡新策略的探索和老策略的利用
做skill evolution，清理无效的skill

CaseStudy

后续改进

可以看到上面的决策过程中，没有关注长期策略的影响，更多的是关注当前skill是否对游戏产生进展。同时也没有考虑状态之间的转移。
后续的一篇工作EXPERIENCE -DRIVEN EXPLORATION FOR EFFICIENT API-FREE AI AGENTS做了一些改进：

类似Q-Learning等方法刻画了动作价值，这里以图的形式记录了每一个状态下，执行不同skill可能得到的reward。
同时skill的reward会考虑到移动到新状态的价值。而状态的价值则由这个状态下skill的价值来计算得到。（非常类似贝尔曼方程刻画V/Q的关系了，只不过添加了一些偏启发的因素在里面）
这篇paper感觉更像是，用RL的算法（或者说近似RL的算法）来跑策略，然后由VLM来生成reward

可惜的是两篇paper并没有类似这篇paper一样《SWE-Exp: Experience-Driven Software Issue Resolution》“针对轨迹进行反思，生成一些指导性质的策略，来辅助做后续决策的思路”，更多的还是类似RL这种比较“统计”一些方法

这种应用在现实中的Agent训练成本会比较高。
比如slay the spire中，通过轨迹总结出不同敌人的攻击模式/特性，来得到一些指导性的策略。

还有一个点，就是两篇paper都把整个过程建模为POMDP，也就是只能通过观测得到一个可能的状态，来做后续的决策。

游戏这种状态比较复杂的，无法单独通过一个截图表达状态。比如在SC2上做RL的时候，也会编码很多其他的数值特征放进去。
如果是以API-FREE的方式想，可能引入一些类似Agent Memory的机制，把一些关键信息记录下来做后续决策可能会更好一些。
- 比如state = (vision_embedding(screen_shot) | text_embedding(summary))

代码

https://github.com/AngusDujw/Bottom-Up-Agent

Hand

Hand 类是 Agent 的执行器（执行层），负责将决策转化为实际的鼠标和键盘操作，控制游戏或应用程序。可以把它理解为 Agent 的"手"
依赖pyautogui

move/left_single_click/right_single_click(x, y)
do_operation() 高级操作：
- 没有单独的按键的操作，只有一个hot key是针对组合键的

Eye

Eye 类是 Agent 的视觉感知层（观察器），负责获取游戏窗口的屏幕截图和检测画面变化。它就像 Agent 的"眼睛"，用来观察环境状态。

get_screenshot_cv获取窗口截图
detect_acted_cv获取两次截图的diff，看像素差值是否过大（用来节省输入给模型的图片）

Detector

Detector 类是 Agent 的视觉理解层，负责从屏幕截图中提取和识别可交互对象。它结合了 SAM（Segment Anything Model）进行对象分割和 CLIP 进行图像编码。

分割相关
- extract_objects，提取场景中的对象，并做一下过滤/去重的清洗，获取对象的：
  - Boundary box/area, hash, center, image
- update_objects，提取对象，并和之前提取的对象做去重。（尝试合并object）
编码：
- 用的ViT-B/32模型，clip库可以load一些这种多模态的模型，提供encode_text/image的接口，生成图片/一段文字的embedding
- encode_text
- encode_image

LongMemory

LongMemory 类是 Agent 的长期记忆系统，使用 SQLite 数据库持久化存储游戏状态、对象、技能和技能簇等信息，让 Agent 能够跨会话学习和复用知识。
不是我们通常认为的会话类的memory，更像是知识库一样
三组接口：

Object，处理binary，比如图片的存储
- get_by_id
- update_object。
States
- save_state
- get_state
Skill/SkillCluster
- get_skill_cluster_by_id/save_skill_cluster
- get_skill_by_id/save_skill/delete_skill

PreKnowledge

通过Prompt注入的一些任务相关的知识
比如slay the spire的：

You are an AI assistant playing the deck-building roguelike game **Slay the Spire**. Below is a summary of the game rules and controls you must know:

Game Overview:
- You control a hero who climbs a spire by defeating enemies in turn-based card battles.
- In each battle, you have **energy** (default 3 per turn) to play cards.
- Cards can be **Attacks**, **Skills**, **Powers**, or **Curses**.
- The goal is to reduce the enemy's HP to 0 while surviving.

Card Types:
- **Attack**: Deals damage to the enemy.
- **Skill**: Provides defense (block), buffs, or utility.
- **Power**: Applies a passive effect for the rest of the battle.
- **Curse/Status**: Unplayable or harmful cards.

Turn System:
- Each turn, you draw 5 cards.
- You can play cards as long as you have energy.
- Enemies show **intents** (e.g. attack, buff, block) above their heads.

Combat Strategy Basics:
- **Block** mitigates damage but disappears at the end of your turn.
- Use **energy efficiently** — don't waste points.
- Prioritize removing enemies with high damage output or debuffs.
- Watch for **vulnerable**, **weak**, and **frail** effects (common debuffs).

Controls (UI-based):
- Click on cards to play them (if you have enough energy).
- Drag cards to enemies or yourself depending on the target.
- Click the “End Turn” button to end your turn.
- Hover or click on enemy intent icons to see what they plan to do.

Goals for AI:
- Analyze visible cards, energy, and enemy intents.
- Decide the best action: which cards to play, which enemies to target.
- Consider card cost, effects, and current HP/block values.

Main

在main之前还有一个brain更核心，但是因为是主体的驱动流程，所以和main一起看

run_step

get_observation，获取当前屏幕的截图，以及对应的embedding
从long memory中查找和当前屏幕状态相似的历史状态
如果没有相似的，那么会初始化一个新的state，包含：
- state_feature，就是截图的embedding
- object_id
- 一个空的mcts，和空的skill cluster
下一步会让模型选择Skill cluster
然后根据skill cluster，选择skill
选择出来的可能是explore，探索新的技能，或者是exploit，执行已有的技能。
- 为了避免无限重试，如果执行技能失败了会重试，并将失败的技能记录下来，下次不会选择。
  - 同时这里好像会把skill cluster下所有的skill都执行一遍
- Explore
  - 通过skill augment生成新的技能，并保存下来
    - 这里会先把当前mcts节点下的operation都执行了，然后更新屏幕里的object id
    - 然后生成potential operation：对于所有object，都click一下
    - 对于当前mcts节点的所有子节点，过滤一下已经存在的operation
    - 剩下的就是未被探索过的operation，这里会随机选一个，或者是让人来帮忙选一个
    - 然后做do operation，看新的这个操作是否会导致屏幕变化。如果没有变化，则探索失败
    - 如果屏幕变化了，会生成新的skill，并且如果state feature有改变，则会设置node.fixed，并退出生成skill的环节。
  - 生成的skill会利用大模型做cluster，得到skill cluster
- Exploit
  - 对于选中的skill，skill会包含若干个operation，对每一个operation
    - 首先进行grounding，把操作的对象id映射到屏幕的坐标中，然后执行操作
  - 执行完之后，会判断执行后和执行前的diff，如果差距过小，则说明执行失败，返回失败
  - 没有失败的话，会开始进行技能的评估，brain.skill_evaluate
    - 用大模型做的
  - 如果技能执行效果和描述的一致，会增加skill_fitness。如果技能使得任务有进展，也会增加skill_fitness
  - 保存skill的fitness和evaluate num，返回continue继续进行下一个skill的执行
最后还会有一个skill evolution，会根据一个阈值，清理掉执行次数多，但是fitness比较低的skill
- 比如有的skill点了一直没啥用。或者点了没有使任务产生进展，就会被删除掉

Prompt

然后来看看一些核心任务的Prompt

| Generate skill |
```Markdown | 通过operation生成skill的描述 |\
|| Objectives: |\
|| - Analyze the recent UI-level operations you performed. |\
|||\
|| Inputs: |\
|| 1. A list of UI-level operations: '{operations}' |\
|| 2. Screenshots of the game captured during these operations. |\
|||\
|| Instructions: |\
|| 1. Determine which UI elements were interacted with, based on the coordinates provided. |\
|| 2. Analyze how the game screen changed during the operations. |\
|| 3. Give this sequence of operations a meaningful **name** and a **description** that includes: |\
|| - What the operation does (its function) |\
|| - Any precautions to take |\
|| 4. The name must be highly relevant to the actual operations performed. |\
|| 5. If the operations are **meaningless**, use the `no_meaning_skill` tool to report it. |\
|| 6. If the operations are **meaningful**, use the `save_skill` tool to save the result. |\
|| 7. Think step by step before making a decision. |\

	``` </th> </tr> </thead> <tbody> <tr> <td>Cluster skill</td> <td> ```SQL \| 聚类相似的skill，生成skill cluster \|\
	Background: \|\
	You are an assistant for grouping similar skills. \|\
	Input: a JSON array called “new_skills”. \|\
	\|\
	Task: Identify skills that are nearly identical in meaning or function, even if their expressions differ. \|\
	• Group together skills that essentially perform the same task or behavior, even if worded differently. \|\
	• Do not group skills that express different functions or intentions, even if they appear related. \|\
	• Think of this as merging duplicates or near-duplicates, not broad semantic clustering. \|\
	\|\
	Output must strictly call the function “cluster_skills” with no extra text. \|\
	\|\
	Here is the list of new_skills (id, name, description): \|\
	{json.dumps(skills, indent=2)} \|\
	\|\
	Please: \|\
	1. Identify cluster of functionally equivalent skills. \|\
	2. For each cluster, select a representative “action_name” and “action_description”. \|\
	3. List its members as an array of action ids. \|\
	``` </td> </tr> <tr> <td>Merge skill cluster</td> <td> ```SQL \| 把skill合并到cluster中 \|\
	Background: \|\
	You are an assistant that merges and clusters skills in one call. \|\
	Input: \|\
	• existing_skill_clusters: clusters with cluster_id, name, description, members \|\
	• new_skills: raw skills with id, name, description \|\
	Instruction: \|\
	1) Cluster new_skills among themselves by semantic similarity. \|\
	2) For each resulting new cluster: \|\
	- If it matches an existing cluster, merge into it (reuse that cluster_id). \|\
	- Otherwise, assign cluster_id = -1. \|\
	3) Each output cluster must include: \|\
	- cluster_id \|\
	- name & description (representative) \|\
	- members: combined list of all action IDs. \|\
	Output: \|\
	Exactly one function_call to "merge_skills", no extra text. \|\
	\|\
	existing_skill_clusters: \|\
	{json.dumps(existing_skill_clusters, indent=2)} \|\
	\|\
	new_skills: \|\
	{json.dumps(new_skills, indent=2)} \|\
	\|\
	Please perform the merge as specified. \|\
	Return the merged list under the key "clusters". \|\
	``` </td> </tr> <tr> <td>Select skill</td> <td> ```SQL \| 实际上是选择skill cluster \|\
	Objectives: \|\
	- Select the best skill from the provided skill list. \|\
	\|\
	Inputs: \|\
	1. A list of learned skills. \|\
	2. A screenshot of the current game state. \|\
	\|\
	Skills: \|\
	{skills} \|\
	\|\
	Instructions: \|\
	1. Analyze the screenshot to understand the current state of the game. \|\
	2. For each skill, assess: \|\
	- Whether the execution conditions are currently satisfied. \|\
	- Whether it is functionally suitable for current state. \|\
	3. First, eliminate any skills that are not currently executable. \|\
	4. Then, among the remaining executable skills, choose the one most appropriate. \|\
	5. Must return your selection using the function tool provided — do NOT respond with text. \|\
	6. Think step by step before making your final decision. \|\
	``` </td> </tr> <tr> <td>Skill evaluate</td> <td> ```SQL \| 评估skill是否progressive，以及效果是否一致 \|\
	Task: \|\
	- '{task}' \|\
	\|\
	Inputs: \|\
	1. Action Information \|\
	Name and description of the action: \|\
	{skill_info} \|\
	\|\
	2. Screenshots \|\
	A pair of images showing the game state before and after the action. \|\
	\|\
	Instructions: \|\
	\|\
	Step 1: Expected Change \|\
	- Based on the action description, summarize what is expected to change in the game state if the action works as intended. \|\
	\|\
	Step 2: Actual Change \|\
	- Analyze the differences between the before and after screenshots. \|\
	- Describe all observable UI changes (e.g., elements, values, progress bars, icons). \|\
	\|\
	Step 3: Consistency Check → `is_consistent` \|\
	- If the actual changes match the expected outcome (in type and magnitude), set `is_consistent = true`; otherwise, `false`. \|\
	\|\
	Step 4: Progress Indicators \|\
	- Identify at least three concrete, task-relevant indicators that suggest progress toward the task goal. \|\
	\|\
	Step 5: Progress Check → `is_progressive` \|\
	- Evaluate whether any of the indicators were clearly met. \|\
	- If any indicator is met meaningfully, set `is_progressive = true`. \|\
	- If none are met, or if changes are negligible/irrelevant, set `is_progressive = false`. \|\
	- If the action is only preparatory (e.g., hovering, previewing, selecting without effect), set `is_progressive = false`. \|\
	\|\
	Step 6: Default to False \|\
	- If information is incomplete, unclear, or ambiguous, default both booleans to `false`. \|\
	\|\
	Remember: \|\
	→ Do not guess. \|\
	→ Only return the `action_reflex` function call with the evaluated booleans. \|\
	```

	``` </th> </tr> </thead> <tbody> <tr> <td>Cluster skill</td> <td> ```SQL \| 聚类相似的skill，生成skill cluster \|\
	Background: \|\
	You are an assistant for grouping similar skills. \|\
	Input: a JSON array called “new_skills”. \|\
	\|\
	Task: Identify skills that are nearly identical in meaning or function, even if their expressions differ. \|\
	• Group together skills that essentially perform the same task or behavior, even if worded differently. \|\
	• Do not group skills that express different functions or intentions, even if they appear related. \|\
	• Think of this as merging duplicates or near-duplicates, not broad semantic clustering. \|\
	\|\
	Output must strictly call the function “cluster_skills” with no extra text. \|\
	\|\
	Here is the list of new_skills (id, name, description): \|\
	{json.dumps(skills, indent=2)} \|\
	\|\
	Please: \|\
	1. Identify cluster of functionally equivalent skills. \|\
	2. For each cluster, select a representative “action_name” and “action_description”. \|\
	3. List its members as an array of action ids. \|\
	``` </td> </tr> <tr> <td>Merge skill cluster</td> <td> ```SQL \| 把skill合并到cluster中 \|\
	Background: \|\
	You are an assistant that merges and clusters skills in one call. \|\
	Input: \|\
	• existing_skill_clusters: clusters with cluster_id, name, description, members \|\
	• new_skills: raw skills with id, name, description \|\
	Instruction: \|\
	1) Cluster new_skills among themselves by semantic similarity. \|\
	2) For each resulting new cluster: \|\
	- If it matches an existing cluster, merge into it (reuse that cluster_id). \|\
	- Otherwise, assign cluster_id = -1. \|\
	3) Each output cluster must include: \|\
	- cluster_id \|\
	- name & description (representative) \|\
	- members: combined list of all action IDs. \|\
	Output: \|\
	Exactly one function_call to "merge_skills", no extra text. \|\
	\|\
	existing_skill_clusters: \|\
	{json.dumps(existing_skill_clusters, indent=2)} \|\
	\|\
	new_skills: \|\
	{json.dumps(new_skills, indent=2)} \|\
	\|\
	Please perform the merge as specified. \|\
	Return the merged list under the key "clusters". \|\
	``` </td> </tr> <tr> <td>Select skill</td> <td> ```SQL \| 实际上是选择skill cluster \|\
	Objectives: \|\
	- Select the best skill from the provided skill list. \|\
	\|\
	Inputs: \|\
	1. A list of learned skills. \|\
	2. A screenshot of the current game state. \|\
	\|\
	Skills: \|\
	{skills} \|\
	\|\
	Instructions: \|\
	1. Analyze the screenshot to understand the current state of the game. \|\
	2. For each skill, assess: \|\
	- Whether the execution conditions are currently satisfied. \|\
	- Whether it is functionally suitable for current state. \|\
	3. First, eliminate any skills that are not currently executable. \|\
	4. Then, among the remaining executable skills, choose the one most appropriate. \|\
	5. Must return your selection using the function tool provided — do NOT respond with text. \|\
	6. Think step by step before making your final decision. \|\
	``` </td> </tr> <tr> <td>Skill evaluate</td> <td> ```SQL \| 评估skill是否progressive，以及效果是否一致 \|\
	Task: \|\
	- '{task}' \|\
	\|\
	Inputs: \|\
	1. Action Information \|\
	Name and description of the action: \|\
	{skill_info} \|\
	\|\
	2. Screenshots \|\
	A pair of images showing the game state before and after the action. \|\
	\|\
	Instructions: \|\
	\|\
	Step 1: Expected Change \|\
	- Based on the action description, summarize what is expected to change in the game state if the action works as intended. \|\
	\|\
	Step 2: Actual Change \|\
	- Analyze the differences between the before and after screenshots. \|\
	- Describe all observable UI changes (e.g., elements, values, progress bars, icons). \|\
	\|\
	Step 3: Consistency Check → `is_consistent` \|\
	- If the actual changes match the expected outcome (in type and magnitude), set `is_consistent = true`; otherwise, `false`. \|\
	\|\
	Step 4: Progress Indicators \|\
	- Identify at least three concrete, task-relevant indicators that suggest progress toward the task goal. \|\
	\|\
	Step 5: Progress Check → `is_progressive` \|\
	- Evaluate whether any of the indicators were clearly met. \|\
	- If any indicator is met meaningfully, set `is_progressive = true`. \|\
	- If none are met, or if changes are negligible/irrelevant, set `is_progressive = false`. \|\
	- If the action is only preparatory (e.g., hovering, previewing, selecting without effect), set `is_progressive = false`. \|\
	\|\
	Step 6: Default to False \|\
	- If information is incomplete, unclear, or ambiguous, default both booleans to `false`. \|\
	\|\
	Remember: \|\
	→ Do not guess. \|\
	→ Only return the `action_reflex` function call with the evaluated booleans. \|\
	```

BottomUpAgent