Mixture of Agents · MoA · Hermes Agent

一句講晒TL;DR

簡單講 MoA（Mixture of Agents） MoA in 30 seconds

核心概念：同時叫多個 LLM 回答同一條問題，再搵一個更強嘅模型將啲答案整合成最終結果。 The core idea: call multiple LLMs on the same question in parallel, then have a stronger model aggregate the best parts into a single final answer.


               問題 ──┬→ Claude Sonnet ──┐
                       ├→ GPT-4o ────────┤──→ Aggregator（最強模型）→ 最終答案
                       ├→ Gemini Pro ────┤
                       └→ DeepSeek ──────┘

                Query ─┬→ Claude Sonnet ──┐
                       ├→ GPT-4o ────────┤──→ Aggregator (strongest model) → Final answer
                       ├→ Gemini Pro ────┤
                       └→ DeepSeek ──────┘

點解有用？Why it works

每個模型有盲點，多個模型互補Each model has blind spots; multiple models complement each other
Aggregator 可以揀最好的部分合併The aggregator picks and merges the best parts
複雜推理、研究分析、需要多元觀點時效果好Best for complex reasoning, research, and multi-perspective tasks

三種實作方式Three approaches

Shell script — hermes chat -q -m <model> 平行 call 多個模型parallel model calls
delegate_task — Python spawn 多個 subagent，每個用不同模型spawn subagents, one per model
Kanban / cron — 最穩固，適合正式環境most robust, production-ready

核心概念Core concepts

為什麼 MoA 比單一模型更強？ Why MoA beats any single model

三個原理：多樣性、互補性、集體智慧。 Three principles: diversity, complementarity, and the wisdom of the crowd.

① 多樣性① Diversity

不同模型架構、訓練資料、對齊方式，產生截然不同的「思維路徑」。 Different architectures, training data, and alignment produce genuinely different reasoning paths.

② 互補性② Complementarity

模型 A 卡住的地方，模型 B 剛好擅長。互補讓弱點被覆蓋。 Where model A gets stuck, model B often shines. Weaknesses cancel; strengths compound.

③ 聚合優於個體③ Aggregation

由最強的「聚合者」讀完全部答案，挑出最精華的部分。 A strong aggregator reads every answer and selects the most useful signal from each.

④ 權衡④ Trade-offs

成本約 N+1 倍、延遲上升。適合複雜任務，不適合簡單問答。 ~N+1× the cost and added latency. Worth it for hard tasks, overkill for trivial ones.

~65%

AlpacaEval 2.0 提升AlpacaEval 2.0 lift

3–5

提議者甜蜜點Sweet-spot proposers

N+1×

相對成本Relative cost

1×

最慢者 + 聚合者延遲Slowest + aggregator

流程圖Pipeline

資料如何在 MoA 中流動 How data flows through an MoA

從一個問題出發，經過平行提議者，最後由聚合者收束為單一答案。 A single question fans out to parallel proposers, then converges to one synthesized answer.

輸入Input

U

使用者User

提出問題Asks the question

提議者層（平行）Proposer layer (parallel)

C

Claude Sonnet

anthropic/claude-sonnet-4

G

GPT-4o

openai/gpt-4o

M

Gemini Pro

google/gemini-2.5-pro

D

DeepSeek

deepseek/deepseek-chat

輸出Output

A

聚合者Aggregator

anthropic/claude-sonnet-4

★

最終答案Final answer

整合後的輸出Synthesized output

實作方式Implementations

在 Hermes 中建立 MoA 的三種方法 Three ways to build MoA in Hermes

從最快到最穩固，依需求選一個。 From fastest to most durable — pick what fits your workload.

moa.sh

#!/bin/bash
# ~/bin/moa.sh — Mixture of Agents pipeline
PROMPT="$1"
PROPOSERS=("anthropic/claude-sonnet-4" "openai/gpt-4o" "google/gemini-2.5-pro")
AGGREGATOR="anthropic/claude-sonnet-4"

# Step 1: collect proposer responses in parallel
for m in "${PROPOSERS[@]}"; do
  hermes chat -q "$PROMPT" -m "$m" -Q >> /tmp/moa_proposers.txt
  echo "---" >> /tmp/moa_proposers.txt
done

# Step 2: aggregator synthesizes
AGG_PROMPT="You are aggregating multiple LLM responses.
Here are the responses:

$(cat /tmp/moa_proposers.txt)

Original question: $PROMPT

Synthesize a single, high-quality answer that takes the best from each:"
hermes chat -q "$AGG_PROMPT" -m "$AGGREGATOR"

moa.py

from hermes_tools import delegate_task


def moa(prompt: str, proposers: list[str], aggregator: str) -> str:
    # Fan out to N proposers in parallel
    tasks = [
        {"goal": prompt,
         "context": f"Reply directly and concisely. Model: {m}."}
        for m in proposers
    ]
    results = delegate_task(tasks=tasks)
    combined = "\n\n---\n\n".join(r["summary"] for r in results)

    # Aggregate
    synth = f"Synthesize the best answer from {len(results)} responses to: {prompt}\n\n{combined}"
    return delegate_task(goal=synth,
                          context=f"You are the aggregator. {synth}")


if __name__ == "__main__":
    print(moa(
        prompt="Explain quantum entanglement to a high-schooler",
        proposers=["anthropic/claude-sonnet-4", "openai/gpt-4o", "google/gemini-2.5-pro"],
        aggregator="anthropic/claude-sonnet-4",
    ))

terminal · bash

# 1. One profile per proposer model
hermes profile create sonnet  --model anthropic/claude-sonnet-4
hermes profile create gpt     --model openai/gpt-4o
hermes profile create gemini  --model google/gemini-2.5-pro

# 2. Initialize a Kanban board for the MoA workflow
hermes kanban init moa-board

# 3. Create one task per proposer; assign a profile
hermes kanban create --board moa-board \
    --title "propose: claude" --profile sonnet
hermes kanban create --board moa-board \
    --title "propose: gpt"    --profile gpt
hermes kanban create --board moa-board \
    --title "propose: gemini" --profile gemini

# 4. Aggregate once all proposers report done
hermes kanban create --board moa-board \
    --title "aggregate: final answer" --profile sonnet \
    --depends-on "propose: claude,propose: gpt,propose: gemini"

如何挑選模型How to pick your models

提議者要「多元」而非「最強」。混搭不同供應商（Anthropic / OpenAI / Google / DeepSeek / 本地）效果最好。聚合者則應該是你能調用的最強模型，因為它要負責分辨噪音與訊號。 Pick proposers for diversity, not raw strength — mix vendors (Anthropic / OpenAI / Google / DeepSeek / local). The aggregator should be your strongest available model: its job is to separate signal from noise.

FAQ

常見問題Common questions

MoA 跟普通的多模型投票（例如 majority vote）有什麼不同？ How is MoA different from simple majority voting across models?

多數決只適用於有單一正確答案的任務（分類、選擇題）。MoA 是生成式的：聚合者讀完整段文字後，根據語意挑出最好的部分。對於開放式問答、寫作、研究分析，MoA 通常明顯優於多數決。 Voting only works for tasks with a single correct answer (classification, MCQ). MoA is generative: the aggregator reads entire responses and cherry-picks the best parts. For open-ended Q&A, writing, and research, MoA usually wins handily.

應該用幾個提議者？ How many proposers should I use?

原始 Together MoA 論文用 3 層、每層 6 個模型。實務上，3–5 個異質模型是甜蜜點。少於 3 個，多樣性不足；多於 6 個，成本急劇上升但品質提升遞減。 The original Together MoA paper used 3 layers of 6 models. In practice, 3–5 heterogeneous models is the sweet spot. Below 3, diversity is thin; above 6, costs balloon while gains diminish.

聚合者本身很強時，MoA 還有幫助嗎？ When the aggregator is already frontier-tier, does MoA still help?

邊際效益會遞減。當聚合者直接呼叫就能得到 95% 好的答案時，MoA 帶來的提升可能只有 2–3%。MoA 的真正價值在於取得多元觀點，而不是把一個好答案變成更好的答案。對於需要創意、辯證、或跨領域整合的任務，MoA 的相對優勢最大。 The marginal benefit shrinks. If calling the aggregator directly already gives you a 95%-good answer, MoA might only add 2–3%. MoA's real value is surfacing diverse perspectives, not polishing a single good answer. It shines brightest on creative, dialectical, or cross-domain synthesis tasks.

延遲會不會很高？ What's the latency impact?

平行執行時，總延遲 ≈ 最慢的提議者 + 聚合者，不是 N 倍疊加。實測一個 4 提議者 + 1 聚合者的 MoA，端到端約 8–15 秒。對即時對話不適合，但對研究、批次分析很剛好。 With parallel execution, total latency ≈ slowest proposer + aggregator, not N× stacked. Empirically, a 4-proposer + 1-aggregator MoA finishes in ~8–15s end-to-end. Too slow for live chat, ideal for research and batch analysis.

可以用本地模型（例如 Ollama）嗎？ Can I use local models (e.g. Ollama)?

可以，而且很建議。混合「雲端強模型 + 本地快模型」可以兼顧品質、成本與隱私。在 Hermes 中，只要在 config.yaml 設定自訂 base_url 即可掛上 Ollama 或其他 OpenAI 相容的本地端點。 Yes — and it's recommended. Mix a strong cloud model with fast local ones to balance quality, cost, and privacy. In Hermes, just point a custom base_url in config.yaml at Ollama or any OpenAI-compatible local endpoint.