LinuxWaylandBackend für Hermes Agent Computer Use

Python 100%

Find a file

jaffari de191c3ede docs: add PLAN_FINAL and finalisation LOG entry Task 15 from PLAN_FINAL: logging per AGENTS.md policy. - LOG.md appended with session entry (bugs fixed, tests added, open items). - PLAN_FINAL.md tracked (the 16-task plan that drove this session). - Vault log at vault/AI/Logs/2026-06-20-linux-computeruse-finalisierung.md (outside this repo, written separately).		2026-06-20 22:34:20 +02:00
reference	Initialer Projektentwurf: LinuxWaylandBackend Planung	2026-05-23 17:37:27 +02:00
tests	test: comprehensive smoke tests for backend and pure helpers	2026-06-20 22:32:29 +02:00
tools	chore: add .gitignore, drop __pycache__ from tracking	2026-06-20 22:22:24 +02:00
.gitignore	chore: add .gitignore, drop __pycache__ from tracking	2026-06-20 22:22:24 +02:00
linux_backend.py	test: comprehensive smoke tests for backend and pure helpers	2026-06-20 22:32:29 +02:00
LOG.md	docs: add PLAN_FINAL and finalisation LOG entry	2026-06-20 22:34:20 +02:00
patch_tool.py	feat: idempotent patch_tool.py replaces comment-only PATCH_tool.py	2026-06-20 22:31:47 +02:00
PLAN.md	docs: align SPECS and README with implementation	2026-06-20 22:33:57 +02:00
PLAN_FINAL.md	docs: add PLAN_FINAL and finalisation LOG entry	2026-06-20 22:34:20 +02:00
README.md	docs: align SPECS and README with implementation	2026-06-20 22:33:57 +02:00
SESSION.md	Initialer Projektentwurf: LinuxWaylandBackend Planung	2026-05-23 17:37:27 +02:00
SPECS.md	docs: align SPECS and README with implementation	2026-06-20 22:33:57 +02:00

README.md

Linux Computer Use — Hermes Desktop Control für Wayland/Plasma

Ein Linux-Backend für Hermes Agents computer_use Toolset, das Desktop-Steuerung auf Wayland-Systemen (speziell CachyOS / KDE Plasma) ermöglicht — ohne Cursor-Klau, ohne Focus-Steal, im Hintergrund.

Motivation

Hermes Agent v0.14.0 hat computer_use als plattformunabhängiges Tool eingeführt — das Schema ist generisches OpenAI Function-Calling, jeder tool-fähige Provider kann es nutzen. Das einzige existierende Backend ist CuaDriverBackend (macOS-only, via trycua/cua's cua-driver). Linux-Desktop-Nutzer gehen leer aus.

Dieses Projekt implementiert LinuxWaylandBackend(ComputerUseBackend) für Wayland + KDE Plasma, mit Fokus auf:

Hintergrund-Steuerung ohne Focus-Raub (via ydotool + KWin API)
SOM (Set-of-Marks) — nummerierte Overlays via at-spi2 Accessibility Tree
Read-Only Safety — Capture/Wait/List sind immer erlaubt, destructive Actions gehen durch Approval
Hard-Blocked Key-Combos und Type-Patterns (gleiche Safety wie das macOS-Backend)

Architektur

Hermes' Computer-Use ist als ABC in zwei Schichten organisiert:

tools/computer_use/
├── backend.py          # ComputerUseBackend ABC + CaptureResult, ActionResult, UIElement
├── schema.py           # COMPUTER_USE_SCHEMA (tool schema, model-agnostisch)
├── tool.py             # Dispatch-Logik + Approval + Safety + Backend-Selektion
├── cua_backend.py      # CuaDriverBackend (bestehend, macOS)
├── linux_backend.py    # LinuxWaylandBackend (NEU, dieses Projekt)
└── vision_routing.py   # auxiliary.vision routing

Der Backend-Selektor in tool.py (Zeile 129-142) wählt per HERMES_COMPUTER_USE_BACKEND env var:

"cua" → CuaDriverBackend (macOS, Standard)
"linux" → LinuxWaylandBackend
"noop" → Test-Stub

Auf Linux wird automatisch auf linux umgeschaltet.

Tool-Stack für Wayland/Plasma

Backend-Methode	Linux-Tool	Beschreibung
`capture(mode='vision')`	`grim`	Screenshot → PNG → base64
`capture(mode='som')`	`grim` + `at-spi2`	Screenshot + nummerierte Element-Overlays via Accessibility-Baum
`capture(mode='ax')`	`at-spi2`	Nur Accessibility-Baum (kein Bild), für Text-only-Modelle
`click` / `double_click` / `right_click`	`ydotool` (libei)	Maus-Events im Hintergrund
`drag`	`ydotool`	mousemove + click + release
`scroll`	`ydotool`	Scroll-Rad-Events
`type_text`	`ydotool type`	Text-Eingabe
`key`	`ydotool key`	Key-Kombos (ctrl+s, alt+tab, ...)
`list_apps`	`kdotool` / KWin D-Bus	Laufende Fenster mit PID, Titel, ID
`focus_app`	`kdotool` / KWin	Input routen, optional raise
`set_value`	at-spi2 D-Bus	AX-Werte direkt setzen (Dropdown, Slider) — kein Focus-Klau
`wait`	`time.sleep`	Geerbt vom ABC

Abhängigkeiten

Runtime (Arch/CachyOS)

sudo pacman -S grim ydotool kdotool at-spi2-core python-pygobject

grim — wayland-native Screenshots (base64 output)
ydotool — libei-basierte Input-Automation (funktioniert auf Wayland + X11)
kdotool — KDE-Fenster-Management (Fenster auflisten, fokussieren, minimieren)
at-spi2-core + python-pygobject — Accessibility Tree (Element-Erkennung)

Python-Dependencies (zur Hermes-venv hinzufügen)

pip install pygobject       # DBus/GLib Bindungen für at-spi2

Optional

slurp — Region-Auswahl für grim (für gezielte Screenshots statt Vollbild)
jq — JSON-Parsing für Tests

Implementierung: LinuxWaylandBackend

Grundgerüst

# tools/computer_use/linux_backend.py

import subprocess, base64, json, re, os, time
from typing import Optional, List, Dict, Any, Tuple
from tools.computer_use.backend import (
    ComputerUseBackend, CaptureResult, ActionResult, UIElement
)

class LinuxWaylandBackend(ComputerUseBackend):
    """Wayland/Plasma Desktop Control via ydotool + grim + at-spi2 + kdotool."""

    def __init__(self):
        # Letzte Capture-Informationen für Action-Kontext
        self._last_capture: Optional[CaptureResult] = None
        self._active_app: Optional[str] = None

    def start(self) -> None:
        pass  # Tools sind CLI-basiert, kein Daemon nötig

    def stop(self) -> None:
        pass

    def is_available(self) -> bool:
        # Prüft ob ydotool, grim, und at-spi2 verfügbar sind
        return all(shutil.which(cmd) for cmd in ["ydotool", "grim", "kdotool"])

Capture (grim + at-spi2)

def capture(self, mode: str = "som", app: Optional[str] = None) -> CaptureResult:
    # 1. Screenshot via grim → PNG → base64
    # 2. width/height aus PNG-IHDR-Header lesen (_png_dimensions)
    # 3. App-Filter via kdotool / KWin D-Bus
    # 4. SOM: at-spi2 Accessibility Tree via _iter_interactive_nodes()
    # 5. self._last_capture = result (für click/drag/scroll SOM-Auflösung)
    ...

at-spi2 Accessibility Tree

_INTERACTIVE_ROLES = frozenset({  # Modul-Level
    "push button", "toggle button", "button",
    "text", "password text", "multi line text",
    "combo box", "drop down", "pop up menu",
    "slider", "spin button", "scroll bar",
    "check box", "radio button", "switch",
    "tree table", "table", "table cell",
    "list", "list item",
    "link", "menu item",
    "page tab", "page tab list",
    "icon",
})

def _is_interactive(role_name: str) -> bool:
    return role_name in _INTERACTIVE_ROLES

def _iter_interactive_nodes_from(node, app_name="", target_app=None, depth=0):
    """Generator: liefert (index, node, app) für interaktive Elemente.
    Depth-first preorder, gleiche Reihenfolge wie _get_ax_elements.
    App-Filter unterdrückt nur Sammlung, nicht Rekursion."""
    ...

def _iter_interactive_nodes(target_app=None):
    """Top-Level: iteriert über den gesamten Atspi-Desktop.
    Returniert leer (Generator) wenn pygobject/Atspi nicht verfügbar."""
    ...

Input (ydotool)

def click(self, *, element=None, x=None, y=None, button="left",
          click_count=1, modifiers=None) -> ActionResult:
    if element is None and x is None and y is None:
        return ActionResult(ok=False, action="click",
                            message="No target — pass element or x/y")
    try:
        cx, cy = self._get_element_coords(element, x, y)
    except ValueError as exc:
        return ActionResult(ok=False, action="click", message=str(exc))
    # ydotool: mousemove --absolute + click N times
    ...

def _get_element_coords(self, element=None, x=None, y=None) -> Tuple[int, int]:
    """Löst SOM-Index via _last_capture.elements auf (UIElement.center()).
    ValueError wenn kein Capture oder Index nicht vorhanden."""
    if element is not None:
        if self._last_capture is None:
            raise ValueError(f"element #{element} requested but no prior capture()")
        ui = next((e for e in self._last_capture.elements if e.index == element), None)
        if ui is None:
            raise ValueError(f"element #{element} not in last capture")
        return ui.center()
    return x or 0, y or 0

def type_text(self, text: str) -> ActionResult:
    subprocess.run(["ydotool", "type", text], timeout=30.0, ...)
    return ActionResult(ok=True, action="type")

def key(self, keys: str) -> ActionResult:
    keys_normalized = self._keys_to_ydotool_args(keys)  # ctrl+s → ["ctrl", "s"]
    subprocess.run(["ydotool", "key"] + keys_normalized, ...)
    return ActionResult(ok=True, action="key")

Fenster-Management (kdotool / KWin D-Bus)

def list_apps(self) -> List[Dict[str, Any]]:
    """kdotool search --all-desktops + kdotool getwindowgeometry"""
    result = subprocess.run(["kdotool", "search", "--all-desktops", ""],
                            capture_output=True, text=True)
    windows = result.stdout.strip().split()
    apps = []
    for wid in windows:
        name = subprocess.run(["kdotool", "getwindowname", wid],
                              capture_output=True, text=True).stdout.strip()
        geo = subprocess.run(["kdotool", "getwindowgeometry", wid],
                             capture_output=True, text=True).stdout
        # Parse PID aus /proc oder KWin D-Bus
        apps.append({"id": wid, "name": name, "geometry": geo, ...})
    return apps

def focus_app(self, app: str, raise_window: bool = False) -> ActionResult:
    result = subprocess.run(["kdotool", "search", "--all-desktops", app],
                            capture_output=True, text=True)
    windows = result.stdout.strip().split()
    if not windows:
        return ActionResult(ok=False, action="focus_app",
                            message=f"no window matching {app!r}")
    cmd = "activate" if raise_window else "setwindow"
    subprocess.run(["kdotool", cmd, windows[0]])
    self._active_app = app
    return ActionResult(ok=True, action="focus_app")

set_value (at-spi2 Accessibility)

def set_value(self, value: str, element: Optional[int] = None) -> ActionResult:
    """Setzt AX-Werte direkt — z.B. Slider-Wert ohne Maus-Klick,
    Combo-Box-Auswahl ohne Menu-Öffnung (kein Focus-Klau).
    Walk via _iter_interactive_nodes() — gleiche Index-Reihenfolge wie
    _get_ax_elements, sodass der per capture() sichtbare Index stimmt.
    1) Value-Interface (slider/spin button): currentValue = float(value)
    2) Action-Interface (combo box/popup): doAction(0) auf passendes Kind
"""

Safety & Security

Die existierenden Safety-Guards aus tool.py greifen automatisch:

Layer	Wirkung
Hard-Blocked Key Combos	`_BLOCKED_KEY_COMBOS` — Papierkorb leeren, Force Delete, Lock Screen, Logout
Blocked Type Patterns	`_BLOCKED_TYPE_PATTERNS` — `curl
Approval Callback	Destructive Actions (click, type, key, drag, scroll, focus_app, set_value) → CLI-Approval-Dialog
Read-Only Free	capture, wait, list_apps — immer erlaubt

Backend-Selektion (Patch via patch_tool.py)

Das Projekt liefert ein idempotentes Patch-Skript patch_tool.py, das die Backend-Auto-Detect in ~/.hermes/hermes-agent/tools/computer_use/tool.py einfügt und linux_backend.py dorthin kopiert.

# Dry-run: zeigt was geändert würde
python3 patch_tool.py --check

# Patch anwenden (erstellt .bak Backups)
python3 patch_tool.py --apply

# Backup wiederherstellen
python3 patch_tool.py --revert

Der Patch ersetzt die _get_backend()-Funktion durch eine Variante, die per Auto-Detect Linux (mit linux_wayland_backend_available()) erkennt und auf LinuxWaylandBackend schaltet, sonst auf CuaDriverBackend. Manuell via Umgebungsvariable:

export HERMES_COMPUTER_USE_BACKEND=linux

Tests

71 Tests in tests/ decken ab:

test_pure.py — _is_linux, _is_wayland, _tool_available, _parse_geometry, _keys_to_ydotool_args, _png_dimensions, GLib-Import-Fehler-Handler, start() ohne pygobject
test_backend.py — click/drag/scroll/type_text/key/wait mit subprocess-Mock, is_available (Wayland/tty/fehlende Tools), list_apps/focus_app (mit/ohne kdotool), text=True-Modus
test_set_value.py — _iter_interactive_nodes_from Walker (Preorder + App-Filter-Rekursion), set_value auf Slider/Button/Combo-Box, Fake-Atspi via FakeNode/FakeValue
test_patch_tool.py — idempotenter Patch-Zyklus (--check/--apply/--revert)

pytest tests/ -v

Die Tests benötigen keine Wayland-Session und keine der CLI-Tools — alle externen Aufrufe werden via unittest.mock bzw. subprocess-Patch aus conftest.py abgefangen.

Wayland-spezifische Integrationstests (echtes grim/ydotool/kdotool) sind noch offen und erfordern die jeweilige Desktop-Umgebung.

Offene Fragen / Todo

SOM-Overlay-Zeichnung: textbasiert via elements-Array, keine Overlays nötig
ydotool benötigt ydotool.service für den Daemon (systemd-user); linux_wayland_backend_available() prüft nur die Binaries, der Daemon muss vom User konfiguriert werden
at-spi2: Wie performant ist der AX-Baum-Durchlauf bei Electron-Apps (VS Code, Obsidian)? (Realer Integrationstest ausstehend)
KWin D-Bus vs kdotool: Fallback für Nicht-KDE-Wayland-Compositor (Hyprland, Sway)?
grim auf Multi-Monitor: HERMES_GRIM_OUTPUT wählt Output, sonst aktiv
Screenshot-Skalierung: _png_dimensions liest physikalische Pixel direkt aus dem IHDR-Header; ydotool --absolute nutzt denselben Koordinatenraum
_is_available(): Prüft $XDG_SESSION_TYPE == "wayland" + which grim + which ydotool
Permission-Hinweis im Post-Setup: ydotool benötigt uinput Gruppe — Doku nur, Prüfung zur Laufzeit noch nicht implementiert