TLDR
We build up the foundation to build a system to enable LLMs to observe, understand, and act on the real world [big asterisk here].
Asterisk: the "real world" is obviously used in a constrained sense. We here mostly concern ourselves about making LLMs behave in a specific tool or platform like how humans behave in it. We're certainly not building robots here.
In the past month or so we did some experimental work on making a non-turn-based asynchronous LLM harness for creative iteration. Let me write down some high level findings first and see if I can open source some code later.
Prerequisite: Harness Is THE Product
Anthropic a while back published theirs — a three-agent pipeline that builds full-stack apps over multi-hour sessions. OpenAI shipped their Agents SDK. There's this think pieces about the pricing model.
OpenCode is a harness, Anthropic's new managed agent is a harness (with some web things), Claude Code and Cursor are kind of harnesses.
Two observations can be made at least:
- LLMs are getting good enough to specialize to different use cases via harness engineering alone, without specializing the model itself.
- These harnesses we often see are almost exclusively task-driven. Give the agent a task. Let it plan. Let it execute. Stretch the context window. Run it for hours. Ship the result.
Here we want to see if we can make something a little more interesting.