About Windmill
Windmill is an open-source platform to build workflows, data pipelines, internal tools and full-stack applications as code. The product exposes three main primitives - scripts, flows and apps - that run on a Rust worker infrastructure and compose inside a web editor (TypeScript / Svelte).
The ecosystem of LLM-based coding assistants (Claude Code, Cursor, Aider, Devin, SWE-agent, OpenHands, etc.) is evolving very rapidly, yet remains mostly oriented toward non-contextualized source-code editing or toward solving isolated tickets on existing codebases. No general-purpose agent is today specialized to produce, end-to-end, functional, tested and deployed workflows in a constrained execution environment like Windmill. Closing that gap is the strategic aim of this mission.
The role
Own Windmill's agentic coding and tool/system-building pipeline end-to-end - from the AI backend (planning, tool use, retrieval, self-correction) to the UX and developer experience that wraps it. The bar: an agent that reliably goes from a natural-language spec to a working, deployed workflow or app - and that developers actually enjoy using.
- Benchmarking: build and maintain the eval harness, task corpus, scoring, and regression tracking. Every prompt / model / tool change is measured.
- Agent loop: design and improve planning, tool use, self-correction, retrieval, execution feedback, multi-file editing, test-driven iteration.
- Integration & DX: own the full surface - UI flows, editor integration, feedback loops, error states - so the experience is polished end-to-end, not just the model calls.
- Prompts & models: systematically optimize prompts; experiment with frontier models (Claude, GPT, Gemini, open-weights); fine-tuning / RL where it pays off.
- Ship to production: everything you build goes live and is used by thousands of developers.
Who we're looking for
- Strong CS fundamentals - algorithms, systems, distributed systems
- Solid programming skills (TypeScript, Rust a plus)
- Deep understanding of LLMs, agents, eval methodology - you've built and shipped LLM-based systems, not just played with APIs
- Rigorous, empirical mindset - you measure before you claim improvement
- 0–5 years of experience - we care more about what you've built than years on a resume
Example projects in your first 3 months
- Redesign the agent's multi-step planning so it can scaffold a full CRUD app (frontend + flow + schema) from a single prompt
- Build a live feedback UI that lets users steer the agent mid-generation - accept, reject, or redirect individual steps
- Stand up an automated eval pipeline that catches regressions before they ship and benchmarks every prompt/model change