Skip to content

simular-ai/Agent-S

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Logo Agent S2: An Open, Modular, and Scalable Framework for Computer Use Agents

Β  🌐 [S2 blog]Β  πŸ“„ [S2 Paper] (Coming Soon)Β  πŸŽ₯ [S2 Video] πŸ—¨οΈ [Discord]Β 

Β  🌐 [S1 blog]Β  πŸ“„ [S1 Paper]Β  πŸŽ₯ [S1 Video]

πŸ₯³ Updates

  • 2025/03/12: Released Agent S2 along with v0.2.0 of gui-agents, the new state-of-the-art for computer use, outperforming OpenAI's CUA/Operator and Anthropic's Claude 3.7 Sonnet!
  • 2025/01/22: The Agent S paper is accepted to ICLR 2025!
  • 2025/01/21: Released v0.1.2 of gui-agents library, with support for Linux and Windows!
  • 2024/12/05: Released v0.1.0 of gui-agents library, allowing you to use Agent-S for Mac, OSWorld, and WindowsAgentArena with ease!
  • 2024/10/10: Released Agent S paper and codebase!

Table of Contents

  1. πŸ’‘ Introduction
  2. 🎯 Current Results
  3. πŸ› οΈ Installation
  4. πŸš€ Usage
  5. 🀝 Acknowledgements
  6. πŸ’¬ Citation

πŸ’‘ Introduction

Welcome to Agent S, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer.

Whether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!

🎯 Current Results


Results of Agent S2's Successful Rate (%) on the OSWorld full test set using Screenshot input only.

Benchmark Agent S2 Previous SOTA Ξ” improve
OSWorld (15 step) 27.0% 22.7% (ByteDance UI-TARS) +4.3%
OSWorld (50 step) 34.5% 32.6% (OpenAI CUA) +1.9%
AndroidWorld 50.0% 46.8% (ByteDance UI-TARS) +3.2%

πŸ› οΈ Installation & Setup

❗Warning❗: If you are on a Linux machine, creating a conda environment will interfere with pyatspi. As of now, there's no clean solution for this issue. Proceed through the installation without using conda or any virtual environment.

⚠️Disclaimer⚠️: To leverage the full potential of Agent S2, we utilize UI-TARS as a grounding model (7B-DPO or 72B-DPO for better performance). They can be hosted locally, or on Hugging Face Inference Endpoints. Our code supports Hugging Face Inference Endpoints. Check out Hugging Face Inference Endpoints for more information on how to set up and query this endpoint. However, running Agent S2 does not require this model, and you can use alternative API based models for visual grounding, such as Claude.

Clone the repository:

git clone https://github.com/simular-ai/Agent-S.git

Install the gui-agents package:

pip install gui-agents

Set your LLM API Keys and other environment variables. You can do this by adding the following line to your .bashrc (Linux), or .zshrc (MacOS) file.

export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Alternatively, you can set the environment variable in your Python script:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

We also support Azure OpenAI, Anthropic, and vLLM inference. For more information refer to models.md.

Setup Retrieval from Web using Perplexica

Agent S works best with web-knowledge retrieval. To enable this feature, you need to setup Perplexica:

  1. Ensure Docker Desktop is installed and running on your system.

  2. Navigate to the directory containing the project files.

     cd Perplexica
     git submodule update --init
  3. Rename the sample.config.toml file to config.toml. For Docker setups, you need only fill in the following fields:

    • OPENAI: Your OpenAI API key. You only need to fill this if you wish to use OpenAI's models.

    • OLLAMA: Your Ollama API URL. You should enter it as http://host.docker.internal:PORT_NUMBER. If you installed Ollama on port 11434, use http://host.docker.internal:11434. For other ports, adjust accordingly. You need to fill this if you wish to use Ollama's models instead of OpenAI's.

    • GROQ: Your Groq API key. You only need to fill this if you wish to use Groq's hosted models.

    • ANTHROPIC: Your Anthropic API key. You only need to fill this if you wish to use Anthropic models.

      Note: You can change these after starting Perplexica from the settings dialog.

    • SIMILARITY_MEASURE: The similarity measure to use (This is filled by default; you can leave it as is if you are unsure about it.)

  4. Ensure you are in the directory containing the docker-compose.yaml file and execute:

    docker compose up -d
  5. Our implementation of Agent S incorporates the Perplexica API to integrate a search engine capability, which allows for a more convenient and responsive user experience. If you want to tailor the API to your settings and specific requirements, you may modify the URL and the message of request parameters in agent_s/query_perplexica.py. For a comprehensive guide on configuring the Perplexica API, please refer to Perplexica Search API Documentation

For a more detailed setup and usage guide, please refer to the Perplexica Repository.

❗Warning❗: The agent will directly run python code to control your computer. Please use with care.

πŸš€ Usage

CLI

Run Agent S2 with a specific model (default is gpt-4o):

agent_s --model gpt-4o --grounding_model claude-3-7-sonnet-20250219

Or use a custom endpoint:

agent_s --model gpt-4o --endpoint_provider "huggingface" --endpoint_url "<endpoint_url>/v1/"

Main Model Settings

  • --model
    • Purpose: Specifies the main generation model
    • Example: gpt-4o
    • Default: gpt-4o

Grounding Configuration Options

You can use either Configuration 1 or Configuration 2:

Configuration 1: API-Based Models
  • --grounding_model
    • Purpose: Specifies the model for visual understanding
    • Supports:
      • Anthropic Claude models (e.g., claude-3-7-sonnet)
      • OpenAI GPT models (e.g., gpt-4-vision)
    • Default: None
Configuration 2: Custom Endpoint
  • --endpoint_provider

    • Purpose: Specifies the endpoint provider
    • Currently supports: HuggingFace TGI
    • Default: huggingface
  • --endpoint_url

    • Purpose: The URL for your custom endpoint
    • Default: None

This will show a user query prompt where you can enter your query and interact with Agent S2. You can use any model from the list of supported models in models.md.

gui_agents SDK

First, we import the necessary modules. GraphSearchAgent is the main agent class for Agent S2. OSWorldACI is our grounding agent that translates agent actions into executable python code.

import pyautogui
import io
from gui_agents.s2.agents.agent_s import GraphSearchAgent
from gui_agents.s2.agents.grounding import OSWorldACI

# Load in your API keys.
from dotenv import load_dotenv
load_dotenv()

current_platform = "ubuntu"  # "macos"

Next, we define our engine parameters. engine_params is used for the main agent, and engine_params_for_grounding is for grounding. For engine_params_for_grounding, we support the Claude, GPT series, and Hugging Face Inference Endpoints.

engine_type_for_grounding = "huggingface"

engine_params = {
    "engine_type": "openai",
    "model": "gpt-4o",
}

if engine_type_for_grounding == "huggingface":
  engine_params_for_grounding = {
      "engine_type": "huggingface",
      "endpoint_url": "<endpoint_url>/v1/",
  }
elif engine_type_for_grounding == "claude":
  engine_params_for_grounding = {
      "engine_type": "claude",
      "model": "claude-3-7-sonnet-20250219",
  }
elif engine_type_for_grounding == "gpt":
  engine_params_for_grounding = {
    "engine_type": "gpt",
    "model": "gpt-4o",
  }
else:
  raise ValueError("Invalid engine type for grounding")

Then, we define our grounding agent and Agent S2.

grounding_agent = OSWorldACI(
    platform=current_platform,
    engine_params_for_generation=engine_params,
    engine_params_for_grounding=engine_params_for_grounding
)

agent = GraphSearchAgent(
  engine_params,
  grounding_agent,
  platform=current_platform,
  action_space="pyautogui",
  observation_type="mixed",
  search_engine="Perplexica"  # Assuming you have set up Perplexica.
)

Finally, let's query the agent!

# Get screenshot.
screenshot = pyautogui.screenshot()
buffered = io.BytesIO() 
screenshot.save(buffered, format="PNG")
screenshot_bytes = buffered.getvalue()

obs = {
  "screenshot": screenshot_bytes,
}

instruction = "Close VS Code"
info, action = agent.predict(instruction=instruction, observation=obs)

exec(action[0])

Refer to gui_agents/s2/cli_app.py for more details on how the inference loop works.

OSWorld

To deploy Agent S2 in OSWorld, follow the OSWorld Deployment instructions.

🀝 Acknowledgements

We extend our sincere thanks to Tianbao Xie for developing OSWorld and discussing computer use challenges. We also appreciate the engaging discussions with Yujia Qin and Shihao Liang regarding UI-TARS.

πŸ’¬ Citation

@misc{agashe2024agentsopenagentic,
      title={Agent S: An Open Agentic Framework that Uses Computers Like a Human}, 
      author={Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang},
      year={2024},
      eprint={2410.08164},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.08164}, 
}