Skip to content

uelkerd/vis

Repository files navigation

  ██╗   ██╗ ██╗ ███████╗
  ██║   ██║ ██║ ██╔════╝
  ██║   ██║ ██║ ███████╗
  ╚██╗ ██╔╝ ██║ ╚════██║
   ╚████╔╝  ██║ ███████║
    ╚═══╝   ╚═╝ ╚══════╝

  Visual Tester

VIS — Autonomous Android Testing Agent

Visual Tester codecov DeepSource DeepSource

Go Report Card License: MIT Go Version

VIS is an autonomous testing agent for Android that combines UIAutomator accessibility trees with multi-modal vision models. It sees the screen, understands context, and takes action — no brittle XPath selectors required.

Core Value: Test Android apps like a human would.

  • Semantic understanding — finds elements by meaning, not static IDs
  • Self-healing — falls back to visual analysis when standard selectors fail
  • Local sovereignty — runs models via Ollama, your data stays on your machine
  • Fast — optimized Go core with async capture and streaming

What VIS Can Do

Capability Description
Launch any app Resolves human-readable names ("Calculator", "Chrome") to Android packages automatically
Tap, swipe, type Full device interaction — buttons, text fields, scroll, navigation keys
Describe what's on screen Vision model reads and interprets the live display in natural language
Find elements visually Locates UI components by description ("the red submit button") when IDs are unavailable
Wait for conditions Polls the screen until a target element appears or a timeout is reached
Run structured flows Execute multi-step YAML test plans with Maestro-compatible syntax
Generate reports Produces HTML and JUnit XML reports per session, with automatic cleanup
Multi-device orchestration Run tasks across multiple connected devices in parallel
MCP server mode Expose VIS as a tool server for AI agents and IDE integrations
Dry-run validation Parse and plan tasks without touching the device — verify before you execute

VIS works with any Android app — production builds, debug builds, Expo Go, system apps. No source code access or instrumentation required.


Quick Start

# Install
git clone https://github.com/uelkerd/vis.git
cd vis && make build

# Ensure prerequisites are running
ollama pull moondream:latest      # Lightweight vision model (recommended)
adb devices                       # Verify device connected

# Run your first task
./bin/vis --task "open the Settings app"

Installation

From Source (recommended)

git clone https://github.com/uelkerd/vis.git
cd vis
make build          # Binary at ./bin/vis
make install        # Installs to $GOPATH/bin

From GitHub Releases

Download pre-built binaries for your platform from Releases:

# macOS (Apple Silicon)
curl -L https://github.com/uelkerd/vis/releases/latest/download/vis_Darwin_arm64.tar.gz | tar xz
chmod +x vis && mv vis /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/uelkerd/vis/releases/latest/download/vis_Linux_x86_64.tar.gz | tar xz
chmod +x vis && mv vis /usr/local/bin/

Prerequisites

Dependency Purpose Install
Go 1.24+ Build from source golang.org/dl
ADB Android device control brew install android-platform-tools or Android SDK
Ollama Local vision model inference ollama.com
Android device Physical or emulated USB debugging enabled

Usage

Natural Language Tasks (--task)

Describe what you want in plain English. VIS parses the intent, resolves app names, and executes on the device.

# Launch apps (human-readable names resolved automatically)
vis --task "open the Settings app"
vis --task "open Calculator"
vis --task "open Chrome"

# Navigation
vis --task "scroll down"
vis --task "press back"
vis --task "go home"

# Interact with elements
vis --task "tap on the search button"
vis --task "type 'hello world' into the search field"

# With verbose logging
vis --task "open Settings" -v      # DEBUG level
vis --task "open Settings" -vv     # TRACE level (most detailed)
vis --task "open Settings" -q      # Quiet (warnings/errors only)

Dry Run Mode (--dry-run)

Parse and plan without touching the device — useful for validating NLP parsing.

vis --task "open Calculator and type 123" --dry-run -v
# Logs: "dry-run: would execute action" with parsed intent details

Vision Streaming (--stream)

Continuous screen analysis — VIS captures and describes what it sees in real-time.

vis --stream                  # Run indefinitely (Ctrl+C to stop)
vis --stream -v               # With debug output

Maestro Flows (--maestro)

Run structured test flows defined in YAML.

vis --maestro flows/login-test.yaml
vis --maestro flows/checkout.yaml -v

Hybrid Vision-Flows (--hybrid)

Combine structured flows with vision-based fallbacks.

vis --hybrid flows/search-flow.yaml

Test Cycles (--test-cycle)

Run continuous iteration cycles for stress testing.

vis --test-cycle 10            # Run 10 iterations
vis --test-cycle 50 -v         # 50 iterations with debug logging

MCP Server Mode (--server)

Start VIS as an MCP (Model Context Protocol) server for integration with other tools.

vis --server                  # Start MCP server on stdin/stdout
vis --mcp                     # Alias for --server

Environment Setup (setup)

Check prerequisites and download required models.

vis setup

Device Targeting (--device)

Target a specific device when multiple are connected.

vis --task "open Settings" --device 29021FDH2009DQ
vis --task "open Settings" --device emulator-5554

Report Control (--report)

Reports are generated by default to reports/ (auto-cleaned, keeps 10 most recent).

vis --task "open Settings" --report=false   # Disable report generation

Environment Variables

Variable Default Description
VIS_MODEL moondream:latest Vision model for screen analysis
VIS_NLU_MODEL llama3.1:latest NLU model for natural language parsing
VIS_OLLAMA_URL http://localhost:11434/api/generate Ollama API endpoint
VIS_TIMEOUT 120 Model timeout in seconds
TEST_DEVICE_ID (none) Specific ADB device for tests
# Defaults work out of the box with moondream
# For higher accuracy on complex screens, upgrade the vision model:
export VIS_MODEL="llama3.2-vision:11b"
export VIS_NLU_MODEL="qwen-agentic:latest"
export VIS_TIMEOUT=180

Known Apps

VIS resolves human-readable app names to Android package IDs automatically:

Name Package
Settings com.android.settings
Calculator com.google.android.calculator
Chrome com.android.chrome
Gmail com.google.android.gm
Maps com.google.android.apps.maps
Camera com.google.android.GoogleCamera
Calendar com.google.android.calendar
Phone com.google.android.dialer
Files com.google.android.apps.nbu.files
Clock com.google.android.deskclock
Photos com.google.android.apps.photos
Expo Go host.exp.exponent

Any unrecognized name is passed through as a raw package ID.


Architecture

VIS follows a Capture-Analyze-Decide-Act (CADA) autonomous agent loop:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ CAPTURE  │───▶│ ANALYZE  │───▶│  DECIDE  │───▶│   ACT   │
│ ADB      │    │ Ollama   │    │ Agent    │    │ ADB     │
│ screencap│    │ Vision   │    │ NLP      │    │ tap/    │
│ uidump   │    │ Model    │    │ Parser   │    │ swipe   │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
     ▲                                            │
     └────────────────────────────────────────────┘
                    (continuous loop)
  1. Capture — Screenshots via ADB with JPEG compression and caching
  2. Analyze — Vision models interpret screen content semantically
  3. Decide — NLP parser + agent logic determines the next action
  4. Act — ADB executes taps, swipes, inputs, key events

Project Structure

cmd/vis/              CLI entry point
internal/
├── adb/              ADB device control (taps, swipes, inputs, key events)
├── agent/            Core CADA loop orchestration
├── capture/          Screenshot acquisition and caching
├── config/           Environment-based configuration
├── hybrid/           Hybrid selector engine
├── livefeed/         Scrcpy live feed integration
├── mcp/              Model Context Protocol server
├── nlp/              Natural language task parsing
├── reporting/        HTML and JUnit report generation
├── resilience/       Circuit breaker and retry patterns
├── selector/         Self-healing element location engine
├── setup/            Ollama environment setup
├── types/            Shared domain types
└── vis/              Vision model client (Ollama API)
scripts/              Build & test automation
e2e/                  End-to-end tests (requires device + Ollama)

Development

make build          # Build binary
make test           # Run unit tests
make test-cover     # Run tests with coverage
make lint           # Run linter
make clean          # Clean build artifacts

# Physical device test suite (requires connected Android device + Ollama)
./scripts/device-test.sh

License

Distributed under the MIT License. See LICENSE for details.

Packages

 
 
 

Contributors

Languages