Skip to content

A Python-based AI agentic assistant that uses Google's Gemini AI to provide natural language computer control through voice commands.

License

Notifications You must be signed in to change notification settings

iamkhalid2/computer-use

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Computer - Use

Python 3.8+ License: MIT

A Python-based AI agentic assistant that leverages Google's Gemini AI API for real-time computer control through voice or text input. The application implements a bidirectional WebSocket architecture for seamless AI communication, coupled with multi-threaded audio processing, computer vision, and OCR for contextual awareness.

✨ Features

  • 🗣️ Dual input modes: voice commands or text input
  • 🖥️ Real-time screen analysis with OCR and UI element detection
  • 🖱️ Precise computer control capabilities:
    • Mouse movement and click simulation
    • Keyboard input and hotkey combinations
    • Application launching and window management
  • 🎯 Intelligent command interpretation using Gemini AI
  • 🔄 Real-time audio processing with noise reduction
  • 📊 Adaptive silence detection for better voice recognition
  • 🤖 WebSocket-based real-time communication with Gemini AI
  • 🔍 OCR-powered text recognition on screen
  • 🎯 UI element detection and classification

🛠️ Prerequisites

  1. Python 3.8 or higher
  2. Tesseract OCR (Download and install from here)
  3. Working microphone (for voice input)
  4. Google Gemini API key

🚀 Installation

  1. Clone the repository:
git clone <repository-url>
cd computer-use
  1. Create and activate a virtual environment:
python -m venv venv
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:

    • Windows: Download and install from UB-Mannheim's repository
    • Linux: sudo apt-get install tesseract-ocr
    • MacOS: brew install tesseract
  2. Create a .env file in the project root:

GOOGLE_API_KEY=your_api_key_here

🏃‍♂️ Usage

  1. Activate the virtual environment if not already activated:
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate
  1. Run the application:
python voice_control.py
  1. Choose your preferred input mode when prompted:
    • voice: Use voice commands
    • text: Use text input

💡 Example Commands

  • "Open Spotify and play my favorite playlist"
  • "Check for the cheapest flights from LA to New York"
  • "Open Chrome and search for the weather"
  • "Find and click the WiFi icon"
  • "Minimize all windows"
  • "Type out an email response"

⚠️ Notes

  • Ensure Tesseract OCR is properly installed and in PATH
  • For voice mode, ensure your microphone is properly configured
  • The assistant works best in a quiet environment for voice commands
  • Some commands may require administrator privileges
  • Screenshots are analyzed in real-time for UI element detection

🤝 Contributing

Contributions are welcome! Feel free to submit issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python-based AI agentic assistant that uses Google's Gemini AI to provide natural language computer control through voice commands.

Resources

License

Stars

Watchers

Forks

Languages