🎙️ Computer - Use

A Python-based AI agentic assistant that leverages Google's Gemini AI API for real-time computer control through voice or text input. The application implements a bidirectional WebSocket architecture for seamless AI communication, coupled with multi-threaded audio processing, computer vision, and OCR for contextual awareness.

✨ Features

🗣️ Dual input modes: voice commands or text input
🖥️ Real-time screen analysis with OCR and UI element detection
🖱️ Precise computer control capabilities:
- Mouse movement and click simulation
- Keyboard input and hotkey combinations
- Application launching and window management
🎯 Intelligent command interpretation using Gemini AI
🔄 Real-time audio processing with noise reduction
📊 Adaptive silence detection for better voice recognition
🤖 WebSocket-based real-time communication with Gemini AI
🔍 OCR-powered text recognition on screen
🎯 UI element detection and classification

🛠️ Prerequisites

Python 3.8 or higher
Tesseract OCR (Download and install from here)
Working microphone (for voice input)
Google Gemini API key

🚀 Installation

Clone the repository:

git clone <repository-url>
cd computer-use

Create and activate a virtual environment:

python -m venv venv
# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate

Install Python dependencies:

pip install -r requirements.txt

Install Tesseract OCR:
- Windows: Download and install from UB-Mannheim's repository
- Linux: sudo apt-get install tesseract-ocr
- MacOS: brew install tesseract
Create a .env file in the project root:

GOOGLE_API_KEY=your_api_key_here

🏃‍♂️ Usage

Activate the virtual environment if not already activated:

# On Windows
.\venv\Scripts\activate
# On Linux/MacOS
source venv/bin/activate

Run the application:

python voice_control.py

Choose your preferred input mode when prompted:
- voice: Use voice commands
- text: Use text input

💡 Example Commands

"Open Spotify and play my favorite playlist"
"Check for the cheapest flights from LA to New York"
"Open Chrome and search for the weather"
"Find and click the WiFi icon"
"Minimize all windows"
"Type out an email response"

⚠️ Notes

Ensure Tesseract OCR is properly installed and in PATH
For voice mode, ensure your microphone is properly configured
The assistant works best in a quiet environment for voice commands
Some commands may require administrator privileges
Screenshots are analyzed in real-time for UI element detection

🤝 Contributing

Contributions are welcome! Feel free to submit issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
voice_control.py		voice_control.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Computer - Use

✨ Features

🛠️ Prerequisites

🚀 Installation

🏃‍♂️ Usage

💡 Example Commands

⚠️ Notes

🤝 Contributing

📄 License

About

Uh oh!

Uh oh!

Languages

License

iamkhalid2/computer-use

Folders and files

Latest commit

History

Repository files navigation

🎙️ Computer - Use

✨ Features

🛠️ Prerequisites

🚀 Installation

🏃‍♂️ Usage

💡 Example Commands

⚠️ Notes

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages