Vision Transformer and Video Segmentation task

This repository contains the documentation and implementation for two coding assignments:

Vision Transformer (ViT) implementation on CIFAR-10
Text-driven Image & Video Segmentation pipeline using GroundingDINO and SAM 2

Assignment 1: Vision Transformer on CIFAR-10 (PyTorch)

This project implements a Vision Transformer (ViT) from scratch in PyTorch and trains it on the CIFAR-10 dataset. The objective is to achieve the highest possible test accuracy by experimenting with improvements and training techniques.

How to Run in Colab

Open Google Colab Navigate to Google Colab.
Upload Notebook
- Go to File -> Upload notebook
- Select the q1.ipynb file from your local machine.
Set Runtime Type
- Click Runtime -> Change runtime type
- Select GPU from the Hardware accelerator dropdown menu and save.
Run All Cells
- Click Runtime -> Run all
- This installs dependencies, downloads CIFAR-10, defines the ViT model, and starts training.

Best Model Configuration

Hyperparameter	Value
Batch Size	128
Learning Rate	3e-4
Patch Size	4
Stride	2
Image Size	32
Embedding Dim	256
Number of Heads	8
Transformer Depth	6
MLP Dimension	512
Dropout Rate	0.1
Optimizer	Adam
Epochs	20

Results

Experiment	Test Accuracy (%)
Baseline (Non-Overlapping Patches, No Augmentation)	65.48
With Data Augmentation	69.14
Augmentation + AdamW + Scheduler	69.28
Augmentation + Overlapping Patches + Adam (Best)	76.27

Various Analysis

Data Augmentation Effects

Applying RandomCrop, RandomHorizontalFlip, and ColorJitter increased test accuracy from 65.48% → 69.14%.
Augmentation improves generalization by exposing the model to varied images, reducing overfitting.

Optimizer and Scheduler Variants

Switching from Adam to AdamW + CosineAnnealingLR provided a slight boost (69.14% → 69.28%).
AdamW improves generalization by decoupling weight decay.
Cosine annealing gradually lowers learning rate, stabilizing convergence.

Overlapping vs. Non-Overlapping Patches

Using overlapping patches (Patch Size = 4, Stride = 2) improved accuracy from 69.14% → 76.27%.
Overlapping patches allow the model to capture finer local context, leading to richer representations.

Assignment 2: Text-Driven Image & Video Segmentation with SAM 2

This project demonstrates a powerful pipeline for performing segmentation on both static images and video clips using natural language text prompts. The system leverages the zero-shot object detection capabilities of GroundingDINO to interpret text and locate objects, and the high-quality segmentation power of the Segment Anything Model 2 (SAM 2) to generate precise masks.

Core Technologies

GroundingDINO: A state-of-the-art, open-set object detector that can locate arbitrary objects in an image based on a free-text query. It acts as the "eyes" of our pipeline, identifying where the object of interest is.
Segment Anything Model 2 (SAM 2): The successor to the original SAM, SAM 2 is a foundation model for image segmentation. It can generate high-quality masks for objects given various prompts, including points, boxes, or even other masks. In this pipeline, it takes the bounding boxes from GroundingDINO to determine the exact pixel-level boundaries of the object.

⚙️ The Pipeline: From Text to Mask

1. Image Segmentation

Input: Image + text prompt (e.g., "the black cat sitting on the couch").
Region Seeding (GroundingDINO): Generates bounding boxes for objects matching the text.
Preprocessing for SAM 2: Converts bounding boxes into the required format and prepares the image.
Mask Generation (SAM 2): Produces precise pixel-level masks.
Final Output: Segmented object mask overlayed on the original image.

This is an end-to-end process requiring no manual annotation or retraining.

2. Video Segmentation (Bonus Extension)

Initialization (First Frame): GroundingDINO + SAM 2 generate the first mask.
Propagation (Next Frames): Previous frame mask guides segmentation in the next frame.
Output: Final segmented video with continuous object tracking.

⚠️ Limitations and Considerations

Dependency on Initial Detection – If GroundingDINO fails, SAM 2 masks will be poor.
Simple Tracking Logic – Fails under fast motion, occlusion, or multiple similar objects.
No Re-identification – Once the object is lost, it cannot be recovered.
High Resource Demand – Both models are computationally heavy and require GPU support for practical performance.

Technologies Used

PyTorch
GroundingDINO
Segment Anything Model 2 (SAM 2)
Google Colab (GPU runtime)

Authors

This work was completed as part of the AIRL Coding Assignments.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
README.md		README.md
q1.ipynb		q1.ipynb
q2.ipynb		q2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer and Video Segmentation task

Assignment 1: Vision Transformer on CIFAR-10 (PyTorch)

How to Run in Colab

Best Model Configuration

Results

Various Analysis

Assignment 2: Text-Driven Image & Video Segmentation with SAM 2

Core Technologies

⚙️ The Pipeline: From Text to Mask

1. Image Segmentation

2. Video Segmentation (Bonus Extension)

⚠️ Limitations and Considerations

Technologies Used

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer and Video Segmentation task

Assignment 1: Vision Transformer on CIFAR-10 (PyTorch)

How to Run in Colab

Best Model Configuration

Results

Various Analysis

Assignment 2: Text-Driven Image & Video Segmentation with SAM 2

Core Technologies

⚙️ The Pipeline: From Text to Mask

1. Image Segmentation

2. Video Segmentation (Bonus Extension)

⚠️ Limitations and Considerations

Technologies Used

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages