Skip to content

SAGNIKMJR/switch-a-view

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ICCV 25] Switch-a-View: Few-Shot View Selection Learned from Unlabeled In-the-wild Videos

This repository contains the PyTorch code for our ICCV 2025 paper and the associated datasets:

Switch-a-View: Few-Shot View Selection Learned from Unlabeled In-the-wild Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

Project website: https://vision.cs.utexas.edu/projects/switch-a-view/

Abstract

We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled--but human-edited--video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between those view-switch moments on the one hand and the visual and spoken content in the how-to video on the other hand. Armed with this predictor, our model then takes an unseen multi-view video as input and orchestrates which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D and rigorously validate its advantages.

Code and Datasets

TODOs: will complete by 5/2

  • Upload and share HT100M videos

Dependencies

This code has been tested with python 3.9.18 with torch 2.2.2+cu121 and torchvision 0.17.2+cu121. Additional python package requirements are available in requirements.txt.

Install the remaining dependencies either by

pip3 install -r requirements.txt

or by parsing requirements.txt to get the names and versions of individual dependencies and install them individually.

Data

Download all data from this link and this link under the project root. Upon successful download, you should have a directory named data and a file named data-part2.tar.gz.

First, extract data-part2.tar.gz using

tar -xvzf data-part2.tar.gz

Second, run

cd data; cat clips.tar.gz.part_* > clips.tar.gz; tar -xzf clips.tar.gz

Third, run

cd ../data-part2; cp -r ../data/* ./ego_exo; rm -r ../data; cd ..; mv data-part2 data; cd data

This should result a directory with the relative path ego_exo/clips

Finally, run

cd ht100m; mkdir videos

and read videos.txt to download HowTo100M videos from YouTube under the videos directory. Each line in videos.txt is in the following format: [SUBDIR_NAME]/[YOUTUBE_VIDEO_ID].mp4, where [SUBDIR_NAME] gives the name of the sub-directory of the videos directory where the video needs to be downloaded and [YOUTUBE_VIDEO_ID] gives the unique ID for the YouTube video that needs to be downloaded. I'll also try to host the raw videos on cloud and share them in case some of the videos cannot be downloaded anymore.

Foundation model checkpoints

Download the foundation model checkpoint directory from this link and place it under the project root. The checkpoint directory is called checkpoints.

Our checkpoints

From the project root, run

mkdir runs; cd runs; mkdir dp_pastFramesNpastTextNfutureText_htCaption ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/ -p

First, download from this link and extract the file using

tar -xvzf random.tar.gz

The extracted directory doesn't have any checkpoints but is used for computing our performance numbers

Second, run

cd ddp_pastFramesNpastTextNfutureText_htCaption

, download from this link and run

tar -xvzf onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns.tar.gz

Finally, run

cd ../ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns

, download from this link and run

tar -xvzf thrNd__pstFrmsNpstTxtNftrTxtNftrFrms_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_frmPrt_5000fwshtSmpls.tar.gz

Run commands, tested with V100s and a total training batch size of 48

View-switch detection training

Use the appropriate number of nodes and gpus to achieve a batch size of 48 (Important!). The command below assumes 2 nodes with 8 V100 gpus per node.

python3 -W ignore train.py --run-dir runs/ddp_pastFramesNpastTextNfutureText_htCaption/[RUN_NAME] --distributed --log-tb --dont-copy-resumeArgs --num-valSamples 960 --num-trainIterations 250 --num-valIterations 20 --batch-size 3 --num-workers 4 --dataset-modalities past_frames,past_text,future_text --valDataset-randomComponents-filename howToCaption___pastFramesNpastTextInput__val__5000MaxDatapoints__classes-ego_exo___wEndFrameBugFix.json --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --lr 1e-6 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding
View-switch detection testing on HowTo100M
python3 test.py --data-parallel --run-dir runs/ddp_pastFramesNpastTextNfutureText_htCaption/[RUN_NAME] --batch-size 18 --num-workers 4 --dataset-modalities past_frames,past_text,future_text --dataset-randomComponents-filename howToCaption___pastFramesNpastTextInput__test__5000MaxDatapoints__classes-ego_exo___wEndFrameBugFix.json --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder

To test with our checkpoint, replace [RUN_NAME] in --run-dir flag with onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns and run the command on 1 node with 8 V100 gpus.

To evaluate our metrics, follow the instructions in scripts/eval/computeMetrics_amtLabels.ipynb and run the script.

View selection training

Use the appropriate number of nodes and gpus to achieve a batch size of 48 (Important!). The command below assumes 3 nodes with 8 V100 gpus per node.

python3 -W ignore train.py --run-dir runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME] --distributed --log-tb --dont-copy-resumeArgs --stop-at-epoch 12 --num-fewShotTrainSamples 5000 --num-valSamples 960 --num-trainIterations 250 --num-valIterations 20 --batch-size 2 --num-workers 4 --dataset-name ego_exo --val-dataset-name ego_exo --egoExo-useNnaASRTranscript --egoExo-amtAnnotations-datasetShuffle --dataset-modalities past_frames,past_text,future_text,future_frames --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --lr 1e-6 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding --loadFromPretrainedModel-forFutureFrames --pretrainedModelMidfix-forFutureFrames ddp_pastFramesNpastTextNfutureText_htCaption/onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns

After training, run

cd runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME]/data; cp valBestCkpt_maxMeanAcc.pth valBestCkpt_maxMeanAcc__after10epochs.pth; cd ../../../../..

and make sure you land back in the project root after running this command.

View selection testing
python3 -W ignore test.py --distributed --run-dir runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME] --batch-size 2 --num-workers 4 --dataset-name ego_exo --egoExo-amtAnnotations-datasetShuffle --egoExo-useNnaASRTranscript --egoExo-amtAnnotations-workerConsensusThreshold 0.7 --dataset-modalities past_frames,past_text,future_text,future_frames --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding --checkpoint-fileName valBestCkpt_maxMeanAcc__after10epochs

To test with our checkpoint, replace [RUN_NAME] in --run-dir flag with thrNd__pstFrmsNpstTxtNftrTxtNftrFrms_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_frmPrt_5000fwshtSmpls and run the command on 1 node with 8 V100 gpus.

To evaluate our metrics, follow the instructions in scripts/eval/computeMetrics_amtLabels_egoExo.ipynb and run the script.

Citation

@article{majumder2024switch,
  author       = {Sagnik Majumder and Tushar Nagarajan and Ziad Al-Halah and Kristen Grauman},
  title        = {Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos},
  year         = {2024},
  eprint       = {arXiv:2412.18386},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
}

License

This project is released under the MIT license, as found in the LICENSE file.

About

[ICCV 2025] Code and datasets for "Switch-a-View: Few-Shot View Selection Learned from Unlabeled In-the-wild Videos"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors