This repository contains the PyTorch code for our ICCV 2025 paper and the associated datasets:
Switch-a-View: Few-Shot View Selection Learned from Unlabeled In-the-wild Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
Project website: https://vision.cs.utexas.edu/projects/switch-a-view/
We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled--but human-edited--video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between those view-switch moments on the one hand and the visual and spoken content in the how-to video on the other hand. Armed with this predictor, our model then takes an unseen multi-view video as input and orchestrates which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D and rigorously validate its advantages.
- Upload and share HT100M videos
This code has been tested with python 3.9.18 with torch 2.2.2+cu121 and torchvision 0.17.2+cu121. Additional python package requirements are available in requirements.txt.
Install the remaining dependencies either by
pip3 install -r requirements.txt
or by parsing requirements.txt to get the names and versions of individual dependencies and install them individually.
Download all data from this link and this link under the project root. Upon successful download, you should have a directory named data and a file named data-part2.tar.gz.
First, extract data-part2.tar.gz using
tar -xvzf data-part2.tar.gz
Second, run
cd data; cat clips.tar.gz.part_* > clips.tar.gz; tar -xzf clips.tar.gz
Third, run
cd ../data-part2; cp -r ../data/* ./ego_exo; rm -r ../data; cd ..; mv data-part2 data; cd data
This should result a directory with the relative path ego_exo/clips
Finally, run
cd ht100m; mkdir videos
and read videos.txt to download HowTo100M videos from YouTube under the videos directory. Each line in videos.txt is in the following format: [SUBDIR_NAME]/[YOUTUBE_VIDEO_ID].mp4, where [SUBDIR_NAME] gives the name of the sub-directory of the videos directory where the video needs to be downloaded and [YOUTUBE_VIDEO_ID] gives the unique ID for the YouTube video that needs to be downloaded. I'll also try to host the raw videos on cloud and share them in case some of the videos cannot be downloaded anymore.
Download the foundation model checkpoint directory from this link and place it under the project root. The checkpoint directory is called checkpoints.
From the project root, run
mkdir runs; cd runs; mkdir dp_pastFramesNpastTextNfutureText_htCaption ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/ -p
First, download from this link and extract the file using
tar -xvzf random.tar.gz
The extracted directory doesn't have any checkpoints but is used for computing our performance numbers
Second, run
cd ddp_pastFramesNpastTextNfutureText_htCaption
, download from this link and run
tar -xvzf onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns.tar.gz
Finally, run
cd ../ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns
, download from this link and run
tar -xvzf thrNd__pstFrmsNpstTxtNftrTxtNftrFrms_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_frmPrt_5000fwshtSmpls.tar.gz
Use the appropriate number of nodes and gpus to achieve a batch size of 48 (Important!). The command below assumes 2 nodes with 8 V100 gpus per node.
python3 -W ignore train.py --run-dir runs/ddp_pastFramesNpastTextNfutureText_htCaption/[RUN_NAME] --distributed --log-tb --dont-copy-resumeArgs --num-valSamples 960 --num-trainIterations 250 --num-valIterations 20 --batch-size 3 --num-workers 4 --dataset-modalities past_frames,past_text,future_text --valDataset-randomComponents-filename howToCaption___pastFramesNpastTextInput__val__5000MaxDatapoints__classes-ego_exo___wEndFrameBugFix.json --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --lr 1e-6 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding
python3 test.py --data-parallel --run-dir runs/ddp_pastFramesNpastTextNfutureText_htCaption/[RUN_NAME] --batch-size 18 --num-workers 4 --dataset-modalities past_frames,past_text,future_text --dataset-randomComponents-filename howToCaption___pastFramesNpastTextInput__test__5000MaxDatapoints__classes-ego_exo___wEndFrameBugFix.json --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder
To test with our checkpoint, replace [RUN_NAME] in --run-dir flag with onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns and run the command on 1 node with 8 V100 gpus.
To evaluate our metrics, follow the instructions in scripts/eval/computeMetrics_amtLabels.ipynb and run the script.
Use the appropriate number of nodes and gpus to achieve a batch size of 48 (Important!). The command below assumes 3 nodes with 8 V100 gpus per node.
python3 -W ignore train.py --run-dir runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME] --distributed --log-tb --dont-copy-resumeArgs --stop-at-epoch 12 --num-fewShotTrainSamples 5000 --num-valSamples 960 --num-trainIterations 250 --num-valIterations 20 --batch-size 2 --num-workers 4 --dataset-name ego_exo --val-dataset-name ego_exo --egoExo-useNnaASRTranscript --egoExo-amtAnnotations-datasetShuffle --dataset-modalities past_frames,past_text,future_text,future_frames --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --lr 1e-6 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding --loadFromPretrainedModel-forFutureFrames --pretrainedModelMidfix-forFutureFrames ddp_pastFramesNpastTextNfutureText_htCaption/onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns
After training, run
cd runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME]/data; cp valBestCkpt_maxMeanAcc.pth valBestCkpt_maxMeanAcc__after10epochs.pth; cd ../../../../..
and make sure you land back in the project root after running this command.
python3 -W ignore test.py --distributed --run-dir runs/ego_exo-ddp_pastFramesNpastTextNfutureText_htCaption/ht100mPrtDr---onNd__pstFrmsNpstTxtNftrTxt_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_ht100mNrrns/[RUN_NAME] --batch-size 2 --num-workers 4 --dataset-name ego_exo --egoExo-amtAnnotations-datasetShuffle --egoExo-useNnaASRTranscript --egoExo-amtAnnotations-workerConsensusThreshold 0.7 --dataset-modalities past_frames,past_text,future_text,future_frames --dataset-classes ego,exo --clip-len 32 --use-diffVidClipLen --vid-clipLen 8 --classifier-aggType transformer --classifier-transformerAgg-clsTokenProjType embedding --classifier-transformerAgg-numLayers 8 --classifierHead-linearLayer-dims 256,64 --vidEncoder-useViewTypeEncoder --vidEncoder-viewTypeEncoder-arc embedding --textEncoder-useViewTypeEncoder --textEncoder-viewTypeEncoder-arc embedding --textEncoder-useTimeEncoder --textEncoder-timeEncoder-arc embedding --textEncoder-timeTokenizationHeuristic mean_startNendTime --use-modalityTypeEncoder --modalityTypeEncoder-arc embedding --checkpoint-fileName valBestCkpt_maxMeanAcc__after10epochs
To test with our checkpoint, replace [RUN_NAME] in --run-dir flag with thrNd__pstFrmsNpstTxtNftrTxtNftrFrms_egNex_vdClpLn8ClpLn32_lr1e6bs48_aggTrnsfrmr8lyrsClssTknTypEmbddng_clssHdLnr256n64_frznVdEncDnoV2lyrs12VwtypEncdTypEmbddng_txtTmTknzHrstcMeanStrtNendTm_frmPrt_5000fwshtSmpls and run the command on 1 node with 8 V100 gpus.
To evaluate our metrics, follow the instructions in scripts/eval/computeMetrics_amtLabels_egoExo.ipynb and run the script.
@article{majumder2024switch,
author = {Sagnik Majumder and Tushar Nagarajan and Ziad Al-Halah and Kristen Grauman},
title = {Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos},
year = {2024},
eprint = {arXiv:2412.18386},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
}
This project is released under the MIT license, as found in the LICENSE file.
