Look for the Change: Learning Object States and
State-Modifying Actions from Untrimmed Web Videos
Tomáš Souček
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
[Paper    ]
[Code    ]
[Dataset    ]
Model overview. Given a set of input noisy untrimmed videos from the web depicting a state-changing action (here cutting apple) our approach learns action classifier g and object state classifier h that output temporal labels l of the input videos with temporal locations of initial object statemanipulating actionend object state that satisfy the causal ordering constraint. This is achieved by minimizing a new noise adaptive learning objective that downweights irrelevant videos with adaptive weight ω measuring similarity to a small number of exemplar images. The learning proceeds by iteratively (i) learning action and state classifiers, g and h, given the current labels l of the input videos and (ii) finding the labels l of the videos that respects the causal ordering constraints.


Abstract

Human actions often induce changes of object states such as “cutting an apple”, “cleaning shoes” or “pouring coffee”. In this paper, we seek to temporally localize object states (e.g. “empty” and “full” cup) together with the corresponding state-modifying actions (“pouring coffee”) in long uncurated videos with minimal supervision. The contributions of this work are threefold. First, we develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states from an uncurated set of videos from the Internet. The model is self-supervised by the causal ordering signal, i.e. initial object state → manipulating action → end state. Second, to cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images, that allows to efficiently filter out irrelevant videos during training. Third, we collect a new dataset with more than 2600 hours of video and 34 thousand changes of object states, and manually annotate a part of this data to validate our approach. Our results demonstrate substantial improvements over prior work in both action and object state-recognition in video.



Example Model Predictions

Test it on your videos! Instructions and trained model weights are available at our GitHub page.
[Code    ]


ChangeIt Dataset

Paper plane folding Dragon fruit peeling
Information on how to download the ChangeIt dataset is available at its GitHub page.
[Dataset    ]


Paper and Supplementary Material

T. Souček, J.B. Alayrac, A. Miech, I. Laptev, J. Sivic
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
(hosted on ArXiv)


@inproceedings{soucek2022lookforthechange,
    title={Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos},
    author={Sou\v{c}ek, Tom\'{a}\v{s} and Alayrac, Jean-Baptiste and Miech, Antoine and Laptev, Ivan and Sivic, Josef},
    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2022}
}
[Bibtex]


Acknowledgements

The project was supported by the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15_003/0000468) and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140), the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute), and Louis Vuitton ENS Chair on Artificial Intelligence. We would like to also thank Kateřina Součková and Lukáš Kořínek for their help with the dataset.