Computer vision has come a long way towards automatic labeling of objects, scenes and human actions in visual data. While this recent progress already powers applications such as visual search and autonomous driving, visual scene understanding remains an open challenge beyond specific applications. In this talk I will outline limitations of human-defined labels and will argue for the task-driven approach to scene understanding. Towards this goal I will describe our recent efforts on learning visual models from narrated instructional videos. I will present methods for automatic discovery of actions and object states associated with specific tasks such as changing a car tire or making coffee. Along these efforts, I will describe a state-of-the-art method for text-based video search using our recent dataset with automatically collected 100M narrated videos. Finally I will present our work on visual scene understanding for real robots where we learn agents to discover sequences of actions for completing particular tasks.