I have access to several videos. I would like to create a machine/deep learning model that can recognize different steps in a new unseen video.
For example: let me keep it simple and give an example of coffee. So I have several videos of people making coffee. Step 1: brewed coffee (e.g.: time 0:00 to 0:20), step 2: added sugar (e.g.: time 0:23 to 0:31) and step 3: stir (e.g.: time 0:33 to 0:45 ).
How should I label my footage and in what direction should I look to choose a good model?
I haven't tried anything yet because I got stuck on the above problem.
Can anyone point me in the right direction?