Medical experts from Klagenfurt hospital provided us a dataset consisting of 21 video recordings of cataract surgeries which have been performed by four different surgeons. Since the dataset is rather small (it contains around 212000 frames) and unbalanced (i.e., phase 5 and 6 contain nearly half of all the frames), we perform some pre-processing to enhance the quality of the dataset. First, we manually remove frames that belong to a certain phase but do not show any instruments (so called idle periods). Second, we balance the dataset so that for each phase the same amount of frames is used to train the CNN. As described before, cataract surgery follows a quasi-standardized routine. Hence, we expect that temporal information is a useful information which can improve classification results. To exploit temporal information, we added a relative timestamp (i.e., frame number/total number of frames) to each frame.