In this paper, we propose a method of automatically categorizing
cooking actions appearing in recipe data. We extract verbs from
textual descriptions of cooking procedures in recipe data, and
vectorize them by using word embedding. These vectors provide a way
to compute contextual similarity between verbs. We also extract
images associated with each step of the procedures, and vectorize them
by using a standard feature extraction method. For each verb, we
collect images associated with the steps whose description includes the
verb, and calculate the average of their vectors. These vectors
provide a way to compute visual similarity between verbs. However,
one type of action is sometimes represented by several types of images
in recipe data. In such cases, the average of the associated image
vectors is not appropriate representation of the action. To mitigate
this problem, we propose a yet another way to vectorize verbs. We
first cluster all the images in the recipe data into 20 clusters. For
each verb, we calculate the ratio of each cluster within the set of
images associated with the verb, and create a 20-dimensional vector
representing the distribution over the 20 classes. We calculate
similarity of verbs by using these three kinds of vector
representations. We conducted a preliminary experiment for comparing
these three ways, and the result shows that each of them are useful
for categorizing cooking actions.