In this paper, we propose a method for recognizing ingredients present
in each cooking step in multimedia recipes. We first introduce and
validate three hypotheses on the characteristics of cooking steps in
recipes: (1) ingredients are most difficult to recognize in the
intermediate and finishing stages, where they lose their original
appearance, (2) a step often inherits ingredients from the previous
step but not always from the immediately previous step when there are
parallel subtasks, and (3) the last step includes all ingredients used
in the recipe. Consequently, based on these hypotheses, we introduce
the following features into our method: (1) each step adaptively
inherits features from similar preceding steps, where ingredients are
easier to recognize, and (2) we decide the thresholds for each class
and each recipe adaptively by using the prediction result of our
method for the last step, where all ingredients appear. The
experimental results demonstrate the improved performance of our
method compared to the baseline methods, showcasing the effectiveness
of our approach.
recipe, multi-modal annotation, multi-label recognition, datasets
Published in Proc. of 6rd ACM Multimedia Asia, pp.94:1-94:7, Auckland, New Zealand, 2024