A Case Study on Start-up of Dataset Construction: In Case of Recipe Named Entity Corpus

by Yoko Yamakata, Keishi Tajima, Shinsuke Mori


In this paper, we report our experience in constructing a cooking recipe text corpus. We describe problems we found and explain how we managed them. One of the problems we faced in the construction of our recipe corpus is the difficulty of establishing a clear, stable, and complete guideline instructing annotators how to annotate. During the annotation, we found many unexpected cases for which the pre-defined guideline is not clear enough, and even cases for which the pre-defined guideline provides no guidance at all. As a result, we needed to update the guideline twice during the annotation, and also needed to revise annotations we have done before the updates. During that process, we have several trade-offs, and it is not easy to decide when and how often we should revise the annotations. It is even unclear whether we should revise them or should instead use the human resource for annotating more data. We show an experiment, whose result suggests that we should revise the old annotations. Another problem we had is the management of versions of the guideline, sets of annotations corresponding to them, and communication between participants.

recipe data; dataset creation; corpus creation; corpus construction; data annotation; annotation guideline; annotation support
Published in Proc. of HMData (collocated with IEEE BigData), pp.3564-3567, Seattle, WA, 2018

