Active Learning Strategies Based on Text Informativeness

by Ruide Li, Yoko Yamakata, Keishi Tajima


In this paper, we propose strategies for selecting the next item to label in active learning for text data. Text data have several text-specific features, such as TF-IDF vectors and document embeddings. These features have correlation with the informativeness of the text data, so our methods select the next item to label by using these text-specific features. We evaluate the performance of our strategies in two problem settings: the standard active learning setting, where we focus on the improvement of the model accuracy, and the learning-to-enumerate setting, where we focus on the efficiency in enumerating all instances of a given target class. We also combine our strategies with two existing strategies: uncertainty sampling, a well-known strategy for active learning, and the exploitation-only strategy, a strategy used in learning-to-enumerate problems. Our experiment on two publicly available English text datasets show that our method outperforms the baseline methods in both problem settings.

Full Text: pdf

Slides: pdf

Talk: mp4 (125M, about 17 min.)

BibTex entry


active learning; learning to enumerate; informativeness; TF-IDF; word embedding; uncertainty sampling
Published in Proc. of IEEE/WIC/ACM WI-IAT, pp.32-39, Niagara Falls, Canada, 2022

tajima@i.kyoto-u.ac.jp / Fax: +81(Japan) 75-753-5978 / Office: Research Bldg. #7, room 404