Béla Benedek Szakács and Tamás Mészáros

Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion

Dictionaries like Wordnet can help in a variety of Natural Language Processing applications by providing additional morphological data. They can be used in Digital Humanities research, building knowledge graphs and other applications. Creating dictionaries from large corpora of texts written in a natural language is a task that has not been a primary focus of research, as other tasks have dominated the field (such as chat-bots), but it can be a very useful tool in analysing texts. Even in the case of contemporary texts, categorizing the words according to their dictionary entry is a complex task, and for less conventional texts (in old or less researched languages) it is even harder to solve this problem automatically. Our task was to create a software that helps in expanding a dictionary containing word forms and tagging unprocessed text. We used a manually created corpus for training and testing the model. We created a combination of Bidirectional Long-Short Term Memory networks, convolutional networks and a distancebased solution that outperformed other existing solutions. While manual post-processing for the tagged text is still needed, it significantly reduces the amount of it.

DOI: 10.36244/ICJ.2020.4.2

Please cite this paper the following way:

Béla Benedek Szakács and Tamás Mészáros, "Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion", Infocommunications Journal, Vol. XII, No 4, December 2020, pp. 6-13. DOI: 10.36244/ICJ.2020.4.2

Technical Co-Sponsors





National Cooperation Fund, Hungary