A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. The authors would also like to thank students of the Higher School of Economics and Yandex School of Data Analysis for their help in annotating dictionary senses. 1 This research was supported by RSF (project No.16-18-02054: Semantic, statistic and psy-cholinguistic analysis of lexical polysemy as a component of Russian linguistic worldview). The Adagram database is available online. causative meanings or different government patterns). However it works better for nouns than for verbs, ignoring the structural differences (e.g. It ignores disappearing and obsolete senses, but induces new and domain-specific senses which are sometimes absent in dictionaries. We found out that AdaGram is quite good at distinguishing homonyms and metaphoric meanings. We quantitatively and qualitatively evaluated them and performed a deep study of the AdaGram method comparing AdaGram clusters for 126 words (nouns, adjectives, and verbs) and their senses in published dictionaries. In this paper, we compared four WSI techniques: Adaptive Skip-gram (AdaGram), Latent Dirichlet Allocation (LDA), clustering of contexts and clustering of synonyms. WSI is the task of automatically inducing the different senses of a given word in the form of an unsupervised learning task with senses represented as clusters of token instances. We aim to study word senses in the wild-in raw corpora- by performing word sense induction (WSI). The problem of word sense granularity is widely discussed both in lexicographic and in NLP studies. The assumption that senses are mutually disjoint and have clear boundaries has been drawn into doubt by several linguists and psychologists. This technique can be applied not only to cognates, but also to pairs of words which are usually offered by the dictionaries as the translation equivalents of each other. Having selected a set of Russian words included into the Active Dictionary of Russian, which have more than two dictionary senses and have cognates in English, we estimated the frequencies for English and Russian senses using SemCor and Russian National Corpus respectively, matched the senses in each pair of words and compared their frequencies., we revealed cases in which the most frequent senses and whole meaning structures are, cross-linguistically, substantially different and studied them in more detail. We proposed a method for detecting such cases. Sometimes, however, this is not the case, and the most frequent sense of a word in one language may be much less frequent for its cognate. Learners of a foreign language who encounter a word similar to one of their native language are often tempted to assume that the foreign word and its equivalent have the same meaning structure. Information about word sense frequency is not only useful for explanatory lexicography and WSD, but it also may enrich language learning resources. We introduced several techniques for determining sense frequency based on dictionary entries matched with data from large corpora. Different senses normally have different frequencies in corpora. When words have several senses, it is important to describe them properly in dictionary (a lexicographic task) and to be able to distinguish them in a given context (a computational linguistics task, WSD).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |