Abstract:
Most machine learning approaches in Natural Language Processing rely mainly on corpora. Indeed, various applications based on this approaches require prior learning of statistical models, including the Hidden Markov Model for Part Of Speech Tagging. However, this learning resources must meet some criteria to have a well trained model, and thus more accurate results. On the other hand, we find that the Arabic language - despite its vast use on the internet and in social media - has a limited number of linguistic resources for machine learning, especially corpora with morpho- syntactic annotations. Thus, in this article we will treat the Nemlar corpus, one of the richest annotated linguistic corpora for the Arabic language. We will first present the content of this corpus. We will then define some criteria in order to improve its structure and enrich its content. We will also present the different modifications made on the original version, including merging POS tags, separating prefixes and suffixes, creating tags for specific cases, etc. in order to lead to the desired form. Then, we will see the experimentation evaluating the new word recognition rate. At the end, we will talk about the advantages and disadvantages of the resulting version.