ANALIZA RAZLIČITIH MODELA ZA PREPOZNAVANJE IMENOVANIH ENTITETA NA SRPSKOM JEZIKU

  • Aleksandar Cvejić
Ključne reči: NLP, NER, Transformeri, CRF, prepoznavanje tokena

Apstrakt

Zadatak ovog rada jeste analiza razli­čitih pristupa za prepoznavanje imenovanih entiteta (Na­med Entity Recognition, NER) u srpskom jeziku. Rad po­redi performanse Conditional Random Fields (CRF) mo­dela i transformer modela na NER zadatku. Kao trans­former modeli korišćeni su BERT, DistilBERT i Electra transformer modeli. CRF se direktno trenira za NER zadatku, dok se transformer modeli treniraju u 3 koraka: (1) treniranje tokenizera, (2) pretreniranje generalnog jezičkog modela i (3) dotreniranje na NER zadatku. U radu se prikazuju rezultati više konfiguracija CRF modela, treniranim na različitim karakteristikama i rezultati transformer modela.

Reference

[1] R. P. Schumaker and H. Chen, “Textual analysis of stock market prediction using breaking financial news: The AZFin text system,” ACM Trans. Inf. Syst., vol. 27, no. 2, 2009, doi: 10.1145/1462198.1462204.
[2] lafferty john, M. Andrew, and C. N. P. Fernando, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” ICML ’01 Proc. Eighteenth Int. Conf. Mach. Learn., vol. 2001, no. June, pp. 282–289, 2001, doi: 10.29122/mipi.v11i1.2792.
[3] M. Konkol and M. Konopík, “CRF-based Czech named entity recognizer and consolidation of Czech NER research,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8082 LNAI, pp. 153–160, 2013, doi: 10.1007/978-3-642-40585-3_20.
[4] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171–4186, 2019.
[5] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” pp. 2–6, 2019, [Online]. Available: https://arxiv.org/abs/1910.01108.
[6] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,” pp. 1–18, 2020, [Online]. Available: https://arxiv.org/abs/2003.10555.
[7] P. J. Ortiz Suárez, L. Romary, and B. Sagot, “A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages,” pp. 1703–1714, 2020, doi: 10.18653/v1/2020.acl-main.156.
[8] V. Batanović, N. Ljubešić, T. Samardžić, and T. Erjavec, “Training corpus SETimes.SR 1.0.” 2018, [Online]. Available: https://hdl.handle.net/11356/1200.
[9] A. Rahimi, Y. Li, and T. Cohn, “Massively multilingual transfer for NER,” ACL 2019 - 57th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., pp. 151–164, 2020, doi: 10.18653/v1/p19-1015.
[10] D. Fišer, N. Ljubešić, and T. Erjavec, “The Janes project: language resources and tools for Slovene user generated content,” Lang. Resour. Eval., vol. 54, no. 1, pp. 223–246, 2020, doi: 10.1007/s10579-018-9425-z.
[11] N. Ljubešić and D. Lauc, “BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian,” Proc. 8th Work. Balto-Slavic Nat. Lang. Process. (BSNLP 2021), no. 1, pp. 37–42, 2021, [Online]. Available: https://www.aclweb.org/anthology/2021.bsnlp-1.5.pdf.
[12] T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” pp. 38–45, 2020, doi: 10.18653/v1/2020.emnlp-demos.6.
[13] A. Cvejić, K. Grujić, A. Cvejić, M. Marković, and S. Gostojić, “Automatic Transformation of Plain-text Legislation into Machine-readable Format,” Proc. 11th Int. Conf. Inf. Soc. Technol., pp. 50–55, 2021.
[14] M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin, “Tuning Multilingual Transformers for Language-Specific Named Entity Recognition,” 2019, no. August, pp. 89–93, doi: 10.18653/v1/w19-3712.
[15] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[16] P. Joshi, S. Santy, A. Budhiraja, K. Bali, and M. Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” pp. 6282–6293, 2020, doi: 10.18653/v1/2020.acl-main.560.
[17] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” pp. 1–23, 2016, [Online]. Available: https://arxiv.org/abs/1609.08144.
[18] M. Miličević and N. Ljubešić, “Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets,” Slov. 2.0 empirical, Appl. Interdiscip. Res., vol. 4, no. 2, pp. 156–188, 2016, doi: 10.4312/slo2.0.2016.2.156-188.
[19] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 1, pp. 5149–5152, 2012, doi: 10.1109/ICASSP.2012.6289079.
[20] H. Yan, B. Deng, X. Li, and X. Qiu, “TENER: Adapting Transformer Encoder for Named Entity Recognition,” 2019, [Online]. Available: https://arxiv.org/abs/1911.04474.
[21] M. Marcińczuk and M. Janicki, “Optimizing CRF-based model for proper name recognition in Polish texts,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, vol. 7181 LNCS, no. PART 1, pp. 258–269, doi: 10.1007/978-3-642-28604-9_22.
[22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” pp. 1–9, 2015, [Online]. Available: https://arxiv.org/abs/1503.02531.
Objavljeno
2022-02-04
Sekcija
Elektrotehničko i računarsko inženjerstvo