PREPOZNAVANJE IMENOVANIH ENTITETA U SPRSKOM JEZIKU POMOĆU TRANSFORMER ARHITEKTURE
Ključne reči:
NLP, NER, BERT, RoBERTa, Obrada prirodnog jezika, Prepoznavanje imenovanih entiteta
Apstrakt
Za treniranje neuronskih mreži za obradu prirodnog jezika već postoje ustaljeni šabloni i principi isprobani nad engleskim jezikom. Prirodni sled dogaђаја је istraživanjе i razvijanje oblasti za druge jezike. U ovom radu predstavljena je arhitektura modela za prepoznacanju imenovanih entitet u srpskom jeziku. Kao ulaz model prima prirodno pisan jezik. Istrenirani model kao izlaz daje verovatnoće pripadnosti reči imenovanoj kategoriji. Predloženi su koraci za poboljšanje i dalji razvoj oblasti.
Reference
[1] Mansouri, A., Affendey, L. S., & Mamat, A. (2008). Named entity recognition approaches. International Journal of Computer Science and Network Security, 8(2), 339-344.
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[4] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[5] Krstev, C., Vitas, D., Obradović, I., & Utvić, M. (2011, July). E-dictionaries and Finite-state automata for the recognition of named entities. In Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing (pp. 48-56).
[6] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
[7] Ljubešić, N., Agić, Ţ., Klubicka, F., Batanović, V., & Erjavec, T. (2018). hr500k–A Reference Training Corpus of Croatian.
[8] Ljubešić, N., & Lauc, D. (2021, April). BERTić-The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (pp. 37-42).
[9] Torfi, A., Shirvani, R. A., Keneshloo, Y., Tavvaf, N., & Fox, E. A. (2020). Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200.
[10] Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., ... & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
[11] Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., ... & Sagot, B. (2019). Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.
[12] Rönnqvist, S., Kanerva, J., Salakoski, T., & Ginter, F. (2019). Is multilingual BERT fluent in language generation?. arXiv preprint arXiv:1910.03806.
[13] Šandrih, B., Krstev, C., & Stanković, R. (2019, September). Development and evaluation of three named entity recognition systems for serbian-the case of personal names. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1060-1068).
[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
[15] Ljubešić, N., & Klubička, F. (2014, April). {bs, hr, sr} wac-web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29-35).
[16] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
[17] Ramshaw, L., & Marcus, M. P. (1995). Text chunking using transformation-based learn.
[18] Kosec, M., Fu, S., & Krell, M. M. (2021). Packing: Towards 2x NLP BERT Acceleration. arXiv preprint arXiv:2107.02027.
[19] Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019, October). How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics (pp. 194-206). Springer, Cham.
[20] Goldhahn, D., Eckart, T., & Quasthoff, U. (2012, May). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In LREC (Vol. 29, pp. 31-43).
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[4] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[5] Krstev, C., Vitas, D., Obradović, I., & Utvić, M. (2011, July). E-dictionaries and Finite-state automata for the recognition of named entities. In Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing (pp. 48-56).
[6] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
[7] Ljubešić, N., Agić, Ţ., Klubicka, F., Batanović, V., & Erjavec, T. (2018). hr500k–A Reference Training Corpus of Croatian.
[8] Ljubešić, N., & Lauc, D. (2021, April). BERTić-The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (pp. 37-42).
[9] Torfi, A., Shirvani, R. A., Keneshloo, Y., Tavvaf, N., & Fox, E. A. (2020). Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200.
[10] Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., ... & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
[11] Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., ... & Sagot, B. (2019). Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.
[12] Rönnqvist, S., Kanerva, J., Salakoski, T., & Ginter, F. (2019). Is multilingual BERT fluent in language generation?. arXiv preprint arXiv:1910.03806.
[13] Šandrih, B., Krstev, C., & Stanković, R. (2019, September). Development and evaluation of three named entity recognition systems for serbian-the case of personal names. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1060-1068).
[14] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019). HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
[15] Ljubešić, N., & Klubička, F. (2014, April). {bs, hr, sr} wac-web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29-35).
[16] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
[17] Ramshaw, L., & Marcus, M. P. (1995). Text chunking using transformation-based learn.
[18] Kosec, M., Fu, S., & Krell, M. M. (2021). Packing: Towards 2x NLP BERT Acceleration. arXiv preprint arXiv:2107.02027.
[19] Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019, October). How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics (pp. 194-206). Springer, Cham.
[20] Goldhahn, D., Eckart, T., & Quasthoff, U. (2012, May). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In LREC (Vol. 29, pp. 31-43).
Objavljeno
2022-02-04
Sekcija
Elektrotehničko i računarsko inženjerstvo