AUTOMATSKA DETEKCIJA STAVKI MENIJA UNUTAR TEKSTOVA RECENZIJA RESTORANA

Igor Trpovski

doi:10.24867/06BE08Trpovski

Igor Trpovski

DOI: https://doi.org/10.24867/06BE08Trpovski

Ključne reči: analiza teksta, obrada prirodnog jezika, prepoznavanje imenovanih entiteta

Apstrakt

Cilj ovog istraživanja jeste prezentovanje jednog pristupa za detekciju stavki menija unutar tekstova recenzija restorana. Nekoliko modela mašinskog i dubokog učenja istrenirano je da detektuje pominjanja hrane unutar recenzija restorana. Nakon toga, nekoliko algoritama poklapanja stringova primenjeno je kako bi se pominjanja hrane uparila sa odgovarajućim stavkama menija. Podaci su prikupljeni sa sajta Donesi.com i ručno anotirani. Svi upotrebljeni modeli i algoritmi su evaluirani.

Reference

[1] www.donesi.com
[2] Lafferty, J., McCallum A., and Pereira F., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
[3] Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9 no.8, pp.1735-1780.
[4] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
[5] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T., 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, pp.135-146.
[6] Jason Huggins, et al, 2004. Selenium, https://www.seleniumhq.org
[7] Leonard Richardson 2014, BeautifulSoup4 https://www.crummy. com/software/BeautifulSoup
[8] Kyeongmin Rim, "MAE2: Portable Annotation Tool for General Natural Language Use". In Proceedings of the 12th Joint ACL-ISO Workshop on Interoperable Semantic Annotation, Portorož, Slovenia, May 28, 2016.
[9] Ljubesic, Nikola, Tomaz Erjavec and Darja Fiser. “Corpus-Based Diacritic Restoration for South Slavic Languages.” LREC (2016).
[10] Ljubesic, Nikola and Tomaz Erjavec. “Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene.” LREC (2016).
[11] Ljubesic, Nikola, Filip Klubicka, Zeljko Agic and Ivo-Pavao Jazbec. “New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian.” LREC (2016).
[12] Agic, Zeljko and Nikola Ljubesic. “Universal Dependencies for Croatian (that work for Serbian, too).” BSNLP@RANLP (2015).
[13] Fišer, D., Ljubešić, N. & Erjavec, T. Lang Resources & Evaluation (2018). https://doi.org/10.1007/s10579-018-9425-z
[14] Milosevic, Nikola “Stemmer for Serbian Language. ” CoRR abs/ 1209.4471 (2012): n. pag.
[15] Taku Kudo, “CRF++: Yet another CRF toolkit“ 2005, https://taku910.github.io/crfpp
[16] Chollet, Francoise et al. “Keras“ 2015, https://keras.io
[17] Damerau, F.J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), pp.171-176.
[18] Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), pp.414-420.
[19] Winkler, W.E., 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.
[20] Zhuo Yang Luo. python-string-similarity 2018 https://github.com/luozhouyang/python-string-similarity
[21] Jean-Bernard Ratte. Jaro Winkler Distance 2015, https:// github.com/nap/jaro-winkler-distance