Volume 39, Issue 3 e12869
ORIGINAL ARTICLE

Multi-source BERT stack ensemble for cross-domain author profiling

José Pereira Delmondes Neto

José Pereira Delmondes Neto

School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil

Search for more papers by this author
Ivandré Paraboni

Corresponding Author

Ivandré Paraboni

School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil

Correspondence

Ivandré Paraboni, School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, Brazil.

Email: [email protected]

Search for more papers by this author
First published: 01 November 2021
Citations: 2

Funding information: University of São Paulo

Abstract

Author profiling is the computational task of inferring an author's demographics (e.g., gender, age etc.) based on text samples written by them. As in other text classification tasks, optimal results are usually obtained by using training data taken from the same text genre as the target application, in so-called in-domain settings. On the other hand, when training data in the required text genre is unavailable, a possible alternative is to perform cross-domain author profiling, that is, building a model from a source domain (e.g., Facebook posts), and then using it to classify text in a different target domain (e.g., e-mails.) Methods of this kind may however suffer from cross-domain vocabulary discrepancies and other difficulties. As a means to ameliorate these, the present work discusses a particular strategy for cross-domain author profiling in which multiple source domains are combined in a stack ensemble architecture of pre-trained language models. Results from this approach are shown to compare favourably against standard single-source cross-domain author profiling, and are found to reduce overall accuracy loss in comparison with optimal in-domain gender and age classification.

CONFLICT OF INTEREST

The authors declare no conflict of interests.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available from the following URLs. BlogSetBR (blog domain) corpus: https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br/ TwiSty (Twitter domain) corpus: https://www.clips.uantwerpen.be/clips.bak/datasets/twisty-corpusB2W-Reviews01 (reviews domain) corpus https://opencor.gitlab.io/corpora/real19b2wreviews01/ e-SIC1BR (e-gov domain) corpus: https://drive.google.com/file/d/12sFdgipuK2d1QyrTlnv5QwFj1Gs5mdnI/view b5 (Facebook domain) corpus: https://drive.google.com/file/d/1tTygLuZKwNr5apLE4kcykj_Qiw8C7bqo/view

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.