Volume 2022, Issue 1 4262270
Research Article
Open Access

[Retracted] Research of Vertical Domain Entity Linking Method Fusing Bert-Binary

Hairong Wang

Corresponding Author

Hairong Wang

School of Computer Science and Engineering, North Minzu University, 750021, China nun.edu.cn

Search for more papers by this author
Beijing Zhou

Beijing Zhou

School of Computer Science and Engineering, North Minzu University, 750021, China nun.edu.cn

Search for more papers by this author
Bo Li

Bo Li

School of Computer Science and Engineering, North Minzu University, 750021, China nun.edu.cn

Search for more papers by this author
Xi Xu

Xi Xu

School of Computer Science and Engineering, North Minzu University, 750021, China nun.edu.cn

Search for more papers by this author
First published: 23 August 2022
Citations: 1
Academic Editor: Yuan Li

Abstract

To solve the problem of unclear entity boundaries and low recognition accuracy in Chinese text, we construct the crop dataset and propose a Bert-binary-based entity link method. Candidate entity sets are generated through entity matching in multiple data sources. The Bert-binary model is called to calculate the correct probability of the candidate entity, and the entity with the highest score is screened for linking. In comparative experiments with three models on the crop dataset, the F1 value is increased by 2.5% on the best method or by 8.8% on average. The experimental results show the effectiveness of Bert-binary method in this paper.

1. Introduction

As a key technology of natural language processing, entity linking is widely used in knowledge representation, knowledge retrieval, and other fields [1]. Entity linking methods mainly include dictionary-based methods [24], search engine methods [5, 6], context augmentation methods, and deep learning methods. For the past few years, with the development of deep learning and their wide application in various fields, the research on entity linking methods based on deep learning has been paid more and more attention [5, 729]. Ganea and Hofmann [7] carried out joint disambiguation in entity link by using the method of combining local and global. Le and Titov [8] proposed a neural entity linking model and introduce entity relations, which optimized entity linking in an end-to-end manner with relations as latent variables. Hosseini et al. [9] proposed a neural embedding-based feature function that enhances implicit entity linking through prior term dependencies and entity-based feature function interpolation. Shengchen et al. [10] proposed a domain-integrated entity linking method based on relation index and representation learning, to address the problem that existing entity linking methods cannot combine text information and knowledge base information well. Xie et al. [11] proposed a GRCCEL (graph-ranking collective Chinese entity linking) algorithm, aiming at the problem of ignoring the entity semantic association relationship and being limited by the size of knowledge graph, using the structural relationship between entities in the knowledge graph and the additional background information provided by external sources of knowledge base, to obtain more semantic and structural information; the purpose is to obtain stronger ability to distinguish similar entities. Xia et al. [12] proposed an integrated entity link algorithm that uses topic distribution to represent knowledge and generate candidate entities. Rosales et al. [13] proposed a fine-grained entity link classification model to distinguish different types of entities and relationships. Wang et al. [14] explored methods to resolve the ambiguity of HIN entities for heterogeneous information network (HIN). Huang et al. [15] constructed an entity linking model combining deep neural network and association graph, which enhanced referential and entity representation by adding character features, context, and deep semantics of information, and performed similarity matching. Li et al. [16] constructed an entity link model, and a candidate entity generation is proposed, which is combining knowledge base matching and word vector similarity calculation. Rosales-Méndez et al. [13] designed a questionnaire, proposed a fine-grained entity link classification scheme based on the survey results, relabeled three general entity link datasets according to the classification scheme, and created a fine-grained entity link system. Zhou et al. [17] graph-based joint feature method preprocesses the knowledge base and text and then combines the semantic similarity of multiple features and using restarted random walks and joint disambiguation to select referential linked entities in a graph model. Zhan et al. [18] introduced the BERT pretrained language model into the entity linking task and used TextRank keyword extraction technology to enhance the topic information of the comprehensive description information of the target entity.

The main tasks of entity linking are candidate entity generation and entity disambiguation [30]. The common methods of candidate entity generation include dictionary-based construction methods [1922], context-based augmentation methods [2325], and search engine-based construction methods [5, 26]. Entity disambiguation is to disambiguate the generated candidate entity set to determine the target entity in the current context, and its mainstream method is entity disambiguation based on deep learning [7, 2729].

We make the following contributions in this paper: (1) a Bert-binary entity linking method is proposed, that combines Bert and binary classification methods; (2) the algorithm of candidate entity set generation is given; (3) Proposed the candidate entity disambiguation processing method.

2. Model Framework

The entity linking method based on Bert-binary identifies one or more entity references to be linked according to the text after named entity recognition, generates a candidate entity set, performs joint disambiguation on all entity references in the candidate entity set, and returns the target entity corresponding to each entity reference expression item in the entity reference set in the knowledge base. The process is shown in Figure 1.

Details are in the caption following the image

The entity linking method of Bert-binary mainly includes two main tasks: candidate entity generation and candidate entity disambiguation.

3. Candidate Entity Set Generated

The quality of candidate entity determines the effect of entity linking [31]. Normally, the entity representations identified by the named entity recognition technology may be ambiguous, and there are situations where the expressions of the same semantic entity are different or the expressions of different semantic entities are the same. Therefore, this paper constructs an entity mapping table corresponding to entity references for a specific domain, which should be able to contain all candidate entities corresponding to entity references. Taking the domain of agriculture as an example, some of the constructed entity mappings are shown in Table 1.

1. Entity mapping table.
Entity reference Candidate entities corresponding to entity references
Rice Paddy, millet
Clouds of rice disease Brown leaf blight, leaf burn
Peony leaf tip blight Tip-white blight
Potato yáng yù、mǎ líng shŭ、dì dàn
Rice paddies aphids Macrosiphum avenae, sitobion avenae
Downy mildew Yellow stunt
Rice blast Fire blast, knock blast
Sheath blight Sharp eyespot, Moire disease

Based on the constructed domain entity reference mapping table, the type of entity reference is determined according to the entity reference and its context information. The knowledge base and semantic dictionary are used to determine the candidate entity corresponding to the entity reference through retrieval, and the candidate entity set is generated. The specific processing is shown in Algorithm 1.

    Algorithm 1: Candidate entity generation.
  • Input: Dictionary D, a set of entity mention M (mM), a set of named entities key=(k: value)

  • Output: Candidate entity set Em

  • 1:  for m (mM) do

  • 2:  if (k = = m ‖ entity name ∈ entity mention) then

  • 3:   Em = Em.add(key)

  • 4:  else if (The k exactly matches the first letter of all words in the M) then

  • 5:    Em = Em.add(key)

  • 6:  end if

  • 7: end for

The candidate entity set is generated by fuzzy matching with the same semantic word dictionary and CNKI. Taking Ningxia rice as an example, the text to be processed is as follows: “In recent years, varieties widely promoted in Ningxia production include: Gongyuan 4, Ningjing 23, Ningjing 24….” The entities contained in the text are “Ningxia,” “Gongyuan 4,” “Ningjing 24,” and “Ningjing 23.” Algorithm 1 can be used to correspond “Ningxia” to “place name,” while entity types of “Gongyuan 4,” “Ningjing 24,” and “Ningjing 23” can be corresponding to “rice variety.”

4. Candidate Entity Disambiguation

Aiming at the possible ambiguity in the generated candidate entity set, this paper adopts a method based on binary classification, invokes the Bert model, calculates the probability score of the candidate entity, and regards the entity with the highest score as the correct entity to be linked. The process is shown in Figure 2.

Details are in the caption following the image
The fully connected layer in the model acts as a classifier. Map n real numbers(-∞n + ∞) to the real numbers of K intervals (-∞n + ∞), map K real numbers to the probability values of K intervals (0, 1) by sigmoid activation function, and the sum of K probabilities is 1. The calculation method is
(1)
In the formula, x is the input of the fully connected layer, WT is the weight, b is the bias, and is the probability of sigmoid output. The calculation formula of sigmoid is
(2)
In general, the input of the last fully connected layer of the deep neural network is used as the feature extracted by the neural network from the input data, and the calculation formula is
(3)
After expanding formula (3), it can be expressed as
(4)

Among them, wjn is the weight of feature under the jn class.

The feature vector output by the BERT layer is spliced and input to the fully connected layer. The calculation formula is
(5)

Among them, SCLS represents the output vector of the CLS position, Sbegin represents the feature vector corresponding to the beginning of the candidate entity, and Send represents the feature vector corresponding to the end position of the candidate entity.

To solve the overfitting phenomenon in model training, a dropout layer is added between the two fully connected layers to reduce the overfitting phenomenon by reducing the number of feature detectors. In the entity disambiguation experiment, due to the paper uses only two fully connected layers and sigmoid layers, which belong to the shallow neural network, the dropout is 0.15.

For the binary classification task, the loss function is shown in
(6)

Among them, is the positive probability of the model predicted sample, yj is the sample label. Assuming that the sample is a positive example, the value is 1; otherwise, the value is 0.

5. Method Validation

In order to verify our method, a crop text dataset is constructed for the agricultural domain. The dataset has a total of 24,779 named entities. The precision (P), recall (R), and F1 values are used to evaluate the effectiveness of our method, and the calculation formula is
(7)
(8)
(9)

The experiment is based on the Python environment. On the constructed crop dataset, our method is compared with BiLSTM-Attention [32], LSTM-CNN-CRF [33], and BiLSTM-CNN [34], respectively. The experimental results are shown in Table 2.

2. Comparing the results.
Model P R F1
BiLSTM-Attention [32] 0.8013 0.7710 0.7859
LSTM-CNN-CRF [33] 0.8674 0.7734 0.8177
BiLSTM-CNN [34] 0.9144 0.8716 0.8925
Bert-binary(our method) 0.9154 0.9247 0.9175

Experimental results of BiLSTM-Attention [32] on crowdsourced annotated datasets in the field of information security show that the model significantly outperforms BiLSTM-Attention-CR, CRF-MA, Dawid & Skene-LSTM, BiLSTM-Attention-CRF-VT, and various model methods. The F1 was 3.6% higher from the optimal one. LSTM-CNN-CRF [33] used a hand-crafted feature, part-of-speech tagging information, and prebuilt lexicon information to augment features for representing sentence; the proposed method improves the performance of named entity recognition. The BiLSTM-CNN [34] incorporates the CNN ability to extract local and long-distance dependent features. Compared with CNN- and LSTM-based methods, our method has an average increase of 5.4% in accuracy, 11.9% in recall rate, and 8.8% in F1 value compared to the other three methods. It can be seen from the experimental results that our method has an improvement in accuracy recall rate and F1 value.

6. Conclusions

A Bert-binary method is proposed in this paper, which constructs an entity mapping table for a specific domain, calls semantic dictionary to generate candidate entity set, calls Bret-binary algorithm to conduct entity disambiguation for generated candidate entities, and filters referents of entity names through probability scores of candidate entities. The agricultural domain is selected, a crop dataset is constructed, and a Python experimental environment is set up to conduct comparative experiments with three models on the crop dataset. The experimental results show the effectiveness of Bret-binary method in this paper.

In the future, the combination of the fully connected network and sigmoid as classifiers can be improved by optimizing the decision variables of neural networks [35, 36] to obtain better accuracy, feasibility, and reliability.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the North Minzu University Major Education and Teaching Reform Projects (2021) and the Key Laboratory of Images & Graphics Intelligent Processing of State Ethnic Affairs Commission.

    Data Availability

    The crop dataset used to support the findings of this study have been deposited in the Baidu’s AI Studio repository (https://aistudio.baidu.com/aistudio/datasetdetail/153737).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.