Correspondence Analysis From the Viewpoint of Compositional Tables
Funding: This work was supported by the Austrian Science Fund (grant DOI: 10.55776/I5799); the European Commission (no. CZ.02.01.01/00/23_025/008686); and the Czech Science Foundation (grant 22-15684L).
ABSTRACT
Correspondence analysis (CA), a well-known method for analyzing the relationships between rows and columns of a table, has been reformulated to link to the logratio methodology of compositional data by using the limiting case of the power transformation. The resulting methodology investigates relative rather than absolute information, and it is invariant with respect to rescaling rows or columns. The latter properties also hold for the analysis of compositional tables, where the table is first decomposed into an independent and an interaction part. It is shown that the analysis of the interaction part is equivalent to CA, but in addition, the variance contributions can be determined. Both concepts also allow for an inclusion of weights to suppress undesirable variance, and it is shown that the equivalence between weighted CA and the analysis of weighted compositional tables again holds. This equivalence allows us to make use of the mathematical framework of weighted compositional tables, the so-called Bayes spaces, to get a deeper understanding of CA and to construct extensions to multi-factorial tables (cubes, etc.).
1 Introduction
Correspondence analysis (CA) is a prominent method in exploratory data analysis, with the aim to analyze the relationships in a contingency table with either discrete-valued or continuous entries [1-3]. The main idea is to subtract the product of row and column marginals from the proportional representation of the contingency table (referring to the “correspondence matrix”), rescale it to the totals of the marginals, and proceed with a singular value decomposition (SVD). This yields row and column information of the table, which can be visualized in order to study their relationships.
With the clr coefficients, the compositional data are moved isometrically from their original sample space, endowed with the Aitchison geometry, to the real space (see, e.g., Pawlowsky-Glahn et al. [8]), where indeed standard SVD after (column-)centering results in a meaningful representation of loadings and scores in a biplot. Since logarithms of ratios are involved, this kind of procedure is generally known under the name logratio approach (see, e.g., Pawlowsky-Glahn et al. [8], Greenacre [9], and Filzmoser et al. [10]).
Recent developments in compositional data analysis, however, enable to proceed further with this limiting case. In particular, the concept of compositional tables opens up new possibilities also for the analysis of relative information in contingency tables [11-13]. As an example, a simple compositional table could be the number of employed people in a region, where the rows are determined by part-time and full-time employment, and the columns by males and females. If one is interested in comparing and analyzing different regions, relative rather than absolute information needs to be considered, as the absolute numbers would essentially be determined by the population size.
To make the link to the mentioned limiting case of correspondence analysis as presented in Greenacre [4, 14], in this paper only one table is considered, and the interest is again in identifying relationships between rows and columns. More specifically, the possibility of a decomposition into two tables, an independent and an interaction part, will be utilized. The former assumes independence of the row and column factors and consists of a product of the respective marginals, and the latter contains the information about their interaction and is numerically equal to the matrix used in correspondence analysis. However, the construction of the interaction table enables to completely filter out in a geometrically meaningful way the independent part from the original table, and thus provides direct pathways to analyzing the remaining interactions. Besides the description of CA from the perspective of compositional tables, the paper also introduces a new mathematical framework for weighting parts of a compositional table and, consequently, for weighted CA, and establishes a solid foundation for an extension to the multi-factorial problem.
The structure of the paper is as follows. In the next section, a key relationship between double-centered log-transformed data and the interaction part of a compositional table (correspondence table) is derived. In Section 3, we generalize the findings to weighted versions of the methods and investigate their equivalence. Section 4 shows that the important property of distributional equivalence also holds for the logratio approach. Numerical experiments which reveal the advantages of using weights are presented in Section 5, and the final Section 6 concludes and provides an outlook to further extensions.
2 Unweighted CA
This section recalls classical (unweighted) CA as well as logratio analysis (LRA), performing CA on log-transformed data. Moreover, we present the concept of compositional tables and show the link to LRA.
2.1 Logratio Analysis (LRA)
2.2 Compositional Tables
The two-factorial extension of the concept of compositional data to compositional tables [11, 12] treats as one data object, which follows the idea of CA. A convenient property of the approach is that the table can be orthogonally decomposed into an independent and an interactive part with respect to the Aitchison geometry [11]. This can be written as , with standing for the entry-wise product, known in the compositional context as the operation of perturbation.
Thus, the geometric marginals replace the arithmetic marginals and used in classical CA, see Equation (3). The geometric marginals have an important property: they are orthogonal projections of the compositional table on the information contained in its rows and columns, see Genest et al. [16] for more details. Note that in case of truly independent row and column factors, both the arithmetic and geometric marginals are equivalent [11], which underpins the reasonability of their definition.
This table carries information on associations between row and column factors. Therefore, it can serve as a natural compositional alternative to matrices used in classical CA (matrix of standardized residuals or matrix of Pearson contingency ratios).
de Sousa et al. [17], where stands for the geometric mean of the entire compositional table. Moreover, has uniform (zero) marginals and is thus margin free as mentioned in Greenacre [4] (Result 1).
2.3 CA of a Compositional Table
3 Generalization: Weighted CA
In real-world applications, the information on the actual data structure can be blurred due to problems related to sampling of the initial data matrix. The structure can be affected by measurement errors, unbalanced sizes of samples defining the rows, or the presence of cells with low observed value; see, for example, the example in Greenacre and Lewi [15, sec. 2]. Also, in recent work, weighting of rows and columns in CA is proposed and considered useful [18]. In the logratio context, weighting can be used to give less importance in the analysis to components with small proportions that often have high variance on the logratio scale [4, 15, 19]. In the following, we will compare weighted CA/LRA with a weighted CA version for compositional tables.
3.1 Weighted CA and LRA
Weighting in CA is commonly carried out by introducing row and column weights (typically row and column arithmetic marginals) in the double-centering stage. The vectors and forming the correspondence matrix , see Equation (3), are computed with respect to given weights. Consequently, weighting propagates also into the approximation stage, so that fitting is done by weighted least squares [15]. According to Greenacre and Lewi [15], weighted CA can alternatively be motivated by a matrix of Pearson contingency ratios , with , , . With the weights in form of the row and column marginal vectors and , and the matrices and , see also Equation (3), one can carry out an SVD of to perform a weighted CA, which then is equivalent to the weighted form of LRA. There are also other important contributions on the equivalence between correspondence analysis and weighted LRA indices: Greenacre [4] provided an empirical description of the respective transformation, Choulakian [20] later presented a mathematical formulation and proof, referring to it as Greenacre's Theorem, and Greenacre [14] subsequently offered an alternative mathematical formulation with a similar proof.
3.2 Weighted CA of a Compositional Table
Let be an matrix of positive weights satisfying , that is, of a structure which is in agreement with weighted LRA [15] and which also corresponds to the product reference measure as used in Genest et al. [16]. Among other options, the vectors of weights and can be given by the arithmetic or geometric marginals, and possibly rescaled to unit sum. However, the rescaling would affect merely the scale of the final result, not the (weighted) data structure itself.
Therefore, the weighted version of CA based on a log-transformation (LRA) and the decomposition of a compositional table are equivalent.
3.3 Choice of Weights and Implications
The most appealing case of weighting is definitely the one with the standard arithmetic marginals of the original contingency table or its proportional representation, the correspondence matrix, as and . Due to scale invariance of weighting in compositional tables, both representations are now equivalent and the rescaling results just in a shrinkage or an expansion of the weighted space [19]. There are good reasons for weighting in the logratio CA: due to the logarithmic scale, row or column factor instances (variables) with small presence engender large variability, which is often a rather undesired effect [15]. The weighting also combines advances of both approaches: simple interpretability of arithmetic marginals as weights is complemented by geometric marginals (i.e., arithmetic margins in the logarithmic scale) which are necessary to develop both important theoretical and practical consequences of the decomposition into independent and interactive parts.
4 Distributional Equivalence of the Logratio Approach
Distributional equivalence is a natural requirement for CA and, more generally, for analyzing any ratio-scale data, including compositional data and compositional tables. In the former case, this requirement was already emphasized in the seminal work on CA [1], and it was further elaborated by Greenacre and Lewi [15], who used the formulation: If two columns (resp., two rows) have the same relative values, then merging them does not affect the distances between rows (resp., columns). An important aspect is what we understand under merging in the context of ratio-scale data. As the logarithm naturally moves the data from the ratio-scale to the interval-scale [26], the simple aggregation should be done there, possibly rescaled by the number of components. In the original scale this is just the geometric mean, which is promoted in the literature also for a geometric reasoning [27]. Likewise, in CA, the distributional equivalence is related to an amalgamation of rows/columns in case of their proportionality, which should be done again for similar reasons in the log-scale. Due to scale invariance of compositional tables, from the perspective of the original scale, it is equivalent if the aggregation is done in the log-scale or in the clr space. Then, if any two rows (or columns) of a compositional table carry the same relative information, or in other words, if they are a constant multiple of each other, it is expected that the logratio CA keeps unchanged irrespectively whether these rows (columns) are aggregated.
In case of compositional data, replacing two compositional parts with their respective geometric mean essentially means that the information contained in the ratio between these two parts is removed, and even more, it can be considered as the orthogonal projection of the original composition to the subspace of the remaining information [28]. Consequently, when considering a sample of compositional data, distances among the original compositions and among their aggregated counterparts remain the same. It is only important to keep the original dimensionality of the data; otherwise, the subcompositional dominance [8] necessarily applies.
The row geometric marginals for a table, where the corresponding elements of the first two rows are replaced by their geometric means, can then be expressed as perturbation of the original column geometric marginal by the -part composition . From the Yule perturbation property (cf. Genest et al. [16, Proposition 7]) it then follows that the interaction table remains unchanged.
5 Numerical Experiments
In this section we aim to complement previous work by Greenacre and Lewi [15] and compare the stability of the results for the unweighted and weighted version of LRA. Accordingly, we perform an SVD of the matrices and . For the weights we will use the arithmetic marginals of the original table.
5.1 Bootstrapping Tables
The main idea is to draw bootstrap samples from the original table , where we assume that are integer-valued. Each entry and is replicated times, forming the rows of a data matrix in “long format,” with rows. Then we draw a bootstrap sample, that is, observations with replacement, and the resulting long format representation is aggregated to a table format, with the same rows and columns as the original table. Call this bootstrapped table , for , where the number of bootstrap tables can be large (e.g., 1000). The unweighted version for the original table results in a decomposition of the interaction table with the singular vectors arranged in and , respectively, see Equation (13), and similarly, we obtain the matrices and for the interaction table of the bootstrapped table. For the weighted version we obtain the interaction table , see Equation (20), and the orthonormal matrices and with the left and right singular vectors, respectively. Equivalently, we obtain for each bootstrapped weighted interaction table the corresponding matrices with the singular vectors and .
5.2 Comparison by the Principal Angle
In the following experiments, we compare the angle of the results for the original (weighted) table with those for the bootstrapped (weighted) tables, where we use all singular vectors in the first case, but only the first two singular vectors in the second case. Thus, the idea is to see whether the results usually shown in a 2D plot for the bootstrapped version are related or even embedded in the space spanned by the results of the original data version. Note that the angle will be 0 when comparing the singular vectors of the smaller dimension of , and thus we will report , and similarly for the weighted versions.
5.3 Spanish Health Survey Data
This data set originates from a Spanish health survey and it was analyzed in Greenacre [31]. Table 1 presents the data that have been used for CA. The rows refer to different age groups, and the columns refer to the health status as perceived by the 6371 respondents, with the corresponding frequencies in the cells. In contrast to the analysis presented in Greenacre [31] based on the frequencies, we are here interested in relative information, and thus treat the table as a compositional table. Some categories contain small frequencies, which can introduce a lot of undesirable variability in an unweighted analysis.
Age group | Very good | Good | Regular | Bad | Very bad |
---|---|---|---|---|---|
16–24 | 243 | 789 | 167 | 18 | 6 |
25–34 | 220 | 809 | 164 | 35 | 6 |
35–44 | 147 | 658 | 181 | 41 | 8 |
45–54 | 90 | 469 | 236 | 50 | 16 |
55–64 | 53 | 414 | 306 | 106 | 30 |
65–74 | 44 | 267 | 284 | 98 | 20 |
75+ | 20 | 136 | 157 | 66 | 17 |
Figure 1 presents the results from unweighted and weighted LRA. There are several changes visible, such as an exchange of the positions of “Very bad” and “Regular,” but also of “55–64” and “65–74.” The bootstrapped tables will introduce even more uncertainty in the categories with small frequencies, and the stability of the results will be investigated in the following. It is also interesting to compare the explained variances, shown along the axis legends in the plots of Figure 1: The first numbers refer to the explained variance within the interaction parts, while the second numbers refer to the proportion of explained variance from the overall (weighted) clr table. In the latter case, weighting here leads to a much higher variance proportion because the weights shift the information to a new origin which removes a lot of the information from the independent part.

Figure 2 presents the results from the bootstrap experiments, as described above. It can happen that zero frequencies occur in a bootstrap table. Such values were replaced simply by 2/3 to add minimum possible variance [32]. The boxplots in the left plot compare the angles of the unweighted LRA with the weighted version, and it can be seen that the angles are clearly smaller for the weighted LRA. The right plot shows for each bootstrap replication the difference of the angle between the unweighted and the angle of the weighted version as boxplots, together with notches for confidence intervals around the median. The boxplots are split up into bootstrap experiments where the smallest value in the bootstrapped table was 0, 1, 2, and so forth, which is shown on the horizontal axis. This reveals that the medians of the differences are positive, and that they tend to be higher if the smallest value is smaller. Thus, weighting stabilizes the results, particularly if there are small frequencies involved.

5.4 Further Data Sets
We investigate for several other data sets known from the CA literature the stability of the results based on the bootstrap procedure as described above. For the choice of the data sets, we considered different aspects, such as the dimension of the table, values close to zero, low counts in some rows or columns, and so forth. Of course, here we consider the table information as compositional. As before, we will present the results from the bootstrap procedure in terms of the difference between the angle of the unweighted and the weighted version, see Figure 3.

Stores: Age distribution in food stores, used in Chapter 15 of Greenacre [3]. This small data set with counts consists of 5 stores (rows) and 5 age groups (columns). The counts are relatively balanced among the cells, with the smallest value of 8 and the largest of 69. The weighted version shows a slight improvement in stability of the results when compared to the unweighted version (see Figure 3).
Health2: We consider again the Spanish health data from Table 1, but aggregate the categories “Bad” and “Very bad” in order to avoid small counts. Figure 3 still reveals a clear advantage of weighting, possibly because the categories “Very good” and “75+” have a large variability.
Cups: A data set originating from the analysis of Roman glass cups, see Greenacre and Lewi [15], available as data cups in the R package easyCODA: concentrations of 47 observations for 11 chemical elements. The element “Mn” has very low values due to detection limit problems. This data set was used in Greenacre and Lewi [15] to illustrate the usefulness of weighting. We multiplied the concentrations, reported in % with 2 digits, by 100 to produce integers in order to make the data suitable for our bootstrap procedure. The results in Figure 3 indeed show a huge advantage concerning the stability of the results when using weighting.
News: Data about the news interest in Europe, see Chapter 19 of Greenacre [3]. The table consists of 34 countries (rows) and 18 categories (columns), and the frequencies are in the range from 18 to 652. There are no issues with small frequencies or big variabilities of values for single categories, and thus the results in Figure 3 are not in favor of any of the methods.
Galton: This data set originates from Galton [33], where the body heights of parents and their children are studied. We use the data as aggregated in tab. 1 of Cuadras and Greenacre [34], resulting in a table, with 20 cells containing zeros, which are replaced by 2/3. There are several other cells with small frequencies, distributed among several categories. This is a difficult situation, and here weighting by arithmetic marginals even leads to slightly more instability compared to the unweighted version.
Fish: Morphological data on Arctic charr fish, available as data fish in the R package easyCODA. We use the 26 morphological measurements for the 75 observations, and multiply the values by 100 to create integers, making them suitable for our bootstrap procedure. There are no particularly small values or categories with large variances in the table, and accordingly, the boxplot shows only a marginal advantage of using weights.
Figure 4 presents the simplicial deviances of the unweighted (U) and weighted (W) versions. The horizontal lines are the values for the original data, and the boxplots present the results of the 1000 bootstrapped tables. In all cases we can see a clear advantage of weighting, which allows to shift much more information to the interaction table.

6 Summary and Conclusions
CA is generally considered an exploratory data analysis tool. The method is motivated by an algorithm, and there is still a continuous discussion about its mathematical background; see, for example, Breitung [35]. The aim of this paper was to show that the link to the logratio methodology [36] as its limiting case, Greenacre [4] can contribute to build a solid theoretical framework for CA. We have shown that the unweighted LRA, which performs an SVD of the centered logratio represented compositions [4], is equivalent to an analysis of a compositional table. In the latter case, the whole table is considered a composition, and it is not treated as a sample of compositional data. Moreover, the orthogonal decomposition of a compositional table into its independent and interaction parts enables us to assess the explained variability not only within the interaction part (corresponding in the jargon of CA to contingency ratios) but also within the whole (logratio) correspondence table.
Compositional data, and also compositional (correspondence) tables as their two-factorial decomposition, are characterized by the property of scale invariance [8]. The possibility of approaching unweighted or weighted LRA with power transformation by different representations of the same contingency table (cf. Results 1 and 2 in Greenacre [4]) indicates that scale invariance in a truly compositional sense is not the main strength of CA. Weighting with compositional tables is achieved instead through a change of the reference measure in the respective Bayes space. This leads to the usual choice of weighting in CA with row and column (arithmetic) marginals to new insights, and to a natural increment of the explained variance within the whole table. On top of that, weighting with compositional tables is equivalent to the weighted CA.
The reformulation of CA using compositional tables also guarantees distributional equivalence for both the unweighted and the weighted case, while the unweighted CA so far lacked this feature. Clearly, the aggregation in distributional equivalence needs to be reformulated in terms of the geometric mean, as it is the case for the marginals in a compositional table, and in general as a measure of central tendency in a truly ratio-scale analysis; however, it is merely nothing but the usual aggregation in the log-scale, from the viewpoint of the original scale [16, 26]. Still, for the weighted case also the usual arithmetic marginals play the important role in providing the absolute (interval-scale) information, which is essentially their role in data analysis. Herewith, the benefits of both concepts of marginals can be utilized.
The whole concept can be easily extended to the case of -factors, , known under the name compositional cubes [37]. They represent the discrete version of the orthogonal decomposition of multivariate densities [16]. Due to the orthogonality of the decomposition, the explained variance of all possible combinations of factors can be assessed which is of particular importance in high-dimensional settings.
Of course, with the logratio approach, the problem of zeros naturally occurs, which needs to be carefully considered. The zeros in contingency tables can be, however, treated as so-called count zeros where a reasonable imputation by non-zero values is adequate, and approaches for this purpose are available [38]. Moreover, the effect of the imputation (and presence of zeros in general) is naturally downweighted in the weighted logratio CA by lowering the respective marginal values. Still, dealing with zeros in the logratio CA is one of the next challenges to be addressed.
Overall, the logratio approach to CA opens up many new potential avenues for how the field can be further developed. The R source codes for both weighted and unweighted CA using the compositional tables methodology and for the numerical experiments are available at https://github.com/kfacevicova.
Author Contributions
K.F., K.H., and P.F. contributed with theory, numerical experiments, and with paper writing.
Acknowledgments
K.H. and K.F. were supported by the Czech Science Foundation (grant 22-15684L) and the project ReDiKid: Resilient Kid in Digital World (reg. no. CZ.02.01.01/00/23_025/008686), co-funded by the European Commission. P.F. was supported by the Austrian Science Fund (grant DOI: 10.55776/I5799). Open access funding was provided by Technische Universität Wien/KEMÖ.
Conflicts of Interest
The authors declare no conflicts of interest.
Open Research
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.