Universal expressions of population change by the Price equation: Natural selection, information, and maximum entropy production
Abstract
The Price equation shows the unity between the fundamental expressions of change in biology, in information and entropy descriptions of populations, and in aspects of thermodynamics. The Price equation partitions the change in the average value of a metric between two populations. A population may be composed of organisms or particles or any members of a set to which we can assign probabilities. A metric may be biological fitness or physical energy or the output of an arbitrarily complicated function that assigns quantitative values to members of the population. The first part of the Price equation describes how directly applied forces change the probabilities assigned to members of the population when holding constant the metrical values of the members—a fixed metrical frame of reference. The second part describes how the metrical values change, altering the metrical frame of reference. In canonical examples, the direct forces balance the changing metrical frame of reference, leaving the average or total metrical values unchanged. In biology, relative reproductive success (fitness) remains invariant as a simple consequence of the conservation of total probability. In physics, systems often conserve total energy. Nonconservative metrics can be described by starting with conserved metrics, and then studying how coordinate transformations between conserved and nonconserved metrics alter the geometry of the dynamics and the aggregate values of populations. From this abstract perspective, key results from different subjects appear more simply as universal geometric principles for the dynamics of populations subject to the constraints of particular conserved quantities.
1 Introduction
Changes in populations can often be described by changes in probability distributions. The dynamics of probability distributions therefore sets the basis for much of theoretical population biology.
This article develops abstract principles for the dynamics of probability distributions. Those abstract principles deepen general understanding, leading to better connections of theoretical population biology to physics, statistics, and other population-based disciplines.
To understand the dynamics of probability distributions, one must consider the forces and constraints that influence the change in populations. Many methods can be used to study dynamics. Here, I apply the Price equation, a highly abstract description of change in populations. The abstractness of the Price equation facilitates discovery and understanding of connections between seemingly different disciplines.
I use the Price equation to show the essentially identical basis for fundamental equations of natural selection, entropy, and information. I emphasize the first steps in how one might go about building a common framework in which to understand the similarities and differences between various disciplines. From this abstract perspective, key results from different subjects appear more simply as universal geometric principles for the dynamics of populations subject to the constraints of particular conserved quantities.
2 Overview
This article provides the basis for unifying diverse subjects. Given the incompatible goals, methods, languages, and cultures of the different disciplines, it is useful to begin with an extended overview.
This overview serves only to orient in the direction of what follows, not as a complete summary unto itself. Readers who prefer to start with the details may wish to skip this section.
Sections 3-5 introduce the Price equation and prepare for application to different subjects. In the Price equation, a population consists of different types. Each type associates with a frequency or probability and with a property. I assume that the properties are quantitative values. I use the words frequency and probability interchangeably. In other contexts, there may be good reasons to distinguish between these words.
The Price equation partitions the total change between two populations into a part caused by changes in frequencies and a part caused by changes in properties. That separation allows clear understanding of dynamics in terms of changes in probability distributions and changes in population quantities, such as biological fitness or physical energy or economic wealth.
Section 6 presents the canonical equation of conservation in populations, in which the change caused by frequency differences balances the change caused by property value differences. In biology, this equation represents the fact that the average of relative reproductive success (fitness) cannot change, because increases in relative fitness caused by natural selection must be exactly balanced by decreases in relative fitness caused by the changed state of the population.
The conservation of relative fitness arises directly from the conservation of total probability. Alternative measures of property values can be understood as geometric coordinate transformations from the property of fitness (frequency change) to alternative measures that often lead to nonconservative changes in populations. For example, a logarithmic measure of fitness leads to classical measures of information.
Section 7 describes various identities and alternative partitions for the conservation of total probability. The different notational forms provide the basis for connecting seemingly different subjects to the common underlying geometric principles.
Section 8 considers frequency changes in relation to an abstract notion of force. By expressing frequency changes in terms of force, the Price equation partitions the conservation of total probability into two balancing components of change. The first component arises from directly acting forces with respect to a fixed frame of reference for the quantitative properties. The second balancing component of change arises from the inertial forces that alter the frame of reference.
The balance between the consequences of the direct and inertial forces provides an analogy to d'Alembert's principle of mechanics. That connection establishes a first step in relating different disciplines to the common underlying geometric foundation.
Sections 9, 11 transform the quantitative property of frequency change into logarithmic coordinates. In the canonical Price equation's partition of conserved total probability into direct and inertial components, the property of each type is its frequency change or growth rate, an analogy with biological fitness. In particular, the relative growth, or fitness, of the ith type is , the ratio of the derived frequency,
, relative to the initial frequency, qi.
The change between the initial and derived frequency can be considered as a path divided into segments, in which the overall growth, or fitness, arises by multiplication of the fitnesses along each segment of the path.
If we transform our focal property of fitness to logarithmic coordinates, then we can add component property values along the segments of a path, achieving an additive geometry of change that greatly enhances the power of analysis and interpretation. The classical notions of information and entropy follow immediately from use of the logarithmic coordinates in the canonical Price equation partition of conserved total probability.
Sections 12 and 13 continue to set the geometric foundations for analysis. When we divide a path of change into many small segments, then we can think of overall change as the combination of many small instantaneous changes in response to directly applied force at each point along the path.
For small changes, the direct force at each point becomes approximately the same for the initial linear coordinates of change, wi, and the logarithmic coordinates, log wi, apart from a constant shift that does not alter the dynamics. The convergence of linear and logarithmic coordinates with respect to small changes explains the common forms of many fundamental results in different fields of study.
Section 14 develops two complementary abstract notions of force. In the canonical expression of the Price equation for the conservation of total probability, the “fitness” term simply describes the change in frequencies relative to the fixed frame of reference given by the initial frequencies. One may treat this description of change as an inductive expression of an underlying force.
Alternatively, it often makes sense to consider the initial frequencies and forces as given, from which one deduces the change in frequency. This section expresses the given forces geometrically by the separation between the initial frequencies, qi, and the given point, . By expressing force in this way, we have a common geometric basis for the inductive and deductive perspectives.
Section 15 develops the deductive perspective by deriving the changes in frequencies for given initial frequencies and given forces. The analysis applies the Lagrangian method, which maximizes the first component of the Price equation partition. That first component is an abstraction of the classical mechanics action term, as the virtual work of the direct forces with respect to a fixed frame of reference. The Lagrangian method generalizes the principle of least action.
The Lagrangian also includes various forces of constraint, such as the conservation of total probability, and any additional forces associated with other conserved quantities. The forces of constraint impose a limited set of potential paths that may be followed in the geometric space of frequency change. The actual path of change extremizes the action among those paths that are consistent with the forces of constraint.
Sections 16-18 present a partial maximum entropy production principle that follows from the dynamics of frequency change. To obtain this result, I partition the direct force into two components. The first component becomes an additional force of constraint that expresses the invariance imposed by the conservation of some system quantity, such as energy or biomass or the direct change in some value. The remaining component of the direct force is—log qi, which can be thought of as the entropy or information in the ith dimension.
The entropy becomes the action term maximized by the path of change, leading to a path that maximizes the production of entropy. Because the maximization is taken with respect to the fixed frame of reference defined by the initial population, ignoring any inertial forces that alter the frame of reference, one can think of the entropy production as the result of a partial change holding constant the frame of reference—the partial maximum entropy production principle.
Sections 19 and 20 develop the notion of a conserved system quantity as a force of constraint. Jaynes maximum entropy analysis of thermodynamics and probability patterns follows as a special case of the general geometric principles of change in populations developed in earlier sections. From Jaynes’ work and the later extensions of his theory to simple invariance principles, we have a unified framework in which to understand the relations between commonly observed probability distributions.
Section 21 discusses alternative ways in which to interpret maximum entropy paths. I argue that the most basic principles derive from the underlying geometry. Notions of entropy and information are simply interpretations of that geometry applied to particular disciplines of study.
Section 22 relates the path of change for populations to the Fisher information metric. That metric arises frequently in particular disciplines, including the fundamental approaches of information geometry.
Sections 23 and 24 briefly review key results. The Appendix provides brief histories of key topics and background references.
3 Separation of Frequency and Property
The Price equation provides an abstract way in which to analyze changes in populations. The equation separates the frequency of entities from the property of those entities (Frank, 2012a; Price, 1972a).
Suppose, for example, that for entities with label i, we express frequency as qi and the average of the associated property value as zi. The zi values can be height, or energy level, or any quantity.


in which the dot product, Δq · z, is understood in the usual way as the sum of the element-wise product of two vectors.
Alternatively, one may separate frequency from property. Thus, we have differences in frequency, , and differences in property values,
.
For example, a transportation planner might study the overall assessment of changing modes of transport in a population. The index i could label different transportation modes, such as automobile, train, and so on. The frequency qi is the fraction of individuals who travel by a particular mode. The quantity zi may be the relative assessment for the value associated with a transportation mode.
The separation of frequency and property allows a more general description of change. Changes in the total assessment of transportation can arise from changes in the frequencies of usage, , and from changes in the assessment of value for each mode,
.
4 Set Mapping of Labels Between Populations
Our goal is to describe the change between two populations. We may arbitrarily label one population as the ancestor and the second population as the descendant. The general formulation concerns only the differences between populations, independently of any particular underlying scale of separation, such as space or time or updating in light of new evidence. In this section, I consider the example of separation between populations by time.
The term is the change in the descendant frequency,
, compared with the ancestral frequency, qi. For the transportation example, one would typically read this as the frequency of people traveling by train or other mode, i, at two different times. If the frequency of people traveling by train is increasing, then Δqi is positive. That interpretation makes a lot of sense and is nearly universal.
The Price equation allows a more abstract notion of the mapping between sets. Let be the frequency of entities in the second population that derive from type i in the first population. Thus, for travel mode by train,
would be the frequency of individuals in the descendant population who derived from, or map to, train travelers in the ancestral population.
Consider two interpretations. First, and qi could have their traditional meaning of the frequencies of train travelers at each point in time. For example, change may occur by social contagion, in which people become train travelers only by learning about trains from someone who already travels by train; an individual train traveler maps to self as a descendant train traveler. In this case, each descendant train traveler maps to a train traveler in the ancestral population. Positive Δqi reflects growth of the ith class by successful recruitment.
In a second interpretation, we could map descendant individuals to their mothers. Then, Δqi has to do with the number of babies produced by each mother. In this case, a descendant's label i is defined only by ancestral type. Descendants do not have their own types, only their mapping to an ancestral i.
We handle the fact that descendants may use travel modes that differ from their mother by adjusting the change in property value, . For mothers who travel by train, with property value zi, their descendants have some average property value,
, that accounts for both changes in travel mode by descendants and changes in property value associated with each travel mode.
In the general, abstract interpretation, the label i applies only to the initial, or ancestral set. All entities from the second, or descendant, population map to ancestors, and thus derive their labels from their ancestors. We can use partial assignments, so that a descendant is made up of various fractions of ancestors, each descendant part accounted for separately by its assignment to an ancestral label, i.
At first glance, this set mapping abstraction may seem rather complicated and obscure. However, its great power arises from the fact that nearly all studies of changes in populations can be described by specific mapping assumptions and associated interpretations. Thus, anything that we can prove about the general abstract setup applies to the very many apparently different special cases that arise in different applications.
5 The Price Equation


in which a dot product is understood in the usual way as q · z = ∑ qi zi.
This equation can be interpreted in various ways, as discussed in prior sections. In general analysis, I adopt the most abstract interpretation with regard to set mapping between two populations. Roughly speaking, we can take qi to be the frequency associated with a subset, i, of the initial population, such that the total frequency is ∑ qi = 1. Thus, is the average of z.
Here, zi is an arbitrary function that maps i to some property value, and zi is interpreted as the average of z in each dimension or subset, i. Because z can be any quantity, calculated in any way, this equation gives the most general expression for , the change in the average of z. One can think of
as a functional of the arbitrary function, z, that maps i ↦ zi.






an explicit expression for the change in average values. Because z can be defined in any way, this expression describes the change in any quantitative property of populations.
6 Biological Fitness and the Conservation of Total Probability



because the total frequency or probability is always a conserved value of one. In some articles, wi is taken as an absolute measure of the number of descendants assigned to type i, and is the average number of descendants, which may differ from one. In that case,
is relative fitness. Here, I am using wi as the measure of relative fitness, with
always equal to one. The following analysis does not differ under the alternative definitions, but it is important to keep in mind the distinct definitions that may be used.


7 Identities for the Conservation of Probability
We may express the conservation of total probability in a variety of equivalent forms. This section shows some of the variants. The purpose of these variants is to set up the discussion in the next section, in which we interpret the Price equation partition in Equation 6 as a partition of total change into two parts. The first part is the change ascribed to direct forces, F. The second part is the change ascribed to the altered context of the population, which may be thought of as a change in the frame of reference caused by inertial forces, I.

We will need a toolkit of notational variants to establish this form and to show the connections between seemingly different subjects. It is a bit tedious to set up the various notational identities, but it is important to do so to develop alternative interpretations and to avoid confusion. On first reading, one may wish skim quickly through this section and then refer back to the notations as needed.


which we will nonetheless find quite useful, because the partition provides some hints about the balance of direct and inertial forces in a conservative system. Before turning to that balance of forces in the next section, it is useful to consider some additional identities.

in which a2 is the vector of the squared terms, , and thus, q · a2 is the second moment of a. Here, Vw is the variance in relative fitness, because ai = wi − 1 is relative fitness shifted so that the mean value of a is zero. Thus, the second moment of a is the variance.
The term q · a2 can be thought of as a squared distance starting from an initial point at zero and moving through the distance given by the sum of the squared deviations in each dimension, , each dimension weighted by its frequency, qi. Thus, the distance that the population moves in frequency space, caused by the changes in frequency given by variable fitnesses, is equivalent to the variance in fitness. Put another way, the reason that the variance in fitness always arises as the key metric in population change is that the variance describes the distance that the population moves.


in which a ratio of vectors implies element-wise division, and vectors distribute through parentheses as dot products.

which measures the nonlinearity, or bending, in the changes of q in subsequent steps, which is roughly like an acceleration.
Note that Equation 11 has Δqi terms in the denominator, which may appear to be problematic when such terms include zero values. However, each term is always part of a dot product, yielding values of for each term; thus, we can always interpret such terms directly by their actual value. The reason for splitting the terms in the manner of Equation 11 follows at the end of this section.

by the conservation of total probability. However, in each individual dimension, i, the value of is not necessarily zero. Although the total value is constrained to be zero, it is often useful to retain this term to emphasize the fact that the values in each dimension can vary.


8 Balance of Direct and Inertial Forces
The previous sections described the conservation of total probability, which imposes strong constraints on the geometry of change in populations. In particular, the dynamics of probability distributions must move along the constraint that the total probability remains unchanged. Within that constraint, the probability distributions that characterize populations may change in response to directly applied forces, such as biological fitness or physical forces or informational processes.

The term F is the vector of direct forces acting on the system, and the term I is the vector of inertial forces that balance the direct forces to achieve no net change. d'Alembert's principle can be thought of as a generalization of Newton's second law of motion (Lanczos, 1986), in which is read as the total force,
, equals mass, μ, times total acceleration,
. Total force and total acceleration must include forces of constraint, which in our case means that Σ Δqi = 0. If we write total inertial force as
, then Newton's law is
.
In d'Alembert's formulation, the direct and inertial forces typically do not sum to zero, F + I = 0, because those terms do not include the constraining forces that act on Δq. Instead, in d'Alembert's expression (F + I)Δq = 0, the term Δq · F combines the direct and constraining forces, and the term Δq · I combines all inertial forces, including any forces of constraint. Newton's law is a special case of the more general principle of d'Alembert (Lanczos, 1986).
Here is a simple intuitive description of d'Alembert's principle (Wikipedia, 2015). You are sitting in a car at rest, and the car suddenly accelerates. You feel thrown back into the seat. But, even as the car gains speed, you effectively do not move in relation to the frame of reference of the car: Your velocity relative to the car remains zero. That net zero velocity can be thought of as the balance between the direct force of the seat pushing on you and the inertial force sending you back as the car accelerates forward.
As long as your frame of reference moves with you, then your net motion in your frame of reference is zero. Put another way, there is a changing frame of reference that zeroes net change by balancing the work of direct forces against the work of inertial forces. Although the system is a dynamic expression of changing components, it also has an overall static, equilibrium quality that aids analysis. As Lanczos (1986) emphasizes, d'Alembert's principle “focuses attention on the forces, not on the moving body…”


For frequency changes, one can think of a coordinate system that locates a population as a point defined by the population's frequency or probability distribution. The direct work done to move the population in that coordinate system is Δq · F, the sum of the force multiplied by the displacement in each dimension, calculated when holding constant the frame of reference defined by the coordinate system. That direct work is balanced by the inertial work done to accelerate the reference frame coordinate system by a total amount Δq · I, which relocates the altered population and its associated forces so that it appears in the new frame of reference to have a net total displacement multiplied by force of zero.
I use the word “force” here in an abstract, nondimensional manner, rather than in the specifically defined manner of classical physics. Such words can be a barrier to interdisciplinary insight and understanding. Readers highly trained in particular disciplines, such as physics, sometimes believe that a word such as “force” has a single correct meaning and associated units of expression. Any variant use of the word is thought to be misleading or mistaken. I take the opposite view. The underlying nondimensional geometry expresses the purest abstract notion of such concepts.
In each separate discipline, the particular dynamics and related equations have terms that take on specific interpretations, units, and meaning. Those specific aspects arise from the application of the same underlying universal geometry to particular problems, which usually means the same underlying conserved quantities and associated symmetries. The same geometry and abstract concepts will take on different units and interpretations in different disciplines.
9 Average Force Along a Path
In the Price equation description of change, we have only the differences between two populations. The two populations describe the initial and final probability distributions, q and q′. Each distribution can be thought of as a single point in a space of probability distributions. The separation between the two points is a nondimensional change that can be small or large. There is no underlying parameter, such as time or spatial distance, that defines the scale of separation and the path of change that connects the points.
Most applications analyze changes along a path with respect to an underlying parametric scale. To relate the Price equation to other theoretical frameworks, it is useful to add an abstract notion of change along a parametric path that connects the initial and final probability distributions.


We can think of ri as the average force acting along the path that moves the system from qi to with respect to total path length, θ = Δs2, in the parametric length scale, s. Thus, riθ is the total force in the ith dimension along the path of change. For our purposes, we can treat s as a nondimensional scale, and think of ri as having nondimensional units of 1/s2, interpreted as a nondimensional force or acceleration. In biology, the force ri is interpreted as the Malthusian expression of biological fitness in analyses of natural selection, connecting the abstract analysis here to models of biological evolution (Frank, 2015).

So that we may think of ri as the average change in logarithmic coordinates of probability with respect to changes in the parametric length scale Δθ = Δs2.


10 Comparing Linear and Logarithmic Coordinates

in which separates
into the segments
, with
between q and q′.


We can decompose any fitness value and its associated vector, (q, q′), into a large number of small pieces. In principle, we could analyze large changes in frequency, Δq = q′ − q, by combining the changes along each small segment in a decomposition of total change.
11 Log Coordinates, Entropy and Information


is the Kullback–Leibler divergence (Cover & Thomas, 1991; Kullback, 1959). This divergence measures relative entropy by extending the classical measure of entropy, −q · log q, for a probability vector q, to a measure of the entropic divergence of q relative to a given probability vector, q′.
One can think of classical entropy for a probability vector, q, as a special case of the more general relative entropy by comparing q to a uniform distribution described by a constant probability vector in which for all i. The Kullback–Leibler divergence is also a primary measure of information in statistics and information theory.
The properties of entropy and information derive from the fundamental geometric properties of logarithmic coordinates, such as the additivity described in the previous section.


which measures the bending, or curvature, of the divergence between the populations in the sequence . When the divergence between successive steps remains constant, then mean log fitness is invariant.



is the Jeffreys divergence. In earlier work, I showed that the Jeffreys divergence is the proper expression for the direct component of change caused by natural selection or, more generally, the component associated with direct forces when evaluated with respect to the fixed frame of reference given by the initial probability vector (Frank, 2012b).
For small changes, and
converge to the Fisher information metric. Thus, analyses of small changes often invoke
,
or Fisher information without distinguishing between the measures. For small changes, the Fisher information metric is often preferable, because it has many useful geometric properties (Amari & Nagaoka, 2000) and is more widely known than
. However, it is useful to keep in mind that, in general,
is the correct measure for the direct effect of natural selection, or for the direct component of change relative to a fixed frame of reference.

12 Small Changes: Prelude
In the remainder of this article, I focus only on the small changes that arise from forces acting at a given point. Small changes correspond to a single small segment in any larger path. I focus on small changes for two reasons.
First, the conceptual relations between different disciplines can be seen mostly clearly in small changes around a focal point.
Second, analysis of larger changes requires either an assumed constancy of a force field, or potential function, or an explicit notion of how forces change with both time and the changing context of the population. Those required assumptions reduce the generality of any particular formulation and obscure the common conceptual basis of different subjects.
In the future, it would be useful to extend analysis to cases in which there is no meaningful decomposition of a large change vector into small segments and to cases in which there exists a constant force field for which one could reconstruct the path of change over a sequence of small segments. Such extensions exist within individual disciplines, but it remains unclear how to connect the analyses from those different subjects to a common unifying framework.
13 Small Changes: Analysis









This last expression is the Fisher information metric, which arises as the direct component of population change or natural selection (Frank, 2009), the limiting expression of the Jeffreys divergence given earlier.
14 Given Forces
I have defined as proportional to the force acting along the infinitesimal change
. These expressions describe a consistency relation between force and frequency change. Often, we wish to consider how extrinsic or given forces cause change, rather than simply express consistency.

Given the location, q, and the force vector, , the vector
provides an alternative way to express the intensity of the force vector as log
. We can multiple
by an arbitrary positive constant, because the net consequences of a force vector are shift invariant. Thus, we may implicitly consider
as the target and choose
to sum to one, satisfying the conservation of total probability.

in which is the endpoint of the exponential growth process that began at qi. Thus, the location q and the “target” location
are sufficient to describe the given force vector. In the following, we will only be interested in small changes,
, that result from the instantaneous given forces with respect to a fixed frame of reference. One goal will be to find the changes,
, that arise from given forces and various constraints on change.





15 Extreme Action and Frequency Dynamics


We measure the total change caused by the direct forces as . That expression comes from Price's separation of direct and inertial forces in Equation 19. In terms of classical mechanics (Lanczos, 1986), the expression
is the virtual work of the direct forces, in which work is distance times force (ignoring mass).
Geometrically, we can think of the constraint in the second term as fixing the total path length moved in frequency space (Amari & Nagaoka, 2000), in which measures distance by the Fisher information metric for infinitesimal displacements,
, or, biologically, C2 is the variance in fitness. I assume that C2 is chosen so that a solution exists that satisfies the constraints. The final term constrains total probability to remain constant.
The constraints of and
do not by themselves determine which frequency changes actually occur. Many different frequency vectors,
, satisfy those two constraints.
Given these forces and constraints, what actual path do the dynamics follow? In other words, what is the realized vector ? We can think of the first term in the Lagrangian as the action, and extremize the action subject to the given constraints (Lanczos, 1986). That action term is
, the product of the displacement times the given force, which is the virtual work. In this case, maximizing the virtual work in the Lagrangian finds the displacement
aligned with the direct and constraining forces.





satisfies the constraint on total path length, in which is the standard deviation of the direct forces.



16 Direct Forces and Constraining Forces
The distinction between direct and constraining forces is arbitrary. We may choose to describe a force by its constraint on allowable displacements, , or by its inclusion in the direct forces,
.
The Lagrangian in Equation 23 defines the action to be extremized as the work done along the path, which is the total displacement, , times the direct component of force,
. We can use
rather than
for force, because we can ignore the constant,
, and
.
The constraining forces in the Lagrangian of Equation 23 are the fixed path length, , and the conservation of total probability,
.
We are free to relabel a component of the direct force as a constraining force (Lanczos, 1986). In practice, deriving the altered Lagrangian provides an easy way to see how the changed labeling of direct and constraining forces enters into the analysis.



We may choose to relabel as a force of constraint. The remaining term
becomes the virtual work associated with the direct forces. The next section illustrates how this change in labeling can be useful.
17 Conserved System Quantities as the Primary Forces of Constraint





The advantage of using z is that we may define the force of constraint directly in terms of any system quantity that we may associate with z. Each zi is, in this analysis, a given value associated with a subset i of the population. We can use any quantity for z, including energy or momentum or monetary wealth or a quantitative biological trait.
Often, underlying quantities of a system, xi, become transformed by various processes before we evaluate the final quantity of the outcome, zi. We may, in general, consider zi = T(xi), in which xi is an intrinsic quantitative value associated with the subset i, and T(xi) is a transformation that defines a scaling relation between the intrinsic xi values and the constraining force, zi. The analysis of pattern often reduces to understanding the processes that set the scaling relation (Frank, 2014), T.




If is constrained, then that constraint defines the constraint on
in Equation 29. For example, the total system quantity
may be conserved, which means that
. If the z quantities do not themselves change, then
, and consequently, we have the constraint on the given forces
. We may also consider other ways in which
is constrained, thereby defining the given forces
that determine dynamics.
18 Maximum Entropy Production Principle

The first term is the total action to be maximized, which is the virtual work of the direct forces, . The other terms describe the constraints on the path that
may follow. I assume that C2 and B are chosen such that a solution exists.
The classical definition of entropy is − q · log q. Thus, the path that maximizes
, subject to the constraints on
, is, in the limit of small changes, the path that maximizes the production of entropy subject to the constraints—the maximum entropy production principle (see Appendix for references).
The idea is that the most likely path is the one that maximizes the production of entropy, which is equivalent to the maximization of the virtual work of the direct forces, , subject to the constraints on
. The constraints in
include all forces that determine the location of
.
The maximum entropy production principle is always true, in the sense that one can always split the total direct forces, , into a constraining component, log
, and a direct component, −log q. The extent to which maximum entropy production is meaningful depends on two questions. First, how meaningful is it to treat
as a constraint? Second, how meaningful is it to consider paths of change in the context of the Price equation separation of direct and inertial forces, a generalization of d'Alembert's principle?
In order to answer those questions about maximum entropy production, the next section analyzes dynamics with respect to z as a constraint. The following section discusses the Jaynesian theory of maximum entropy in relation to equilibrium thermodynamic expressions for common probability distributions. After those two sections, I return to the broader question of how to interpret the maximum entropy production principle in terms of the Price equation.
19 Maximum Entropy Path Subject to Constraint




in which is the traditional definition of system entropy. Thus,
is the deviation of the entropy in the ith dimension from the system entropy. The constant
is absorbed by expressing
and
as deviations from their average values. The constant
is given by Equation 25, in which
is the standard deviation of the forces,
.


The term βɛz is the regression coefficient of , on zi, which transforms the scale for the forces of constraint imposed by z to be on a common scale with the direct forces of entropy, −log q. The term
describes the required force of constraint on frequency changes so that the new frequencies move
by the amount
. The term
is the variance in z.
When the z values change, the changing frame of reference with respect to z follows from Equation 30 as . When
is a conserved quantity and the z values remain constant such that
, then
. When B = 0, the force of constraint for the conserved quantity is expressed simply by
.
20 Equilibrium Thermodynamics and Probability
This section analyzes how the system equilibrium arises from the direct force causing maximum increase in entropy and the constraining forces imposed by z. That equilibrium can be interpreted as the maximum entropy probability distribution.







That probability distribution is the classic Jaynesian thermodynamic equilibrium (Jaynes, 1957a,b, 2003) that arises by maximizing entropy subject to a constraint on . That constraint is usually interpreted as a conserved quantity, such that
, and
. We can use multiple constraints on a set of system values
, and replace
by
summed over j. For simplicity, I focus on a single constraint.











in which , and
. We can extend this result to unify the commonly observed probability distributions within a single framework by noting that
is an arbitrary scaling relation of an underlying value, xi (Frank, 2014, 2016).
Two conclusions follow. First, equilibrium probability distributions at maximum entropy express the force of constraint on total probability and the forces of constraint on total system quantities. The point of maximum entropy occurs at the minimum relative entropy, , which is achieved as q →
.
Second, pattern follows from the values of z that set the forces of constraint and thus the magnitudes of . How the z values arise has not been specified. Thus, the study of pattern often reduces to the study of how various processes set z. The analysis here clarifies how those processes and the associated maximum entropy probability distribution relate to the universal Price equation expression for the dynamics of populations.
21 Interpretation of Maximum Entropy Path
The previous sections analyzed forces in terms of Price's partition of direct and inertial forces, an abstract generalization of d'Alembert's principle of mechanics. By analogy with d'Alembert's principle, the Price equation term can be thought of as an abstraction of the virtual work associated with the direct and constraining forces.
The direct forces are F. The constraining forces are included in the allowable set of displacements, , taken relative to the fixed frame of reference. Such displacements relative to a fixed frame of reference are sometimes called virtual displacements, thus the name virtual work for the term
. The Lagrangian expressions provide a method for maximizing the virtual work subject to the constraints that limit the possible set of displacements.
We may interpret the partition of direct and constraining forces in different ways, to match the interpretation of different problems. In this article, I split the total direct forces into a direct force that increases entropy, F = − log q, and a set of potential virtual displacements, , that obey the forces of constraint defined by conservation of a functional,
, of the system quantities, z, where one can think of each zi as a function on the subset, i, of the population.


If we take as the direct forces, then the frequency changes can be obtained from the Lagrangian in Equation 23 that maximizes the action
, which is equivalent to minimizing the change in relative entropy,
.
If we take –log q as the direct forces, then the frequency changes can be obtained from the Lagrangian in Equation 31 that maximizes the action , which is equivalent to maximizing the gain in entropy,
.
In other words, the realized path maximizes the production of entropy when analyzed within the fixed frame of reference, thus the maximum entropy production principle. That conclusion holds only in the d'Alembert–Price distinction between direct and constraining forces, in which we choose to interpret all direct forces except entropy production as constraining forces on the possible virtual displacements, . In addition, the changes in frame of reference that typically arise from change in location,
, or from change in the constraining forces, are separated by the Price equation approach into the consequences of the inertial forces.
Maximum entropy production only holds for the partial change from the direct forces, when separating all direct forces other than entropy into the constraints, and when ignoring changes in the frame of reference associated with the inertial forces.
Does it make sense to follow this particular partition of forces into components? There is no correct answer to that question. The principle exists. The interpretations of usefulness and meaning will always have a strongly subjective aspect.
I follow Lanczos (1986) in the claim that separating direct, inertial, and constraining components is the great unifying perspective in the study of forces. In many systems, it makes sense to describe most of the applied forces in terms of the constraining forces of conserved system quantities. Often, all that remains is the only truly universal force, the increase of entropy, which completes the description of the total direct forces acting on a system.
In some cases, it may make sense to use a different partition of applied forces into direct and constraining component forces. When the remaining direct component of force differs from entropy alone, then it would appear that the system does not follow the maximum entropy production principle. However, it is better to say that the maximum entropy production principle always holds, but alternative expressions may provide a more meaningful perspective for particular problems.
In this interpretation, entropy is simply a geometric description of position and change for probability distributions when located in logarithmic coordinates. That fundamental geometry explains the universality of entropy, or information, in widely different disciplines and applications.
22 Geometry and the Fisher Information Metric




In various models of natural selection, information, and entropy, different measures arise in terms of the Jeffreys divergence, , the Kullback–Leibler divergence,
, and the Fisher information metric,
. Confusion sometimes occurs, because in the limit of small changes, all three measures converge to an equivalent form that often appears as the Fisher information metric. That limiting equivalence hides the significant differences between the measures and the different situations to which each measure naturally applies.
The Fisher information metric is used in many applications (Cover & Thomas, 1991; Kullback, 1959). For example, Frieden (2004) has emphasized that this Fisher information partition subsumes nearly all of the key results of theoretical physics. Similarly, the subject of information geometry subsumes nearly all of the classical aspects of statistical inference through a Riemannian geometry based on the Fisher information metric (Amari & Nagaoka, 2000).
From the general perspective of the Price equation and d'Alembert's form for the conservation of total probability in Equation 7, the partition into Fisher information components arises as a special case in the limit of small changes (Frank, 2015). In that special case of Fisher information, in which , one does not separate the forces of constraint from the other directly applied forces. Instead, all directly applied and constraining forces combine into a single quantity that describes the path, in which that path has a natural geometric expression in terms of the Fisher information metric. That geometry is very useful in many applications. But it is important to recognize the more general perspective of Price and d'Alembert, which allows a deeper conceptual understanding of the different roles played by directly applied forces, constraining forces, and inertial forces.
One can think of the maximum entropy production principle in terms of Fisher information geometry. The universal direct force that increases entropy is always present. In addition to that universal direct force, various additional constraining forces combine to influence the curvature of the space of allowable virtual displacements. The direct and constraining forces combine to determine the paths of change within the Fisher information geometry (Amari & Nagaoka, 2000).
23 Direct Work, Information, and Entropy
I summarize in two parts. In this section, I briefly review the Price equation formulation of the work of the direct forces. I then show how the classic measures of information and entropy follow from simple geometric assumptions about the most useful scale on which to measure changes in populations. The following section focuses on the Lagrangian analysis of the dynamical paths of change, including the partial maximum entropy production principle, and provides a final summary.



The allowable displacements in probability, Δq, must obey any constraints imposed on changes in the system, and thus implicitly reflect any underlying forces of constraint. Such displacements may be reversed, because all allowable displacements fall within the constraints of conserved total probability. Reversible infinitesimal displacements that obey the constraining forces, taken in the context of the fixed frame of reference in the initial state of the population, are often called virtual displacements.



The work of the direct forces describes change in the context of the fixed frame of reference given by the initial population. The total change depends on how the frame of reference changes, captured by the second term q′ · Δa = Δq · I, as in Equation 11.
Often, it is difficult to interpret the changing frame of reference in a simple way. Instead, the strongest universal principles come from study of the work of the direct forces—the partial change caused by the direct forces with respect to the fixed initial frame of reference.
The work of the direct forces may be partitioned into components of directly applied forces, F, and constraining forces expressed by the allowable displacements, Δq. One can make that partition in a variety of ways according to the interpretation of a particular system. The emphasis on forces helps greatly in understanding the causes of change (Lanczos, 1986).


When we interpret fitness as a force, the logarithmic coordinates change the multiplication of fitness components of force into the addition of the logarithmic fitness components of force, as in Equation 18.
In the Price equation, we can use any arbitrary coordinates, z, for the quantitative property values associated with probabilities. We can think of those arbitrary coordinates as a geometric transformation of the fundamental coordinates of conserved probability and fitness, w ↦ z. Equivalently, we may write a ↦ z, because a = w − 1, and the Price equation is shift invariant.

When the changes, Δqi/qi, are small, the logarithmic measure of fitness converges to the linear measure of fitness, m→a, and the Jeffreys divergence and the Kullback–Leibler divergence converge to the Fisher information metric. The Fisher metric is the fundamental measure of distance between probability distributions that forms the basis of much of statistical inference and information geometry.
In these Price equation descriptions of change, we have taken the fitnesses as given, and equated fitness or the logarithm of fitness with a notion of force. That approach is essentially inductive, in which we take the probabilities as given locations, , and implicitly induce the force that would be consistent with the change from qi to
.
24 Partial Maximum Entropy Production
The main point of this article is to analyze the traditional deductive perspective of dynamics with respect to force. In that traditional perspective, we begin with the initial location of the population, q, and given forces which we denote . From those given conditions, we then deduce the changes in location and the new probabilities, q′. I confined the analysis to the study of small changes,
.
To obtain the dynamics, , from the initial location and the given forces, I first wrote the Lagrangian expression for each particular case. The Lagrangian focuses on a first term, often called the action, which is either maximized or minimized (extremized). When minimized, the procedure follows the principle of least action, but more generally, the procedure is known as the principle of extreme action.
In this article, I maximized the virtual work of the given direct forces, . Intuitively, this simply means that the changes will follow the lines of force in relation to the magnitudes of the force in each dimension. However, we must consider both the direct and constraining force.
The Lagrangian approach provides a natural way to combine direct and constraining forces. In each Lagrangian, the first term gives the virtual work of the direct forces to be maximized. The remaining terms give the constraints that must be satisfied, usually as some total quantity that is conserved when summed over all dimensions of the system. The Lagrangian procedure transforms the system constraints into the constraining force components in each dimension.
The various results in the text show how different kinds of constraints and different ways of separating overall force into direct and constraining components determine the change in frequencies.





With this component labeled as a constraining force, the remaining part of the virtual work of the direct forces is , which in the limit for small changes is the production of entropy along the path of small changes,
. This component is the action term maximized along the path of change; thus, the path follows the direction that maximizes the production of entropy. I call this the partial maximum entropy production principle, because the result expresses the change in terms of the fixed frame of reference of the initial population state. Total change must also evaluate any changes in the frame of reference through the inertial forces.
The entropy production principle simply expresses the basic geometry for the path of change when extrinsic forces are considered as constraints on system quantities, and logarithmic coordinates are used to locate populations. Because changes in probabilities as fitness or force have a natural expression as the ratio of probabilities, , and such quantities combine multiplicatively, logarithmic coordinates arise naturally from the transformation that yields additive combinations. Thus, entropy production or changes in information arise as the inevitable consequence of the geometry of change when evaluated in the Price equation partition of direct and inertial forces.
In summary, several different disciplines share the same basic fundamental theory of change. From the perspective of the Price equation, we have seen common expressions for natural selection, aspects of physical mechanics and thermodynamics, entropy expressions for probability distributions, and common measures of information theory. Perhaps many common models of learning by reinforcement (Sutton & Barto, 1998; Szepesvri, 2010) and Bayesian updating (Campbell, 2016; Harper, 2011; Shalizi, 2009) will also share the same underlying geometric principles.
Acknowledgements
National Science Foundation grant DEB–1251035 supports my research. I did this work while on fellowship at the Wissenschaftskolleg zu Berlin.
Conflict of Interest
None declared.
Appendix A: Literature in Specific Disciplines
Natural selection
Price originally formulated his equation as an expression of natural selection (Price, 1970, 1972a). In another article, without any direct connection to the Price equation, he speculated about a unified theory of change based on an abstract generalization of the principle of selection (Price, 1995).
In Price's vision for a general theory of selection, he suggested the separation of frequency and property values in the description of population change. He also described changes by an abstract mapping scheme between members of two populations. Price never connected these abstract ideas about mapping and about separating frequency and property directly to his formulation of the Price equation, although one can see hints of this in Price (1972a).
In other work (Price, 1972b), Price clarified one of the great puzzles in the history of evolutionary theory. In 1930, Fisher stated his fundamental theorem of natural selection as: “The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”
Fisher emphasized the exactness of the theorem and his belief that the theorem was a general and profound statement about natural selection. The puzzle is that Fisher's theorem holds exactly only under a very restricted set of assumptions (Crow & Kimura, 1970). Fisher is regarded as perhaps the greatest mathematical biologist ever. So the mismatch between Fisher's strong claim and the seemingly obvious failure of the theorem was hard to reconcile.
Price (1972b) solved the puzzle. In the language of the present article, Fisher meant that the rate of increase in fitness equals the variance in fitness when evaluated with respect to the fixed frame of reference of the population's initial state. Selection acts as a direct force, with consequences of the direct force evaluated by holding constant the context. Any changes to the population that alter the fitnesses of individuals are regarded as consequences of inertial forces that alter the frame of reference.
Price (1972b) did not use the language of direct and inertial forces, but he clearly understood Fisher's partition of total change into two components. Later work clarified a variety of early theories about natural selection within the context of the Fisher's partition (Ewens, 1989, 1992; Frank & Slatkin, 1992).
In summary, Price left three separate insights about natural selection: the Price equation, the separation of frequency and property in an abstract mapping scheme, and Fisher's method of partitioning total change with respect to the frame of reference. My own work has unified those different pieces into an extended, more general and abstract interpretation of the Price equation (Frank, 1995, 1997, 2012a,b).
Another important line of work in evolutionary theory concerns the path of change in gene frequencies. Wright (1931, 1932) initiated the approach most closely related to analogies with classical mechanics. That line of work continues to be developed, including explicit connections to notions of entropy and statistical mechanics (de Vladar & Barton, 2011).
The studies initiated by Wright contrast with Fisher's approach (Frank, 2012c). In the language of this article, Fisher emphasized instantaneous change at a point and the partition of direct and inertial components of change. Fisher believed that the inertial components of change were too unpredictable to allow an explicit theory for the full path of change over significant lengths. By contrast, Wright and his descendants sought a theory of the paths of change over significant distances. This article emphasized the Fisherian perspective.
Maximum entropy production
Jaynes’ theory of maximum entropy (Jaynes, 1957a,b, 2003) emphasizes that probability distributions can be read as expressions of constraining forces (Frank, 2014).
For example, a Gaussian distribution expresses a constraint on the average distance of observations from the mean value. If one constrains that average distance of fluctuations from the mean, then the Gaussian distribution arises by maximizing the entropy subject to that constraint. Maximizing entropy is roughly equivalent to minimizing information or maximizing randomness.
Jaynes’ maximum entropy describes an equilibrium condition (Jaynes, 1957a,b, 2003). The idea is that entropy increase is a ubiquitous force—a ubiquitous entropic force. Increasing entropy plus constraining forces together define the form of the equilibrium distribution.
The increase in entropy toward an equilibrium leaves open the problem of the dynamical path followed from initial condition to final equilibrium state. What characterizes the increments along that path? One possibility is that each increment follows the direction that maximizes the increase in entropy—the path of maximum entropy production (MEP).
Some authors have proposed MEP as a fundamental principle similar to the principle of least action (Dewar, 2005; Dewar, Lineweaver, Niven, & Regenauer-Lieb, 2014). By that view, essentially all realized paths of motion maximize the production of entropy. Other authors have suggested that MEP is only an approximate description of dynamics (Dewar et al., 2014). By that view, certain special systems follow MEP exactly, whereas many other systems follow MEP approximately or not at all.
The logical status of MEP as a principle and its usefulness in analysis remain open problems. The interpretation of MEP is important, because that interpretation reflects our general understanding of diverse subjects and the relations between those subjects.
In this article, I showed that MEP is an exact statement about dynamics when interpreted in the context of the Price equation and the information theory definition of entropy. The Price equation provides an abstraction of change that may be interpreted as a partition into components that separate direct, inertial, and constraining forces.
This Price equation separation of forces is an abstract generalization of d'Alembert's principle of classical mechanics (Lanczos, 1986). The Price equation formulation can be applied to both conservative and nonconservative systems, extending d'Alembert's application to conservative systems. Wang (2007) proposed a different way to connect entropy and d'Alembert through a more traditional thermodynamic approach.
Although MEP is a valid principle, I suggested that a purely geometric interpretation provides a more fundamental and universal perspective than does the entropy perspective of MEP. In particular, the conservation of total probability imposes strong geometric symmetry and constraint on the separation of direct and inertial forces (Frank, 2015). Maximum entropy production is a useful but often unnecessarily complicated way of expressing those fundamental geometric principles.
Returning to Jaynes, his goal was to express an abstract and general approach to understanding probability patterns. He sought to transcend the specific physical assumptions of statistical mechanics and thermodynamics, thereby achieving a more general theory that applied to broader range of disciplines.
In several ways, Jaynes did not go far enough. For example, he retained entropy and information as primary quantities. Similarly, information geometry, based on metrics such as Fisher information, retains a notion of information as primary. In my view, the underlying geometry, conserved quantities, and symmetries provide the true foundation for analysis as, for example, in Frank (2016).
Statistical inference and learning algorithms
This article showed that natural selection connects to universal expressions of population change and probability through the Price equation (Frank, 1995, 2012a; Price, 1970, 1972a). One can think of natural selection as an algorithm for accumulating information. Many authors have noted formal connections between natural selection, information theory (Frank, 2009, 2012b); Bayesian updating in statistical inference (Campbell, 2016; Harper, 2011; Shalizi, 2009); and learning algorithms (Campbell, 1974).
Although initial connections have been made between natural selection and those different subjects, unification based on a deeper geometric foundation remains an open problem. For example, Jaynes maximum entropy approach ultimately aimed to unify probability, information, statistical inference, and physical theories of statistical mechanics and thermodynamics (Jaynes, 2003). Another subject which might eventually coalesce is reinforcement learning (Sutton & Barto, 1998; Szepesvri, 2010) which provides the basis for aspects of neuroscience, cognitive science, and machine learning.
How do those various subjects relate to general underlying geometric principles for the dynamics of change in populations?