The concept of attention has proven to be very relevant in artificial intelligence. Relative entropy (RE, aka Kullback-Leibler divergence) plays a central role in communication theory. Here, these concepts, attention, and RE are combined. RE guides optimal encoding of messages in bandwidth-limited communication as well as optimal message decoding via the maximum entropy principle. In the coding scenario, RE can be derived from four requirements, namely being analytical, local, proper, and calibrated. Weighted RE, used for attention steering in communications, turns out to be improper. To see how proper attention communication can emerge, a scenario of a message sender who wants to ensure that the receiver of the message can perform well-informed actions is analyzed. In case only the curvature of the utility function maxima are known, it becomes desirable to accurately communicate an attention function, in this case a by this curvature weighted and re-normalized probability function. Entropic attention communication is here proposed as the desired generalization of entropic communication that permits weighting while being proper, thereby aiding the design of optimal communication protocols in technical applications and helping to understand human communication. It provides the level of cooperation expected under misaligned interests of otherwise honest communication partners.
1 Introduction
1.1 Relative Entropy
The work of Shannon[1] and Jaynes[2-6] made it clear that entropy and its generalizations,[7-9] in particular to relative entropy,[10] have a root in communication theory. Already introduced by Gibbs[11] in 1906 to thermodynamics, relative entropy plays nowadays a central role in artificial intelligence,[12-16] particularly for variational autoencoder,[17-20] as well as in information field theory.[21-23] Relative entropy permits to code maximal informative messages and the Maximum Entropy Principle (MEP)[2-6] to decode such, as detailed in the following.
The “relative entropy” between two probability densities and on an unknown signal or situation
(1)
measures the amount of information in nits lost by degrading some knowledge to knowledge . The letters A and B stand for Alice and Bob, who are communication partners. Alice can use the relative entropy to decide which message she wants to send to Bob in order to inform him best. is Alice's background information before and after her communication.
We assume that Alice knows how to communicate such that Bob updates his previous knowledge state to .[24-27]
The functional form of the relative entropy can be derived from various lines of argumentation.[28-33] As the most natural information measure relative entropy plays a central role in information theory. It is often used as the quantity to be minimized when deciding which of the possible messages shall be sent through a bandwidth limited communication channel that does not permit the full transfer of , but also in other circumstances.
As we will discuss in more detail, relative entropy as specified by Equation 1 is uniquely determined up to a multiplicative factor as the measure to determine the optimal message to be sent under the requirements of it being analytical (all derivatives w.r.t. to the parameters in exist everywhere), local (only the that happens will matter in the end), proper (to favor ),[8, 34, 35] and calibrated (being zero when ). Our derivation is a slight modification of that given in Leike and Enßlin.[31]
A number of attempts have been made to introduce weights into the relative entropy.[33, 36-41] Some of these go back to Guiasu[42] and Belis and Guiasu.[43] Most of them can be summarized by the “weighted relative entropy,”
(2)
with weights , for which holds for all .
The extension of relative entropy to weighted relative entropy, as given by Equation (2), appears to be attractive, as it can reflect scenarios in which not all possibilities in are equally important. For example, detailed knowledge on the subset of situations in which Bob's decisions do not matter to him are not very relevant for the communication, as he can not gain much practical use from it. Therefore, Alice should not waste valuable communication bandwidth for details within , but use it to inform him about the remaining situations for which being well informed makes a difference to Bob. However, despite being well motivated, weighted relative entropy is not proper in the mathematical sense for non-constant weighting functions, as we will show and was already recognized before.[33]
1.2 Relative Attention Entropy
Given the success of attention weighting in artificial intelligence research,[44-46] in particular in transformer networks,[46] the question arises whether a form of weighted relative entropy exists that is proper. In order to understand how the weighting should be included we investigate a specific communication scenario, in which weighted and re-normalized probabilities of the form
(3)
appear naturally as the central element of communication. We will call a quantity of this form “attention function,” “attention density function,” or briefly “attention”[47-49] when there is no risk of confusion with other attention concepts (as within this study).
The corresponding “relative attention entropy”
(4)
leads to proper communication, in case for all , as we show in the following. The minimum of the relative attention entropy is given by , which in case for all implies since
(5)
Here, we first turned the attention back into a probability by inversely weighting and re-normalization, then substituted by thanks to their identity, further substituted the latter by its definition in terms of , and finally used the normalization .
The relative attention entropy differs from weighted relative entropy (Equation 2) due to the re-normalization in the definition of attention, which leads to an irrelevant re-scaling of weighted relative entropy, since independent of , but also to a relevant extra term that depends on :
(6)
This extra term ensures properness when the attention entropy gets minimized w.r.t. . We refrained here to give similar integration variables different names.
We investigate the specific scenario of Alice wanting to inform Bob optimally. From this, we will motivate attention according to Equation 4. This means that she wants to prepare Bob such that he can decide about an action in a for him optimal way. This action has an outcome for Bob that depends on the unknown situation Alice tries to inform Bob about. The outcome is described by a utility function , which both, Alice and Bob want to maximize. The utility depends on both, the unknown situation and Bob's action . In the following, we assume to be at least twice differentiable w.r.t. and to exhibit only one maxima for any given . In case Alice does not know Bob's utility function, but its curvature w.r.t. at its maxima for any given , it will turn out that Alice wants to communicate her attention as accurately as possible to Bob, where is the curvature of the utility w.r.t. . In short, we will show that Alice should inform Bob most precisely about situations in which Bob's decision requires the largest accuracy and not at all about situations in which his actions do not make any difference. The functional, according to which Alice will fit to , will not be the relative attention entropy of Equation (4), but a different one. However, it motivates attention as an essential element of utility aware communication and shows the path to extend the derivation of relative entropy to that of relative attention entropy.
1.3 Attention Example
To illustrate how relative attention entropy works in practice, we investigate an illustrative example. For this we assume that Alice has a bimodal knowledge state
(7)
on a situation about which she wants to inform Bob, which is a superposition of two Gaussians, with
Example of communication based on relative attention entropy and weighted relative entropy as discussed in Section 1.3. Alice's bimodal knowledge state is given by the red solid curve. Bob's final knowledge state after Alice's communication is shown for various cases. The dashed black lines correspond to cases in which Alice uses relative attention entropy, and the dotted blue lines to cases she uses weighted relative entropy. Different results for the weight function with , 1, 2, , 16, and 32 are shown from left two right, respectively. In case relative entropy, relative attention entropy, and weighted relative entropy give the same result, the shown zero centered Gaussian. For the different curves for the relative attention entropy results are visually indistinguishable and indicate the result of the limit.
Let us assume that Alice believes that the different situations have importance weights for Bob, with controlling the inhomogeneity of the weights. We further assume that her communication can only create Gaussian knowledge states in Bob of the form
(9)
The corresponding attentions of Alice and Bob are
(10)
(11)
respectively, as shown in Appendix A, which also covers the details of the following calculations. These functions are displayed in Figure2 for smaller values of .
Attention functions corresponding to the cases of Figure 1 on logarithmic scale to display the unattended peak of the Alice's attention. Note that due to the strong exponential focus of the weights on larger -values the attention peaks are displaced to the right w.r.t. the corresponding knowledge peaks.
Matching the parameters and by minimizing the relative attention entropy in Equation (4) w.r.t. those yield
(12)
(13)
The resulting communicated attention weighted knowledge is depicted in Figure 1 for various values of . The case corresponds to homogeneous weights and therefore to using the non-weighted relative entropy, Equation (1). In this case, Alice communicates a broad knowledge state to Bob that covers both peaks of her knowledge. With increasing the right peak becomes more important and Alice puts more and more attention on communicating its location and shape more accurately. In the limit of we have and , meaning that Alice communicates exactly this relevant peak at and completely ignores the for Bob irrelevant one at .
Minimizing the weighted entropy of Equation (2) instead, yields with
(14)
(15)
which results in poorly adapted communicated knowledge states, as also depicted in Figure 1. Weighted relative entropy just moves the broad peak centered originally at zero for both entropies in case of no weight () to increasingly more extreme locations, which for become completely detached from the location of the relevant peak. This detachment is clearly visible in Figures 1 and 2.
In order to see how the process of Alice informing Bob works in detail, we have to understand how he decodes messages, as this defines the format of the messages Alice can send. The best way for him to incorporate knowledge sent to him into his own beliefs is by using the MEP, which we assume he does in the following.
1.4 Maximum Entropy Principle
The MEP was derived as a device to optimally decode messages and to incorporate their information into preexisting knowledge.[2-6, 50] The MEP states that among all possible updated probability distributions that are consistent with the message, the one with the largest entropy should be taken. Requiring that this update should be local, reparametrization invariant (w.r.t. the way the signal or situation is expressed), and separable (w.r.t. handling independent dimensions of ) enforces the functional form (up to affine transformations) of this entropy to be
(16)
with being Bob's knowledge before and after the update.[2-6, 51]
We assume that Alice's message takes the form
(17)
We call the function the topic of the communication, as it expresses the specific aspects of that Alice's message is about. For example, in case Alice wants to inform Bob about the first moment of the first component of , the topic would be . As Alice communicates it, the topic is known to both, Alice and Bob. Here, is a compact notation for evaluating this topic as an expectation value. The communicated expectation value of is called in the following the “message data” . The word data expresses that a quantity can be regarded as a certain fact, which still might have an uncertain interpretation. In this case it is certain to Bob that Alice claims that has the value . Under the premise that Alice is honest, this informs Bob about her belief, and if he can assume further that she is also well informed, the data is informative on itself. Under the premise that Alice is dishonest, could still inform Bob about what she wants him to believe, and therefore still be data, just on Alice's intentions and not on her knowledge.
Alice's message to Bob therefore consists of the tuple . Although this message construction might look artificial at a first glance, it actually can express many, if not all real world communications, as we motivate in Appendix B.
Updating his knowledge according to the MEP after receiving the message implies that Bob extremizes the constrained entropy
(18)
w.r.t. the arguments and . Here, the latter is a Lagrange multiplier that ensures that Alice's statement, Equation (17), is imprinted onto Bob's knowledge . The MEP and the requirement of normalized probabilities then imply that Bob's updated knowledge is of the form
(19)
(20)
The Lagrange multiplier needs to be chosen such that , which can be achieved by requiring
(21)
since
(22)
Thus, the MEP procedure ensures that the communicated moment of Alice's knowledge state, , gets transferred accurately into Bob's knowledge,
(23)
and thus that Bob extracts all information Alice has sent to him. Alice's communicated expectation for the topic, , is now “imprinted” onto Bob's knowledge, as as well after his update.
If Bob decodes Alice's message via the MEP, she can send a perfectly accurate image of her knowledge if the communication channel bandwidth permits for this. Detailed explanations on this can be found in Appendix C. Otherwise, she needs to compromise and for this requires a criterion on how to do so.
The protocol of the entropy-based communication between Alice and Bob looks therefore as follows:
Both Know:
The situation/signal is in a set of possibilities, on which they have information and , respectively, implying for each a knowledge state and , respectively.
Alice Sends:
A function , the message topic, and her expectation for it, , the message data.
Bob Updates:
, according to the MEP, such that his updated knowledge has the same expectation value w.r.t. the topic, .
1.5 Structure of this Work
The remainder of this work is structured as follows. Section 2 recapitulates the derivation of relative entropy by Bernardo[28] and Leike and Enßlin[31] and shows that a non-trivially weighted relative entropy is not proper. Section 3 then discusses variants of our communication scenario in which Alice's and Bob's utility functions are known to them, be them aligned or misaligned. In Section 4 we show that attention emerges as the quantity to be communicated most accurately in case Alice wants the best for Bob, but lack's precise knowledge on Bob's utility except for its curvature w.r.t Bob's action in any situation. This motivates the introduction of attention to entropic communication. In Section 5, relative attention entropy is derived in analogy to the derivation of relative entropy. We conclude in Section 6 with discussing the relevance of our analysis in technological and socio-psychological contexts.
2 Proper and Weighted Coding
2.1 Proper Coding
In order to see the relation of being proper and relative entropy, we recapitulate its derivation as given by Leike and Enßlin[31] in a modified way.[52] There, it was postulated that Alice uses a loss function (negative utility) that depends on the situation that happens in the end, as well as her and Bob's knowledge at this point, and , respectively. Leike and Enßlin[31] call this function her embarrassment, as it should encode how badly she informed Bob about the situation that happens in the end. Obviously, Alice wants to minimize this embarrassment.
At the time Alice has to make her choice, she does not know which situation will happen. She therefore needs to minimize her expected loss
(24)
for deciding which knowledge state Bob should get (via her message ). Here we discriminate the related, but different functions labeled by via their signatures (their sets of arguments).
General criteria such a loss function should obey were formulated by [31], which we slightly rephrase here as:
Analytical: Alice's loss should be an analytic expression of its arguments. An analytic function is an infinitely differentiable function such that it has a converging Taylor series in a neighborhood of any point of its domain. As a consequence, an analytical function is fully determined for all locations of its domain (assuming this is connected) by such a Taylor series around any of the domain positions.
Locality: In the end, only the case that happens matters. Without loss of generality, let's assume that turned out to be the case. Of all statements about , only her prediction that Alice made about before turned out to be the case, is relevant for her loss.
Properness: If possible, Alice should favor to transmit her actual knowledge state to Bob, .
Calibration: The expected loss of being proper shall be zero.
1. Locality implies that can only depend on and through and , meaning
(25)
again using the function signatures[53] to discriminate different and introducing a Lagrange multiplier to ensure that is normalized.
Properness then requests that the expected loss should be minimal for , implying for all possible
(26)
From this
(27)
follows, which is solved analytically by
(28)
as can be verified by insertion. The Lagrange multiplier is unspecified and we choose it to be , with the positive sign of ensuring that is actually a minimum and the magnitude that the units of this loss are nits ( would set the units to bits or shannons).
Calibration requests then that , therefore , and thus, by reinsertion this into Equation 24, we find
(29)
This is the relative entropy as defined by Equation (1). We note that calibration is more of an aesthetically requirement, as Alice's choice is already uniquely determined by any loss functions that is local and proper. Calibration, however, makes the loss reparametrization invariant, as for any diffeomorphism we find , as can be verified by a coordinate transformation:
(30)
Strictly speaking, Equation (27) implies Equation (28) only for an infinitesimal environment of . Only thanks to the requirement of the loss being “analytic” in the full domain of its second argument (added here to the requirement set of Leike and Enßlin[31]), Equation (28) has to hold at all other locations and Alice's expected loss becomes uniquely determined to be the relative entropy.
2.2 Weighted Coding
Could one insert weights into the relative entropy by inserting those into the above derivation? One could try to do so by modifying the locality requirement by requiring
(31)
to have a minimal expectation value for Alice. Propriety requires that
(32)
as follows from a calculation along the lines of Equation (26). This is analytically solved by
(33)
where we directly ensured calibration and set . Alice's expected weighted loss is therefore (for normalized )
(34)
which is exactly the same unweighted relative entropy as before. Therefore, weights in a relative entropy as a means of deciding on optimal coding are not consistent with our requirements. Since we have modified the locality requirement, and since calibration is not essential, the requirement that prevents weights must be properness.
As weighted relative entropy is not proper,[33] we ask whether a different way to introduce weights could be proper. In order to answer this question, we turn to a conceptually simpler setting. For this we introduce the concept of Theory of Mind.[54, 55] This is the representation of a different mind in one's own mind. As this can be applied recursively (“I think that you think that I think…”) one discriminates Theory of Minds of different orders according to the level of recursion. The above derivation of the relative entropy is based on a second order Theory of Mind construction, namely that Alice does not want Bob to think badly about her informedness (She worries abouthis beliefs on her thinking). A first order Theory of Mind construction, in which Alice only cares about Bob being well informed for what matters for Bob, might be more instructive to understand how weights might emerge. We now turn to such a scenario.
3 Optimal Communication
3.1 Communication Scenario
In the following we investigate the scenario sketched in Figure3: Alice is relatively well informed about the situation her communication partner Bob will find himself in. In this not perfectly known situation Bob needs to take an action , which is rewarded by a utility to Bob that depends on the situation that will eventually occur. Bob's action also implies a utility to Alice. We will eventually assume that Alice's and Bob's utility functions are aligned, , and give arguments why this might happen, but for the moment, we keep them separate in order to study the consequences of different interests.
Sketch of the investigated communication scenario. Alice communicates parts of her knowledge to Bob about an unknown situation. After updating according to Alice's message Bob chooses his action, which for simplicity is here assumed to be a point in situation space (for example, the blue point could indicate to which situation his action is best adapted). His action and the unknown situation determine Bob's resulting utility. Bob chooses his action by maximizing his expected utility given his knowledge after Alice informing him (blue equal probability contours of in his mental copy of the situation space). The action and situation also determine a utility for Alice, which may or may not equal Bob's utility. Alice chooses her message such that her expected utility resulting from Bob's action is maximized in the light of her situation knowledge (red contours). In case she is honest, she can only choose which parts of her knowledge she reveals with her message by deciding on a message topic ; the message data is then determined to be .
Such misalignments are actually very common. Just imagine that Bob is Alice's young child, is the amount of sugar that Bob is going to consume, and is how well his metabolism handles sugar. There is a lot of anecdotal evidence for under such conditions.
Alice will communicate through a bandwidth limited channel parts of her knowledge to Bob, who is here assumed to trust Alice fully, so that Bob can perform a more informed decision on his action. As stated before, Alice's message takes the form , with , the conversation topic, being a moment function over the set of situations out of a limited set of such functions she can choose from.
With increasing set size of Alice's communication channel becomes more flexible and with increasing dimension of the data space the channel bandwidth increases. Her message to Bob therefore consists of the tuple .[56]
In case
(35)
actually holds as communicated, Alice is honest, otherwise she lies.
Alice assumes Bob to perform a knowledge update following the MEP upon receiving her message such that
(36)
Thus, by choosing and Alice can directly determine certain expectation values of Bob's knowledge, which then influence Bob's action.
3.2 Optimal Action
The action Bob chooses optimally is
(37)
(38)
For simplicity we assume that this action is uniquely determined and that is twice differentiable w.r.t. . Then, we find that solves
(39)
(40)
meaning that for the action Bob chooses, his expectation for his utility gradient w.r.t. his action has to vanish.
From Alice's perspective, Bob's optimal action would be
(41)
(42)
implying that solves
(43)
(44)
3.3 Dishonest Communication
Let's first investigate the scenario of Alice being so eager to manipulate Bob's action to her advantage that she is willing to lie for that. In order that Bob does what Alice finds optimal for herself, , she needs to manipulate Bob's updated knowledge state such that
(45)
according to Equation (39). Thus, it would be advantageously for her if Bob's expected utility gradient vanishes for the action , which Alice prefers him to take, since then he would take this action, . Alice can achieve this by setting his expectation for to zero via communicating him an appropriate deceptive message. Any message with the topic being and the data will achieve that Equation (45) is satisfied. Here is an arbitrary constant Alice might use to ensure (if this is possible) or to obscure her manipulation. As her deceptive communication derives from Alice's knowledge and utility, these quantities imprint onto her message. This happens through the usage of her optimal action in in Equation (45). This by her preferred action is specified by Equation (41).
Thus, Alice will use a communication topic that reflects Bob's interest, as it is built on , however evaluated for Alice's preferred action in this scenario. In order for her manipulation to work, she has to make Bob believe that and hope that this will let him indeed choose .
Interestingly, Alice does not need to know Bob's initial knowledge state for this, as the MEP update ensures that the relevant moment of Bob's updated knowledge gets the necessary value, see Equation (45). Nevertheless, she needs to know his interests , as through exploiting those she can manipulate his action.
In the likely case of , Alice would be lying. However, lying is risky for Alice, since Bob might detect her lies in the long run, being it for Bob's knowledge after Alice informing him turns out too often to be a bad predictor for , or by other telltale signs of Alice. Bob realizing that Alice lies could have negative consequences for her in the long run, therefore we assume in the following that Alice is always honest. However, she might still follow her own interests.[57, 58]
3.4 Topics under Misaligned Interests
What Alice faces in the general case of differing interests is a complex mathematical problem. Even if Alice is bound to be honest, she still has some influence on Bob by deciding which part of her knowledge she shares. She does not need to give him information that would drive his decision against her own interests. By choosing the conversation topic smartly, Alice could make Bob acting in a way that is beneficial to both of them to some degree.
Let us assume for now that Alice knows both utility functions, Bob's and her own, as well as Bob's initial knowledge . For a given used as the topic of her honest communication , she can work out Bob's resulting updated knowledge = , his action , as well as how advantageous that action would be for her, by calculating and optimizing
(46)
(47)
(48)
(49)
(50)
(51)
The last step here, determining , is also an optimization problem according to the MEP. Thus, the optimal topic for Alice therefore results from the three fold nested optimization
(52)
Analytic solutions to this can only be expected in special cases. For future reference and numerical approaches to the problem, we calculate the relevant gradient in Appendix D. Its component for is
(53)
where , , and are given by Equations (47), (49), and (51), respectively, and . Inspection of the condition and the terms that could allow for it is instructive, as it hints to the factors that drive Alice's topic choice. Alice has found a local optimal topic for her communication when either
a) , meaning that Alice's remaining interest is perfectly satisfied by Bob's resulting action ,
b) for any situation a sophisticated balance Equation 53 holds between Bob's interest in that situation and the difference in the probabilities Alice and Bob assign to it, , or
c) the not balanced term is orthogonal to Alice's remaining interest w.r.t. a metric given by the inverse Hessian of Bob's expected utility (as derived w.r.t. his action ),
and the corresponding location is not a minimum of the expected utility.
Detailed investigation of the general case of misaligned interests are left for future work. Here, only an illustrative example will be examined.
3.5 Example of Misaligned Interests
An instructive example of misaligned interests is in order. For this, let us assume that the space of possible situations as well as that of actions have two dimensions, . Alice's and Bob's initial beliefs shall be Gaussian distributions
(54)
(55)
, , and (spectrally), so that Alice is better informed than Bob (since and the chosen coordinate system is aligned with that knowledge (-axis is parallel to ). Furthermore, Bob's utility should be
(56)
so that he wants his action to match the situation. Alice would prefer if he matched a by an angle rotated target according to her utility
(57)
(58)
establishing a misalignment of their interests.
In this situation, Bob's expected utility is
(59)
from which his action
(60)
follows. Thus, the first moment of Bob's belief on determines fully his action and therefore Alice only needs to inform him about that. Let us therefore assume that Alice will use a topic of the form , with some normalized direction, so that . We check later using Equation 53 whether this is her best choice or not.
The data of her message is then
(61)
and Bob's updated knowledge state becomes
(62)
as verified in the following: The MEP fixes Bob's updated knowledge to be of the form
(63)
(64)
Requiring that the communicated moment is matched leads to
(65)
(66)
(67)
as we claimed. This implies . Thus in this situation Bob does exactly what Alice tells him, as consists of the two essential elements of her message . However, being honest, Alice is not fully free in what she can say. With choosing the topic direction the message's data are fully determined thanks to her honesty.
Therefore, Alice's expected utility
(68)
becomes maximal for the topic direction
(69)
as a straightforward calculation shows. The sign of has to be chosen such that is maximal, which for turns out to be .
For mostly aligned interests, , Alice's optimal topic has an angle of , which means it is a nearly perfect compromise between what is optimal for her and for Bob. For their interests are perfectly aligned and Alice informs Bob ideally with the statement “”.
In case of orthogonal interests, , Alice's optimal topic angle is , informing effectively with a statement like “”, which, given that , is less informative for Bob than the statement “” Alice would have made under aligned interests. Bob's resulting decision of in that case turns out to be a perfect compromise between their interests, or put differently being sub-optimal for each of them to the same degree. This situation is depicted in Figure4.
Knowledge states and preferred actions of Alice and Bob in case of misaligned interests before (left) and after (right) the communication. The plane of -values is shown. Bob's knowledge state initially, (left), and finally, (right), is shown by the background color as well as by the blue contour lines at the 1- and 2-sigma levels. Alice's more precise knowledge is indicated only via red 1-, 2-, and 3-sigma level contours. The dots mark possible actions for Bob that are optimal for him under his knowledge (blue), under Alice's knowledge (green), or optimal for Alice (red). Comparing the two panels, especially the movement of Bob's optimal action (blue dot) between them, shows that Alice informs Bob such that he chooses an action that is a compromise between their interests.
For anti-aligned interests, , Alice's optimal topic angle is as then she can only send the uninformative message “”, as revealing any more of her knowledge would be against her interests. This leaves Bob's knowledge unchanged and therefore lets him pick the action .
To summarize, misalignment of interests leads to a communication and a resulting action that are a compromise between the interests of the communication partners. Who of the two has to compromise more depends on details of their knowledge states and their interests.
Alice informing Bob sub-optimally to her own advantage bears for Alice the risk of Bob realizing this. In repeated situations, Bob might recognize the systematic misalignment angle between the that happened and the topic direction chosen by Alice on the basis of her information on . This might either let him question Alice's good intentions for him or her competence. In either case he could threaten Alice to ignore or even counteract against her advice until he gets convinced that she has largely aligned her interests with his.
The general optimal topic of Alice's message could, however, be a non-linear function of the situation instead of the linear assumed above. This can be checked by inspecting the functional gradient of w.r.t. as given by Equation 53. In case it vanishes for all the topic was optimal.
It turns out that this gradient only vanishes when Alice's and Bob's interests are aligned (), as shown in Appendix E. For the instructive case of orthogonal interests (), however, the gradient
(70)
does not vanish. This indicates that she could construct a more sophisticated message that would pull Bob's resulting action a bit closer towards her own interest and further away from the action optimal for him (under her knowledge). The precise form of the optimal topic for Alice is left for future work.
3.6 Aligned Interests
In the following we assume that Alice simply wants the best for Bob from his perspective () and therefore adapts his utility for herself,
(71)
In this case, Alice informs Bob optimally via the message with , which leads to a synchronization of their expectations w.r.t. the most relevant moment of , and therefore to an alignment of their optimal actions, .
For simplicity, we assume in the following that the action is described by one real number . An extension to a vector valued action space is straightforward, but does not add too much to the discussion below except of complexity in the notation. Furthermore we assume Alice and Bob’s common utility function to be uni-modal in a given situation and to be well approximated within the relevant region by
(72)
Here, is the optimal action in a given situation , the utility of this optimal action, the tolerance for deviation from the optimal action, and is specifying how harsh larger deviations reduce the utility. This should serve as a sufficiently generic model that can capture a large variety of realistic situations. In particular the case of a quadratic loss (= negative utility) with mimics the typical situation in which a Taylor expansion in around the optimal action can be truncated after the quadratic term.
Alice's expected utility of Bob’s action
(73)
has the gradient
(74)
(75)
(76)
This is a polynomial of odd order in and therefore guaranteed to have at least one real root. The maximum of among all such roots then gives the optimal action . Thus, the topic function
(77)
is Alice's best choice for a communication that ensures that Bob makes an optimal decision.
In case and this is , which is a polynomial of order in . Instead of communicating the expectation value of this polynomial, which requires her to work out , Alice could simply communicate all moments up to order and thereby ensure that Bob would have all information needed in order to decide on the optimal action.
Here, the requirement of properness appears in a weak form. Alice wanting to inform Bob on a number of moments of her knowledge in order to put him into a position to make a good decision is a weak form of the requirement of properness. Full properness would be that Alice wants to inform Bob to know all possible moments of her knowledge. Thus, properness is expected to occur when Alice does not know Bob's utility function, but wants to support him no matter what his interests are. We will now turn to such a scenario.
4 Attention
We saw why Alice might align her interest with Bob's and in the following assume this to have happened, . Her knowledge on Bob's utility function influences how she selects her message optimally. For the concept of attention to appear in her reasoning, Alice must not know Bob's utility function in detail. In case she did, she would optimize for the utility function. However, she needs to be aware of the sensitivity with which Bob's utility reacts to Bob's choices in the different situations in order to give those situations appropriate weights in her communication. These weights will determine how accurately she should inform about the different situations such that Bob is optimally prepared to make the right decision.
To be concrete, let Alice assume that in a given situation Bob's utility has a single maximum at some to her unknown optimal action of a to her unknown height , but with a to her known curvature . Furthermore, she assumes that this utility function can be well Taylor-approximated around any of these maxima as
(78)
This corresponds to the case of Equation (72) with and . We have added the parameters , , and to the list of arguments of this approximate utility function as Alice needs to average over the ones unknown to her, which are and .
In order to circumvent the technical difficulty to deal with probabilities over functional space let us restrict the following discussion to discrete . In this case the parameters , , and become finite dimensional vectors with components , etc. The case of a continuous set of situations is dealt with in Appendix F.
Alice assumes that Bob's action will depend on these parameters and is given by
(79)
where means the (here assumed to be unique) value of that fulfills . Furthermore, we introduced the -weighted expectation value
(80)
that involves the attention defined in Equation (3).
Let us assume that Alice believes that the curvature of Bob's utility maxima are . This might be because she can estimate the influence Bob's actions have on his own well being in the different situations. For example, in the extreme case that Bob might be dead in a given , she might set as none of the possible actions of Bob then matter any more to him. It will turn out that the actual values of do not matter, only their relative values w.r.t. each other.
Thus, her knowledge about Bob's utility is
(81)
with a relatively uninformative probability density. We assume this to be independent on her knowledge on the situation, .
Furthermore, we assume here, in order to have a simple instructive scenario, that Alice only has a vague idea around which value the location of the maximum of Bob's utility could be, and how much it could deviate from . We assume that she is not aware of any correlation of this function nor any structure of its variance and therefore define
(82)
(83)
(84)
In the last step we used that Alice's knowledge on is uninformative, therefore unstructured, and thus its uncertainty covariance proportional to the unit matrix. The parameter expresses how much variance Alice expects in . Its precise value will turn out to be irrelevant.
Other setups in which is not a constant or contain cross-correlations, are addressed in Appendix F.
With the above assumptions, the expected utility is
(85)
(86)
(87)
The three terms I-III occurring therein are
(88)
(89)
(90)
where we wrote just for for brevity. Inserting I–III into Equation (85) gives
(91)
This expected utility needs to be maximized w.r.t. , Bob's knowledge after the update. It is obvious that the maximum is at as then and the negative term becomes zero.
At this maximum we have if for all , which means that Alice strives for communicating properly, if possible. Otherwise she tries to minimizes the -norm between her and Bob's attention distribution functions.
In case is continuous, properness appears as well as Alice's optimal strategy if is either diagonal, , or translation invariant, , as is shown in Appendix F. If these conditions are not met, Alice will optimally transmit a biased attention to Bob.
5 Relative Attention Entropy
5.1 Derivation
We have seen how attention appears naturally in a communication scenario in which an honest sender tries to be supportive to the receiver, without knowing details of the receiver's utility function except for having a guess for the variation of its narrowness in different situations. The measure used by such a sender to match the receiver's attention function to the own one is then typically of a square distance form, like in Equation (91), maybe with some bias term as in Equation (F8). In any way, attention seems to be a central element of utility aware communication.
This poses the question whether there is a scenario in which relative attention entropy appears as the measure the sender should use to choose among possible messages. The answer to this question is yes.
In case Alice assumes that Bob will judge her prediction on the basis of how much attention was given to the situation that ultimately happened, and knows the weights Bob will apply to turn probabilities into attentions, as well as wants to be proper, if possible, the relative attention entropy can be derived analogously to the derivation of relative entropy in Section 2.1, as we see in the following.
Again, we require the measure to be analytical, proper, and calibrated, and only modify the requirement of locality to “attention locality”: Only Bob's attention for the case that happens in the end matters for Alice's loss.
Again, at the time Alice has to make her choice, she does not know which situation will happen in the end, and therefore needs to minimize her expected loss
(92)
for deciding which knowledge state Bob should get (via her message ). This loss will depend on the weight function that turns probabilities into attentions according to Equation (3). We assume for all in the following, such that for all implies for all and vice versa.
“Attention locality” implies that must depend on through , whereas the dependence on could still be through . However, as the information content of and are equivalent, it is convenient to express the dependence on through , as then properness is given when .
Thus we have
(93)
where as before we use the function signatures to discriminate different ’s and introduce a Lagrange multiplier to ensure that is normalized.
“Properness” then requests that the expected loss should be minimal for , implying for all possible
(94)
From this follows
(95)
which is solved analytically by
(96)
as can be verified by insertion. We note that
(97)
and choose .
“Calibration” requests then that and therefore
(98)
Thus, Alice's loss function to choose the message
(99)
turns out to be the relative attention entropy. This closes its derivation.
5.2 Comparison to Other Scoring Rules
A brief comparison of relative attention entropy to other attention based score functions is in order.
First we note that in case the weights are constant, , relative attention entropy reduces to relative entropy.
For the comparison to the communication scenario of Section 4, in which Alice wants to support Bob as much as possible, but does only know the curvature of Bob's utility, we investigated the limit of small relative difference between the attention function, for all . In this case, relative attention entropy is well approximated by
(100)
which is the well-known information metric. Comparing this to the negative loss function of Equation (91), as generalized to the continuum by Equation (F8) gives under the assumptions of homogeneity and independence (, see Appendix F for details)
(101)
There is at least one similarity between these scores, in that deviations between the attention functions should be avoided as the loss increases with their square. However, these scores also differ in a significant point, as for the attention entropy the deviation in attention functions is reversely weighted with Alice attention. This means that relative attention entropy allows for larger deviations in regions of higher attention compared to the utility based score, and smaller deviations in regions of low attention.
Finally, we note that weighted relative entropy as well as relative attention entropy are equivalent to scoring rules.[25, 26] Scoring rules evaluate how well a belief matches a probability and are—in our notation—of the functional form
(102)
with being some loss function that expresses how bad it is to only believe with the strength into an event that might happen, with the correct probability . Scoring rules are used to choose the “best fitting” belief among a set of beliefs, by picking the one that has the lowest score. They are called proper, if the best fit for is whenever the latter is part of the set of beliefs to choose from. Any additive, only -independent affine transformations does not change the minimum of the score. Therefore those lead to identical results for . Thus, we need only to show for our claim of equivalence that and can be brought into the extended form
(103)
This works for weighted relative entropy by choosing
(104)
(105)
(106)
as well as for relative attention entropy with
(107)
(108)
(109)
Thus, the well developed formalism of scoring rules[26] can be used to investigate these entropies. It might be interesting to note in that context that weighted entropy is equivalent[59] to a local scoring rule, since its depends only on for the in the argument of . However, attention entropy is equivalent to a non-local score, as the normalization of the attention function in its combines values of for different .
6 Conclusion
6.1 Properness, Attention, and Entropy
Entropy is a central element of communication theory. Relative entropy allows a sender to decide which message to send in order to inform about an unknown situation in case only the communicated probability of the situation that finally happens matters. Naively introducing an importance weighting for the different situations into relative entropy renders weighted relative entropy to be improper, meaning that it does not favor to transmit the sender's precise knowledge state in case this is possible.
In order to find guidance how a weighting could be introduced into entropic communication properly, we investigated the scenario in which a sender, Alice, informs a receiver, Bob, about a situation that will matter for a decision on an action Bob will perform. The goal of this exercise is to find a scenario that encourages Alice to be on the one hand proper, and on the other hand to include weights into her considerations. Alice can decide which aspects of her knowledge she communicates and which she omits. In case the utility functions of Alice and Bob differ, Alice might be tempted to lie to Bob. This would certainly be improper. We argued that lying should be strongly discouraged if Alice and Bob interact repeatedly, as otherwise Bob might discover that Alice lies and stops cooperating with her, or even punishes her by taking actions that impact her utility negatively. Only the existence of this option for Bob could give Alice a sufficient incentive towards honesty.
But even if Alice is bound to be honest, she still can choose what of her knowledge is revealed to Bob, and what she prefers to keep to herself by communicating diplomatically. In order to be able to influence Bob's action to her advance, Alice has to give him some for him useful information, but only in a way that this information also serves her interests. This way, both expect to benefit from the communication, which is honest, but not proper.
Again, in a repeated interaction scenario, Bob has a chance to discover Alice not being fully supportive to his needs by judging how helpful Alice's communications were and whether there are systematic omissions of relevant information. For example, in the scenario discussed in Section 3.5, in which Alice's interests are always rotated by to the left of Bob's, he might realize that her advice makes him choose actions that are typically rotated to the left of what would have been optimal for him. Under the plausible assumption that her knowledge is generated independently from his utility, a few of such incidents should make him suspicious about Alice really providing him with all of her for him relevant information. Thus, Alice also risks to get a bad reputation by not being fully supportive to Bob.
Assuming then that Alice aligns her interests with Bob's, we still do not find that Alice is forced to be proper, as she only needs to inform him about the aspects of her knowledge that are relevant for his action.
In order to recover properness in this communication scenario, we needed to assume that Alice is fully supportive to Bob, but does not know his utility function in detail. Now she has to inform him properly, to prepare him for whatever his utility is. Furthermore, if she knows how sharply his utility function is peaked in the different situations, she should fold this sharpness as a weight into her measure to choose how to communicate. More precisely, Alice should turn her knowledge state into an attention function, basically a weighted probability distribution that is again normalized to one. And then she should communicate such that Bob's similarly constructed attention function becomes as close as possible to hers. In the discussed scenario, the square difference of the attention functions should be minimized. This quadratic loss function for attentions has a well known equivalent for probabilities, the Brier-score.[60] For this an axiomatic characterization exists,[61] which requests properness as one of the axioms (there called “incentive compatibilty”). Here, we found a communication scenario in which properness emerges from the request that the communication should be useful for the receiver, without having that use specified.
This last scenario therefore provides a communication measure that is proper and weighted. It is, however, a quadratic loss and therefore of a different form than an entropy based on a logarithm. Nevertheless, it shows the path on how to construct such a weighted entropy that leads to properness.
In order to have a proper and weighted entropic measure, we have to request that Alice's communication is judged by Bob on the basis of which attention value she gave to the situation that finally happened. This and the request of properness then determines relative attention entropy as the unique measure for Alice to choose her message.
It should be noted that attention is here formed by giving weights to different possible situations. In machine learning, the term attention is prominent in form of weights on different parts of a data vector or latent space.[44-46] These two different concepts of attention are not completely unrelated, as giving weight to specific parts of the data implies to weight up possibilities to which these parts of the data point.
Our purely information theoretical motivated considerations should have technical as well as socio-psychological implications, as we discuss in the following.
6.2 Technical Perspective
The concept of attention and its relative entropy should have a number of technical applications.
In designing communication systems, the relevance might differ between situations, about which the communication should inform. Attention and its relative entropy guide how to incorporate this into the system design. More specifically, in the problem of Bayesian data compression one tries to find compressed data that imply an approximate posterior that is as similar as possible to the original one, which is measured by their relative entropy.[62] However, there can be cases in which the relative attention entropy is a better choice as it permits for importance weighting of the potential situations.
Bayesian updating from a prior to a posterior
(110)
is of the form of forming an attention function out of the prior distribution, with the weights being given by the likelihood . Communicating a prior in the light of the data one might already have gotten is then also best done using the corresponding relative attention entropy.
Furthermore, we like to stress that attention functions as defined here are formally equivalent to probabilities and can therefore—formally—be inserted into any formula that takes those as arguments. In particular, all scoring rules for probabilities[8] can be extended to attentions, and therefore attention provides a mean to introduce the concept of relevance into those.
Finally, we like to point out that ensuring that more relevant dimensions of a signal or situation are more reliably communicated can be achieved by constructions like
(111)
in which the additional relative entropies for individual signal directions are weighted according to . The term ensures propriety of the resulting scoring rule for any .
6.3 Socio-Psychological Perspective
Attention, intention, and properness are concepts that play a significant role in cognition, psychology, and sociology.[46, 63-67] This work made it clear that utility aware communication naturally involves the concept of attention functions, which guide the choice of topics to the more important aspects of the speaker's knowledge that are to be communicated. As there could be certain situations—for example—in which the different options for actions a message receiver has do not matter much and therefore detailed knowledge of these situation is not of great value to him. The sender of messages should not spend much of her valuable communication bandwidth on informing about these situation of low empowerment to the receiver.
In our derivation of properness and attention we investigated scenarios in which the interests of speaker and receiver deviated. This is a very common situation in sociology. We saw that misaligned interests can leave an imprint in the topic choice of otherwise honest communication partners. Based on our calculations in Section 3.4, we expect that usefulness of received information decreases the more the interest of the sender differed from the one of the receiver. If the interests are exactly oppositely directed, the sender would prefer to send no information at all. Otherwise, the optimally transmitted information will result in a compromise between the sender's and the receiver's interests.
The fact that misalignment of interests in general reduces the information content of messages in a society of mostly honest actors provides the possibility to detect and measure the level of such misalignment. Furthermore, our analysis shows that the specific topic choices made by communication partners should allow to draw conclusions on their intentions, and on their believes about the intentions of the receiver of their messages.
Acknowledgements
The authors thank Viktoria Kainz and two anonymous reviewers for constructive feedback on the manuscript. P.F. acknowledges funding through the German Federal Ministry of Education and Research for the project ErUM-IFT: Informationsfeldtheorie für Experimente an Großforschungsanlagen (Förderkennzeichen: 05D23EO1).
Open access funding enabled and organized by Projekt DEAL.
Conflict of Interest
The authors declare no conflict of interest.
Appendix A: Attention Example Calculations
Here, we give details of the calculation for Section 1.3. Before we calculate the attention functions we note that weighting a Gaussian with an exponential weight function shifts and re-scales it:
An analogous calculation gives Bob's attention function,
(A5)
(A6)
as claimed by Equation (11). The relative entropy of these—up to terms that do not depend on and are dropped (as indicated by “” in the following whenever happening)—is
(A7)
where we introduced the attention averaging . From this it becomes apparent that inherits the first and second moment from during the minimization of the relative attention entropy w.r.t. :
(A8)
(A9)
The first moment is
(A10)
The second moment is
(A11)
as claimed in Equation (13). From this it follows that for the mean of Bob's final knowledge the following equation holds,
Finally, the mean and uncertainty dispersion of Bob's knowledge state in case Alice uses the weighted relative entropy of Equation (2) for designing her message to Bob need to be worked out. This entropy—up to irrelevant constant terms—is
(A13)
Minimizing this w.r.t. yields
(A14)
(A15)
as Equation (14) claims. Inserting this into the weighted relative entropy and minimizing w.r.t. yields
(A16)
which was claimed by Equation (15). This completes the calculations for Section 1.3.
Appendix B: Real World Communication
We want to illustrate how Alice's moment constraining messages of the form of Equation 17 can embrace ordinary, real world communications with an example. A general proof that the communication of moments are sufficiently rich to express any message is beyond the scope of this work.
To have an illustrative example, we look at the statement m = “Tomorrow’s weather should be alright”. The relevant, but unknown situation is tomorrow's weather, which we assume for the sake of the argument to be out of , the latter being a numerical embedding of these situations. It is reasonable then to assume that the statement contains the message “,” i.e. the first components of are and . Furthermore the word “should” “is going to”,“should”,“might” can be read as a quantifier for the sender's uncertainty on the situation, which shall here be interpreted as a statement on the variance, “,” implying and . Thus the message is given by .
As no prior information is specified we assume . This leads to and therefore to
(B1)
from which it follows that , , and therefore . Given that then , we get as our weather prognosis and .
The alternative statement “Tomorrow’s weather might be alright” implies , , , and therefore leads to a bit less confidence on tomorrow's weather with and .
Of course the here chosen language embedding—meaning a representation of a language in a mathematical structure, as here the representation of statements on weather in terms of topic function and message data —is only one possibility out of many. The language embedding used by the speaker and recipient of a message needs to be identical for a high fidelity communication. In reality, the embedding will depend on social conventions that can differ between speaker and recipient. This might in part explain the difficulty of communication across cultures, even if a common language is used.
Appendix C: Accurate Communication
Here, we show that the message format of Equation (17) permits Alice in principle to transfer her exact knowledge to Bob if there are no bandwidth restrictions. In case she knows his knowledge state , she could simply send the relative surprise function as well as , which turns out to be the amount of information she is transmitting. Bob updates then to as a straightforward calculation shows:
(C1)
(C2)
(C3)
(C4)
(C5)
In case she does not know his initial belief, she could alternatively send her knowledge by using the vector valued topic . This lets the message data be a vector that contains her full probability function to which Bob would then update his knowledge.
Appendix D: Topic Gradient
Here we work out the gradient of Alice's utility w.r.t. the topic of her honest communication given in Section 3.2 according to Equations (46-51). This gradient
(D1)
consists of a product between Alice's expectation for her utility gradient given Bob's action and how his action changes with changing topics of her communication. This gradient vanishes when Bob happens to choose the for Alice optimal action, such that and therefore as the latter is zero thanks to Bob's choice of action, see Equation (39), or when a further change in does not change Bob's action any more.
Bob's chosen action is the result of the minimization in Equation (47). Its gradient w.r.t. can be worked out using the implicit function theorem:
(D2)
The last term of this is
(D3)
with
(D4)
since according to the inverse function theorem applied to the quantity determining
(D5)
we have
(D6)
thanks to
(D7)
Collecting terms gives the topic gradient of Alice's utility as
(D8)
with and according to Equation (47) and (51), respectively.
Appendix E: Specific Topic Gradient
Here, the topic gradient given by Equation (53) at is calculated for the simple example of misaligned interests of Alice and Bob as discussed in Section 3.4.
In order to have a concise notation, let us first note that
(E1)
(E2)
(E3)
(E4)
and therefore
(E5)
With this, the building blocks of the gradient given by Equation 53 are
(E6)
(E7)
(E8)
(E9)
(E10)
(E11)
Inserting these terms into the topic gradient of Alice's utility yields
(E12)
This obviously does not vanish for all unless or simultaneously and , where the latter means that Alice's message contained no news for Bob. For example for , where , , , and the gradient is
(E13)
Appendix F: Attention in Continuous Situations
Here, we repeat the calculation from Section 5 for the case of a continuous situation space and potentially inhomogeneous and correlated knowledge of Alice on .
For this calculation, it is convenient to switch to an information field theoretical notation[68, 69] by defining the fields , , , and their scalar product . Alice might not know but only have a vague idea about it, which is characterized by a mean and uncertainty covariance
(F1)
(F2)
For the latter usage we note that
(F3)
The expected utility is still given by Equation (86) with the three terms occurring therein being now
(F4)
(F5)
(F6)
where we introduced the notation for the diagonal of an operator. The first two of these terms appear in Equation (86) in the combination
(F7)
Inserting this as well as III into Equation (86) gives
(F8)
which is maximal for in case is independent of , as then , with arbitrary, is independent of . This happens, for example, in case Alice's uncertainty (covariance) on Bob's optimal action is the same for all situations. In this case, Alice optimally communicates properly, with , as this implies , which extremizes the utility.
Otherwise, in case depends on , her optimal communication, which she gets by maximizing w.r.t. , would be such that
(F9)
(F10)
Here is a Lagrange multiplier that needs to be chosen to ensure proper normalization of via , where we introduced with a constant unit field, with . This implies . Thus, the attention Alice optimally would communicate to Bob
(F11)
is her own attention, modified if , if this is possible to her.[70] For this shift away from her own attention to be desirable, , her expectations of the second moment of , needs to vary with (since the operator projects out homogeneous components) as well as not being diagonal (since then , which is constant w.r.t , and thus projected out by ).
Thus, in case Alice's knowledge about is homogenous w.r.t. , she wants to properly communicate her attention. If that is not possible, for example due to a limited set of topics she can use in her communication, she minimizes a (with weighted) square distance between her and Bob's attention function.
27In this paper we chose to take the objective Bayesian perspective,[24–26] in that we assume that is uniquely defined by and thus that for all implies that and carry equivalent information on . This choice is convenient, since simplifying the discussion, but not essential.
45A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Advances in Neural Information Processing Systems2017, 30.
49The term “attention” seems appropriate for this quantity: Attention is the concentration of awareness on some phenomenon to the exclusion of other stimuli.[47] It is a process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective[48]. If “awareness on some phenomenon” can be read as the probability associated with it, the weighting done in our construction of attention then concentrates the awareness on relevant information.
51Jaynes original works on the MEP used Shannon entropy, and therefore a uniform in a discrete setting[2, 3], whereas in his later works he introduced the relative entropy in particular for the case of continuous probability densities.[4–6]
56There is a bit of redundancy in this notation, since for any invertible affine encodes the same information as . For example, a modification of the topic allows to absorb the information of the data into the topic by using , which changes the message to . As this redundancy is not a problem for the following discussion we leave it in the notation.
58In case Bob realizes Alice is lying, he will stop updating his knowledge according to her messages, and she will loose her ability to inform him w.r.t., for good or bad. For the complex dynamics that can emerge under not fully honest communication, the reader is referred to.[57] Bob might even decode from her message partly what Alice’s interests are, as and imprint onto her message’s topic. This can even enable him to choose actions that deviate from Alice’s interests as largely as he can afford in order to punish her for lying. Thus, although lying can definitively bring a short term advantage to Alice, in the long run it could cost Alice more than she might gain in the beginning. For this reason, she might decide to become honest or to be honest already in the first place. Although performing punishments typically costs Bob in terms of his own utility, he might choose them in a way that they cost Alice more than himself. This way, they might educate her to become honest, which would then let them be a good investment for Bob, as he will benefit from the information Alice can share.
70It can turn out that this utility-uncertainty aware optimal attention function is not larger zero for all . This would be an improper attention function, that can not necessarily be communicated through entropic communication. In this case, we suspect that with being the rectified linear unit activation function, is then the by Alice optimally communicated attention function.
Please check your email for instructions on resetting your password.
If you do not receive an email within 10 minutes, your email address may not be registered,
and you may need to create a new Wiley Online Library account.
Request Username
Can't sign in? Forgot your username?
Enter your email address below and we will send you your username
If the address matches an existing account you will receive an email with instructions to retrieve your username
The full text of this article hosted at iucr.org is unavailable due to technical difficulties.