Estimation of joint probability distribution from marginals

5 min readFeb 15, 2021

This post describes how to estimate the joint probability distribution from marginal distributions using optimal transport.

At Hello Watt we help French customers to better understand how they use energy in their households to then provide them with personalized advice on how to reduce their electricity bill. For instance one of the recurring questions our users ask is “What is my consumption compared to my neighbors? Do I consume too much? “. Of course electrical consumption depends on the age of your building, whether you live in an apartment or in a house, on geography and a lot of other factors. Therefore such estimates are fair only within groups.

But we often find ourselves in a situation when our data is incomplete.

For example, the results of a survey might be presented in compact form.

Consider three variables building type (B), year group(Y) and heating type (H).

Building types could be house or apartment, year groups could be before 1919, between 1920 and 1945, 1946 and 1970, while heating types could be electrical, gas etc.

What if a census agency (in our case we are using the INSEE database) provides the data for P(B, Y) and P(B, H), but you are instead interested in estimating P(Y, H) or even P(B, Y, H)?

Stated in the language of probability theory: what is our best guess at the joint distribution P(X, Y) (2D) given two marginals P(X) and P(Y) (1D)?

This post describes a simple solution and includes a snippet with a Python implementation in our github repo.

First of all we can multiply the marginal distributions as two vectors (outer product): P(X, Y) = P(X) ∘ P(Y).

That will render X and Y in the joint distribution independent.

How do we encode any information about the correlation between X and Y?

As we recall the marginals are related through

Now it starts to look like the transport of (probability) measure !

Assume for a minute that we have a discrete probability distributions P(Xₖ) and P(Yₘ). Then

is the transport matrix that tells us how much of P(Xₖ) is transported to each P(Yₘ).

A typical example of the transport of measure is a parallel shift in 1D:

In our case the distances are in 2D with periodic boundary conditions:

We finally formulate it as an Optimal Transport problem.

Let

Then our transport is optimal when we minimize the moving cost

Distances dₖₘ are peculiar: in order to reach point (0,m) from (k, 0)

we have to go through (k, m) or though (K — k, m) if that is more optimal, or through (k, M — m) or through (K — k, M — m), which is due to periodic boundary conditions.

We then want to minimize the moving cost subject to the following constraints:

We can also impose a correlation constraint :

if we have reasons to believe that X and Y are correlated.

In addition to that, it was shown that introducing a regularization term in the form of entropy of

improves the shape of P(Y| X). Indeed it makes sense: the resultant distribution will try to have maximum entropy and will look more like a Normal distribution.

We can then take our favorite optimization package (e.g. cvxpy) and define the optimization problem (code attached).

Let’s take a look at an example.

We generate bi-variate normal data with a positive correlation.

import numpy as npnp.random.seed(0)
means = np.array([6, 8])sigmax = 2.0
sigmay = 3.0
corr = 0.5
covs = np.array([[sigmax**2, corr*sigmax*sigmay],
                 [corr*sigmax*sigmay, sigmay**2]])
ns = 1000data = np.random.multivariate_normal(means, covs, ns).T

If we bin the data appropriately, the tabulated joint probability function will look like:

Let’s consider a simple multiplication of the marginal distributions:

It does not resemble the original, it looks symmetric. Indeed, no information about the correlation is stored in the marginals, so we have to appeal to prior knowledge and postulate the value of the correlation.

Once we do this, the correlation is captured. But the shape of the joint is somewhat distorted…

To save the day we turn the regularization on (γ = 1). We verify that visually the resulting joint distribution is much closer to the true joint.

We also verify that the fully constrained solution is the closest to the original by computing the average L2 distance between the true and the predicted distributions:

We see that the product approximation has a smaller error compared to using the correlation constraint. That is due to the fact that the correlation constraint distorts the product to make it look less Gaussian, whereas the true density function is Gaussian.

That’s why adding the entropy regularization term yields the closest approximation to the original, which is a multivariate Gaussian.

Overall our approximation strategy relies on assumptions about the underlying distributions and the prior knowledge about the correlation between the covariates. We verify that corr+reg approximation is satisfactory and departs from the true distribution only due to the intrinsic uncorrelated noise present in the latter.

The technique described above is used in “Me comparer” section of Hello Watt’s Coach Conso to provide a better estimation of user ranking within her/his group.

References:

Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transportation.

Estimation of joint probability distribution from marginals

Written by Hello Watt Technology Blog