Estimating entropy from a set of measurements

As correctly stated in the answer by msm, the solution to this interesting problem might be considerably easier if we could deal with a large number of samples. Regardless of the use of an empirical distribution function or of a distribution directly obtained by row data, when we have a large number of samples and a pdf can be defined, we can rapidly calculate entropy using the standard formula for Shannon entropy.

However, there are two major issues to be considered in this question. The first is that the problem seems to clearly ask for an analysis of entropy on a single, relatively small set of observations taken from a very larger set of possibilities (in this regard, knowing the range within numbers are generated could be useful). So we are working in a context of "undersampled" regime. On the other hand, the conventional Shannon entropy is a measure that is suitable for clearly defined probability distributions. Although sometimes we can make assumptions on the underlying distribution to link our sample dataset to some entropy measure, estimating entropy from a single undersampled set of observations is not easy. In practice, we have an unknown discrete distribution composed by $k $ observations over $N $ different possible outcomes, defined by a probability vector $p=(p_1,p_2,…,p_N) \,\,\,$, with $p_i \geq 0$ and $\sum p_i=1$. Because in most cases the probability vector is unknown, the classical Shannon entropy $$H (p)=-\sum_{i=1}^{N} p_i \log p_i $$ cannot be directly used. So we have to obtain an estimate of$H(p)$ from our dataset of size $k $.

This is why the typical approach to entropy in undersampled sets of observations is based on entropy estimators. These are surrogate measures of entropy that somewhat aim to overcome the drawbacks depending on the small size of our dataset. For example, a very basic (and rarely used) estimator is the so called Naive Plugin (NP) estimator, which uses the frequency estimates of the discrete probabilities to calculate the following surrogate of entropy:

$$\hat {H} (p)=-\sum_{i=1}^{k} \hat {p}_i \log \hat {p}_i $$

where $\hat {p}_i$ is the maximum likelihood estimate of each probability $p_i $, calculated as the ratio between the frequency of the outcome $i $ (i.e. the histogram of the outcomes) and the total number of observations $k$. It can be shown that such estimator largely underestimates $H (p) $.

A number of other estimators has been proposed to improve the performance of the NP estimator $\hat {H}(p) $. For instance, a rather old approach is the Miller adjustment, in which a slight increase in the accuracy of the NP estimator is obtained by adding to $\hat {H}(p) $ a constant offset equal to $(k-1)/(2N)\,\, \,$. Clearly this correction is still rough, because it only took into account the size of the sample, and not its distribution. A more robust modification of the NP estimator can be obtained using the classical jackknife resampling approach, commonly used to assess bias and variance of several types of estimators. The jackknife-corrected version of the NP for a dataset of $k $ observations is

$$\hat {H}_{J}(p)= k \hat {H}(p) - (k-1) \tilde {H}(p) $$

where $\tilde {H}(p) $ is the average of $k $ NP estimates, each obtained by excluding a single different observation. Other robust variants of the NP estimator, more complex, can be obtained using procedures based on analytic continuation. You can find additional details on this issue here.

Recently, a number of other estimators based on different arguments have been proposed. Among these, the most commonly used for discrete distributions are the Nemenman-Shafee-Bialek (NSB), the Centered Dirichlet Mixture, the Pitman-Yor mixture, and the Dirichlet process mixture. These are Bayesian estimators, which then hing on explicitly defined probabilistic assumptions. Similarly, non Bayesian measures have been suggested, such as the Coverage-Adjusted estimator, the Best Upper Bound, or the James Stein estimator. It should be highligted that there is no unbiased estimator in this context, and that the convergence rate of different estimators can vary in a considerable manner, in some cases being arbitrarily slow. However, for the specific question of the OP, which is based on a discrete distribution with finite range, a reasonable choice could be the NSB estimator, which uses an approximately flat prior distribution over the values of the entropy, built as a mixture of symmetric Dirichlet distributions. This estimator shows rapid convergence to the entropy and good performances in terms of robustness and bias. You can find more details on the underlying theory here. Very useful online applications and tools for the calculation of NSB entropy can be found here.

The second issue in this question is that the problem - if I understood correctly - seems to be focused on the amount of entropy related to each single observation, rather than on the whole dataset entropy. While the contribution of each observation is easy to determine in conventional Shannon entropy calculations, this is more challenging for other estimators. A typical approach to simplify this problem, commonly used in many other statistical fields, could be to calculate the entropy estimator for the whole dataset after removal of the observations of interest, and then compare it with the entropy estimator for the whole dataset. The difference can be used as a measure of entropy contribution related to that specific observation. Applying such approach for the NSB estimator, or alternatively for a relatively robust NP-related estimator (e.g., the jackknife-corrected) might be a good choice to answer the specific question reported in the OP.

Estimating entropy from a set of measurements

Tags:

Statistics

Entropy

Probability

Information Theory

Related

Recent Posts