how to interpret scipy.stats.probplot results?
I looked since hours for an answer to this question, and this can be found in the Scipy/Statsmodel code comments.
In Scipy, comment at https://github.com/scipy/scipy/blob/abdab61d65dda1591f9d742230f0d1459fd7c0fa/scipy/stats/morestats.py#L523 says:
probplot
generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, seestatsmodels.api.ProbPlot
.
So, now, let's look at Statsmodels, where comment at https://github.com/statsmodels/statsmodels/blob/66fc298c51dc323ce8ab8564b07b1b3797108dad/statsmodels/graphics/gofplots.py#L58 says:
ppplot : Probability-Probability plot Compares the sample and theoretical probabilities (percentiles).
qqplot : Quantile-Quantile plot Compares the sample and theoretical quantiles
probplot : Probability plot Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.
So, difference between QQ plot and Probability plot, in these modules, is related to the scales.
The theoretical probability of an event occurring is an "expected" probability based upon knowledge of the situation. It is the number of favorable outcomes to the number of possible outcomes.
When you gather data from observations during an experiment, you will be calculating an empirical (or experimental) probability.
Example: You tossed a coin and you got a head.
- Experimental Probability(head)=1
- Theoretical Probability(head)=0.5
For simplicity, see the below diagram which shows probability of getting particular Bill amount. p and q plot are shown.
ppplot (Probability-Probability plot)
- Compares the sample and theoretical probabilities (percentiles).
qqplot (Quantile-Quantile plot)
- Compares the sample and theoretical quantiles
probplot (Probability plot)
- Same as a Q-Q plot, however probabilities are shown in the scale of the theoretical distribution (x-axis) and the y-axis contains unscaled quantiles of the sample data.
Difference between ppplot,qqplot and probplot are related to the scales. Both show sample and theoretical values on x and y axis.
Percentile plots
- Percentile plots are the simplest plots. You simply plot the data against their plotting positions. The plotting positions are shown on a linear scale, but the data can be scaled as appropriate.
Quantile plots
- Quantile plots are similar to probabilty plots. The main differences is that plotting positions are converted into quantiles or ZZ-scores based on a probability distribution.
The default distribution is the standard-normal distribution. You’ll notice that the shape of the data is straighter on the Q-Q plot than the P-P plot. This is due to the transformation that takes place when converting the plotting positions to a distribution’s quantiles.
Best-fit lines
- Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution.
In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
Probability density of a normal distribution, with quartiles shown. The area below the red curve is the same in the intervals (−∞,Q1), (Q1,Q2), (Q2,Q3), and (Q3,+∞).
In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x.
A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.
A P–P plot plots two cumulative distribution functions (cdfs) against each other: It is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other. P-P plots are vastly used to evaluate the skewness of a distribution.