1.Probability Plots
Jake Blanchard
Spring 2010
Uncertainty Analysis for Engineers
1
2.Introduction
Probability plots allow us to assess the degree to which a set of data fits a particular distribution
The idea is to scale the x-axis of a CDF such that the result would be a straight line if the data conforms to the assumed distribution
Uncertainty Analysis for Engineers
2
3.An Example
Suppose we have a set of data that we suspect is normal.
First, we form an empirical cdf
[f,xx]=ecdf(x)
Then scale cdf so that each unit on the axis corresponds to 1 standard deviation
z=norminv(ff);
Then we plot the data (sorted) against this new axis
figure, plot(y,z,'+')
Uncertainty Analysis for Engineers
3
4.The Script
n=100;
x=normrnd(10,3,n,1);
y=sort(x);
[f, xx]=ecdf(x)
for i=1:n
ff(i)=(f(i)+f(i+1))/2;
end
z=norminv(ff);
figure, plot(y,z,'+')
Uncertainty Analysis for Engineers
4
5.Uncertainty Analysis for Engineers
5
6.Matlab has an alternative
probplot('norm',x)
Options are
exponential
extreme value
lognormal
normal
rayleigh
weibull
Uncertainty Analysis for Engineers
6
7.Uncertainty Analysis for Engineers
7
Vertical axis here is cdf, not number of standard deviations
8.Look at some lognormal data
Uncertainty Analysis for Engineers
8
Normal probability plot
Normal probability plot of log of data
Lognormal probability plot
9.Now some exponential data
Uncertainty Analysis for Engineers
9
Exponential probability plot
Lognormal probability plot
10.Facts
On normal probability plots, the intercept is the mean
On exponential paper, the slope is 1/
Results at the extremes are expected to deviate from the straight line more than those in the middle
On the other hand, for some data, multiple distributions will fit in the center, but not in the tails
Uncertainty Analysis for Engineers
10
11.Results
We are dealing with samples, so our conclusions tend to be one of
The model appears to be adequate
The model is questionable
The model is not adequate
Uncertainty Analysis for Engineers
11
12.Comparison
Take some wind data (maximum measured wind velocity over a given period)
20 data points taken over 20 years
Compare all 6 Matlab probability plots
Compare looking at CDFs
Compare other error measures
Uncertainty Analysis for Engineers
12
13.Uncertainty Analysis for Engineers
13
14.Uncertainty Analysis for Engineers
14
15.Goodness of Fit Statistics
For discrete and continuous sampled data distributions
Chi-square statistic
Kolmogorov-Smirnoff (K-S) statistic
Anderson-Darling (A-D) statistic
Root Mean Square Error (RMS).
Value is limited if there are fewer than about 30 data points.
The lower the value, the closer the distribution appears to fit the data. But they do not provide a measure that the data actually come from the distribution.
16.Chi-square statistic
This goodness-of-fit statistic measures
The oldest, most commonly used
Data are grouped into frequency cells and compared to the expected number of observations based on the proposed distribution.
Definition
Where O(i) is the observed frequency of the ith histogram bar and
E(i) is the expected frequency from the fitted distribution of x values falling within the x range of the ith histogram bar.
It can be overly sensitive to large errors
17.Chi-Squared Tests
First we divide values into groups; suggestion is
For example, if we have n=500 data points, then this gives us about 45 groups (I would use 50 for convenience)
18.Example (cont.)
Sort data and divide into 50 cells
Find upper and lower bound of values in each cell
Calculate expected number of data points in each cell by subracting cdf of lower bound from cdf of upper bound
19.Example (cont.)
Compare this value to the value of the chi-squared distribution for k-np-1 degrees of freedom and a desired confidence level, where np is the number of parameters in the model (eg, 2 for a normal distribution)
20.Example (cont.)
In Matlab, we can get this chi-squared distribution from chi2inv(p,v), where p is the confidence level (0
21.Kolmogorov-Smirnov Test
Compare measured cumulative frequency with CDF of assumed theoretical distribution
Compare the maximum discrepancy between these two with a critical value of a test statistic and reject fit if former exceeds latter
Good when we don’t have many data points
22.Kolmogorov-Smirnoff Statistic
Where Dn is the K-S distance,
n is the total number of data points,
F(x) is the distribution function of the fitted distribution,
and Fn(x)=i/n and i is the cumulative rank of the data point.
K-S is better than χ2 because data are assessed at all points—
avoids problem of number of bars (bins).
But value determined by the one largest discrepancy
So it takes no account of lack of fit across entire distribution
23.More on K-S statistic
The position of Dn along the x-axis is more likely to occur away from the low probability tails.
This insensitivity to lack of fit at the extremes is corrected for in the Anderson-Darling statistic.
Some statistical literature is critical about distribution fitting software that use this statistic as a goodness-of-fit test.
Because the statistic assumes the fitted distribution is fully specified so that the critical region of the curve can be checked.
24.Process
Sort n data points
Make a step-wise cdf
(cdfplot in Matlab)
Fit data to a model to obtain model cdf
Find maximum difference between these two cdf’s over each of the steps in the first cdf
Look up comparison data in tables
25.KS Critical Values
26.What is the Significance Level?
Our hypothesis is that the fit is a good fit.
If difference in cdf’s exceeds that of the test statistic, we reject the hypothesis
There are two possibilities:
Fit really is bad, or
We are rejecting a good fit
Significance level is probability that we are rejecting a good fit
27.Fit Tests in Matlab
chi2gof(x) – normal distribution only
kstest(x,CDF)
30.Anderson-Darling Statistic
This is a more sophisticated and complex version of the K-S,
It is more powerful because
The f(x) weights the observed distances by the probability that the value will be generated at that x value.
This helps focus the difference measure more equitably.
The vertical distances are integrated over ALL values of x rather than just looking at the maximum.
This makes maximum use of the observed data
31.Root Mean Square Error
RMS error is available as a test statistic in BESTFIT for expert data that is sampled using percentiles.
Measures the area between the distribution fit and the data.
The smaller the better. Does not provide fine distinction.
32.Back to Example - Tests
All pass KS test except exponential and rayleigh at 5% significance level
Same holds at 2% level
Uncertainty Analysis for Engineers
32
34.A Second Example
A new controller was installed on 96 diesel locomotives
The mileage at failure for each was recorded
37 failed at less than 135,000 miles
All we know about the others is that each lasted beyond 135k miles
Thus, we have to “censor” the data
We assume we have 96 data points, but only plot 37
This is important when we compute CDF
Goal is 80,000 mile warranty
Uncertainty Analysis for Engineers
34
35.Censoring in Matlab
Tell Matlab which data points are censored and how many of each there are
We’ll use a lognormal plot in the example because failure mechanism indicates this is appropriate
Uncertainty Analysis for Engineers
35
36.Uncensored “Normal” probability plot
Uncertainty Analysis for Engineers
36
37.Un-Censored “Lognormal” Plot
Uncertainty Analysis for Engineers
37
38.Censored “Lognormal” Plot
Uncertainty Analysis for Engineers
38
39.Conclusions
The lognormal distribution looks like a good fit
Probability of failure is approximately 15%
Uncertainty Analysis for Engineers
39