J
joeu2004
Jeff said:Subject: how do I identify statistical outliers with excel?
You start by putting your inquiry in the __body__ of your posting, not
just in the subject line. Some news readers -- notably
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx --
do not show the entire subject line, and they do not permit posting
responses when the body of the posting is empty.
I will offer some answers to your question if you promise not to use
this information to blindly and/or automatically exclude outliers from
your data. There are many reasons for deprecating automatic exclusion;
some people even deprecate excluding outliers manually. The primary
reason for excluding outliers is if you believe they represent data
entry errors. But some people argue that even that is not a good
reason, pointing to some unusual historical anecdotes.
Two numerical methods are commonly used to identify outliers: (1)
relying on the interquartile range (IQR); and (2) relying on the
standard deviation (SD). I prefer #1; #2 seems valid only when the
distribution is known to be "normal". In either case, it is best to
graph the data to confirm that the data are truly outliers in a
subjective sense. Often, "outliers" are merely indicative of extreme
skew in the distribution.
One way to graph the data in Excel is to select the data, then click on
the Chart Wizard on the toolbar and select the XY (Scatter) chart type.
Some writers suggest using histograms for this purpose. I disagree
because the shape of a histogram is greatly affected by the width of
the bars with respect to the data distribution, which is an art and
error-prone.
One definition of an outlier is that it is outside"x" times the
interquartile range from first or third quartile. "X" is typically 1.5
(mild) or 3 (extreme). I prefer 3 because it is more conservative. In
Excel, the following compute the first and third quartiles in A1 and A2
and the interquartile range (IQR) in B1 for the data in D1100, for
example:
A1: =QUARTILE(D1100,1)
A2: =QUARTILE(D1100,3)
B1: =A2 - A1
Then the following determines if D1 is an "extreme" outlier:
=if(or(D1 < A1 - 3*B1, D1 > A2 + 3*B1), "outlier?", "")
Another definition of an outlier is that it is outside "x" standard
deviations (SD) from the average. "X" is typically 3 or 4. I would
prefer 4 because it is more consistent with the definition of "extreme
outlier" above. In Excel, if A1 contains =AVERAGE(D1100) and B1
contains =STDEV(D1100), for example, the following determines if D1
is an outlier:
=if(or(D1 < A1 - 4*B1, D1 > A1 + 4*B1), "outlier?", "")