RK_Excel said:
I used the following formula to find the median for a range
of numbers. I assume variable p in the below formula as 0.5.
=PERCENTILE(A1:A7,PERCENTRANK(PROB(A1:A7,B1:B7/SUM(B1:B7),,A1:A7),p,20)) [....]
|Download:
http://www.excelbanter.com/attachment.php?attachmentid=663|
The short answer: this is a problem in PROB. It is arguable whether or not
to call it a defect. (Although I think it is avoidable.)
And I do not think __we__ can do anything to avoid the problem reliably
because it arises from a sum that is computed internally.
Ergo, this paradigm is not a reliable way to compute the median of grouped
data.
There are steps that we can take that might mitigate the problem with
specific sets of data. See the details below.
But first....
-----
You have some mistakes. But ironically, they help to identify the root
cause of the problem.
First, the formula in G3 is not array-entered (you pressed just Enter
instead of ctrl+shift+Enter). It should be, just as the formula in G8 is.
Ironically, that is unrelated to the #NUM error in G3. But if PROB had not
found the conditions that triggered the #NUM error, PERCENTRANK would have
returned a #N/A error.
Second, the debug formulas in D3
198 are incorrect. That is, they are a
misinterpretation of the array-entered formula in G3.
The correct interpretation is to normally-enter the following formula into
E3 (just press Enter), then copy down through E197:
=PROB($A$3:$A$197,$B$3:$B$197/SUM($B$3:$B$197),,A3)
Note: I changed the range from rows 3:198 to rows 3:197 to avoid the PROB
defect, just for discussion purposes.
Alternatively, you can select E3:E197 and array-enter the following formula
(press ctrl+shift+Enter instead of just Enter):
=PROB(A3:A197,B3:B197/SUM(B3:B197),,A3:A197)
This has the same effect as the normally-entered formulas. But it is easier
to create directly from the original formula in G3.
I could explain the difference in interpretation, if you wish. But it is
not germane to the central issue, namely: why does PROB return #NUM errors?
-----
Now select E3:E198 and array-enter the same formula (press ctrl+shift+Enter
instead of just Enter), extending the ranges to row 198, to wit:
=PROB(A3:A198,B3:B198/SUM(B3:B198),,A3:A198)
Note that PROB returns an array of #NUM errors.
Why? Because PROB has determined that SUMPRODUCT(B3:B198/SUM(B3:B198)) is
not 1.
And indeed it is not. Note that =SUMPRODUCT(B3:B198/SUM(B3:B198))-1-0
returns about -1.22E-15, indicating that SUMPRODUCT(B3:B198/SUM(B3:B198)) is
infinitesimally less than 1.
(Note: The "redundant" -0 is needed to ensure that Excel does not
arbitrarily "correct" the true arithmetic result to be exactly zero, based
the dubious heuristic poorly described under the misleading title "Example
When a Value Reaches Zero" at
http://support.microsoft.com/kb/78113.)
The inaccuracy is a normal by-product of the native binary computer
arithmetic, which Excel relies on. Such infinitesimal differences are very
common. Usually they simply go unnoticed.
Given that they are "very common", the real mystery is: why does PROB raise
a #NUM error for some infinitesimal differences, but not for others?
I believe your incorrect formulas in D3
198 provide a hint. Note that some
of the formulas in D3
198 return a #NUM error.
Copying the text of the second PROB parameter in D3, if we normally-enter
the formula =SUMPRODUCT($B$3:B3/SUM($B$3:B3))-1-0 into F3 and copy down
through F198, some analysis reveals the following observations:
1. Only 32 differences are exactly zero.
2. Of the remaining 166 differences, the min difference is about -2.33E-15,
and the max difference is about about 2.66E-15.
3. But there are only 46 #NUM errors in D3
198.
4. Among the rows with #NUM errors, the min difference is about -2.33E-15,
and the max difference is about -5.55E-16 (closer to zero).
Ergo, PROB does tolerate some infinitesimal differences from 1 for the sum
of the probabilities, as it really should, IMHO.
But it draws the line at "large" infinitesimal differences.
What is "large"?
I speculate that it is any infinitesimal difference that causes the result
to be something other than 1 when rounded to 15 significant digits.
Specifically: -5*2^-53 (about -5.55E-16) or smaller (further from zero);
and about 23*2^-52 (about 5.11E-15) or larger.
------
I believe that explains the anomalous #NUM errors from PROB. The question
is: can we avoid them?
In general, I do not believe there is any reliable way to avoid them, since
the sum is computed internally.
But the following steps, together or separately, might remedy #NUM errors
for __specific__ sets of data.
1. Reduce the number of line items (rows). For example, in RK-Excel's data,
several rows of data could be combined because the prices are identical.
This is even more true if the prices were rounded to the cent, as I believe
they should be.
The fewer the number of items summed, the smaller the accumulated
infinitesimal differences.
2. Instead of entering the array expression B3:B198/SUM(B3:B198) as a
parameter of the PROB function, enter the normally-entered formula (just
press Enter) into J4 and copy down through J198:
=B4/SUM($B$3:$B$198)
Then normally-enter the following formula into J3 (just press Enter):
=1-SUMPRODUCT(B4:B198/SUM($B$3:$B$198))-0
Then array-enter the following formula in G3 (press ctrl+shift+Enter, not
just Enter):
=PERCENTILE(A3:A198,PERCENTRANK(PROB(A3:A198,J3:J198,,A3:A198),0.5,20))
Caveat: Although I suspect this will work 99.9999%
of the time, it
might not always work because of the specific way that Excel might compute
the sum within the PROB function.
3. I might add that the following "common sense" alternative probably will
__not__ remedy the problem reliably. In fact, it might exacerbate it.
(In fact, it does not remedy the problem with RK-Excel's data.)
That would be: replace that array expression B3:B198/SUM(B3:B198) with the
array expression array
expression --TEXT(B3:B198/SUM(B3:B198),"0.00000000000000E+0"), which rounds
each probability to 15 significant digits.
Although it might seem to remedy the problem with some data sets, that would
be a coincidence.
There still might be too large a cumulative infinitesimal difference in the
sum due to the fact that Intel-compatible CPUs actually do arithmetic with
larger precision.