t-distribution puzzle in Excel

M

mmoroz

I tried a couple of standard ways to generate random numbers that
follow Student's t-distribution. Something is very wrong, and I don't
know what.

Start with a simple example. In column A, I put 1000 uniformly
distributed on (0,1) numbers, using RAND(). Then I apply to them the
function TINV. E.g., in cell B1, I have

=IF(A1<0.5,-TINV(2*A1,df),TINV(2*(1-A1),df))

where df is my degrees of freedom. I fixed df = 5.

I obtain a column of (presumably) t-distributed numbers. I then check
their excess kurtosis (as calculated by the KURT() function). As I was
repeating this exercise using new random numbers, I found the sample
kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The
theoretical value for the excess kurtosis of t distribution is 6 / (df
- 4) = 6, and the theoretical stdev of the sample kurtosis itself is
sqrt(24/1000) ~ 0.15. So I am puzzled as to why the excess kurtosis is
so incredibly far off. Also puzzled by why it's so unstable.

I then replaced Excel's rand() worksheet function with the
Mersenne-Twister implementation in NtRand
http://www.numtech.com/NtRand/.

Similar result (the sample excess kurtosis jumps wildly but still very
far below the theoretical value of 6).

I then added another column of random uniform (0,1) numbers (in column
C), and tried using instead the formula:

=NORMSINV(A1)/SQRT(df/GAMMAINV(C1,df/2,2))

Similar result.

I tried usign 60,000 instead of 1000 numbers. Now the theoretical stdev
of the sample kurtosis is just 0.02.

Similar result.

I checked the variance and skewness, and those match their values ( df
/ df - 2 = 1.667 and 0 respectively) very closely.

I checked the normal distribution, and its kurtosis matches the
theoretical perfectly.

I know I'm doing something wrong, but what? =(

Thanks....
 
J

Jerry W. Lewis

Excel's DIST functions are not that great in the tails. Excel's INV
functions (prior to 2003) are not that great at inverting the DIST
functions. The functions are more than adequate for simply hypothesis
testing, but you have double whammy in using this approach for
generating random numbers. Switch to inv_tdist in
http://members.aol.com/iandjmsmith/examples.xls
If you still have problems, then I will take a closer look.

For small df, there might be numerical issues due to the fact that there
is much less precision for representing small departures from 1 vs.
small departures from 0. If this is your problem, you might want to
only use p-values <0.5 with an auxiliary Bernoulli rv to decide whether
to use it for the upper or lower tail.

Jerry
 
M

mmoroz

Thanks Jerry! Btw, your function is a LOT faster than Excel's tinv!

But the results are still the same.

I'm beginning to wonder.. maybe the distribution of the sample kurtosis
is so *terribly* skewed, even for sample of size 60000, that it looks
to me like I'm getting a biased result? And maybe the variance is so
terribly huge too (I only know the theoretical variance of the sample
kurtosis for normal, not t distribution)..
 
J

Jerry W. Lewis

Please clarify what you did. Are you talking about
1. Using TINV(2*p,df) but only for p<0.5 with an auxiliary Bernoulli
variable to decide whether it is the upper or lower tail (multiply
t-value by +/-1).
2. Using Ian Smith's inv_tdist(p,df).
3. Using Ian Smith's inv_tdist(p,df) but only for p<0.5 with an
auxiliary Bernoulli variable to decide whether it is the upper or lower
tail (multiply t-value by +/-1).
4. Something else (please describe)

I would be very surprised if you are finding a VBA function to be faster
than a native Excel function.

In Excel 2002 and earlier versions,
=TINV(p,df)
returns 5E6 for p<4.3E-7, regardless of df (too extreme even for df=1).
Whenever you hit such a value, that will necessarily destabilize your
descriptive statistics.

Excel 2003 truncates the distribution at 1E7, but the p-value for which
TINV returns 1E7 varies with df, and seems to provide around 10 digit
accuracy in this tail, so TINV in 2003 might be acceptable to generate
random variates, but I still would avoid GAMMAINV for this purpose.

By contrast, Ian Smith's VBA functions are as close to machine accuracy
as I have seen in any double precision implementation (better than
commercial statistics packages, better than commercial math libraries, etc.)

Jerry
 
M

mm

Jerry,

I'm ignoring all p > 0.5. Then I'm using

inv_tdist(p,df) if q < 0.5
inv_tidst(1-p,df) if q > 0.5,

where q = rand() is an uniform r.v. from Excel (independent from the
previous one).

I don't why it's faster, but it is! At least 5 times faster, if not 10
times. I'm using Excel 2003.

But I think I found that my problem is in the statistical domain, not
in the computation. See this thread:
http://groups.google.com/group/sci.stat.math/browse_thread/thread/b8383e1dd956d59e/
or sci.stat.math thread with subject "t distribution -- sample
kurtosis" if this link doesn't work. (Anyone knows how to link
correctly to threads?..)

Btw, is inv_tdist in public domain?

Thank you!
 
J

Jerry W. Lewis

mm wrote:

....
Btw, is inv_tdist in public domain?
....


The first line of the VBA code is a copyright notice, so by definition,
it is not public domain.

However, if Ian Smith had not intended for individuals to use his code,
then I doubt that he would have published the source code in a public
place. If you are wanting to use it commercially, I would recommend
e-mailing him. His address ([email protected]) is clearly displayed
on his web page
http://members.aol.com/iandjmsmith/
and I have found him to respond readily to communications regarding his
numerical analysis efforts. Ian?

Jerry
 
I

iandjmsmith

Jerry said:
mm wrote:

...

...


The first line of the VBA code is a copyright notice, so by definition,
it is not public domain.

However, if Ian Smith had not intended for individuals to use his code,
then I doubt that he would have published the source code in a public
place. If you are wanting to use it commercially, I would recommend
e-mailing him. His address ([email protected]) is clearly displayed
on his web page
http://members.aol.com/iandjmsmith/
and I have found him to respond readily to communications regarding his
numerical analysis efforts. Ian?

Jerry

Any time I suggest someone might wish to use the code, I try to make it
clear that they can do anything they like with it. I place no
restrictions and give no guarantees although I believe the code is
pretty much as good as you will find. I do not see the point in having
everyone write code for the same old stuff over and over again. It
seems to defeat the main point of computer programs which is that you
only have to get it right once and it'll work forever. I have no
problems with the code being used commercially. Should anyone make a
fortune because of its use, they are more than welcome to shower me
with gifts/donations. So Microsoft, please note, I have no problem if
this code or better is added to Excel making my VBA code redundant!

The code has a copyright notice at the top because when I first made
the code available, I consulted some online document on copyright
issues and it said I should include it. It also said I have the
copyright whether or not I add it in because I produced the code or
some such stuff. Please note, the code for AS241 is merely a
translation and so from a copyright point of view has nothing to do
with me. This may have implications about it's commercial use. I
believe it is ok to use AS241 (and translations of it) commercially but
one thing I am not is a legal expert. I do not even know exactly what
"in the public domain" implies.

The main purpose of the code is to provide accurate methods of
performing basic calculations for statistical distributions.

It also shows how simple modifications to techniques such as
Newton-Raphson can improve them enormously. I think that's why the
inv_tdist code is much faster than the Excel TINV code despite VBA not
being particularly quick. Excel claims to use a binary search which
would require the TDIST function to be called lots of times. I have not
seen the modified NR approach call the cdf function more than 7 times.

The code also includes asymptotic formulae and continued fraction
formulae which I've discovered over the years and have not seen
elsewhere.

Lastly, it shows how write statistical functions without using a
function to evaluate the log of the gamma function - a pet hate of mine
because its use guarantees large errors if applied to large arguments.
Again you only have to get it right once and unnecessary errors of that
type need never reappear.

What the code is not intended to do is probably just as important.

It is not intended to be the fastest, the most accurate or best in any
sense.

I believe all of the code can be improved (made faster or more accurate
or work with logs of probabilities to avoid returning probabilities of
0 ...) but the code is a good place to start from.

The code can usually be modified easily. If, for example, you wish to
calculate functions of the t-distribution for non-integral degrees of
freedom just comment out the lines like df = Fix(df) in the pdf, cdf
and inv functions. The code does not make use of the fact that it is
restricted to integral degrees of freedom.


Ian Smith
 
D

Dana DeLouis

repeating this exercise using new random numbers, I found the sample
kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The
theoretical value for the excess kurtosis of t distribution is 6 / (df
- 4) = 6,

Hi. I'm not a Stats guy, so this is just for gee wiz.
Just for feedback, the math program Mathematica appears to agree with Excel.
Most of the values returned by KurtosisExcess are below 6.
I "think" that Excel (2003) is doing it correctly!

These values agree with what you mentioned...

Variance[StudentTDistribution[5]]}
5/3 (or 1.6666...)

KurtosisExcess[StudentTDistribution[5]]
6

( I know this isn't Excel, but I thought it would be easier to show it this
way instead of just the answers.)

= = = = = = = = = = =
Example #1 Data size = 30,000. The value of KurtosisExcess jumps
around. Some values were below 3, and above 10. df =5.
Here's 10 samples...

t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 30000]], {10}]

{5.111078, 4.0498174, 3.8689816, 4.5970144, 7.0692163, 3.2919761,
5.9771902, 4.1000405, 3.23158, 4.8915233}

{Mean[t], StandardDeviation[t]}

{4.6188418, 1.2021682}


= = = = = = = = = = =
Example #2 Data size = 300,000 (Larger size)

t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 300000]],
{10}]

{4.4799506, 4.6979051, 7.693889, 4.8315766, 6.7575222, 4.685617,
4.415599, 4.9943591, 6.166258, 4.404873}

{Mean[t], StandardDeviation[t]}

{5.312755, 1.1504864}

As you can see, KurtosisExcess is usually below 6 and varies quite a bit.
Maybe someone good at Stats can jump in and confirm.
 
I

iandjmsmith

Dana said:
repeating this exercise using new random numbers, I found the sample
kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The
theoretical value for the excess kurtosis of t distribution is 6 / (df
- 4) = 6,

Hi. I'm not a Stats guy, so this is just for gee wiz.
Just for feedback, the math program Mathematica appears to agree with Excel.
Most of the values returned by KurtosisExcess are below 6.
I "think" that Excel (2003) is doing it correctly!

These values agree with what you mentioned...

Variance[StudentTDistribution[5]]}
5/3 (or 1.6666...)

KurtosisExcess[StudentTDistribution[5]]
6

( I know this isn't Excel, but I thought it would be easier to show it this
way instead of just the answers.)

= = = = = = = = = = =
Example #1 Data size = 30,000. The value of KurtosisExcess jumps
around. Some values were below 3, and above 10. df =5.
Here's 10 samples...

t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 30000]], {10}]

{5.111078, 4.0498174, 3.8689816, 4.5970144, 7.0692163, 3.2919761,
5.9771902, 4.1000405, 3.23158, 4.8915233}

{Mean[t], StandardDeviation[t]}

{4.6188418, 1.2021682}


= = = = = = = = = = =
Example #2 Data size = 300,000 (Larger size)

t=Table[KurtosisExcess[RandomArray[StudentTDistribution[5], 300000]],
{10}]

{4.4799506, 4.6979051, 7.693889, 4.8315766, 6.7575222, 4.685617,
4.415599, 4.9943591, 6.166258, 4.404873}

{Mean[t], StandardDeviation[t]}

{5.312755, 1.1504864}

As you can see, KurtosisExcess is usually below 6 and varies quite a bit.
Maybe someone good at Stats can jump in and confirm.
--
Dana DeLouis
Win XP & Office 2003


I tried a couple of standard ways to generate random numbers that
follow Student's t-distribution. Something is very wrong, and I don't
know what.

Start with a simple example. In column A, I put 1000 uniformly
distributed on (0,1) numbers, using RAND(). Then I apply to them the
function TINV. E.g., in cell B1, I have

=IF(A1<0.5,-TINV(2*A1,df),TINV(2*(1-A1),df))

where df is my degrees of freedom. I fixed df = 5.

I obtain a column of (presumably) t-distributed numbers. I then check
their excess kurtosis (as calculated by the KURT() function). As I was
repeating this exercise using new random numbers, I found the sample
kurtosis jumps around ~2.9 with a standard deviation of ~1.1. The
theoretical value for the excess kurtosis of t distribution is 6 / (df
- 4) = 6, and the theoretical stdev of the sample kurtosis itself is
sqrt(24/1000) ~ 0.15. So I am puzzled as to why the excess kurtosis is
so incredibly far off. Also puzzled by why it's so unstable.

I then replaced Excel's rand() worksheet function with the
Mersenne-Twister implementation in NtRand
http://www.numtech.com/NtRand/.

Similar result (the sample excess kurtosis jumps wildly but still very
far below the theoretical value of 6).

I then added another column of random uniform (0,1) numbers (in column
C), and tried using instead the formula:

=NORMSINV(A1)/SQRT(df/GAMMAINV(C1,df/2,2))

Similar result.

I tried usign 60,000 instead of 1000 numbers. Now the theoretical stdev
of the sample kurtosis is just 0.02.

Similar result.

I checked the variance and skewness, and those match their values ( df
/ df - 2 = 1.667 and 0 respectively) very closely.

I checked the normal distribution, and its kurtosis matches the
theoretical perfectly.

I know I'm doing something wrong, but what? =(

Thanks....

As the OP said the discussion on the statistical properties of the
sample kurtosis goes on at sci.stat.math and no new fault in Excel has
been discovered.

What has been discovered is how surprisingly slow the Excel inverse
functions are. I knew they were slow in extreme situations but I had
not appreciated how slow they were in general.

Out of curiousity, I then looked at
http://support.microsoft.com/default.aspx?scid=kb;en-us;829208#XSLTH3169121121120121120120
to see what Microsoft suggest on generating random variables. As you
can see random number generation via the Analysis Toolpak is no longer
recommended. Instead they suggest =NORMINV(RAND(), mu, sigma) for
generating values from the normal distribution with mean mu and
variance sigma^2.

Curious about the performance I copied the formula =normsinv(rand())
into the range a1:j20000 to see how long it would take. On my machine
it took 154 seconds which is fairly terrible - spreadsheet overheads
notwithstanding.

I then tried inv_normal(rand()) in the same range and it took 4
seconds.

Lastly I tried the Box-Muller approach using
=SQRT(-2*LN(RAND()))*COS(2*PI()*RAND()) in the same range and this took
about 1 second.

Maybe the random number generation section of the page referred to
above could do with updating.

Ian Smith
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top