John said:
With relatively small number of records (up to a few thousand), the odds
of getting a duplicate are really really really small.
John Spencer
Access MVP 2002-2005, 2007-2009
The Hilltop Institute
University of Maryland Baltimore County
Without getting into a debate about what constitutes "really really
really small," if my calculations are correct, the odds of randomly
selecting at least one duplicate question when selecting 30 questions
with replacement out of a pool of 100 is approximately 99.2%. For 30
questions out of a pool of 2000, I get 19.6% - much higher than
intuition might falsely suggest given the subtle nature of the
hypergeometric distribution.
Public Function P2(n As Integer, d As Integer) As Double
Dim lngI As Long
Dim dblTemp As Double
dblTemp = 1
For lngI = d - 1 To d - n + 1 Step -1
dblTemp = dblTemp * lngI / CDbl(d)
Next lngI
P2 = 1 - dblTemp
End Function
MsgBox(P2(23, 365)) => 50.7297%
MsgBox(P2(30, 100)) => 99.2209%
MsgBox(P2(30, 2000)) => 19.6339%
C.f.:
http://mathworld.wolfram.com/BirthdayProblem.html
Note that the value for P2(23, 365) is exactly the same as the result on
the Wolfram page. An interesting bit of trivia on the same page:
"...the distribution of birthdays is assumed to be uniform throughout
the year (in actuality, there is a more than 6% increase from the
average in September in the United States; Peterson 1998), then ..."
Note that I was careful not to push the function above to its numerical
limits since it would not be difficult to find values of n and d that
would compromise the accuracy of the function's result. Suffice it to
say that based on the preliminary function results shown above, I
recommend that some method be used to preclude the selection of the same
question twice, subject to verification of the function's algorithm and
accuracy.
James A. Fortune
(e-mail address removed)