DISTINCT vs GROUPBY ?

Stef · Sep 12, 2005

DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx

BabyATX13 via AccessMonster.com · Sep 12, 2005

Do you have separate Supplier and Invoice Tables? If not, I would start
there, and use one to many relationships. Ex. One Supplier will have many
Invoices; One Invoice will have many Orders.
K Board

DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx

Michel Walsh · Sep 12, 2005

Hi,

GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group":

SELECT FirstName, LastName, LAST(address), LAST(phoneNumber)
FROM myTable
GROUP BY FirstName, LastName

could be faster than

SELECT FirstName, LastName, address, phoneNumber
FROM myTable
GROUP BY FirstName, LastName, address, phoneNumber

if FirstName and LastName define the "groups", because here, the sorting
list has a larger number of bytes to handle, for each entry in the sorted
list.

Hoping it may help,
Vanderghast, Access MVP

peregenem · Sep 13, 2005

Michel said:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"

I'm not convinced by your argument

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.

Michel Walsh · Sep 13, 2005

Hi,

Even if it hashes, the hash has to be computed on each and every selected
expression, in the case you use more that ONE expression in the SELECT
statement, and nothing insures that two different row would never produce
the SAME hash value....

You are just sure that if two rows produce a different hash, then they are
different, but not the reverse.

Hashing is O(1), while plain sorting by comparison is O(2), so internally,
the database engine should perform the hashing, anyhow, when sorting on
large set. I fail to really see your point.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;

All your actual GROUP BY statement are penalized by returning TWO columns
(on a million rows) while your DISTINCT just return one column. That is
plainly unfair (almost twice the IO to produce the result).

I can be unfair too, what about:

SELECT DISTINCT pk, f1, f2, f3, f4, f5
FROM myTableWithAprimaryKey

versus

SELECT pk, LAST(f1), LAST(f2), LAST(f3), LAST(f4), LAST(f5)
FROM myTableWithAprimaryKey
GROUP BY pk

Vanderghast, Access MVP

peregenem · Sep 13, 2005

Michel said:
Even if it hashes

I don't know that it does. I was simply guessing that a DISTINCT would
be easy to optimize.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;

I get 0.985 secs.

I can be unfair too

I wasn't trying to be unfair, I was merely being foolish

Help on Counting and grouping distinct records	3	Apr 29, 2009
yet another remove duplicates question	4	Oct 11, 2007
Using SELECT DISTINCT	9	Jun 30, 2008
select distinct and another column...	2	Jan 29, 2007
Unique entries based on condition - count distinct	1	May 1, 2010
SQL performance: Nested SELECT vs. INNER JOIN	14	Sep 25, 2003
Selecting the Most Recent Date From one unique ID	4	Nov 1, 2005
Extracting common ids across 12 tables and assoc data	6	Dec 4, 2007

DISTINCT vs GROUPBY ?

Stef

BabyATX13 via AccessMonster.com

Michel Walsh

peregenem

Michel Walsh

peregenem

Ask a Question

Similar Threads