DISTINCT vs GROUPBY ?

S

Stef

DISTINCT vs GROUPBY ?


Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx
 
B

BabyATX13 via AccessMonster.com

Do you have separate Supplier and Invoice Tables? If not, I would start
there, and use one to many relationships. Ex. One Supplier will have many
Invoices; One Invoice will have many Orders.
K Board
DISTINCT vs GROUPBY ?

Hello all,

I'm trying to figure out if there is any difference of performance using
either
GROUPBY or DISTINCT in the following scenario.

I have a table named [Orders] that contains many fields, among them two
called [Supplier_ID] and [InvoiceNumber]. The combo of the 2 is not
unique and may repeat over several item records.

I'm trying to generate a unique list of all [Supplier_ID][InvoiceNumber]

Is it faster/better to use GROUPBY or DISTINCT ?

many thx
 
M

Michel Walsh

Hi,


GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group":


SELECT FirstName, LastName, LAST(address), LAST(phoneNumber)
FROM myTable
GROUP BY FirstName, LastName



could be faster than

SELECT FirstName, LastName, address, phoneNumber
FROM myTable
GROUP BY FirstName, LastName, address, phoneNumber


if FirstName and LastName define the "groups", because here, the sorting
list has a larger number of bytes to handle, for each entry in the sorted
list.



Hoping it may help,
Vanderghast, Access MVP
 
P

peregenem

Michel said:
GROUP BY should be faster, since both are likely to relay on sorting, and
DISTINCT has no idea about which fields are important to consider (it will
sort the whole SELECTed list of expressions), while GROUP BY can be
instructed. Furthermore, use a dummy aggregate for fields in the list, when
used with GROUP BY, such as LAST( ), on fields that are not defining a
"group"

I'm not convinced by your argument :)

Neither rely on sorting. Instinct tells me DISTINCT would be faster
because it doesn't have to perform any aggregation, dummy or otherwise.
DISTINCT merely ignores duplicated rows, which is easy for the
optimizer e.g. a hash table.

Here are my test results:

SELECT DISTINCT data_col
FROM MillionRowTable;
1.266 secs

SELECT COUNT(*), data_col
FROM MillionRowTable
GROUP BY data_col;
1.953 secs

SELECT LAST(data_col), data_col
FROM MillionRowTable
GROUP BY data_col;
1.937 secs

Based on the above, I wouldn't consider GROUP BY to be significantly
slower but still think DISTINCT is the best choice.
 
M

Michel Walsh

Hi,


Even if it hashes, the hash has to be computed on each and every selected
expression, in the case you use more that ONE expression in the SELECT
statement, and nothing insures that two different row would never produce
the SAME hash value....

You are just sure that if two rows produce a different hash, then they are
different, but not the reverse.

Hashing is O(1), while plain sorting by comparison is O(2), so internally,
the database engine should perform the hashing, anyhow, when sorting on
large set. I fail to really see your point.

Your test did not include the GROUP BY statement equivalent to your
distinct, which is:


SELECT data_col
FROM MillionRowTable
GROUP BY data_col;


All your actual GROUP BY statement are penalized by returning TWO columns
(on a million rows) while your DISTINCT just return one column. That is
plainly unfair (almost twice the IO to produce the result).

I can be unfair too, what about:


SELECT DISTINCT pk, f1, f2, f3, f4, f5
FROM myTableWithAprimaryKey


versus


SELECT pk, LAST(f1), LAST(f2), LAST(f3), LAST(f4), LAST(f5)
FROM myTableWithAprimaryKey
GROUP BY pk





Vanderghast, Access MVP
 
P

peregenem

Michel said:
Even if it hashes

I don't know that it does. I was simply guessing that a DISTINCT would
be easy to optimize.
Your test did not include the GROUP BY statement equivalent to your
distinct, which is:

SELECT data_col
FROM MillionRowTable
GROUP BY data_col;

I get 0.985 secs.
I can be unfair too

I wasn't trying to be unfair, I was merely being foolish :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top