Query to find potential duplicates within one table?

P

Paul (TRX)

I wonder if anybody may be able to help me out with a problem I am having. I
am using Access 2003. I was given a massively huge table (over 1,000,000
entries) and asked to figure out a way we may be able to find duplicates. We
get data from several different sources, and often get the same person more
than once. For example, if Dr. Peter Venkman has a New York office as well as
a Pennyslvania office, we may have him in our records for both offices.

We may have Michael J. Fox also in our records as Michael Fox. We may have
Shamus O'Connor also in our records as Shamus O Connor. What I need to be
able to do is to query the table against itself in several different ways to
see if there are duplicates. For example, I may query by the first name, last
name, and date of birth to see if we have the same person twice for some
reason. I may also query by our client ID we've given them, and there are
many other methods I may try.

The problem I am running into is that the only way I can think of to do this
is to query the table against itself. In other words, this means, in my
query, I basically see ClientTable and ClientTable_1 and I run my query as
though it is two tables. The problem is, since I am querying the table
against itself, this gives me EVERY SINGLE record as a duplicate. For
example, say Dr. Egon Spengler is only in our table once. It still considers
him a match with himself.

Also, a somewhat related question... Is there a way to run a query to find
where some things match, but others don't? For example, could I run a query
that will actually find where the first and last name match, but the client
ID doesn't, or where the first and last name match, but the date of birth
does not? An unmatched query does no do this, because that is only finding
records from one table that have NO match in the other, and that obviously
won't work since we are talking about one table here. Thanks in advance for
any and all help you may be able to offer.

--
Have a nice day!

~Paul
TargetRx
The power to know what comes next.

~~~~~~
| |
|c--OD
| _)
| |
|-. |
/ `-# /A
/ /_|..`#.J/
||LJ `m''''
ptaylor
 
J

John Spencer

Have you looked at the Find Duplicates Query Wizard? It will build a query
to find duplicates on one or more fields within a single table.

-- Click New in the query window
-- Select Find Duplicates Query Wizard
-- Follow the bouncing ball

By the way, good luck on finding records where the names are similar (not
the same) and the addresses are the similar (not the same). Accurately
finding duplicated names and addresses is a major task.


--
John Spencer
Access MVP 2002-2005, 2007
Center for Health Program Development and Management
University of Maryland Baltimore County
..
 
P

Paul (TRX)

Wow! Somehow, I've NEVER noticed that before. I guess I just didn't expect it
to be that easy. Thank you very much! I'm trying it now. Being as the table
is so massive, it is taking a long while, but hopefully it will work. If not,
I'll likely be back to ask follow up questions, but I'll try to remember to
come back either way to let you know if it worked out. Thanks! Now I feel
like a goof for not noticing that! LOL!

--
Have a nice day!

~Paul
TargetRx
The power to know what comes next.

~~~~~~
| |
|c--OD
| _)
| |
|-. |
/ `-# /A
/ /_|..`#.J/
||LJ `m''''
ptaylor
 
J

John Nurick

Hi Paul,

Assuming your table has a primary key [PK] you can look for
near-duplicate records by using a query along these lines:

SELECT A.*, B.*
FROM ClientTable AS A INNER JOIN ClientTable AS B
ON ((A.FirstName = B.FirstName) AND (A.LastName = B.LastName))
WHERE A.PK <> B.PK

The WHERE clause ensures that a record cannot match itself; the ON
clause specifies which fields should match to count as a "duplicate". If
you want to find only records where the firstname and lastname match but
the date of birth doesn't, add the latter condition to the WHERE clause:
WHERE A.PK <> B.PK AND A.DOB <> B.DOB

If the table doesn't have a primary key, add an autonumber field and use
that.

But there's no easy way to do this: there are too many possible
variations!
 
S

Scott

John,

I used Find Duplicates Query. Now the question is how to delete the
duplicates. The query found 692,000 records. It's just impossible to delete
the duplicates manually. Any clue?

Thanks,
Scott
 
J

John Spencer

No guarantees with that many records to handle, but here are two techniques
you can try

One way to handle this would be to build a new table with a compound unique
index based on the multiple fields. Then import all the records into this
new table and ignore the errors. Once the import is successful, delete the
old table and rename the new table to the old table's name.

Another way, (BACK UP your data before you do this) - this works best if you
have a single field primary key. I think I could modify it to handle a
multi-field primary key, but I would have to test it.

Build a query based on the table that will identify the primary key values
you want to keep and save that as qKeepThese. If you don't care which of
the "duplicate" records you want to keep then you can use the First
aggregate function to more-or-less randomly select one.

Query One: << This query is the key to identifying which records to keep>>
SELECT First(PrimaryKeyField) as FirstID
FROM TheTable
GROUP BY TelephoneNumber, ... << list fields that determine duplicates>>


Query Two:
DELETE DistinctRow T.*
FROM TheTable as T
WHERE T.PrimaryKeyField IN
(SELECT PrimaryKeyField
FROM TheTable LEFT JOIN QKeepThese
ON TheTable.PrimaryKeyField= QKeepThese.FirstID
WHERE qKeepThese.FirstID is Null)

All in one query would be as follows - only works if field and table names
don't contain "Special" characters.

DELETE DistinctRow T.*
FROM TheTable as T
WHERE T.PrimaryKeyField IN
(SELECT PrimaryKeyField
FROM TheTable LEFT JOIN
(SELECT First(PrimaryKeyField) as FirstID
FROM TheTable
GROUP BY TelephoneNumber) AS QKeepThese
ON TheTable.PrimaryKeyField= QKeepThese.FirstID
WHERE qKeepThese.FirstID is Null)

--
John Spencer
Access MVP 2002-2005, 2007
Center for Health Program Development and Management
University of Maryland Baltimore County
..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top