Looking for sophisticated De-Duping tool

D

Dawn

Does anyone know of a possible add-on tool for de-duping that is more
detailed than just the 'find duplicates' wizard in Access. More
specifically, what I am looking for (or asked to do at work), is something
that will pick up a duplicate record in the database in which the fields
arent 'exact'. (ie, a company name of one of the records says Brigham Young
University with an address line of P.O. Box 882, and the other record is BYU
for company, and PO Box 822)..these arent EXACT obviously, but is most likely
the same person(with a possible typo in the PO Box). Unfortunately, there
are no unique identifiers for this table.
Any help or advice would be greatly appreciated...even if it involves using
a completely different database program.
 
T

TC

It's trivially easy for a human being to spot such situations. But it's
enormously difficult for a program to do it. What rules do you expect
it would use? I sure can not thiunk of them, in any general sense. I
can't imagine you will find a solution "out of the box", as it were.

Possibly the more important problem, is that the database sounds as if
it has not been normalized correctly. If it had been, a company name
(like "BYU, PO Box 882") would never be duplicated in multiple records.
It would be defined in a single record in a company names table, and
the other records would all refer to that record by primary key. If
you're not sure what I'm saying here, you need to learn about "database
normalization", then redesign your table structure, before you go much
further. (Or ask more questions here, or preferably in
microsoft.public.access.tablesdbdesign, the newsgroup designed for that
topic)

HTH,
TC [MVP Access]
 
D

Dawn

TC..thank you for responding. Believe me, I totally understand that what Im
asking is next to impossible for a program to do. However, I recently
started this job and the database was already here. I want to totally
redesign it and yes, NORMALIZE it as well. It was very poorly written (no
primary keys, foreign keys, nothing...! They left me with a MESS!) BUT,
there are close to 100K records and they needed something done NOW ( I really
didnt have the time to re-design the database), so I was trying to see if
there was a quick fix...pretty much knowing there wasn't. After telling my
boss over and over again that I was 99% sure there wasn't a way to do this,
he insisted that I "look" for an add on or tool that COULD do this...so to
satisfy him, thats what Im doing. ;)
 
T

TC

It's a tough situation. You can't afford to seem unresponsive to what
your boss wants. But neither can you say, "It can't be done you idiot!"
:)

In principle, there should be no problem getting an addon, or having
someone write some custom code, or whatever, that could make
user-specified changes to the data in the relevant fields. Then you
could easily change all occurrences of "GPO", to "General Post Office"
(or vice versa), for example. The problem would be, deciding what those
changes should be - the addon certainly could not guess them. So you'd
make a few changes, then review the data again, then make more changes,
and so on. Eventually, once everything was in the same or similar
fomat, it might be easier to find the duplicates, using queries, or
custom written code. But it would all be a big ask, for a table
comprising 100k records.

Maybe someone else has a better suggestion?

HTH,
TC [MVP Access]
 
J

John Gray

Does anyone know of a possible add-on tool for de-duping that
is more detailed than just the 'find duplicates' wizard in Access.


Hi,

Search back in this forum (microsoft.public.access) to Nov 22/05 (on or around that
date). There was a question about finding duplicates, posted by "WishIKnewMore". This
was answered by Ed Warren, who provided a rough technique for building your own
rules-based dupe search.

hth,
John
 
D

Dawn

Thanks John, I found the thread and have printed it out...I will give it a
try...looks like it *might* help a bit...I really appreciate everyone's help
 
T

Tony Toews

Dawn said:
Does anyone know of a possible add-on tool for de-duping that is more
detailed than just the 'find duplicates' wizard in Access. More
specifically, what I am looking for (or asked to do at work), is something
that will pick up a duplicate record in the database in which the fields
arent 'exact'. (ie, a company name of one of the records says Brigham Young
University with an address line of P.O. Box 882, and the other record is BYU
for company, and PO Box 822)..these arent EXACT obviously, but is most likely
the same person(with a possible typo in the PO Box). Unfortunately, there
are no unique identifiers for this table.

One approach would be using the Soundex algorithm although that may
not do a good job for many languages outside of English/Western
European.

Now two things I did do when assembling groups of families to save on
mailing costs was to find duplicates of the last name and display
those using a treeview control. Then I allowed the user to group
those into families.

Create Families from Volunteers User Interface
http://www.granite.ab.ca/access/familiesui.htm

I also did the same thing but using phone numbers. I knew that
there'd be lots of differences in addresses so I didn't bother trying
that route.

You could consider using a combination of the Soundex algorithm and
zip code if you are in the U.S. In Canada the postal codes alone are
small enough that dupe checking on those would likely be sufficient in
a lot of cases.

Also to locate almost duplicate city names I'd suggest a simple query
with a count based on province/state and city name. You could easily
pick out those where the city name has a slight spelling mistake and
fix those. And possibly create a city table to save on data entry
time. Also force the user to consciously click on an item to create a
new city/province/state combination.

Now if you want to de dupe address the best way would be to use a lot
of string manipulation to break the address into it's components and
then reassemble the address properly. An example of the fields
required is at
http://www.canadapost.ca/tools/pcl/bin/advanced-e.asp?sblid=pcl

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top