How to handle missing values

N

NoMeContactes

Scenario: You need to record the last college a person attended. When
importing data, the possible values are 1) <Name of College>, 2) Did
not attend college, and 3) Data Not Available

1) is straightforward, you put the name of the college (or FK to a
lookup table)
2) I've always solved it by hardcoding a string on the table itself or
the lookup table. This can be an empty string or something like "Did
not attend college".
3) I use either Null on the table itself to denote the data is not
available or an "Unknown" field in the lookup.

Do you guys have any comments or a better way of handling this?

I ask because I'm reading the book: "Data Quality: The Accuracy
Dimension" and the way they address the issue is not clear to me.
Here's the extract:

------------------
Missing Values

A blank for COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If
the person it applied to had attended college, it would be inaccurate.
This is another case of valid but not accurate.

Missing values are very problematic in a database because of this
characteristic. To be accurate in recording information, an optional
data element should allow encoding a value for NULL. This is properly
done by creating a separate field that indicates whether the value in
the first field is missing or not. In this instance, the value in the
primary field would be set to blank if no college had been attended.
If the recorder just did not know the anser to the question, the NULL
field would be set to indicate YES, meaning that the blank in the
primary field did not apply.

Unfortunately, few systems are built to allow for this distinction.
Even if they were, most data entry people would not get it right all
of the time anyway.

Sometimes a system is created that supports a keyword that means
empty. In the previous example, it might be "never attended college"
This is a bad practice because it can make queries that do COUNT,
GROUP BY, and other commands difficult or impossible to formulate
correctly. Thus it leads to inaccurate results.
---------------------

The book implies that what I've been doing is wrong, but in my opinion
their solution doesn't provide a better alternative either. I would
love to hear some comments on this.

Thank you,
 
N

NoMeContactes

That's one way I do it (As stated in my original post)

Since this extract says it's not the best way of handling it, I was
wondering about how other people tackle this problem, and their
comments on the book extract (specially the last paragraph about
having query problems).

Thanks,
 
R

Rod Plastow

The 'textbook' answer is that for strings:

A string value (including a space) is a known value.
A zero length string is 'nothing' or 'none.'
A null is 'not known.'

Remember you have to set up the field/column to allow zero length.

However the way you have been doing it and the way Karl suggested is often
more readable.

Regards,

Rod
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top