N
NoMeContactes
Scenario: You need to record the last college a person attended. When
importing data, the possible values are 1) <Name of College>, 2) Did
not attend college, and 3) Data Not Available
1) is straightforward, you put the name of the college (or FK to a
lookup table)
2) I've always solved it by hardcoding a string on the table itself or
the lookup table. This can be an empty string or something like "Did
not attend college".
3) I use either Null on the table itself to denote the data is not
available or an "Unknown" field in the lookup.
Do you guys have any comments or a better way of handling this?
I ask because I'm reading the book: "Data Quality: The Accuracy
Dimension" and the way they address the issue is not clear to me.
Here's the extract:
------------------
Missing Values
A blank for COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If
the person it applied to had attended college, it would be inaccurate.
This is another case of valid but not accurate.
Missing values are very problematic in a database because of this
characteristic. To be accurate in recording information, an optional
data element should allow encoding a value for NULL. This is properly
done by creating a separate field that indicates whether the value in
the first field is missing or not. In this instance, the value in the
primary field would be set to blank if no college had been attended.
If the recorder just did not know the anser to the question, the NULL
field would be set to indicate YES, meaning that the blank in the
primary field did not apply.
Unfortunately, few systems are built to allow for this distinction.
Even if they were, most data entry people would not get it right all
of the time anyway.
Sometimes a system is created that supports a keyword that means
empty. In the previous example, it might be "never attended college"
This is a bad practice because it can make queries that do COUNT,
GROUP BY, and other commands difficult or impossible to formulate
correctly. Thus it leads to inaccurate results.
---------------------
The book implies that what I've been doing is wrong, but in my opinion
their solution doesn't provide a better alternative either. I would
love to hear some comments on this.
Thank you,
importing data, the possible values are 1) <Name of College>, 2) Did
not attend college, and 3) Data Not Available
1) is straightforward, you put the name of the college (or FK to a
lookup table)
2) I've always solved it by hardcoding a string on the table itself or
the lookup table. This can be an empty string or something like "Did
not attend college".
3) I use either Null on the table itself to denote the data is not
available or an "Unknown" field in the lookup.
Do you guys have any comments or a better way of handling this?
I ask because I'm reading the book: "Data Quality: The Accuracy
Dimension" and the way they address the issue is not clear to me.
Here's the extract:
------------------
Missing Values
A blank for COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If
the person it applied to had attended college, it would be inaccurate.
This is another case of valid but not accurate.
Missing values are very problematic in a database because of this
characteristic. To be accurate in recording information, an optional
data element should allow encoding a value for NULL. This is properly
done by creating a separate field that indicates whether the value in
the first field is missing or not. In this instance, the value in the
primary field would be set to blank if no college had been attended.
If the recorder just did not know the anser to the question, the NULL
field would be set to indicate YES, meaning that the blank in the
primary field did not apply.
Unfortunately, few systems are built to allow for this distinction.
Even if they were, most data entry people would not get it right all
of the time anyway.
Sometimes a system is created that supports a keyword that means
empty. In the previous example, it might be "never attended college"
This is a bad practice because it can make queries that do COUNT,
GROUP BY, and other commands difficult or impossible to formulate
correctly. Thus it leads to inaccurate results.
---------------------
The book implies that what I've been doing is wrong, but in my opinion
their solution doesn't provide a better alternative either. I would
love to hear some comments on this.
Thank you,