Parsing out nouns

S

Steve Lang

Hi all,

Does anyone know of a way to pull out nouns from a Word document,
programmatically?

TIA and have a great day!

--
Stephen Lang
Legislative Counsel Bureau
Carson City, NV
GMT+8
slang at lcb <dot> state <dot> nv <dot> us
 
J

Jezebel

Linguistically, this is a *very* difficult proposition. A word is a noun
only by virtue of its syntactic function within an actual sentence. Even if
all you mean is 'words listed in the dictionary as nouns' it's still a major
challenge: many nouns are also listed as verbs or other parts of speech, and
almost all nouns can, in some context, be used as verbs or adjectives.
Trying to parse natural language syntax is still one of the holy grails of
programming (especially if you're dealing with the sort of pseudo-English
that lawyers create!).

You'd also need to grapple with all the inflections and suppletions that our
foul language is heir to.

One approach would be to make a list of *all* the words in the document, do
a unique sort, then discard those that are obviously not nouns. But it would
be essentially a manual task.
 
H

Helmut Weber

Hi Jezebel and Steve,
virtually nothing to add to Jezebel's explanation.
Imagine a text: "Help!" Verb or noun? Language is
all fuzzy.
Greetings from Bavaria, Germany
Helmut Weber
"red.sys" & chr$(64) & "t-online.de"
 
W

Word Heretic

G'day "Jezebel" <[email protected]>,

<nods>


Jezebel said:
Linguistically, this is a *very* difficult proposition. A word is a noun
only by virtue of its syntactic function within an actual sentence. Even if
all you mean is 'words listed in the dictionary as nouns' it's still a major
challenge: many nouns are also listed as verbs or other parts of speech, and
almost all nouns can, in some context, be used as verbs or adjectives.
Trying to parse natural language syntax is still one of the holy grails of
programming (especially if you're dealing with the sort of pseudo-English
that lawyers create!).

You'd also need to grapple with all the inflections and suppletions that our
foul language is heir to.

One approach would be to make a list of *all* the words in the document, do
a unique sort, then discard those that are obviously not nouns. But it would
be essentially a manual task.

Steve Hudson

Word Heretic, Sydney, Australia
Tricky stuff with Word or words for you.
Email (e-mail address removed)
Products http://www.geocities.com/word_heretic/products.html

Replies offlist may require payment.
 
B

Bruce Brown

I agree with Jezebel. It is a surely hopeless pursuit to parse nouns,
and any attempt to do so, as Jezebel notes, is doomed to be a manual
task:

"Dogs don't shower but showers can dog bridal showers."

However, it is nothing less than flagitious to suggest that our
language is "foul." That charge is to thinking people what treason is
to governments.

The English language contains more words than any other language on
earth, is now the world's leading international language, and boasts a
glorious body of literature that equals -- some would say surpasses --
its rivals.

English has no genders or silly little accent marks to keep track of,
practical blessings for practical people. Above all, to a degree
greater than most other languages, the English tongue is spoken by
free peoples.

One can't help wondering what guttural, gargling glob of a language
Jezebel would prefer to be grunting in.
 
W

Word Heretic

G'day (e-mail address removed) (Bruce Brown),

def c as long
def all_non_binary_languages as void

:)

(e-mail address removed) (Bruce Brown) was spinning this yarn:
One can't help wondering what guttural, gargling glob of a language
Jezebel would prefer to be grunting in.

Steve Hudson

Word Heretic, Sydney, Australia
Tricky stuff with Word or words for you.
Email (e-mail address removed)
Products http://www.geocities.com/word_heretic/products.html

Replies offlist may require payment.
 
J

JGM

--
_________________________________________

Jean-Guy Marcil
(e-mail address removed)

Bruce Brown said:
I agree with Jezebel. It is a surely hopeless pursuit to parse nouns,
and any attempt to do so, as Jezebel notes, is doomed to be a manual
task:

"Dogs don't shower but showers can dog bridal showers."

However, it is nothing less than flagitious to suggest that our
language is "foul." That charge is to thinking people what treason is
to governments.
You are right, qualifying English as foul is overdoing it, but I do not
think Jezebel did it in any serious way. I think it was more of a joking
comment referring to all the exceptions and non-sense semantics that exist
in English. This is why learning English as a second language is very
difficult, I know, I did. Just think of learning how to pronounce "pear,
beard, heard"; "alive, live (as in They live over there)"; "tough, though"
and the list could go on for pages. How about spelling? as in "believe,
perceive" How about all the crazy plurals due to the fact that the words are
foreign in origin, as in "datum, data" or that "bonus" is "bonuses", but
"cactus" can be either "cactuses" or "cacti", but "stimulus" must be
"stimuli"; "criterion" is "criteria", or "true" English words, as
"mouse-mice"; "tooth-teeth" and why is it that we say "one sheep" and "many
sheep" and not "many sheeps"... How about "crazy" syntax: "We get on the
bus", but actually we are inside the vehicle, so why do we say "on the bus"?
As you can see, I could go on and on to show why Jezebel used the word
"foul"... but again, in the strictest sense, I agree with you, English is
not foul, not more than Hebrew, Cantonese, Hindi, Swahili, Italian, Greek...
Most people who learn English later in life have a hard time with those
examples (and other idiosyncrasies). I am listing all this not because I
want a logical explanation, but merely to point out that English is indeed
difficult to learn because it is not very practical.
The English language contains more words than any other language on
earth, is now the world's leading international language, and boasts a
Yes it does, but nearly 70% of English words are foreign in origin, mostly
from French, then German, Spanish, Italian, Arabic, and on and on...
glorious body of literature that equals -- some would say surpasses --
its rivals.
Its rivals? Who made this out to be a competition?
English has no genders or silly little accent marks to keep track of,
Well, I am from Canada and I speak French, and we have silly little accents
and genders... German, Swedish, Spanish, Italian, Portuguese, Thai,
Vietnamese, Romanian, Arabic, just to name a few, all have either gender,
"accents" or both... To qualify those diacritic marks as silly is equivalent
to Jezebel's labelling of English as foul.
practical blessings for practical people. Above all, to a degree
Are all speakers of other languages "not-practical"? So because you were
born in a place on the Earth where English is spoken, you are superior in
some way? If you refer to the examples I provided above, you will see that
English is far from being practical. It is a hodgepodge of borrowed syntax,
pronunciation and spelling rules that makes it both vibrant and
nightmarish. Compared to other languages, English is by far one of the most
impractical... (There are many reasons for this, for example, being occupied
by the French for 300 years starting back in 1033 had an enormous impact,
also, the fact that when printing started, every printer chose its own way
of spelling some words that had previously never been written, just
spoken.... by the time they finally agreed on a standard, the damage was
done...). Your way of thinking is typical of people who were born in an
environment and never had to learn second-hand what it was like to learn to
live in the said environment. You automatically assume that your way is
easier (which is true for you, since you were born in it), but then assume
that every other way "has something wrong with it" or is inferior in some
way.
greater than most other languages, the English tongue is spoken by
free peoples.
What has this got to do with the parsing of nouns? I do not think that this
forum is appropriate to starting a political debate on colonialism, free
speech and human rights, so I will not go there...
One can't help wondering what guttural, gargling glob of a language
Jezebel would prefer to be grunting in.
Now, why do you have to insult someone because they made a non-serious
exaggeration that you happen to disagree with? You are effectively
belittling every other language and you are alienating yourself from what
those marvelously rich other cultures and languages have to offer.
Also, in a forum where people from all over the world exchange ideas, it
would be a very good idea to be careful before asserting one's superiority
because one speaks a particular language and to "bash" on other languages in
the process.

Hoping you will be more open to language differences and not so quick at
proclaiming your own language as the superior, practical language of them
all.

Sincerely,
Cheers.
 
J

Jezebel

Awww c'mon., people. Didn't they make you read Hamlet at school?

Murder most foul (etc), and the thousand natural shocks that flesh is heir
to?
 
J

JGM

--
_________________________________________

Jean-Guy Marcil
(e-mail address removed)
 
H

Helmut Weber

Hi Bruce,
the number of words, if we knew what a word was,
would be unlimited in every language,
if we could define "language", which is impossible, too.
Similar to string, where stringC = stringA & stringB and,
given there is no longest string, makes the number
of possible strings unlimited.
Except, if one refers, let's say to written Latin
between A.D. 0 and 100, where all "words" can be listed.
English is in no way more practical or easier to learn
or more difficult to learn than any other language.
Depends on where you start from. Chinese can't be so
difficult with so many speakers getting along with it
very well.
"Word" itself is a fiction, a pretheoretical assumption,
which makes some definitions of rules easier in some
contexts, and more difficult in others.
In ancient Greek, there is no concept of "word".
The illusion of "words" results from ways of writing
(spacing), which are much younger then the Greek texts.
In Greek, "ónoma" means "utterance, act of speaking",
"word", as we understand it, was unknown then.
Greetings from Bavaria, Germany
Helmut Weber
 
J

JGM

Sorry for the triple post earlier in the thread....

I have no idea what happened there... Like I replied to my own reply AND
reply to a post witout writing anything... Not likely. Another to classify
as an "unknown MS mystery"!

And just to add to Helmut's comments... you do not have to go back to the
Greeks. Today, in Thailand for example, sentences are not written with each
word neatly spaced out, "words" are all put together acording to the idea
being expressed. So a written text consits of a chain of characters grouped
according to a main idea and punctuation does not exist...

Cheers.

--
_________________________________________

Jean-Guy Marcil
(e-mail address removed)

"Helmut Weber" <[email protected]> a écrit dans le message de [email protected]...
Hi Bruce,
the number of words, if we knew what a word was,
would be unlimited in every language,
if we could define "language", which is impossible, too.
Similar to string, where stringC = stringA & stringB and,
given there is no longest string, makes the number
of possible strings unlimited.
Except, if one refers, let's say to written Latin
between A.D. 0 and 100, where all "words" can be listed.
English is in no way more practical or easier to learn
or more difficult to learn than any other language.
Depends on where you start from. Chinese can't be so
difficult with so many speakers getting along with it
very well.
"Word" itself is a fiction, a pretheoretical assumption,
which makes some definitions of rules easier in some
contexts, and more difficult in others.
In ancient Greek, there is no concept of "word".
The illusion of "words" results from ways of writing
(spacing), which are much younger then the Greek texts.
In Greek, "ónoma" means "utterance, act of speaking",
"word", as we understand it, was unknown then.
Greetings from Bavaria, Germany
Helmut Weber
 
J

Jezebel

And just to add to Helmut's comments... you do not have to go back to the
Greeks. Today, in Thailand for example, sentences are not written with each
word neatly spaced out, "words" are all put together acording to the idea
being expressed. So a written text consits of a chain of characters grouped
according to a main idea and punctuation does not exist...

That's a long way short of invalidating the concept of 'word' or 'language'.
Don't confuse orthography with language. Thai has words, just as much as any
other language. It's true that there's no single definition of 'word' that
works for all languages, but there's never any a problem defining 'word' in
any given language. The fact that the set of words is unbounded is
irrelevant. Helmut's argument is like saying we can't define 'number'
because our digits can be arranged in an infinite variety of ways.

If you really want some fun, look at the seriously fusional languages
(Greenlandic is the textbook example) -- they make German words look short.
 
J

JGM

Hi Jez,

Far from me the idea of trying to invalidate the concepts of word and
languages. I agree with you that those concepts are clearly defined within a
language.
I did not say that Thai has no words, I just pointed out that when Thai is
written, you cannot distinguish one word from another because all the words
needed to express an idea are stuck together in a single chain of
characters, until you get to the next idea where another chain starts, a bit
like the concept of "sentence" in English. So, if you do not read Thai you
cannot readily identify single words by looking at a text, and since Thai
also places vowel characters before, after, under and above the consonants,
it is difficult to say where a word starts and finishes. Of course, if you
can read Thai you will identify the words easily. As a contrast, I do not
speak Spanish, but in a Spanish text I can tell you with 100% accuracy where
a word starts and where another finishes. So, my example was just to show
that even today there are written forms that are similar to ancient Greek in
that there are no obvious words in a written text.

Fusional languages are actually languages like Latin, French or Arabic,
where grammatical meanings are expressed through suffixes (Sometimes
prefixes) - as in "aimais" (I loved) where the suffix "ais"tells you that it
is first person singular and past tense. From what you say, Greenlandic must
be an agglutinative language in which words are built up from long sequences
of units with each unit expressing a particular meaning, so you could have a
words that ends up being a root word plus 6 units. Such languages are very
common, some native south american laguages even have infixes as well as
prefixes and suffixes. That being said, most modern languages show some form
of fusion or agglutination. We just use these two concepts to label a
language's general tendancy. But that does not mean that in written form all
the words are slapped together to form a single chain of characters. Do
Greenlandic texts consit of long chain of characters because words are put
together without spaces? Or it just happens that words can be very long
because they use lots of prefixes and suffixes... Like Welsh that has super
long words. English is at its most basic an isolating language where each
word is invariable and grammar is expressed by adding new words, like
Chinese, as in "The boy will ask the girl" or "The girl must ask the boy".
But because it is the all time champion in borrowing, it is also fusional
and agglutinative: "The big-gest boy-s have be-en ask-ing" is fusional and
"anti-dis-establish-ment-arian-ism" is agglutinative.

Languages are fun to explore, and if you see or hear something that you find
weird in one language (infixes in some lanaguages are weird to me), I
guarantee you that something weirder is waiting around the corner (the
clicking consonant sounds in Xhosa are even weirder to me!)

Good day!
 
H

Helmut Weber

Hi Jezebel,
That's a long way short of invalidating the concept
of 'word' or 'language'.
It certainly is. Even fuzzy concepts work.
Though fuzzy they are and fuzzy they stay.
The fact that the set of words is unbounded is
irrelevant. Helmut's argument is like saying we can't
define 'number' because our digits can be arranged in
an infinite variety of ways.
The argument of unboundedness was to show, that attempts
of counting the number of words, are in vain.
It had nothing to do with definitions of words.
Bruce meant that English had more words than
other languages. Which is not so far from the truth
either in a way, as Bruce probably thought of English
as a dissociative language (mouth - oral etc.).
Greetings from Bavaria, Germany
Helmut Weber
Have a nice day!
 
W

Word Heretic

G'day "JGM" <[email protected]>,

The original discussion was defining nouns, as a subset of words.

Whilst Thai (my wife is Thai) suffers from the unpopular ignominy of
lacking spacial separation twixt words, the character groups
themselves do not provide a separate definition for the dictionary nor
their understanding. Thus you raise a red herring, it is merely
English and other popular conventions that separates words with
spaces, words still exist outright unto themselves.

To pursue this argument requires the reading of something far
preceding mere Hamlet, a soliloquy of self-annihilation, a
Shakespearean Star Trek so to speak, the words of our dear techers
Plato and Socrates.

They pursued the definition of things by subdivision of capabilities
and measurements. They carried on the oral tradition of dividing
things in this world into actions, objects and descriptives. Something
gets done to something somehow. We describe energy and mass relatively
instinctively, yet it took us how many thousand years to equate them?
Let alone, my initial heresy is finally stated as dogma: E=mc^2 is
NOT numerically correct. Conceptually yes. Numerically, no. It's
thrown a few equations out.

Taken as a specific generality (I love oxymorons), any word is only
recognisable referring back to a dictionary and then taken in context
in accordance with its inter-sentence juxtapositioning. There have
been several attempts at 'open-source' dictionaries, they all suck
through complete lack of basic vocab. I could rant about the OS
projects and their deliverance, with several _choice_ examples from a
few different genres but don't want to add napalm to an already
burning bridge or sixty ;-)

This broad definition even covers the raw Chinese combination of word
elements into a new character. It is merely our English / Thai etc
convention that states these symbols need follow each other linearly
in order to form a word.

Enough, I need another pull at my Shiraz and my wife's on the phone
:)



JGM said:
Hi Jez,

Far from me the idea of trying to invalidate the concepts of word and
languages. I agree with you that those concepts are clearly defined within a
language.
I did not say that Thai has no words, I just pointed out that when Thai is
written, you cannot distinguish one word from another because all the words
needed to express an idea are stuck together in a single chain of
characters, until you get to the next idea where another chain starts, a bit
like the concept of "sentence" in English. So, if you do not read Thai you
cannot readily identify single words by looking at a text, and since Thai
also places vowel characters before, after, under and above the consonants,
it is difficult to say where a word starts and finishes. Of course, if you
can read Thai you will identify the words easily. As a contrast, I do not
speak Spanish, but in a Spanish text I can tell you with 100% accuracy where
a word starts and where another finishes. So, my example was just to show
that even today there are written forms that are similar to ancient Greek in
that there are no obvious words in a written text.

Fusional languages are actually languages like Latin, French or Arabic,
where grammatical meanings are expressed through suffixes (Sometimes
prefixes) - as in "aimais" (I loved) where the suffix "ais"tells you that it
is first person singular and past tense. From what you say, Greenlandic must
be an agglutinative language in which words are built up from long sequences
of units with each unit expressing a particular meaning, so you could have a
words that ends up being a root word plus 6 units. Such languages are very
common, some native south american laguages even have infixes as well as
prefixes and suffixes. That being said, most modern languages show some form
of fusion or agglutination. We just use these two concepts to label a
language's general tendancy. But that does not mean that in written form all
the words are slapped together to form a single chain of characters. Do
Greenlandic texts consit of long chain of characters because words are put
together without spaces? Or it just happens that words can be very long
because they use lots of prefixes and suffixes... Like Welsh that has super
long words. English is at its most basic an isolating language where each
word is invariable and grammar is expressed by adding new words, like
Chinese, as in "The boy will ask the girl" or "The girl must ask the boy".
But because it is the all time champion in borrowing, it is also fusional
and agglutinative: "The big-gest boy-s have be-en ask-ing" is fusional and
"anti-dis-establish-ment-arian-ism" is agglutinative.

Languages are fun to explore, and if you see or hear something that you find
weird in one language (infixes in some lanaguages are weird to me), I
guarantee you that something weirder is waiting around the corner (the
clicking consonant sounds in Xhosa are even weirder to me!)

Good day!

Steve Hudson

Word Heretic, Sydney, Australia
Tricky stuff with Word or words for you.
Email (e-mail address removed)
Products http://www.geocities.com/word_heretic/products.html

Replies offlist may require payment.
 
B

Bruce Brown

A half second after clicking "Post message" I remembered that
Lars-Eric and Jose (Thanks, from Spain) and others from these threads
have names with accent marks which, far from being silly, are thereby
made sacred. My apologies -- long live your names and and your
languages.

(Still, I'm glad we don't have to deal with accent marks in English.
Makes typing easier, if nothing else.)

Truth is, don't we tend to think of languages, countries and cultures
the way we think of mother, that is to say, everyone tends to be
convinced his or hers is just about the best imaginable? This is part
of human nature, and I wouldn't bet on its changing anytime soon.

Now, by all means, on with the linguists' festival. Have a fricative,
Freddie. Don't look now, but Gloria is displaying her reduplicative
compounds. Really, that woman shouldn't drink and post . . .
 
J

Jezebel

Now, by all means, on with the linguists' festival. Have a fricative,
Freddie. Don't look now, but Gloria is displaying her reduplicative
compounds. Really, that woman shouldn't drink and post . . .

Worst of all, English has no bilabial fricatives. Not in polite company,
anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top