In this example, the Spanish source sentence
"Haga clic en la ficha" and the matching English Target
sentence "Click the tab" are parsed, using the Morphology,
Sketch, Portrait, and Logical Form components, into their respective
source and target Logical Forms (LFs). Both LFs undergo statistical
processing to identify word associations (e.g., "ficha"
and "tab") and "alignment" of their structures.
"This is done for 350,000 sentence pairs in English and Spanish,
applying both heuristics (rules) and statistics to find bits of structural
alignment across the language boundary. Most of the MT work has gone
into the alignment phase, figuring out which bits across that language
boundary should align up and what context you need to save,"
says Dolan.
Rules help the system learn the appropriate context, narrowing the
search. A probability is attached to each correspondence, or "mapping",
for use during runtime. Finally, these transfer mappings are stored
in the MindNet repository.
"We thought it would take us a lot longer to make progress
on machine translation. It has come together pretty fast," says
Dolan. To speed development, the research arm of Microsoft took a
page from the product side by creating nightly NLPWin builds to make
available feedback on progress.
Helping speed this process along, the researchers "
use
a huge cluster of 30 computers to retrain the system and rerun a regression
test every night," says Richardson. This produces a new NLPWin
build each day, as well as a newly updated version of MindNet. |
|
Consequently, the NLP Group sees the
impact of the previous day's work on the MT system's effectiveness,
making it simpler to recognize positive and negative changes to the
code, and fixing the latter.
Another longtime obstacle to progress in natural language processing
is the lack of an objective means to accurately measure advancements.
The typical metric, having humans judge the accuracy of a machine
translation, makes the process inherently subjective. The NLP Group
developed a more objective testing metric, which compares how close
the MT system comes to matching an ideal sentence translation. By
minimizes the role human judgment plays in determining MT improvements,
this approach is a more quantifiable process, as has been the case
in speech recognition for decades.
Runtime
Figure 4 illustrates how the MT system works during runtime. In
this example, a Spanish source sentence is parsed by NLPWin into
its source LF. The next stage, MindMeld, refers to a highly sophisticated
process that has consumed the NLP Group's research efforts since
1997. "MindMelding takes a sentence and matches it to the closet
conceptual relationship in a MindNet," says Dolan.
This is essentially a graph matching process, which takes
an input sentence LF and attempts to match it against one or more
subgraphs in MindNet. For instance, if the Spanish source LF is uncomplicated,
it might exactly match an English target LF in MindNet. Typically,
the match requires |