In this example, the Spanish source
sentence "Haga clic en la ficha" and the matching English
Target sentence "Click the tab" are parsed, using the
Morphology, Sketch, Portrait, and Logical Form components, into
their respective source and target Logical Forms (LFs). Both LFs
undergo statistical processing to identify word associations (e.g.,
"ficha" and "tab") and "alignment"
of their structures.
"This is done for 350,000 sentence pairs in English and
Spanish, applying both heuristics (rules) and statistics to find
bits of structural alignment across the language boundary. Most
of the MT work has gone into the alignment phase, figuring out which
bits across that language boundary should align up and what context
you need to save," says Dolan.
Rules help the system learn the appropriate context,
narrowing the search. A probability is attached to each correspondence,
or "mapping", for use during runtime. Finally, these transfer
mappings are stored in the MindNet repository.
"We thought it would take us a lot longer to make
progress on machine translation. It has come together pretty fast,"
says Dolan. To speed development, the research arm of Microsoft
took a page from the product side by creating nightly NLPWin builds
to make available feedback on progress.
Helping speed this process along, the researchers
"
use a huge cluster of 30 computers to retrain the system
and rerun a regression test every night," says Richardson.
This produces a new NLPWin build each day, as well as a newly updated
version of MindNet. |
|
Consequently, the NLP Group sees the impact of the
previous day's work on the MT system's effectiveness, making it
simpler to recognize positive and negative changes to the code,
and fixing the latter.
Another longtime obstacle to progress in natural language
processing is the lack of an objective means to accurately measure
advancements. The typical metric, having humans judge the accuracy
of a machine translation, makes the process inherently subjective.
The NLP Group developed a more objective testing metric, which compares
how close the MT system comes to matching an ideal sentence translation.
By minimizes the role human judgment plays in determining MT improvements,
this approach is a more quantifiable process, as has been the case
in speech recognition for decades.
Runtime
Figure 4 illustrates how the MT system works during
runtime. In this example, a Spanish source sentence is parsed by
NLPWin into its source LF. The next stage, MindMeld, refers to a
highly sophisticated process that has consumed the NLP Group's research
efforts since 1997. "MindMelding takes a sentence and matches
it to the closet conceptual relationship in a MindNet," says
Dolan.
This is essentially a graph matching process, which takes
an input sentence LF and attempts to match it against one or more
subgraphs in MindNet. For instance, if the Spanish source LF is
uncomplicated, it might exactly match an English target LF in MindNet.
Typically, the match requires |