Qun Liu†‡
|
Baobao Chang†
|
Weidong Zhan†
|
Qian Zhou#
|
†Institute of Computational Linguistics, Peking
University
‡ Institute of Computing Technology, Chinese Academy of
Sciences
# State Key Laboratory of Intelligent Technology and
System, Tsinghua University
liuqun@ict.ac.cn, {chbb,zhwd}@pku.edu.cn, zhouq@s1000e.cs.tsinghua.edu.cn
Abstract: This
paper introduces a News-oriented Chinese-English Machine Translation
system. Firstly, general
information of the system is given.
Secondly, the architecture of the system –– the micro-engine pipeline is
introduced. Thirdly, the several
main micro-engines in the system are discussed, including micro-engines for
Chinese lexical analysis, example-based micro-engine, and rule-based
micro-engine. Fourthly,
linguistics resources used in this system are described. Finally, we will give the conclusion and
the future works.
Keywords: Multi-Engine
Machine Translation, Chinese
Information Processing
面向新闻领域的汉英机器翻译系统1
刘群†‡
|
常宝宝†
|
詹卫东†
|
周强# |
† 北京大学计算语言学研究所
‡ 中国科学院计算技术研究所
# 清华大学智能技术与系统国家重点实验室
liuqun@ict.ac.cn, {chbb,zhwd}@pku.edu.cn, zhouq@s1000e.cs.tsinghua.edu.cn
摘要: 本文介绍了一个面向新闻领域的汉英机器翻译系统。首先介绍了系统的总体情况。其次,介绍了该系统的总体结构——微引擎流水线。再次,讨论了系统中几个主要的微引擎了,包括词法分析微引擎,基于实例的翻译微引擎和基于规则的翻译微引擎。然后,介绍了系统中所使用的语言资源。最后给出了结论和下一步的工作计划。
关键词:多引擎机器翻译,中文信息处理
The News-oriented Chinese-English Machine Translation System is a project supported by the national key fundamental research schema. Institute of Computational Linguistics of Peking University, Institute of Computing Technology of Chinese Academy of Sciences and State Key Laboratory of Intelligent Technology and System of Tsinghua University undertake this project.
The leader of the project is Prof. Yu Shiwen. The project started on January 2000. In this system we adopt a
multi-strategy approach, which we call it a micro-engine pipeline
approach. Now the first version of
the system is completed. The
system has a dictionary of about 50,000 Chinese words in this version. It can translate 1,000 words per minute
on a computer with an Intel Pentium III 500 CPU.
Various approaches are used in different
MT systems, such as rule-based approach, statistical-based approach,
example-based approach, transfer-based approach, interlingua-based approach,
and etc. Each approach has its
advantages and disadvantages. A
lot of researchers realized that using a hybrid approach would archive better
result than using a single approach.
Many systems used a multi-engine approach in MT system. However, there are many different
methods to integrate different MT engines to a system. [Frederking and Nirenburg 94] proposed
a typical diagram of multi-engine MT, where every MT engine translate the input
sentence in parallel, and put the translations of the source sentence segments
with a score into a Chart-like structure, according to the position of the
source constituent. Then the
system uses a dynamic program algorithm to find the best combination of the
translations. [Zhang and Choi
1999] use different engines in different phases in a transfer-based MT
system. In analysis phase, rule-
& statistics-based engines are started up. In transfer phase, patter- &
statistics-based engines are started up. In synthesis module, rule-based engine
is started up.
In our system, we proposed a general-purpose
multi-engine MT architecture –– Micro-Engine Pipeline, which is defined as
below:
l A micro-engine pipeline consist of an array of micro-engines;
l All the micro-engines share a chart data structure;
l There are two kind of micro-engine: recognizer and selector;
l A recognizer should implement two function:
1)
Recognize: To produce a new
constituent (edge) according the existing constituents in the chart;
2)
Translate: To translate a constituent
produced by itself; the translation function may call the translate function of
the recognizer which produce the children constituents of the constituent to be
translated.
l A selector should implement one function:
1)
Select: To select best sequences of
constituents on the charts;
l The translation algorithm:
1)
Assign the first micro-engine in the
micro-engine pipeline to TheEngine;
2)
If TheEngine is NULL, then translation
fails, return NULL;
3)
If TheEngine is a recognizer, call its
recognize() function repeatedly, until a constituent cover the whole sentence
is produced, or no new constituent is produced. In the first case, goto step 6);
4)
If TheEngine is a selector, call its
select() function, discards all the constituents which do not appear in the
returned sequences.
5)
Assign the next micro-engine in the
micro-engine pipeline to TheEngine, goto 2);
6)
Assign the constituent which cover the
whole sentence to TheRoot;
7)
Call the translate() function of
TheEngine to translate TheRoot;
8)
Translation success, return the target
sentence.
The actual micro-engine pipeline in our Chinese-English MT system consist of the following micro-engines:
1)
Example-based Exact-maching
Recognizer: to search the bilingual corpus to find the exact-matched example,
and return the translation directly;
2)
Dictionary Lookup Recognizer: the
recognizer to look up the dictionary and process the overlapping Chinese
words. Since no space exists
between Chinese words, all the possible words in the dictionary are added to
the chart. This process is
so-called full segmentation;
3)
Named Entity Recognizer: the
recognizer to recognize named entity which is not recorded in the dictionary,
such as Chinese people names, Chinese place names, foreign names, numbers, and
etc.
4)
Segmentation and POS Tagging Selector:
to do Chinese word segmentation and POS tagging, using a hybrid of rule-based
approach and statistical-based approach, which we will introduce in other
papers;
5)
Example-based Fuzzy-matching
Recognizer: to search the bilingual corpus to find the fussy-matched example,
and translate the sentence by an example-based approach. The bilingual corpus used here is
sentence-aligned. This recognizer
is still under construction;
6)
Rule-based Recognizer: a traditional
rule-based MT engine, using an LFG-like grammar and a chart parser with
unification supported. The
translation function of this engine also adopts a rule-based approach.
7)
Fail-soft Recognizer: if all of the
above recognizer do not product a constituent covering the whole sentence,
fail-soft recognizer will try to find a best sequence of the exist constituents
and produce a new constituent, with all the constituents in the best sequence
as the children of the new constituents.
The algorithm to find the best sequence is just like the “chart walk”
algorithm introduced in [Frederking and Nirenburg, 1994].
In the micro-engine pipeline approach, each
micro-engine uses its own resources.
For example, the Name Entity Recognizer needs dictionaries of Chinese
people names, Chinese place names and foreign names, and the Rule-based
Recognizer needs rulebases, and etc.
Here we will not example all these resource in detail. We will only introduce the dictionary
and the corpus.
A core dictionary and a extend dictionary
is used in the system. There are
about 50,000 Chinese words in the core dictionary. The information of the core dictionary is very rich. The grammatical information of Chinese
words of this dictionary is mainly extracted from the “The Grammatical
Knowledge-base of Contemporary Chinese” [Shiwen Yu, et al, 1998], and the
semantic information is based on an ontology and semantic frame schema, which
was introduced in [Hui Wang, et al, 1998]. The extend dictionary contains about 200,000 Chinese words.
Information in the extend dictionary is rather simple, i.e., the Chinese and
English word forms and POSs. Now
the extend dictionary is under construction and is not integrated to the system
yet.
A monolingual corpus and a bilingual corpus
are used in the system. The
monolingual corpus is the “Peoples Daily Corpus” with Chinese words segmented
and POS tagged [Shiwen Yu, 2000].
Now it is used to train a HMM model in Segmentation and POS Tagging
Selector. The bilingual corpus
consists of two parts. One part
contains news, Chinese government white papers and editorials of Xinhua News
Agency. There are about 40,000
sentence pairs in this part. The
other part contains about 180,000 sentence pairs collected from various
sources. The whole bilingual
corpus is sentence-aligned and used in the Example-based Exact-matching
Recognizer and Example-based Fuzzy-matching Recognizer. Now we are doing further chunk-level
alignment on the former part of the bilingual corpus.
The News-oriented Chinese-English MT System is developed on the basis of a traditional rule-based MT system [Qun Liu, Shiwen Yu, 199810], which is used as Rule-based Recognizer in the micro-engine pipeline. We have test the two systems in a small test set, and find that the translation of the new system is a little better that of the old system, mainly because the Named Entity Recognizer is added.
Compared with other implementation methods
of the multi-engine MT, the micro-engine has many advantages. The system modularity is better. When we add a new micro-engine to the
MT system, we need not to change the algorithm of the whole system. The micro-engines could be very
specialized; it needs not to be almighty MT engine. That means, it could adopt a very special algorithm to deal
with a very special language phenomena.
The relation among the micro-engines is more cooperative, rather than
competitive. The micro-engines are
used in serially, rather than in parallel. So the later micro-engine can make use of the result
generated by the former engine.
Our
future work include:
1)
Complete the Example-based
Fuzzy-matching Recognizer.
2)
Improve the Rule-based Recognizer by
add statistical information to the LFG-like rulebases and research on the WSD
technology to select the best translations of the words;
3)
Add a Template-based Recognizer to
system. The translation templates
will be acquired from the bilingual corpus automatically and be corrected
manually [Jiang Zhang, 2000];
4)
Add a Chunk Recognizer to the system;
5)
Pay more attention to the resource
construction, especially on the bilingual corpus.
References
[1]
Robert
Frederking and Sergei Nirenburg. Three Heads are Better than One. In Proceedings of the Fourth Conference
on Applied Natural Language Processing (ANLP-94), Stuttgart, Germany.
[2]
Robert
Frederking et al. Integrating
Translations from Multiple Sources
with the Pangloss Mark III Machine Translation System. In Proceedings of
the First Conference for Machine Translation in the Americas (AMTA), Columbia,
Maryland, October, 1994.
[3]
Vasileios
Hatzivassiloglou, and Kevin Knight, Unification-Based Glossing, In: Proc. 14th
Int. Joint Conf. Artificial Intelligence, 1995
[4]
Manny Rayner
and David Carter. Hybrid Processing in the Spoken Language Translator. In Proceedings of
ICASSP-97, pages 107-110, Munich, Germany. 1997.
[5]
Christopher
Hogan and Robert E. Frederking. An Evaluation of Multi-engine MT Architecture.
In David Farwell et al., editors, Machine Translation and the Information Soup,
pages 113-123, Third Conference of the Association for Machine Translation in
Americas (AMTA), Langhorne, PA. USA, October 1998.
[6]
周明,中-日机器翻译系统J-北京,收录于:黄昌宁,董振东主编,计算语言学文集,第312~319页,清华大学出版社,1999.10
Ming Zhou, J-Beijing Chinese-Japanese
Machine Translation System, In: Changning Huang, Zhengdong Dong, eds, A
collection on Computational Linguistics, pages 312-319, Tsinghua University
Press, Oct. 1999
[7]
Qun Liu and
Shiwen Yu. TransEasy: A Chinese-English Machine Translation System Based on
Hybrid Approach. In David Farwell
et al., editors, Machine Translation and the Information Soup, pages 514-517,
Third Conference of the Association for Machine Translation in Americas (AMTA),
Langhorne, PA. USA, October 1998.
[8]
刘群,俞士汶,汉英机器翻译的难点分析,International Conference on Chinese
Information Processing,黄昌宁主编,1998中文信息处理国际会议论文集,第507-514页,清华大学出版社,1998.11
Qun Liu, Shiwen Yu, Difficulties in Chinese-English Translation, In:
Proceedings of Conference on Chinese Information Processing 1998, pages
507-514, Tsinghua University Press, Nov. 1998
[9]
Qun Liu, A
Chinese-English Machine Translation System Based on Micro-Engine Architecture,
An International Conference on Translation and Information Technology, Hong
Kong , Dec. 2000
[10]
Min Zhang and
Key-Sun Choi, Pipelined multi-engine Machine Translation: accomplishment of
MATES/CK system, In: Proceedings of TMI’99, page 228, 1999
[11]
俞士汶,朱学锋,王慧,张芸芸,现代汉语语法信息词典详解,清华大学出版社,1998
Shiwen Yu, Xuefeng Zhu, Hui Wang, Yunyun Zhang, the Grammatical Knowledge-base
of Contemporary Chinese – A Complete Specification, Tsinghua University Press,
1998
[12]
王慧,詹卫东,刘群,《现代汉语语义词典》的概要及设计,International Conference on Chinese
Information Processing,黄昌宁主编,1998中文信息处理国际会议论文集,第507-514页,清华大学出版社,1998.11
Hui Wang, Weidong Zhan, Qun Liu, Design and
Essentials of the Semantic Knowledge-base of Contemporary Chinese, In:
Changning Huang, eds, Proceedings of International Conference on Chinese
Information Processing, pages: 507-514, Tsinghua University Press, Nov.1998
[13]
俞士汶、朱学锋、段慧明,大规模现代汉语标注语料库的加工规范,多语言信息处理国际会议,2000'ICMIP论文集,18-24,
2000年8月,新疆乌鲁木齐
Shiwen Yu, Xuefeng Zhu, Huiming Duan,
Specification of large-scale
modern Chinese corpus, In: Proceedings of ICMIP’2001, pp 18-24, Aug. 2000,
Urumqi
[14]
张健,基于实例的机器翻译的泛化方法研究,中国科学院计算技术研究所硕士论文,2001
Jian Zhang, Research on the Generalization
Method of the Example-based Machine Translation, Master’s dissertation of
Institute of Computing Technology, Chinese Academy of Sciences, 2001
[1] This project is supported by the
National Key Fundamental Research Schema.
The project number is G1998030507-4.
Many thanks to Prof. Sun Maosong for his support to this project.