Multilingual Web retrieval: An experiment in English–Chinese business intelligence
Jialun Qin
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorYilu Zhou
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorMichael Chau
School of Business, The University of Hong Kong, Hong Kong, People's Republic of China
Search for more papers by this authorHsinchun Chen
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorJialun Qin
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorYilu Zhou
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorMichael Chau
School of Business, The University of Hong Kong, Hong Kong, People's Republic of China
Search for more papers by this authorHsinchun Chen
Department of Management Information Systems, The University of Arizona, Tucson, AZ 85721
Search for more papers by this authorAbstract
As increasing numbers of non-English resources have become available on the Web, the interesting and important issue of how Web users can retrieve documents in different languages has arisen. Cross-language information retrieval (CLIR), the study of retrieving information in one language by queries expressed in another language, is a promising approach to the problem. Cross-language information retrieval has attracted much attention in recent years. Most research systems have achieved satisfactory performance on standard Text REtrieval Conference (TREC) collections such as news articles, but CLIR techniques have not been widely studied and evaluated for applications such as Web portals. In this article, the authors present their research in developing and evaluating a multilingual English–Chinese Web portal that incorporates various CLIR techniques for use in the business domain. A dictionary-based approach was adopted and combines phrasal translation, co-occurrence analysis, and pre- and posttranslation query expansion. The portal was evaluated by domain experts, using a set of queries in both English and Chinese. The experimental results showed that co-occurrence-based phrasal translation achieved a 74.6% improvement in precision over simple word-by-word translation. When used together, pre- and posttranslation query expansion improved the performance slightly, achieving a 78.0% improvement over the baseline word-by-word translation approach. In general, applying CLIR techniques in Web applications shows promise.
References
- Aljlayl, M., Frieder, O., & Grossman, D. (2002). On bidirectional English-Arabic search. Journal of the American Society for Information Science and Technology, 53(13), 1139–1151.
-
Arasu, A.,
Cho, J.,
Garcia-Molina, H.,
Paepcke, A.,
& Raghavan, S.
(2001).
Searching the Web.
ACM Transactions on Internet Technology,
1(1),
2–43.
10.1145/383034.383035 Google Scholar
- Attar, R., & Fraenkel, A.S. (1977). Local feedback in full-text retrieval systems. Journal of the Association for Computing Machinery, 24(3), 397–417.
- Ballesteros, L., & Croft, B. (1996, September). Dictionary methods for cross-lingual information retrieval. In Paper presented at the 7th DEXA Conference on Database and Expert Systems Applications, Zurich, Switzerland.
- Ballesteros, L., & Croft, B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In N. Belkin, D. Nara Simhalu, & P. Willett (Eds.), Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 84–91). New York: ACM.
-
Ballesteros, L.,
& Croft, B.
(1998).
Resolving ambiguity for cross-language retrieval.
In W.B. Croft,
A. Moffat,
C. van Rijsbergen,
R. Wilkinson,
& J. Zobel (Eds.),
Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 64–71).
New York: ACM.
10.1145/290941.290958 Google Scholar
- Bergmark, D., Lagoze, C., & Sbityakov, A. (2002, September). Focused crawls, tunneling, and digital libraries. Paper presented at the European Conference on Digital Libraries, Rome, Italy.
- Capstick, J., Diagne, A.K., Erbach, G., Uszkoreit, H., Cagno, F., Gadaleta, G., et al. (1998, August). MULINEX: Multilingual Web Search and Navigation. Paper presented at the Conference on Natural Language Processing and Industrial Applications, Moncton, Canada.
- Chau, M., & Chen, H. (2003). Comparison of three vertical search spiders. IEEE Computer, 36(5), 56–62.
- Chen, A., Jiang, H., & Gey, F. (2000, September). Combining multiple sources for short query translation in Chinese-English cross-language information retrieval. Paper presented at the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China.
- Chen, K.-H., Chen, H.-H., Kando, N., Kuriyama, K., Lee, S., Myaeng, S.H., et al. (2002, October). Overview of CLIR Task at the Third NTCIR Workshop. Paper presented at the Third NTCIR Workshop, Tokyo, Japan.
- Cheong, F.C. (1996). Internet agents: Spiders, wanderers, brokers, and bots. IN: New Riders Publishing.
- Croft, W.B., & Harper, D.J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35, 285–295.
- Davis, M., & Dunning, T. (1995). A TREC evaluation of query translation methods for multi-lingual text retrieval. In D.K. Harman (Ed.), Proceedings of the Fourth Text Retrieval Evaluation Conference. Gaithersburg, MD: National Institute of Standards and Technology.
- Davis, M.W., & Ogden, W.C. (1997). Free resources and advanced alignment for cross-language text retrieval. In Proceedings of the Sixth Text Retrieval Conference.
- Eguchi, K., Oyama, K., et al. (2002, May). Evaluation design of Web retrieval task in the Third NTCIR Workshop. Paper presented at the 11th International World Wide Web Conference, Honolulu, Hawaii. Retrieved September 12, 2004, from http://www.2002.org/CDROM/poster/22/
-
Gao, J.,
Nie, J.-Y.,
Xun, E.,
Zhang, J.,
Zhou, M.,
& Huang, C.
(2001).
Improving query translation for cross-language information retrieval using statistical models.
In W.B. Croft,
D.J. Harper,
D.H. Kraft,
& J. Zobel (Eds.),
Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 96–104).
New York: ACM.
10.1145/383952.383966 Google Scholar
- Gey, F., & Chen. A. (2000). TREC-9 cross-language information retrieval (English–Chinese) overview. In E.M. Voorhees & D.K. Harman (Eds.), Proceedings of the Ninth Text Retrieval Conference (pp. 15–24). Gaithersburg, MD: National Institute of Standards and Technology.
- Global Reach. (2003). Global internet statistics. Retrieved October 24, 2003, from http://global-reach.biz/globstats/index.php3
-
Hull, D.A.,
& Grefenstette, G.
(1996).
Querying across languages: A dictionary-based approach to multilingual information retrieval.
In H.-P. Frei,
D. Harman,
P. Schäuble,
& R. Wilkinson (Eds.),
Proceedings of 19th ACM SIGIR International Conference on Research and Development in Information Retrieval
(pp. 49–57).
New York: ACM.
10.1145/243199.243212 Google Scholar
- Jones, G., Sakai, T., Collier, N., Kumano, A., & Sumita, K. (1999, September). Exploring the use of machine translation resources for English–Japanese cross-language information retrieval. Paper presented at the Post-Conference Workshop on Machine Translation for Cross Language Information Retrieval at AAMT Machine Translation Summit (pp. 181–188).
- Kando, N. (2002). Evaluation—the way ahead: A case of the NTCIR. In Proceedings of the 25th ACM SIGIR Workshop on Cross-Language Information Retrieval: A Research Roadmap (pp. 72–77). New York: ACM.
- Kwok, K.L. (2000, September). Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval. Paper presented at the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China.
- Landauer, T.K., & Littman, M.L. (1991, May). A statistical method for language-independent representation of the topical content of text segments. Paper presented at the 11th International Conference on Expert Systems and Their Applications, Avignon, France.
- Lawrence, S., & Giles, C.L. (1998). Searching the world wide web. Science, 280, 98–100.
- Liu, S. (2001). ECIRS: an English-Chinese Cross-language Information-retrieval System. In A. El Kamel, K. Mellouli, & P. Borne (Eds.), Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, (Vol 2, pp. 954–959). Piscataway, NJ: IEEE.
-
Lu, W.-H.,
Chien, L.-F.,
& Lee, H.-J.
(2002).
Translation of Web queries using anchor text mining.
ACM Transactions on Asian Language Information Processing,
2(1),
159–172.
10.1145/568954.568958 Google Scholar
- Lu, W.-H., Chien, L.-F., Lee, H.-J. (2004). Anchor text mining for translation of web queries: A transitive translation approach. ACM Transactions on Information Systems, 22, 1–28.
- Maeda, A., Sadat, F., Yoshikawa, M., & Uemura, S. (2000). Query term disambiguation for web cross-language information retrieval using a search engine. Paper presented at the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China.
-
McNamee, P.,
& Mayfield, J.
(2002).
Comparing cross-language query expansion techniques by degrading translation resources.
In R. Baeza-Yates,
N. Fuhr,
& Y. Maarek (Eds.),
Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval
(pp. 159–166).
New York: ACM.
10.1145/564376.564406 Google Scholar
- Nie, J.-Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In M.A. Hearst, F. Gey, & R. Tong (Eds.), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 74–81). New York: ACM.
- Oard, D. (1997, March). Cross-language text retrieval research in the USA. Paper presented at the 3rd ERCIM DELOS Workshop, Zurich, Switzerland.
- Oard, D. (2002). When you come to a fork in the road, take it: Multiple futures for CLIR research. Cross-language information retrieval: A research roadmap. In R. Baeza-Yates, N. Fuhr, & Y. Maarek (Eds.), Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval (pp. 5–7). New York: ACM.
- Oard, D., & Wang, J. (2001). NTCIR-2 ECIR Experiment at Maryland: Comparing structured queries and balanced translation. In J. Adachi & N. Kando (Eds.), Proceedings of the Second National Institute of Informatics (NII) Test Collection Information Retrieval (NTCIR) Workshop. Tokyo, Japan: NII.
- Ogden, W.C., Cowie, J., Davis, M., Ludovik, E., Nirenburg, S., Molina-Salgado, H., et al. (1999): Keizai: An interactive cross-language text retrieval system. In S. Ananiadou, Y. Hayashi, C. Jacquemin, M.K. Leong, & H.-H. Chen (Eds.), Proceedings of Workshop on Machine Translation for Cross Language Information Retrieval. Retrieved May 6, 2003, from http://crl.nmsu.edu/Research/Projects/tipster/ursa/Papers/MTsummit.pdf
- Ong, T.-H. and Chen, H. (1999). Updateable PAT-tree approach to Chinese key phrase extraction using mutual information: A linguistic foundation for knowledge management. Paper presented at the 2nd Asian Digital Library Conference, Taipei, Taiwan.
-
Peters, C.
(2002).
The contribution of evaluation.
In R. Baeza-Yates,
N. Fuhr,
& Y. Maarek (Eds.),
Proceedings of the ACM SIGIR Workshop on Cross-language Information Retrieval: A Research Roadmap
(pp. 16–19).
New York: ACM.
10.1007/3-540-45691-0 Google Scholar
- Qin, J., Zhou, Y., & Chau, M. (2004). Building domain-specific Web collections for scientific digital libraries: A meta-search enhanced focused crawling method. In H. Chen & M. Christel et al. (Eds.), Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'04) (pp. 135–141).
- Sadat, F., Maeda, A., Yoshikawa, M., & Uemura, S. (2002, September). A combined statistical query term disambiguation in cross-language information retrieval. In Paper presented at the 13th International Workshop on Database and Expert Systems Applications (DEXA'02), Aix-en-Provence, France.
- Sakai, T. (2000, October). MT-based Japanese-English cross-language IR experiments using the TREC test Collections. Paper presented at the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China.
- Salton, G. (1972). Experiments in multi-lingual information retrieval. (Technical Report TR 72-154). Ithaca, NY: Computer Science Department, Cornell University.
-
Sheridan, P.,
& Ballerini, J.P.
(1996).
Experiments in multilingual information retrieval using the SPIDER system.
In H.-P. Frei,
D. Harman,
P. Schäuble,
& R. Wilkinson (Eds.),
Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 58–65).
New York: ACM.
10.1145/243199.243213 Google Scholar
- Spink, A., & Xu, J. (2000). Selected results from a large study of web searching: The excite study. Information Research, 6(1). Retrieved October 24, 2003, from http://InformationR.net/ir/6-1/paper90.html
- Tolle, K.M., & Chen, H. (2000). Comparing noun phrasing techniques for use with medical digital library tools. Journal of the American Society for Information Science, 51(4), 352–370.
- Voorhees, E.M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In W.B. Croft, A. Moffat, C.J. van Rijsbergen, R. Wilkinson, & J. Zobel (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 315–323).
- Wang, J.H., Teng, J.W., Cheng, P.J., Lu, W.H., & Chien, L.F. (2004). Translating unknown cross-lingual queries in digital libraries using a web-based approach. In H. Chen, H.D. Wactlar, C.-C. Chen, E.-P. Lim, & M.G. Christel (Eds.), Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (pp. 4–10). New York: ACM.
- Xu, J., & Croft, B. (1996). Querying expansion using local and global document analysis. In H.-P. Frei, D. Harman, P. Schäuble, & R. Wilkinson (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 4–11). New York: ACM.
- Xu, J., & Weischedel, R. (2000). TREC-9 Cross-lingual retrieval at BBN. In E.M. Voorhees & D.K. Harman (Eds.), Proceedings of the 9th Text Retrieval Conference (pp. 106–116). Gaithersburg, MD: National Institutes of Standards and Technology.