An extension of the FFT-based algorithm for the match-count problem to weighted scores

Kensuke Baba,

Corresponding Author

Kensuke Baba

Non-member

[email protected]

Artificial Intelligence Laboratory, Fujitsu Laboratories Ltd, Kawasaki, 211-8588 Japan

Correspondence to: Kensuke Baba. E-mail: [email protected]Search for more papers by this author

Kensuke Baba,

Corresponding Author

Kensuke Baba

Non-member

[email protected]

Artificial Intelligence Laboratory, Fujitsu Laboratories Ltd, Kawasaki, 211-8588 Japan

Correspondence to: Kensuke Baba. E-mail: [email protected]Search for more papers by this author

First published: 08 December 2017

https://doi.org/10.1002/tee.22554

Citations: 2

Share a link

Email
Wechat
Bluesky

Abstract

The match-count problem on strings is the basic problem of counting the matches of characters between two strings for every possible alignment. The problem is classically computed in O(σ n log m) time using a fast Fourier transform (FFT) for two strings of lengths m and n (m ≤ n) over an alphabet of size σ. This paper extends the target of this FFT-based algorithm to a weighted version of the problem, which computes the sum of similarities between characters instead of the number of matches. The algorithm extended in this paper can solve the weighted match-count problem in O(dn log m) time by mapping characters to numerical vectors of dimensionality d. This paper also evaluates the usefulness of the extended algorithm by applying it to plagiarism detection in documents. The experimental results show that the proposed algorithm is applicable to general vector representation of words and that the obtained plagiarism detection method can extremely reduce the processing time with a slight decrease of accuracy from the method based on the normal match-count problem.

References

1Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press: New York; 1997.
10.1017/CBO9780511574931
Google Scholar
2 Cormen TH, Stein C, Rivest RL, Leiserson CE. Introduction to Algorithms. 2nd ed. McGraw-Hill Higher Education: Boston; 2001.
CAS PubMed Google Scholar
3 Fischer MJ, Paterson MS. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973). MIT: Cambridge; 1974; 113–125.
Google Scholar
4Abrahamson K. Generalized string matching. SIAM Journal on Computing 1987; 16(6): 1039–1051.
10.1137/0216067
Web of Science® Google Scholar
5Atallah MJ, Chyzak F, Dumas P. A randomized algorithm for approximate string matching. Algorithmica 2001; 29(3): 468–486.
10.1007/s004530010062
Web of Science® Google Scholar
6 Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge University Press: Cambridge; 2008.
10.1017/CBO9780511809071
Google Scholar
7 word2vec. Google Code Archive. https://code.google.com/archive/p/word2vec/. Accessed April 1, 2017.
Google Scholar
8 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, vol. 26. CJC Burges, L Bottou, M Welling, Z Ghahramani, KQ Weinberger (eds). Curran Associates, Inc.: New York; 2013; 3111–3119.
Google Scholar
9 Potthast M, Gollub T, Hagen M, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B. Overview of the 5th international competition on plagiarism detection. In Working Notes Papers of the CLEF 2013 Evaluation Labs. P Forner, R Navigli, D Tufis (eds). CEUR-WS.org: Aachen; September 2013.
Google Scholar
10 PAN. Evaluation data, originality. http://pan.webis.de/data.html. Accessed April 1, 2017.
Google Scholar
11Baba K, Shinohara A, Takeda M, Inenaga S, Arikawa S. A note on randomized algorithm for string matching with mismatches. Nordic Journal of Computing 2003; 10(1): 2–12.
Google Scholar
12Schoenmeyr T, Zhang DY. FFT-based algorithms for the string matching with mismatches problem. Journal of Algorithms 2005; 57(2): 130–139.
10.1016/j.jalgor.2005.01.001
Web of Science® Google Scholar
13Baba K. String matching with mismatches by real-valued FFT. In Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV. D Taniar, O Gervasi, B Murgante, E Pardede, BO Apduhan (eds). Springer: Berlin Heidelberg; 2010; 273–283.
10.1007/978-3-642-12189-0_24
Google Scholar
14Atallah MJ, Grigorescu E, Wu Y. A lower-variance randomized algorithm for approximate string matching. Information Processing Letters 2013; 113(18): 690–692.
10.1016/j.ipl.2013.06.005
Web of Science® Google Scholar
15Baba K. An acceleration of FFT-based algorithms for the match-count problem. Information Processing Letters 2017; 125: 1–4.
10.1016/j.ipl.2017.04.013
Web of Science® Google Scholar

Citing Literature

Volume12, IssueS2

December 2017

Pages S97-S100

An extension of the FFT-based algorithm for the match-count problem to weighted scores

Abstract

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

An extension of the FFT-based algorithm for the match-count problem to weighted scores

Abstract

References

Citing Literature

References

Related

Information