Quick-Motif: An Efficient and Scalable Framework for Exact Motif Discovery

Yuhong Li, Leong Hou U, Man Lung YIU and Zhiguo Gong

This is a supporting page for our paper about motif discovery, the paper is here.

Abstract: Discovering motifs in sequence databases has been receiving abundant attentions from both database and data mining communities, where the motif is the most correlated pair of subsequences in a sequence object. Motif discovery is expensive for emerging applications which may have very long sequences (e.g., million observations per sequence) or the queries arrive rapidly (e.g., per 10 seconds). Prior works cannot reduce the correlation computation cost and prune subsequence pairs at the same time, as these two techniques require different orderings on examining subsequence pairs. In this work, we propose a novel framework named Quick-Motif which adopts a two-level approach to enable batch pruning at the outer level and enable fast correlation calculation at the inner level. We further propose two optimization techniques for the outer and the inner level. In our experimental study, our method is up to 3 orders of magnitude faster than the state-of-the-art methods.

Overview of the proposed framework (i.e., Quick-Motifs) for motif discovery:

Codes: The codes will be available very soon !!!
Executables: It was complied using Visual Studio 2013, for more details about how the executables work please check readme inside this folder. [Download]
Note: If you download the executables before August 21 (2014), you may need to download the Visual C++ Redistributable Packages for Visual Studio 2013 from here, and select vcredist_x86.exe in order to run our executables!!!
Datasets: Please check the detail description for each dataset in our paper. In this page, we provide the formatted csv file.

EEG: a sequence of length [180,204],original sequence was came from here
ECG: a sequence of length [144,002],original sequence was came from here
EPG: a sequence of length [106,950],original sequence was came from here
TAO: [374071], [642423], [654217], [661009], [671056], [647786], [710976], [715155], [720261], [741528],
The authors would like to thank Prof. Eamonn Keogh for collecting and publishing the UCR time series data sets, as well as Dr. Abdullah Mueen for providing the source code of MK. Also, the authors would like to thank TAO Project Office for publishing the TAO dataset. Lastly, thank the anonymous reviewers for their valuable comments that improved our paper dramatically.
This project was supported by grant MYRG2014-00106-FST and MYRG105-FST13-GZG from University of Macau RC, grant FDCT/106/2012/A3 from Macau FDCT, and grant GRF 152201/14E from Hong Kong RGC.
Contact: Please feel free to email the authors for any question. Email: yb27407 AT umac.mo

*** End ***