Dictionary Based Matcher (from Old Wiki) #3963
Closed
chenlica
started this conversation in
archived-wiki
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
From the wiki page https://github.com/apache/texera/wiki/Dictionary-Based-Matcher (may be dangling)
======
Authors: Sandeep Reddy Madugala , Sudeep Meduri and Rajesh Yarlagadda
Reviewers: Chen Li
Synopsys
Lucene already provides basic functionality for performing a Keyword search and a Phrase search. We created a Dictionary Matcher feature at the top of these existing features.
The purpose of the Dictionary Matcher is to enable users to perform multiple phrase searches at a time.
Status
As of 5/25/2016: COMPLETED
Modules
edu.uci.ics.texera.dataflow.dictionarymatcheredu.uci.ics.texera.dataflow.commonedu.uci.ics.texera.dataflow.keywordmatchRelated Issues
[Issue #90] (Team -1) - Add Keyword based and Phrase Based Dictionary Matcher
[Issue #53] (Team -1) - Design a Dictionary class for the DictionaryMatcher
[Issue #52] (Team -1) - Implement a "Span" class
[Issue #37] (Team -1) - Design: Dictionary Matcher Operator
Description
DictionaryMatcher performs a scan, keyword or a phrase based search depending on the sourceoperator type, gets the dictionary value and scans the documents for matches. Presently 2 types of KeywordOperatorTypes are supported.
There are three kinds of source operators being considered.
#####SourceOperatorType.SCANOPERATOR:
Loops through the dictionary entries. For each dictionary entry, loop through the tuples in the operator. For each tuple, loop through the fields in the attributelist. For each field, loop through all the matches. Returns only one tuple per document. If there are multiple matches, all spans are included in a list.
Java Regex is used to match word boundaries.
Ex: If dictionary word is "Lin", and text is "Lin is Angelina's friend", matches should include Lin but not Angelina.
#####SourceOperatorType.KEYWORDOPERATOR:
Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using
KeyWordOperator.BASIC. Updates span information at the end of the tuple.
#####SourceOperatorType.PHRASEOPERATOR:
Loops through the dictionary entries. For each dictionary entry, keywordmatcher's getNextTuple is called using
KeyWordOperator.PHRASE. The span returned is the span information provided by the keywordmatcher's phrase operator.
Presentation
Lucene Presentation (Team 1)
Performance Test
Machine configuration : MacBook Pro, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3
Dataset: 100k medline record
index time: 29.4110 seconds
Performance results for DictionaryMatcher with SCANOPERATOR:
Dictionary : {"medical"}
Lucene Query time: 0.1480 seconds
Match time: 5.2740 seconds
Total: 2459 results
Performance results for DictionaryMatcher with PHRASEOPERATOR:
Dictionary : {"medical"}
Lucene Query time: 0.3840 seconds
Match time: 0.5980 seconds
Total: 2459 results
Performance results for DictionaryMatcher with SCANOPERATOR:
Dictionary : {"medical","medication"}
Lucene Query time: 0.4430 seconds
Match time: 10.9500 seconds
Total: 2904 results
Performance results for DictionaryMatcher with PHRASEOPERATOR:
Dictionary : {"medical","medication"}
Lucene Query time: 0.4560 seconds
Match time: 0.8950 seconds
Total: 2904 results
Performance results for DictionaryMatcher with PHRASEOPERATOR:
Dictionary : {"medical","medication","medicare","medicaid"}
Lucene Query time: 0.5210 seconds
Match time: 0.9100 seconds
Total: 3022 results
Dataset: 1M medline record
index time: 335.6620 seconds
Performance results for DictionaryMatcher with SCANOPERATOR:
Dictionary : {"medical"}
Lucene Query time: 0.9840 seconds
Match time: 53.0320 seconds
Total: 29355 results
Performance results for DictionaryMatcher with PHRASEOPERATOR:
Dictionary : {"medical"}
Lucene Query time: 0.5870 seconds
Match time: 5.2180 seconds
Total: 29355 results
Performance results for DictionaryMatcher with PHRASEOPERATOR:
Dictionary : {"medical","medication","medicare","medicaid"}
Lucene Query time: 0.5950 seconds
Match time: 5.6970 seconds
Total: 36528 results
Beta Was this translation helpful? Give feedback.
All reactions