KeywordProcessor Class Doc¶
-
class
flashtext.keyword.
KeywordProcessor
(case_sensitive=False)¶ - Attributes:
- _keyword (str): Used as key to store keywords in trie dictionary.
- Defaults to ‘_keyword_’
- non_word_boundaries (set(str)): Characters that will determine if the word is continuing.
- Defaults to set([A-Za-z0-9_])
- keyword_trie_dict (dict): Trie dict built character by character, that is used for lookup
- Defaults to empty dictionary
- case_sensitive (boolean): if the search algorithm should be case sensitive or not.
- Defaults to False
- Examples:
>>> # import module >>> from flashtext import KeywordProcessor >>> # Create an object of KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> # add keywords >>> keyword_names = ['NY', 'new-york', 'SF'] >>> clean_names = ['new york', 'new york', 'san francisco'] >>> for keyword_name, clean_name in zip(keyword_names, clean_names): >>> keyword_processor.add_keyword(keyword_name, clean_name) >>> keywords_found = keyword_processor.extract_keywords('I love SF and NY. new-york is the best.') >>> keywords_found >>> ['san francisco', 'new york', 'new york']
- Note:
- loosely based on Aho-Corasick algorithm.
- Idea came from this Stack Overflow Question.
-
add_keyword
(keyword, clean_name=None)¶ To add one or more keywords to the dictionary pass the keyword and the clean name it maps to.
- Args:
- keyword : string
- keyword that you want to identify
- clean_name : string
- clean term for that keyword that you would want to get back in return or replace if not provided, keyword will be used as the clean name also.
- Returns:
- status : bool
- The return value. True for success, False otherwise.
- Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York') >>> # This case 'Big Apple' will return 'New York' >>> # OR >>> keyword_processor.add_keyword('Big Apple') >>> # This case 'Big Apple' will return 'Big Apple'
-
add_keyword_from_file
(keyword_file, encoding='utf-8')¶ To add keywords from a file
- Args:
- keyword_file : path to keywords file encoding : specify the encoding of the file
- Examples:
keywords file format can be like:
>>> # Option 1: keywords.txt content >>> # java_2e=>java >>> # java programing=>java >>> # product management=>product management >>> # product management techniques=>product management
>>> # Option 2: keywords.txt content >>> # java >>> # python >>> # c++
>>> keyword_processor.add_keyword_from_file('keywords.txt')
- Raises:
- IOError: If keyword_file path is not valid
-
add_keywords_from_dict
(keyword_dict)¶ To add keywords from a dictionary
- Args:
- keyword_dict (dict): A dictionary with str key and (list str) as value
- Examples:
>>> keyword_dict = { "java": ["java_2e", "java programing"], "product management": ["PM", "product manager"] } >>> keyword_processor.add_keywords_from_dict(keyword_dict)
- Raises:
- AttributeError: If value for a key in keyword_dict is not a list.
-
add_keywords_from_list
(keyword_list)¶ To add keywords from a list
- Args:
- keyword_list (list(str)): List of keywords to add
- Examples:
>>> keyword_processor.add_keywords_from_list(["java", "python"]})
- Raises:
- AttributeError: If keyword_list is not a list.
-
add_non_word_boundary
(character)¶ add a character that will be considered as part of word.
- Args:
- character (char):
- Character that will be considered as part of word.
-
extract_keywords
(sentence, span_info=False)¶ Searches in the string for all keywords present in corpus. Keywords present are added to a list keywords_extracted and returned.
- Args:
- sentence (str): Line of text where we will search for keywords
- Returns:
- keywords_extracted (list(str)): List of terms/keywords found in sentence that match our corpus
- Examples:
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') >>> keywords_found >>> ['New York', 'Bay Area']
-
get_all_keywords
(term_so_far='', current_dict=None)¶ Recursively builds a dictionary of keywords present in the dictionary And the clean name mapped to those keywords.
- Args:
- term_so_far : string
- term built so far by adding all previous characters
- current_dict : dict
- current recursive position in dictionary
- Returns:
- terms_present : dict
- A map of key and value where each key is a term in the keyword_trie_dict. And value mapped to it is the clean name mapped to it.
- Examples:
>>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('j2ee', 'Java') >>> keyword_processor.add_keyword('Python', 'Python') >>> keyword_processor.get_all_keywords() >>> {'j2ee': 'Java', 'python': 'Python'} >>> # NOTE: for case_insensitive all keys will be lowercased.
-
get_keyword
(word)¶ if word is present in keyword_trie_dict return the clean name for it.
- Args:
- word : string
- word that you want to check
- Returns:
- keyword : string
- If word is present as it is in keyword_trie_dict then we return keyword mapped to it.
- Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.get('Big Apple') >>> # New York
-
remove_keyword
(keyword)¶ To remove one or more keywords from the dictionary pass the keyword and the clean name it maps to.
- Args:
- keyword : string
- keyword that you want to remove if it’s present
- Returns:
- status : bool
- The return value. True for success, False otherwise.
- Examples:
>>> keyword_processor.add_keyword('Big Apple') >>> keyword_processor.remove_keyword('Big Apple') >>> # Returns True >>> # This case 'Big Apple' will no longer be a recognized keyword >>> keyword_processor.remove_keyword('Big Apple') >>> # Returns False
-
remove_keywords_from_dict
(keyword_dict)¶ To remove keywords from a dictionary
- Args:
- keyword_dict (dict): A dictionary with str key and (list str) as value
- Examples:
>>> keyword_dict = { "java": ["java_2e", "java programing"], "product management": ["PM", "product manager"] } >>> keyword_processor.remove_keywords_from_dict(keyword_dict)
- Raises:
- AttributeError: If value for a key in keyword_dict is not a list.
-
remove_keywords_from_list
(keyword_list)¶ To remove keywords present in list
- Args:
- keyword_list (list(str)): List of keywords to remove
- Examples:
>>> keyword_processor.remove_keywords_from_list(["java", "python"]})
- Raises:
- AttributeError: If keyword_list is not a list.
-
replace_keywords
(sentence)¶ Searches in the string for all keywords present in corpus. Keywords present are replaced by the clean name and a new string is returned.
- Args:
- sentence (str): Line of text where we will replace keywords
- Returns:
- new_sentence (str): Line of text with replaced keywords
- Examples:
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.') >>> new_sentence >>> 'I love New York and Bay Area.'
-
set_non_word_boundaries
(non_word_boundaries)¶ set of characters that will be considered as part of word.
- Args:
- non_word_boundaries (set(str)):
- Set of characters that will be considered as part of word.