FlashText’s documentation!

Build Status Documentation Status Version Test coverage license

This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm.

Installation

$ pip install flashtext

Usage

Extract keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
Replace keywords
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'
Case Sensitive example
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Bay Area']
Span of keywords extracted
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
Get Extra information with keywords extracted
>>> from flashtext import KeywordProcessor
>>> kp = KeywordProcessor()
>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))
>>> kp.extract_keywords('Taj Mahal is in Delhi.')
>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]
>>> # NOTE: replace_keywords feature won't work with this.
No clean name for Keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Big Apple', 'Bay Area']
Add Multiple Keywords simultaneously
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> # {'clean_name': ['list of unclean names']}
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> # Or add keywords from a list:
>>> keyword_processor.add_keywords_from_list(["java", "python"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management', 'java']
To Remove keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
>>> # output ['product management', 'java']
>>> keyword_processor.remove_keyword('java_2e')
>>> # you can also remove keywords from a list/ dictionary
>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
>>> keyword_processor.remove_keywords_from_list(["java programing"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management']
To check Number of terms in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(len(keyword_processor))
>>> # output 4
To check if term is present in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> 'j2ee' in keyword_processor
>>> # output: True
>>> keyword_processor.get_keyword('j2ee')
>>> # output: Java
>>> keyword_processor['colour'] = 'color'
>>> keyword_processor['colour']
>>> # output: color
Get all keywords in dictionary
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('colour', 'color')
>>> keyword_processor.get_all_keywords()
>>> # output: {'colour': 'color', 'j2ee': 'Java'}

For detecting Word Boundary currently any character other than this \w [A-Za-z0-9_] is considered a word boundary.

To set or add characters as part of word characters
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # ['Big Apple']
>>> keyword_processor.add_non_word_boundary('/')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # []

API doc

API Doc

Import and initialize module

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # if match has to be case sensitive
>>> keyword_processor = KeywordProcessor(case_sensitive=True)

Add Keywords to module

>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')

Extract keywords

>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> ['New York', 'Bay Area']

Replace keywords

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
>>> new_sentence
>>> 'I love New York and NCR region.'

Add keywords from File

>>> # Option 1: keywords.txt content
>>> # java_2e=>java
>>> # java programing=>java
>>> # product management=>product management
>>> # product management techniques=>product management
>>> # Option 2: keywords.txt content
>>> # java
>>> # python
>>> # c++
>>> keyword_processor.add_keyword_from_file('keywords.txt')

Add keywords from dict

>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)

Add keywords from list

>>> keyword_processor.add_keywords_from_list(["java", "python"]})

KeywordProcessor Class Doc

class flashtext.keyword.KeywordProcessor(case_sensitive=False)
Attributes:
_keyword (str): Used as key to store keywords in trie dictionary.
Defaults to ‘_keyword_’
non_word_boundaries (set(str)): Characters that will determine if the word is continuing.
Defaults to set([A-Za-z0-9_])
keyword_trie_dict (dict): Trie dict built character by character, that is used for lookup
Defaults to empty dictionary
case_sensitive (boolean): if the search algorithm should be case sensitive or not.
Defaults to False
Examples:
>>> # import module
>>> from flashtext import KeywordProcessor
>>> # Create an object of KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # add keywords
>>> keyword_names = ['NY', 'new-york', 'SF']
>>> clean_names = ['new york', 'new york', 'san francisco']
>>> for keyword_name, clean_name in zip(keyword_names, clean_names):
>>>     keyword_processor.add_keyword(keyword_name, clean_name)
>>> keywords_found = keyword_processor.extract_keywords('I love SF and NY. new-york is the best.')
>>> keywords_found
>>> ['san francisco', 'new york', 'new york']
Note:
add_keyword(keyword, clean_name=None)

To add one or more keywords to the dictionary pass the keyword and the clean name it maps to.

Args:
keyword : string
keyword that you want to identify
clean_name : string
clean term for that keyword that you would want to get back in return or replace if not provided, keyword will be used as the clean name also.
Returns:
status : bool
The return value. True for success, False otherwise.
Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> # This case 'Big Apple' will return 'New York'
>>> # OR
>>> keyword_processor.add_keyword('Big Apple')
>>> # This case 'Big Apple' will return 'Big Apple'
add_keyword_from_file(keyword_file, encoding='utf-8')

To add keywords from a file

Args:
keyword_file : path to keywords file encoding : specify the encoding of the file
Examples:

keywords file format can be like:

>>> # Option 1: keywords.txt content
>>> # java_2e=>java
>>> # java programing=>java
>>> # product management=>product management
>>> # product management techniques=>product management
>>> # Option 2: keywords.txt content
>>> # java
>>> # python
>>> # c++
>>> keyword_processor.add_keyword_from_file('keywords.txt')
Raises:
IOError: If keyword_file path is not valid
add_keywords_from_dict(keyword_dict)

To add keywords from a dictionary

Args:
keyword_dict (dict): A dictionary with str key and (list str) as value
Examples:
>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
Raises:
AttributeError: If value for a key in keyword_dict is not a list.
add_keywords_from_list(keyword_list)

To add keywords from a list

Args:
keyword_list (list(str)): List of keywords to add
Examples:
>>> keyword_processor.add_keywords_from_list(["java", "python"]})
Raises:
AttributeError: If keyword_list is not a list.
add_non_word_boundary(character)

add a character that will be considered as part of word.

Args:
character (char):
Character that will be considered as part of word.
extract_keywords(sentence, span_info=False)

Searches in the string for all keywords present in corpus. Keywords present are added to a list keywords_extracted and returned.

Args:
sentence (str): Line of text where we will search for keywords
Returns:
keywords_extracted (list(str)): List of terms/keywords found in sentence that match our corpus
Examples:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> ['New York', 'Bay Area']
get_all_keywords(term_so_far='', current_dict=None)

Recursively builds a dictionary of keywords present in the dictionary And the clean name mapped to those keywords.

Args:
term_so_far : string
term built so far by adding all previous characters
current_dict : dict
current recursive position in dictionary
Returns:
terms_present : dict
A map of key and value where each key is a term in the keyword_trie_dict. And value mapped to it is the clean name mapped to it.
Examples:
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('Python', 'Python')
>>> keyword_processor.get_all_keywords()
>>> {'j2ee': 'Java', 'python': 'Python'}
>>> # NOTE: for case_insensitive all keys will be lowercased.
get_keyword(word)

if word is present in keyword_trie_dict return the clean name for it.

Args:
word : string
word that you want to check
Returns:
keyword : string
If word is present as it is in keyword_trie_dict then we return keyword mapped to it.
Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.get('Big Apple')
>>> # New York
remove_keyword(keyword)

To remove one or more keywords from the dictionary pass the keyword and the clean name it maps to.

Args:
keyword : string
keyword that you want to remove if it’s present
Returns:
status : bool
The return value. True for success, False otherwise.
Examples:
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns True
>>> # This case 'Big Apple' will no longer be a recognized keyword
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns False
remove_keywords_from_dict(keyword_dict)

To remove keywords from a dictionary

Args:
keyword_dict (dict): A dictionary with str key and (list str) as value
Examples:
>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.remove_keywords_from_dict(keyword_dict)
Raises:
AttributeError: If value for a key in keyword_dict is not a list.
remove_keywords_from_list(keyword_list)

To remove keywords present in list

Args:
keyword_list (list(str)): List of keywords to remove
Examples:
>>> keyword_processor.remove_keywords_from_list(["java", "python"]})
Raises:
AttributeError: If keyword_list is not a list.
replace_keywords(sentence)

Searches in the string for all keywords present in corpus. Keywords present are replaced by the clean name and a new string is returned.

Args:
sentence (str): Line of text where we will replace keywords
Returns:
new_sentence (str): Line of text with replaced keywords
Examples:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.')
>>> new_sentence
>>> 'I love New York and Bay Area.'
set_non_word_boundaries(non_word_boundaries)

set of characters that will be considered as part of word.

Args:
non_word_boundaries (set(str)):
Set of characters that will be considered as part of word.

Test

$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext
$ pip install pytest
$ python setup.py test

Build Docs

$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext/docs
$ pip install sphinx
$ make html
$ # open _build/html/index.html in browser to view it locally

Why not Regex?

It’s a custom algorithm based on Aho-Corasick algorithm and Trie Dictionary.

Benchmark

Time taken by FlashText to find terms in comparison to Regex.

https://thepracticaldev.s3.amazonaws.com/i/xruf50n6z1r37ti8rd89.png

Time taken by FlashText to replace terms in comparison to Regex.

https://thepracticaldev.s3.amazonaws.com/i/k44ghwp8o712dm58debj.png

Link to code for benchmarking the Find Feature and Replace Feature.

The idea for this library came from the following StackOverflow question.

Citation

The original paper published on FlashText algorithm.

@ARTICLE{2017arXiv171100046S,
   author = {{Singh}, V.},
    title = "{Replace or Retrieve Keywords In Documents at Scale}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1711.00046},
 primaryClass = "cs.DS",
 keywords = {Computer Science - Data Structures and Algorithms},
     year = 2017,
    month = oct,
   adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

The article published on Medium freeCodeCamp.

License

The project is licensed under the MIT license.