Shizen Gengo Python NLP Utilities¶
shizen_gengo is a Python library for simplifying common hands on NLP tasks.
Contents:¶
About¶
shizen_gengo is a Python library for simplifying common hands on NLP tasks.
Installation¶
Create a Virtual Environment (Recommended)
With Conda:
$ conda create --name gengo python=3.7
$ source activate gengo
(gengo) $
Pip Install
(gengo) $ pip install shizen-gengo
API¶
Explore¶
Functions to search for text in a pandas dataframe column.
Explore Utils¶
search(df[, df_col, tok]) | Search dataframe column and return rows that contain the specified token. |
check_missing_values(df) | Returns a dataframe with missing values count for all columns sorted in descending order. |
Preprocess Dataframe¶
Functions to modify a pandas dataframe e.g. rename columns, to standardise column headers. or to fill missing values with a string.
Preprocess Dataframe Utils¶
rename_column(df_col_names, before, after) | Rename column. |
standardise_column_headers(df_col_names[, …]) | Make dataframe headers lowercase and replace spaces by underscores. |
fill_missing(df_col[, val]) | Fill missing values with string of choice. |
Preprocess Text¶
Functions to clean text in a pandas dataframe column.
Preprocess Text Utils¶
remove_newline_chars(df_col) | Remove new line and/or carriage return from dataframe column. |
remove_digits(df_col) | Remove digits. |
remove_non_char(df_col) | Remove non-alphabetic tokens: [#<>=.,;:$&*|?'" -()%] |
custom_replace(df_col[, change_from, change_to]) | Replace tokens. |
remove_url(df_col) | Remove hyperlink / url. |
remove_email(df_col) | Remove email address. |
remove_consecutive_spaces(df_col) | Remove consecutive white spaces. |
remove_stopwords(df_col) | Remove stopwords. |
remove_accented_chars(df_col) | Remove accented characters. |
remove_punctuation(df_col) | Remove punctuation. |
remove_repeating_letters(df_col) | Remove repeating letters with a minimum threshold of 2. |
clean_text(df_col) | Function that combines all text pre-processing tasks. |
Changelog¶
- v 0.1.5 add clean_text function to perform all pre-process text tasks in one go.
- v 0.1.4 minor bug fix (remove print statement).
- v 0.1.3 improve function to remove new line and carriage return characters.
- v 0.1.2 further development of new or improvement of existing functions and docstring.
- v 0.1.1 pre-release.