Shizen Gengo Python NLP Utilities

shizen_gengo is a Python library for simplifying common hands on NLP tasks.

Contents:

About

shizen_gengo is a Python library for simplifying common hands on NLP tasks.

Installation

Create a Virtual Environment (Recommended)

With Conda:

$ conda create --name gengo python=3.7
$ source activate gengo
(gengo) $

Pip Install

(gengo) $ pip install shizen-gengo

API

Explore

Functions to search for text in a pandas dataframe column.

Explore Utils
search(df[, df_col, tok]) Search dataframe column and return rows that contain the specified token.
check_missing_values(df) Returns a dataframe with missing values count for all columns sorted in descending order.

Preprocess Dataframe

Functions to modify a pandas dataframe e.g. rename columns, to standardise column headers. or to fill missing values with a string.

Preprocess Dataframe Utils
rename_column(df_col_names, before, after) Rename column.
standardise_column_headers(df_col_names[, …]) Make dataframe headers lowercase and replace spaces by underscores.
fill_missing(df_col[, val]) Fill missing values with string of choice.

Preprocess Text

Functions to clean text in a pandas dataframe column.

Preprocess Text Utils
remove_newline_chars(df_col) Remove new line and/or carriage return from dataframe column.
remove_digits(df_col) Remove digits.
remove_non_char(df_col) Remove non-alphabetic tokens: [#<>=.,;:$&*|?'"-()%]
custom_replace(df_col[, change_from, change_to]) Replace tokens.
remove_url(df_col) Remove hyperlink / url.
remove_email(df_col) Remove email address.
remove_consecutive_spaces(df_col) Remove consecutive white spaces.
remove_stopwords(df_col) Remove stopwords.
remove_accented_chars(df_col) Remove accented characters.
remove_punctuation(df_col) Remove punctuation.
remove_repeating_letters(df_col) Remove repeating letters with a minimum threshold of 2.
clean_text(df_col) Function that combines all text pre-processing tasks.

Changelog

  • v 0.1.5 add clean_text function to perform all pre-process text tasks in one go.
  • v 0.1.4 minor bug fix (remove print statement).
  • v 0.1.3 improve function to remove new line and carriage return characters.
  • v 0.1.2 further development of new or improvement of existing functions and docstring.
  • v 0.1.1 pre-release.