Skip to content

Commit

Permalink
Satistical row filtering (#53)
Browse files Browse the repository at this point in the history
* Create target encoding

* generalize target encoding to multiple columns

* Integrate target_encoding to preapreSet*

* Add verbose

* Update doc

* function remove_sd_outlier and �remove_rare_categorical to start filtering rows

* Create remove_percentile_outlier and complete doc
  • Loading branch information
ELToulemonde authored Jul 19, 2019
1 parent e92d020 commit 9868461
Show file tree
Hide file tree
Showing 9 changed files with 435 additions and 9 deletions.
4 changes: 4 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ export(generateFromFactor)
export(identifyDates)
export(one_hot_encoder)
export(prepareSet)
export(remove_percentile_outlier)
export(remove_rare_categorical)
export(remove_sd_outlier)
export(sameShape)
export(setAsNumericMatrix)
export(setColAsCharacter)
Expand All @@ -48,5 +51,6 @@ importFrom(progress,progress_bar)
importFrom(stats,as.formula)
importFrom(stats,contrasts)
importFrom(stats,model.matrix)
importFrom(stats,quantile)
importFrom(stats,sd)
importFrom(stringr,str_replace_all)
3 changes: 3 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ V 0.4.1
- New features:
- New functions:
- Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable.
- Function *remove_sd_outlier* helps to remove rows that have numerical values to extrem.
- Function *remove_percentile_outlier* helps to remove rows that have numerical values to extrem (based on percentile analysis).
- Function *remove_rare_categorical* helps to remove rows that have categorial values to rare.
- New features in existing functions :
- Function *prepareSet* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*.

Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ V 0.4.1
- New features:
- New functions:
- Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable.
- Function *remove_sd_outlier* helps to remove rows that have numerical values to extrem.
- Function *remove_percentile_outlier* helps to remove rows that have numerical values to extrem (based on percentile analysis).
- Function *remove_rare_categorical* helps to remove rows that have categorial values to rare.
- New features in existing functions :
- Function *prepareSet* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*.

Expand Down
200 changes: 200 additions & 0 deletions R/rowFiltering.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
#' Standard deviation outlier filtering
#'
#' Remove outliers based on standard deviation thresholds. \cr
#' Only values within \code{mean - sd * n_sigmas} and \code{mean + sd * n_sigmas} are kept.
#' @param dataSet Matrix, data.frame or data.table
#' @param cols List of numeric column(s) name(s) of dataSet to transform. To transform all
#' numeric columns, set it to "auto". (character, default to "auto")
#' @param n_sigmas number of times standard deviation is accepted (interger, default to 3)
#' @param verbose Should the algorithm talk? (logical, default to TRUE)
#' @details Filtering is made column by column, meaning that extrem values from first element
#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed,
#' ... \cr
#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
#' @return Same dataset with less rows, edited by \strong{reference}. \cr
#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
#' @examples
#' # Given
#' library(data.table)
#' col_vals <- runif(1000)
#' col_mean <- mean(col_vals)
#' col_sd <- sd(col_vals)
#' extrem_val <- col_mean + 6 * col_sd
#' dataSet <- data.table(num_col = c(col_vals, extrem_val))
#'
#' # When
#' dataSet <- remove_sd_outlier(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE)
#'
#' # Then extrem value is no longer in set
#' extrem_val %in% dataSet[["num_col"]] # Is false
#' @export
remove_sd_outlier <- function(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE){
## Environement
function_name <- "remove_sd_outlier"

## Sanity check
dataSet <- checkAndReturnDataTable(dataSet = dataSet)
cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name, types = "numeric")

## Initialization
if (verbose){
printl(function_name, ": I start to filter categorical rare events")
pb <- initPB(function_name, names(dataSet))
start_time <- proc.time()
}
initial_nrow <- nrow(dataSet)

## Computation
for (col in cols){
tmp_nrow <- nrow(dataSet)
col_mean <- dataSet[, mean(get(col))]
col_sd <- dataSet[, sd(get(col))]
dataSet <- dataSet[(get(col) <= col_mean + n_sigmas * col_sd) &
(get(col) >= col_mean - n_sigmas * col_sd), ]
if (verbose){
printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
setPB(pb, col)
}
}

if (verbose){
printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ",
round( (proc.time() - start_time)[[3]], 2), " seconds. ")
}


## Wrapp-up
return(dataSet)
}

#' Filter rare categoricals
#'
#' Filter rows that have a rare occurences
#' @param dataSet Matrix, data.frame or data.table
#' @param cols List of column(s) name(s) of dataSet to transform. To transform all
#' columns, set it to "auto". (character, default to "auto")
#' @param threshold share of occurencies under which row should be removed (numeric, default to 0.01)
#' @param verbose Should the algorithm talk? (logical, default to TRUE)
#' @details Filtering is made column by column, meaning that extrem values from first element
#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed,
#' ... \cr
#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
#' @return Same dataset with less rows, edited by \strong{reference}. \cr
#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
#' @examples
#' # Given a set with rare "C"
#' library(data.table)
#' dataSet <- data.table(cat_col = c(sample(c("A", "B"), 1000, replace=TRUE), "C"))
#'
#' # When calling function
#' dataSet <- remove_rare_categorical(dataSet, cols = "cat_col",
#' threshold = 0.01, verbose = TRUE)
#'
#' # Then there are no "C"
#' unique(dataSet[["cat_col"]])
#' @import data.table
#' @export
remove_rare_categorical <- function(dataSet, cols ="auto", threshold = 0.01, verbose = TRUE){
## Environement
function_name <- "remove_rare_categorical"

## Sanity check
dataSet <- checkAndReturnDataTable(dataSet = dataSet)
cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name)

## Initialization
if (verbose){
printl(function_name, ": I start to filter categorical rare events")
pb <- initPB(function_name, names(dataSet))
start_time <- proc.time()
}
initial_nrow <- nrow(dataSet)

## Computation
for ( col in cols){
col_val_occ <- dataSet[, .N, by = col]
acceptable_cat <- col_val_occ[get("N") >= initial_nrow * threshold, get(col)]

tmp_nrow <- nrow(dataSet)
dataSet <- dataSet[get(col) %in% acceptable_cat]

if (verbose){
printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
setPB(pb, col)
}
}

if (verbose){
printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ",
round( (proc.time() - start_time)[[3]], 2), " seconds. ")
}

## Wrap-up
return(dataSet)
}

#' Percentile outlier filtering
#'
#' Remove outliers based on percentiles. \cr
#' Only values within \code{n}th and \code{100 - n}th percentiles are kept.
#' @param dataSet Matrix, data.frame or data.table
#' @param cols List of numeric column(s) name(s) of dataSet to transform. To transform all
#' numeric columns, set it to "auto". (character, default to "auto")
#' @param percentile percentiles to filter (numeric, default to 1)
#' @param verbose Should the algorithm talk? (logical, default to TRUE)
#' @details Filtering is made column by column, meaning that extrem values from first element
#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed,
#' ... \cr
#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
#' @return Same dataset with less rows, edited by \strong{reference}. \cr
#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
#' @examples
#' # Given
#' library(data.table)
#' dataSet <- data.table(num_col = 1:100)
#'
#' # When
#' dataSet <- remove_percentile_outlier(dataSet, cols = "auto", percentile = 1, verbose = TRUE)
#'
#' # Then extrem value is no longer in set
#' 1 %in% dataSet[["num_col"]] # Is false
#' 2 %in% dataSet[["num_col"]] # Is true
#' @importFrom stats quantile
#' @export
remove_percentile_outlier <- function(dataSet, cols = "auto", percentile = 1, verbose = TRUE){
## Environement
function_name <- "remove_percentile_outlier"

## Sanity check
dataSet <- checkAndReturnDataTable(dataSet = dataSet)
cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name, types = "numeric")

## Initialization
if (verbose){
printl(function_name, ": I start to filter categorical rare events")
pb <- initPB(function_name, names(dataSet))
start_time <- proc.time()
}
initial_nrow <- nrow(dataSet)

## Computation
for (col in cols){
tmp_nrow <- nrow(dataSet)
percentiles <- quantile(dataSet[[col]],
c(percentile / 100, (100 - percentile) / 100), na.rm = TRUE)
dataSet <- dataSet[(get(col) >= percentiles[1]) & (get(col) <= percentiles[2]), ]
if (verbose){
printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
setPB(pb, col)
}
}

if (verbose){
printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ",
round( (proc.time() - start_time)[[3]], 2), " seconds. ")
}


## Wrapp-up
return(dataSet)
}
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ Before using any machine learning (ML) algorithm, one need to prepare its data.

Here are the functions available in this package to tackle those issues:

Correct | Transform | Filter | Pre model manipulation| Shape
--------- |----------- |-------- |----------- |------------------------
unFactor | generateDateDiffs | fastFilterVariables | fastHandleNa | shapeSet
findAndTransformDates | generateFactorFromDate | whichAreConstant | fastDiscretization | sameShape
findAndTransformNumerics | aggregateByKey | whichAreInDouble | fastScale | setAsNumericMatrix
setColAsCharacter | generateFromFactor | whichAreBijection | | one_hot_encoder
setColAsNumeric | generateFromCharacter | | |
setColAsDate | fastRound | | |
setColAsFactor | target_encode | | |
Correct | Transform | Filter | Pre model manipulation| Shape
--------- |----------- |-------- |----------- |------------------------
unFactor | generateDateDiffs | fastFilterVariables | fastHandleNa | shapeSet
findAndTransformDates | generateFactorFromDate | whichAreConstant | fastDiscretization | sameShape
findAndTransformNumerics | aggregateByKey | whichAreInDouble | fastScale | setAsNumericMatrix
setColAsCharacter | generateFromFactor | whichAreBijection | | one_hot_encoder
setColAsNumeric | generateFromCharacter |remove_sd_outlier | |
setColAsDate | fastRound |remove_rare_categorical | |
setColAsFactor | target_encode |remove_percentile_outlier| |

All of those functions are integrated in the __full pipeline__ function `prepareSet`.

Expand Down
45 changes: 45 additions & 0 deletions man/remove_percentile_outlier.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

44 changes: 44 additions & 0 deletions man/remove_rare_categorical.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 9868461

Please sign in to comment.