Satistical row filtering (#53)

* Create target encoding * generalize target encoding to multiple columns * Integrate target_encoding to preapreSet* * Add verbose * Update doc * function remove_sd_outlier and �remove_rare_categorical to start filtering rows * Create remove_percentile_outlier and complete doc
ELToulemonde · Jul 19, 2019 · 9868461 · 9868461
1 parent e92d020
commit 9868461
Show file tree

Hide file tree

Showing 9 changed files with 435 additions and 9 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -24,6 +24,9 @@ export(generateFromFactor)
 export(identifyDates)
 export(one_hot_encoder)
 export(prepareSet)
+export(remove_percentile_outlier)
+export(remove_rare_categorical)
+export(remove_sd_outlier)
 export(sameShape)
 export(setAsNumericMatrix)
 export(setColAsCharacter)
@@ -48,5 +51,6 @@ importFrom(progress,progress_bar)
 importFrom(stats,as.formula)
 importFrom(stats,contrasts)
 importFrom(stats,model.matrix)
+importFrom(stats,quantile)
 importFrom(stats,sd)
 importFrom(stringr,str_replace_all)
diff --git a/NEWS b/NEWS
@@ -3,6 +3,9 @@ V 0.4.1
 - New features:
   - New functions:
     - Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable.
+    - Function *remove_sd_outlier* helps to remove rows that have numerical values to extrem.
+    - Function *remove_percentile_outlier* helps to remove rows that have numerical values to extrem (based on percentile analysis).
+    - Function *remove_rare_categorical* helps to remove rows that have categorial values to rare.
   - New features in existing functions : 
     - Function *prepareSet* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*.
 

diff --git a/NEWS.md b/NEWS.md
@@ -3,6 +3,9 @@ V 0.4.1
 - New features:
   - New functions:
     - Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable.
+    - Function *remove_sd_outlier* helps to remove rows that have numerical values to extrem.
+    - Function *remove_percentile_outlier* helps to remove rows that have numerical values to extrem (based on percentile analysis).
+    - Function *remove_rare_categorical* helps to remove rows that have categorial values to rare.
   - New features in existing functions : 
     - Function *prepareSet* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*.
 

diff --git a/R/rowFiltering.R b/R/rowFiltering.R
@@ -0,0 +1,200 @@
+#' Standard deviation outlier filtering
+#'
+#' Remove outliers based on standard deviation thresholds. \cr
+#' Only values within \code{mean - sd * n_sigmas} and \code{mean + sd * n_sigmas} are kept.
+#' @param dataSet Matrix, data.frame or data.table
+#' @param cols List of numeric column(s) name(s) of dataSet to transform. To transform all 
+#' numeric columns, set it to "auto".  (character, default to "auto")
+#' @param n_sigmas number of times standard deviation is accepted (interger, default to 3)
+#' @param verbose Should the algorithm talk? (logical, default to TRUE)
+#' @details Filtering is made column by column, meaning that extrem values from first element
+#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed, 
+#' ... \cr
+#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
+#' @return Same dataset with less rows, edited by \strong{reference}. \cr
+#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
+#' @examples 
+#' # Given
+#' library(data.table)
+#' col_vals <- runif(1000)
+#' col_mean <- mean(col_vals)
+#' col_sd <- sd(col_vals)
+#' extrem_val <- col_mean + 6 * col_sd
+#' dataSet <- data.table(num_col = c(col_vals, extrem_val))
+#' 
+#' # When
+#' dataSet <- remove_sd_outlier(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE)
+#' 
+#' # Then extrem value is no longer in set
+#' extrem_val %in% dataSet[["num_col"]] # Is false
+#' @export
+remove_sd_outlier <- function(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE){
+  ## Environement
+  function_name <- "remove_sd_outlier"  
+
+  ## Sanity check
+  dataSet <- checkAndReturnDataTable(dataSet = dataSet)
+  cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name, types = "numeric")
+
+  ## Initialization
+  if (verbose){
+    printl(function_name, ": I start to filter categorical rare events")
+    pb <- initPB(function_name, names(dataSet))
+    start_time <- proc.time()
+  }
+  initial_nrow <- nrow(dataSet)
+
+  ## Computation
+  for (col in cols){
+    tmp_nrow <- nrow(dataSet)
+    col_mean <- dataSet[, mean(get(col))]
+    col_sd <- dataSet[, sd(get(col))]
+    dataSet <- dataSet[(get(col) <= col_mean + n_sigmas * col_sd) & 
+                         (get(col) >= col_mean - n_sigmas * col_sd), ]
+    if (verbose){
+      printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
+      setPB(pb, col)
+    }
+  }
+
+  if (verbose){
+    printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ", 
+           round( (proc.time() - start_time)[[3]], 2), " seconds. ")
+  }
+
+
+  ## Wrapp-up
+  return(dataSet)
+}
+
+#' Filter rare categoricals
+#'
+#' Filter rows that have a rare occurences
+#' @param dataSet Matrix, data.frame or data.table
+#' @param cols List of column(s) name(s) of dataSet to transform. To transform all 
+#' columns, set it to "auto".  (character, default to "auto")
+#' @param threshold share of occurencies under which row should be removed (numeric, default to 0.01)
+#' @param verbose Should the algorithm talk? (logical, default to TRUE)
+#' @details Filtering is made column by column, meaning that extrem values from first element
+#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed, 
+#' ... \cr
+#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
+#' @return Same dataset with less rows, edited by \strong{reference}. \cr
+#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
+#' @examples 
+#' # Given a set with rare "C"
+#' library(data.table)
+#' dataSet <- data.table(cat_col = c(sample(c("A", "B"), 1000, replace=TRUE), "C"))
+#' 
+#' # When calling function
+#' dataSet <- remove_rare_categorical(dataSet, cols = "cat_col",  
+#'                                    threshold = 0.01, verbose = TRUE)
+#'                                    
+#' # Then there are no "C"
+#' unique(dataSet[["cat_col"]])
+#' @import data.table
+#' @export
+remove_rare_categorical <- function(dataSet, cols ="auto", threshold = 0.01, verbose = TRUE){
+  ## Environement
+  function_name <- "remove_rare_categorical"  
+
+  ## Sanity check
+  dataSet <- checkAndReturnDataTable(dataSet = dataSet)
+  cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name)
+
+  ## Initialization
+  if (verbose){
+    printl(function_name, ": I start to filter categorical rare events")
+    pb <- initPB(function_name, names(dataSet))
+    start_time <- proc.time()
+  }
+  initial_nrow <- nrow(dataSet)
+
+  ## Computation
+  for ( col in cols){
+    col_val_occ <- dataSet[, .N, by = col]
+    acceptable_cat <- col_val_occ[get("N") >= initial_nrow * threshold, get(col)]
+
+    tmp_nrow <- nrow(dataSet)
+    dataSet <- dataSet[get(col) %in% acceptable_cat]
+
+    if (verbose){
+      printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
+      setPB(pb, col)
+    }
+  }
+
+  if (verbose){
+    printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ", 
+           round( (proc.time() - start_time)[[3]], 2), " seconds. ")
+  }
+
+  ## Wrap-up
+  return(dataSet)
+}
+
+#' Percentile outlier filtering
+#'
+#' Remove outliers based on percentiles. \cr
+#' Only values within \code{n}th and \code{100 - n}th percentiles are kept.
+#' @param dataSet Matrix, data.frame or data.table
+#' @param cols List of numeric column(s) name(s) of dataSet to transform. To transform all 
+#' numeric columns, set it to "auto".  (character, default to "auto")
+#' @param percentile percentiles to filter (numeric, default to 1)
+#' @param verbose Should the algorithm talk? (logical, default to TRUE)
+#' @details Filtering is made column by column, meaning that extrem values from first element
+#' of \code{cols} are removed, then extrem values from second element of \code{cols} are removed, 
+#' ... \cr
+#' So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.
+#' @return Same dataset with less rows, edited by \strong{reference}. \cr
+#' If you don't want to edit by reference please provide set \code{dataSet = copy(dataSet)}.
+#' @examples 
+#' # Given
+#' library(data.table)
+#' dataSet <- data.table(num_col = 1:100)
+#' 
+#' # When
+#' dataSet <- remove_percentile_outlier(dataSet, cols = "auto", percentile = 1, verbose = TRUE)
+#' 
+#' # Then extrem value is no longer in set
+#' 1 %in% dataSet[["num_col"]] # Is false
+#' 2 %in% dataSet[["num_col"]] # Is true
+#' @importFrom stats quantile
+#' @export
+remove_percentile_outlier <- function(dataSet, cols = "auto", percentile = 1, verbose = TRUE){
+  ## Environement
+  function_name <- "remove_percentile_outlier"  
+
+  ## Sanity check
+  dataSet <- checkAndReturnDataTable(dataSet = dataSet)
+  cols <- real_cols(dataSet = dataSet, cols = cols, function_name = function_name, types = "numeric")
+
+  ## Initialization
+  if (verbose){
+    printl(function_name, ": I start to filter categorical rare events")
+    pb <- initPB(function_name, names(dataSet))
+    start_time <- proc.time()
+  }
+  initial_nrow <- nrow(dataSet)
+
+  ## Computation
+  for (col in cols){
+    tmp_nrow <- nrow(dataSet)
+    percentiles <- quantile(dataSet[[col]], 
+                            c(percentile / 100, (100 - percentile) / 100), na.rm = TRUE)
+    dataSet <- dataSet[(get(col) >=  percentiles[1]) & (get(col) <= percentiles[2]), ]
+    if (verbose){
+      printl(function_name, ": dropped ", tmp_nrow - nrow(dataSet), " row(s) that are rare event on ", col, ".")
+      setPB(pb, col)
+    }
+  }
+
+  if (verbose){
+    printl(function_name, ": ", initial_nrow - nrow(dataSet), " have been dropped. It took ", 
+           round( (proc.time() - start_time)[[3]], 2), " seconds. ")
+  }
+
+
+  ## Wrapp-up
+  return(dataSet)
+}
diff --git a/README.md b/README.md
@@ -30,15 +30,15 @@ Before using any machine learning (ML) algorithm, one need to prepare its data.
 
 Here are the functions available in this package to tackle those issues:
 
-Correct                     | Transform                | Filter              | Pre model manipulation| Shape              
----------                   |-----------               |--------             |-----------            |------------------------
-unFactor                    | generateDateDiffs        | fastFilterVariables | fastHandleNa          | shapeSet           
-findAndTransformDates       | generateFactorFromDate   | whichAreConstant    | fastDiscretization    | sameShape          
-findAndTransformNumerics    | aggregateByKey           | whichAreInDouble    | fastScale             | setAsNumericMatrix 
-setColAsCharacter           | generateFromFactor       | whichAreBijection   |                       | one_hot_encoder
-setColAsNumeric             | generateFromCharacter    |                     |                       |
-setColAsDate                | fastRound                |                     |                       |
-setColAsFactor              | target_encode            |                     |                       |
+Correct                     | Transform                | Filter                  | Pre model manipulation| Shape              
+---------                   |-----------               |--------                 |-----------            |------------------------
+unFactor                    | generateDateDiffs        | fastFilterVariables     | fastHandleNa          | shapeSet           
+findAndTransformDates       | generateFactorFromDate   | whichAreConstant        | fastDiscretization    | sameShape          
+findAndTransformNumerics    | aggregateByKey           | whichAreInDouble        | fastScale             | setAsNumericMatrix 
+setColAsCharacter           | generateFromFactor       | whichAreBijection       |                       | one_hot_encoder
+setColAsNumeric             | generateFromCharacter    |remove_sd_outlier        |                       |
+setColAsDate                | fastRound                |remove_rare_categorical  |                       |
+setColAsFactor              | target_encode            |remove_percentile_outlier|                       |
 
 All of those functions are integrated in the __full pipeline__ function `prepareSet`.
 

diff --git a/man/remove_percentile_outlier.Rd b/man/remove_percentile_outlier.Rd
diff --git a/man/remove_rare_categorical.Rd b/man/remove_rare_categorical.Rd