31 Aug

Merging two dataframe while keeping the original rows’ order

Hello GIStouille,

Here is a nice function to merge two dataframe while keeping the original rows’ order (source).
keep_order=1 keeps order of the first dataframe (x).
keep_order=2 keeps order of the second dataframe (y).

merge.with.order <- function(x,y, ..., sort = T, keep_order)
  # this function works just like merge, only that it adds the option to return the merged data.frame ordered by x (1) or by y (2)
  add.id.column.to.data <- function(DATA)
    data.frame(DATA, id... = seq_len(nrow(DATA)))
  # add.id.column.to.data(data.frame(x = rnorm(5), x2 = rnorm(5)))
  order.by.id...and.remove.it <- function(DATA)
    # gets in a data.frame with the "id..." column.  Orders by it and returns it
    if(!any(colnames(DATA)=="id...")) stop("The function order.by.id...and.remove.it only works with data.frame objects which includes the 'id...' order column")
    ss_r <- order(DATA$id...)
    ss_c <- colnames(DATA) != "id..."
    DATA[ss_r, ss_c]
  # tmp <- function(x) x==1; 1	# why we must check what to do if it is missing or not...
  # tmp()
    if(keep_order == 1) return(order.by.id...and.remove.it(merge(x=add.id.column.to.data(x),y=y,..., sort = FALSE)))
    if(keep_order == 2) return(order.by.id...and.remove.it(merge(x=x,y=add.id.column.to.data(y),..., sort = FALSE)))
    # if you didn't get "return" by now - issue a warning.
    warning("The function merge.with.order only accepts NULL/1/2 values for the keep_order variable")
  } else {return(merge(x=x,y=y,..., sort = sort))}
27 Nov

How to randomly split a data frame or a vector into a training and test dataset ?

GIS là,

Here it is the function to do that. Fix the seed if you want to generate the exact same sample several time.
prop allows to define the proportion of the total data that will be sample for the training set.

#' Splitdf splits a dataframe into a training sample and test sample with a given proportion
#' This function takes a data frame and according to predefined proportion "prop" it will return a training and a test sample
#' @param input a n x p dataframe of n observations and p variables or a vector
#' @param seed the seed to be set in order to ensure reproductability of the split
#' @param prop the proportion of the training sample [0-1]
#' @return a list with two slots: trainset and testset
#' @author BlackGuru
#' @details
#' This function takes a data frame or a vector and according to predefined proportion "prop" it will return a training and a test sample. "prop" corresponds to the proportion of the training sample.
#' @export
splitdf <- function(input, prop=0.5, seed=NULL) {
  if (!is.null(seed)) set.seed(seed)
  if (is.data.frame(input)){
    index <- 1:nrow(input)
    trainindex <- sample(index, trunc(length(index)*prop))
    trainset <- input[trainindex, ]
    testset <- input[-trainindex, ]
  }else if (is.vector(input)){
    trainindex <- sample(index, trunc(length(index)*prop))
    trainset <- input[trainindex]
    testset <- input[-trainindex]
    print("Input must be a dataframe or a vector")