--- title: Extraction of numbers from a character string date : 2013-12-01 --- &lead ```{r setup0, echo=FALSE} opts_chunk$set(fig.path="assets/fig/Numextract-") ``` Sometimes you get data like: ```{r, echo=FALSE} set.seed(666) n <- 10 dat <- data.frame( concentration = paste0(round(rlnorm(n),2), "mL"), temperature = paste0(rpois(n,10)-5, " C"), pH = rpois(n,14)/2 ) dat ``` and you'd like to separate the numerical values from the units in columns such as `concentration` or `temperature`. @BondedDust and @tcash21 kindly provide me some help on [stackoverflow](http://stackoverflow.com/questions/19252663/extracting-decimal-numbers-from-a-string) to achieve such a task. ## Extract numbers from a character string I'm using the following function to extract the numerical values: ```{r} library(stringr) numextract <- function(string){ str_extract(string, "\\-*\\d+\\.*\\d*") } ``` For example: ```{r} numextract("30.5ml") numextract(">2g") ``` The function is vectorized: ```{r} numextract(c("30.5ml", "37ml")) ``` It also extracts negative numbers: ```{r} numextract("-3°C") ``` If you want to ignore the minus sign, delete it in `str_extract(string, "\\-*\\d+\\.*\\d*")`. If there are multiple numbers in the character string, it only extracts the first one: ```{r} numextract("between 2ml and 3ml") ``` (yes, I have already received such data in my life). If you want to extarct multiple numbers use this function instead: ```{r} Numextract <- function(string){ unlist(regmatches(string,gregexpr("[[:digit:]]+\\.*[[:digit:]]*",string))) } Numextract("between 2ml and 3ml") ``` ## Separate numbers and units Consider the problem of the introduction: ```{r} dat ``` Extract the numerical values of the `concentration` column ```{r} ( concentration <- numextract(dat$concentration) ) ``` Sometimes, units are different, and we want to get them : ```{r} ( unit <- str_split_fixed(dat$concentration, concentration, n=2)[,2] ) ``` And it is a good practice to check we are able to reconstruct the original column: ```{r} all(paste0(concentration, unit) == dat$concentration) ``` Now we'd like to replace the original column by these two columns: ```{r} ( concentration <- data.frame(concentration, unit) ) ``` To do so, I'm using the following function: ```{r} df_replace <- function(dat, col, newcol){ if(!(col %in% names(dat))) stop(paste0("column ", col, " is not in ", deparse(substitute(dat)))) if(is.vector(newcol)){ newcol <- data.frame(newcol, stringsAsFactors=FALSE) newcol <- reshape::rename(newcol, replace=(c("newcol"=col))) } E <- which(names(dat) == col) if(E==1) return(cbind(newcol, dat[,(E++1):length(dat),drop=FALSE])) if(E==length(dat)) return(cbind(dat[,1:(E-1),drop=FALSE], newcol)) else return(cbind(dat[,1:(E-1),drop=FALSE], newcol, dat[,(E++1):length(dat),drop=FALSE])) } ``` like this: ```{r} df_replace(dat, "concentration", concentration) ``` ## Handling character data Assume you have a long dataset made of columns in character mode only (see next section about the import of Excel files in R). Even the columns containing numerical values are given in character mode, and you need an automatic way to convert these columns in numeric mode. First, I'm using the following function to check whether a column can be safely converted in numerical mode: ```{r} isNumeric <- function(string){ ## test whether a character vector contains numbers only x <- numextract(string) notNA <- which(!is.na(x)) ifelse(length(notNA)==0, FALSE, all(x[notNA]==str_trim(string[notNA]))) } ``` ```{r} x <- c("1.5", "3.2") isNumeric(x) x <- factor(x) isNumeric(x) x <- c("1.5", "3.2 ") isNumeric(x) x <- c("1.5", "<3.2") isNumeric(x) ``` If `isNumeric(x)` returns `TRUE`, the you can safely use `as.numeric(x)` to convert if `x` is a character vector, *but* you have to use `as.numeric(as.character(x))` if `x` is a factor. You can simply use `as.numeric(as.character(x))` for both cases. Based on `isNumeric`, the function `Numerize` below takes as input a dataframe containing only character or factor columns, and returns the same dataframe with numerical columns when possible. ```{r} Numerize <- function(dat, factors=TRUE){ # factorcols <- if(factors) FALSE else vapply(dat, function(col) is.factor(col), TRUE) numcols <- names(dat)[which(vapply(dat, function(col) isNumeric(col), TRUE) & !factorcols)] sapply(numcols, function(col) invisible(dat[[col]] <<- as.numeric(as.character(dat[[col]])))) return(dat) } ``` For example: ```{r} Numerize(data.frame(x=c("1.5", "3.2"), y=c("1.5", "<3.2"))) ```