r - Alternative for BAF calculations loop -
i working in loop 'b allele frequency' (baf) calculations on data frame of genetic markers. based on mathematical logic behind baf concept tried write code perform it, however, inefficient.
my input:
theta <- 'probe sample1 sample2 sample3 sample4 sample5 aam abm bbm ax-1 0.674 0.756 0.694 0.671 0.754 0.167 0.281 0.671 ax-2 0.117 0.907 0.501 0.904 0.548 0.116 0.506 0.903 ax-3 0.068 0.075 0.071 0.208 0.038 0.06 0.445 0.846' theta <- read.table(text=theta, header=t) my script:
theta.split <- split(theta, 1:nrow(theta)) for(k in 1:length(theta.split)){ thetax <- as.data.frame(theta.split[[k]]) for(i in 2:(ncol(thetax)-3)){ if(as.numeric(as.character(thetax[1,i])) < as.numeric(as.character(thetax$aam))){ thetax[1,i] <- 0} if(as.numeric(as.character(thetax[1,i])) >= as.numeric(as.character(thetax$aam)) && as.numeric(as.character(thetax[1,i])) < as.numeric(as.character(thetax$abm))){ thex <- as.numeric(as.character(thetax[1,i])) theaa <- as.numeric(as.character(thetax$aam)) theab <- as.numeric(as.character(thetax$abm)) bafx <- ((0.5)*(thex - theaa))/(theab - theaa) thetax[1,i] <- bafx} if(as.numeric(as.character(thetax[1,i])) >= as.numeric(as.character(thetax$abm)) && as.numeric(as.character(thetax[1,i])) < as.numeric(as.character(thetax$bbm))){ thex <- as.numeric(as.character(thetax[1,i])) theab <- as.numeric(as.character(thetax$abm)) thebb <- as.numeric(as.character(thetax$bbm)) bafx <- 0.5 + ((0.5)*(thex-theab)/(thebb-theab)) thetax[1,i] <- bafx} if(as.numeric(as.character(thetax[1,i])) >= as.numeric(as.character(thetax$bbm))){ thetax[1,i] <- 1} } theta[k,] <- thetax } out <- theta my expected output:
out <- 'probe sample1 sample2 sample3 sample4 sample5 aam abm bbm ax-1 1.000 1.000 1.000 1.000 1.000 0.167 0.281 0.671 ax-2 0.001 1.000 0.493 1.000 0.552 0.116 0.506 0.903 ax-3 0.010 0.019 0.014 0.192 0.000 0.06 0.445 0.846' out <- read.table(text=out, header=t) i grateful ideas make code smarter.
you can take advantage of both apply , vectorised calculations avoid loops. following takes on third of time:
library(dplyr) #take main code in loops out function #using vectorised logical calcs instead of if statements #samplevec vector , thetadf original theta dataframe bafxfn <- function(samplevec, thetadf) { testaam <- samplevec < thetadf$aam samplevec <- samplevec * (1 - testaam) testaamabm <- (samplevec >= thetadf$aam) * (samplevec < thetadf$abm) bafx <- ((0.5) * (samplevec - thetadf$aam)) / (thetadf$abm - thetadf$aam) samplevec <- testaamabm * bafx + (1 - testaamabm) * samplevec testabmbbm <- (samplevec >= thetadf$abm) * (samplevec < thetadf$bbm) bafx <- 0.5 + ((0.5) * (samplevec - thetadf$abm)) / (thetadf$bbm - thetadf$abm) samplevec <- testabmbbm * bafx + (1 - testabmbbm) * samplevec testbbm <- samplevec >= thetadf$bbm samplevec <- testbbm * 1 + (1 - testbbm) * samplevec samplevec } #subset original data frame leave sample columns (using dplyr's select function) sampledf <- theta %>% select(-probe, -aam, -abm, -bbm) #use apply loop through columns of remaining data #passing columns in vectors outsampledf <- sampledf %>% apply(2, bafxfn, thetadf = theta) %>% as.data.frame() #and bind results (using dplyr's bind_cols) outdf <- bind_cols( theta %>% select(probe), outsampledf, theta %>% select(aam, abm, bbm) ) there may neater way deal of subsetting tried generalise in case you've more 5 sample columns.
outdf source: local data frame [3 x 9] probe sample1 sample2 sample3 sample4 sample5 aam abm bbm (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) 1 ax-1 1.000000000 1.00000000 1.00000000 1.0000000 1.0000000 0.167 0.281 0.671 2 ax-2 0.001282051 1.00000000 0.49358974 1.0000000 0.5528967 0.116 0.506 0.903 3 ax-3 0.010389610 0.01948052 0.01428571 0.1922078 0.0000000 0.060 0.445 0.846
Comments
Post a Comment