regex - Why does strsplit use positive lookahead and lookbehind assertion matches differently? -


common sense , sanity-check using gregexpr() indicate look-behind , look-ahead assertions below should each match @ 1 location in teststring:

teststring <- "text xx text" bb  <- "(?<= xx )" ff  <- "(?= xx )"  as.vector(gregexpr(bb, teststring, perl=true)[[1]]) # [1] 9 as.vector(gregexpr(ff, teststring, perl=true)[[1]][1]) # [1] 5 

strsplit(), however, uses match locations differently, splitting teststring @ one location when using lookbehind assertion, @ two locations -- second of seems incorrect -- when using lookahead assertion.

strsplit(teststring, bb, perl=true) # [[1]] # [1] "text xx " "text"      strsplit(teststring, ff, perl=true) # [[1]] # [1] "text"    " "       "xx text" 

i have 2 questions: (q1) what's going on here? , (q2) how can 1 strsplit() better behaved?


update: theodore lytras' excellent answer explains what's going on, , addresses (q1). answer builds on identify remedy, addressing (q2).

i not sure whether qualifies bug, because believe expected behaviour based on r documentation. ?strsplit:

the algorithm applied each input string is

repeat {     if string empty         break.     if there match         add string left of match output.         remove match , left of it.     else         add string output.         break. } 

note means if there match @ beginning of (non-empty) string, first element of output ‘""’, if there match @ end of string, output same match removed.

the problem lookahead (and lookbehind) assertions zero-length. example in case:

ff <- "(?=funky)" teststring <- "take me funky town"  gregexpr(ff,teststring,perl=true) # [[1]] # [1] 12 # attr(,"match.length") # [1] 0 # attr(,"usebytes") # [1] true  strsplit(teststring,ff,perl=true) # [[1]] # [1] "take me " "f"           "unky town"  

what happens lonely lookahead (?=funky) matches @ position 12. first split includes string position 11 (left of match), , removed string, match, -however- has 0 length.

now remaining string funky town, , lookahead matches @ position 1. there's nothing remove, because there's nothing @ left of match, , match has 0 length. algorithm stuck in infinite loop. apparently r resolves splitting single character, incidentally documented behaviour when strspliting empty regex (when argument split=""). after remaining string unky town, returned last split since there's no match.

lookbehinds no problem, because each match split , removed remaining string, algorithm never stuck.

admittedly behaviour looks weird @ first glance. behaving otherwise violate assumption of 0 length lookaheads. given strsplit algorithm documented, belive not meet definition of bug.


Comments

Popular posts from this blog

authentication - Mongodb revoke acccess to connect test database -

r - Update two sets of radiobuttons reactively - shiny -

ios - Realm over CoreData should I use NSFetchedResultController or a Dictionary? -