regex - Why does strsplit use positive lookahead and lookbehind assertion matches differently? -
common sense , sanity-check using gregexpr() indicate look-behind , look-ahead assertions below should each match @ 1 location in teststring:
teststring <- "text xx text" bb <- "(?<= xx )" ff <- "(?= xx )" as.vector(gregexpr(bb, teststring, perl=true)[[1]]) # [1] 9 as.vector(gregexpr(ff, teststring, perl=true)[[1]][1]) # [1] 5 strsplit(), however, uses match locations differently, splitting teststring @ one location when using lookbehind assertion, @ two locations -- second of seems incorrect -- when using lookahead assertion.
strsplit(teststring, bb, perl=true) # [[1]] # [1] "text xx " "text" strsplit(teststring, ff, perl=true) # [[1]] # [1] "text" " " "xx text" i have 2 questions: (q1) what's going on here? , (q2) how can 1 strsplit() better behaved?
update: theodore lytras' excellent answer explains what's going on, , addresses (q1). answer builds on identify remedy, addressing (q2).
i not sure whether qualifies bug, because believe expected behaviour based on r documentation. ?strsplit:
the algorithm applied each input string is
repeat { if string empty break. if there match add string left of match output. remove match , left of it. else add string output. break. }note means if there match @ beginning of (non-empty) string, first element of output ‘""’, if there match @ end of string, output same match removed.
the problem lookahead (and lookbehind) assertions zero-length. example in case:
ff <- "(?=funky)" teststring <- "take me funky town" gregexpr(ff,teststring,perl=true) # [[1]] # [1] 12 # attr(,"match.length") # [1] 0 # attr(,"usebytes") # [1] true strsplit(teststring,ff,perl=true) # [[1]] # [1] "take me " "f" "unky town" what happens lonely lookahead (?=funky) matches @ position 12. first split includes string position 11 (left of match), , removed string, match, -however- has 0 length.
now remaining string funky town, , lookahead matches @ position 1. there's nothing remove, because there's nothing @ left of match, , match has 0 length. algorithm stuck in infinite loop. apparently r resolves splitting single character, incidentally documented behaviour when strspliting empty regex (when argument split=""). after remaining string unky town, returned last split since there's no match.
lookbehinds no problem, because each match split , removed remaining string, algorithm never stuck.
admittedly behaviour looks weird @ first glance. behaving otherwise violate assumption of 0 length lookaheads. given strsplit algorithm documented, belive not meet definition of bug.
Comments
Post a Comment