regex - Replacing unicode brackets in python -

August 15, 2015

how pad unicode brackets spaces?

when tried use re.sub, sre_constants.error:

>>> import re >>> open_punct = ur'([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝（［｛｟｢' >>> text = u'this weird ❴sentence ⟅with crazy ⟦punctuations sprinkled⟨' >>> re.sub(open_punct, ur'\1 ', text) traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/usr/lib/python2.7/re.py", line 155, in sub     return _compile(pattern, flags).sub(repl, string, count)   file "/usr/lib/python2.7/re.py", line 251, in _compile     raise error, v # invalid expression sre_constants.error: unexpected end of regular expression

why did happen? why there's unexpected end of regular expression?

when tried use re.escape, doesn't raise , error re.sub didn't pad punctuation space:

>>> re.sub(re.escape(open_punct), ur'\1 ', text) u'this weird \u2774sentence \u27c5with crazy \u27e6punctuations sprinkled\u27e8' >>> print re.sub(re.escape(open_punct), ur'\1 ', text) weird ❴sentence ⟅with crazy ⟦punctuations sprinkled⟨

i expect regex solution should more optimal loop:

>>> p in open_punct: ...     text = text.replace(p, p+' ') ...  >>> text u'this weird \u2774 sentence \u27c5 crazy \u27e6 punctuations sprinkled\u27e8 ' >>> print text weird ❴ sentence ⟅ crazy ⟦ punctuations sprinkled⟨  >>> open_punct u'([{\u0f3a\u0f3c\u169b\u201a\u201e\u2045\u207d\u208d\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62' >>> print open_punct ([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝（［｛｟｢

related questions:

[ , ( have special meaning in regular expression, parser looking ] , ) counterparts.

if meant open_punct character group, you'd enclose characters [..] anyway, @ point both ( , [ can included unescaped. 'expression' matches text all characters in order present.

since expect reference capturing group (\1), add paretheses:

>>> re.sub(u'([{}])'.format(open_punct), ur'\1 ', text) u'this weird \u2774 sentence \u27c5 crazy \u27e6 punctuations sprinkled\u27e8 ' >>> print re.sub(u'([{}])'.format(open_punct), ur'\1 ', text) weird ❴ sentence ⟅ crazy ⟦ punctuations sprinkled⟨

note using re.escape() still idea, in case have - or ] character, or \[group] sequence in group want match with. - defines sequence of characters (0-9 digits), ] end of group, , \d, \w, \s, etc, define pre-defined character groups:

re.sub(u'([{}])'.format(re.escape(open_punct)), ur'\1 ', text)

Search This Blog

Live one

regex - Replacing unicode brackets in python -

Comments

Post a Comment

Popular posts from this blog

php - XML feed for Wordpress Social Board plugin modifications -

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -