I came across an odd quirk with Python’s split-by-regular-expression function, recently.
The re
module has a function split
which takes a regex and a string and
splits the string by occurrences of the regex, returning a list of substrings.
Ordinarily it works how you might expect:
>>> import re
>>> re.split( "[A-Z]", "HowNowBrownCow" )
['', 'ow', 'ow', 'rown', 'ow']
Here I’m splitting the string “HowNowBrownCow” by capital letters. The substrings between the regex occurrences are returned including, in this case, an empty string before the ‘H’.
But then we come to this example:
>>> re.split( "(No|Co)", "HowNowBrownCow" )
['How', 'No', 'wBrown', 'Co', 'w']
Splitting the same string by occurrences of either “No” or “Co”. This time,
split
has decided to return the separator strings along with the other
substrings.
Why? It would seem that if you include any capturing groups in your separator
regex, then split
gives these back to you in the result. By dumping them in
sequence into the returned list. Here’s another example:
>>> re.split( "(o)(w)", "HowNowBrownCow" )
['H', 'o', 'w', 'N', 'o', 'w', 'Br', 'o', 'w', 'nC', 'o', 'w', '']
Um… thanks a bunch, split
.
So if you need to use bracketed sequences in the regex and, like me, you don’t
find split
’s quirky behaviour all that useful, here’s the workaround. With
Perl-style regular expressions you can declare a group as non-capturing by
using a ?:
prefix. The example above, then, would look like this:
>>> re.split( "(?:o)(?:w)", "HowNowBrownCow" )
['H', 'N', 'Br', 'nC', '']
Just the in-between substrings now, no extra stuff from the separators. Perfect!