Re Split Separator Quirk

I came across an odd quirk with Python’s split-by-regular-expression function, recently.

The re module has a function split which takes a regex and a string and splits the string by occurrences of the regex, returning a list of substrings. Ordinarily it works how you might expect:

>>> import re
>>> re.split( "[A-Z]", "HowNowBrownCow" )
['', 'ow', 'ow', 'rown', 'ow']

Here I’m splitting the string “HowNowBrownCow” by capital letters. The substrings between the regex occurrences are returned including, in this case, an empty string before the ‘H’.

But then we come to this example:

>>> re.split( "(No|Co)", "HowNowBrownCow" )
['How', 'No', 'wBrown', 'Co', 'w']

Splitting the same string by occurrences of either “No” or “Co”. This time, split has decided to return the separator strings along with the other substrings.

Why? It would seem that if you include any capturing groups in your separator regex, then split gives these back to you in the result. By dumping them in sequence into the returned list. Here’s another example:

>>> re.split( "(o)(w)", "HowNowBrownCow" )
['H', 'o', 'w', 'N', 'o', 'w', 'Br', 'o', 'w', 'nC', 'o', 'w', '']

Um… thanks a bunch, split.

So if you need to use bracketed sequences in the regex and, like me, you don’t find split’s quirky behaviour all that useful, here’s the workaround. With Perl-style regular expressions you can declare a group as non-capturing by using a ?: prefix. The example above, then, would look like this:

>>> re.split( "(?:o)(?:w)", "HowNowBrownCow" )
['H', 'N', 'Br', 'nC', '']

Just the in-between substrings now, no extra stuff from the separators. Perfect!

Mark Frimston

re.split Separator Quirk