RE2 is much faster than the STL regex. According to my tests on simple queries like “.*a.*” at least 15 times. And it is faster than a lot of other regular expressions engines, which potentially can hit exponential runtime. RE2 processes input in a linear time because it uses old good Thompson NFA approach. It does not have new fancy syntax features (like backreferences), but if you can comply with a subset of the traditional Unix egrep regular expression syntax, then you have a tremendous speed win.  Every time you think that you need to integrate regex in your application, I would definitely recommend using RE2. It has a simple and clear interface, wrappers for Python, Ruby, Node.js and more. Supports precompiled regular expressions. Russ Cox did a really great job to make it efficient and robust.

However, it supports only UTF8. There are 2 options on what could be done to “support” UTF-16:

  1. The hardcore way: to rewrite the library. The good starting point will be the patch from the library creator for UCS-2 encoding support.
  2. Straightforward approach: to convert strings. This is obviously easier but will influence performance. Are conversion costs that critical?

Closer look at performance

On my machine (16 GB RAM, Intel Core i7, 2.4 GHz) the speed for allocating and converting 2 mio strings from UTF-16 to UTF-8 on average took 0.62 sec using preallocated buffer. Using a buffer is a good approach if you only need to match an input text, because there is no overhead on creating and deleting strings.

Matching on the same amount of UTF-8 encoded strings for precompiled regex like “.*ab.*txt” took on average 3.09 sec for PartialMatch and 3.49 sec for FullMatch.

The bottom line is: converting UTF-16 encoded strings to UTF-8 adds roughly 20% overhead. For some applications, it could be critical (with real-time performance and large input size), but for the vast majority of others is still OK. That said, depending on the circumstances, you have to decide whether twenty percent will be critical for your application.

Using RE2 with UTF-16 encoded strings
Tagged on: