Neil Mitchell's Blog (Haskell etc): c

Tuesday, December 12, 2017

Benchmarking strchr vs memchr

Summary: memchr is faster, but the obvious implement seems to beat the builtin versions.

There are two related C functions for finding the next character in a string - strchr which assumes the string has a NUL character at the end, and memchr which takes the string length as an argument. For strings where you have the size and a NUL terminator, which is fastest? Using gcc 6.2.0 64bit MSYS2 on Windows 10, searching for a single byte 10M bytes along a string, the times were (fastest to slowest):

11.05ms memchr implemented the obvious way.
14.82ms strchr implemented the obvious way.
14.96ms memchr provided by GCC.
19.63ms strchr provided by GCC.

Trying on 3 different Windows computers, the results are all similar (but scaled).

Given the choice, you should prefer memchr over strchr.

Surprise result

The optimised implementations shipped with GCC are slower than the obvious C implementations taken from a wiki. I have absolutely no idea why. From what I can tell, the builtin versions are coded in assembly, operating on multiple bytes at a time, using SSE instructions. In contrast, the C variants operate on a single byte at a time, and aren't vectorised by the optimiser according to Godbolt. If anyone has an explanation I'd be keen to hear it.

Benchmark Code

To benchmark the variants I wrote a Haskell program using criterion. The full code and build instructions are available in this gist. I compiled the C code with -O3, using the gcc shipped with GHC 8.2.1. I've reproduced the Haskell code below, with some comments:

-- Import all the necessary pieces
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import Criterion.Main
import Foreign
import Foreign.C.Types
import Data.Monoid

-- Make all the C imports
foreign import ccall unsafe "string.h memchr" memchr_std :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe "string.h strchr" strchr_std :: Ptr Word8 -> CInt -> IO (Ptr Word8)
foreign import ccall unsafe memchr_c :: Ptr Word8 -> CInt -> CSize -> IO (Ptr Word8)
foreign import ccall unsafe strchr_c :: Ptr Word8 -> CInt -> IO (Ptr Word8)

-- Method for ignoring the size when using strchr
ignoreSize f a b _ = f a b

-- Build a suitable string with an interesting character i bytes along
cstr i = BS.replicate i 32 <> BS.singleton 64 <> BS.replicate i 32 <> BS.singleton 0

-- The functions to benchmark
funs =
    [("memchr_std", memchr_std)
    ,("strchr_std", ignoreSize strchr_std)
    ,("memchr_c", memchr_c)
    ,("strchr_c", ignoreSize strchr_c)]

-- The main function, using Criterion
main = defaultMain
    [ seq bs $ bench (show i ++ " " ++ name) $ whnfIO $ test fun bs
    | i <- [1,10,100,1000,10000,100000,1000000,10000000]
    , let bs = cstr i
    , (name, fun) <- funs]

-- The function under test and input string
{-# NOINLINE test #-}
test fun bs =
    BS.unsafeUseAsCStringLen bs $ \(ptr,len) ->
        fun (castPtr ptr) 64 (fromIntegral len)

Tuesday, December 06, 2016

Undefined Behaviour in C

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
    array = malloc(42);
    return 'x';    
}

void test() {
    array = malloc(0);
    array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
PVS Studio found nothing.
Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
RV-Match hit a stack-overflow when running the program.

Neil Mitchell's Blog (Haskell etc)

Tuesday, December 12, 2017

Benchmarking strchr vs memchr

Tuesday, December 06, 2016

Undefined Behaviour in C

Blog Archive

About Me