While we're at it

Discussion:

While we're at it

Hugo Monteiro

2010-03-31 22:27:19 UTC

Hi,

i'm going to take the ride about "why did DJB did it like that" to ask
something (one of many) that has been bothering me for a while now.

Could someone explain why did DJB use 4 equal if lines inside the cicle
in str_len.c ? Wouldn't the use of a single line accomplish exactly the
same thing?

Regards,

Hugo Monteiro.
--
fct.unl.pt:~# cat .signature

Hugo Monteiro
Email : ***@fct.unl.pt
Telefone : +351 212948300 Ext.15307
Web : http://hmonteiro.net

Divisão de Informática
Faculdade de Ciências e Tecnologia da
Universidade Nova de Lisboa
Quinta da Torre 2829-516 Caparica Portugal
Telefone: +351 212948596 Fax: +351 212948548
www.fct.unl.pt ***@fct.unl.pt

fct.unl.pt:~# _

Hugo Monteiro

2010-03-31 22:40:44 UTC

Permalink

Post by Hugo Monteiro
Hi,
i'm going to take the ride about "why did DJB did it like that" to ask
something (one of many) that has been bothering me for a while now.
Could someone explain why did DJB use 4 equal if lines inside the cicle in
str_len.c ? Wouldn't the use of a single line accomplish exactly the same
thing?

speed :) it makes 1/4th of the jumps that way

That was my guess also, but then another question pops into my mind.

Why 4, and not more then? Is 4 the an average of something? Bare in mind
we're talking about strings and not only words.

R's,

Hugo Monteiro.
--
fct.unl.pt:~# cat .signature

Hugo Monteiro
Email : ***@fct.unl.pt
Telefone : +351 212948300 Ext.15307
Web : http://hmonteiro.net

Divisão de Informática
Faculdade de Ciências e Tecnologia da
Universidade Nova de Lisboa
Quinta da Torre 2829-516 Caparica Portugal
Telefone: +351 212948596 Fax: +351 212948548
www.fct.unl.pt ***@fct.unl.pt

fct.unl.pt:~# _

Colm MacCárthaigh

2010-03-31 23:04:00 UTC

Permalink

Post by Hugo Monteiro
Why 4, and not more then? Is 4 the an average of something? Bare in mind
we're talking about strings and not only words.

I think it's essentially arbitrary. Originally I thought that some
compilers might be intelligent enough to compress it into a neat
word-wize test, but have found no evidence of this in machine code
emitted by gcc or icc.

In a dynamic or semi-interpretted language (such as Java) it's
possible to do run-time statistical profiling and change the balance
between memory consumption Vs jumps based on observed average string
length and memory availability. (at some point the amount of memory
consumed by the unrolled code will cross a caching boundary and become
a source of serious latency). But in statically compiled languages
such as C, fewer such tricks are available - one could choose a value
based on a more-widely measured average - but just picking 4 seems
like an ok thing to do too - it hardly matters for the type of usage
patterns DJB's daemons have.

Security and portability are more important goals than performance for
str_len.c, so it doesn't avail of any of the modern tricks that are
available (such as the SSE instruction set). Though I think any modern
compiler would do a better job of unrolling than putting it explicitly
in the code. Also, some compilers do allow you to feed them profiling
hints, which they use to make decisions such as to what degree a loop
should be unrolled.

--
Colm

Colm MacCárthaigh

2010-04-01 19:52:20 UTC

Permalink

Clue: How many bytes fit in a word? The start of the array is probably
word-aligned. A really good compiler might even figure out that it can
load the word in a register and test all at once.

Sadly, not. As I said in my reply, I can find no evidence of this in
my analysis of machine code. Note that alignment is a tricky problem,
it would have to scroll until the tests were word-aligned. See method
4 in the cited post;

http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/

for such an example implementation (from glibc). This is quite far
beyond the capabilities of static compile optimization, I doubt it
would even be all that useful.

--
Colm

Colm MacCárthaigh

2010-04-05 22:16:18 UTC

Permalink

Post by Colm MacCÃ¡rthaigh
http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/
Now we.re into serious unreadability territory, and by the way, the
above is how djb implements strlen. But we have actually gained some
efficiency, now for every jump operation that the loop creates, we
test for the 0 value 4 times. The choice of 4 times is arbitrary here,
but an interesting exercise would be to vary this number and test each
possibility.

The very next paragraph explicitly mentions pipelining, so I don't
agree with you that it has been omitted. I certainly don't call out
that the L2 cache would contain the whole word, but this seems like an
odd thing to call out. A register would contain the whole word, never
mind any of the caches. The L1 and L2 caches are both several orders
of magnitude larger than a word.

Loop-unrolling is a very separate thing from caching optimisation, and
is usually in opposition to it. Note that unrolled instructions
require storage too, and so compete for cache line space.

Show me a memory allocator that gives back unaligned memory, and I'll
show you how to fix that. Most of the time, we can be expect the memory
to be word-aligned. In the rare case it isn't, it won't matter too much.

The issue is not that an allocator may return unaligned memory, but
that you may wish to count the length of a string that does not start
at the alignment boundary. For one trivial, but realistic, example;

char * string = "Hello World"; // probably a word-aligned pointer
int firstwordlength = strlen(string) - strlen(strtok(string, " "));
// second call is unaligned

--
Colm

Dean Anderson

2010-04-07 20:16:39 UTC

Permalink

Post by Colm MacCÃ¡rthaigh
The very next paragraph explicitly mentions pipelining, so I don't
agree with you that it has been omitted.

Ok.

Post by Colm MacCÃ¡rthaigh
I certainly don't call out that the L2 cache would contain the whole
word, but this seems like an odd thing to call out. A register would
contain the whole word, never mind any of the caches. The L1 and L2
caches are both several orders of magnitude larger than a word.

In total, the L1 & L2 cache are of course larger. And a register _could_
contain a word, but doesn't do so in this case. In this case, it
contains a byte, which is compared to 0. It doesn't make the comparison
4 times on the same byte and there are no unnecessary comparisons, which
is sort of implied in your description.

But a complete word is brought into L2 cache to read one byte. When
aligned the next 3 tests access L2 cache at high speed and are
pipelined. A conditional jump to do one test at a time would slow it
down because the pipeline is flushed when the branch is taken.

Post by Colm MacCÃ¡rthaigh
Loop-unrolling is a very separate thing from caching optimisation,and
is usually in opposition to it. Note that unrolled instructions
require storage too, and so compete for cache line space.

Agree. Some middle ground is appropriate: not too much unrolled. But
zero unrolling isn't ideal in many cases.

Post by Colm MacCÃ¡rthaigh

The issue is not that an allocator may return unaligned memory, but
that you may wish to count the length of a string that does not start
at the alignment boundary. For one trivial, but realistic, example;
char * string = "Hello World"; // probably a word-aligned pointer
int firstwordlength = strlen(string) - strlen(strtok(string, " "));
// second call is unaligned

Sure. But one wants code that is fast in the most common case, and works
in the less common case.

--
Av8 Internet Prepared to pay a premium for better service?
www.av8.net faster, more reliable, better service
617 256 5494

James Sutherland

2010-04-08 18:55:32 UTC

Permalink

(snip)

But a complete word is brought into L2 cache to read one byte. When
aligned the next 3 tests access L2 cache at high speed and are
pipelined. A conditional jump to do one test at a time would slow it
down because the pipeline is flushed when the branch is taken.

L1 cache not L2 - and the pipeline should only be flushed if the branch
is mispredicted. Any half-decent branch prediction should know the
conditional branch back to check the next byte (JNZ in x86-speak) is
taken most of the time, so the misprediction will only occur when
breaking out of the loop to return the result.

On the original SPARC and MIPS static branch predictors, the first
three lines of DJB's unrolled routine will be "predicted" correctly for
the common case of a non-zero byte; only the loop back to the
beginning will be mis-predicted. The later refinement, assuming
that only backward branches are taken, will be correct for all except
the final byte.

Post by Colm MacCÃ¡rthaigh

The issue is not that an allocator may return unaligned memory, but
that you may wish to count the length of a string that does not start
at the alignment boundary. For one trivial, but realistic, example;
char * string = "Hello World"; // probably a word-aligned pointer
int firstwordlength = strlen(string) - strlen(strtok(string, " "));
// second call is unaligned

Sure. But one wants code that is fast in the most common case, and works
in the less common case.

When parsing HTTP requests or DNS packets, I would expect nearly 75% of
the initial data to be non-word-aligned (87.5% on machines with 64 bit words).
With tinydns, the static data.cdb content already contains the length of each
object, so there shouldn't be any strlen-type calls there either.

With the simple byte by byte comparison, it doesn't really matter. I would
not be at all surprised to find the extra overhead in the "clever" word
by word version made it slower than the unrolled and branch-predicted
byte approach.

Of course, experimenting would be good - I'll see if I can find time to
round up some actual figures, at least on x86 (-64) in the next few
days. I would expect DJB did some profiling back when he first wrote
that code and that's how he ended up with the 4 line unroll; I
seem to recall he tended to check timing on SPARC and MIPS as well.

--
James A Sutherland
mobile +44 7977 563483; Skype jas88cam

Dean Anderson

2010-04-05 21:56:09 UTC

Permalink

Post by Colm MacCÃ¡rthaigh

Clue: How many bytes fit in a word? The start of the array is probably
word-aligned. A really good compiler might even figure out that it can
load the word in a register and test all at once.

Sadly, not. As I said in my reply, I can find no evidence of this in
my analysis of machine code. Note that alignment is a tricky problem,
it would have to scroll until the tests were word-aligned. See method
4 in the cited post;
http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/
Now we.re into serious unreadability territory, and by the way, the
above is how djb implements strlen. But we have actually gained some
efficiency, now for every jump operation that the loop creates, we
test for the 0 value 4 times. The choice of 4 times is arbitrary here,
but an interesting exercise would be to vary this number and test each
possibility.

The unreadability of this is questionable. But importantly, you also
omit the effect of pipelining comparisons and the fact that the whole
word will probably be in L2 cache, making it a lot faster to access.

Show me a memory allocator that gives back unaligned memory, and I'll
show you how to fix that. Most of the time, we can be expect the memory
to be word-aligned. In the rare case it isn't, it won't matter too much.

If I get some time, I'll count the instruction cycles on a couple
processors for your example.

--Dean

--
Av8 Internet Prepared to pay a premium for better service?
www.av8.net faster, more reliable, better service
617 256 5494

Dean Anderson

2010-04-01 19:37:44 UTC

Permalink

Clue: How many bytes fit in a word? The start of the array is probably
word-aligned. A really good compiler might even figure out that it can
load the word in a register and test all at once.

--Dean

Post by Colm MacCÃ¡rthaigh

Post by Hugo Monteiro
Why 4, and not more then? Is 4 the an average of something? Bare in mind
we're talking about strings and not only words.

I think it's essentially arbitrary. Originally I thought that some
compilers might be intelligent enough to compress it into a neat
word-wize test, but have found no evidence of this in machine code
emitted by gcc or icc.
In a dynamic or semi-interpretted language (such as Java) it's
possible to do run-time statistical profiling and change the balance
between memory consumption Vs jumps based on observed average string
length and memory availability. (at some point the amount of memory
consumed by the unrolled code will cross a caching boundary and become
a source of serious latency). But in statically compiled languages
such as C, fewer such tricks are available - one could choose a value
based on a more-widely measured average - but just picking 4 seems
like an ok thing to do too - it hardly matters for the type of usage
patterns DJB's daemons have.
Security and portability are more important goals than performance for
str_len.c, so it doesn't avail of any of the modern tricks that are
available (such as the SSE instruction set). Though I think any modern
compiler would do a better job of unrolling than putting it explicitly
in the code. Also, some compilers do allow you to feed them profiling
hints, which they use to make decisions such as to what degree a loop
should be unrolled.

--
Av8 Internet Prepared to pay a premium for better service?
www.av8.net faster, more reliable, better service
617 256 5494

Colm MacCárthaigh

2010-03-31 22:41:13 UTC

Permalink

Post by Hugo Monteiro
Could someone explain why did DJB use 4 equal if lines inside the cicle
in str_len.c ? Wouldn't the use of a single line accomplish exactly the
same thing?

it's a form of partial loop unrolling. See;

http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/

for some discussion (Method 3). Hope that helps :-)

--
Colm

Laurent Bercot

2010-04-01 20:41:30 UTC

Permalink

Post by Colm MacCÃ¡rthaigh
http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/

Thanks for the link, Colm. That's a very informative article you wrote.

--
Laurent

Simon Casady

2010-04-02 12:47:46 UTC

Permalink

I agree, a very good paper worth reading more than once.
However I can't resist adding a couple of more stories, thoughts on
the subject of optimization.
There once was a processor from att that had strlen as a single
opcode, one complex instruction. It was convenient but it turned out
to actually a little slower, in the real world, than test increment
loop with simple instructions. (also a cics rics comparison on the
same cpu !). The problem was interrupts. It took too long to save all
that state and restart the instruction and it had to be interruptible.
Point being you have to profile to know whats faster.
The other experience I had was with an early version of AIX. Being
curious and not wanting to do the work my boss wanted done I did some
test with disk IO. I tested clib stdio, read write system calls(page
size), and mmap. To my surprise stdio won every time. Apparently the
people at IBM knew more than I did. The point here is that if you are
not intimately acquainted with all the features and quirks of a
system,
just keep it simple and of course profile.

Dean Anderson

2010-04-05 21:59:18 UTC

Permalink

Post by Simon Casady
I agree, a very good paper worth reading more than once.
However I can't resist adding a couple of more stories, thoughts on
the subject of optimization.
There once was a processor from att that had strlen as a single
opcode, one complex instruction.

I386 can also do this with REPZ.

Post by Simon Casady
It was convenient but it turned out to actually a little slower, in
the real world, than test increment loop with simple instructions.
(also a cics rics comparison on the same cpu !). The problem was
interrupts. It took too long to save all that state and restart the
instruction and it had to be interruptible.
Point being you have to profile to know whats faster.

Exactly.

Post by Simon Casady
The other experience I had was with an early version of AIX. Being
curious and not wanting to do the work my boss wanted done I did some
test with disk IO. I tested clib stdio, read write system calls(page
size), and mmap. To my surprise stdio won every time. Apparently the
people at IBM knew more than I did. The point here is that if you are
not intimately acquainted with all the features and quirks of a
system, just keep it simple and of course profile.

Stdio probably buffers more efficiently.

--Dean

--
Av8 Internet Prepared to pay a premium for better service?
www.av8.net faster, more reliable, better service
617 256 5494

Paul Procacci

2010-03-31 22:41:49 UTC

Permalink

Post by Hugo Monteiro
Hi,
i'm going to take the ride about "why did DJB did it like that" to ask
something (one of many) that has been bothering me for a while now.
Could someone explain why did DJB use 4 equal if lines inside the cicle
in str_len.c ? Wouldn't the use of a single line accomplish exactly the
same thing?
Regards,
Hugo Monteiro.

http://en.wikipedia.org/wiki/Loop_unwinding

This message may contain confidential or privileged information. If you are not the intended recipient, please advise us immediately and delete this message. See http://www.datapipe.com/emaildisclaimer.aspx for further information on confidentiality and the risks of non-secure electronic communication. If you cannot access these links, please notify us by reply message and we will send the contents to you.

Alejandro Mery

2010-03-31 22:35:53 UTC

Permalink

speed :) it makes 1/4th of the jumps that way

Dean Anderson

2010-04-08 22:05:13 UTC

Permalink

(snip)

L1 cache not L2 - and the pipeline should only be flushed if the branch
is mispredicted. Any half-decent branch prediction should know the
conditional branch back to check the next byte (JNZ in x86-speak) is
taken most of the time, so the misprediction will only occur when
breaking out of the loop to return the result.
On the original SPARC and MIPS static branch predictors, the first
three lines of DJB's unrolled routine will be "predicted" correctly for
the common case of a non-zero byte; only the loop back to the
beginning will be mis-predicted. The later refinement, assuming
that only backward branches are taken, will be correct for all except
the final byte.

Very nice explanation. Thanks!

Sure. But one wants code that is fast in the most common case, and works
in the less common case.

DNS packets never require strlen when decoding, only encoding. And then
it may depend on the RR representation in the server. HTTP is known to

With the simple byte by byte comparison, it doesn't really matter. I would
not be at all surprised to find the extra overhead in the "clever" word
by word version made it slower than the unrolled and branch-predicted
byte approach.
Of course, experimenting would be good - I'll see if I can find time to
round up some actual figures, at least on x86 (-64) in the next few
that code and that's how he ended up with the 4 line unroll; I
seem to recall he tended to check timing on SPARC and MIPS as well.

Thanks!

--Dean

--
Av8 Internet Prepared to pay a premium for better service?
www.av8.net faster, more reliable, better service
617 256 5494