





















































Hi ,
Welcome to our very first issue for the year 2025.
In today’sExpert Insight, we bring you an excerpt from the recently published book, Debunking C++ Myths, which discusses different approaches to counting set bits in a 32-bit integer, comparing manual bitwise operations, the C++20 std::popcount
function, and the CPU-level POPCNT
instruction.
News Highlights:JetBrains launches ‘Junie’ for AI coding in Python, Kotlin, and Java; Tailwind CSS 4.0’s Rust-powered build engine boosts incremental builds up to 100x; Rust 1.85 stabilizes async closures and enhances Linux kernel support; and 11 new languages like Mojo, Wing, and Jakt target AI, memory safety, and edge computing.
My top 5 picks from today’s learning resources:
But there’s more, so dive right in.
Stay Awesome!
Divya Anne Selvaraj
Editor-in-Chief
T...[index]
), simplifies previously cumbersome methods, and improves readability.strptime()
, locales, daylight saving time, and the best available solutions, including timegm()
and C++20’s time zone library.java -jar
.go tool
is one of the best additions to the ecosystem in years: Discusses Go 1.24’s new go tool
command and tool
directive in go.mod
, which improves dependency management by reducing bloat.GC.total_time
for tracking GC impact, gvltools
for measuring GVL wait time, and Linux /proc
metrics for monitoring CPU scheduling.Here’s an excerpt from “Chapter 8: The Fastest C++ Code is Inline Assembly" in the book, Debunking C++ Myths by Alexandru Bolboacă and Ferenc-Lajos Deák, published in December 2024.
Dear reader. In our previous section of this chapter, unfortunately, we exhausted the only pompous introduction we could borrow from various cultural sources concerning technical interviews, career and life choices, and whether should we take the red pill or the blue one, so let’s focus our attention on more technical
questions that our candidates might face at a technical interview (the word technical appears four times in this shortintroductory paragraph).
One of these questions, served to the author of these lines a few years ago, was to write a short code snippet that will count the number of 1 bits (the on bits) in a 32-bit integer. Let’s draft up a quick application todo this:
int countOneBits(uint32_t n) {
int count = 0;
while (n) {
count += n & 1;
n >>= 1;
}
return count;
}
Here’s what happens. Firstly, we initialize a counter, starting with0
. The next step is to loop through the bits. Whilen
is non-zero, we add the least significant bit ofn
to the counter (n&1
gives us this value). Following this, we shiftn
right by one bit (discarding the leastsignificant bit).
Once all bits are processed (whenn
becomes0
), return the total count of 1 bits. Not a very complicated process, justraw work.
It seems that this procedure of counting bits in numbers must be of a very peculiar interest in computing circles, such as for the purpose of error detection and correction, data compression, cryptography, algorithmic efficiency, digital signal processing, hardware design, and performance metrics, so no wonder it managed to creep itself into the STL (C++ STL, which is the standard template library) too in the form ofstd::popcount
fromC++ 20.
The interesting part of the story is that not only in the STL do we find this handy operation, but it was deemed so useful that it even exists at the level of the processors, under the infamousPOPCNT
mnemonic. Infamous it is, due to the fact that in 2024, it was effectively used in hindering the installation of Windows 11 on older machines that were not officiallysupported (https://www.theregister.com/2024/04/23/windows_11_cpu_requirements/).
But what that means for our candidate, who has to write code to impress the interviewers, is that they can simply replace the complicated code from before with the following veryhandy snippet:
int countOneBits(uint32_t n) {
return std::popcount(n);
}
Not forgetting to include the<bit>
header, after feeding the preceding program intogcc.godbolt.org’s compilers, we get a strange mishmash of results. The code compiled by GCC, regardless of the optimization level, always generates a variation ofthe following:
countOneBits(unsigned int):
sub rsp, 8
mov edi, edi
call __popcountdi2
add rsp, 8
ret
So, the code at some level disappears from our eyes into a strange call deep inside the libraries offered by GCC, called__popcountdi2
(https://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html). In order to convince GCC to fully utilize the power of the processor that we are running the code on, we need to utilize some of the not-so-well-known command-line options, such as-march
(or-mpopcnt
for thisspecific purpose).
According to the official documentation, (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html) this command will select the appropriate processor instruction set in order to use the available extensions of the specific processor. Since, at this stage, we know that the POPCNT
instruction was introduced in the early Core i5 and i7 processors, in the Nehalem family, we should simply specify the following to GCC:-march=nehalem
. And now, not surprisingly, the compiler generatesthe following:
countOneBits(unsigned int):
popcnt eax, edi
ret
Interestingly, if we provide the compiler with just the-mpopcnt
flag, then it generates an extraxor eax, eax
(meaning it nulls the EAX register) so maybe we have witnessed some processor-specific extra optimizations by choosing theNehalem architecture:
countOneBits(unsigned int):
xor eax, eax
popcnt eax, edi
ret
We cannot squeeze more than this out of GCC; there is simply no lower level for this functionality, so we focus our attention on the next compiler onour list.
Without explicitly asking to optimize the code, Clang also generates a generic call to astd::popcount
function, found somewhere in its libraries; however, explicitly asking to optimize the generated code, Clang at various levels of optimization yieldsthe following:
countOneBits(unsigned int):
mov eax, edi
shr eax
and eax, 1431655765
sub edi, eax
mov eax, edi
and eax, 858993459
shr edi, 2
and edi, 858993459
add edi, eax
mov eax, edi
shr eax, 4
add eax, edi
and eax, 252645135
imul eax, eax, 16843009
shr eax, 24
ret
Surprising as it seems, there is a perfectly logical explanation for this code, found at the bit-twiddling site (https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) of Sean Eron Anderson at Stanford. Not considering this extra detour, Clang behaves identically to GCC when it comes to handling architecture and specifying the subset of CPU extensions to use while generating code.
The last of the big three, Microsoft’s own (we know, tiny, squishy) C++ compiler handles the situation very similarly to Clang. When asking to optimize the code while we specify an architecture that does not support thePOPCNT
instruction, it generates code like the one generated by Clang with low-level bit hacks, while if the architecture has support for thePOPCNT
instruction, it will adjust to the correct type and will callPOPCNT
for the proper parameters (/std:c++latest /arch:SSE4.2 /O1).
Good work, tiny,squishy compiler.
Debunking C++ Mythswas published in Decemver 2024. Packt library subscribers can continue reading the entire book for free or you can buy the book here!
That’s all for today.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere.
If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.
If you have any suggestions or feedback, or would like us to find you a learning resource on a particular subject, just respond to this email!