Any experiences with Intel's Threading Building Blocks?
I've introduced it into our code base because we needed a bettor malloc to use when we moved to a 16 core machine. With 8 and under it wasn't a significant issue. It has worked well for us. We plan on using the fine grained concurrent containers next. Ideally we can make use of the real meat of the product, but that requires rethinking how we build our code. I really like the ideas in TBB, but it's not easy to retrofit onto a code base.
You can't think of TBB as another threading library. They have a whole new model that really sits on top of threads and abstracts the threads away. You learn to think in task, parallel_for type operations and pipelines. If I were to build a new project I would probably try to model it in this fashion.
We work in Visual Studio and it works just fine. It was originally written for linux/pthreads so it runs just fine over there also.
I'm not doing numerical computing but I work with data mining (think clustering and classification), and our workloads are probably similar: all the data is static and you have it at the beginning of the program. I have briefly investigated Intel's TBB and found them overkill for my needs. After starting with raw pthread-based code, I switched to OPENMP and got the right mix between readability and performance.
Portability
TBB is portable. It supports Intel and AMD (i.e. x86) processors, IBM PowerPC and POWER processors, ARM processors, and possibly others. If you look in the build directory, you can see all the configurations the build system support, which include a wide range of operating systems (Linux, Windows, Android, MacOS, iOS, FreeBSD, AIX, etc.) and compilers (GCC, Intel, Clang/LLVM, IBM XL, etc.). I have not tried TBB with the PGI C++ compiler and know that it does not work with the Cray C++ compiler (as of 2017).
A few years ago, I was part of the effort to port TBB to IBM Blue Gene systems. Static linking was a challenge, but is now addressed by the big_iron.inc build system helper. The other issues were supporting relatively ancient versions of GCC (4.1 and 4.4) and ensuring the PowerPC atomics were working. I expect that porting to any currently unsupported architecture would be relatively straightforward on platforms that provide or are compatible with GCC and POSIX.
Usage in community codes
I am aware of at least two HPC application frameworks that uses TBB:
- MOOSE
- MADNESS
I do not know how MOOSE uses TBB, but MADNESS uses TBB for its task queue and memory allocator.
Performance versus other threading models
I have personally used TBB in the Parallel Research Kernels project, within which I have compared TBB to OpenMP, OpenCL, Kokkos, RAJA, C++17 Parallel STL, and other models. See the C++ subdirectory for details.
The following figure shows the relative performance of the aforementioned models on an Intel Xeon Phi 7250 processor (the details aren't important - all models used the same settings). As you can see, TBB does quite well except for smaller problem sizes, where the overhead of adaptive scheduling is more relevant. TBB has tuning knobs that will affect these results.
Full disclosure: I work for Intel in a research/pathfinding capacity.