Cbench - Scalable Cluster Benchmarking and Testing
Table of Contents
Overview
Cbench's goal is to be a relatively straightforward collection of tests, benchmarks, applications, utilities, and framework to hold them together with the goal to facilitate scalable testing, benchmarking, and analysis of a Linux parallel compute cluster. It grew gradually out of frustration having to redo the same work over and over as new Linux clusters were being integrated and brought online. I've continually found labor intensive tasks in cluster integration that could be assisted so that labor could be applied to the true goals of system integration, i.e. getting the system tested and debugged. As this toolkit has grown, it has opened the doors to more sophisticated system integration, testing, and characterization capabilities.
Some ways in which Cbench is utilized on clusters at Sandia:
- stress testing and analyzing cluster interconnect performance and characteristics using multiple bandwidth, latency, and collective tests
- test and analyze cluster scalability utilizing common benchmarks like Linpack, HPCC, NAS Parallel Benchmarks, Intel MPI Benchmarks, IOR
- stress test cluster file systems with a mix of job sizes and flavors
- stress testing a cluster after maintenance by pounding the system with 100s to 1000s of jobs of various sizes and flavors
- stress testing cluster scheduler and resource manager
- test nodes for hardware stability and conformance to performance profile of homogeneous hardware
- used as the basis for a deterministic methodology strictly detailing the testing process for returning broken hardware back into general production usage
Cbench is used heavily on the 4480 node (8960 processor) Thunderbird Linux cluster at Sandia National Labs.
Goals
The first goal is to make it easy to build all the source code gathered within Cbench and make getting bootstrapped on a new cluster as easy as possible. Cbench tries to do this by centralizing as many configuration parameters as possible including Make parameters and cluster definition parameters.
The second more ambitious goal is to be able to quickly generate, run a large amount of tests, and analyze the test data. To this end Cbench uses the idea of 'test sets' which package up tests/benchmarks/applications into sets with utilities to:
- generate jobs for batch and interactive execution
- run many batch and interactive jobs easily
- analyze the resulting mass of output and synthesize it
The third goal is too make it completely painless to switch between different batch environments and different job launch methods (i.e. mpiexec, prun, yod, etc...).
The overarching goal is to make dealing with cluster testing much more tractable. To enable focusing on what you really want to focus on, not on how to get there.
Some examples that are relatively painless to do with Cbench:
- run 1000+ of parallel jobs over a range of 2 to 1024 processors using 10-15 different tests/benchmarks overnight and analyze what happened the next morning in minutes
- plot results in semi real-time (using gnuplot) for supported tests as hundreds of jobs run through the scheduler
- for a set of supported test jobs, easily analyze success/failure ratios according to job size (number of processes)
- run node-level (i.e. on a single node w/o worrying about MPI) nightly burn-in tests on a 4500 node cluster, analyze the results, and generate a statistic performance and fault profile for the nodes
- cease to worry about the hassles of changing mpi job launchers and batch schedulers
Cluster Testing/Benchmarking System Levels
We have observed that Linux cluster testing/benchmarking seems to fall into three different levels. Each level requires different, but interrelated toolsets.
Node Level
The foundational level is the Node level, where one is only concerned with testing a node in isolation without worrying about high-speed interconnects or system MPI libraries, etc. Hardware burnin testing during integration is one example. Note that you will in all likelihood be testing many isolated nodes in parallel, since you most likely have a Linux cluster after all. Some of the complexities at this level are:
- having a flexible set of tests to run on nodes
- having the ability to run tests on many nodes in parallel and organize the results
- analyze the results from the tests as they are generated and comparing the results among the set of nodes running tests
- having the ability to quickly change the output parsing algorithms when new errors, conditions, data points, etc. are discovered to be of importance in test output
- comparing results of previous runs with current runs
- statistically characterizing the results to help focus on what really needs to be scrutinized for errors and aberrations in performance
Cbench has growing support for addressing testing at this level in the Nodehwtest Testset and currently has capabilities to address all of the complexities.
Point-to-Point Level
The middle level is point-to-point system testing between nodes, but not necessarily utilizing any system MPI libraries or the high-speed interconnect. This can encompass testing such as point-to-point netperf testing between all nodes to test out ethernet links or point-to-point Verbs layer Infiniband testing to test out IB network links below MPI. Cbench does not currently specifically address this area of testing... yet!
MPI System Level
The top of the stack is the system level normally associated with Linux compute clusters, i.e. the MPI system level where parallel compute jobs run. At this level, one is concerned with testing the interaction of all the other system levels combined. As anyone who has been involved in cluster integration, testing, and or characterization, there are many complexities in dealing with testing at this level. Some of the complexities, but certainly not limited to are:
- dealing with all the combinations of
- a set of tests/benchmarks to run
- a range of process counts to test on, i.e. 2,4,8,....1024
- 1 process/node vs 2 process/node
- running testing in an interactive mode versus a batch mode
- different batch schedulers
- different job launchers (i.e. mpiexec, mpirun, prun,...)
- different testing iterations or testing purposes to categorize results from
- dealing with all the raw output in any sane way!
Cbench attempts to greatly simplify dealing with these complexities in several ways:
- centralize as much configuration as possible including Make parameters ([source:trunk/cbench/make.def make.def]) and cluster definition parameters ([source:trunk/cbench/cluster.def cluster.def])
- provide the scripting infrastructure to easily generate/regenerate all the combinations of jobs desired at anytime
- support different batch schedulers and job launchers in a core library and easily switch between them by changing the appropriate values in cluster.def and regenerating job files
- utilize the "testset" concept and structure
- provide a modular output parsing structure to make it as easy as possible to add/change parsing logic for tests and still utilize the core output parsing analysis capabilities
News
- 08-13-2008: Tagged the 1.2.0rc1 release and made a tarball available here. The tarball has not been added to the Sourceforge mirrors yet due to their datacenter migration going on. In the meantime, you can just checkout the tag:
svn co https://cbench.svn.sourceforge.net/svnroot/cbench/tags/cbench-1_2_0rc1 cbench
- 03-10-2008: Started migrating to use the cbench Sourceforge project webserver instead of the cbench-sf Sourceforge project webserver, which is what we have been using. Two major changes have been accomplished:
- The Cbench Subversion repository is now hosted on Sourceforge!! The URL is https://cbench.svn.sourceforge.net/svnroot/cbench/trunk/cbench . Thanks to Chris for spearheading this!
- To switch checkout URLs see the Developer's Page
- The read-only mirror of the Cbench TRAC website was moved to the cbench webserver space and the cbench-sf webserver space is now just a pointer to http://cbench.sourceforge.net .
- We are still using TRAC hosted at http://cbench.sandia.gov as the live development site for Cbench (with no plans to ever leave TRAC)
- The Cbench Subversion repository is now hosted on Sourceforge!! The URL is https://cbench.svn.sourceforge.net/svnroot/cbench/trunk/cbench . Thanks to Chris for spearheading this!
- 12-21-2007: Cbench version 1.1.5 released and pushed out to Sourceforge, release notes can be found here.
- 11-19-2007: Preparing to freeze and release Cbench 1.1.5 very soon...
- 11-12-2007: Cbench received an honorary mention in an HPCwire article about Woven Systems and Chelsio Communications Scalable High Performance 10 Gig Ethernet for Computing Clusters which can be found here and here and a Woven press release here
Downloads
- Latest release: v 1.2.0rc1 (08-13-2008)
- download (only on Sandia SRN and SON)
- Sourceforge mirror not available yet
- Previous release: v 1.1.5 (12-21-2007)
- Repository Snapshots
- [source:tags/cbench-1_1_5/CHANGES Changelog] (latest Changelog on Sourceforge)
- All Cbench downloads
- Licensing
Mailing Lists
- One size fits all list:
Documentation and Resources
- Documentation
- Whitepaper on cluster testing and Cbench
- Samples and Examples of Cbench at work
- Single node benchmarking reports
- dual socket, single core/socket server: PDF report, raw data
- Single node benchmarking reports
- Developers Home Page
- Cbench Sourceforge homepage (used for Subversion, downloads, mailing lists, and read-only website mirroring)
Interesting Related Work by Others
Related Projects
Affiliated Organizations
NON-Affiliated (but cool) Organizations
The Cbench project uses Sourceforge!





