Your storage vendor claims 500K IOPS. Your server vendor promises 128 cores of “enterprise-grade performance.” Your hypervisor can “easily handle” 50 VMs. Your network fabric has “plenty of headroom.” Then you consolidate eight production databases onto the platform and everything falls over at 3 AM on a Tuesday.
What happened? Nobody actually tested what happens when multiple databases compete for the same resources. Storage saturates. CPUs contend. Network bandwidth plateaus. And you find out in production instead of in testing because traditional benchmark approaches are tedious manual nightmares that only focus on a single point of failure.
The traditional approach? Spin up eight databases. Configure eight HammerDB instances. Coordinate test timing. Collect results from eight different systems. Manually correlate application metrics with infrastructure telemetry. By the time you’re done, the migration deadline has passed and your boss is asking why you’re still “testing.”
HammerDB-Scale fixes this. Orchestrate parallel database benchmarks on Kubernetes, automatically correlate application metrics with infrastructure behavior, get empirical answers about where your platform breaks. Instead of asking “how fast is this database?” it answers “how many databases can this platform handle before something breaks?”
Acknowledgments
This project builds directly on Anthony Nocentino’s foundational work containerizing HammerDB. His implementation made orchestrating database benchmarks at scale possible. HammerDB-Scale extends that work with multi-target coordination, infrastructure monitoring integration, and automated result correlation.
HammerDB-Scale
HammerDB-Scale is a Kubernetes-native orchestration framework that uses database workloads to stress test infrastructure. It measures database performance (NOPM, TPM, QPH) across multiple instances simultaneously to expose infrastructure bottlenecks. The database metrics tell you where the platform underneath breaks.
Define your targets in YAML. Deploy via Helm. Retrieve aggregated results. The tool handles the tedious bits: orchestration, parallel execution, metric collection, temporal alignment, correlation. All automatic.
This isn’t testing distributed databases (single logical database across multiple nodes). It’s testing multiple independent database instances sharing infrastructure. TPC-C and TPC-H produce realistic I/O patterns, CPU utilization, and memory pressure that synthetic tools like fio can’t replicate. Your storage array behaves differently under actual database workloads than it does under synthetic sequential writes.
How to Use It
The workflow assumes you already have database instances and infrastructure running, plus a Linux system with kubectl and cluster access configured. From there: clone the repo, configure your targets, deploy via Helm, watch it run, get results. Five steps, thirty minutes, empirical data about where your infrastructure breaks.
Step 1: Clone the repo
Clone the repository:
git clone https://github.com/PureStorage-OpenConnect/hammerdb-scale cd hammerdb-scale
The repo structure includes:
hammerdb-scale/ ├── Chart.yaml # Helm chart metadata ├── values.yaml # Your configuration goes here ├── values-examples.yaml # Example configurations ├── templates/ # Helm chart templates ├── scripts/ # Benchmark and monitoring scripts ├── deploy-test.sh # Deployment helper script ├── aggregate-results.sh # Results aggregation ├── ADDING-DATABASES.md # Guide for adding new database types └── README.md # Full documentation
Key files: values.yaml (where you configure your tests), deploy-test.sh (quick deployment), and aggregate-results.sh (get your results).
Step 2: Configure your test in values.yaml
The values.yaml file defines three things: what to test, how to test it, and optional infrastructure monitoring.
What to test – your database targets (this is an example of 2):
targets:
- name: sql-server-01
type: mssql
host: "sqlserver1.example.com"
username: sa
password: "YourSecurePassword"
- name: sql-server-02
type: mssql
host: "sqlserver2.example.com"
username: sa
password: "YourSecurePassword"
Add as many targets as you want to test. Each gets an independent worker pod.
How to test – benchmark parameters:
testRun:
phase: "build" # "build" (schema + data) or "load" (run benchmark)
benchmark: "tprocc" # "tprocc" (OLTP) or "tproch" (analytics)
hammerdb:
tprocc:
warehouses: 100 # Database size (~100MB per warehouse)
load_num_vu: 8 # Virtual users (concurrency)
duration: 5 # Test duration in minutes
Optional: Infrastructure monitoring (Pure Storage FlashArray):
pureStorage: enabled: true host: "192.168.1.100" apiToken: "your-api-token-here" pollInterval: 5 # Collect metrics every 5 seconds
The full values.yaml has detailed comments explaining every parameter. See values-examples.yaml for complete scenarios (4-database consolidation test, pre-migration validation, protocol comparison).
Step 3: Choose build or load phase
Set the phase in values.yaml:
testRun: phase: "build" # Options: "build" or "load"
Build phase: Creates the database schema and loads test data. Run this first.
Load phase: Runs the actual benchmark workload against existing data. Run this after the build completes.
Typical workflow:
- Set phase: “build” and deploy
- Wait for build to complete (schema created, data loaded)
- Change to phase: “load” in values.yaml
- Deploy again to run the benchmark
- Run multiple load tests against the same data (change parameters, test different scenarios)
Building a 100-warehouse or 10GB TPC-C database takes 15-30 minutes. The load phase runs in minutes. Separating them means you build once, test multiple configurations without rebuilding data every time.
Step 4: Deploy
You have two options: use the helper script or deploy directly with Helm.
Option A: Use the deploy script
./deploy-test.sh --phase load --test-id demo --benchmark tprocc
The script reads your values.yaml, validates the configuration, and deploys the appropriate resources based on your phase and targets.
A successful run of the deploy test script will output the following:
========================================== HammerDB Scale Deployment ========================================== Phase: load Test ID: demo Benchmark: tprocc Namespace: default ========================================== NAME: load-demo LAST DEPLOYED: Mon Nov 3 13:31:28 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None ========================================== Monitor logs: kubectl logs -n default -l hammerdb.io/phase=load --follow View job status: kubectl get jobs -n default -l hammerdb.io/test-run=demo Aggregate results (after completion): ./aggregate-results.sh load demo Cleanup when done: helm uninstall load-demo -n default
Option B: Direct Helm deployment
helm install hammerdb-test . -f values.yaml
This gives you more control over the deployment name and Helm-specific options.
What happens during deployment:
For each target in your values.yaml, Kubernetes creates:
- A worker pod running the HammerDB benchmark
- ConfigMaps with your test configuration
- Secrets for database credentials
- Optional: Storage monitoring sidecar (if Pure Storage metrics enabled)
Example: For a 2 instance scenario, 2 jobs and 2 corresponding pods will be created:
[root@OpenShift-Manager hammerdb-scale]# kubectl get jobs NAME STATUS COMPLETIONS DURATION AGE load-demo-hammerdb-scale-load-sql-bench-01-demo Running 0/1 49s 49s load-demo-hammerdb-scale-load-sql-bench-02-demo Running 0/1 49s 49s [root@OpenShift-Manager hammerdb-scale]# kubectl get pods NAME READY STATUS RESTARTS AGE load-demo-hammerdb-scale-load-sql-bench-01-demo-nk6dk 1/1 Running 0 54s load-demo-hammerdb-scale-load-sql-bench-02-demo-fhwsd 1/1 Running 0 54s
The pods start immediately and begin executing the phase you specified (build or load).
Step 5: Watch the pods
Monitor test progress:
kubectl get pods -w
You’ll see pods transition from ContainerCreating → Running → Completed. Each target gets its own worker pod that runs independently.
To watch the actual benchmark execution in real-time, follow the logs from all workers:
kubectl logs -n default -l hammerdb.io/phase=load --follow
This shows live throughput metrics from each database as the test runs:
742998 MSSQLServer tpm 773586 MSSQLServer tpm 803166 MSSQLServer tpm Vuser 1:Rampup 2 minutes complete ... Vuser 1:Rampup complete, Taking start Transaction Count. Vuser 1:Timing test period of 5 in minutes 824622 MSSQLServer tpm 859374 MSSQLServer tpm 896580 MSSQLServer tpm
Each line shows transactions per minute for a worker. Multiple workers running in parallel means you’ll see interleaved output from different databases. The rampup period warms up the workload before the timed test begins.
Step 6: Get results
After all workers complete, run the aggregation script:
./aggregate-results.sh --phase load --test-id demo
The script collects results from all workers, correlates with infrastructure metrics (if enabled), and generates a unified report:
======================================== HammerDB Results Aggregation ======================================== Phase: load Test ID: demo Namespace: default [INFO] Found 2 job(s) [Processing] load-demo-hammerdb-scale-load-sql-bench-01-demo Status: Completed Detected database: mssql, benchmark: tprocc Parsing results... [Processing] load-demo-hammerdb-scale-load-sql-bench-02-demo Status: Completed Detected database: mssql, benchmark: tprocc Parsing results... ======================================== Results Summary ======================================== Total Jobs: 2 Successful: 2 Failed: 0 Aggregated Metrics: Total TPM: 2143970 Total NOPM: 922523
The output shows per-database performance and aggregate metrics across all targets. Results are saved to ./results/demo/load/ in both text and JSON formats for further analysis.
If Pure Storage monitoring was enabled, the summary includes storage metrics (IOPS, latency, bandwidth) aligned with benchmark execution, showing exactly where infrastructure behavior correlates with application performance changes.
What You Actually Find
Here’s real data from testing consolidated SQL Server workloads on shared storage, progressively scaling from 1 to 8 databases. Platform details are anonymized, but the performance patterns are representative of what you’ll find when stress testing consolidated infrastructure.
| Databases | Aggregate NOPM | Aggregate TPM | Storage Write Latency (avg) | Storage Read Latency (avg) | Storage Write IOPS |
|---|---|---|---|---|---|
| 1 | 336,147 | 780,321 | 126 µs | 379 µs | 86,371 |
| 2 | 619,819 | 1,439,430 | 674 µs | 656 µs | 168,328 |
| 4 | 573,041 | 1,331,451 | 2,789 µs | 1,384 µs | 171,444 |
| 8 | 533,129 | 1,238,515 | 4,892 µs | 3,046 µs | 166,200 |
(NOPM = New Orders Per Minute, the TPC-C metric for transaction throughput.)
What the data shows:
- Performance peaks at 2 databases (620K NOPM), then degrades as consolidation density increases
- NOPM drops 7.5% at 4 databases, 14% at 8 databases
- Write latency escalates: 126µs → 674µs → 2,789µs → 4,892µs
- Read latency follows similar pattern: 379µs → 656µs → 1,384µs → 3,046µs
- Write IOPS plateau around 170K regardless of database count
This isn’t a scaling problem. It’s a consolidation density problem. Something in the infrastructure saturates between 2 and 4 databases, causing latency to spike while IOPS plateau. More databases competing for the same resources creates contention that degrades individual workload performance. The declining performance at 8 databases shows the upper limits of the consolidated stack under this workload.
This is what a good load test does: it finds the saturation point. The data shows peak performance (2 databases, 620K NOPM, sub-millisecond latency) and where the system tips over into unacceptable territory (4+ databases, degraded throughput, multi-millisecond latency). Without this empirical boundary, you’re guessing at capacity. With it, you know exactly where acceptable performance ends and where you’re pushing into risky territory.
Without HammerDB-Scale, you’d see inconsistent database performance but wouldn’t know why. With infrastructure correlation, the pattern becomes clear: the platform handles 2 databases well, shows stress at 4, and reaches practical limits at 8. The correlation between application metrics (NOPM degradation) and infrastructure metrics (latency escalation, IOPS plateau) exposes where the bottleneck emerges.
What this means for capacity planning:
That 2-database sweet spot tells you the optimal consolidation density for this workload on this platform. Push beyond it and you’re trading throughput for density. Whether 4.9ms average write latency at 8 databases is acceptable depends entirely on your application SLAs and business requirements.
What this exposes:
- Storage limits: IOPS ceilings, latency degradation curves, bandwidth saturation points
- Compute contention: CPU saturation patterns, memory pressure thresholds
- Network bottlenecks: Bandwidth limits, protocol overhead under multi-tenant load
- Consolidation ratios: Empirical data showing where performance degrades
Why This Works
The value isn’t running multiple HammerDB instances. Anyone can do that manually (if they hate themselves). The value is eliminating the operational nightmare.
Manual approach: Deploy each database. Configure each HammerDB instance. Coordinate test timing. Collect results from eight different systems. Manually correlate application metrics with storage telemetry. Realize you configured database #5 wrong. Start over. Six hours later, you have results you don’t trust.
HammerDB-Scale approach: Edit YAML file. Run helm install. Get coffee. Retrieve results. Thirty minutes later, you have consistent results with infrastructure correlation showing exactly where things broke.
The tool enforces consistency. Same workflow definition produces identical test configurations every time. No “did I configure all eight databases the same way?” uncertainty.
The tool enables correlation. Application metrics aligned with infrastructure behavior automatically. Transaction rates correlated with storage latency. Throughput correlated with IOPS consumption. Performance degradation correlated with resource saturation. The cause-and-effect relationships that would be invisible looking at either layer alone become obvious.
Getting Started
HammerDB-Scale is open source on GitHub at https://github.com/PureStorage-OpenConnect/hammerdb-scale. The repo includes Helm charts, example configurations, and integration guides for Pure Storage FlashArray monitoring (extensible framework; other platforms can be added).
Prerequisites: Kubernetes cluster (single-node works for testing), target databases (SQL Server supported, PostgreSQL and Oracle planned), optional Pure Storage FlashArray with API access for infrastructure monitoring.
The repo includes examples for common scenarios: 4-database consolidation tests, pre-migration validation workflows, protocol comparison setups (iSCSI vs NVMe/TCP under realistic load). Start with the quickstart, modify for your environment, run it.
Fair warning: Things will break. Worker pods will OOM if you undersize resources. Storage API credentials will be wrong. Test timing will misalign. Database connections might fail if your targets aren’t configured for the test load. This is normal. The documentation covers common issues. That’s why you test in a lab before production.
Database benchmarks were designed to test database engines, but they inadvertently became the best tools for stressing infrastructure with realistic workloads. HammerDB-Scale embraces this. If you’re building consolidated platforms and need to know where they break, this gives you empirical answers in under an hour.
If you’ve tested consolidated infrastructure and found interesting bottlenecks, I want to hear about it. What broke first in your environment? Find me on LinkedIn.