Your storage vendor claims 500K IOPS. Your server vendor promises 128 cores of “enterprise-grade performance.” Your hypervisor can “easily handle” 50 VMs. Your network fabric has “plenty of headroom.” Then you consolidate eight production databases onto the platform and everything falls over at 3 AM on a Tuesday.

What happened? Nobody actually tested what happens when multiple databases compete for the same resources. Storage saturates. CPUs contend. Network bandwidth plateaus. And you find out in production instead of in testing because traditional benchmark approaches are tedious manual nightmares that only focus on a single point of failure.

The traditional approach? Spin up eight databases. Configure eight HammerDB instances. Coordinate test timing. Collect results from eight different systems. Manually correlate application metrics with infrastructure telemetry. By the time you’re done, the migration deadline has passed and your boss is asking why you’re still “testing.”

HammerDB-Scale fixes this. Orchestrate parallel database benchmarks on Kubernetes, automatically correlate application metrics with infrastructure behavior, get empirical answers about where your platform breaks. Instead of asking “how fast is this database?” it answers “how many databases can this platform handle before something breaks?”

Contents hide

1 Acknowledgments

2 HammerDB-Scale

2.1 How to Use It

2.1.1 Step 1: Clone the repo

2.1.2 Step 2: Configure your test in values.yaml

2.1.3 Step 3: Choose build or load phase

2.1.4 Step 4: Deploy

2.1.4.1 Option A: Use the deploy script

2.1.4.2 Option B: Direct Helm deployment

2.1.5 Step 5: Watch the pods

2.1.6 Step 6: Get results

3 What You Actually Find

4 Why This Works

5 Getting Started

Acknowledgments

This project builds directly on Anthony Nocentino’s foundational work containerizing HammerDB. His implementation made orchestrating database benchmarks at scale possible. HammerDB-Scale extends that work with multi-target coordination, infrastructure monitoring integration, and automated result correlation.

HammerDB-Scale

HammerDB-Scale is a Kubernetes-native orchestration framework that uses database workloads to stress test infrastructure. It measures database performance (NOPM, TPM, QPH) across multiple instances simultaneously to expose infrastructure bottlenecks. The database metrics tell you where the platform underneath breaks.

Define your targets in YAML. Deploy via Helm. Retrieve aggregated results. The tool handles the tedious bits: orchestration, parallel execution, metric collection, temporal alignment, correlation. All automatic.

This isn’t testing distributed databases (single logical database across multiple nodes). It’s testing multiple independent database instances sharing infrastructure. TPC-C and TPC-H produce realistic I/O patterns, CPU utilization, and memory pressure that synthetic tools like fio can’t replicate. Your storage array behaves differently under actual database workloads than it does under synthetic sequential writes.

How to Use It

The workflow assumes you already have database instances and infrastructure running, plus a Linux system with kubectl and cluster access configured. From there: clone the repo, configure your targets, deploy via Helm, watch it run, get results. Five steps, thirty minutes, empirical data about where your infrastructure breaks.

Step 1: Clone the repo

Clone the repository:

git clone https://github.com/PureStorage-OpenConnect/hammerdb-scale
cd hammerdb-scale

The repo structure includes:

hammerdb-scale/
├── Chart.yaml              # Helm chart metadata
├── values.yaml             # Your configuration goes here
├── values-examples.yaml    # Example configurations
├── templates/              # Helm chart templates
├── scripts/                # Benchmark and monitoring scripts
├── deploy-test.sh          # Deployment helper script
├── aggregate-results.sh    # Results aggregation
├── ADDING-DATABASES.md     # Guide for adding new database types
└── README.md               # Full documentation

Key files: values.yaml (where you configure your tests), deploy-test.sh (quick deployment), and aggregate-results.sh (get your results).

Step 2: Configure your test in values.yaml

The values.yaml file defines three things: what to test, how to test it, and optional infrastructure monitoring.

What to test – your database targets (this is an example of 2):

targets:
  - name: sql-server-01
    type: mssql
    host: "sqlserver1.example.com"
    username: sa
    password: "YourSecurePassword"
    
  - name: sql-server-02
    type: mssql
    host: "sqlserver2.example.com"
    username: sa
    password: "YourSecurePassword"

Add as many targets as you want to test. Each gets an independent worker pod.

How to test – benchmark parameters:

testRun:
  phase: "build"        # "build" (schema + data) or "load" (run benchmark)
  benchmark: "tprocc"   # "tprocc" (OLTP) or "tproch" (analytics)

hammerdb:
  tprocc:
    warehouses: 100     # Database size (~100MB per warehouse)
    load_num_vu: 8      # Virtual users (concurrency)
    duration: 5         # Test duration in minutes

Optional: Infrastructure monitoring (Pure Storage FlashArray):

pureStorage:
  enabled: true
  host: "192.168.1.100"
  apiToken: "your-api-token-here"
  pollInterval: 5       # Collect metrics every 5 seconds

The full values.yaml has detailed comments explaining every parameter. See values-examples.yaml for complete scenarios (4-database consolidation test, pre-migration validation, protocol comparison).

Step 3: Choose build or load phase

Set the phase in values.yaml:

testRun:
  phase: "build"  # Options: "build" or "load"

Build phase: Creates the database schema and loads test data. Run this first.

Load phase: Runs the actual benchmark workload against existing data. Run this after the build completes.

Typical workflow:

Set phase: “build” and deploy
Wait for build to complete (schema created, data loaded)
Change to phase: “load” in values.yaml
Deploy again to run the benchmark
Run multiple load tests against the same data (change parameters, test different scenarios)

Building a 100-warehouse or 10GB TPC-C database takes 15-30 minutes. The load phase runs in minutes. Separating them means you build once, test multiple configurations without rebuilding data every time.

Step 4: Deploy

You have two options: use the helper script or deploy directly with Helm.

Option A: Use the deploy script

./deploy-test.sh --phase load --test-id demo --benchmark tprocc

The script reads your values.yaml, validates the configuration, and deploys the appropriate resources based on your phase and targets.

A successful run of the deploy test script will output the following:

==========================================
HammerDB Scale Deployment
==========================================
Phase:      load
Test ID:    demo
Benchmark:  tprocc
Namespace:  default
==========================================

NAME: load-demo
LAST DEPLOYED: Mon Nov  3 13:31:28 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

==========================================
Monitor logs:
  kubectl logs -n default -l hammerdb.io/phase=load --follow

View job status:
  kubectl get jobs -n default -l hammerdb.io/test-run=demo

Aggregate results (after completion):
  ./aggregate-results.sh load demo

Cleanup when done:
  helm uninstall load-demo -n default

Option B: Direct Helm deployment

helm install hammerdb-test . -f values.yaml

This gives you more control over the deployment name and Helm-specific options.

What happens during deployment:

For each target in your values.yaml, Kubernetes creates:

A worker pod running the HammerDB benchmark
ConfigMaps with your test configuration
Secrets for database credentials
Optional: Storage monitoring sidecar (if Pure Storage metrics enabled)

Example: For a 2 instance scenario, 2 jobs and 2 corresponding pods will be created:

[root@OpenShift-Manager hammerdb-scale]# kubectl get jobs
NAME                                              STATUS    COMPLETIONS   DURATION   AGE
load-demo-hammerdb-scale-load-sql-bench-01-demo   Running   0/1           49s        49s
load-demo-hammerdb-scale-load-sql-bench-02-demo   Running   0/1           49s        49s

[root@OpenShift-Manager hammerdb-scale]# kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
load-demo-hammerdb-scale-load-sql-bench-01-demo-nk6dk   1/1     Running   0          54s
load-demo-hammerdb-scale-load-sql-bench-02-demo-fhwsd   1/1     Running   0          54s

The pods start immediately and begin executing the phase you specified (build or load).

Step 5: Watch the pods

Monitor test progress:

kubectl get pods -w

You’ll see pods transition from ContainerCreating → Running → Completed. Each target gets its own worker pod that runs independently.

To watch the actual benchmark execution in real-time, follow the logs from all workers:

kubectl logs -n default -l hammerdb.io/phase=load --follow

This shows live throughput metrics from each database as the test runs:

742998 MSSQLServer tpm
773586 MSSQLServer tpm
803166 MSSQLServer tpm
Vuser 1:Rampup 2 minutes complete ...
Vuser 1:Rampup complete, Taking start Transaction Count.
Vuser 1:Timing test period of 5 in minutes
824622 MSSQLServer tpm
859374 MSSQLServer tpm
896580 MSSQLServer tpm

Each line shows transactions per minute for a worker. Multiple workers running in parallel means you’ll see interleaved output from different databases. The rampup period warms up the workload before the timed test begins.

Step 6: Get results

After all workers complete, run the aggregation script:

./aggregate-results.sh --phase load --test-id demo

The script collects results from all workers, correlates with infrastructure metrics (if enabled), and generates a unified report:

========================================
HammerDB Results Aggregation
========================================
Phase:      load
Test ID:    demo
Namespace:  default
[INFO] Found 2 job(s)

[Processing] load-demo-hammerdb-scale-load-sql-bench-01-demo
  Status: Completed
  Detected database: mssql, benchmark: tprocc
  Parsing results...

[Processing] load-demo-hammerdb-scale-load-sql-bench-02-demo
  Status: Completed
  Detected database: mssql, benchmark: tprocc
  Parsing results...

========================================
Results Summary
========================================
Total Jobs:  2
Successful:  2
Failed:      0

Aggregated Metrics:
  Total TPM:  2143970
  Total NOPM: 922523

The output shows per-database performance and aggregate metrics across all targets. Results are saved to ./results/demo/load/ in both text and JSON formats for further analysis.

If Pure Storage monitoring was enabled, the summary includes storage metrics (IOPS, latency, bandwidth) aligned with benchmark execution, showing exactly where infrastructure behavior correlates with application performance changes.

What You Actually Find

Here’s real data from testing consolidated SQL Server workloads on shared storage, progressively scaling from 1 to 8 databases. Platform details are anonymized, but the performance patterns are representative of what you’ll find when stress testing consolidated infrastructure.

Databases	Aggregate NOPM	Aggregate TPM	Storage Write Latency (avg)	Storage Read Latency (avg)	Storage Write IOPS
1	336,147	780,321	126 µs	379 µs	86,371
2	619,819	1,439,430	674 µs	656 µs	168,328
4	573,041	1,331,451	2,789 µs	1,384 µs	171,444
8	533,129	1,238,515	4,892 µs	3,046 µs	166,200

(NOPM = New Orders Per Minute, the TPC-C metric for transaction throughput.)

What the data shows:

Performance peaks at 2 databases (620K NOPM), then degrades as consolidation density increases
NOPM drops 7.5% at 4 databases, 14% at 8 databases
Write latency escalates: 126µs → 674µs → 2,789µs → 4,892µs
Read latency follows similar pattern: 379µs → 656µs → 1,384µs → 3,046µs
Write IOPS plateau around 170K regardless of database count

This isn’t a scaling problem. It’s a consolidation density problem. Something in the infrastructure saturates between 2 and 4 databases, causing latency to spike while IOPS plateau. More databases competing for the same resources creates contention that degrades individual workload performance. The declining performance at 8 databases shows the upper limits of the consolidated stack under this workload.

This is what a good load test does: it finds the saturation point. The data shows peak performance (2 databases, 620K NOPM, sub-millisecond latency) and where the system tips over into unacceptable territory (4+ databases, degraded throughput, multi-millisecond latency). Without this empirical boundary, you’re guessing at capacity. With it, you know exactly where acceptable performance ends and where you’re pushing into risky territory.

Without HammerDB-Scale, you’d see inconsistent database performance but wouldn’t know why. With infrastructure correlation, the pattern becomes clear: the platform handles 2 databases well, shows stress at 4, and reaches practical limits at 8. The correlation between application metrics (NOPM degradation) and infrastructure metrics (latency escalation, IOPS plateau) exposes where the bottleneck emerges.

What this means for capacity planning:

That 2-database sweet spot tells you the optimal consolidation density for this workload on this platform. Push beyond it and you’re trading throughput for density. Whether 4.9ms average write latency at 8 databases is acceptable depends entirely on your application SLAs and business requirements.

What this exposes:

Storage limits: IOPS ceilings, latency degradation curves, bandwidth saturation points
Compute contention: CPU saturation patterns, memory pressure thresholds
Network bottlenecks: Bandwidth limits, protocol overhead under multi-tenant load
Consolidation ratios: Empirical data showing where performance degrades

Why This Works

The value isn’t running multiple HammerDB instances. Anyone can do that manually (if they hate themselves). The value is eliminating the operational nightmare.

Manual approach: Deploy each database. Configure each HammerDB instance. Coordinate test timing. Collect results from eight different systems. Manually correlate application metrics with storage telemetry. Realize you configured database #5 wrong. Start over. Six hours later, you have results you don’t trust.

HammerDB-Scale approach: Edit YAML file. Run helm install. Get coffee. Retrieve results. Thirty minutes later, you have consistent results with infrastructure correlation showing exactly where things broke.

The tool enforces consistency. Same workflow definition produces identical test configurations every time. No “did I configure all eight databases the same way?” uncertainty.

The tool enables correlation. Application metrics aligned with infrastructure behavior automatically. Transaction rates correlated with storage latency. Throughput correlated with IOPS consumption. Performance degradation correlated with resource saturation. The cause-and-effect relationships that would be invisible looking at either layer alone become obvious.

Getting Started

HammerDB-Scale is open source on GitHub at https://github.com/PureStorage-OpenConnect/hammerdb-scale. The repo includes Helm charts, example configurations, and integration guides for Pure Storage FlashArray monitoring (extensible framework; other platforms can be added).

Prerequisites: Kubernetes cluster (single-node works for testing), target databases (SQL Server supported, PostgreSQL and Oracle planned), optional Pure Storage FlashArray with API access for infrastructure monitoring.

The repo includes examples for common scenarios: 4-database consolidation tests, pre-migration validation workflows, protocol comparison setups (iSCSI vs NVMe/TCP under realistic load). Start with the quickstart, modify for your environment, run it.

Fair warning: Things will break. Worker pods will OOM if you undersize resources. Storage API credentials will be wrong. Test timing will misalign. Database connections might fail if your targets aren’t configured for the test load. This is normal. The documentation covers common issues. That’s why you test in a lab before production.

Database benchmarks were designed to test database engines, but they inadvertently became the best tools for stressing infrastructure with realistic workloads. HammerDB-Scale embraces this. If you’re building consolidated platforms and need to know where they break, this gives you empirical answers in under an hour.

If you’ve tested consolidated infrastructure and found interesting bottlenecks, I want to hear about it. What broke first in your environment? Find me on LinkedIn.