Most business systems include some form of backend processing. This could be report generation, data transformations, credit card processing, payment auditing, or countless other scenarios. It’s typical for these systems to pull records out of a queue, perform the necessary processing, and then move on to the next record. When possible, these systems are engineered to process more than one record at a time, reducing overhead and increasing efficiency. Each time a batch processing system is created though, we face a difficult question.
What is the best batch size?
This question is always hard to answer because we know that our development environment will differ from the production environment. To combat this problem, most developers define an environment variable or configuration setting that will control the batch size, and then hard-code a default value if the setting is not supplied. This provides a feeling of comfort that we can change the setting in production without having to update the code. But this approach falls short in many ways.
NuGet Package Statistics
NuGet.org creates records every time a package is downloaded—this happens about 750,000 times per day or 8.5 times per second. The records are stored in the production database in a denormalized table where the raw values can easily be inserted at that pace. Then twice each day, we produce updated package download reports for every package with download activity since the last time its report was generated.
To generate these package download reports, we have backend processes that aggregate total download numbers, replicate the records into a warehouse database, and then purge records that are at least 7 days old and that have already been replicated. Each of these processes works against batches of records; choosing batch sizes for each of them was difficult.
When trying to select a batch size for each of these processes, we realized that there are lots of factors that come into play. Here are the variables that we found to have significant impact on throughput:
- Scale settings for our production database (SQL Azure)
- Scale settings for our warehouse database (SQL Azure)
- Virtual Machine specifications on our backend processing server (Azure VM)
- Current load on the production database
- Current load on the warehouse database
- Current load on the backend processing server (it performs lots of other backend jobs at the same time)
- Index fragmentation in the production database
- Index fragmentation in the warehouse database
- Number of records in the queue
- Network latency
Each time any of these factors changed, the previous choice we’d made for our batch sizes become stale. Every once in a while, a batch would fail, cause an error, and raise an operations alert. We would then file a bug: “Stats Replicator cannot process the current batch size without timing out.” There are two obvious fixes for the bug:
- Increase the timeout
- Reduce the batch size
Either of these “fixes” would get the job unstuck, but then it’s just a matter of time before the change is stale.
The Edge of Failure
Batch processing can be more efficient because it reduces overhead. There’s startup/shutdown time required for each iteration of the process. When you only pay the startup/shutdown cost once but process thousands of records, the savings can be significant. The bigger the batch, the more we save on overhead. But there’s usually a breaking point where giant batch sizes lead to failure. Finding the largest batch size that can be successfully processed often yields the best performance.
To make the backend processes for NuGet.org as efficient as possible at all times, I created an approach that discovers this breaking point and then automatically adapts batch sizes to achieve the best throughput attainable within the current environment.
Defining Batch Size Ranges
Instead of defining a single batch size setting to be used, the new approach uses a pair of parameters to specify the minimum and maximum batch sizes. These batch sizes aren’t guesses, they are objective numbers with meaning.
Minimum Batch Size
The minimum batch size is truly a minimum. If the system fails to process a batch of this size, it is considered an error and the process will crash. This will lead to an operations alert to inform the team that something is wrong.
Maximum Batch Size
The maximum batch size is the max size that we would ever want to be processed at one time. This number can be selected based on the scenario and it should take into account issues like debugging when a batch encounters a bug. But this number should be as large as you’re comfortable with—don’t worry about what the system will be “capable of” handling—because all of the factors above affect the capability. If you scale your server up significantly, a previously unfathomable batch size may become not only possible, but preferable.
Sampling and Adapting
With a batch size range provided, we can now take samples of different batch sizes. This sampling will produce two important pieces of data:
- The edge of failure, where the batch succeeds but larger batch sizes fail (generally by exceeding a timeout period)
- The throughput measured for each sampled batch size, in terms of records per second
To accomplish the sampling, we take the following approach:
- Process the minimum batch size and record the throughput (records/second)
- Incrementally increase the batch size toward the maximum batch size, stepping by 10%
batchSize = minBatchSize + ((maxBatchSize - minBatchSize) / 10 * samplesTaken);
- Record the throughput for each sample
batchTimes[perSecond] = batchSize;
- If a batch size times out, record its throughput as Int32.MaxValue and decrease the maximum batch size by 33%
maxBatchSize = maxBatchSize * 2 / 3;
Once we’ve finished taking our 11 samples (yes, 11, because fenceposts), we then use the sampling data to begin adapting our batch sizes. Each time we’re ready to process another batch, we calculate the next batch size to use. This calculation aims to find the best possible batch size, but we don’t simply want to choose the best batch size we’ve seen so far because there’s usually a batch size better than what we’ve already seen. Instead, we select the best 25% of our batches and then use the average batch size across them.
var bestBatches = batchTimes.OrderByDescending(b => b.Key).Take(batchTimes.Count / 4);
var nextBatchSize = (int)bestBatches.Select(b=> b.Value).Average();
We will then use this size to process the next batch. We’ll record its throughput and add it into our samples. As we continue to process more batches, we’ll have a larger pool of sample values to select our 25% best batches from, and we’ll be averaging out more batch sizes. But because previous batch sizes were selected based on the averages in the first place, the result is zeroing in on the batch size that yields the best throughput.
Examining the Numbers
Let’s take a look at how this can play out.
- Min Batch Size: 100
- Max Batch Size: 10000
- Timeout Period: 30 seconds
- Batch: 100; Time: 1 sec; Pace: 100/sec
- Batch: 1090; Time: 9 sec; Pace: 121/sec
- Batch: 2080; Time: 14 sec; Pace: 149/sec
- Batch: 3070; Time: 19 sec; Pace: 162/sec
- Batch: 4060; Time: 26 sec; Pace: 156/sec
- Batch: 5040; Time: TIMEOUT (Int32.MaxValue). Max set to 10000 * 2 /3 = 6667
- Batch: 4042; Time: 25 sec; Pace: 161/sec
- Batch: 4699; Time: 29 sec; Pace: 162/sec
- Batch: 5356; Time: TIMEOUT (Int32.MaxValue). Max set to 6667 * 2 / 3 = 4445
- Batch: 4015; Time: 26 sec; Pace: 154/sec
- Batch: 4445; Time: 27 sec; Pace: 165/sec
After taking these 11 samples, we’ve learned that we can’t seem to get past ~5000 records in a batch without timing out; the maximum successful batch was 4699 at 29 seconds (162/sec). But we also see that within the timeout period, larger batches are providing better throughput than smaller batches. The system will now automatically adapt to use this data.
The samples we've taken can be ordered like this:
- 4445 (165/sec)
- 4699 (162/sec)
- 3070 (162/sec)
- 4042 (161/sec)
- 4060 (156/sec)
- 4015 (154/sec)
- 2080 (149/sec)
- 1090 (121/sec)
- 100 (100/sec)
- 5040 (Int32.MaxValue/sec)
- 5356 (Int32.MaxValue/sec)
Considering the best 25% of these values (that will be the top 3), we calculate the average of the batch sizes to be 4071. That will be the next batch size. We’ll time that batch as well, and put its data into the sample set.
As more batches are executed, we’ll see performance fluctuate, batch sizes vary a bit, but ultimately narrow down to a small deviation. After around 100 iterations, the value becomes relatively static. So the next step is to guard against circumstances changing and our data becoming stale.
After around 100 iterations, we lose some of our ability to adapt. Even if the times start to get very bad for the batch size we’re zeroing in on, there’s too much data indicating that batch size should be efficient. The easiest way to combat this problem is to perform periodic resets. After 100 iterations, simply reset all sample data and start fresh—take 11 new samples and then run 89 more iterations afterward, adapting anew.
While this reset can lead to a few inefficient batches, it’s an important part of what makes the system fully reliable. If load on the production system or any of the other throughput factors changes, it won’t be long before we reset and discover that we need to change our target range.
This approach is in use within a few of our backend processes around package statistics. The most straight-forward example is the job that finds package statistics from the production database that have already been replicated over to the warehouse and can now be purged from the production database.
The biggest benefit I've seen from this approach is that our production system stays alive and efficient all the time. We used to have to tweak the batch sizes pretty regularly. And when our statistics processing fell behind, it could take a long time to catch up because our batch sizes were conservative. Now, the batch sizes can get more aggressive automatically, while ensuring we avoid timeouts.
Overall, these processes are now much more hands-off. If we need to increase throughput, we can scale a server up and the process will automatically take advantage of the improvement and use bigger batch sizes if that yields better results. But if the system is under load, the process will automatically back off if smaller batch sizes are proving to run at a steady pace.