A step-by-step guide to estimating the value of increased training data, using a case study from geospatial deep learning

A satellite image and its building footprint ground truth mask from our geospatial case study.

Here’s the scenario: You’ve gone to great difficulty and expense to collect some training data, and you’ve used that data to train a deep neural net. Still, you find yourself wondering, “Do I have enough?” Would more training data cause a meaningful performance boost, or just waste time and money for an inconsequential gain? At first, it seems almost paradoxical: You can’t make an informed decision about gathering more data without an idea of how much it will help, but you can’t measure how much it will help unless you’ve made the decision and acquired the additional data already.

There is a solution, however. By repeatedly retraining a deep neural net with different-sized subsets of the training data that’s already in hand, it may be possible to extrapolate performance out to quantities of training data beyond what’s presently available.

We’ll walk through the procedure using an important test case from the field of computer vision: tracing out the outlines of buildings in satellite photos. The ability to automatically map buildings, without relying on extensive manual labor every time things change, has applications ranging from disaster relief to urban planning.

Our Case Study: Building Footprints in Satellite Imagery

The Data

To undertake a study of satellite imagery, we need to start by getting some images! For that, we turn to SpaceNet, which maintains a free, open source, always-available data store of thousands of square kilometers of labeled high-resolution satellite imagery.

We’ll use SpaceNet’s latest (as of this writing) data release: a series of collects taken over Atlanta, Georgia by Maxar’s WorldView-2 satellite. Included with the imagery are high-quality labels of building footprints (outlines). This dataset was the basis of the now-completed SpaceNet4 Challenge. The imagery shows the same area from 27 different viewing angles, making it ideal for studying how viewing angle affects the performance of geospatial deep learning algorithms. For simplicity, we’ll organize our results into three categories: “nadir” with viewing angles within 25 degrees of nadir (straight down), “off-nadir” with viewing angles from 26 to 40 degrees off nadir, and “far-off-nadir,” with viewing angles exceeding 40 degrees off nadir. We also define “overall” performance as a simple average of performance for the three categories. Fig. 1 shows an example of imagery from different angles.

Fig. 1: Different viewing angles for the same physical location, including (a) an on-nadir view, (b) an off-nadir view from the north, and © a far-off-nadir view from the south. Image (d) shows the corresponding building footprint mask, used in training to tell the model what is or isn’t a building.

The Model

The model we’ll train to find buildings also comes from SpaceNet. We’ll use the fifth-place submission to the SpaceNet4 Challenge. Although a few other submissions slightly outperformed this model, this one is desirable for its fast inference time and straightforward architecture. The model’s inference speed is more than ten times faster that the top SpaceNet4 winner. It carries out pixel segmentation using a U-Net with a VGG-16 encoder, then generates building footprint polygons based on the pixel map. To expedite training, the original submission’s ensemble of three neural nets is pared down to a single net, causing only a modest performance degradation discussed elsewhere.

To quantify model performance, an F1 score is calculated for the building footprints identified by the model. For the purpose of the calculation, a footprint is considered correct if it has an IoU (intersection over union) of at least 0.5 with a ground truth building footprint.

The Results

To see how model performance depends on the amount of training data, the same model is trained with different amounts of data.

The data is chipped into tiles that are 900 pixels on a side, each showing an area of 450m on a side. There is an average of 63 buildings per tile. Within the training data are 1064 unique tile locations, with 27 views of each location. Since our algorithm sets aside a quarter of the data for validation, training on the full dataset gives us 1064 * 27 * (3/4) = 21546 images. Fig. 2 shows what happens if we train with less data. The training and evaluation process is repeated ten times. But each time, we start from scratch and use only about half as much data as the time before. When data is scarce, performance rises quickly with new data, but there are diminishing returns when data is abundant.

Fig. 2: Model performance, as measured by F1 score, versus number of training images. Images are 900x900 pixel regions of satellite photos with 0.5m resolution.

Now that we’re all on the same page for our case study, it’s time to use it as the context in which to address our main question: How can we make an educated guess about model performance with lots of data, before that data is even available? To do that, there are a few steps.

Step 1: Know When to Stop Training

For this process, we will need to train the model from scratch many times, with different amounts of data. If the model is hard-coded to train for a certain number of epochs or certain amount of time (whether it uses the final weights after that period or the best intermediate weights), it is worth the effort to study whether all that time is really necessary. Fig. 3 shows model performance as a function of training time for different amounts of data. The top line uses the full data set, and each subsequent line uses about half the training data of the line above it.

Fig. 3: Model performance, as measured by F1 score, versus amount of training time. Training time is expressed as number of image views and also approximate GPU-days on Nvidia Titan Xp GPUs.

Recall that when the training data set size is cut in half, the number of training epochs must be doubled to get the same total number of image views. However, the number of image views that are needed to reach maximum performance may go down with less data. For our geospatial case study, about 60 epochs are needed to reach maximum performance with the full training data set of 21546 images. But when only a few hundred images are used, the number of necessary image views falls below the equivalent of 5 epochs with the full training data set. It takes about four GPU-days (on an Nvidia Titan Xp) to train with the full dataset. While that’s unavoidable, knowing that we can at least get away with less training for the reduced-data cases brings some welcome time savings.

Once we’ve trained with different amounts of data and evaluated each version of the model, we can plot the points in Fig. 2. But we’re not ready to fit a curve just yet.

Step 2: Generate Error Bars

Many deep learning papers quote performance figures without error bars. However, in the absence of error bars, it is unclear whether observed performance improvements are statistically significant or whether results are even repeatable. For our extrapolation task, error bars are necessary. In the case study, we have ten data points, each representing a different amount of training data. To save time, we’ll calculate the error bars for only four of them, and logarithmically interpolate the rest. To calculate an error bar, we just repeat the model training/testing process multiple times with the given amount of data, and take the standard deviation of the results. Fig. 4 shows the same information as Fig. 2, but with a logarithmic x-axis to better display the error bars for low amounts of data.

Fig. 4: Identical to Fig. 2, but with a logarithmic x-axis.

A better approach, which I’ll be using in a follow-up study, would be to repeatedly train the model for every amount of training data under consideration, and then make a plot not of individual results but instead of the means of the results for each amount of training data.

Step 3: Fit the Curve

Next, to understand the dependence of F1 score on the amount of training data, we want to be able to model the relationship of these variables, ideally with a simple curve.

For ideas, we turn to the literature. There hasn’t been much written about our specific type of case study (data size dependence of the F1 score for semantic segmentation in a geospatial context). However, there has been a lot of work on deep learning classification problems [1, 2, 3, 4]. For those, the data set size dependence of the accuracy was found to scale as a constant minus an inverse power law term (Fig. 5). In a pleasant turn of events, the same functional form shows an excellent fit to our test case even though it’s not a classification problem. In fact, the fitted curves shown as dotted lines in Figs. 2 and 4 were generated using this functional form.

Fig. 5: A constant minus an inverse power law. The variable x is the amount of training data; y is the estimated F1 score; and a, b, and c are positive free parameters.

Once we have a fitted curve, we can use it to extrapolate performance given higher amounts of data. To summarize the process: the same model was trained with different amounts of data, error bars were estimated for each of those amounts, and a weighted regression was use to generate a simple model of how performance scales with the amount of the training data. Now it’s time to try out the procedure and see how well it does when the amount of training data is limited.

Putting the Method to the Test

Suppose that instead of having the full training data set, we only had one-sixteenth of it. With only 1,323 images to train on instead of 21,546, we want to estimate the improvement from acquiring sixteen times more training data. Fig. 6 shows what happens when we follow the three steps above. Here we’ve simply ignored the points with more than 1,323 images when fitting the curve. This test is somewhat artificial in that the low-training-data samples were not restricted to all be drawn from the same one-sixteenth of the full data set. However, that is not expected to materially change the result.

Fig. 6: F1 score vs. amount of training data (solid lines), along with fitted curves (dotted lines) that are based on only the points with 1,323 training images or fewer.

From starting with 1,323 images, Table 1 shows the actual and predicted performance increases from doubling the training data and from increasing it by a factor of 16. In each case the prediction was correct to within a factor of two, and in most cases it was correct to within 25%. These figures refer to the estimated improvements in the F1 score, whereas the percent errors in the estimated F1 scores themselves are much lower. The measured and predicted results tend to gradually diverge as training data is increased. However, the F1 scores increase with more data, which constrains the growth of the percentage error. As a result, for this case study the method gives an estimate for a 16-fold increase in training data that’s about as good as its estimate for a mere 2-fold increase. For a point of comparison, fitting the data with a simple logarithmic curve, instead of the constant minus a power law, produces a maximum percent error that’s almost twice as large as seen here.

Table 1: Comparison of measured and estimated improvement with a 2-fold data increase and a 16-fold data increase. “Measure” is the actual increase in F1 score, “predict” is the increase predicted by the model, and “% Error” is the percentage error between them.

Conclusions and Caveats

In this building footprint case study, we could predict the performance improvement from a 16-fold increase in training data to within a factor of 2. Although there are no one-size-fits-all solutions in data science, this approach can help inform your decision the next time you find yourself wondering whether to procure more training data.

A couple caveats bear mentioning. Implicit in this extrapolation approach is an assumption that the training data we already have and the new training data we might acquire are drawn from the same distribution. This would not hold true if, for example, the new training data we sought to acquire was from a different city, or a different part of the same city, which differed in overall appearance from the source of our already-available data. Even if that assumption is met, there is no firm guarantee that the F1 score will have the same dependence on training data set size as seen in this case study. No one recipe can be a replacement for thoughtful judgement and deep understanding of what makes your data unique. However, it can be a constructive starting point to making the best decisions for your data science project.

Predicting the Effect of More Training Data, by Using Less was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.