Score as a function of number of samples that we train the model on  

Now with the `--number-lines` command line argument, it'd be nice to check systematically how much the score increases if more data is used for training. And also what's the upper bound on the amount of data we can use before GCP crashes with a memory error.