diff --git a/README.md b/README.md
index 84243a6..2837b3e 100644
--- a/README.md
+++ b/README.md
@@ -154,7 +154,7 @@ optional arguments:
 # Examples of samples
 The biggest challenge is to make the network converge to a good set of parameters. I've experimented with hyperparameters and here are the results I've managed to obtain for N-way MNIST using different models.
 
-Generally, in order for model to converge to a good set of parameters, one needs to go with a small learning rate (in order of 1e-4). I've also found that bigger kernel sizes work best for hidden layers.
+Generally, in order for model to converge to a good set of parameters, one needs to go with a small learning rate (about 1e-4). I've also found that bigger kernel sizes in hidden layers work better.
 
 A very simple model, `python train.py --epochs 2 --color-levels 2 --hidden-fmaps 21 --lr 0.002 --max-norm 2` (all others are default values), trained for just 2 epochs, managed to produce these samples on a binary MNIST:
 
diff --git a/train.py b/train.py
index 60ae4b1..763856b 100644
--- a/train.py
+++ b/train.py
@@ -99,7 +99,7 @@ def main():
     parser.add_argument('--hidden-layers', type=int, default=6,
                         help='Number of layers of gated convolutions with mask of type "B"')
 
-    parser.add_argument('--learning-rate', '--lr', type=float, default=0.0002,
+    parser.add_argument('--learning-rate', '--lr', type=float, default=0.0001,
                         help='Learning rate of optimizer')
     parser.add_argument('--weight-decay', type=float, default=0.0001,
                         help='Weight decay rate of optimizer')