An implementation of distributed stochastic gradient descent in lua/torch. Clients can be local and remote.
This code is written in Lua, and an installation of Torch is assumed. Training requires a few packages which can easily be installed through LuaRocks (which comes with a Torch installation). Datasets are formatted and loaded using hdf5, which can be installed using this guide.
Once torch and torch-hdf5 are installed, use luarocks to install the other dependencies used in the example:
$ luarocks install nn
$ luarocks install rnn
If you want to train on an Nvidia GPU using CUDA, you'll need to install the CUDA Toolkit as well as the cutorch
and cunn
packages:
$ luarocks install cutorch
$ luarocks install cunn
We need to ensure that our local version of parallel is installed. This can be done with a short bash script from the lua-lua folder:
$ cd lua-lua
$ bash install_parallel.sh
.
├── data # Folder holding data used for demo
├── parallel # Folder containing the changes we added to the parallel class
├── End-To-End-Generative-Dialgoue # Folder of our other repo containing the code used in demo
├── README.md # lua-lua usage
├── server.lua # Main server file
├── README.md
├── startup.sh # Startup script for remote gcloud servers
├── setup_image.sh # Script that copies startup.sh to remote server and calls startup.sh
├── install_parallel.sh # script that installs our version of parallel
└── demo_server.lua # A demo class that implements the server
Code is run from the lua-lua folder:
$ cd lua-lua
To run a worker with 2 parallel clients on your own machine:
$ th server.lua -n_proc 2
In order to get the demo to connect through localhost rather than simply forking, we must first setup an .ssh key for this project.
Note: This is basically doing the same thing as local, except we now connect to the clients through localhost. This is a good tool to use to debug problems with clients running on remote servers.
Replace USERNAME with your username on the computer you want to connect to:
$ USERNAME=michaelfarrell
$ ssh-keygen -t rsa -f ~/.ssh/dist-sgd-sshkey -C $USERNAME
Hit enter twice and a key should have been generated.
In order to connect to clients through localhost, we must add the key to our list of authorized_keys:
$ cat ~/.ssh/dist-sgd-sshkey.pub >> ~/.ssh/authorized_keys
$ chmod og-wx ~/.ssh/authorized_keys
In order to connect through localhost, you must allow your computer to allow incoming ssh connections.
On a Mac, this can be done by going to:
System Preferences > Sharing
and checking the 'Remote Login' box
You can now communicate over localhost using the command:
$ EXTENSION=Desktop/GoogleDrive/FinalProject/Distributed-SGD/lua-lua/
$ TORCH_PATH=/Users/michaelfarrell/torch/install/bin/th
$ th server.lua -n_proc 4 -localhost -extension $EXTENSION -torch_path $TORCH_PATH
where $EXTENSION is the relative path to the lua-lua folder from the your directory and $TORCH_PATH is the absolute path to torch on your computer
Instead of having the client programs running on your own computer, you can farm them out to any number of remote computers. Below is a description of how to setup remote clients using google cloud (gcloud offers 60 day free trials with $300 worth of credit).
We have to allow our gcloud servers to accept incoming ssh connections from our computer.
If you have yet to do so, generate an ssh-key
Once you have created the key print it out:
$ cat ~/.ssh/dist-sgd-sshkey.pub
Next you must add the key to the set of public keys :
- Login to your google compute account.
- Go to compute engine dashboard
- Go to metdata tab
- Go to ssh-key subtab
- Click edit
- Add the key you copied as a new line
Restrict external access to the key:
$ chmod 400 ~/.ssh/dist-sgd-sshkey
We only have to setup and install everything once, after which we can clone that client.
- Click on the 'VM Instances' tab
- Create Instance
- Give the instance a name i.e. 'demo-baseline'
- Set the zone to us-central1-b
- Choose 8vCPU highmem as machine type
- Under boot disk click change
- Choose Ubuntu 14.04 LTS
- At the bottom change size to 30 GB and click 'select'
- Allow HTTP traffic
- Allow HTTPS traffic
- Click 'Management, disk, networking, SSH keys' to dropdown more options
- Under 'Disk' unclick 'Delete boot disk when instance is deleted'
- Click 'Create' an you should see your new instance listed in the table
- Wait for the VM instance to startup (indicated by a green check next to the instance)
- Under the 'network' column, click 'default'
- Go to 'Firewall rules' and Add a new rule
- Set name to be 'all'
- Set source filter to allow from any source
- Under allowed protocols, put 'tcp:0-65535; udp:0-65535; icmp'
- Create
- Return to the 'VM instances' tab
- Grab the external IP address for the instance
$ EXTERNAL_IP=104.154.48.250
$ USERNAME=michaelfarrell
- Next you must modify the 'startup.sh' script to also include any additional installs that you may need on the server. This script is run from the home directory of the remote client. To run the demo, you do not need to modify this script.
- Next you must modify the 'setup_image.sh' script so that it correctly calls your startup.sh script on the remote server. If you did not change 'startup.sh' script, you should probably not be changing this script either.
- Setup the image:
$ source setup_image.sh
Note you can connect to the server:
$ ssh -o "StrictHostKeyChecking no" -i ~/.ssh/dist-sgd-sshkey $USERNAME@$EXTERNAL_IP
- Once the server is setup to your liking, disconnect from the server and return to your google cloud dashboard
- Go to the 'VM Dashboard'
- Click on the instance you just setup, and delete it. This should remove the instance and save it as a disk. If you click on the 'disks' tab, you should see the instance name you just deleted.
- Click on the 'Images' tab
- 'Create Image'
- Give it a name i.e. 'demo-image'
- Under Source-Disk, choose the disk that you just created
- Create
- Click on the 'Instance templates' tab
- Create new
- Name the template i.e. 'demo-template'
- Under 'Boot Disk' click change
- At the top click 'Your image'
- Choose the image you just created i.e. 'demo-image'
- Set size to 30 GB
- Select
- Allow HTTP traffic
- Allow HTTPS traffic
- Under more->Disks, unclick 'Delete boot disk when instance is deleted'
- Create
- Go to the "Instance groups" tab
- Create instance group
- Give the group a name, i.e. 'demo-group'
- Give a description
- Set zone to us-central1-b
- Use instance template
- Choose the template you just made i.e. 'demo-template'
- Set the number of instances
- Create
- Wait for the instances to launch
- Once there is a green checkmark, click on the new instance
You will want to add your list of client servers to the file 'client_list.txt' where each line in the file is one of the external ip addresses located in the Instance group you are currently using. You will need to copy this list of files to the computer that you are going to use as the main parameter server. Choose an IP from the freshly updated 'client_list.txt' and set the $SERVER_IP environment variable:
$ SERVER_IP=130.211.160.115
Copy over 'client_list.txt' to the main server:
$ scp -o "StrictHostKeyChecking no" -i ~/.ssh/dist-sgd-sshkey ../client_list.txt $USERNAME@$SERVER_IP:~/Distributed-SGD
You can connect to one of the servers by running:
$ ssh -o "StrictHostKeyChecking no" -i ~/.ssh/dist-sgd-sshkey $USERNAME@$SERVER_IP
Note: the flag -o "StrictHostKeyChecking no"
automatically adds the host to your list and does not prompt confirmation.
If you get an error like this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
then you'll want to
$ vim ~/.ssh/known_hosts
and delete the last few lines that were added. They should look like some ip address and then something that starts with AAAA. You can delete lines in vim by typing 'dd' to delete the current line. This can happen when you restart the servers and they change ip addresses, among other things.
If the servers have been initialized, you will first want to connect to the computer above that you chose to be the main server
$ ssh -o "StrictHostKeyChecking no" -i ~/.ssh/dist-sgd-sshkey $USERNAME@$SERVER_IP
Once connected, you need to again setup an ssh key from the computer that you are using as the client.
Once this is done, you can run the server with remote gcloud clients using the command:
$ cd Distributed-SGD/lua-lua
$ EXTENSION=Distributed-SGD/lua-lua/
$ TORCH_PATH=/home/michaelfarrell/torch/install/bin/th
$ th server.lua -n_proc 4 -remote -extension $EXTENSION -torch_path $TORCH_PATH
If you wish to extend this demo to work with your own SGD model you must simply create a new server class specific to your task, replacing the 'demo_server' class. Use the file 'demo_server.lua' as an example. The server only needs to have __init(opt) and run() functions defined in order to work. Once this class is properly defined (i.e. named 'new_server'), you can run the following to initiate your task:
$ NEW_SERVER_NAME=new_server
$ th server.lua -server_class $NEW_SERVER_NAME # Plus Additional arguments
When developing, all command line arguments should be added in the file server.lua. Look at the command arguments
$ th server.lua --help
that already exist and use those names when developing your model. If you need an additional command line argument, add it in server.lua. Other than this, there should be no reason to edit the server.lua file.
If you are having your clients run remotely, you may also need to modify 'startup.sh' and 'setup_image.sh' so that they setup the server environements according to the specifications that you need.
- Document data folder and include description in demo-usage about what the demo is
- Add in documentation of how the data needs to be formatted in order to run the demo
- Finish description
- Finish Acknowledgements
- Add in proto implementation
- Add in git pull at startup
- add way to catch if failure down and reset
- maybe add paxos if kevin is successful
- try adding protobufs
- get results
- Add in addtional catches for errors like add to path
This example is also apart of another one of our repos: https://github.com/michaelfarrell76/End-To-End-Generative-Dialogue
Our implementation utilizes code from the following: