Cluster Training Using Fabric¶
Prepare a Linux cluster¶
kubectl -f ssh_servers.yaml under the directory:
paddle/scripts/cluster_train_v2/fabric/docker_cluster will launch a demo cluster. Run
kubectl get po -o wide to get IP addresses of these nodes.
Launching Cluster Job¶
paddle.py provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can be set as
paddle.py command options and
paddle.py will transparently and automatically set these options to PaddlePaddle lower level processes.
paddle.pyprovides two distinguished command option for easy job launching.
job_dispatch_packageset it with local
workspacedirectory, it will be dispatched to all nodes which is set in
conf.py. It could be helpful for frequently manipulating workspace files. otherwise, frequent multi-nodes workspace deployment is very annoying.
job_workspaceset it with already deployed workspace directory,
paddle.pywill skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy dispatch latency.
cluster_train/run.sh provides command line sample to run
demo/recommendation cluster job, just modify
job_workspace with your defined directory, then:
The cluster Job will start in several seconds.
Kill Cluster Job¶
paddle.py can capture
Ctrl + C SIGINT signal to automatically kill all processes launched by it. So just stop
paddle.py to kill cluster job. You should manually kill the job if the program crashed.
Check Cluster Training Result¶
Check log in $workspace/log for details, each node owns same log structure.
It provides almost all internal output log for training, same as local training. Check runtime model convergence here.
It provides parameter server running log, which could help to diagnose distributed error.
It provides stderr and stdout of parameter server process. Check error log if training crashes.
It provides stderr and stdout of trainer process. Check error log if training crashes.
Check Model Output¶
After one pass finished, model files will be written in
output directory in node 0.
nodefile in workspace indicates the node id of current cluster job.