Performance for Distributed vgg16

Test Result

Hardware Infomation

  • CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  • cpu MHz : 2101.000
  • cache size : 20480 KB

Blas settings

Setting environment variable: MKL_NUM_THREADS=1.

Single Node Single Thread

  • Metrics: samples / sec
Batch Size 32 64 128 256
PaddlePaddle Fluid 15.44 16.32 16.74 16.79
PaddlePaddle v2 15.97 17.04 17.60 17.83
TensorFlow 9.09 9.10 9.24 8.66

Different Batch Size

  • PServer Count: 10
  • Trainer Count: 20
  • Metrics: samples / sec
Batch Size 32 64 128 256
PaddlePaddle Fluid 190.20 222.15 247.40 258.18
PaddlePaddle v2 170.96 233.71 256.14 329.23
TensorFlow - - - -

Accelerate Rate

  • Pserver Count: 20
  • Batch Size: 128
  • Metrics: samples / sec
Trainer Count 20 40 80 100
PaddlePaddle Fluid 263.29 (78.64%) 518.80 (77.47%) 836.26 (62.44%) 1019.29 (60.89%)
PaddlePaddle v2 (need more tests) 326.85 (92.85%) 534.58 (75.93%) 853.30 (60.60%) 1041.99 (59.20%)
TensorFlow - - - -

Different Pserver Count

  • Trainer Count: 60
  • Batch Size: 128
  • Metrics: samples/ sec
PServer Count 3 6 10 20
PaddlePaddle Fluid(should fix in next PR) 589.1 592.6 656.4 655.8
PaddlePaddle v2 (need more tests) 593.4 791.3 729.7 821.7
TensorFlow - - - -

The performance gap between Fuild and v2 comes from the network interference.

Steps to Run the Performance Test

  1. You must re-compile PaddlePaddle and enable -DWITH_DISTRIBUTE to build PaddlePaddle with distributed support.
  2. When the build finishes, copy the output whl package located under build/python/dist to current directory.
  3. Run docker build -t [image:tag] . to build the docker image and run docker push [image:tag] to push the image to reponsitory so kubernetes can find it.
  4. Run kubectl create -f pserver.yaml && kubectl create -f trainer.yaml to start the job on your kubernetes cluster (you must configure the kubectl client before this step).
  5. Run kubectl get po to get running pods, and run kubectl logs [podID] to fetch the pod log of pservers and trainers.

Check the logs for the distributed training progress and analyze the performance.

Enable Verbos Logs

Edit pserver.yaml and trainer.yaml and add an environment variable GLOG_v=3 and GLOG_logtostderr=1 to see what happend in detail.