Design doc of high performance PS implementation #1620

QiJune · 2020-01-07T16:10:08Z

Here is for a better review.

skydoorkai · 2020-01-09T06:29:47Z

docs/designs/high_performance_ps.md

+
+## Motivation
+
+This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)


focus -> focuses

skydoorkai · 2020-01-09T06:34:07Z

docs/designs/high_performance_ps.md

+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
+
+The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.


As a result, the receiving gradients service is also blocked -> The receiving gradients service is not blocked, but will only service when the running thread for applying gradients is preempted. It cannot service in parallel.

Currently, the receiving gradients and applying gradients are in the same thread. It seems that if applying gradients is not finished, the thread could not provide receiving gradients service.

wangkuiyi

This design doc looks very good. I have gone through it; sending the first batch of a few comments for your reference. I will complete the review soon.

wangkuiyi · 2020-01-10T03:52:49Z

docs/designs/high_performance_ps.md

+
+## Motivation
+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)


Need a space before (.

wangkuiyi · 2020-01-10T03:53:02Z

docs/designs/high_performance_ps.md

+
+## Motivation
+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)


(short for PS) => (PS)

wangkuiyi · 2020-01-10T03:53:12Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.


PS => The PS

wangkuiyi · 2020-01-10T03:53:47Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.


bring IO workload to PS => are primary I/O workloads of the PS

wangkuiyi · 2020-01-10T03:54:24Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.


parameter update cost CPU resource

wangkuiyi · 2020-01-10T03:54:50Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.


many workers => more than one workers

wangkuiyi · 2020-01-10T03:55:05Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.


would be very heavy => could be heavy

wangkuiyi · 2020-01-10T03:55:18Z

docs/designs/high_performance_ps.md

+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
+
+The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.


The current PS are in Python

wangkuiyi · 2020-01-10T03:55:54Z

docs/designs/high_performance_ps.md

+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
+
+The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.


Because of GIL => Due to the existence of GIL

wangkuiyi · 2020-01-10T03:56:10Z

docs/designs/high_performance_ps.md

+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
+
+The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.


We want to remove this bottleneck and make full utilization of multiple CPU cores.

skydoorkai · 2020-01-10T09:05:04Z

docs/designs/high_performance_ps.md

+
+## Computation
+
+The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally a math operation of tensors.


acutally -> actually

skydoorkai · 2020-01-10T09:05:53Z

docs/designs/high_performance_ps.md

+
+There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently.
+
+It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some faint worry with the performance of math libraries in Go.


faint worry with -> worry about

skydoorkai · 2020-01-10T09:06:21Z

docs/designs/high_performance_ps.md

+
+## Scheduling
+
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.


optimzation -> optimization

skydoorkai · 2020-01-10T09:06:49Z

docs/designs/high_performance_ps.md

+
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
+
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen on one of these events. It means the scheduler gets the opportunity.


happen on -> happen in

skydoorkai · 2020-01-10T09:07:13Z

docs/designs/high_performance_ps.md

+
+## Conclusion
+
+Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.


peformance -> performance

skydoorkai · 2020-01-10T09:07:33Z

docs/designs/high_performance_ps.md

+
+Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.
+
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.


wrappered -> wrapped

skydoorkai · 2020-01-10T09:07:52Z

docs/designs/high_performance_ps.md

+
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.
+
+The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a goroutine will be launched to do optimization.


receving -> receiving

xiaogaozi · 2020-01-10T10:04:25Z

docs/designs/high_performance_ps.md

+
+Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.
+
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.


Just FYI, the overhead of Cgo cannot ignore in some case. There're many articles complain this problem:

The Cost and Complexity of Cgo: blog post from CockroachDB team

cgo is not Go: blog post from Dave Cheney (member of Go language)

Introducing Badger: A fast key-value store written purely in Go: blog post from Dgraph team, see the "Cgo: The necessary evil" section

go#19574: Cgo performance issue created by CockroachDB team

@xiaogaozi Mant thanks for your kind reminder.

There are mainly two kinds of overhead:

Call overhead. I have made a benchmark, 1000-d dense tensor with Adam optimization, it seems that cgo gets about 8% more time than native C++. Please refer to https://github.com/QiJune/learning-notes/tree/master/test_codes/cgo.

Scheduling overhead. In PS, cgo is used to call the parameter optimization function in C++. We will not launch hundreds of goroutines. It will be limited to be equal to the CPU core numbers.

So, let's move first. And I believe we could benefit from the speed-up of C++ and minimize the overhead.

Very looking forward the Go language implementation of PS 🖖

brightcoder01 · 2020-01-13T00:22:45Z

docs/designs/high_performance_ps.md

+
+This design doc focuses on implementing a high performance parameter server (PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and parameters updating cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy.


parameters updating cost => updating parameters costs CPU resource

brightcoder01 · 2020-01-13T00:45:44Z

docs/designs/high_performance_ps.md

+
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimization will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
+
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity.


that occur in Go programs that allow => that occur in Go programs and allow

brightcoder01 · 2020-01-13T00:47:24Z

docs/designs/high_performance_ps.md

+
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimization will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
+
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity.


This does not mean it will always happen in one of these events. => It doesn't mean that the scheduling will always happen when one of the events occurs.

I think the original expression is more concise.

brightcoder01 · 2020-01-13T00:51:38Z

docs/designs/high_performance_ps.md

+
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrapped with C interface, and exposed to Go.
+
+The receiving gradients and sending parameters service are implemented in Go. Once receiving gradients from a worker, a goroutine will be launched to do optimization.


The receiving gradients and sending parameters service => The gradients receiving and parameters sending services? or The services of receiving gradients and sending parameters

docs/designs/high_performance_ps.md

QiJune added 5 commits January 7, 2020 23:50

add draft hp ps design

47626e0

add draft hp ps design

5900f65

update

a8aad15

format

b666ecf

refine doc

a09b9a4

QiJune changed the title ~~[WIP]Design doc of high performance PS implementation~~ Design doc of high performance PS implementation Jan 8, 2020

QiJune added 2 commits January 8, 2020 13:40

fix typos

b8e9eb9

update links

60a5774

QiJune requested review from brightcoder01, skydoorkai, terrytangyuan, wangkuiyi and workingloong January 8, 2020 06:36

skydoorkai reviewed Jan 9, 2020

View reviewed changes

QiJune added 3 commits January 10, 2020 10:18

update

04b8d07

update

ef0aefa

fix typo

eaef642

wangkuiyi reviewed Jan 10, 2020

View reviewed changes

polish doc

3e470e5

skydoorkai reviewed Jan 10, 2020

View reviewed changes

xiaogaozi reviewed Jan 10, 2020

View reviewed changes

fix typo

5efb0fc

brightcoder01 reviewed Jan 13, 2020

View reviewed changes

workingloong reviewed Jan 13, 2020

View reviewed changes

docs/designs/high_performance_ps.md Show resolved Hide resolved

polish doc

3e94579

skydoorkai approved these changes Jan 15, 2020

View reviewed changes

QiJune merged commit 83fd6df into sql-machine-learning:develop Jan 15, 2020

QiJune deleted the hpc_ps branch January 15, 2020 12:24


		## Motivation

		This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)


		PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.

		The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.


		## Motivation

		This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)


		## Computation

		The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally a math operation of tensors.


		There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently.

		It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some faint worry with the performance of math libraries in Go.


		## Scheduling

		In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.


		In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.

		In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen on one of these events. It means the scheduler gets the opportunity.


		## Conclusion

		Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.


		Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.

		[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.


		This design doc focuses on implementing a high performance parameter server (PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)

		The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and parameters updating cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy.

Design doc of high performance PS implementation #1620

Design doc of high performance PS implementation #1620

Conversation

QiJune commented Jan 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaogaozi Jan 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune commented Jan 7, 2020 •

edited

Loading

xiaogaozi Jan 10, 2020 •

edited

Loading