Distribution as Linear Algebra

We adapt an approach of looking at distribution of tensor computation as Linear Algebra operations.

This allows ParametricOperators.jl to offer several high level API in order to perform controlled parallelism as part of your tensor program in the context of machine learning.

Kronecker Distribution

Distributed Fourier Transform

Let's consider the example of Fourier Transform as seen in the Fourier Transform Example

# Define type and the size of our global tensor
T = Float32
gx, gy, gz = 10, 20, 30

Fx = ParDFT(T, gx)
Fy = ParDFT(Complex{T}, gy)
Fz = ParDFT(Complex{T}, gz)

F = Fz ⊗ Fy ⊗ Fx

Assume that our data is partitioned across multiple machine according to the following scheme:

partition = [1, 1, 2]

Each element of partition denotes the number of processing elements that divide our input tensor along that dimension.

For eg. given the above partition and global size, our local tensor would be of size:

x = rand(T, 10, 20, 15)

OR in other terms:

localx, localy, localz = [gx, gy, gz] .÷ partition
x = rand(T, localx, localy, localz)

Now, following the method seen in several recent works (Grady et al., 2022) and traditional distributed FFTs, we can distribute the application of our linearly separable transform across multiple processing elements by simply doing:

F = distribute(F, partition)

Now, to apply the Fourier Transform to our tensor, one can do:

F * vec(x)

Another out-of-box example can be seen at Distributed FFT of a 3D Tensor

Distributed Convolution

Definition of Convolution

Convolution here refers to the application of a linear transform along the channel dimension

Now, in order to extend this to a convolution layer, lets consider the following partitioned tensor:

T = Float32

gx, gy, gc = 10, 30, 50
partition = [2, 2, 1]

nx, ny, nc = [gx, gy, gc] .÷ partition
x = rand(T, nx, ny, nc)

Our tensor is sharded across x and y dimensions by 2 processing element along each dimension.

We can define the operators of our convolution as:

Sx = ParIdentity(T, gx)
Sy = ParIdentity(T, gy)
Sc = ParMatrix(T, gc, gc)

Chain our operators and distribute them:

S = Sc ⊗ Sy ⊗ Sx
S = distribute(S, partition)

Parametrize and apply our transform:

θ = init(S)
S(θ) * vec(x)

Take the gradient of the parameters w.r.t to some objective by simply doing:

θ′ = gradient(θ -> sum(S(θ) * vec(x)), θ)

Another out-of-box example can be seen at Distributed Parametrized Convolution of a 3D Tensor

Sharing Weights

Sharing weights can be thought of as a broadcasting operation.

In order to share weights of an operator across multiple processing elements, we can do:

A = ParMatrix(T, 20, 20)
A = distribute(A)

Assume the following partition and tensor shape:

gc, gx = 20, 100
partition = [1, 4]

nc, nx = [gc, gx] .÷ partition
x = rand(T, nc, nx)

Initialize and apply the matrix operator on the sharded tensor:

θ = init(A)
A(θ) * x

Compute the gradient by doing:

θ′ = gradient(θ -> sum(A(θ) * x), θ)

Reduction Operation

In order to perform a reduction operation, more commonly known as an ALL_REDUCE operation, we can define:

R = ParReduce(T)

Given any local vector or matrix, we can do:

x = rand(T, 100)
R * x

To compute the gradient of the input w.r.t some objective:

x′ = gradient(x -> sum(R * x), x)