Solvers and Eigensolvers for Multicore Processors

Solvers and Eigensolvers for Multicore Processors Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de Max-Plank-Institute für biologische Kybernetik March 18th, 2011 Tübingen, Germany Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 1 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 2 / 38

Dense Linear Algebra Kernels Matrix-matrix multiplication (GEMM): C C + AB Factorizations: Linear system: Transformations: Matrix Equations: Eigenproblems: Generalized eigenproblems:... LU = A, LL T = A, QR = A AX = B QAQ T = T, QA = H AX + XB = C AZ = ZΛ AZ = BZΛ Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 3 / 38

Dense Linear Algebra Kernels Matrix-matrix multiplication (GEMM): C C + AB Factorizations: Linear system: Transformations: Matrix Equations: Eigenproblems: Generalized eigenproblems:... LU = A, LL T = A, QR = A AX = B QAQ T = T, QA = H AX + XB = C AZ = ZΛ AZ = BZΛ Objective: High-performance Numerical stability Multiple algorithmic variants Multiple implementations Multicore Multiple types of parallelism! Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 3 / 38

Multi-cores: standard (computing) architecture Multi-cores invasion: 499/500 entries of the Top 500 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 4 / 38

Multi-cores: standard (computing) architecture Multi-cores invasion: 499/500 entries of the Top 500 4 cores, 8 cores,... 24 cores,... More parallelism than we know what to do with? Is multi-threaded BLAS the solution for LA libs? Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 4 / 38

Linear Algebra: Modularity! Algorithms expressed in terms of simpler linear algebra operations Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 5 / 38

Linear Algebra: Modularity! Algorithms expressed in terms of simpler linear algebra operations BLAS: Basic Linear Algebra Subroutines BLAS-1: y := y + αx x, y R n β := α + x T y BLAS-2: y := y + Ax A R n n, x, y R n y := A 1 x BLAS-3: C := C + AB A, B, C R n n C := A 1 B ESSL (IBM), MKL (Intel), ATLAS, GotoBLAS,... Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 5 / 38

Example: AX = B AX = B Linear System LU = A LU Factorization LX = B Triangular System LX = B Triangular System C = AB + C Gemm C = AB + C Gemm C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 6 / 38

Performance of BLAS Single threaded GEMM 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 0 0 1000 2000 3000 4000 5000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 7 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 8 / 38

LU factorization: loop-based algorithm Iteration i: completed DONE DONE PARTIALLY DONE COMPUTED Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: repartitioning Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: computation LU GEMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: completed (boundary shift) DONE PARTIALLY DONE COMPUTED Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

Parallelism? Solution #1: Multithreaded BLAS LU GEMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 10 / 38

Parallelism? Solution #1: Multithreaded BLAS LU GEMM Advantage: ease of use. Legacy code! Drawback: synchronization. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 10 / 38

Performance of BLAS Multithreaded GEMM 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 1 thread 4 threads 8 threads 0 0 1000 2000 3000 4000 5000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 11 / 38

Example: SPD Inverse Inversion of a Symmetric Positive Definite matrix Covariance matrix Very large dense problems Cholesky factorization Triangular inversion Matrix-matrix multiplication Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 12 / 38

Example: SPD Inverse Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 12 / 38

Parallelism: can we do better? Solution #2: Algorithms by blocks Advantage: out of order execution. Advantage: parallelism limited only by the data dependencies between operations. Drawback: plateux. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 13 / 38

Cholesky factorization LL T = A CHOL CHOL SYRK SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 14 / 38

Algorithms by blocks Creating small tasks CHOL CHOL SYRK SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 15 / 38

Decomposing the computation Iteration 1 CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 2 CHOL SYRK GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 3 CHOL SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 4 CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

DAG - Dependencies 4 4-tile matrix CHOL 7 7 3 7 3 SYRK GEMM GEMM SYRK GEMM SYRK CHOL 3 3 SYRK GEMM SYRK CHOL SYRK CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 18 / 38

Task Execution 4 4-tile matrix Stage Scheduled Tasks 1 CHOL 2 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 7 SYRK GEMM SYRK GEMM 8 GEMM SYRK CHOL 9 10 SYRK GEMM SYRK 11 CHOL 12 13 SYRK 14 CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 19 / 38

SPD Inverse again: Chol+Inv+GEMM 5 5-tile matrix Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 20 / 38

SPD Inverse again: Chol+Inv+GEMM 5 5-tile matrix Stage Scheduled Tasks 1 CHOL 2 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 7 TRINV SYRK 8 GEMM SYRK GEMM GEMM 9 SYRK TTMM CHOL 10 11 GEMM GEMM GEMM SYRK 12 GEMM SYRK CHOL 13 TRINV SYRK 14 GEMM GEMM GEMM 15 GEMM TRMM SYRK 16 TTMM CHOL 17 SYRK TRINV GEMM SYRK 18 GEMM GEMM GEMM TRMM 19 TRMM 20 21 TTMM SYRK GEMM SYRK 22 TRINV GEMM GEMM TRINV 23 SYRK SYRK GEMM SYRK 24 TRMM GEMM TRMM GEMM 25 TRMM SYRK GEMM GEMM 26 TTMM GEMM TRMM TRMM 27 SYRK TRMM 28 TRMM 29 TTMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 20 / 38

Cholesky, algorithm by block Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 21 / 38

Multithreaded vs. algorithm by block Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 22 / 38

Part 1: Summary Multithreaded BLAS vs. Algorithms by blocks No absolute winner: crossover! Ease of use Synchronization Out of order execution Parallelism dictated by data dependencies Plateux Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 23 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 24 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Approach T = Q H AQ Reduction to tridiagonal form O(n 3 ) T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) X = QZ Backtransformation O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Approach T = Q H AQ Reduction to tridiagonal form O(n 3 ) T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) X = QZ Backtransformation O(kn 2 ) Algorithms Inverse Iteration (1958): subsets O(kn 2 ) QR (1961): high-accuracy O(n 3 ) Divide & Conquer (1981): parallel, BLAS3 O(n 3 ) MRRR (1997): subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Multi-threaded BLAS? Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Time in seconds 14 12 10 8 6 4 2 0 MRRR (MKL) MRRR (LAPACK) DC (MKL) 4 8 12 16 20 24 Number of threads Tridiagonal eigensolver, matrix size=4289, from DFT. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 26 / 38

More motivation?... it s O(n 2 ) anyway Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 27 / 38

More motivation?... it s O(n 2 ) anyway Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads If not properly parallelized, even O(n 2 ) dominates! Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 27 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 28 / 38

MRRR Dhillon & Parlett Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 29 / 38

MRRR Dhillon & Parlett Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: Bisection or dqds eigenvectors: Compute 1(λ, z) Scan λ s sep. cluster Shift New RRR λ s refine Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 29 / 38

Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 30 / 38

The work queue Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 31 / 38

Example trace: 16 cores eigenvectors Matrix size: 12387 Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 32 / 38

MR3-SMP: execution time Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Time in seconds 14 12 10 8 6 4 2 0 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 4 8 12 16 20 24 Number of threads Matrix size: 4289. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 33 / 38

A larger example: look at the scale! Matrix size: 16023. Frequency response analysis of automobiles. 600 N = 16023 350 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 300 250 200 150 100 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads From almost 10 hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 34 / 38

MR3-SMP: speedup Time in seconds 5 4.5 Eigenvalues 4 Eigenvectors 3.5 3 2.5 2 1.5 1 0.5 0 LAPACK 2 4 8 16 24 Number of threads Speedup 25 20 15 10 5 0 Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 35 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 36 / 38

3 stages: before and after Execution time 110 100 90 80 70 N = 4,289 Reduction Sequential MRRR Backtransformation 60 50 40 30 20 10 0 1 2 4 8 16 24 Number of threads Execution time 110 100 90 80 70 60 50 40 30 20 10 0 N = 4,289 Reduction Parallel MRRR Backtransformation 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 37 / 38

3 stages: before and after Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation MR 3 SMP Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 37 / 38

Conclusions MRRR-SMP Matthias Petschow (AICES) Eigensolver tailored for multi-cores Almost perfect speedups Routines are available Multi-threaded BLAS for solvers: nice and easy. Multi-threaded BLAS for eigensolvers: not THAT good. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 38 / 38

Conclusions MRRR-SMP Matthias Petschow (AICES) Eigensolver tailored for multi-cores Almost perfect speedups Routines are available Multi-threaded BLAS for solvers: nice and easy. Multi-threaded BLAS for eigensolvers: not THAT good. Thank you for the attention. Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 38 / 38