Solvers and Eigensolvers for Multicore Processors

Similar documents
A Bibliography of Publications of Jack J. Dongarra

Interactive Image Mining Annie Morin 1, Nguyen-Khang Pham 1,2 TEXMEX/IRISA. Cantho University, Vietnam

Chapter 1. Introduction: Some Representative Problems. CS 350: Winter 2018

The Analytic Hierarchy Process. M. En C. Eduardo Bustos Farías

Network Analysis: Minimum Spanning Tree, The Shortest Path Problem, Maximal Flow Problem. Métodos Cuantitativos M. en C. Eduardo Bustos Farías 1

PeCoH Overview HPC Certification Program Knowledge Base HPC Cost Modelling HHCC Summary and Outlook. PeCoH

UNIT FIVE RATIONAL EXPRESSIONS 18 HOURS MATH 521B

Nonlocal methods for image processing

The updated Degree of Urbanisation and its impact on the City list/layer

Housing Transfer Taxes and Household Mobility: Distortion on the Housing or Labour Market? Christian Hilber and Teemu Lyytikäinen

University of Zürich, Switzerland

Chart-Based Decoding

Cube Land integration between land use and transportation

Tree-based Models. Dr. Mariana Neves (adapted from the original slides of Prof. Philipp Koehn) January 25th, 2016

FIRST PRINCIPLES OF VALUE

MONETARY POLICY AND HOUSING MARKET: COINTEGRATION APPROACH

LINEAR ALGEBRA FRIEDBERG PDF

HUD Multifamily Utility Benchmarking Webinar Series Webinar 2: Creating a Plan and Managing Data for Multifamily Utility Benchmarking

Historic Environment Scotland

The Proposal of Cadastral Value Determination Based on Artificial Intelligence

RoboCup Challenges. Robotics. Simulation League Small League Medium-sized League (less interest) SONY Legged League Humanoid League

SAS at Los Angeles County Assessor s Office

TALKS AND PRESENTATIONS AT PROFESSIONAL MEETINGS SINCE 1991

Hedonic Pricing Model Open Space and Residential Property Values

Name: Date: Problem Set: Find the value of these expressions for the specified replacements of a, b, and c.

A Factor Analysis of Housing Market Dynamics in the U.S. and the Regions

Curriculum vitae. Personal Data. Employement. Education. 8 May Date of Birth: Place of Birth: Address: Kayseri, Turkey

Appendix to Forced Sales and House Prices

Well-functioning Real Estate Markets Criteria and Examples (9076)

Unit 6 Test Review Day

Copernicus Land Monitoring Service (Pan- European and Local) in the Netherlands

A Method For Building Legal Digital Cadastre Without Using Cadastral Measurements Field Book Data Is It Accurate Enough?

Summary of Findings & Recommendations

Hunting the Elusive Within-person and Between-person Effects in Random Coefficients Growth Models

Use of Comparables. Claims Prevention Bulletin [CP-17-E] March 1996

Separating the Age Effect from a Repeat Sales Index: Land and Structure Decomposition

How should we measure residential property prices to inform policy makers?

Metro Boston Perfect Fit Parking Initiative

Village of Scarsdale

THE LEGAL AND FINANCIAL FRAMEWORK OF AN EFFICIENT PRIVATE RENTAL SECTOR: THE GERMAN EXPERIENCE

STATPAK MARKET IN A MINUTE A SUMMARY OF MARKET CONDITIONS FOR AUGUST McEnearney.com CONTRACTS URGENCY INDEX INVENTORY INTEREST RATES

LET S MIX IT UP: What you need to know to understand and evaluate mixed use projects.

Real Estate Solutions. For Multiple Listing Organizations and Associations

METROPOLITAN COUNCIL S FORECASTS METHODOLOGY

Automatic Cryptanalysis of Block Ciphers with CP

Click to edit Master title style

Post Construction and Operations & Maintenance Guidance

On the Responsiveness of Housing Development to Rent and Price Changes: Evidence from Switzerland

Marginalized kernels for biological sequences

A NOMINAL ASSET VALUE-BASED APPROACH FOR LAND READJUSTMENT AND ITS IMPLEMENTATION USING GEOGRAPHICAL INFORMATION SYSTEMS

Data and Methodology: Location Affordability Index Version 2.0

Sorting based on amenities and income

Polynomial Project. Algebra 1

Overview of OR Modeling Approach & Introduction to Linear Programming

What s Next for Commercial Real Estate Leveraging Technology and Local Analytics to Grow Your Commercial Real Estate Business

Public incentives and conservation easements on private land

CMA "Price It Right"- Matrix

Maximization of Non-Residential Property Tax Revenue by a Local Government

Project Finance Ratios Tutorial February 2017

Edward Mitchell AIA; Yale University and Edward Mitchell Architects, New Haven CT

Bend City Council Work Session 3/21/2018 Staff team, consulting team

Multi-Tenant Commercial Building 5900 Butler Lane Scotts Valley, CA Price: $4,000,000 In Place Cap Rate: 7.43% Proforma Cap Rate: 8.

Hungarian Cadastre and its relation to LADM

STATPAK MARKET IN A MINUTE A SUMMARY OF MARKET CONDITIONS FOR MAY McEnearney.com CONTRACTS URGENCY INDEX INVENTORY INTEREST RATES AFFORDABILITY

The Impact of Internal Displacement Inflows in Colombian Host Communities: Housing

Student Dormitory Rübenhügel

The history and development of numerical analysis in Scotland: a personal perspective

Status of HUD-Insured (or Held) Multifamily Rental Housing in Final Report. Executive Summary. Contract # HC-5964 Task Order #7

The creation of a Survey Accurate Cadastral Map for surveyed areas in Trinidad & Tobago

An Assessment of Recent Increases of House Prices in Austria through the Lens of Fundamentals

Network Analysis: Minimum Spanning Tree,

Target Market Variability. Parking & Outside

Performance of the Private Rental Market in Northern Ireland

England Occupancy Survey May 2017 SUMMARY OF RESULTS

MARKET AREA UPDATE Report as of: 1Q 2Q 3Q 4Q

TRANSFER OF DEVELOPMENT RIGHTS

WELL-SIZED PUBLIC SPACES

County of Riverside OFFICE OF THE AUDITOR-CONTROLLER STANDARD PRACTICE MANUAL

MARKET IN A MINUTE A SUMMARY OF MARKET CONDITIONS FOR MARCH & 1st QUARTER 2016

Village of Perry Zoning Ordinance Update Draft Diagnostic Report

Course Commerical/Industrial Modeling Concepts Learning Objectives

How to Make Appraisals More Competitive

ACHIEVING HIGHER SALES VOLUME PRICES COMMISSIONS USING CO-OWNERSHIP

Analyzing Ventilation Effects of Different Apartment Styles by CFD

Valuation techniques to improve rigour and transparency in commercial valuations

Negative Gearing and Welfare: A Quantitative Study of the Australian Housing Market

The Journey to 100% Electronic Survey. Land Information New Zealand. August 2009

Questions and Answers

Scaling Your Developer Community via Plugins

Busy Central Rio Rancho Retail/Office for Lease

WHERE ARE YOU GOING TO LIVE?

How Severe is the Housing Shortage in Hong Kong?

Demonstration Properties for the TAUREAN Residential Valuation System

Ownership Data in Cadastral Information System of Sofia (CIS Sofia) from the Available Cadastral Map

Blockchain Real Estate Rental Auction Platform

STATPAK MARKET IN A MINUTE A SUMMARY OF MARKET CONDITIONS FOR JULY McEnearney.com CONTRACTS URGENCY INDEX INVENTORY INTEREST RATES AFFORDABILITY

Land Evaluation in Urban Development Process in Germany

On the Choice of Tax Base to Reduce. Greenhouse Gas Emissions in the Context of Electricity. Generation

Dynamic Impact of Interest Rate Policy on Real Estate Market

1. 6 RATIOnAl expressions

Transcription:

Solvers and Eigensolvers for Multicore Processors Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de Max-Plank-Institute für biologische Kybernetik March 18th, 2011 Tübingen, Germany Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 1 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 2 / 38

Dense Linear Algebra Kernels Matrix-matrix multiplication (GEMM): C C + AB Factorizations: Linear system: Transformations: Matrix Equations: Eigenproblems: Generalized eigenproblems:... LU = A, LL T = A, QR = A AX = B QAQ T = T, QA = H AX + XB = C AZ = ZΛ AZ = BZΛ Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 3 / 38

Dense Linear Algebra Kernels Matrix-matrix multiplication (GEMM): C C + AB Factorizations: Linear system: Transformations: Matrix Equations: Eigenproblems: Generalized eigenproblems:... LU = A, LL T = A, QR = A AX = B QAQ T = T, QA = H AX + XB = C AZ = ZΛ AZ = BZΛ Objective: High-performance Numerical stability Multiple algorithmic variants Multiple implementations Multicore Multiple types of parallelism! Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 3 / 38

Multi-cores: standard (computing) architecture Multi-cores invasion: 499/500 entries of the Top 500 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 4 / 38

Multi-cores: standard (computing) architecture Multi-cores invasion: 499/500 entries of the Top 500 4 cores, 8 cores,... 24 cores,... More parallelism than we know what to do with? Is multi-threaded BLAS the solution for LA libs? Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 4 / 38

Linear Algebra: Modularity! Algorithms expressed in terms of simpler linear algebra operations Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 5 / 38

Linear Algebra: Modularity! Algorithms expressed in terms of simpler linear algebra operations BLAS: Basic Linear Algebra Subroutines BLAS-1: y := y + αx x, y R n β := α + x T y BLAS-2: y := y + Ax A R n n, x, y R n y := A 1 x BLAS-3: C := C + AB A, B, C R n n C := A 1 B ESSL (IBM), MKL (Intel), ATLAS, GotoBLAS,... Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 5 / 38

Example: AX = B AX = B Linear System LU = A LU Factorization LX = B Triangular System LX = B Triangular System C = AB + C Gemm C = AB + C Gemm C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 6 / 38

Performance of BLAS Single threaded GEMM 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 0 0 1000 2000 3000 4000 5000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 7 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 8 / 38

LU factorization: loop-based algorithm Iteration i: completed DONE DONE PARTIALLY DONE COMPUTED Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: repartitioning Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: computation LU GEMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

LU factorization: loop-based algorithm Iteration i+1: completed (boundary shift) DONE PARTIALLY DONE COMPUTED Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 9 / 38

Parallelism? Solution #1: Multithreaded BLAS LU GEMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 10 / 38

Parallelism? Solution #1: Multithreaded BLAS LU GEMM Advantage: ease of use. Legacy code! Drawback: synchronization. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 10 / 38

Performance of BLAS Multithreaded GEMM 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 1 thread 4 threads 8 threads 0 0 1000 2000 3000 4000 5000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 11 / 38

Example: SPD Inverse Inversion of a Symmetric Positive Definite matrix Covariance matrix Very large dense problems Cholesky factorization Triangular inversion Matrix-matrix multiplication Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 12 / 38

Example: SPD Inverse Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 12 / 38

Parallelism: can we do better? Solution #2: Algorithms by blocks Advantage: out of order execution. Advantage: parallelism limited only by the data dependencies between operations. Drawback: plateux. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 13 / 38

Cholesky factorization LL T = A CHOL CHOL SYRK SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 14 / 38

Algorithms by blocks Creating small tasks CHOL CHOL SYRK SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 15 / 38

Decomposing the computation Iteration 1 CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 2 CHOL SYRK GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 3 CHOL SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Decomposing the computation Iteration 4 CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 16 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

Dependencies CHOL SYRK GEMM SYRK GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 17 / 38

DAG - Dependencies 4 4-tile matrix CHOL 7 7 3 7 3 SYRK GEMM GEMM SYRK GEMM SYRK CHOL 3 3 SYRK GEMM SYRK CHOL SYRK CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 18 / 38

Task Execution 4 4-tile matrix Stage Scheduled Tasks 1 CHOL 2 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 7 SYRK GEMM SYRK GEMM 8 GEMM SYRK CHOL 9 10 SYRK GEMM SYRK 11 CHOL 12 13 SYRK 14 CHOL Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 19 / 38

SPD Inverse again: Chol+Inv+GEMM 5 5-tile matrix Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 20 / 38

SPD Inverse again: Chol+Inv+GEMM 5 5-tile matrix Stage Scheduled Tasks 1 CHOL 2 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 7 TRINV SYRK 8 GEMM SYRK GEMM GEMM 9 SYRK TTMM CHOL 10 11 GEMM GEMM GEMM SYRK 12 GEMM SYRK CHOL 13 TRINV SYRK 14 GEMM GEMM GEMM 15 GEMM TRMM SYRK 16 TTMM CHOL 17 SYRK TRINV GEMM SYRK 18 GEMM GEMM GEMM TRMM 19 TRMM 20 21 TTMM SYRK GEMM SYRK 22 TRINV GEMM GEMM TRINV 23 SYRK SYRK GEMM SYRK 24 TRMM GEMM TRMM GEMM 25 TRMM SYRK GEMM GEMM 26 TTMM GEMM TRMM TRMM 27 SYRK TRMM 28 TRMM 29 TTMM Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 20 / 38

Cholesky, algorithm by block Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 21 / 38

Multithreaded vs. algorithm by block Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 22 / 38

Part 1: Summary Multithreaded BLAS vs. Algorithms by blocks No absolute winner: crossover! Ease of use Synchronization Out of order execution Parallelism dictated by data dependencies Plateux Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 23 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 24 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Approach T = Q H AQ Reduction to tridiagonal form O(n 3 ) T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) X = QZ Backtransformation O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Problem AX = XΛ Input: A C n n, A H =A; #eigenpairs: 1 k n Output: X C n k eigenvectors Λ R k k eigenvalues Approach T = Q H AQ Reduction to tridiagonal form O(n 3 ) T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) X = QZ Backtransformation O(kn 2 ) Algorithms Inverse Iteration (1958): subsets O(kn 2 ) QR (1961): high-accuracy O(n 3 ) Divide & Conquer (1981): parallel, BLAS3 O(n 3 ) MRRR (1997): subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 25 / 38

Multi-threaded BLAS? Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Time in seconds 14 12 10 8 6 4 2 0 MRRR (MKL) MRRR (LAPACK) DC (MKL) 4 8 12 16 20 24 Number of threads Tridiagonal eigensolver, matrix size=4289, from DFT. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 26 / 38

More motivation?... it s O(n 2 ) anyway Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 27 / 38

More motivation?... it s O(n 2 ) anyway Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads If not properly parallelized, even O(n 2 ) dominates! Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 27 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 28 / 38

MRRR Dhillon & Parlett Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 29 / 38

MRRR Dhillon & Parlett Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: Bisection or dqds eigenvectors: Compute 1(λ, z) Scan λ s sep. cluster Shift New RRR λ s refine Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 29 / 38

MRRR Dhillon & Parlett Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: Bisection or dqds eigenvectors: Compute 1(λ, z) Scan λ s sep. cluster Shift New RRR λ s refine Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 29 / 38

Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 30 / 38

The work queue Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 31 / 38

Example trace: 16 cores eigenvectors Matrix size: 12387 Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 32 / 38

MR3-SMP: execution time Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Time in seconds 14 12 10 8 6 4 2 0 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 4 8 12 16 20 24 Number of threads Matrix size: 4289. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 33 / 38

A larger example: look at the scale! Matrix size: 16023. Frequency response analysis of automobiles. 600 N = 16023 350 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 300 250 200 150 100 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads From almost 10 hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 34 / 38

MR3-SMP: speedup Time in seconds 5 4.5 Eigenvalues 4 Eigenvectors 3.5 3 2.5 2 1.5 1 0.5 0 LAPACK 2 4 8 16 24 Number of threads Speedup 25 20 15 10 5 0 Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 35 / 38

1 Introduction 2 Part #1: Solvers 3 Part #2: Eigensolvers 4 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 36 / 38

3 stages: before and after Execution time 110 100 90 80 70 N = 4,289 Reduction Sequential MRRR Backtransformation 60 50 40 30 20 10 0 1 2 4 8 16 24 Number of threads Execution time 110 100 90 80 70 60 50 40 30 20 10 0 N = 4,289 Reduction Parallel MRRR Backtransformation 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 37 / 38

3 stages: before and after Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation MR 3 SMP Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 37 / 38

Conclusions MRRR-SMP Matthias Petschow (AICES) Eigensolver tailored for multi-cores Almost perfect speedups Routines are available Multi-threaded BLAS for solvers: nice and easy. Multi-threaded BLAS for eigensolvers: not THAT good. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 38 / 38

Conclusions MRRR-SMP Matthias Petschow (AICES) Eigensolver tailored for multi-cores Almost perfect speedups Routines are available Multi-threaded BLAS for solvers: nice and easy. Multi-threaded BLAS for eigensolvers: not THAT good. Thank you for the attention. Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) MRRR for Multicore Processors MPI Tübingen March 18th, 2011 38 / 38