\documentclass[12pt]{article} \batchmode \usepackage{hyperref} \usepackage{minted} \topmargin -0.5in \textheight 9.0in \oddsidemargin 0pt \evensidemargin 0pt \textwidth 6.5in %------------------------------------------------------------------------------- % get epsf.tex file, for encapsulated postscript files: \input epsf %------------------------------------------------------------------------------- % macro for Postscript figures the easy way % usage: \postscript{file.ps}{scale} % where scale is 1.0 for 100%, 0.5 for 50% reduction, etc. % \newcommand{\postscript}[2] {\setlength{\epsfxsize}{#2\hsize} \centerline{\epsfbox{#1}}} %------------------------------------------------------------------------------- \title{User's Guide for ParU, an unsymmetric multifrontal multithreaded sparse LU factorization package} \author{Mohsen Aznaveh\thanks{ email: aznaveh@tamu.edu. http://www.suitesparse.com. }, Timothy A. Davis} \input{paru_version.tex} %------------------------------------------------------------------------------- \begin{document} %------------------------------------------------------------------------------- \maketitle \begin{abstract} ParU is an implementation of the multifrontal sparse LU factorization method. Parallelism is exploited both in the BLAS and across different frontal matrices using OpenMP tasking, a shared-memory programming model for modern multicore architectures. The package is written in C++ and real sparse matrices are supported. ParU, Copyright (c) 2022, Mohsen Aznaveh and Timothy A. Davis, All Rights Reserved. SPDX-License-Identifier: GNU GPL 3.0 \end{abstract} \maketitle %------------------------------------------------------------------------------- \section{Introduction} \label{intro} %------------------------------------------------------------------------------- The algorithms used in ParU will be discussed in a companion paper, ?. This document gives detailed information on the installation and use of ParU. ParU is a parallel sparse direct solver. This package uses OpenMP tasking for parallelism. ParU calls UMFPACK for the symbolic analysis phase, after that, some symbolic analysis is done by ParU itself, and then the numeric phase starts. The numeric computation is a task parallel phase using OpenMP, and each task calls parallel BLAS; i.e. nested parallelism. The performance of BLAS has a heavy impact on the performance of ParU. Moreover, the way parallel BLAS can be called in a nested environment can also be very important for ParU's performance. %------------------------------------------------------------------------------- \subsubsection{Instructions on using METIS} %------------------------------------------------------------------------------- SuiteSparse is now on METIS 5.1.0, distributed along with SuiteSparse itself. Its use is optional, however. ParU is using AMD as the default ordering. METIS tends to give orderings that are good for parallelism. However, the METIS itself can be slow. As a result, the symbolic analysis using METIS can be slow, but usually, the factorization is faster. Therefore, depending on your use case, either use METIS, or you can compile and run your code without using METIS. If you are using METIS on an unsymmetric case, UMFPACK has to form the Matrix $A^{T}A$. This matrix can be a dense matrix and takes a lot of resources to form it. To avoid such conditions, you can use the ordering strategy \verb'UMFPACK_ORDERING_METIS_GUARD' that is introduced in UMFPACK version 6.0. This ordering strategy use COLAMD instead of METIS in those cases. Note that METIS is not bug-free; it can occasionally cause segmentation faults, particularly if used when finding basic solutions to underdetermined systems with many more columns than rows. ParU does not solve such systems anyway but you might see some problems with other SuiteSparse packages. %------------------------------------------------------------------------------- \section{Using ParU in C and C++} %------------------------------------------------------------------------------- ParU relies on CHOLMOD for its basic sparse matrix data structure, a compressed sparse column format. CHOLMOD provides interfaces to the AMD, COLAMD, and METIS ordering methods and many other functions. ParU also relies on UMFPACK Version 6.0 or higher for symbolic analysis. %------------------------------------------------------------------------------- \subsection{Installing the C/C++ library on Linux/Unix} %------------------------------------------------------------------------------- In Linux/MacOs, type \verb'make' at the command line in either the \verb'SuiteSparse' directory (which compiles all of SuiteSparse) or in the \verb'SuiteSparse/ParU' directory (which just compiles ParU and the libraries it requires). ParU will be compiled; you can type \verb'make demos' to run a set of simple demos. The use of \verb'make' is optional. The top-level \verb'ParU/Makefile' is a simple wrapper that uses \verb'cmake' to do the actual build. To fully test the coverage of the lines ParU, go to the \verb'Tcov' directory and type \verb'make'. This will work for Linux only. To install the shared library into /usr/local/lib and /usr/local/include, do {\tt make install}. To uninstall, do {\tt make uninstall}. For more options, see the {\tt ParU/README.md} file. %------------------------------------------------------------------------------- \subsection{C/C++ Example} %------------------------------------------------------------------------------- The C++ interface is written using only real matrices. The simplest function computes the MATLAB equivalent of \verb'x=A\b' and is almost as simple: Below is a simple C++ program that illustrates the use of ParU. The program reads in a problem from \verb'stdin' in MatrixMarket format \cite{BoisvertPozoRemingtonBarrettDongarra97}, solves it, and prints the norm of \verb'A' and the residual. Some error testing code is omited to simplify showing how the program works. The full program can be found in \verb'ParU/Demo/paru_demo.cpp' \begin{minted}{cpp} #include "ParU.hpp" int main(int argc, char **argv) { cholmod_common Common, *cc; cholmod_sparse *A; ParU_Symbolic *Sym = NULL; //~~~~~~~~~Reading the input matrix and test if the format is OK~~~~~~~~~~~~ // start CHOLMOD cc = &Common; int mtype; cholmod_l_start(cc); // A = mread (stdin) ; read in the sparse matrix A A = (cholmod_sparse *)cholmod_l_read_matrix(stdin, 1, &mtype, cc); //~~~~~~~~~~~~~~~~~~~Starting computation~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ParU_Control Control; ParU_Ret info; info = ParU_Analyze(A, &Sym, &Control); ParU_Numeric *Num; info = ParU_Factorize(A, Sym, &Num, &Control); double my_time = omp_get_wtime() - my_start_time; //~~~~~~~~~~~~~~~~~~~Test the results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ int64_t m = Sym->m; if (info == PARU_SUCCESS) { double *b = (double *)malloc(m * sizeof(double)); double *xx = (double *)malloc(m * sizeof(double)); for (int64_t i = 0; i < m; ++i) b[i] = i + 1; info = ParU_Solve(Sym, Num, b, xx, &Control); printf("Solve time is %lf seconds.\n", my_solve_time); double resid, anorm, xnorm; info = ParU_Residual(A, xx, b, m, resid, anorm, xnorm, &Control); printf("Residual is |%.2e|, anorm is %.2e, xnorm is %.2e and rcond is" " %.2e.\n", resid, anorm, xnorm, Num->rcond); free(b); free(xx); } //~~~~~~~~~~~~~~~~~~~End computation~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ int max_threads = omp_get_max_threads(); BLAS_set_num_threads(max_threads); //~~~~~~~~~~~~~~~~~~~Free Everything~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ParU_Freenum(&Num, &Control); ParU_Freesym(&Sym, &Control); cholmod_l_free_sparse(&A, cc); cholmod_l_finish(cc); } \end{minted} A simple demo for the C interface is shown next. You can see the complete demo in \verb'ParU/Demo/paru_simple.c' \begin{minted}{c} #include "ParU_C.h" int main(int argc, char **argv) { cholmod_common Common, *cc; cholmod_sparse *A; ParU_C_Symbolic *Sym; //~~~~~~~~~Reading the input matrix and test if the format is OK~~~~~~~~~~~~ // start CHOLMOD cc = &Common; int mtype; cholmod_l_start(cc); // A = mread (stdin) ; read in the sparse matrix A A = (cholmod_sparse *)cholmod_l_read_matrix(stdin, 1, &mtype, cc); //~~~~~~~~~~~~~~~~~~~Starting computation~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ printf("================= ParU, a simple demo, using C interface : ====\n"); ParU_C_Control Control; ParU_C_Init_Control(&Control); ParU_Ret info; info = ParU_C_Analyze(A, &Sym, &Control); printf("Input matrix is %ldx%ld nnz = %ld \n", Sym->m, Sym->n, Sym->anz); ParU_C_Numeric *Num; info = ParU_C_Factorize(A, Sym, &Num, &Control); if (info != PARU_SUCCESS) { printf("ParU: factorization was NOT successfull."); if (info == PARU_OUT_OF_MEMORY) printf("\nOut of memory\n"); if (info == PARU_INVALID) printf("\nInvalid!\n"); if (info == PARU_SINGULAR) printf("\nSingular!\n"); } else { printf("ParU: factorization was successfull.\n"); } //~~~~~~~~~~~~~~~~~~~ Computing Ax = b ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if (info == PARU_SUCCESS) { int64_t m = Sym->m; double *b = (double *)malloc(m * sizeof(double)); double *xx = (double *)malloc(m * sizeof(double)); for (int64_t i = 0; i < m; ++i) b[i] = i + 1; info = ParU_C_Solve_Axb(Sym, Num, b, xx, &Control); double resid, anorm, xnorm; info = ParU_C_Residual_bAx(A, xx, b, m, &resid, &anorm, &Control); printf("Residual is |%.2e|, anorm is %.2e, xnorm is %.2e and rcond is" " %.2e.\n", resid, anorm, xnorm, Num->rcond); free(b); free(xx); } //~~~~~~~~~~~~~~~~~~~End computation~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ParU_C_Freenum(&Num, &Control); ParU_C_Freesym(&Sym, &Control); cholmod_l_free_sparse(&A, cc); cholmod_l_finish(cc); } \end{minted} %------------------------------------------------------------------------------- \subsection{C/C++ Syntax} %------------------------------------------------------------------------------- \verb'ParU_Ret' is the output structure of all ParU routines in C and C++. The user must check the output before continuing and computing further the result of prior routine. You can see the user callable routines in \verb'ParU/Include/ParU.hpp'. The following is a list of user-callable C++ functions and what they can do: \begin{enumerate} \item \verb'ParU_Version': return the version of the ParU package you are using. \item \verb'ParU_Analyze': Symbolic analysis is done in this routine. UMFPACK is called here and after that, some more specialized symbolic computation is done for ParU. \verb'ParU_Analyze' is called once and can be used for different \verb'ParU_Factorize' calls for the matrices that have the same pattern. \item \verb'ParU_Factorize': Numeric factorization is done in this routine. Scaling and making $Sx$ (scaled and staircase structure) matrix, computing factors, and permutations are here. \verb'ParU_Symbolic' structure which is computed in \verb'ParU_Analyze' is an input in this routine. \item \verb'ParU_Solve': Using symbolic analysis and factorization phase output to solve $Ax=b$. In all the solve routines Num structure must come with the same Sym struct that comes from \verb'ParU_Factorize'. This routine is overloaded and can solve different systems. It has versions that keep a copy of x or overwrite it. Also, it can solve multiple right-hand side problems. % FIXME: also for C \item \verb'ParU_Lsolve' \item \verb'ParU_Usolve' \item \verb'ParU_perm' FIXME name \item \verb'ParU_inv_perm' FIXME name \item \verb'ParU_Residual': This function computes $|Ax-b|$ one norm of matrix $A$ and one norm of $x$ (or $X$ for multiple right handside). \item \verb'ParU_Freenum': frees the numerical part of factorization. \item \verb'ParU_Freesym': frees the symbolic part of factorization. \end{enumerate} The C interface is quite similar to the C++ interface, and you can see the C user callable routines in \verb'ParU/Include/ParU_C.h'. The following is a list of user-callable C functions and what they can do: \begin{enumerate} \item \verb'ParU_C_Version': return the version of the ParU package you are using. \item \verb'ParU_C_Init_Control': Initialize C Control object before using it. \item \verb'ParU_C_Analyze': Symbolic analysis is done in this routine. UMFPACK is called here; after that, some more specialized symbolic computation is done for ParU. \verb'ParU_C_Analyze' is called once and can be used for different \verb'ParU_C_Factorize' calls for the matrices that have the same pattern. \item \verb'ParU_C_Factorize': Numeric factorization is done in this routine. Scaling and making $Sx$ (scaled and staircase structure) matrix, computing factors, and permutations are here. \verb'ParU_C_Symbolic' structure which is computed in \verb'ParU_C_Analyze' is an input in this routine. \item \verb'ParU_C_Solve_Axx', \verb'ParU_C_Solve_Axb', \verb'ParU_C_Solve_AXX' and \verb'ParU_C_Solve_AXB', Using symbolic analysis and factorization phase output to solve $Ax=b$. In all the solve routines Num structure must come with the same Sym struct that comes from \verb'ParU_C_Factorize'. \item \verb'ParU_C_Residual_bAx': This function computes $|Ax-b|$ one norm of matrix $A$ and one norm of $x$ \item \verb'ParU_C_Residual_BAX': This function computes $|AX-B|$ one norm of matrix $A$ and one norme of $X$ \item \verb'ParU_C_Freenum': frees the numerical part of factorization. \item \verb'ParU_C_Freesym': frees the symbolic part of factorization. \end{enumerate} %------------------------------------------------------------------------------- \subsection{Details of the C/C++ Syntax} %------------------------------------------------------------------------------- For further details on how to use the C/C++ syntax, please refer to the definitions and descriptions in the following files: \begin{enumerate} \item \verb'SuiteSparse/ParU/Include/ParU.hpp' describes each C++ function. Only \verb'double' and square matrices are supported. \item \verb'SuiteSparse/ParU/Include/ParU_C.h' describes the C-callable functions. \end{enumerate} There are C/C++ options to control ParU, which is an input argument to several routines. When you make C++ \verb'ParU_Control' object, it is initialized with default values. The user can change the values. When using C \verb'ParU_C_Control', you have to fully initialize it or call \verb'ParU_C_Init_Control' before using it. Here is the list of control options (both in C and C++): \vspace{0.1in} {\footnotesize \begin{tabular}{|lll|} \hline \verb'ParU_Control' & default value & explanation \\ \hline\hline \verb'mem_chunk' & $1024*1024$ & chunk size for memset and memcpy\\ \verb'paru_max_threads' & $0$ & initialized with \verb'omp_max_threads' \\ \hline \verb'umfpack_ordering' & \verb'UMFPACK_ORDERING_AMD' & default UMFPACK ordering\\ \verb'umfpack_strategy' & \verb'UMFPACK_STRATEGY_AUTO'& default UMFPACK strategy\\ \verb'umfpack_default_singleton' & $1$ & default filter singletons if true\\ \verb'relaxed_amalgamation_threshold' & 32 & threshold for relaxed amalgamation \\ \hline \verb'scale' & 1 & if 1 matrix will be scaled using \verb'max_row'\\ \verb'panel_width' & 32 & width of panel for dense factorizaiton\\ \verb'paru_strategy' & \verb'PARU_STRATEGY_AUTO' & default strategy for ParU\\ \verb'piv_toler' & $0.1$ & tolerance for accepting sparse pivots\\ \verb'diag_toler' & $0.001$ & tolerance for accepting symmetric pivots\\ \verb'trivial' & $4$ & Do not call BLAS for smaller dgemms\\ \verb'worthwhile_dgemm' & $512$ & dgemms bigger than worthwhile are tasked\\ \verb'worthwhile_trsm' & $4096$ & trsm bigger than worthwhile are tasked\\ \hline \end{tabular} } \vspace{0.1in} The first row of the options is either used in symbolic or numerical analysis. The second row of the options is used in the symbolic analysis. In the symbolic analysis phase, only the matrix pattern is probed. The third row of control options shows those that have an impact on numerical analysis. \verb'paru_max_threads' is initalized by \verb'omp_max_threads' if the user do not provide a smaller number. If \verb'paru_strategy' is set to \verb'PARU_STRATEGY_AUTO' ParU uses the same strategy as UMFPACK. However, the user can ask UMFPACK for an unsymmetric strategy but use a symmetric strategy for ParU. Usually, UMFPACK chooses a good ordering; however, there might be cases where users prefer unsymmetric ordering on UMFPACK but symmetric computation on ParU. %------------------------------------------------------------------------------- \section{Requirements and Availability} \label{summary} %------------------------------------------------------------------------------- ParU requires several Collected Algorithms of the ACM: CHOLMOD \cite{ChenDavisHagerRajamanickam09,DavisHager09} (version 1.7 or later), AMD \cite{AmestoyDavisDuff96,AmestoyDavisDuff03}, COLAMD \cite{DavisGilbertLarimoreNg00_algo,DavisGilbertLarimoreNg00} and UMFPACK \cite{10.1145/992200.992206} for its ordering/analysis phase and for its basic sparse matrix data structure, and the BLAS \cite{dddh:90} for dense matrix computations on its frontal matrices. An efficient implementation of the BLAS is strongly recommended, either vendor-provided (such as the Intel MKL, the AMD ACML, or the Sun Performance Library) or other high-performance BLAS such as those of \cite{GotoVanDeGeijn08}. Note that while ParU uses nested parallelism heavily the right options for the BLAS library must be chosen to get a good performance. The use of OpenMP tasking is optional, but without it, only parallelism within the BLAS can be exploited (if available). See ParU/Doc/LICENSE for the license. Alternative licenses are also available; contact the authors for details. %------------------------------------------------------------------------------- % References %------------------------------------------------------------------------------- \bibliographystyle{plain} \bibliography{paru_user_guide} \end{document}