commit c2af113c7ba6d0dcc128ba36ec6e140d89180cf3 (HEAD -> master) Author: Field G. Van Zee Date: Mon May 6 13:37:47 2024 -0500 Version file update (1.0) commit 5ab286f61525f8ead35ecc258305a5ccd4ee096b (origin/master, origin/HEAD) Author: Field G. Van Zee Date: Mon May 6 13:14:52 2024 -0500 Added a script to help create new rc branches. Details: - Added a new script, build/start-new-rc.sh, which: 1. Updates the version file with a new version string. 2. Commits (locally) the version string update. 3. Updates the CHANGELOG file with the output of 'git log'. 4. Commits (locally) the CHANGLOG file update. 5. Creates a new branch whose name is equal to "-rc0" where is the new version string. 6. Reminds the user to execute some final steps if everything looks good. This new script will help in the future when it's time to start a new release candidate branch/lineage off of 'master'. Note that this script is based on build/bump-version.sh (which itself may change in the future due to changes in the way versions/releases will be handled going forward). commit cad51491e8a0b306015a5a02881dc2a9b60dd8d9 Author: Field G. Van Zee Date: Tue Apr 30 16:46:54 2024 -0500 Use "-i auto" by default in test/3 drivers. Details: - Request default induced method behavior of BLIS via "-i auto" when running the standalone performance drivers in test/3 via the runme.sh script present in that directory. (Previously, the runme.sh script would use "-i native" by default.) This change was originally intended for fd1a7e3. commit fd1a7e3ca9547718aa61c806848099705216182b Author: Field G. Van Zee Date: Thu Apr 25 15:00:59 2024 -0500 Allow test/3 drivers to use default ind_t method. (#804) Details: - Previously, the standalone performance drivers in test/3 were written under the assumption that the user would want to explicitly test either native execution *or* 1m. But because the accompanying runme.sh script defaults to passing "native" in for the -i command line option (which explicitly sets the induced method type), running the script without modification causes the test drivers to use slow reference microkernels on systems where native complex-domain microkernels are not registered -- which will yield poor performance for complex-domain level-3 operations. Furthermore, even if a user was aware of this, the test drivers did not support any single value for the -i option that would test BLIS using the library's default behavior -- that is, using 1m on systems where it is needed and native execution on systems that have native microkernels implemented and registered. - This commit addresses the aforementioned issue by supporting a new value for the -i option: "auto". The "auto" value causes the driver to avoid explicitly setting the induced method altogether, leaving BLIS's default behavior in place. This "auto" option is also now the default setting within the runme.sh script. Thanks to Leick Robinson for finding and reporting this issue. - Also added support for "nat" as a shorthand for "native", which the help text already (erroneously) claimed was supported. commit a49238e6141c96a41aa3c2a4adb0b0663d0b4968 Author: Devin Matthews Date: Wed Apr 24 15:07:18 2024 -0500 Refactor the control tree and other infrastructure (#710) Details: 1. A "plugin" architecture. - Users are now able to register new kernels, kernel preferences, and blocksizes at runtime, directly from user applications. - Plugins can be created, configured, and built using only an installed version of BLIS -- no source or source code changes required. - Plugins support both reference and optimized kernels, as well as custom configuration-to-kernel-set mappings. - Building plugins (including reference and relevant optimized kernels) for enabled architectures or architecture families is automated, as is linking into the final library. - The configure script is now installed as 'configure-plugin'. In this mode, it can be used to initialize a plugin from a template including optional example code, and prepare a build system for compiling the plugin into a shared or static library. - Additional configuration files, templates, and build system components are also installed to '%prefix%/share/blis'. - The cntx_t struct now has extensible data structures for holding kernels, preferences, and blocksizes. These are based on a "stack" structure which contains a list of fixed-size data blocks. Adding a new entry (which may require allocating a new block or reallocating the block pointer array) requires locking, but looking up entries is lock-free and takes O(1) time. - Kernels can depend on either 1 or 2 type parameters (e.g. mixed-precision packing requires 2). The func2_t struct supports the latter, but can be implicitly cast to func_t if only "diagonal" entries are needed. The number of type parameters can be inferred from the kernel ID for type safety. - Functions have been added to register new kernels, preferences, and blocksizes with the global kernel structure (gks). This creates corresponding entries in each allocated context and returns the next available ID. Plugins use this API to register user kernels, although the user is responsible for tracking the returned IDs for later lookup. Setting newly-registered reference kernels, as well as overriding these with optimized kernels is done in exactly the same manner as in bli_cntx_init_ref() and bli_cntx_init_(). 2. Restructuring of the control and thread control trees. - The control tree has been substantially restructured to support more flexibility. - The "default" control trees for gemm (also used for hemm/symm/herk/her2k/syrk/syr2k/trmm/trmm3) and trsm are now represented as a single structure containing all necessary control tree nodes and parameters. - An API has been added to modify the default gemm/trsm control trees. - This same API is used by the framework and packm/gemm/trsm variants to access specific control tree nodes. - Users can alternatively create a custom control tree from scratch. - The blocksizes are now encoded directly in the control tree, rather than via loop IDs. The logic for adjusting blocksizes for certain operations has been moved to the control tree initialization. - Type information is encoded in the control tree to drive proper selection of packing and computational kernels provided by the user. - The packing microkernel now receives an opaque "params" struct which is user-definable and can be used to pass additional information through the call stack. - The auxinfo_t struct has been updated with a .params field for opaque user data as well as the global offsets of the current microtile. - The packm and gemm variants can be overridden by the user, and also receive an opaque params struct via the associated control tree node. - The structure-aware packing kernel bli_packm_struc_cxk() is no longer hard-coded to be called from the default packm variant, but can be overridden by the user. It also supports mixed-precision/mixed-domain natively now. - The thread control tree (thrinfo_t) is now created entirely up-front by inspecting the control tree. The required number of threads at each level is encoded in the control tree via loop IDs (actually a bitfield of loop IDs), although the ordering and number of such IDs is arbitrary. The logic for adjusting the number of threads at each level based on operation type (e.g. trmm) is now in the control tree initialization and expressed by combining loop IDs from multiple levels into a single level. - The mem_t object containing the pack buffer pointer has been moved from the control tree to the thread control tree. NOTE: **The control tree is now strictly const throughout the operation, and only a single copy is shared by all threads.** - The thread control tree node for packing has been changed so that there is no longer a "fake" node indicating a team of single threads. Instead, the number of threads and thread IDs in the "normal" thread control tree node are used. This change has also been made to the gemmsup thread control tree and packing variants, as well as to the gemmlike sandbox. - Parameters controlling packing (e.g. inversion of the diagonal, direction, schema) are not stored directly in the control tree but in the opaque params struct. The packing control tree node and its default params struct are stored together in the "combined" gemm/trsm control tree structure and initialized as a unit. Users can update these parameters individually or substitute a custom packm variant and params struct. - The "target" and "execution" datatypes has been removed from the obj_t struct and replaced by type information in the control tree. - The "sub-node" and "sub-prenode" of a control tree node have been replaced by an arbitrary number of sub-nodes accessed by index. There is a hard cap on the number of sub-nodes (currently 2). Sub-nodes are added during control tree initialization, *after* creation/initialization of the parent node through an updated API. - The level-3 thread decorator has been significantly simplified and directly calls bli_l3_int(). The control tree is created externally, and it is no longer necessary to alias matrices or set object pack schemas. Also, the rntm_t passed in may be NULL. Finally, family and scalar information is no longer needed here. - bli_l3_int() is now a simple inline function which extracts the next control tree node and variant and calls it. - bli_*_front() have been removed and inlined into the expert object API with significant simplification. - 1m (or other induced method) no longer uses an alternative cntx_t. - The .pack_fn/.ker_fn pointers and associated params fields on the obj_t were removed in favor of the present solution. 3. Overhaul of variable substitution in configure script. - The configure script has been somewhat re-written to use a centralized mechanism for substituting variables into build system and other configuration files. - All substitution variables go through the same pathway now, which necessitated some variable naming changes for variables which were named the same in e.g. Makefile and bli_config.h but with different definitions. - CC and CXX variables can now contain spaces, e.g. 'g++ -std=c++17'. This provides better support for integration with build tooling such as autotools. 4. Overhaul of packing kernels. - Previously there were two packing kernels referenced in the cntx_t structure for MRxk and NRxk shaped micropanels, respectively. These have now been merged into one kernel which is responsible for packing any dense rectangular portion of either A or B. - The packing kernel now receives information about the register blocksize (cdim_max) and duplication factor (the "broadcast-B" format, although this can also apply to the A matrix). - The structure-aware packing kernel (bli_packm_struc_cxk(), which is now user-overridable) also receives global offsets of the current micropanel within A or B. - Explicit kernels for packing the diagonal blocks of triangular/symmetric/Hermitian matrices have been added to the cntx_t. This means that the bli_packm_struc_ckx() "kernel" no longer needs to directly touch data (except to zero out some regions). - bli_packm_struc_cxk() has also been updated to work only in terms of fundamental elements (i.e., real datatypes) when computing offsets and when zeroing data, which greatly simplifies mixed-domain/1m packing. - bli_packm_scalar() has been updated to better support complex scalars in mixed-domain operations. - Pack schemas for PACKED_ROW_PANELS* and PACKED_COL_PANELS* have been merged into simply PACKED_PANELS*. This reflects the merging of the packing kernels into a single generic kernel. There were only a very few places which needed the row/column information and this is now supplied by alternative means. - Packing variants always behave "as if" the A matrix were being packed (i.e. the code assumes packing column-stored row panels). Packing of B is handled by applying an implicit or explicit transpose before packing. This change also applies to gemmsup. 5. Improved MD/MP support. - All level-3 operations (except trsm) now support full mixed-domain/mixed-precision operation. - Explicit 1m packing kernels have been added in the cntx_t. - An explicit 1m microkernel wrapper has been added to the cntx_t. - An extra packing kernel for the "ro" format has been added, along with the pack_t enumeration value. This supports the packing for real*complex -> real, including potential scaling by a complex alpha, support for structured matrices, etc. - Extra microkernel wrappers for mixed-domain operations have been added to support the 'ccr' (and by extension, 'crc'), 'rcc', and 'crr' cases. Notably this includes full support for general stride storage and complex alpha/beta. - Packing kernels and gemm microkernels are now "templated" based on two type parameters rather than one. For packing this allows direct optimization of mixed-precision kernels, and for gemm microkernels this allows direct optimization of mixed-precision without writing to a temporary buffer. Reference packing kernels are directly instantiated for all mixes of precisions, while by default mixed-precision gemm microkernels are supported via a microkernel wrapper. The "old" way of specifying optimized kernels using a single type parameter works unchanged. - alpha and beta are typecast appropriately to the computational or output datatype, respectively, and **always** to the complex domain. Scalar typecasting has also been added to gemmsup for safety. - The gemm macrokernel doesn't have to do any typecasting anymore, as a microkernel wrapper or optimized mixed-precision/mixed-domain kernel now handles this. - 1m and mixed-domain operations now always use a microkernel wrapper, rather than adjusting parameters in the gemm macrokernel. - The gemmt macrokernel **does** still have to handle explicit write-back of microtiles which intersect the diagonal, although typecasting has already been performed. - The gemmt_x_ker_var2(), trmm_xx_ker_var2(), and trsm_xx_ker_var2() functions have been removed. The appropriate macrokernel pointer is selected during control tree initialization. - Real domain MR/NR are checked for even-ness based on the gemm microkernel's row preference in order to guarantee proper 1m and mixed-domain operation. - Full range of mixed-domain/mixed-precision functionality tested in the testsuite ('input.*.mixed'). 6. Other changes: - The build system has been updated to support C++ source files throughout the framework. While the intent is not to add such files to BLIS itself, this supports plugins written in C++. - Many instances of configuration-specific code have been simplified by introducing an INSERT_GENTCONF macro which instantiates a block of code for each enabled sub-configuration. The ConfigurationHowTo.md document has been updated accordingly. - PASTEMAC?/PASTECH?/PASTEF77? have been removed in favor of variadic macros which accept any number of arguments (up to a reasonable limit). - The INSERT_GENTFUNC* macros have been updated to clean up mixed-precision and mixed-domain instantiations. - bli_align_dim_to_mult() has been updated to support rounding either up or down based on a flag. - Checking for empty matrices and other early exits (level-3 only) has been consolidated into a single utility function. - The auxinfo_t struct is always passed as const. - The new function bli_obj_alias_submatrix() aliases a matrix while also resetting the root to NULL, offsets to zero (while adjusting the buffer), and applying any implicit transpose. - Level-3 pruning functions now only check matrix structure to see what to do, not the operation family. - gemmsup packing has been updated to use the "normal" pack buffer allocation routines. - Remove duplicate checks for early return from gemmsup handler. - bli_determine_blocksize() has been significantly simplified. - Partitioning packed panels is no longer allowed. - Added bli_xxsame macros. - Automated the calculation of info bit shifts and masks based on predefined bit sizes for various flags. This greatly simplifies reordering, adding, or removing flags from the info/info2 bitfields. - Moved more BLIS_NUM_* macros into the corresponding enums as the last entry so that the value is automatically computed. - Better const-correctness in some level0 scalar macros. - Better mixed-precision support in some level0 scalar macros. - Added a bli_axpbys_mxn() macro. - bli_thread_range_sub() takes explicit thread ID and number of threads rather than a thrinfo_t node. - "De-templated" BLIS gemmlike sandbox (specifically, bls_gemm_bp_var1() and bls_packm_var1()). - Combined bls_l3_packm_[ab]() into one function with thin wrappers. - Deleted bls_packm_var[23](). - Add a "termination tag" to the testsuite output so that 'make check-blis' can accurately check for successful completion. - Add a new function to centrally compute FLOPs for level-3 operations in the testsuite. commit a316d2c6c33fc1f8f7c58c4210ab203f48349041 Author: Devin Matthews Date: Thu Mar 28 12:52:00 2024 -0500 Fix incorrect commenting of `BLIS_RNTM_INITIALIZER` and `BLIS_OBJECT_INITIALIZER`. commit 664cc6bc3ea610b4ecea63d78c6024c48f045635 Author: Devin Matthews Date: Tue Mar 26 16:25:17 2024 -0500 Update BLIS_*_INITIALIZER macros for C++ compatibility. (#802) Details: - Remove designated initializer syntax. This isn't officially supported until C++20. - Arrange initializers in the order in which they are defined in the struct. Even with standard or extension support for designated initializers, initializing non-static members out-of-order is an error in C++. - Remove the conditional code which uses '-1' as the default value of the 'pack_buf' member of 'mem_t' in C, but 'BLIS_BUFFER_FOR_GEN_USE' in C++. Simply use the latter as a common-sense default. commit 1a8c8180b32cf5988bf9eb5d2f0f8111a729993a Author: John <50754967+j-bm@users.noreply.github.com> Date: Thu Feb 15 12:35:10 2024 -0400 Add cpu part codes for various manufacturers and use in the code (#794) * Add cpu_id symbols for arm v8. * Add symbols for arm v7. * Always assume firestorm on Apple aarch64. * Fixes incorrect usage of model vs. part in some places. * Fixes #793 --------- Co-authored-by: J commit c382d8bdccc07e22a341fe04960f0cbf4eec083b Author: Igor Zhuravlov Date: Sun Jan 14 04:03:31 2024 +0000 Fix errors and typos in docs/BLIS*API.md (#791) Details: - Fixed errors and unified formatting in docs/BLIS*API.md docs. commit a72e4569f2a03cc3578c019bf7ce25491a44137d Author: Field G. Van Zee Date: Wed Dec 6 18:21:47 2023 -0600 Include bli_config.h before bli_system.h in cblas.h. (#789) Details: - Previously, in cblas.h, bli_config.h was being #included *after* bli_system.h, which meant that the BLIS_ENABLE_SYSTEM macro was never defined in time for proper OS detection. This bug only affected cblas.h -- blis.h had been correctly #including bli_config.h before bli_system.h since fb93d24. Thanks to Edward Smyth for reporting this bug and suggesting the fix. commit 1236ddab455ef3a6293ab394ff06b3a19c2913d9 Author: Field G. Van Zee Date: Sun Dec 3 16:42:34 2023 -0600 Fixed random segfault in test/3 drivers. (#788) Details: - Fixed a segfault in the non-gemm test drivers in test/3 that was the result of sometimes leaving either .n_str or .k_str fields of the params_t struct uninitialized, depending on the operation in question. For example, in test_hemm.c, init_def_params() would only initialize the .m_str and .n_str fields, but not the .k_str field. Even though hemm doesn't use a 'k' dimension, the proc_params() function (called via parse_cl_params()) universally attempts to convert all three into integers via sscanf(), which was understandably failing when one of those strings was a NULL pointer. I'm not sure how this code ever worked to begin with. Special thanks to Leick Robinson for finding and reporting this bug. commit 141a6c9a8e7557d9c7d28aecedec9dc5377dba13 Author: Field G. Van Zee Date: Tue Nov 21 12:26:43 2023 -0600 Install helper headers to INCDIR prefix. (#787) Details: - Install one-line headers to INCDIR whose entire purpose is to #include the actual headers within the local 'blis' header directory so that applications can #include "blis.h" instead of #include (and/or "cblas.h" instead of if CBLAS is enabled) when headers are installed to global paths. (Note that INCDIR is the installation prefix for headers as specified by '--includedir=INCDIR', which defaults to 'PREFIX/include' if not specified.) Not sure how this problem went unreported for so long, since presumably any user trying to #include "blis.h" from a global installation would have encountered a compiler error. - The one-line blis.h and cblas.h headers now reside in the 'build' directory, ready to install as is. - Thanks to to Jed Brown for reporting this via Issue #786, and for Devin Matthews and Mo Zhou for their engagement. - Harmonized the rule in the top-level Makefile for installing blis.pc into SHAREDIR/pkgconfig with conventions for others vis-a-vis verbosity/non-verbosity. commit 2d9439298b336aa6d0ee000a5285a3adb4e6d462 Author: Devin Matthews Date: Tue Nov 21 12:18:07 2023 -0600 Allow users to defines [sd]complex using std::complex (#784) Details: - In C++ applications, it makes a lot of sense to interface to BLIS using C++'s standard complex number library, which uses a template class std::complex. Obviously BLIS doesn't know anything about this and defaults to a custom struct to represent complex numbers. This PR updates the bli_[cz]{real,imag}() functions to accept std::complex numbers when a C++ compiler is being used. Note that this has no effect on the compilation of the BLIS library (or testsuite), and only comes into play when including blis.h into a C++ project and forcing the use of std::complex for scomplex and dcomplex. - The application can explicitly request std:complex-based types via: #define BLIS_ENABLE_STD_COMPLEX #include // Call BLIS functions using std::complex here. - Fixed a bug in the definition of some scalar level-0 macros, since bli_creal()/bli_cimag() and bli_zreal()/bli_zimag() are no longer interchangeable. commit f7ce54a252028483e4c6af619015eb22063d5541 (origin/1.0-rc0) Author: Field G. Van Zee Date: Fri Nov 3 15:52:57 2023 -0500 CREDITS file update. commit 05388ddb66f8bf2d62009b162d64bf2d99226b83 Author: Aaron Hutchinson <113382047+Aaron-Hutchinson@users.noreply.github.com> Date: Fri Nov 3 13:30:31 2023 -0700 Added 'sifive_x280' subconfig, kernel set. (#737) Details: - Added a new 'sifive_x280' subconfiguration for SiFive's x280 RISC-V instruction set architecture. The subconfig registers kernels from a correspondingly new kernel set, also named 'sifive_x280'. - Added the aforementioned kernel set, which includes intrinsics- and assembly-based implementations of most level-1v kernels along with level-1f kernels axpy2v dotaxpyv, packm kernels, and level-3 gemm, gemmtrsm_l, and gemmtrsm_u microkernels (plus supporting files). - Registered the 'sifive_x280' subconfig as belonging to a singleton family by the same name. - Added an entry to '.travis.yml' to test the new subconfig via qemu. - Updates to 'travis/do_riscv.sh' script to support the 'sifive_x280' subconfig and to reflect updated tarball names. - Special thanks to Lee Killough, Devin Matthews, and Angelika Schwarz for their engagement on this commit. commit 7a87e57b69d697a9b06231a5c0423c00fa375dc1 (origin/10.0-rc0) Author: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com> Date: Sat Oct 14 02:05:41 2023 -0500 Fixed HPX barrier synchronization (#783) Details: - Fixed hpx barrier synchronization. HPX was hanging on larger cores because blis was using non-hpx synchronization primitives. But when using hpx-runtime only hpx-synchronization primitives should be used. Hence, a C style wrapper hpx_barrier_t is introduced to perform hpx barrier operations. - Replaced hpx::for_loop with hpx::futures. Using hpx::for_loop with hpx::barrier on n_threads greater than actual hardware thread count causes synchronization issues making hpx hanging. This can be avoided by using hpx::futures, which are relatively very lightweight, robust and scalable. commit 8fff1e31da1c87e46cacec112b0ac280ab47cd8b Author: Field G. Van Zee Date: Thu Oct 12 15:51:41 2023 -0500 Fixed bug in sup threshold registration. (#782) Details: - Fixed a bug that resulted in BLIS non-deterministically calling the gemmsup handler, irrespective of the thresholds that are registered via bli_cntx_set_blkszs(). - Deep dive: In bli_cntx_init_ref.c, the default values for the gemmsup thresholds (BLIS_[MNK]T blocksizes) wre being set to zero so that no operation ever matched the criteria for gemmsup (unless specific sup thresholds are registered). HOWEVER, these thresholds are set via bli_cntx_set_blkszs() which calls bli_blksz_copy_if_pos(), which was only coping the thresholds into the gks' cntx_t if the values were strictly positive. Thus, the zero values passed into bli_cntx_set_blkszs() were being ignored and those threshold slots within the gks were left uninitialized. The upshot of this is that the reference gemmsup handler was being called for gemm problems essentially at random (and as it turns out, very rarely the reference gemmsup implementation would encounter a divide-by-zero error). - The problem was fixed by changing bli_blksz_copy_if_pos() so that it copies values that are non-negative (values >= 0 instead of > 0). The function was also renamed to bli_blksz_copy_if_nonneg() - Also needed to standardize use of -1 as the sole value to embed into blksz_t structs as a signal to bli_cntx_set_blkszs() to *not* register a value for that slot (and instead let whatever existing values remain). This required updates to the bli_cntx_init_*() functions for bgq, cortexa9, knc, penryn, power7, and template subconfigs, as some of these codes were using 0 instead of -1. - Fixes #781. Thanks to Devin Matthews for identifying, diagnosing, and proposing a fix for this issue. commit 1e264a42474b535431768ef925bbd518412d392e Author: Abhishek Bagusetty <59661409+abagusetty@users.noreply.github.com> Date: Mon Oct 2 18:29:46 2023 -0500 Update zen3 subconfig to support NVHPC compilers. (#779) Details: - Parse $(CC_VENDOR) values of "nvc" in 'zen3' make_defs.mk file. - Minor refactor to accommodate above edit. - CREDITS file update. commit c2099ed2519dcac8ee421faf999b36e1c2260be7 Author: Field G. Van Zee Date: Mon Oct 2 14:56:48 2023 -0500 Fixed brokenness when sba is disabled. (#777) Details: - Previously, disabling the sba via --disable-sba-pools resulted in a segfault due to a sanity-check-triggering abort(). The problem was that the sba, as currently used in the l3 thread decorators, did not yet (fully) support pools being disabled. The solution entailed creating wrapper function, bli_sba_array_elem(), which either calls bli_apool_array_elem() (when sba pools are enabled at configure time) or returns a NULL sba_pool pointer (when sba pools are disabled), and calling bli_sba_array_elem() in place of bli_apool_array_elem(). Note that the NULL pointer returned by bli_sba_array_elem() when the sba pools are disabled does no harm since in that situation the pointer goes unreferenced when acquiring and releasing small blocks. Thanks to John Mather for reporting this bug. - Guarded the bodies of bli_sba_init() and bli_sba_finalize() with #ifdef BLIS_ENABLE_SBA_POOLS. I don't think this was actually necessary to fix the aforementioned bug, but it seems like good practice. - Moved the code in bli_l3_thrinfo_create() that checked that the array* pointer is non-NULL before calling bli_sba_array_elem() (previously bli_apool_array_elem()) into the definition of bli_sba_array_elem(). - Renamed various instances of 'pool' variables and function parameters to 'sba_pool' to emphasize what kind of pool it represents. - Whitespace changes. commit 37ca4fd168525a71937d16aaf6a13c0de5b4daef Author: Field G. Van Zee Date: Thu Sep 28 16:37:57 2023 -0500 Implemented [cz]symv_(), [cz]syr_(), [cz]rot_(). (#778) Details: - Expanded existing BLAS compatibility APIs to provide interfaces to [cz]symv_(), [cz]syr_(). This was easy since those operations were already implemented natively in BLIS; the APIs were previously omitted only because they were not formally part of the BLAS. - Implemented [cz]rot_() by feeding code from LAPACK 3.11 through f2c. - Thanks to James Foster for pointing out that LAPACK contains these additional symbols, which prompted these additions, as well as for testing the [cz]rot_() functions from Julia's test infrastructure. - CREDITS file update. commit 6f412204004666abac266409a203cb635efbabf3 Author: Field G. Van Zee Date: Tue Sep 26 18:00:54 2023 -0500 Added 'altra', 'altramax' subconfigs. (#775) Details: - Forward-ported 'altra' and 'altramax' subconfigurations from the older 'stable' branch lineage [1]. These subconfigs primarily target the Ampere Altra and AltraMax (ARM) processors. They also contain "QuickStart" directories with information and scripts to help use BLIS on these microarchitectures. Thanks to Jeff Diamond and Leick Robinson for developing these subconfigs and resources. - Updated kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c according to changes in the 'stable' lineage, mostly related to re-enabling of assembly code branches that target general stride IO. [1] Note that the 'stable' branch is being used to make sure that more recent commits do not introduce unreasonable performance regressions. As such, the name should be interpreted as shorthand for "performance stable," not "API stable." commit a4a63295b96ed5b32f4df6477d24db07bf431202 Author: Srinivas Yadav <43375352+srinivasyadav18@users.noreply.github.com> Date: Tue Sep 26 17:58:38 2023 -0500 Fixes to HPC runtime code path. (#773) Details: - Fixed hpx::for_each invocation and replace with hpx::for_loop. The HPX runtime was initialized using hpx::start, but the hpx::for_each function was being called on a non-hpx runtime (i.e standard BLIS runtime - single main thread). To run hpx::for_each on HPX runtime correctly, the code now uses hpx::run_as_hpx_thread(func, args...). - Replaced hpx::for_each with hpx::for_loop, which eliminates use of hpx::util::counting_iterator. - Employ hpx::execution::chunk_size(1) to make sure that a thread resides on a particular core. - Replaced hpx::apply() with updated version hpx::post(). - Initialize tdata->id = 0 in libblis.c to 0, as it is the main thread and is needed for writing results to output file. - By default, if not specified, the HPX runtime uses all N threads/cores available in the system. But, if we want to only specify n_threads out N threads, we use hpx::execution::experimental::num_cores(n_threads). commit c6546c1131b1ddd45ef13f9f2b620ce2e955dbf8 Author: John Mather <54645798+jmather-sesi@users.noreply.github.com> Date: Wed Sep 20 13:41:07 2023 -0400 Fixed broken link in Multithreading.md. (#774) Details: - Replaced 404'd link in docs/Multithreading.md with an archive from The Wayback Machine. - CREDITS file update. commit 6dcf7666eff14348e82fbc2750be4b199321e1b9 Author: Field G. Van Zee Date: Sun Aug 27 14:18:57 2023 -0500 Revamped bli_init() to use TLS where feasible. (#767) Details: - Revamped bli_init_apis() and bli_finalize_apis() to use separate bli_pthread_switch_t objects for each of the five sub-API init functions, with the objects for the 'ind' and 'rntm' sub-APIs being declared with BLIS_THREAD_LOCAL. This allows some APIs to be treated as thread-local and the rest as thread-shared. Thanks to Edward Smyth for requesting application thread-specific rntm_t structs, which inspired these change. - Combined bli_thread_init_from_env() and bli_pack_init_from_env() into a new function, bli_rntm_init_rntm_from_env(), and placed the combined code in bli_rntm.c inside of a new bli_rntm_init() function. Then removed the (now empty) bli_pack_init() and _finalize() function defs. - Deprecated bli_rntm_init() for the purposes of initializing a rntm_t (temporarily preserving it as bli_rntm_clear() in a cpp-undefined code block) so that the function name could be used for the aforementioned bli_rntm_init() function. - Updated libblis_test_pobj_create() in test_libblis.c to use a static rntm_t initializer instead of the deprecated bli_rntm_init() function-based option. - Minor updates to docs/Multithreading.md, including removal of bli_rntm_init() in the example of how to initialize rntm_t structs. - Changed the return value of bli_gks_init(), bli_ind_init(), bli_memsys_init(), bli_thread_init(), and bli_rntm_init() (and their finalize() counterparts) from 'void' to 'int' so that those functions match the function type expected by bli_pthread_switch_on()/_off(). Those init/finalize functions now return 0 to indicate success, which is needed so that the switch actually changes state from off to on and vice versa. - Defined bli_thread_reset(), which copies the contents of the global_rntm_at_init() struct into the global_rntm struct (for the current application thread). - Guard calls to bli_pthread_mutex_lock()/_unlock() in - bli_pack_set_pack_a() and _pack_b() - bli_rntm_init_from_global() - bli_thread_set_ways() - bli_thread_set_num_threads() - bli_thread_set_thread_impl() - bli_thread_reset() - bli_l3_ind_oper_set_enable() with #ifdef BLIS_DISABLE_TLS (since TLS precludes the possibility of race conditions). - In frame/base/bli_rntm.c, declare global_rntm, global_rntm_at_init, and global_rntm_mutex as BLIS_THREAD_LOCAL so that separate application threads can change the number of ways of BLIS parallelism independently from one another. - Access global_rntm only via a new private (not exported) function, bli_global_rntm(). Defined a similar function for a rntm_t new to this commit, global_rntm_at_init, which preserves the state of the global rntm at initialization-time. - In frame/3/bli_l3_ind.c, added a guard to the declaration of the static variable oper_st_mutex with #ifdef BLIS_DISABLE_TLS so that the mutex is omitted altogether when TLS is enabled (which prevents the compiler from warning about an unused variable). - Removed redundant code from bli_thread.c: #ifdef BLIS_ENABLE_HPX #include "bli_thread_hpx.h" #endif since this code is already present in bli_thread.h. - Thanks to Minh Quan Ho for his review of and feedback on this commit. - Comment updates. commit fa6a9b24ae2ddbd5f30f657d46004843581c768c Author: Field G. Van Zee Date: Sat Aug 19 12:44:34 2023 -0500 Fixed error when using common.mk from testsuite. (#768) Details: - Commit 2db31e0 (#755) inserted logic into common.mk that attempts to preprocess build/detect/android/bionic.h to determine whether the __BIONIC__ macro is defined (in which case -lrt should not be included in LDFLAGS). However, the path to bionic.h was encoded without regard to DIST_PATH, and so utilizing common.mk anywhere that isn't the top- level directory (such as in the testsuite directory) resulted in a compiler error: gcc: error: build/detect/android/bionic.h: No such file or directory gcc: fatal error: no input files compilation terminated. This commit adds a $(DIST_PATH) prefix to the path to bionic.h so that it can be located from other applications' Makefiles that use BLIS's makefile fragments. commit 634e532c8dcce7383d96ba33276df65c656b2198 Author: Field G. Van Zee Date: Wed Aug 9 21:54:49 2023 -0500 Set thrcomm timpl_t id inside init functions. (#766) Details: - Previously, the timpl_t id being used when a thrcomm_t is being initialized was set within the bli_thrcomm_init() dispatch function after the timpl_t-specific bli_thrcomm_init_*() function returned. But it just occurred to me that each bli_thrcomm_init_*() function already intrinsically knows its own timpl_t value. This commit shifts the setting of the thrcomm_t.ti field into the corresponding bli_thrcomm_init_*() function for each timpl_t type (e.g. single, openmp, pthreads, hpx). - Removed long-deprecated code dating back nearly 10 years. - Whitespace changes - Comment updates. commit 3cf17b4a91232709bc6a205b0e4d7ecc96579aa9 Author: Field G. Van Zee Date: Mon Aug 7 13:46:20 2023 -0500 Small fixes/improvements to docs/Multithreading.md. (#764) Details: - Added reminders that #include "blis.h" must be added to source files in order to access BLIS API function prototypes. Thanks to Barry Smith for suggesting this improvement. - Fixed pre-existing typos. - CREDITS file update. commit dbc79812c390f812c7bf030bfcf87e947a1443c4 Author: Field G. Van Zee Date: Fri Jul 28 18:16:38 2023 -0500 CREDITS file update. Details: - Thanks to Igor Zhuravlov for PR #753 (commit 915daaa). commit 915daaa43cd189c86d93d72cd249714f126e9425 Author: Igor Zhuravlov Date: Thu Jul 27 20:33:59 2023 +0000 Fix typos in docs + example code comments. (#753) Details: - Fixed various typos in API documentation in docs/BLIS*API.md and comments in the source code examples within examples/?api/*.c. commit 2db31e057e7e9c97fc60021b5ae72a01a48d7588 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Thu Jul 27 15:27:21 2023 -0500 Exclude -lrt on Android with Bionic libraries. (#755) Details: - Added build/detect/android/bionic.h header to test whether the __BIONIC__ cpp macro is defined. - In common.mk, only add -lrt to LDFLAGS when Bionic is not present. - CREDITS file update. commit 22ad8c1b752364784f320168b31995945ad84a59 Author: ct-clmsn Date: Thu Jul 27 16:23:29 2023 -0400 Small fixes to support hpx in the testsuite (#759) Details: - Minor changes to test_libblis.c to support hpx. commit c91b41d022e33da82b3b06c82be047a29873d9b6 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Wed Jul 26 14:37:08 2023 -0500 Auto-detect the RISC-V ABI of the compiler and use -mabi= during RISC-V Builds (#750) Details: - Generate a build error if there is a 32/64-bit mismatch between the RISC-V ABI or architecture and the BLIS configuration selected. - Handle Q, Zicsr, ZiFencei, Zba, Zbb, Zbc, Zbs and Zfh extensions in the RISC-V architecture auto-detection. ZiFencei and Zicsr is not detectable with built-in RISC-V macros right now. - ZiFencei is not important for BLIS because doesn't it have Just-In-Time compilation or self-modifying code, and Zicsr is implied by the floating-point extensions, which are required for good performance in BLIS. - Move RISC-V autodetect header files to build/detect/riscv/. commit a0b04e3c007f1207e5678bf20c07752906742fb7 (origin/aocl-blas, aocl-blas) Author: Field G. Van Zee Date: Mon Jun 26 17:59:21 2023 -0500 Rewrote regen-symbols.sh (gen-libblis-symbols.sh). (#751) Details: - Wrote an alternative to regen-symbols.sh, gen-libblis-symbols.sh, that generates a list of exported symbols from the monolithic blis.h file rather than peeking inside of the shared object via nm. (This new script lives in the 'build' directory and the older script has been retired to build/old.) Special thanks to Devin Matthews for authoring gen-libblis-symbols.sh. - Added a 'symbols' target to the top-level Makefile which will refresh build/libblis-symbols.def, with supporting changes to common.mk. - Updates to build/libblis-symbols.def using the new symbol-generating script. commit 6b894c30b9bb2c2518848d74e4c8d96844f77f24 Author: Field G. Van Zee Date: Mon Jun 12 17:22:44 2023 -0500 Rewrote/fixed broken tree barrier implementation. Details: - Rewrote the defintion of bli_thrcomm_tree_barrier() so that it (a) actually worked again, and (b) used atomics instead of a basic C99 spin loop. (Note that the conventional barrier implementation is still enabled by default; the tree barrier must be toggled on manually within the configuration.) - Added an early return to the definition of bli_thrcomm_barrier() in the cases where comm == NULL or comm->n_threads == 1. - Reordered thread-related and thread-dependent header #include directives in blis.h so that the BLIS_TREE_BARRIER and BLIS_TREE_BARRIER_ARITY macros, which would be defined in the target configuration's in the bli_family_*.h file, would be #included prior to the inclusion of the thrcomm_t header that uses them. - Changed the type of barrier_t.count from 'int' to 'dim_t'. - Changed the type of barrier_t.signal from 'volatile int' to 'gint_t'. - Special thanks to Leick Robinson for contributing these changes. - Whitespace changes. commit d639554894b6252a86bd3164921bce6fbb9e3b5e Author: Field G. Van Zee Date: Wed Jun 7 16:11:14 2023 -0500 Pad thrcomm_t fields to avoid false sharing. Details: - Inserted a cache line of padding between various fields of the thrcomm_t and, in the case of the (presently defunct) tree barrier, fields of the barrier_t. This additional padding ensures that these fields, which both serve different purposes when performing a thread barrier, are only accessed when needed (and not just due to their spatial locality with their cache line neighbors). - Added a new cpp macro constant, BLIS_CACHE_LINE_SIZE, to bli_config_macro_defs. This new constant defines the size of a cache line (in bytes) and defaults to 64. - Special thanks to Leick Robinson for discovering this false sharing issue and developing/submitting the patch. commit 89b7863fc9a88903917deedc6a5ad9fd17f83713 Author: Devin Matthews Date: Mon May 8 16:51:18 2023 -0500 Fix 1m enablement for herk/her2k/syrk/syr2k. (#743) Details: - Ever since 28b0982, herk, her2k, syrk, and syr2k have been implemented in terms of the gemmt expert API. And since the decision of which induced method to use (1m or native) is made *below* the level of the expert API, executing any of {herk,her2k,syrk,syr2k} results in BLIS checking the enablement status for gemmt. - This commit applies a band-aid of sorts to this issue by modifying bli_l3_ind_oper_get_enable() and bli_l3_ind_oper_set_enable() so that any attempts to query or modify the internal enablement status for herk, her2k, syrk, or syr2k instead does so for gemmt. - This solution isn't perfect since, in theory, the user could enable 1m for, say, herk but then disable it for syrk, and then be confused when herk runs via native execution. But we don't anticipate that users modify 1m enablement at the operation level, and so in practice this solution is likely fine for now. commit 138de3b3e88c5bf7d8718c45c88811771cf42db8 Author: Ajay Panyala Date: Sun May 7 13:01:38 2023 -0700 add nvhpc compiler support (#719) Add detection of the NVIDIA nvhpc compiler (`nvc`) in `configure`, and adjust some warning options in `config.mk`. Currently, no specific options for `nvc` have been added in the relevant configurations so it may not be usable without further tweaks. commit 0873c0f6ed03fea321d1631b3d1a385a306aa797 Author: Devin Matthews Date: Sun May 7 14:03:19 2023 -0500 Consolidate INSERT_ macro sets via variadic macros. (#744) Details: - Consolidated INSERT_GENTFUNC_* (and corresponding GENTPROT) macro sets using variadic macros (__VA_ARGS__), which means we no longer need a different INSERT_ macro for each possible number of arguments the macro might take. This change seems reasonable given that variadic macros are a standard C99 feature and widely supported. I took care not to use variadic macros where 0 variadic arguments are expected since that is a non-standard extension. - Added pre-typecast parentheses to arithmetic expressions in printf() statements in bli_thread_range_tlb.c. commit ef9d3e6675320a53e7cb477c16b01388e708b1da Author: h-vetinari Date: Sun May 7 04:59:35 2023 +1100 Added missing #include for Windows. (#747) Details: - This commit fixes issue #746, in which the _access() function (called from within blastest/f2c/open.c) is undeclared when compiling on Windows with clang 16. commit 6fd9aabb03d172a792a7eeb106c7d965cf038421 Author: Devin Matthews Date: Fri May 5 14:22:52 2023 -0500 Fix bug in detecting Fortran compiler vendor (#745) `FC` was used instead of `found_fc`. commit 8215b02f99aa77ecc7d813508c247565115319d7 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Wed Apr 12 12:59:27 2023 -0500 Apply #738 to make_defs.mk of RISC-V subconfigs. (#740) Details: - PR #738 -- which moved -fPIC flag insertion responsibilities from common.mk to the subconfigs' individual make_defs.mk files -- was merged shortly before the introduction of new RISC-V subconfigs in #693. This commit brings those RISC-V subconfigs up to date with the new -fPIC conventions. commit 6b38c5ac07a2a27738674784e58aa699bf895447 Author: angsch <17718454+angsch@users.noreply.github.com> Date: Tue Apr 11 19:27:43 2023 +0200 Add RISC-V target (#693) Details: - There are four RISC-V base configurations: 'rv32i', 'rv32iv', 'rv64i', and 'rv64iv', namely the 32-bit and 64-bit implementations with and without the 'V' vector extension. Additional extensions such as 'M' (multiplication), 'A' (atomics), 'F' ('float' hardware support), 'D' ('double' hardware support), and 'C' (compressed-length instructions), are automatically used when available. If they are not available, then software equivalents (e.g., softfloat and -latomic) are used. - './configure auto' can be invoked on a RISC-V build platform, and will automatically detect RISC-V CPU extensions through the RISC-V C API: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md - The assembly kernels assume the presence of the vector extension RVV 1.0. - It is possible to build 'rv[32,64]iv' for any value of VLEN. However, if VLEN < 128, the targets will fall back to the generic kernels and blocksizes. - The vector microkernels are vector-length agnostic and work with every VLEN >=128, but are expected to work best with smaller vector lengths, i.e., VLEN <= 512. - The assembly kernels cover column major storage (rs_c == 1). - The blocksizes aim at being a good generic choice for out-of-order cores. They are not tuned to a specific RISC-V HPC core. - The vector kernels have been tested using vlen={128,256,512}. - The single- and double-precision assembly code routines for 'sgemm' and 'dgemm', or for 'cgemm' and 'zgemm', are combined in their RISC-V vector assembly source code, and are differentiated only with macros. - The XLEN=32 and XLEN=64 versions of the RISC-V assembly code are identical, except that callee-saved registers are saved and restored differently. There are RISC-V assembly code #include files for handling the saving and restoring of callee-saved registers, and they are future-proof if ever XLEN=128. - Multiplications, such as computing array strides and offsets, are performed in C, and later passed to the RISC-V assembly kernels. This is so that the compiler can determine whether the 'M' (multiply) extension is available and use multiplication instructions, or call library helper functions instead. - A new macro called bli_static_assert() has been added to perform static assertions at compile-time, regardless of the C/C++ dialect of the compiler. The original motivation of this was to ensure that calling RISC-V assembly kernels would not silently truncate arguments of type 'dim_t' or 'inc_t' (so-called "narrowing conversions"). - RISC-V CI tests have been added to Travis CI, using the riscv-gnu-toolchain cross-compiler, and qemu simulator. - Thanks to Lee Killough for collaborating on this commit. commit 593d01761910af6a9a16ee0ac097142732f73c29 Author: Field G. Van Zee Date: Sat Apr 8 16:44:16 2023 -0500 CREDITS file update. commit 259f68479671bbaf9c5986759aaa0004f9b05a24 Author: Field G. Van Zee Date: Fri Apr 7 16:11:34 2023 -0500 CREDITS file update. Details: - Added attributions associated with commits: - 98d4678 9b1beec: @bartoldeman - 2b05948 059f151: @ct-clmsn - Reordered attirubtion for @decandia50. commit aea8e1d9243631635ca788d5e14f0f29328e637d Author: Field G. Van Zee Date: Mon Apr 3 12:17:51 2023 -0500 Optionally disable thread-local storage. (#735) Details: - Implemented a new configure option, --disable-tls, which allows the user to optionally disable the use of thread-local storage qualifiers on static variables in BLIS. This option will rarely be needed, but in some situations may allow BLIS to compile when TLS is unavailable. Thanks to Nick Knight for suggesting this option. - Unlike the --disable-system option, --disable-tls does not forcibly disable threading. Instead, warnings of the possible consequences of using threading with TLS disabled are added to: - the output of './configure --help'; - the output of 'configure' the --disable-tls option is parsed; - the informational header output by the testsuite. Thanks to Minh Quan Ho for suggesting these warnings. - Modified frame/include/bli_lang_defs.h so that BLIS_THREAD_LOCAL is defined to nothing when BLIS_ENABLE_TLS is not defined. - Defined bli_info_get_enable_tls(), which returns whether the cpp macro BLIS_ENABLE_TLS was defined. - Edited --disable-system configure status output for clarity. - Whitespace updates. commit 3f1432abe75cc306ef90a04381d7e0d8739fded8 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Mon Apr 3 12:10:59 2023 -0500 Add output.testsuite to .gitignore (#736) Details: - Added `output.testsuite` to .gitignore since it was previously not being matched by `output.testsuite.*`. commit 38fc5237520a2f20914a9de8bb14d5999009b3fb Author: Field G. Van Zee Date: Thu Mar 30 17:30:07 2023 -0500 Added mm_algorithm pdf files (bp and pb). Details: - Added PDF versions of the PowerPoint files added in 17cd260. commit 17cd260cb504b2f3997c32daec77f4c828fbb32b Author: Field G. Van Zee Date: Wed Mar 29 21:47:12 2023 -0500 Added mm_algorithm pptx files (bp and pb). Details: - Added two PowerPoint files that contain slides depicting the classic Goto algorithm for matrix multiplication as well as its sister "panel-block" algorithm. These files reside in docs/diagrams. commit 9d778e0f7c94d8752dd578101e4fc6893a1f54ef Author: Field G. Van Zee Date: Wed Mar 29 17:36:49 2023 -0500 Move -fPIC insertion to subconfigs' make_defs.mk. (#738) * Move -fPIC insertion to subconfigs' make_defs.mk. Details: - Previously, common.mk was appending -fPIC to the CPICFLAGS variables set within the various subconfigurations' make_defs.mk files. This seemed somewhat unintuitive, and so now the -fPIC flag is assigned to the various subconfigs' CPICFLAGS variables in the respective make_defs.mk files. - This also commit changes the logic in common.mk so that instead of appending, the variable is overwritten, but now *only* in the case of Windows (since apparently -fPIC needs to be omitted there). Thanks to Nick Knight for catching and reporting this weirdness. commit 04090df01175477394d1e73af2e5769751d47cd6 Author: Field G. Van Zee Date: Mon Mar 27 14:13:10 2023 -0500 Fixed compile errors with `BLIS_DISABLE_BLAS_DEFS`. (#730) * Fixed compile errors with BLIS_DISABLE_BLAS_DEFS. Details: - This commit fixes a compile-time error related to the type definition (prototype) of dsdot_() when BLIS_DISABLE_BLAS_DEFS is defined by the application (or the configuration), which is actually a symptom of a larger design issue when disabling BLAS prototypes. The macro was intended to allow applications to bring their own BLAS prototypes and suppress the inclusion of duplicate (or possibly conflicting) prototypes within blis.h. However, prototypes are still needed during compilation even if they are ultimately omitted from blis.h. The problem is that almost every source file in BLIS--including the BLAS compatibility layer--only includes one header (blis.h), and if we were to #include a new header in the BLAS source files (to isolate only the BLAS prototypes), we would also have to make the build system aware of the location of those headers. Thanks to Edward Smyth of AMD for reporting this issue. - The solution I settled upon was to remove all cpp guards from all BLAS headers (by changing them to #if 1, for easy search-and-replace anchoring in the future if we ever need to re-insert guards) and modifying bli_blas.h so that the BLAS prototypes are #included if either (a) BLIS_ENABLE_BLAS_DEFS is defined, or (b) BLIS_ENABLE_BLAS_DEFS is *not* defined but BLIS_IS_BUILDING_LIBRARY *is* defined. (Thanks to Devin Matthews for steering me away from an inferior solution.) - This commit also spins off the actual BLAS prototypes/definitions to a separate file, bli_blas_defs.h. - CREDITS file update. commit 5f841307f668f65b7ed5a479bd8374d2581208cf Author: Field G. Van Zee Date: Fri Mar 24 20:05:13 2023 -0500 Omit -fPIC if shared library build is disabled. (#732) Details: - Updated common.mk so that when --disable-shared option is given to configure: 1. The -fPIC compiler flag is omitted from the individual configuration family members' CPICFLAGS variables (which are initialized in each subconfig's make_defs.mk file); and 2. The BUILD_SYMFLAGS variable, which contains compiler flags needed to control the symbol export behavior, is left blank. - The net result of these changes is that flags specific to shared library builds are only used when a shared library is actually scheduled to be built. Thanks to Nick Knight for reporting this issue. - CREDITS file update. commit 72c37eb80f964b7840377076e5009aec5b29d320 (origin/riscv) Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Thu Mar 23 16:01:55 2023 -0500 Updated configure to pass all shellcheck checks. (#729) Details: - Modified configure so that it passes all 'shellcheck' checks, disabling ones which we violate but which are just stylistic, or are special cases in our code. - Miscellaneous other minor changes, such as rearranged redirections in long sed/perl pipes to look more natural. - Whitespace tweaks. commit 60f36347c16e6336215cd52b4e5f3c0f96e7c253 Author: Field G. Van Zee Date: Wed Feb 22 20:37:30 2023 -0600 Fixed bugs in scal2v ref kernel when alpha == 1. (#728) Details: - Fixed a typo bug in ref_kernels/1/bli_scal2v_ref.c where the conditional that was supposed to be checking for cases when alpha is equal to 1.0 (so that copyv could be used instead of scal2v) was instead erroneously comparing alpha against 0.0. - Fixed another bug in the same function whereby BLIS_NO_CONJUGATE was erroneously being passed into copyv instead of the kernel's conjx parameter. This second bug was inert, however, due to the first bug since the "alpha == 0.0" case was already being handled, resulting in the code block never executing. commit fab18dca46618799bb0b4f652820b33d36a5d4d4 Author: Field G. Van Zee Date: Wed Feb 22 16:50:00 2023 -0600 Use 'void*' datatypes in kernel APIs. (#727) Details: - Migrated all kernel APIs to use void* pointers instead of float*, double*, scomplex*, and dcomplex* pointers. This allows us to define many fewer kernel function pointer types, which also makes it much easier to know which function pointer type to use at any given time. (For example, whereas before there was ?axpyv_ker_ft, ?axpyv_ker_vft, and axpyv_ker_vft, now there is just axpyv_ker_ft, which is equivalent so what axpyv_ker_vft used to be.) - Refactored how kernel function prototypes and kernel function types are defined so as to reduce redundant code. Specifically, the function signatures (excluding cntx_t* and, in the case of level-3 microkernels, auxinfo_t*) are defined in new headers named, for example, bli_l1v_ker_params.h. Those signatures are reused via macro instantiation when defining both kernel prototypes and kernel function types. This will hopefully make it a little easier to update, add, and manage kernel APIs going forward. - Updated all reference kernels according to the aforementioned switch to void* pointers. - Updated all optimzied kernels according to the aforementioned switch to void* pointers. This sometimes required renaming variables, inserting typecasting so that pointer arithmetic could continue to function as intended, and related tweaks. - Updated sandbox/gemmlike according to the aforementioned switch to void* pointers. - Renamed: - frame/1/bli_l1v_ft_ker.h -> frame/1/bli_l1v_ker_ft.h - frame/1f/bli_l1f_ft_ker.h -> frame/1f/bli_l1f_ker_ft.h - frame/1m/bli_l1m_ft_ker.h -> frame/1m/bli_l1m_ker_ft.h - frame/3/bli_l1m_ft_ukr.h -> frame/3/bli_l1m_ukr_ft.h - frame/3/bli_l3_sup_ft_ker.h -> frame/3/bli_l3_sup_ker_ft.h to better align with naming of neighboring files. - Added the missing "void* params" argument to bli_?packm_struc_cxk() in frame/1m/packm/bli_packm_struc_cxk.c. This argument is being passed into the function from bli_packm_blk_var1(), but wasn't being "caught" by the function definition itself. The function prototype for bli_?packm_struc_cxk() also needed updating. - Reordered the last two parameters in bli_?packm_struc_cxk(). (Previously, the "void* params" was passed in after the "const cntx_t* cntx", although because of the above bug the params argument wasn't actually present in the function definition.) commit 93c63d1f469c4650df082d0fa2f29c46db0e25f5 Author: Field G. Van Zee Date: Mon Feb 20 11:14:23 2023 -0600 Use 'const' pointers in kernel APIs. (#722) Details: - Qualified all input-only data pointers in the various kernel APIs with the 'const' keyword while also removing 'restrict' from those kernel APIs. (Use of 'restrict' was maintained in kernel implementations, where appropriate.) This affected the function pointer types defined for all of the kernels, their prototypes, and the reference and optimized kernel definitions' signatures. - Templatized the definitions of copys_mxn and xpbys_mxn static inline functions. - Minor whitespace and style changes (e.g. combining local variable declaration and initialization into a single statement). - Removed some unused kernel code left in 'old' directories. - Thanks to Nisanth M P for helping to validate changes to the power10 microkernels. commit 4e18cd34f909c5045597f411340ede3a5e0bc5e1 Author: RuQing Xu Date: Sun Feb 19 04:18:41 2023 +0900 Restored ArmSVE general storage case. (#708) Details: - Restored general storage case in armsve kernels. - Reason for doing this: Though real `g`-storage is difficult to speedup, `g`-codepath here can provide a good support for transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`. - By experience, this solution is only *a little* slower than in-reg transpose. Plus in-reg transpose is only possible for a fixed VL in our case. commit 0ba6e9eafb1e667373d9dbc2aa045557921f33e2 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Sat Feb 18 13:15:42 2023 -0600 Refined emacs handling of indentation. (#717) Details: - This refines the emacs autoformatting to be better in line with contribution guidelines. - Removed a stray shebang in a .mk file which confuses emacs about the file mode, which should be makefile-mode. (emacs also removes stray whitespace at the ends of lines.) commit 059f15105b1643fe56084f883c22b3cadf368b39 Author: ct-clmsn Date: Sat Feb 18 14:13:23 2023 -0500 Updated hpx namespace for make_count_shape. (#725) Details: - The hpx namespace for *counting_shape changed. This PR updates the use of counting_shape in blis to comply with the change in hpx. - Co-authored-by: ctaylor commit 0b421eff130b5c896edcc09e7358d18564d177e9 Author: Field G. Van Zee Date: Sat Feb 18 13:11:41 2023 -0600 Added an 'arm64' entry to `.travis.yml`. (#726) Details: - Added a new 'arm64' entry to the .travis.yml file in an attempt to get Travis CI to compile both NEON and SVE kernels, even if only NEON kernels are exercised in the testing. With this new 'arm64' entry, the 'cortexa57' entry becomes redundant and may be removed. Thanks to RuQing Xu for this suggestion. - Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in bli_kernels_arm64.h, which meant that the default value of 64 was being used. This caused a runtime consistency check to fail in bli_gks.c (in Travis CI), one which requires that mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is defined as BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2 This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64' configuration, thus overriding the default and (hopefully) avoiding the aforementioned consistency check failures. - Appended '|| cat ./output.testsuite' to all 'make' commands in travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion. - Whitespace changes. commit b1d3fc7e5b0927086e336a23f16ea59aa3611ccb Author: Field G. Van Zee Date: Fri Feb 10 15:34:47 2023 -0600 Redirect grep stderr to /dev/null. (#723) Details: - In common.mk, added a redirection of stderr to /dev/null for the grep command being used to gather a list of header files #included from bli_cntx_ref.c. The redirection is desirable because as of grep 3.8, regular expressions with "stray" backslashes trigger warnings [1]. But removing the backslash seems to break the BLIS build system when using pre-3.8 versions of grep, so this seems to be easiest way to satisfy the BLIS build system for both pre- and post-3.8 grep environments. [1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html commit e3d352f1fcc93e6a46fde1aa4a7f0a18fb27bd42 Author: Nisanth M P Date: Wed Feb 8 06:11:41 2023 +0530 Added runtime selection of 'power' config family. (#718) Details: - Created a 'power' umbrella configuration family, which, when targeted at configure-time, will build both 'power9' and 'power10' subconfigs. (With this feature, a BLIS shared library could be compiled on a power9 system and run on power10 and vice-versa. Unoptimised code will execute if it is linked and run on any other generic system.) - This new configuration family will only work with gcc, since that is the only compiler supported by both power9 and power10 subconfigs in BLIS. - Documented power9 and power10 as supported microarchitectures in the docs/HardwareSupport.md document. commit e730c685d09336b3bd09e86c94330c4eba967f3e Author: Field G. Van Zee Date: Mon Feb 6 15:31:54 2023 -0600 Define `BLIS_VERSION_STRING` in `blis.h`. (#720) Details: - Previously, the version string was communicated from configure to config.mk (via the config.mk.in template), where it was included via the top-level Makefile, where it was then used to define the preprocessor macro BLIS_VERSION_STRING via a command line argument to the compiler (via -D). This macro is then used within bli_info.c to initialize a static string which can then be queried via the bli_info_get_version_str() function. However, there are some applications that may find utility in being able to access the version string by inspecting the monolithic (flattened) blis.h header file that is created at compile time and installed alongside the library. This commit moves the definition of BLIS_VERSION_STRING into bli_config.h (via the bli_config.h.in template) so that it is embedded in blis.h. The version string is now available in three places: - the static/shared library, which is installed in the 'lib' subdirectory of the install prefix (query-able via the bli_info_get_version_str() function); - the config.mk makefile fragment, which is installed in the 'share' subdirectory of the install prefix (in the VERSION variable); - the blis.h header file, which is installed in the 'include' subdirectory of the install prefix (via the BLIS_VERSION_STRING macro constant). Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this change. - CREDITS file update. commit dc5d00a6ce0350cd82859d8c24f23d98f205d8db Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Fri Jan 27 17:36:47 2023 -0600 Typecast printf() args to avoid compiler warnings. (#716) Details: - In bli_thread_range_tlb.c, typecast integer arguments passed to printf() -- which are typically disabled unless debugging -- to type "long" to guarantee a match to the "%ld" format specifiers used in those calls. This avoids spurious warnings with certain compilers in certain toolchain environments, such as 32-bit RISC-V (rv32iv). commit ecbcf4008815035c695822fcaf106477debff89a Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Wed Jan 18 20:35:50 2023 -0600 Use here-document for 'configure --help' output. (#714) Details: - Changed the configure script function that outputs "--help" text to do so via so-called "here-document" syntax for improved readability and maintainability. The change eliminates hundreds of echo statements and makes it easier to change existing configure options' help text, along with other benefits such as eliminating the need to escape double- quote characters ("). commit c334ec278f5e2a101625629b2e13bbf1b38dede5 Author: Devin Matthews Date: Wed Jan 18 13:10:19 2023 -0600 Merge tlb- and slab/rr-specific gemm macrokernels. (#711) Details: - Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr- specific one (var2) so that a single function can be compiled with either tlb or slab/rr support, depending on the value of the BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating information from both approaches: the start/end/inc for the JR and IR loops from slab or rr partitioning; and the number of assigned microtiles, plus the starting IR dimension offset for all iterations after the first (ir_next). With these changes, slab, rr, and tlb can all be parameterized by initializing a similar set of variables prior to the jr loop. - Removed the wrap-around logic that sets the "b_next" field of the auxinfo_t struct, which executes during the last IR iteration of the last JR iteration. The potential benefit of this code is so minor (and hinges on the microkernel making use of the b_next field) that it's arguably not worth including. The code also does the wrong thing for some threads whenever JR_NT > 1, since only thread 0 (in the JR group) would even compute with the first micropanel of B. - Re-expressed the definition of bli_is_last_iter_slrr so that slab and tlb use the same code rather than rr and tlb. - Adjusted the initialization of the gemm control tree accordingly. commit 5793a77937aee9847a5692c8e44b36a6380800a1 Author: HarshDave12 <122850830+HarshDave12@users.noreply.github.com> Date: Tue Jan 17 21:55:02 2023 +0530 Fixed mis-mapped instruction for VEXTRACTF64X2. (#713) Details: - This commit fixes a typo in the macro definition for the extended inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro was previously defined (incorrectly) in terms of the vextractf64x4 instruction rather than vextractf64x2. - CREDITS file update. commit 16d2e9ea9ca0853197b416eba701b840a8587bca Author: Field G. Van Zee Date: Fri Jan 13 20:03:01 2023 -0600 Defined lt, lte, gt, gte + misc. other updates. (#712) Details: - Changed invertsc operation to be a non-destructive operation; that is, it now takes separate input and output operands. This change applies to both the object and typed APIs. - Defined an alternative square root operation, sqrtrsc, which, when operating on complex scalars, assumes the imaginary part of the input to be zero. - Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym so that when the source matrix has an implicit unit diagonal, the operation leaves the diagonal of the destination matrix untouched. Previously, the operations would interpret an implicit unit diagonal on the source matrix as a request to manifest the unit diagonal *explicitly* on output (either as something to copy in the case of copym, or something to compute with in the cases of addm, subm, axpym, scal2m, and xpbym). It turns out that this behavior was too cute by half and could cause unintended headaches for practical use cases. (This change in behavior also required small modifications to the trmv and trsv testsuite modules so that they would properly test matrices with unit diagonals.) - Added missing dependencies for copym to gemv, ger, hemv, trmv, and trsv testsuite modules. - Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in frame/util, which use lt, lte, gt, and gte level-0 scalar macros. - Trivial variable rename in bli_part.c to harmonize with other variable naming conventions. commit 9a366b14fe52c469f4664ef5dd93d85be8d97baa Author: Field G. Van Zee Date: Thu Jan 12 13:07:22 2023 -0600 Implement cntx_t pointer caching in gks. (#709) Details: - Refactored the gks cntx_t query functions so that: (1) there is a clearer pattern of similarity between functions that query a native context and those that query its induced (1m) counterpart; and (2) queried cntx_t pointers (for both native and induced cntx_t pointers) are cached (by default), or deep-queried upon each invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined. - Refactored query-related functions in bli_arch.c to cache the queried arch_t value (by default), or deep-query the arch_t value upon each invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined. - Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is repopulated each time the function is called. (It is still only allocated once on first call.) This was mostly done in preparation for some future in which the arch_t value might change at runtime. In such a scenario, the induced method context would need to be recalculated any time the native context changes. - Added preprocessor logic to bli_config_macro_defs.h to handle enabling or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING). - For now, cntx_t pointer caching is enabled by default and does not correspond to any official configure option. Disabling can be done by inserting a #define for BLIS_DISABLE_GKS_CACHING into the appropriate bli_family_*.h header file within the configuration of interest. - Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers (and not just arch_t values) be cached. - Comment updates. commit b895ec9f1f66fb93972589c06bff171337153a31 Author: Nisanth M P Date: Wed Jan 11 09:02:32 2023 +0530 Fixing type-mismatch errors in power10 sandbox (#701) Details: - This commit fixes a mismatch between the function type signature of bli_gemm_ex() required by BLIS and the version of the function defined within the power10 sandbox. It also performs typecasting upon calling bli_gemm_front() to attain type consistency with the type signature defined by BLIS for bli_gemm_front(). commit 38d88d5c131253066cad4f98eea06fa9299cae3b Author: Devin Matthews Date: Tue Jan 10 21:24:58 2023 -0600 Define new global scalar (obj_t) constants. (#703) Details: - This commit defines the following new global scalar constants: - BLIS_ONE_I: This constant encodes the imaginary unit. - BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit. - BLIS_NAN: This constant encodes a not-a-number value. Both real and imaginary parts are set to NaN for complex datatypes. commit cdb22b8ffa5b31a0c16ac1a7bcecefeb5216f669 Author: Nisanth M P Date: Wed Jan 11 08:50:57 2023 +0530 Disable power10 kernels other than sgemm, dgemm. (#705) Details: - There is a power10 sandbox which uses microkernels for datatypes other than float and double (or scomplex/dcomplex). In a regular power10- configured build (that is, with the sandbox disabled), there were compile errors for some of these other non-sgemm/non-dgemm microkernels. This commit protects those kernels with a new cpp macro guard (which is defined in sandbox/power10/bli_sandbox.h) that prevents that kernel code from being compiled for normal, non-sandbox power10 builds. commit d220f9c436c0dae409974724d42ab6c52f12a726 Author: Nisanth M P Date: Wed Jan 11 08:43:03 2023 +0530 Fix k = 0 edge case in power10 microkernels (#706) Details: - When power10 sgemm and dgemm microkernels are called with k = 0, they become caught in infinite loops and segfault. This is fixed now via an early exit in the case of k = 0. commit 2e1ba9d13c23a06a7b6f8bd326af428f7ea68c31 Author: Field G. Van Zee Date: Tue Jan 10 21:05:54 2023 -0600 Tile-level partitioning in jr/ir loops (ex-trsm). (#695) Details: - Reimplemented parallelization of the JR loop in gemmt (which is recycled for herk, her2k, syrk, and syr2k). Previously, the rectangular region of the current MC x NC panel of C would be parallelized separately from from the diagonal region of that same submatrix, with the rectangular portion being assigned to threads via slab or round-robin (rr) partitioning (as determined at configure- time) and the diagonal region being assigned via round-robin. This approach did not work well when extracting lots of parallelism from the JR loop and was often suboptimal even for smaller degrees of parallelism. This commit implements tile-level load balancing (tlb) in which the IR loop is effectively subjugated in service of more equitably dividing work in the JR loop. This approach is especially potent for certain situations where the diagonal region of the MC x NR panel of C are significant relative to the entire region. However, it also seems to benefit many problem sizes of other level-3 operations (excluding trsm, which has an inherent algorithmic dependency in the IR loop that prevents the application of tlb). For now, tlb is implemented as _var2b.c macrokernels for gemm (which forms the basis for gemm, hemm, and symm), gemmt (which forms the basis of herk, her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and trmm3). Which function pointers (_var2() or _var2b()) are embedded in the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp macro is defined, which is controlled by the value passed to the existing --thread-part-jrir=METHOD (or -r METHOD) configure option. This script adds 'tlb' as a valid option alongside the previously supported values of 'slab' and 'rr'. ('slab' is still the default.) Thanks to Leick Robinson for abstractly inspiring this work, and to Minh Quan Ho for inquiring (in PR #562, and before that in Issue #437) about the possibility of improved load balance in macrokernel loops, and even prototyping what it might look like, long before I fully understood the problem. - In bli_thread_range_weighted_sub(), tweaked the the way we compute the area of the current MC x NC trapezoidal panel of C by better taking into account the microtile structure along the diagonal. Previously, it was an underestimate, as it assumed MR = NR = 1 (that is, it assumed that the microtile column of C that overlapped with microtiles exactly coincided with the diagonal). Now, we only assume MR = NR. This is still a slight underestimate when MR != NR, so the additional area is scaled by 1.5 in a hackish attempt to compensate for this, as well as other additional effects that are difficult to model (such as the increased cost of writing to temporary tiles before finally updating C). The net effect of this better estimation of the trapezoidal area should be (on average) slightly larger regions assigned to threads that have little or no overlap with the diagonal region (and correspondingly slightly smaller regions in the diagonal region), which we expect will lead to slightly better load balancing in most situations. - Spun off the contents of bli_thread.[ch] that relate to computing thread ranges into one of three source/header file pairs: - bli_thread_range.[ch], which define functions that are not specific to the jr/ir loops; - bli_thread_range_slab_rr.[ch], which define functions that implement slab or round-robin partitioning for the jr/ir loops; - bli_thread_range_tlb.[ch], which define functions that implement tlb for the jr/ir loops. - Fixed the computation of a_next in the last iteration of the IR loop in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around to the first micropanel of the current MC x KC packed block of A. However, this is almost never actually the micropanel that is used next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use in the upper-stored case (which *does* actually always choose the first micropanel of A as its a_next at the end of the IR loop). - Removed adjustments for a_next/b_next (a2/b2) for the diagonal- intersecting case of gemmt_l_ker_var2() and the above-diagonal case of gemmt_u_ker_var2() since these cases will only coincide with the last iteration of the IR loop in very small problems. - Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of which explicitly considers whether the current microtile is the last tile that intersects the diagonal. (The former does the same, but the computation coincides with the original bli_is_last_iter().) These functions are now used in gemmt to test when a_next (or a2) should "wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l() and bli_is_last_iter_tlb_u(), which are similar to the aforementioned functions but are used when employing tlb in gemmt. - Redefined macros in bli_packm_thrinfo.h, which test whether an iteration of work is assigned to a thread, as static inline functions in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h). In the process of redefining these macros, I also renamed them from bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl(). - Renamed bli_thread_range_jrir_rr() -> bli_thread_range_rr() bli_thread_range_jrir_sl() -> bli_thread_range_sl() bli_thread_range_jrir() -> bli_thread_range_slrr() - Renamed bli_is_last_iter() -> bli_is_last_iter_slrr() - Defined bli_info_get_thread_jrir_tlb() and renamed: - bli_info_get_thread_part_jrir_slab() -> bli_info_get_thread_jrir_slab() - bli_info_get_thread_part_jrir_rr() -> bli_info_get_thread_jrir_rr() - Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism into the JR loop when tlb is enabled for non-trsm level-3 operations. - Added a sanity check to prevent bli_prune_unref_mparts() from being used on packed objects. This prohibition is necessary because the current implementation does not take into account the atomicity of packed micropanel widths relative to the diagonal of structured matrices. That is, the function prunes greedily without regard to whether doing so would prune off part of a micropanel *which has already been packed* and assigned to a thread for inclusion in the computation. - Further restricted early returns in bli_prune_unref_mparts() to situations where the primary matrix is not only of general structure but also dense (in terms of its uplo_t value). The addition of the matrix's dense-ness to the conditional is required because gemmt is somewhat unusual in that its C matrix has general structure but is marked as lower- or upper-stored via its uplo_t. By only checking for general structure, attempts to prune gemmt C matrices would incorrectly result in early returns, even though that operation effectively treats the matrix as symmetric (and stored in only one triangle). - Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges were computed when 1 < bf. Thankfully, this bug was not yet manifesting since all current invocations used bf == 1. - Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2() that would perform incorrect pruning of unreferenced regions above where the diagonal of a lower-stored matrix intersects the right edge. Thankfully, the bug was not harming anything since those unreferenced regions were being pruned prior to the macrokernel. - Rewrote slab/rr-based gemmt macrokernels so that they no longer carved C into rectangular and diagonal regions prior to parallelizing each separately. The new macrokernels use a unified loop structure where quadratic (slab) partitioning is used. - Updated all level-3 macrokernels to have a more uniform coding style, such as wrt combining variable declarations with initializations as well as the use of const. - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. This change probably should have been included in aeb5f0c. - Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that corresponded to functions that were removed in aeb5f0c. - Other very minor cleanups. - Comment updates. commit b6735ca26b9d459d9253795dc5841ae8de9e84c9 Author: Devin Matthews Date: Fri Jan 6 14:10:01 2023 -0600 Refactor structure awareness in packm_blk_var1.c. (#707) Details: - Factored some of the structure awareness out of the loop in bli_packm_blk_var1(). So instead of having a single loop with conditionals in the body to handle various kinds of structure (and stored/unstored submatrix placement), we now have a conditional branch to handle various structure/storage scenarios with a loop in each section. This change was originally motivated to choose slab or round- robin partitioning (in the context of triangular matrices) based on the structure of the entire block (or panel) being packed rather than each micropanel individually. Previously, the code would attempt to limit rr to the portion of the block that intersects the diagonal and use slab for the remainder. However, that approach was not well-thought out and in many situations this would lead to inferior load balancing when compared to using round-robin for the entire block (or panel). This commit has the added benefit of incurring less overhead during the packing process now that each of the new loops is simpler. commit f956b79922da412791e4c8b8b846b3aafc0a5ee0 Author: Field G. Van Zee Date: Sat Dec 31 20:18:08 2022 -0600 Switch to l3 sup decorator in gemmlike sandbox. (#704) Details: - Modified the gemmlike sandbox to call bli_l3_sup_thread_decorator() rather than a local analogue of that code. This reduces redundant logic and makes it easier for the sandbox to inherit future improvements to the framework's threading code. - Moved addon/gemmd to addon/old/gemmd. This code has fallen out of date and is taking too much effort to maintain. We will very likely reimplement it completely once future changes are made to the framework proper. commit 538150c5845ad903773ca797c740048174116aa4 Author: Field G. Van Zee Date: Sun Dec 25 22:28:09 2022 -0600 Applied race condition fix to sup thread decorator. Details: - Applied the race condition bugfix in commit 7d23dc2 to the corresponding sup code in bli_l3_sup_decor.c. Note that in the case of sup, the race condition would have only manifested when optional packing was enabled at runtime (typically via setting BLIS_PACK_A and/or BLIS_PACK_B environment variables). - Both the fix in this commit and the fix in 7d23dc2 address bugs that were introduced when the thrinfo_t trees/communicators were restructured in the October omnibus commit (aeb5f0c). commit 7d23dc2a064a371dc9883e2c2c7236a70912428c Author: Devin Matthews Date: Sun Dec 25 19:09:14 2022 -0600 Fix a race condition which manifested as incorrect results (rarely). (#702) The problem occurs when there are at least two teams of threads packing different parts of a matrix, and where each team has at least two threads; call them team A and team B. The problematic sequence is: 1. The chief of team A checks out a block B and broadcasts the pointer to its teammates. 2. Team A completely packs their data and perform a barrier amongst themselves. 3. Team A commences computing with the packed data. 4. The chief of team A finishes computing before its teammates, then calls bli_thrinfo_free on its thrinfo_t struct (which contains the mem_t object referencing the buffer B). This causes buffer B to be checked back in to the pba. 5. The chief of team B checks out the *same* block B that was just checked back in and broadcasts the pointer to its teammates. 6. DATA RACE: now the remaining threads of team A are reading *while* team B are writing to the same buffer B. If team A write new data before team B are done computing then an incorrect result is generated. The solution is to place a global barrier before the call to bli_thrinfo_free at the end of the computation. Co-authored-by: Field G. Van Zee commit 3accacf57d11e9b109339754f91bf22329b6cb6a Author: Field G. Van Zee Date: Fri Dec 16 10:26:33 2022 -0600 Skip 1m optimization when forcing hemm_l/symm_l. (#697) Details: - Fixed a bug in right-sided hemm when: - using the 1m method, - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration, and - the storage of C matches the gemm microkernel IO preference PRIOR to the right-sidedness being detected and recast in terms of the left- side code path. It turns out that bli_gemm_ind_recast_1m_params() was applying its optimization (recasting a complex-domain macrokernel calling a 1m virtual microkernel to a real-domain macrokernel calling the real- domain microkernel) in situations in which it should not have. The optimization was silently assuming that the storage of C always matched that of the microkernel preference, since the front-end (in this case, bli_hemm_front()) would have already had a chance to transpose the operation to bring the two into agreement. However, by disabling right-sided hemm, we deprive BLIS of that flexibility (as a transposed left-sided case would necessarily have to become a right- sided case), and thus the assumption was no longer holding in all cases. Thanks to Nisanth M P for reporting this bug in Issue #621. - The aforementioned bug, and its bugfix, also apply to symm when BLIS_DISABLE_SYMM_RIGHT is defined. - Comment updates. - CREDITS file update. commit 4833ba224eba54df3f349bcb7e188bcc53442449 Author: Field G. Van Zee Date: Mon Dec 12 20:26:02 2022 -0600 Fixed perf of mt sup with packing, and mt gemmlike. (#696) Details: - Brought the gemmsup code path up to date relative to the latest thrinfo_t semantics introduced in the October Omnibus commit (aeb5f0c). This was done by passing the prenode (instead of the current node) into the packm variant within bli_l3_sup_packm.c as well as creating the prenodes and attaching them to the thrinfo_t tree in bli_l3_sup_thrinfo_create(). These changes erase the performance degradation introduced in the omnibus when running multithreaded sup with optional packing enabled. Special thanks to Devin Matthews for sussing out this fix in short order. - Fixed the gemmlike sandbox in a manner similar to that of sup with packing, described above. This also involved passing the prenode into the local gemmlike packm variant. (Recall that gemmlike recycles the use of bli_l3_sup_thrinfo_create(), so it automatically inherits that part of the sup fix described above.) - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. commit db10dd8e11a12d85017f84455558a82c0093b1da Author: Field G. Van Zee Date: Tue Nov 29 19:10:31 2022 -0600 Fixed _gemm_small() prototype; disabled gemm_small. Details: - Fixed a mismatch between the prototype for bli_gemm_small() in bli_gemm_front.h and the actual definition of bli_gemm_small() in kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting this issue. - Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix implementation should probably remain disabled in vanilla BLIS, at least for now. commit f0337b784d164ae505ca0e11277a1155680500d1 Author: Field G. Van Zee Date: Sun Nov 13 21:36:47 2022 -0600 Trival whitespace/comment tweaks. Details: - Trivial whitespace and comment changes, most of which ideally would have been part of the previous commit pertaining to HPX (2b05948). commit 2b05948ad2c9785bc53f376d53a7141cbc917447 Author: ct-clmsn Date: Sun Nov 13 17:40:22 2022 -0500 blis support for hpx (#682) Implement threading backend via HPX. HPX is an asynchronous many task runtime system used in high performance computing applications. The runtime implements the ISO C++ parallelism specification and provides a user-space thread implementation. This PR provides BLIS a thread backend implementation using HPX and resolves feature request #681. The configuration script, makefiles, and testsuite have been updated to support an HPX build option. The addition of HPX support provides other developers an exemplar for integrating other C++ threading backends into BLIS. Co-authored-by: ctaylor Co-authored-by: Devin Matthews commit e1ea25da43508925e33d4e57e420cfc0a9de793f Author: Field G. Van Zee Date: Fri Nov 11 12:07:51 2022 -0600 Fixed subtle barrier_fpa bug in bli_thrcomm.c. (#690) Details: - In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the barrier function pointer array (barrier_fpa) to NULL when BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is not enabled. This bug was introduced in a1a5a9b and was likely the result of an incomplete edit. The effects of the bug would have likely manifested when querying a thrcomm_t that was initialized with a timpl_t value corresponding to a threading implementation that was omitted from the -t option at configure-time. commit dc6e5f3f5770074ba38554541b8b64711a68c084 Author: leekillough <15950023+leekillough@users.noreply.github.com> Date: Thu Nov 3 18:33:08 2022 -0500 Enhance emacs formatting of C files to remove trailing whitespace and ensure a newline at the end of file commit 713d078075a4a563a43d83fd0880ab5091c2e4a4 Author: Field G. Van Zee Date: Thu Nov 3 20:00:11 2022 -0500 Delete mpi_test garbage. (#689) Details: - tlrmchlsmth: "What even is this? No comments, no commit message, not used by anything. Trash." commit 8d813f7f12732d52c95570ae884d5defbfd19234 Author: Field G. Van Zee Date: Thu Nov 3 19:10:47 2022 -0500 Some decluttering of the top-level directory. Details: - Relocated 'mpi_test' directory to test/mpi_test. - Relocated 'so_version' and 'version' files from top-level directory to 'build' directory. - Updated build/bump-version.sh script to accommodate relocation of 'version' file to 'build' directory. - Updated configure script to accommodate relocation of 'so_version' file to 'build' directory. - Updated INSTALL file to replace pointers to blis-devel mailing list with a pointer to docs/Discord.md. - Updated RELEASING file to contain a reminder to consider whether the so_version file should be updated prior to the release. commit 6774bf08c92fc6983706a91bbb93b960e8eef285 Author: Lee Killough <15950023+leekillough@users.noreply.github.com> Date: Thu Nov 3 15:20:47 2022 -0500 Fix typo in configure --help text. (#686) Details: - Fixed a misspelling in the --help description for the --int-size (-i) configure option. commit 872898d817f35702e7678ff7f3eeff0f12e641f5 Author: Field G. Van Zee Date: Wed Nov 2 21:53:22 2022 -0500 Fixed trmm[3]/trsm performance bug in cf7d616. (#685) Details: - Fixed a performance bug in the packing of micropanels that intersect the diagonal of triangular matrices (i.e., those found in trmm, trmm3, and trsm). This bug was introduced in cf7d616 and stemmed from an ill-formed boolean conditional expression in bli_packm_blk_var1(). This conditional would chose when to use round-robin parallel work allocation, but checked for the triangularity of the submatrix being packed while failing also to check for whether the current micropanel actually intersected the diagonal. The net result of this bug was that *all* micropanels of a triangular matrix, no matter where the upanels resided within the matrix, were assigned to threads via a round-robin policy. This affected some microarchitectures and threading configurations much worse than others, but it seems that overall the effect was universally negative, likely because of the reduced spatial locality during the packing with round-robin. Thanks to Leick Robinson for his tireless efforts in helping track down this issue. commit edcc2f9940449f7d9cefcfc02159d27b013e7995 Author: Field G. Van Zee Date: Wed Nov 2 19:04:49 2022 -0500 Support --nosup, --sup configure options. (#684) Details: - Added --nosup and --sup as alternative ways of requesting that sup be disabled or enabled. These are analagous to --disable-sup-handling and --enable-sup-handling, respectively. (I got tired of typing out --disable-sup-handling and needed a shorthand notation.) - Tweaked message output by configure when sup is enable/disabled for clarity and specificity. - Whitespace changes. commit 5eea6ad9eb25f37685d1ae4ae08c73cd1daca297 Author: Field G. Van Zee Date: Wed Nov 2 17:07:54 2022 -0500 Add mention of Wilkinson Prize to README.md. (#683) Details: - Added blurbs and links to Wilkinson Prize to README.md. - Added mention of both Best Paper and Wilkinson Prizes to the top of README.md. - Other minor tweaks. commit 29f79f030e939969d4f3876c4fdaac7b0c5daa63 Author: Devin Matthews Date: Mon Oct 31 18:57:45 2022 -0500 Fixed performance bug caused by redundant packing. (#680) Details: - Fixed a performance bug whereby multiple threads were redundantly packing the same (rather than separate) micropanels. This bug was caused by different parts of the code using the num_threads/thread_id field of the thrinfo_t vs. the n_way/work_id fields. The fix was to standardize on the latter and provide a "fake" thrinfo_t sub-prenode in the thrinfo tree which consists of single-member thread teams. The single team with multiple threads node is still required since it and only it can be used to perform barriers and broadcasts (e.g. of the packed buffer pointer). commit aeb5f0cc19665456e990a7ffccdb09da2e3f504b Author: Devin Matthews Date: Thu Oct 27 12:39:11 2022 -0500 Omnibus PR - Oct 2023 (#678) Details: - This is an "omnibus" commit, consisting of multiple medium-sized commits that affect non-trivial aspects of BLIS. The major highlights: - Relocated the pba, sba pool (from the rntm_t), and mem_t (from the cntl_t) to the thrinfo_t object. This allows the rntm_t to be effectively const (although it is sometimes copied internally and modified to reflect different ways of parallelism). Moving the mem_t sets the stage for sharing a global control tree amongst all threads. - De-templatized the macrokernels for gemmt, trmm, and trsm to match the macrokernel for gemm, which has been de-templatized since 54fa28b. - Reimplemented bli_l3_determine_kc() by separating out the logic for adjusting KC based on MR/NR for triangular A and/or B into a new function, bli_l3_adjust_kc(). For now, this function is still called from bli_l3_determine_kc(), but in the future we plan to have it called once when constructing the control tree. - Refactored the level-3 thread decorator into two parts: - One part deals only with launching threads, each one calling a generic thread entry function. This code resides in frame/thread and constitutes the definition of bli_thread_launch(). Note that it is specific to the threading implementation (OpenMP, pthreads, single, etc.) - The other part deals with passing the matrix operands and related information into bli_thread_launch(). This is the "l3 decorator" and now resides in frame/3. It is agnostic to the threading implementation. - Modified the "level" of the thread control tree passed in at each operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was passed in a communicator representing the active thread teams which would share the available work. Now, the *parent* thread comm is passed in. The operation then grabs the child comm and uses it to partition the work. The difference is in bli_trsm_blk_var1(), where there are now two children nodes for this single operation (i.e. the thread control tree is split one level above where the control tree is). The sub-prenode is used for the trsm subproblem while the normal sub-node is used for the gemm part. Importantly, the parent comm is used for the barrier between them. - Removed cntl_t* arguments from bli_*_front() functions. These will be added back in the future when the control tree's creation is moved so that it happens much sooner (provided that bli_*_front() have not been absorbed into their respective bli_*_ex() functions). - Renamed various bli_thread_*() query functions to bli_thrinfo_*(), for consistency. This includes _num_threads(), _thread_id(), _n_way(), _work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and _am_chief(). - Removed extraneous barrier from _blk_var3() of gemm and trsm. - Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was misspelled. commit c803b03e52a7a6997a8d304a8cfa9acf7c1c555b Author: Devin Matthews Date: Wed Oct 26 18:20:00 2022 -0500 Add check to disable armsve on Apple M1. commit 2dd692b710b6a9889f7ebdd7934a2108be5c5530 Author: Devin Matthews Date: Wed Oct 26 18:10:26 2022 -0500 Fix auto-detection of firestorm (Apple M1). commit 88105dbecf0f9dfbfa30215743346e8bd6afb971 Author: Field G. Van Zee Date: Fri Oct 21 15:16:12 2022 -0500 Added Discord documentation (#677) Details: - Added a docs/Discord.md markdown document that walks the reader through creating a Discord account, obtaining the invite link, and using the link to join the BLIS Discord server. - Updated README.md to reference the new Discord.md document in multiple places, including via the official Discord logo (used with explicit permission from representatives at Discord Inc.). commit 23f5b8df3e802a27bacd92571184ec57bbdfa646 Author: Field G. Van Zee Date: Mon Oct 17 20:21:21 2022 -0500 Shuffled checked properties in bli_l3_check.c. (#676) Details: - Added certain checks for matrix structure to the level-3 operations' _check() functions, and slightly reorganized existing checks. commit 9453e0f163503f64a290256b4be53d8882224863 Author: Field G. Van Zee Date: Mon Oct 3 19:46:20 2022 -0500 CREDITS file update. Details: - This attribution was intended to go in PR #647. commit 76a23bd8c33e161221891935a489df9a9fb9c8c0 Author: Devin Matthews Date: Mon Oct 3 15:55:07 2022 -0500 Reinstate sanity check in bli_pool_finalize. (#671) Details: - Added a reinit argument to bli_pool_finalize(). This bool will signal whether or not the function is being called from bli_pool_reinit(). If it is not being called from _reinit(), we can safely check to confirm that .top_index == 0 (i.e., all blocks have been checked in). But if it *is* being called from _reinit(), then that check will be skipped since one of the predicted use cases for bli_pool_reinit() anticipates that some blocks are (probably) checked out when the pool_t is reinitialized. - Updated existing invocations of bli_pool_finalize() to pass in either FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or TRUE (from bli_pool_reinit()) for the new reinit argument. commit 63470b49e3b9b15e00a8f666e86ccd70c6005fe9 Author: Devin Matthews Date: Thu Sep 29 18:52:08 2022 -0500 Fix some bugs in bli_pool.c (#670) Details: - Add a check for premature pool exhaustion when checking in blocks via bli_pool_checkin_block(). This detects "double-free" and other bad conditions that don't necessarily result in a segfault. - Make sure to copy all block pointers when growing the pool size. Previously, checked-out block pointers (which are guaranteed to be set to NULL) were not being copied, leading to the presence of uninitialized data. commit 42d0e66318b186d25eeb215b40ce26115401ed8b Author: Devin Matthews Date: Thu Sep 29 17:38:02 2022 -0500 Add AddressSanitizer (-fsanitize=address) option. (#669) Details: - Added support for AddressSanitizer (ASan), a compiler-integrated memory error detector. The option (disabled by default) enables compiling and linking with the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is employed during compilation of all BLIS source files *except* for optimized kernels, which are exempted because ASan usually requires an extra register, which violates the constraints for many gemm microkernels. - Minor whitespace, comment, ordering, and configure help text updates. commit b861c71b50c6d48cb07282f44aa9dddffc1f1b3f Author: Devin Matthews Date: Fri Sep 23 13:22:27 2022 -0500 Add consistent NaN/Inf handling in sumsqv. (#668) Details: - Changed sumsqv implementation as follows: - If there is a NaN (either real or imaginary), then return a sum of NaN and unit scale. - Else, if there is an Inf (either real or imaginary), then return a sum of +Inf and unit scale. - Otherwise behave as normal. commit ee81efc7887374c974a78bfb3e0865776b2f97a8 Author: Field G. Van Zee Date: Thu Sep 22 19:15:07 2022 -0500 Parameterized test/3 drivers via command line args. (#667) Details: - Rewrote the drivers in test/3, the Makefile, and the runme.sh script so that most of the important parameters, including parameter combo, datatype, storage combo, induced method, problem size range, dimension bindings, number of repeats, and alpha/beta values can be passed in via command line arguments. (Previously, most of these parameters were hard-coded into the driver source, except a few that were hard-coded into the Makefile.) If no argument is given for any particular option, it will be assigned a sane default. Either way, the values employed at runtime will be printed to stdout before the performance data in a section that is commented out with '%' characters (which is used by matlab and octave for comments), unless the -q option is given, in which case the driver will proceed quietly and output only performance data. Each driver also provides extensive help via the -h option, with the help text tailored for the operation in question (e.g. gemm, hemm, herk, etc.). In this help text, the driver reminds the user which implementation it was linked to (e.g. blis, openblas, vendor, eigen). Thanks to Jeff Diamond for suggesting this CLI-based reimagining of the test/3 drivers. - In the test/3 drivers: converted cpp macro string constants, as well as two string literals (for the opname and pc_str) used in each test driver, to global (or static) const char* strings, and replaced the use of strncpy() for storing the results of the command line argument parsing with pointer copies from the corresponding strings in argv. This works because the argv array is guaranteed by the C99 standard to persist throughout the life of the program. This new approach uses less storage and executes faster. Thanks to Minh Quan Ho for recommending this change. - Renamed the IMP_STR cpp macro that gets defined on the command line, via the test/3/Makefile, to IMPL_STR. - Updated runme.sh to set the problem size ranges for single-threaded and multithreaded execution independently from one another, as well as on a per-system basis. - Added a 'quiet' variable to runme.sh that can easily toggle quiet mode for the test drivers' output. - Very minor typecast fix in call to bli_getopt() in bli_utils.c. - In bli_getopt(), changed the nextchar variable from being a local static variable to a field of the getopt_t state struct. (Not sure why it was ever declared static to begin with.) - Other minor changes to bli_getopt() to accommodate the rewritten test drivers' command line parsing needs. commit 036a4f9d822df25a76a653e70be76fb02284d3d3 Author: Field G. Van Zee Date: Thu Sep 22 18:36:50 2022 -0500 Refactored some rntm_t management code. (#666) Details: - Separated the "sanitizing" code from the auto-factorization code in bli_rntm_set_ways_from_rntm() and _rntm_set_ways_from_rntm_sup(). The santizing code now resides in bli_rntm_sanitize() while the factorization code resides in bli_rntm_factorize() and bli_rntm_factorize_sup(). (There are two different functions because the conventional and sup factorization codes are currently somewhat different.) Also note that the factorization code now relies on the .auto_factor field to have already been set, either during rntm_t initialization or when the rntm_t was previously updated and santized. So rather than locally determining whether to auto- factorize, those functions just read the .auto_factor field and proceed accordingly. - Refactored and removed most code from bli_thread_init_rntm_from_env(). This function now reads the environment variables needed to set nt, jc, pc, ic, jr, and ir; sets them into the global rntm_t; and then calls bli_rntm_sanitize() in order to make sure that the contents are in a "good" state. Thanks to Devin Matthews for suggesting this refactoring. - Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() such that if multithreading is disabled at compile time (that is, if the cpp macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the caller's request and instead clear the nt and ways fields. - Redefined bli_thread_set_num_threads() and bli_thread_set_ways() such that if multithreading is disabled at compile time (that is, if the cpp macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the caller's request and do nothing. - Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() as true functions rather than static inline functions. - In bli_rntm.c, statically initialize the global_rntm global variable via the BLIS_RNTM_INITIALIZER macro. - In bli_rntm.h, defined bli_rntm_clear_auto_factor(), which sets the .auto_factor field of the rntm_t to FALSE. - Reorganized order of some inline function definitions in bli_rntm.h. - Changed the default value given to the .auto_factor field by the BLIS_RNTM_INITIALIZER macro from TRUE to FALSE. - Call bli_rntm_clear_auto_factor() instead of bli_rntm_set_auto_factor_only() in bli_rntm_init(). - Comment/whitespace updates. commit a1a5a9b4cbef9208da494c45a2f933a8e82559ac Author: Field G. Van Zee Date: Wed Sep 21 18:31:01 2022 -0500 Implemented support for fat multithreading. (#665) Details: - Allow the user to configure BLIS in such a way that multiple threading implementations get compiled into the library, with one of those implementations chosen at runtime. For now, there are only three implementations available: OpenMP, pthreads, and single. (Here, 'single' merely refers to single-threaded mode.) The configure script now allows the user to give the -t option with a comma-separated list of values, such as '-t openmp,pthreads'. The first value in the list will always be the default at library initialization time, and 'single' is always silently appended to the end of the list. The user can specify which implementation should execute in one of three ways: by setting the BLIS_THREAD_IMPL environment variable prior to launch; by calling the bli_thread_set_thread_impl() global runtime API; or by encoding their choice into a rntm_t that is passed into one of the expert interfaces. Any of these three choices overrides the initialization-time default (i.e., the first value listed to the -t configure option). Requesting an implementation that was not compiled into the library will result in an error message followed by bli_abort(). - Relocated the 'auto' logic for the -t option from the top-level Makefile to the configure script. (Currently, this logic is pretty dumb, choosing 'openmp' for gcc and icc, and 'pthreads' for clang.) - Defined a new 'timpl_t' enum in bli_type_defs.h, with three valid values: BLIS_SINGLE, BLIS_OPENMP, BLIS_POSIX. - Reorganized the thrcomm_t struct into a single defintion with two preprocessor blocks, one each for additional fields needed by OpenMP and pthreads. - Added timpl_t argument to bli_thrcomm_bcast(), bli_thrcomm_barrier(), bli_thrcomm_init(), and bli_thrcomm_cleanup(), which these functions need since they are now wrappers that choose the implementation- specific function corresponding to the currently enabled threading implementation. - Added rntm_t* to bli_thread_broadcast(), bli_thread_barrier() so that those functions can pass the timpl_t value into bli_thrcomm_bcast() and bli_thrcomm_barrier(), respectively. - Defined bli_env_get_str() in bli_env.c to allow the querying of BLIS_THREAD_IMPL (which, unlike BLIS_NUM_THREADS and friends, is expected to be a string). - Defined bli_thread_get_thread_impl(), bli_thread_set_thread_impl() to get and set the current threading implementation at runtime. - Defined bli_rntm_thread_impl() and bli_rntm_set_thread_impl() to query and set the threading implementation within a rntm_t. Also choose BLIS_SINGLE as the default value when initializing rntm_t structs. - Added bli_info_get_*() functions to query whether OpenMP or pthreads would be chosen as the default at init-time. Note that this only tests whether OpenMP or pthreads is the first implementation in the list passed to the threading configure option (-t) and is *not* the same as querying which implementation is currently selected, since that can be influenced by BLIS_THREAD_IMPL and/or bli_thread_set_thread_impl(). - Changed l3int_t to l3int_ft. - Updated docs/Multithreading.md to document the new behavior. - Updated sandbox/gemmlike and addon/gemmd to work with the new fat threading feature. This included a few bugfixes to bring the codes up to date, as necessary. - Comment, whitespace updates. commit 89df7b8fa3a3e47ab2fc10ac4d65d0b9fde16942 Author: Devin Matthews Date: Sun Sep 18 18:46:57 2022 -0500 De-templatized _sup_var1n2m.c; unified _sup_packm_a/b(). (#659) Details: - Re-expressed the two variants in frame/3/bli_l3_sup_var1n2m.c as a single function each that performs char* pointer arithmetic rather than four datatype-specific functions. Did the same for the functions in bli_l3_sup_packm_a.c and _sup_packm_b.c, and then unified the two into a single set of functions for packing either A or B, which now resides in bli_l3_sup_packm.c. - Pre-grow the cntl_t tree in both bli_l3_sup_var1n2m.c variants rather than grow them incrementally. - Relocated empty-matrix and scale-by-beta early return handlnig from bli_gemm_front() and bli_gemmt_front() to their _ex() counterparts. - Comment, whitespace updates. commit fb91337eff1ee2098f315a83888f6667b3a56f86 Author: Field G. Van Zee Date: Thu Sep 15 19:08:10 2022 -0500 Fixed a harmless pc_nt bug in 05a811e. Details: - Added missing curly braces around some statements in bli_rntm.c, one of which needed them in order for the relevant code to be executed in the intended way. The consequence of 05a811e omitting those braces was that a statement (pc_nt = 1;) was executed more often than it needed to be. - Also adjusted the analagous code in bli_thread.c to match that of bli_rntm.c. commit e86076bf4461d1a78186fb21ba8320cfb430f62c Author: Field G. Van Zee Date: Thu Sep 15 14:22:59 2022 -0500 Test the 'gemmlike' sandbox via AppVeyor. (#664) Details: - Added a fifth test to our .appveyor.yml that enables the 'gemmlike' sandbox with OpenMP enabled (via clang, the 'auto' configuration target, and building to a static library). Thanks to Jeff Diamond for pointing out that this test would be useful. commit 63177dca48cb7d066576d884da4a7a599ececebf Author: Field G. Van Zee Date: Thu Sep 15 11:21:26 2022 -0500 Fixed gemmlike sandbox bug introduced in 7c07b47. Details: - Fixed a bug in the 'gemmlike' sandbox that was introduced in 7c07b47. This bug was the result of the fact that the gemmlike implementation uses bli_thrinfo_sup_grow() to grow its thrinfo_t tree, but the aforementioned commit added an optimization that kicks in when the rntm_t .pack_a and .pack_b fields are both FALSE. Those fields were originally added only for sup execution; for large code path, they are intended to be ignored. But the default initial state of a rntm_t has those fields set to FALSE, which was inadvertantly activating the optimization (which targeted single-threaded cases only) and would cause multithreaded use cases of 'gemmlike' to segfault. The fix took the form of setting the .pack_a and .pack_b fields to TRUE in bls_gemm_ex(). - Added minimal 'const' and 'const'-casting to 'gemmlike' so that gcc stays quiet. commit 05a811e898b371a76581abd4afa416980cce7db9 Author: Field G. Van Zee Date: Tue Sep 13 19:24:05 2022 -0500 Initialize rntm_t nt/ways fields with 1 (not -1). (#663) Details: - Changed the way that rntm_t structs are initialized, mainly so that the global rntm_t that is set via environment variables at runtime may be queried by the application prior to any computation taking place. (Strictly speaking, the application may already query these fields, but they do not always contain valid values and often contain -1 when they are unset.) These changes also served to clarify how these parameters are treated, and homogenized the implementations of bli_rntm_set_ways_from_rntm(), bli_rntm_set_ways_from_rntm_sup(), and bli_thread_init_rntm_from_env(). Special thanks to Jeff Diamond, Leick Robinson, and Devin Matthews for pointing out that the previous behavior was needlessly confusing and could be improved. - The aforementioned modifications also included subtle changes as to what counts as "setting" a loop's ways of parallelism for the purposes of deciding whether to use the ways or the total number of threads. Previously, setting any loop's ways, even to 1, counted in favor of using the ways. Now, only values greater than 1 will count as "setting", and all other values will silently be mapped to 1, with those parameters treated as if they were untouched all along. - Updated bli_rntm.h and bli_thread.c so that any attempt to set the PC_NT variable (or pc_nt field of a rntm_t) will either ignore the request or reassert the value as 1. - Updated bli_rntm_set_ways() so that rather than clear the num_threads field, it is set to the product of all of the per-loop ways of parallelism. - Removed code from test_libblis.c that handled the possibility of unset environment variables when printing out their values. - Removed bli_rntm_equals() inline function from bli_rntm.h, which has long been disabled. - Updates to docs/Multithreading.md related to the aforementioned changes. - Comment updates. commit fd885cf98f4fe1d3bc46468e567776c37c670fcc Author: Field G. Van Zee Date: Tue Sep 13 11:50:23 2022 -0500 Use kernel CFLAGS for 'kernels' subdirs in addons. (#658) Details: - Updated Makefile and common.mk so that the targeted configuration's kernel CFLAGS are applied to source files that are found in a 'kernels' subdirectory within an enabled addon. For now, this behavior only applies when the 'kernels' directory is at the top level of the addon directory structure. For example, if there is an addon named 'foobar', the source code must be located in addon/foobar/kernels/ in order for it to be compiled with the target configurations's kernel CFLAGS. Any other source code within addon/foobar/ will be compiled with general-purpose CFLAGS (the same ones that were used on all addon code prior to this commit). Thanks to AMD (esp. Mithun Mohan) for suggesting this change and catching an intermediate bug in the PR. - Comment/whitespace updates. commit cb74202db39dc8cb81fdd06f8a445f8837e27853 Author: Field G. Van Zee Date: Tue Sep 13 11:46:24 2022 -0500 Fixed incorrect sizeof(type) in edge case macros. (#662) Details: - In bli_edge_case_macro_defs.h, the GEMM_UKR_SETUP_CT_PRE() and GEMMTRSM_UKR_SETUP_CT_PRE() macros previously declared their temporary ct microtiles as: PASTEMAC(ch,ctype) _ct[ BLIS_STACK_BUF_MAX_SIZE / sizeof( PASTEMAC(ch,type) ) ] \ __attribute__((aligned(alignment))); \ The problem here is that sizeof( PASTEMAC(ch,type) ) evaluates to things like sizeof( BLIS_DOUBLE ), not sizeof( double ), and since BLIS_DOUBLE is an enum, it is typically an int, which means the sizeof() expression is evaluating to the wrong value. This was likely a benign bug, though, since BLIS does not support any computational datatypes that are smaller than sizeof( int ), which means the ct array would be *over*-allocated rather than underallocated. Thanks to @moon-chilled for identifying and reporting this bug in #624. - CREDITS file update. commit 6e5431e8494b06bd80efcab3abf0a6456d6c0381 Author: Devin Matthews Date: Sat Sep 10 15:16:58 2022 -0500 Fix line number issue in flattened blis.h. (#660) Details: - Updated the top-level Makefile so that it invokes flatten-headers.py without the -c option, which was requesting that comments be stripped (since comment stripping is disabled by default). - Updated flatten-headers.py to accept a new option (-l) to enable insertion of #line directives into the output file. This new option is enabled by default. - Also added logic to flatten-headers.py that outputs a warning if both comment stripping and line numbers are requested since the comment stripping will cause the line numbers to become inaccurate. commit 4afe0cfdab0e069e027f97920ea604249e34df47 Author: Field G. Van Zee Date: Thu Sep 8 18:33:20 2022 -0500 Defined invscalv, invscalm, invscald operations. (#661) Details: - Defined invert-scale (invscal) operation on vectors (level-1v), matrices (level-1m), and diagonals (level-1d). - Added test modules for invscalv and invscalm to the testsuite. - Updated BLISObjectAPI.md and BLISTypedAPI.md API documentation to reflect the new operations. Also updated KernelsHowTo.md accordingly. - Renamed 'beta' to 'alpha' in scalv and scalm testsuite modules (and input.operations files) so that the parameter name matches the parameter used in the documentation. commit a87eae2b11408b556e562f1b04e673c6cd1612bc Author: Field G. Van Zee Date: Tue Sep 6 18:04:09 2022 -0500 Added '-q' quiet mode option to testsuite. (#657) Details: - Added support for a '-q' command line option to the testsuite. This option suppresses most informational output that would normally clutter up the screen. By default, verbose mode (the previous status quo) will be operative, and so quiet mode must be requested. commit dfa54139664a42d29774e140ec9e5597af869a76 Author: RuQing Xu Date: Tue Aug 30 08:07:50 2022 +0800 Arm64 dgemmsup with extended MR&NR (#655) Details: - Since the number of registers in NEON is large but their lengths are short, I'm here extending both MR and NR. - The approach is to represent the C microtile in registers optionally in columns, so for sizes like 6x7m, the 'crr' kernel is the default with 'rrr' supported through an in-register transpose. - A few asm kernels are crafted for 'rv' to complete this extended size support. - For 'rd' I'm still relying heavily on C99 intrinsic kernels with branching so the performance might not be optimal. (Sorry for that.) - So far, these changes only affect the 'firestorm' subconfig. - This commit also contains row-preferential s12x8 and d6x8 gemm ukernels. These microkernels are templatized versions of the existing s8x12 and d6x8 ukernels defined in bli_gemm_armv8a_asm_d6x8.c. commit 9e5594ad5fc41df8ef2825a025d7844ac2275c27 Author: Field G. Van Zee Date: Thu Aug 11 14:36:38 2022 -0500 Temporarily disabled #line directives from 6826c1c. Details: - Commented out the inclusion of #line preprocessor directives in the flattened header output provided by build/flatten-headers.py. This output was added recently in 6826c1c, but was later found to have thrown off the line numbering referenced by compiler warnings and errors (possibly due to license comment blocks, which are stripped from source headers as they are inlined into the monolithic header). commit 775148bcdbb1014b4881a76306f35f5d0fedecbe Author: jdiamondGitHub Date: Fri Aug 5 12:01:24 2022 -0500 Updated ARMv8a kernels to fix 2 prefetching issues. (#649) Details: - The ARMv8a dgemm/sgemm microkernels had 2 prefetching issues that impacted performance on modern ARM platforms. The most significant issue was that only a single prefetch per C tile column was issued. When a column of C was not cache aligned, the second cache line would not be prefetched at all, forcing the kernel to wait for an entire load to update elements of C. This happened with roughly 50% of the C prefetches. The fix was to have two prefetches per column, spaced 64 bytes (1 cache line) apart. - A secondary performance issue was that all the C prefetch instructions were issued sequentially at the beginning of the kernel call. This caused a noticeable performance slowdown. Interleaving the prefetch calls every 2-3 instructions in the prologue code solved the issue. commit bbaf29abd942de47a3a99a80a67d12bab41b27db Author: Field G. Van Zee Date: Thu Aug 4 17:51:37 2022 -0500 Very minor variable updates to common.mk. Details: - Fixed a harmless bug that would have allowed C++ headers into the list of header suffices specifically reserved for C99 headers. In practice, this would have had no substantive effect on anything since the core BLIS framework does not use C++ headers. commit a48e29d799091a833213efeafaf2d342ebdafde9 Author: Field G. Van Zee Date: Thu Jul 28 10:11:07 2022 -0500 CREDITS file update. Details: - Thanks to Kihiro Bando for assisting with issue #644. commit 5b298935de7f20462bfad1893ed34ecd691cec5a Author: Field G. Van Zee Date: Wed Jul 27 19:14:15 2022 -0500 Removed buggy cruft from power10 subconfig. Details: - Removed #defines for BLIS_BBN_s and BLIS_BBN_d from bli_kernel_defs_power10.h. These were inadvertently set in ae10d949 because the power10 subconfig was registering bb packm ukernels, but only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because the original author (probably) copy-pasted from power9 when getting started. That 6xk packm registration was effectively "dead code" prior to ae10d949, but was then mistaken as not-dead code during the ae10d949 refactor. These improper bb factors may have been causing bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind me what the power10 subconfig was supposed to look like. - Removed extraneous microkernel preference registrations from power10 subconfig. Preferences for single and double complex gemm were being registered despite there being no complex gemm ukernels registered to go with them. Similarly, there were trsm preferences registered without any trsm ukernels registered (and BLIS doesn't actually use a preference for the trsm ukernel anyway). These extraneous registrations were almost surely not hurting anything, even if they were quite misleading. commit 56de31b00fa0f1ba866321817cd1e5d83000ff11 Author: Devin Matthews Date: Wed Jul 27 13:54:17 2022 -0500 Disable modification of KC in the gemmsup kernels. (#648) This led to a ~50% performance reduction for certain gemm operations (but not others?). See #644 for example. commit 4dde947e2ec9e139c162801320c94e6a01a39708 Author: Field G. Van Zee Date: Tue Jul 26 17:29:32 2022 -0500 Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in 17b0caa and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. commit 6826c1cdfba855513786d9e3d606681316453398 Author: Devin Matthews Date: Mon Jul 25 18:21:05 2022 -0500 Add `#line` directives to flattened `blis.h`. (#643) Details: - Modified flatten-headers.py so that #line directives are inserted into the flattened blis.h file. This facilitates easier debugging when something is amiss in the flattened blis.h because the compiler will be able to refer to the line number within the original constituent header file (which is where the fix would go) rather than the line number within the flattened header (which is not as helpful). commit af3a41e02534befdae026377592ce437bab83023 Author: Alexander Grund Date: Thu Jul 21 18:05:48 2022 +0200 Add autodetection for POWER7, POWER9 & POWER10 (#647) Read from `/proc/cpuinfo` as done for ARM. Fixes #501 commit 17b0caa2b2bff439feb6d2b39cfa16e7591882b0 Author: Field G. Van Zee Date: Thu Jul 14 17:55:34 2022 -0500 Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. commit cc260fd7068f0fe449d818435aa11adb14c17fed Author: Field G. Van Zee Date: Wed Jul 13 16:16:01 2022 -0500 Allow uniform max problem sizes in test/3/runme.sh. Details: - Tweaked test/3/runme.sh so that the test driver binaries for single- threaded (st), single-socket (1s), and dual-socket (2s) execution can be built using identical problem size ranges. Previously, this was not possible because runme.sh used the maximum problem size, which was embedded into the binary filename, to tell the three classes of binaries apart from one another. Now, runme.sh uses the binary suffix ("st", "1s", or "2s") to tell them apart. This required only a few changes to the logic, but it also required a change in format to the threading config strings themselves (replacing the max problem size with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this improvement. - Comment updates. commit 9b1beec60be31c6ea20b85806d61551497b699e4 Author: bartoldeman Date: Mon Jul 11 20:15:12 2022 -0400 Use BLIS_ENABLE_COMPLEX_RETURN_INTEL in blastest files (#636) Details: - Fixed a crash that occurs when either cblat1 or zblat1 are linked with a build of BLIS that was compiled with '--complex-return=intel'. This fix involved inserting preprocessor macro guards based on BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and blastest/src/zblat1.c to correctly handle situations where BLIS is compiled with Intel/f2c-style calling conventions for complex numbers. - Updated blastest/src/fortran/run-f2c.sh so that future executions will insert the aforementioned cpp macro conditional where appropriate. commit 98d467891b74021ace7f248cb0856bec734e39b6 Author: bartoldeman Date: Mon Jul 11 19:40:53 2022 -0400 Change complex_return='intel' for ifx. (#637) Details: - When checking the version string of the Fortran compiler for the purposes of determining a default return convention for complex domain values, grep for "IFORT" instead of "ifort" since that string is common to both the 'ifx' and 'ifort' binaries provided by Intel: $ ifx --version ifx (IFORT) 2022.1.0 20220316 Copyright (C) 1985-2022 Intel Corporation. All rights reserved. $ ifort --version ifort (IFORT) 2021.6.0 20220226 Copyright (C) 1985-2022 Intel Corporation. All rights reserved. commit ffde54cc5c334aca8eff4d6072ba49496bf3104c Author: jdiamondGitHub Date: Mon Jul 11 16:47:30 2022 -0500 Minor changes to .gitignore and LICENSE files. (#642) Details: - Macs create .DS_Store files in every directory visited. Updated .gitignore file so these files won't be reported as untracked by 'git status'. - Added Oracle Corporation to the LICENSE file. - Updated UT copyright on behalf of SHPC. commit 7cba7ce3dd1533fcc4ca96ac902bdf218686139a Author: Field G. Van Zee Date: Fri Jul 8 11:15:18 2022 -0500 Minor cleanups, comment updates to bli_gks.c. Details: - Removed a redundant registration of 'a64fx' subconfig in bli_gks_init(). - Reordered registration of 'armsve', 'a64fx', and 'firestorm' subconfigs. Thanks to Jeff Diamond for his input on this reordering. - Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h. commit 667f201b7871da68622027d02bd6b7da3262f8e8 Author: Field G. Van Zee Date: Thu Jul 7 16:44:21 2022 -0500 Fixed type bug in bli_cntx_set_ukr_prefs(). Details: - Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the num_t value read from va_args() down to a bool before being stored within the cntx_t. This bug was introduced on April 6th 2022, in ae10d94. This caused the ukernel preferences for double real and double complex to go unchanged while the preferences for single real and single complex were corrupted by the former datatypes' preference values. The bug manifested as degraded performance for subconfigurations that registered column-preferential ukernels. The reason is that the erroneous preferences trigger unnecessary transpositions in the operation, which forces the gemm ukernel to compute on matrices that are not stored according to its preference. Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their extensive efforts and assistance in tracking down this issue. - Augmented the informational header that is output by the testsuite to include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu]. - CREDITS file update. commit d429b6bfced21a63bf711224ac402f93f0080b52 Author: Isuru Fernando Date: Tue Jun 28 15:34:10 2022 -0500 Support clang targetting MinGW (#639) * Support clang targetting MinGW * Fix pthread linking commit d93df023348144e091f7b3e3053995648f348aa7 Author: Field G. Van Zee Date: Wed Jun 15 14:09:49 2022 -0500 Removed unused dt arg in bli_gks_query_ind_cntx(). Details: - Removed the num_t datatype argument from bli_gks_query_ind_cntx(). This argument stopped being needed by the function in commit e9da642. Its only use in bli_gks_query_ind_cntx() was to be passed through to the context initialization function for the chosen induced method, but even then, commit log notes from e9da642 indicate that I could not recall why the datatype argument was ever needed by the context init function to begin with. - Updated all invocations of bli_gks_query_ind_cntx() to omit the dt argument. Most of these invocations resided in various standalone test drivers (and the testsuite). commit 56772892450cc92b3fbd6a9d0460153a43fc47ab Author: Field G. Van Zee Date: Wed Jun 1 10:49:33 2022 -0500 Added SMU citation to README.md intro. Details: - Added a citation to SMU and the Matthews Research Group to the general attribution of maintainership and development in the Introduction of the README.md file. Thanks to Robert van de Geijn and Devin Matthews for suggesting this change. commit 4603324eb090dfceaad3693a70b2d60544036aa8 Author: Field G. Van Zee Date: Thu May 19 14:07:03 2022 -0500 Init/finalize via bli_pthread_switch_t API (#634). Details: - Defined and implemented a new pthread-like abstract datatype and API in bli_pthread.c. The new type, bli_pthread_switch_t, is similar to bli_pthread_once_t in some respects. The idea is that like a switch in your home that controls a light or ceiling fan, it can either be on or off. The switch starts in the off state. Moving from one state to the other (on to off; off to on) causes some action (i.e., a startup or shutdown function) to be executed. Trying to move from one state to the same state (on to on; off to off) is safe in that it results in no action. Unlike bli_pthread_once(), the API for bli_pthread_switch_t contains both _on() and _off() interfaces. Also, unlike the _once() function, the _on() and _off() functions return error codes so that the 'int' error code returned from the startup or shutdown functions may be passed back to the caller. Thanks to Devin Matthews for his input and feedback on this feature. - Replaced the previous implementation of bli_init_once() and bli_finalize_once() -- both of which used bli_pthread_once() -- with ones that rely upon bli_pthread_switch_on() and _switch_off(), respectively. This also required updating the return types of _init_apis() and _finalize_apis() to match the function pointer type required by bli_pthread_switch_on()/_switch_off(). - Comment updates. commit 64a9b061f6032e2b59613aecdbe7bb52161605c1 Author: Field G. Van Zee Date: Tue May 10 14:54:22 2022 -0500 Fixed misspelling of 'xpbys' in gemm macrokernel. Details: - Fixed a functionally harmless typo in bli_gemm_ker_var2.c where a few instances of the substring "xpbys" were misspelled as "xbpys". The misspellings were harmless because they were consistent, and because they referenced only local symbols. commit 1c733402a95ab08b20f3332c2397fd52a2627cf6 Author: Jed Brown Date: Thu Apr 28 11:58:44 2022 -0600 Fix version check for znver3, which needs gcc >= 10.3 (#628) Apple's clang-12 lacks znver3 support, unlike upstream clang-12. commit 6431c9e13b86e4442b6aacba18a0ace12288c955 Author: Field G. Van Zee Date: Thu Apr 14 13:01:24 2022 -0500 Added missing 'const' to zen bli_gemm_small.c. Details: - Added missing 'const' qualifiers to signatures of functions defined in kernels/zen/3/bli_gemm_small.c. This fixes compile-time errors when targeting 'zen3' subconfig (which apparently is enabling AMD's gemm_small code path by default). Thanks to Devin Matthews for reporting this error. commit 9fea633748ed27ef3853bba7cd955690c61092b4 Author: Devin Matthews Date: Wed Apr 13 15:59:06 2022 -0500 Partial addition of 'const' to all interfaces above the (micro)kernels. (#625) Details: - Added 'const' qualifier to applicable function arguments wherever the the pointed-to object is not internally modified. This change affects all interfaces that reside above the level of the (micro)kernels. - Typecast certain function return values to discard 'const' qualifier. - Removed 'restrict' from various arguments, including cntx_t*, auxinfo_t*, rntm_t*, thrinfo_t*, mem_t*, and others - Removed parts of some APIs, such as bli_cntx_*(), due to limited use. - Merged some variable declarations with their corresponding initialization statements. - Whitespace changes. commit ae10d9495486f589ed0320f0151b2d195574f1cf (origin/amd) Author: Devin Matthews Date: Wed Apr 6 20:31:11 2022 -0500 Simplify and rewrite reference packm kernels. (#610) Details: - Reorganized the way kernels are stored within the cntx_t structure so that rather than having a function pointer for every supported size of unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm kernels per datatype: one to pack MRxk micropanels and one to pack NRxk micropanels. - NOTE: The "bb" (broadcast B) reference kernels have been merged into the "standard" kernels (packm [including 1er and unpackm], gemm, trsm, gemmtrsm). This replication factor is controlled by BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a replication factor of 1 has been tested. armsve also needs testing since the MR value isn't available as a macro. - Simplified the bli_cntx_*() APIs to conform to the new unified kernel array within the cntx_t. Updated existing bli_cntx_init_() function definitions for all subconfigurations. - Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t, etc.) into one kernel id type: ukr_t. - Various edits, updates, and rewrites of reference kernels pursuant to the aforementioned changes. - Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz], and friends) in bli_kernel_macro_defs.h, but only when the macro BLIS_IN_REF_KERNEL is defined by the build system. - Loose ends: - Still need to update documentation, including: - docs/ConfigurationHowTo.md - docs/KernelsHowTo.md to reflect changes made in this commit. commit b3e674db3c05ca586b159a71deb1b61d701ae5c9 Author: Field G. Van Zee Date: Mon Apr 4 17:31:02 2022 -0500 README.md update to link to releases page. commit 69fa915464c52f09a5971a60f521900d31a34e69 Author: Field G. Van Zee Date: Fri Apr 1 08:47:46 2022 -0500 Fixed broken "tagged releases" link in README.md. commit 88cab8383ca90ddbb4cf13e69b7d44a1663a4425 Author: Field G. Van Zee Date: Fri Apr 1 08:12:06 2022 -0500 CHANGELOG update (0.9.0) commit 14c86f66b20901b60ee276da355c1b62642c18d2 (tag: 0.9.0) Author: Field G. Van Zee Date: Fri Apr 1 08:12:06 2022 -0500 Version file update (0.9.0) commit 99bb9002f1aff598d347eae2821a3f7bdd1f48e8 Author: Field G. Van Zee Date: Fri Apr 1 08:10:59 2022 -0500 ReleaseNotes.md update in advance of next version. commit bee7678b2558a691ac850819dbe33fefe4fdbee3 Author: Field G. Van Zee Date: Thu Mar 31 14:09:39 2022 -0500 CREDITS file update. commit cf06364327bd2d21d606392371ff3c5962bee5ba Author: Field G. Van Zee Date: Tue Mar 29 16:18:25 2022 -0500 Fixed typo in BLAS gemm3m call to _check(). Details: - Fixed an unresolved symbol issue leftover from #590 whereby ?gemm3m_() as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which does not exist. It should have simply called the _check() function for gemm. commit 1ec020b33ece1681c0041e2549eed2bd4c6cf356 Author: Dipal M Zambare <71366780+dzambare@users.noreply.github.com> Date: Wed Mar 30 02:45:36 2022 +0530 AMD kernel updates; frame-specific AMD updates. (#597) Details: - Allow building BLIS with certain framework files (each with the '_amd' suffix) that have been customized by AMD for Zen-based hardware. These customized files were derived from portable versions of the same files (i.e., those without the '_amd' suffix). Whether the portable or AMD- specific files are compiled is now controlled by a new configure option, --[en|dis]able-amd-frame-tweaks. This option is disabled by default in vanilla BLIS, though AMD may choose to enable it by default in their fork. For now, the added AMD-specific files are: - bli_gemv_unf_var2_amd.c - bla_copy_amd.c - bla_gemv_amd.c These files reside in 'amd' subdirectories found within the directory housing their generic counterparts. - Register optimized real-domain copyv, setv, and swapv kernels in bli_cntx_init_zen.c. - Various minor updates to level-1v kernels in 'zen' kernel set. - Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to the 'zen' kernel set - If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim, call gemv instead and return early. - Combined variable declarations with their initialization in various level-2 and level-3 BLAS compatibility files, and also inserted 'const' qualifer in those same declaration statements. - Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ . - Added copyv and swapv test drivers to 'test' directory. - Whitespace, comment changes. commit 0db2bd5341c5c3ed5f1cc2bffa90952735efa45f Author: Bhaskar Nallani Date: Fri Mar 25 05:11:55 2022 +0530 Added BLAS/CBLAS APIs for gemm3m. (#590) Details: - Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply invoke the 1m implementation unconditionally. (Note that these APIs bypass sup handling.) - Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h. - Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h. - Relocated: frame/compat/cblas/src/cblas_?gemmt.c files into frame/compat/cblas/src/extra/ - Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ . - Minor reorganization of prototypes and cpp macro directives in bli_blas.h, cblas.h, and cblas_f77.h. - Trival whitespace change to cblas_zgemm.c. commit d6810000e961fe807dc5a7db81180a8355f3eac0 Author: Devin Matthews Date: Mon Mar 14 10:29:54 2022 -0500 Update Multithreading.md Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip] commit f1dbb0e514f53a3240d3a6cbdc3306b01a2206f5 Author: Field G. Van Zee Date: Fri Mar 11 13:38:28 2022 -0600 Trival whitespace change; commit log addendum. Details: - A co-attribution to Mithun Mohan was inadvertently omitted from the commit log for headline change in the previous commit, 7c07b47. commit 7c07b477e432adbbce5812ed9341ba3092b03976 Author: Field G. Van Zee Date: Fri Mar 11 13:28:50 2022 -0600 Avoid gemmsup barriers when not packing A or B. (#622) Details: - Implemented a multithreaded optimization for the special (and common) case of employing the gemmsup code path when the user requests (implicitly or explicitly) that neither A nor B be packed during computation. This optimization takes the form of a greatly reduced code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a broadcast and two barriers, and results in higher performance when obtaining two-way or higher parallelism within BLIS. Thanks to Bhaskar Nallani of AMD for proposing this change via issue #605. - Added an early return branch to bli_thrinfo_create_for_cntl() that detects and quickly handles cases where no parallelism is being obtained within BLIS (i.e., single-threaded execution). Note that this special case handling was/is already present in bli_thrinfo_sup_create_for_cntl(). - CREDITS file update. commit cad10410b2305bc0e328c5f2517ab02593b53428 Author: Ivan Korostelev Date: Thu Mar 10 09:58:14 2022 -0600 POWER10: edge cases in microkernel (#620) Use new API for POWER10 gemm microkernel commit 71851a0549276b17db18a0a0c8ab4f54493bf033 Author: Field G. Van Zee Date: Tue Mar 8 17:38:09 2022 -0600 Fixed level-3 performance bug in haswell ukernels. Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug. commit 84732bf95634ac606c5f2661d9474318e366c386 Author: Field G. Van Zee Date: Mon Feb 28 12:19:31 2022 -0600 Revamp how tools are handled/checked by configure. Details: - Consolidate handling of tools that are specifiable via CC, CXX, FC, PYTHON, AR, and RANLIB into one bash function, select_tool_w_env(). - If the user specifies a tool via an environment variable (e.g. CC=gcc) and that tool does not seem valid, print an error message and abort configure, unless the tool is optional (e.g. CXX or FC), in which case a warning message is printed instead. - The definition of "seems valid" above amounts to: - responding to at least one of a basic set of command line options (e.g. --version, -V, -h) if the os_name is Linux (since GNU tools tend to respond to flags such as --version) or if the tool in question is CC, CXX, FC, or PYTHON (which tend to respond to the expected flags regardless of OS) - the binary merely existing for AR and RANLIB on Darwin/OSX/BSD. (These OSes tend to have non-GNU versions of ar and ranlib, which typically do not respond to --version and friends.) - This PR addresses #584. Thanks to Devin Matthews for suggesting some of the changes in this commit. commit d5146582b1f1bcdccefe23925d3b114d40cd7e31 Author: RuQing Xu Date: Wed Feb 23 03:35:46 2022 +0900 ArmSVE Ensure Non-zero Block Size (#615) Fixes #613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically. commit 4d8352309784403ed6719528968531ffb4483947 Author: RuQing Xu Date: Wed Feb 23 01:03:47 2022 +0900 Add armsve to arm64 Metaconfig (#614) Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes #612. commit c9700f369aa84fc00f36c4b817ffb7dab72b865d Author: Field G. Van Zee Date: Tue Feb 15 15:36:52 2022 -0600 Renamed SIMD-related macro constants for clarity. Details: - Renamed the following macros defined in bli_kernel_macro_defs.h: BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE Also updated all instances of these macros elsewhere, including subconfigurations, source code, and documentation. Thanks to Devin Matthews for suggesting this change. commit ee9ff988c49f16696679d4c6cd3dcfcac7295be7 Author: Field G. Van Zee Date: Tue Feb 15 15:01:51 2022 -0600 Move edge cases to gemmtrsm ukrs; doc updates. Details: - Moved edge-case handling into the gemmtrsm microkernel. This required changing the microkernel API to take m and n dimension parameters as well as updating all existing gemmtrsm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. Also updated all existing gemmtrsm kernels in the 'kernels' directory (which for now is limited to haswell and penryn kernel sets, plus native and 1m-based reference kernels in 'ref_kernels') to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Note that the edge-case handling for gemm-like operations had already been relocated into the gemm microkernel in 54fa28b. - Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in bli_edge_case_macro_defs.h to allow for easier reading. - Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up the bullet under "Implementation Notes for gemm" that covers alignment issues. (Thanks to Ivan Korostelev for pointing out the confusing and outdated language in issue #591.) - Other minor tweaks to KernelsHowTo.md. commit 25061593460767221e1066f9d720fa6676bbed8f Author: Devin Matthews Date: Sun Feb 13 20:11:55 2022 -0600 Don't use `-Wl,-flat-namespace`. Flat namespaces can cause problems due to conflicting system libraries, etc., so just mark `xerbla_` as a weak symbol on macOS instead. commit 5a4d3f5208d3d8cc1827f8cc90414c764b7ebab3 Author: Devin Matthews Date: Sun Feb 13 17:28:30 2022 -0600 Use -flat_namespace option to link on macOS Fixes #611. commit 26742910a087947780a089360e2baf82ea109e01 Author: Devin Matthews Date: Sun Feb 13 16:53:45 2022 -0600 Update CC_VENDOR logic Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip] commit 2f3872e01d51545c687ae2c8b2650e00552111a7 Author: RuQing Xu Date: Mon Feb 7 17:14:49 2022 +0900 ArmSVE Adopts Label Wrapper For clang (& armclang?) compilation. Hopefully solves #609 . commit 72089bb2917b78d99cf4f27c69125bf213ee54e6 Author: RuQing Xu Date: Sat Feb 5 16:56:04 2022 +0900 ArmSVE Use Predicate in M-Direction No need to query MR during kernel runtime. commit 9cc897f37455d52fbba752e3801f1a9d4a5bfdc1 Author: Ruqing Xu Date: Thu Feb 3 16:40:02 2022 +0000 Fix SVE Compil. commit b5df1811f1bc8212b2cda6bb97b79819afe236a8 Author: RuQing Xu Date: Thu Feb 3 02:31:29 2022 +0900 Armv8a, ArmSVE: Simplify Gen-C commit 35195bb5cea5d99eb3eaf41e3815137d14ceb52d Author: Devin Matthews Date: Mon Jan 31 10:29:50 2022 -0600 Add armclang detection to configure. armclang is treated as regular clang. Fixes #606. [ci skip] commit 0be9282cdccf73342d8571d3f7971a9b0af72363 Author: Field G. Van Zee Date: Wed Jan 26 17:46:24 2022 -0600 Updated zen3 macro constant names. Details: - In config/zen3/bli_family_zen3.h, renamed: BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK Thanks to Jeff Diamond for helping spot the stale _SYRK naming. commit 0ab20c0e72402ba0b17fe2c3ed3e16bf2ace0fd3 Author: Jeff Hammond Date: Thu Jan 13 07:29:56 2022 -0800 the Apple local label thing is required by Clang in general @egaudry and I both saw this issue on Linux with Clang 10. ``` Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels) kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition " \n\t" ^ :90:5: note: instantiated into assembly here .SLOOPKITER: ^ 1 error generated. ``` Signed-off-by: Jeff Hammond commit 81f93be0561c705ae6823d19e40849facc40bef7 Author: Devin Matthews Date: Mon Jan 10 10:19:47 2022 -0600 Fix row-/column-major pref. in 16x8 haswell sgemm ukr (unused) commit 268ce1f29a717d18304713ecc25a2eafe41838c7 Author: Devin Matthews Date: Mon Jan 10 10:17:17 2022 -0600 Relax alignment constraints Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes #595. commit 3f2440b0226d5e23a43d12105d74aa917cd6c610 Author: Field G. Van Zee Date: Thu Jan 6 14:57:36 2022 -0600 Added m, n dims to gemmd/gemmlike ukernel calls. Details: - Updated the gemmd addon and the gemmlike sandbox code to use the new microkernel calling sequence, which now includes m and n dimensions so that the microkernel has all the information necessary to handle edge cases. Thanks to Jeff Diamond for catching this, which ideally would have been included in commit 54fa28b. - Retired var2 of both gemmd and gemmlike to 'attic' directories and removed their corresponding prototypes. In both cases, var2 was a variant of the block-panel algorithm where edge-case handling was abstracted away to a microkernel wrapper. (Since this is now the official behavior of BLIS microkernels, I saw no need to have it included as a separate code path.) - Comment updates. commit 864bfab4486ac910ef9a366e9ade4b45a39747fc Author: Field G. Van Zee Date: Tue Jan 4 15:10:34 2022 -0600 CREDITS file update. commit 466b68a3ad118342dc49a8130b7b02f5e7748521 Author: Devin Matthews Date: Sun Jan 2 14:59:41 2022 -0600 Add unique tag to branch labels for Apple ARM64. Add `%=` tag to branch labels, which expands to a unique identifier for each inline assembly block. This prevents duplicate symbol errors on Apple Silicon (#594). Fixes #594. [ci skip] since we can't test Apple Silicon anyways... commit 08174a2f6ebbd8ed5aa2bc4edc45da80962f06bb Author: RuQing Xu Date: Sat Jan 1 21:35:19 2022 +0900 Evict Requirement for SVE GEMM For 8<= GCC < 10 compatibility. commit 54fa28bd847b389215cffb57a83dc9b3dce79c86 Author: Devin Matthews Date: Fri Dec 24 08:00:33 2021 -0600 Move edge cases to gemm ukr; more user-custom mods. (#583) Details: - Moved edge-case handling into the gemm microkernel. This required changing the microkernel API to take m and n dimension parameters. This required updating all existing gemm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. We also updated all existing kernels in the 'kernels' directory to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Also removed the assembly code that formerly would handle general stride IO on the microtile, since this can now be handled by the same code that does edge cases. - Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and bli_trsm_cntl_create(), where this function pointer is used in lieu of the default macrokernel when it is non-NULL, and ignored when it is NULL. - Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single function using byte pointers rather that one function for each floating-point datatype. Also, obtain the microkernel function pointer from the .ukr field of the params struct embedded within the obj_t for matrix C (assuming params is non-NULL and contains a non-NULL value in the .ukr field). Communicate both the gemm microkernel pointer to use as well as the params struct to the microkernel via the auxinfo_t struct. - Defined gemm_ker_params_t type (for the aforementioned obj_t.params struct) in bli_gemm_var.h. - Retired the separate _md macrokernel for mixed datatype computation. We now use the reimplemented bli_gemm_ker_var2() instead. - Updated gemmt macrokernels to pass m and n dimensions into microkernel calls. - Removed edge-case handling from trmm and trsm macrokernels. - Moved most of bli_packm_alloc() code into a new helper function, bli_packm_alloc_ex(). - Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c. - Added test/syrk_diagonal and test/tensor_contraction directories with associated code to test those operations. commit 961d9d509dd94f3a66f7095057e3dc8eb6d89839 Author: Kiran Date: Wed Dec 8 03:00:38 2021 +0530 Re-add BLIS_ENABLE_ZEN_BLOCK_SIZES macro for 'zen'. Details: - Added previously-deleted cpp macro block to bli_cntx_init_zen.c targeting the Naples microarchitecture that enabled different cache blocksizes when the number of threads exceeds 16. This commit represents PR #573. commit cf7d616a2fd58e293b496770654040818bf5609c Author: Devin Matthews Date: Thu Dec 2 17:10:03 2021 -0600 Enable user-customized packm ukernel/variant. (#549) Details: - Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and .ker_params. These fields store pointers to functions and data that will allow the user to more flexibly create custom operations while recycling BLIS's existing partitioning infrastructure. - Updated typed API to packm variant and structure-aware kernels to replace the diagonal offset with panel offsets, and changed strides of both C and P to inc/ldim semantics. Updated object API to the packm variant to include rntm_t*. - Removed the packm variant function pointer from the packm cntl_t node definition since it has been replaced by the .pack_fn pointer in the obj_t. - Updated bli_packm_int() to read the new packm variant function pointer from the obj_t and call it instead of from the cntl_t node. - Moved some of the logic of bli_l3_packm.c to a new file, bli_packm_alloc.c. - Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers instead of typed pointers, allowing a single function to be used regardless of datatype. This obviated having a separate implementation in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a new function, bli_packm_scalar(). - Employed a new standard whereby right-hand matrix operands ("B") are always packed as column-stored row panels -- that is, identically to that of left-hand matrix operands ("A"). This means that while we pack matrix A normally, we actually pack B in a transposed state. This allowed us to simplify a lot of code throughout the framework, and also affected some of the logic in bli_l3_packa() and _packb(). - Simplified bli_packm_init.c in light of the new B^T convention described above. bli_packm_init()--which is now called from within bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns a bool that indicates whether packing should be performed (or skipped). - Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(), which, among other things, defaults the new .pack_fn field of the obj_t to bli_packm_blk_var1() if the field is NULL. - Defined a new function, bli_obj_reset_origin(), which permanently refocuses the view of an object so that it "forgets" any offsets from its original pointer. This function also sets the object's root field to itself. Calls to bli_obj_reset_origin() for each matrix operand appear in the _front() functions, after the obj_t's are aliased. This resetting of the underlying matrices' origins is needed in preparation for more advanced features from within custom packm kernels. - Redefined bli_pba_rntm_set_pba() from a regular function to a static inline function. - Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use libblis_test_pobj_create() to create local packed objects. Previously, these packed objects were created by calling lower-level functions. commit e229e049ca08dfbd45794669df08a71dba892925 Author: Field G. Van Zee Date: Wed Dec 1 17:36:22 2021 -0600 Added recu-sed.sh script to 'build' directory. Details: - Added a recursive sed script to the 'build' directory. commit 12c66a4acc77bf4927b01e2358e2ac10b61e0a53 Author: Field G. Van Zee Date: Fri Nov 19 14:43:53 2021 -0600 Minor updates to README.md, docs/Addons.md. Details: - Add additional mentions of addons to README.md, including in the "What's New" section. - Removed mention of sandboxes from the long list of advantages provided by BLIS. - Very minor description update to opening line of Addons.md. commit a4bc03b990fe0572001eb6409efd12cd70677dcf Author: Field G. Van Zee Date: Fri Nov 19 13:29:00 2021 -0600 Brief mention/link to Addons.md in README.md. Details: - Add a blurb about the new addons feature to the "Documentation for BLIS developers" section of the README.md, which also links to the Addons.md document. commit b727645eb7a8df39dee74068f734da66322fe0b3 Merge: 9be97c15 7bde468c Author: Field G. Van Zee Date: Fri Nov 19 13:22:09 2021 -0600 Merge branch 'dev' commit 9be97c150e19fa58bca30cb993a6509ae21e2025 Author: Madan mohan Manokar <86282872+madanm3@users.noreply.github.com> Date: Thu Nov 18 00:46:46 2021 +0530 Support all four dts in test/test_her[2][k].c (#578) Details: - Replaced the hard-coded calls to double-precision real syr, syr2, syrk, and syrk in the corresponding standalone test drivers in the 'test' directory with conditional branches that will call the appropriate BLAS interface depending on which datatype is enabled. Thanks to Madan mohan Manokar for this improvement. - CREDITS file update. commit 26e4b6b29312b472c3cadf95ccdf5240764777f4 Author: Dipal M Zambare <71366780+dzambare@users.noreply.github.com> Date: Thu Nov 18 00:32:00 2021 +0530 Added support for AMD's Zen3 microarchitecture. Details: - Added a new 'zen3' subconfiguration targeting support for the AMD Zen3 microarchitecture (#561). Thanks to AMD for this contribution. - Restructured clang and AOCC support for zen, zen2, and zen3 make_defs.mk files. The clang and AOCC version detection now happens in configure, not in the subconfigurations' makefile fragments. That is, we've added logic to configure that detects the version of clang/AOCC, outputs an appropriate variable to config.mk (ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the makefile fragment (as is currently done for the GCC_OT_* variables). - Added configure support for a GCC_OT_10_1_0 variable (and associated substitution anchor) to communicate whether the gcc version is older than 10.1.0, and use this variable to check for recent enough versions of gcc to use -march=znver3 in the zen3 subconfig. - Inlined the contents of config/zen/amd_config.mk into the zen and zen2 make_defs.mk so that the files are self-contained, harmonizing the format of all three Zen-based subconfigurations' make_defs.mk files. - Added indenting (with spaces) of GNU make conditionals for easier reading in zen, zen2, and zen3 make_defs.mk files. - Adjusted the range of models checked by bli_cpuid_is_zen() (which was previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is completely disjoint from the models checked by bli_cpuid_is_zen2() (0x30 ~ 0xff). This is normally necessary because Zen and Zen2 microarchitectures share the same family (23, or 0x17), and so the model code is the only way to differentiate the two. But in our case, fixing the model range for zen *wasn't* actually necessary since we checked for zen2 first, and therefore the wide zen range acted like the 'else' of an 'if-else' statement. That said, the change helps improve clarity for the reader by encoding useful knowledge, which was obtained from https://en.wikichip.org/wiki/amd/cpuid . - Added zen2.def and zen3.def files to the collection in travis/cpuid. Note that support for zen, zen2, and zen3 is now present, and while all the three microarchitectures have identical instruction sets from the perspective of BLIS microkernels, they each correspond to different subconfigurations and therefore merit separate testing. Thanks to Devin Matthews for his guidance in hacking these files as slight modifications of zen.def. - Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh. Now, zen, zen2, and zen3 are tested through the SDE via Travis CI builds. - Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils repository on GitHub rather than on Intel's website. This change was made in an attempt to circumvent recent troubles with Travis CI not being able to download the SDE directly from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea. - Updated travis/do_sde.sh to grab the latest version (8.69.1) of the Intel SDE from the flame/ci-utils repository. - Updated .travis.yml to use gcc 9. The file was previously using gcc 8, which did not support -march=znver2. - Created amd64_legacy umbrella family in config_registry for targeting older (bulldozer, piledriver, steamroller, and excavator) microarchitectures and moved those same subconfigs out of the amd64 umbrella family. However, x86_64 retains amd64_legacy as a constituent member. - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and lists that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via substitution of umbrella families with their subconfig members, and one of those lists was improperly performing the substitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the substitution properly, via substitute_words(). Also added comments noting the importance of using substitute_words() in both instances. - Comment updates. commit 74c0c622216aba0c24aa2c3a923811366a160cf5 Author: Field G. Van Zee Date: Tue Nov 16 16:06:33 2021 -0600 Reverted cbc88fe. Details: - Reverted the annotation of some markdown code blocks with 'bash' after realizing that the in-browser syntax highlighting was not worthwhile. commit cbc88feb51b949ce562d044cf9f99c4e46bb8a39 Author: Field G. Van Zee Date: Tue Nov 16 16:02:39 2021 -0600 Marked some markdown shell code blocks as 'bash'. Details: - Annotated the code blocks that represent shell commands and output as 'bash' in README.md and BuildSystem.md. commit 78cd1b045155ddf0b9ec6e2ab815f2b216ad9a9e Author: Field G. Van Zee Date: Tue Nov 16 15:53:40 2021 -0600 Added 'Example Code' section to README.md. Details: - Inserted a new 'Example Code' section into the README.md immediately after the 'Getting Started' section. Thanks to Devin Matthews for recommending this addition. - Moved the 'Performance' section of the README down slightly so that it appears after the 'Documentation' section. commit 7bde468c6f7ecc4b5322d2ade1ae9c0b88e6b9f3 Author: Field G. Van Zee Date: Sat Nov 13 16:39:37 2021 -0600 Added support for addons. Details: - Implemented a new feature called addons, which are similar to sandboxes except that there is no requirement to define gemm or any other particular operation. - Updated configure to accept --enable-addon= or -a syntax for requesting an addon be included within a BLIS build. configure now outputs the list of enabled addons into config.mk. It also outputs the corresponding #include directives for the addons' headers to a new companion to the bli_config.h header file named bli_addon.h. Because addons may wish to make use of existing BLIS types within their own definitions, the addons' headers must be included sometime after that of bli_config.h (which currently is #included before bli_type_defs.h). This is why the #include directives needed to go into a new top-level header file rather than the existing bli_config.h file. - Added a markdown document, docs/Addons.md, to explain addons, how to build with them, and what assumptions their authors should keep in mind as they create them. - Added a gemmlike-like implementation of sandwich gemm called 'gemmd' as an addon in addon/gemmd. The code uses a 'bao_' prefix for local functions, including the user-level object and typed APIs. - Updated .gitignore so that git ignores bli_addon.h files. commit 7bc8ab485e89cfc6032932e57929e208a28f4be5 Author: Meghana-vankadari <74656386+Meghana-vankadari@users.noreply.github.com> Date: Fri Nov 12 04:16:14 2021 +0530 Added BLAS/CBLAS APIs for axpby, gemm_batch. (#566) Details: - Expanded the BLAS compatibility layer to include support for ?axpby_() and ?gemm_batch_(). The former is a straightforward BLAS-like interface into the axpbyv operation while the latter implements a batched gemm via loops over bli_?gemm(). Also expanded the CBLAS compatibility layer to include support for cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari for submitting these new APIs via #566. - Fixed a long-standing bug in common.mk that for some reason never manifested until now. Previously, CBLAS source files were compiled *without* the location of cblas.h being specified via a -I flag. I'm not sure why this worked, but it may be due to the fact that the cblas.h file resided in the same directory as all of the CBLAS source, and perhaps compilers implicitly add a -I flag for the directory that corresponds to the location of the source file being compiled. This bug only showed up because some CBLAS-like source code was moved into an 'extra' subdirectory of that frame/compat/cblas/src directory. After moving the code, compilation for those files failed (because the cblas.h header file, presumably, could not be found in the same location). This bug was fixed within common.mk by explicitly adding the cblas.h directory to the list of -I flags passed to the compiler. - Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory, and updated test/Makefile to build those drivers. - Fixed typo in error message string in cblas_sgemm.c. commit 28b0982ea70c21841fb23802d38f6b424f8200e1 Author: Devin Matthews Date: Wed Nov 10 12:34:50 2021 -0600 Refactored her[2]k/syr[2]k in terms of gemmt. (#531) Details: - Renamed herk macrokernels and supporting files and functions to gemmt, which is possible since at the macrokernel level they are identical. Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal functions rather than cpp macros that instantiate multiple functions. Thanks to Devin Matthews for his efforts on this issue (#531). - Check that the maximum stack buffer size is sufficiently large relative to the register blocksizes for each datatype, and do so when the context is initialized rather than when an operation is called. Note that with this change, users who pass in their own contexts into the expert interfaces currently will *not* have any checks performed. Thanks to Devin Matthews for suggesting this change. commit cfa3db3f3465dc58dbbd842f4462e4b49e7768b4 Author: Field G. Van Zee Date: Wed Nov 3 18:13:56 2021 -0500 Fixed bug in mixed-dt gemm introduced in e9da642. Details: - Fixed a bug that broke certain mixed-datatype gemm behavior. This bug was introduced recently in e9da642 when the code that performs the operation transposition (for microkernel IO preference purposes) was moved up so that it occurred sooner. However, when I moved that code, I failed to notice that there was a cpp-protected "if" conditional that applied to the entire code block that was moved. Once the code block was relocated, the orphaned if-statement was now (erroneously) glomming on to the next thing that happened to be in the function, which happened to be the call to bli_rntm_set_ways_for_op(), causing a rather odd memory exhaustion error in the sba due to the num_threads field of the rntm_t still being -1 (because the rntm_t field were never processed as they should have been). Thanks to @ArcadioN09 (Snehith) for reporting this error and helpfully including relevant memory trace output. commit f065a8070f187739ec2b34417b8ab864a7de5d7e Author: Field G. Van Zee Date: Thu Oct 28 16:05:43 2021 -0500 Removed support for 3m, 4m induced methods. Details: - Removed support for all induced methods except for 1m. This included removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any code that existed only to support those implementations. These implementations were rarely used and posed code maintenance challenges for BLIS's maintainers going forward. - Removed reference kernels for packm that pack 3m and 4m micropanels, and removed 3m/4m-related code from bli_cntx_ref.c. - Removed support for 3m/4m from the code in frame/ind, then reorganized and streamlined the remaining code in that directory. The *ind(), *nat(), and *1m() APIs were all removed. (These additional API layers no longer made as much sense with only one induced method (1m) being supported.) The bli_ind.c file (and header) were moved to frame/base and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to frame/3. - Removed 3m/4m support from the code in frame/1m/packm. - Removed 3m/4m support from trmm/trsm macrokernels and simplified some pointer arithmetic that was previously expressed in terms of the bli_ptr_inc_by_frac() static inline function (whose definition was also removed). - Removed the following subdirectories of level-0 macro headers from frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros defined in these directories were used exclusively for 3m and 4m method codes. - Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in light of 1m being the only induced method left within BLIS. - Removed dt_on_output field within auxinfo_t and its associated accessor functions. - Re-indexed the 1e/1r pack schemas after removing those associated with variants of the 3m and 4m methods. This leaves two bits unused within the pack format portion of the schema bitfield. (See bli_type_defs.h for more info.) - Spun off the basic and expert interfaces to the object and typed APIs into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c and bli_l3_tapi_ex.c. - Moved the level-3 operation-specific _check function calls from the operations' _front() functions to the corresponding _ex() function of the object API. (This change roughly maintains where the _check() functions are called in the call stack but lays the groundwork for future changes that may come to the level-3 object APIs.) Minor modifications to bli_l3_check.c to allow the check() functions to be called from the expert interface APIs. - Removed support within the testsuite for testing the aforementioned induced methods, and updated the standalone test drivers in the 'test' directory so reflect the retirement of those induced methods. - Modified the sandbox contract so that the user is obliged to define bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light of the *nat() functions no longer existing.) Also updated the existing 'power10' and 'gemmlike' sandboxes to come into compliance with the new sandbox rules. - Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation to reflect the retirement of 3m/4m, and also modified Sandboxes.md to bring the document into alignment with new conventions. - Updated various comments; removed segments of commented-out code. commit e8caf200a908859fa5f5ea2049911a9bdaa3d270 Author: Field G. Van Zee Date: Mon Oct 18 13:04:15 2021 -0500 Updated do_sde.sh to get SDE from GitHub. Details: - Updated travis/do_sde.sh so that the script downloads the SDE tarball from a new ci-utils repository on GitHub rather than from Intel's website. This change is being made in an attempt to circumvent Travis CI's recent troubles with downloading the SDE from Intel's website via curl. Thanks to Devin Matthews for suggesting the idea. commit 290ff4b1c26737b074d5abbf76966bc22af8c562 Author: Field G. Van Zee Date: Thu Oct 14 16:09:43 2021 -0500 Disable SDE testing of old AMD microarchitectures. Details: - Skip testing on piledriver, steamroller, and excavator platforms in travis/do_sde.sh. commit 514fd101742dee557e5eb43d0023a221ae8a7172 Author: Field G. Van Zee Date: Thu Oct 14 13:50:28 2021 -0500 Fixed substitution bug in configure. Details: - Fixed a bug in configure related to the building of the so-called config list. When processing the contents of config_registry, configure creates a series of structures and list that allow for various mappings related to configuration families, subconfigs, and kernel sets. Two of those lists are built via subsitituion of umbrella families with their subconfig members, and one of those lists was improperly performing the subtitution in a way that would erroneously match on partial umbrella family names. That code was changed to match the code that was already doing the subtitution properly, via substitute_words(). - Added comments noting the importance of using substitute_words() in both instances. commit e9da6425e27a9d63c9fef92afc2dd750c601ccd7 Author: Field G. Van Zee Date: Wed Oct 13 14:15:38 2021 -0500 Allow use of 1m with mixing of row/col-pref ukrs. Details: - Fixed a bug that broke the use of 1m for dcomplex when the single- precision real and double-precision real ukernels had opposing I/O preferences (row-preferential sgemm ukernel + column-preferential dgemm ukernel, or vice versa). The fix involved adjusting the API to bli_cntx_set_ind_blkszs() so that the induced method context init function (e.g., bli_cntx_init__ind()) could call that function for only one datatype at a time. This allowed the blocksize scaling (which varies depending on whether we're doing 1m_r or 1m_c) to happen on a per-datatype basis. This fixes issue #557. Thanks to Devin Matthews and RuQing Xu for helping discover and report this bug. - The aforementioned 1m fix required moving the 1m_r/1m_c logic from bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is called from each level-3 _front() function. The pack_t schemas in the cntx_t were also removed entirely, along with the associated accessor functions. This in turn required updating the trsm1m-related virtual ukernels to read the pack schema for B from the auxinfo_t struct rather than the context. This also required slight tweaks to bli_gemm_md.c. - Repositioned the logic for transposing the operation to accommodate the microkernel IO preference. This mostly only affects gemm. Thanks to Devin Matthews for his help with this. - Updated dpackm pack ukernels in the 'armsve' kernel set to avoid querying pack_t schemas from the context. - Removed the num_t dt argument from the ind_cntx_init_ft type defined in bli_gks.c. The context initialization functions for induced methods were previously passed a dt argument, but I can no longer figure out *why* they were passed this value. To reduce confusion, I've removed the dt argument (including also from the function defintion + prototype). - Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This breaks high-leve implementations of 3m and 4m, but this is okay since those implementations will be removed very soon. - Removed some older blocks of preprocessor-disabled code. - Comment update to test_libblis.c. commit 81e103463214d589071ccbe2d90b8d7c19a186e4 Author: Minh Quan Ho <1337056+hominhquan@users.noreply.github.com> Date: Wed Oct 13 20:28:02 2021 +0200 Alloc at least 1 elem in pool_t block_ptrs. (#560) Details: - Previously, the block_ptrs field of the pool_t was allowed to be initialized as any unsigned integer, including 0. However, a length of 0 could be problematic given that malloc(0) is undefined and therefore variable across implementations. As a safety measure, we check for block_ptrs array lengths of 0 and, in that case, increase them to 1. - Co-authored-by: Minh Quan Ho commit 327481a4b0acf485d0cbdd8635dd9b886ba3f2a7 Author: Minh Quan Ho <1337056+hominhquan@users.noreply.github.com> Date: Tue Oct 12 19:53:04 2021 +0200 Fix insufficient pool-growing logic in bli_pool.c. (#559) Details: - The current mechanism for growing a pool_t doubles the length of the block_ptrs array every time the array length needs to be increased due to new blocks being added. However, that logic did not take in account the new total number of blocks, and the fact that the caller may be requesting more blocks that would fit even after doubling the current length of block_ptrs. The code comments now contain two illustrating examples that show why, even after doubling, we must always have at least enough room to fit all of the old blocks plus the newly requested blocks. - This commit also happens to fix a memory corruption issue that stems from growing any pool_t that is initialized with a block_ptrs length of 0. (Previously, the memory pool for packed buffers of C was initialized with a block_ptrs length of 0, but because it is unused this bug did not manifest by default.) - Co-authored-by: Minh Quan Ho commit 32a6d93ef6e2af5e486dfd5e46f8272153d3d53d Merge: 408906fd 2604f407 Author: Devin Matthews Date: Sat Oct 9 15:53:54 2021 -0500 Merge pull request #543 from xrq-phys/armsve-packm-fix ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9 commit 408906fdd8892032aa11bd061b7971128f453bef Merge: 4277fec0 ccf16289 Author: Devin Matthews Date: Sat Oct 9 15:50:25 2021 -0500 Merge pull request #542 from xrq-phys/armsve-zgemm Arm SVE CGEMM / ZGEMM Natural Kernels commit ccf16289d2e71fd9511ccf2d13dcebbfa29deabc Author: RuQing Xu Date: Fri Oct 8 12:34:14 2021 +0900 Arm SVE C/ZGEMM Fix FMOV 0 Mistake FMOV [hsd]M, #imm does not allow zero immediate. Use wzr, xzr instead. commit 82b61283b2005f900101056e6df2a108258db602 Author: RuQing Xu Date: Fri Oct 8 12:17:29 2021 +0900 SH Kernel Unused Eigher commit 1749dfa493054abd2e4ddba7cb21278d337e4f74 Author: RuQing Xu Date: Fri Oct 8 12:11:53 2021 +0900 Arm SVE C/ZGEMM Support *beta==0 commit 4b648e47daad256ab8ab698173a97f71ab9f75eb Author: RuQing Xu Date: Wed Sep 22 16:42:09 2021 +0900 Arm SVE Config armsve Use ZGEMM/CGEMM commit f76ea905e216cf640975e6319c6d2f54aeafed2e Author: RuQing Xu Date: Tue Sep 21 20:38:44 2021 +0900 Arm SVE: Update Perf. Graph Pic. size seems a bit different from upstream. Generaged w/ MATLAB. Open to any change. commit 66a018e6ad00d9e8967b67e1aa3e23b20a7efdfe Author: RuQing Xu Date: Mon Sep 20 00:16:11 2021 +0900 Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0 commit 9e1e781cb59f8fadb2a10a02376d3feac17ce38d Author: RuQing Xu Date: Sun Sep 19 23:30:42 2021 +0900 Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0 commit f7c6c2b119423e7ba7a24ae2156790e076071cba Author: RuQing Xu Date: Thu Sep 16 01:47:42 2021 +0900 A64FX Config Use ZGEMM/CGEMM commit e4cabb977d038688688aca39b366f98f9c36b7eb Author: RuQing Xu Date: Thu Sep 16 01:34:26 2021 +0900 Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg commit b677e0d61b23f26d9536e5c363fd6bbab6ee1540 Author: RuQing Xu Date: Thu Sep 16 01:18:54 2021 +0900 Arm SVE Add SGEMM 2Vx10 Unindexed commit 3f68e8309f2c5b31e25c0964395a180a80014d36 Author: RuQing Xu Date: Thu Sep 16 01:00:54 2021 +0900 Arm SVE ZGEMM Support Gather Load / Scatt. St. commit c19db2ff826e2ea6ac54569e8aa37e91bdf7cabe Author: RuQing Xu Date: Wed Sep 15 23:39:53 2021 +0900 Arm SVE Add ZGEMM 2Vx10 Unindexed commit e13abde30b9e0e381c730c496e74bc7ae062a674 Author: RuQing Xu Date: Wed Sep 15 04:19:45 2021 +0900 Arm SVE Add ZGEMM 2Vx7 Unindexed commit 49b9d7998eb86f340ae7b26af3e5a135d6a8feee Author: RuQing Xu Date: Tue Sep 14 04:02:47 2021 +0900 Arm SVE Add ZGEMM 2Vx8 Unindexed commit 4277fec0d0293400497ae8bcfc32be5e62319ae9 Merge: 2329d990 f44149f7 Author: Devin Matthews Date: Thu Oct 7 13:47:22 2021 -0500 Merge pull request #533 from xrq-phys/arm64-hi-bw ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig commit 2329d99016fe1aeb86da4552295f497543cea311 (origin/1m_row_col_problem) Author: Devin Matthews Date: Thu Oct 7 12:37:58 2021 -0500 Update Travis CI badge [ci skip] commit f44149f787ae3d4b53d9c4d8e6f23b2818b7770d Author: RuQing Xu Date: Fri Oct 8 02:35:58 2021 +0900 Armv8 Trash New Bulk Kernels - They didn't make much improvements. - Can't register row-preferral and column-preferral ukrs at the same time. Will break 1m. commit 70b52cadc5ef4c16431e1876b407019e6286614e Author: Devin Matthews Date: Thu Oct 7 12:34:35 2021 -0500 Enable testing 1m in `make check`. commit 2604f4071300d109f28c8438be845aeaf3ec44e4 Author: RuQing Xu Date: Thu Oct 7 02:39:00 2021 +0900 Config ArmSVE Unregister 12xk. Move 12xk to Old commit 1e3200326be9109eb0f8c7b9e4f952e45700cbba Author: RuQing Xu Date: Thu Oct 7 02:37:14 2021 +0900 Revert __has_include(). Distinguish w/ BLIS_FAMILY_** commit a4066f278a5c06f73b16ded25f115ca4b7728ecb Author: RuQing Xu Date: Thu Oct 7 02:26:05 2021 +0900 Register firestorm into arm64 Metaconfig commit d7a3372247c37568d142110a1537632b34b8f2ff Author: RuQing Xu Date: Thu Oct 7 02:25:14 2021 +0900 Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo commit 2920dde5ac52e09f84aa42990aab8340421522ce Author: RuQing Xu Date: Thu Oct 7 02:01:45 2021 +0900 Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo commit 14b13583f1802c002e195b3b48874b3ebadbeb20 Author: Devin Matthews Date: Wed Oct 6 10:22:34 2021 -0500 Add test for Apple M1 (firestorm) This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either. commit a024715065532400da6257b8b3124ca5aecda405 Author: RuQing Xu Date: Thu Oct 7 00:15:54 2021 +0900 Firestorm CPUID Dispatcher Commenting out due to possibly a Xcode bug. commit b9da6d55fec447d05c8b67f34ce83617123d8357 Author: RuQing Xu Date: Wed Oct 6 12:25:54 2021 +0900 Armv8 GEMMSUP Edge Cases Require Signed Ints Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints. commit 34919de3df5dda7a06fc09dcec12ca46dc8b26f4 Author: Devin Matthews Date: Sat Oct 2 18:48:50 2021 -0500 Make error checking level a thread-local variable. Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex. commit c3024993c3d50236fad112822215f066496c5831 Author: Devin Matthews Date: Tue Oct 5 15:20:27 2021 -0500 Fix data race in testsuite. commit 353a0d82572f26e78102cee25693130ce6e0ea5b Author: Devin Matthews Date: Tue Oct 5 14:24:17 2021 -0500 Update .appveyor.yml [ci skip] commit 4bfadf9b561d4ebe0bbaf8b6d332f07ff531d618 Author: RuQing Xu Date: Wed Oct 6 01:51:26 2021 +0900 Firestorm Block Size Fixes commit 40baf83f0ea2749199b93b5a8ac45c01794b008c Author: RuQing Xu Date: Wed Oct 6 01:00:52 2021 +0900 Armv8 Handle *beta == 0 for GEMMSUP ??r Case. commit 079fbd42ce8cf7ea67a939b0f80f488de5821319 Merge: f5c03e9f 9905f443 Author: Devin Matthews Date: Mon Oct 4 17:21:48 2021 -0500 Merge branch 'master' into arm64-hi-bw commit 9905f44347eea4c57ef4927b81f1c63e76a92739 Merge: 6d3036e3 64a421f6 Author: Devin Matthews Date: Mon Oct 4 15:58:59 2021 -0500 Merge pull request #553 from flame/rpath-fix Add an option to use an @rpath-dependent install_name on macOS commit 6d3036e31d8a2c1acbc1260489eeb8f535a8f97a Merge: 53377fcc eaa554aa Author: Devin Matthews Date: Mon Oct 4 15:58:43 2021 -0500 Merge pull request #545 from hominhquan/clean_error bli_error: more cleanup on the error strings array commit 53377fcca91e595787b38e2a47780ac0c35a7e7c Merge: d0a0b4b8 80c5366e Author: Devin Matthews Date: Mon Oct 4 15:45:53 2021 -0500 Merge pull request #554 from flame/armsve-cleanup Move unused ARM SVE kernels to "old" directory. commit 80c5366e4a9b8b72d97fba1eab89bab8989c44f4 Author: Devin Matthews Date: Mon Oct 4 15:40:28 2021 -0500 Move unused ARM SVE kernels to "old" directory. commit 64a421f6983ab5bc0b55df30a2ddcfff5bfd73be Author: Devin Matthews Date: Mon Oct 4 13:40:43 2021 -0500 Add an option to control whether or not to use @rpath. Adds `--enable-rpath/--disable--rpath` (default disabled) to use an install_name starting with @rpath/. Otherwise, set the install_name to the absolute path of the install library, which was the previous behavior. commit c4a31683dd6f4da3065d86c11dd998da5192740a Author: Devin Matthews Date: Mon Oct 4 13:27:10 2021 -0500 Fix $ORIGIN usage on linux. commit d0a0b4b841fce56b7b2d3c03c5d93ad173ce2b97 Author: Dave Love Date: Mon Oct 4 18:03:04 2021 +0000 Arm micro-architecture dispatch (#344) Details: - Reworked support for ARM hardware detection in bli_cpuid.c to parse the result of a CPUID-like instruction. - Added a64fx support to bli_gks.c. - #include arm64 and arm32 family headers from bli_arch_config.h. - Fix the ordering of the "armsve" and "a64fx" strings in the config_name string array in bli_arch.c. The ordering did not match the ordering of the corresponding arch_t values in bli_type_defs.h, as it should have all along. - Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 subconfigs. - Updated arm64 and arm32 families in config_registry. - Updated docs/HardwareSupport.md to reflect added ARM support. - Thanks to Dave Love, RuQing Xu, and Devin Matthews for their contributions in this PR (#344). commit 91408d161a2b80871463ffb6f34c455bdfb72492 Author: Devin Matthews Date: Mon Oct 4 11:37:48 2021 -0500 Use @path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries. - RPATH entries (and DYLD_LIBRARY_PATH) do nothing on macOS unless the install_name of the library starts with @rpath/. While the install_name can be set to the absolute install path, this makes the installation non-relocatable. When using @path in the install_name, install paths within the normal DYLD_LIBRARY_PATH work with no changes on the user side, but for install paths off the beaten track, users must specify an RPATH entry when linking (or modify DYLD_LIBRARY_PATH at runtime). Perhaps this could be made into a configure-time option. - Having relocable testsuite binaries is not necessarily a priority but it is easy to do with @executable_path (macOS) or $ORIGIN (linux/BSD). commit f5c03e9fe808f9bd8a3e0c62786334e13c46b0fc Author: RuQing Xu Date: Sun Oct 3 16:51:51 2021 +0900 Armv8 Handle *beta == 0 for GEMMSUP ?rc Case. commit abc648352c591e26ceee436bd3a45400115b70c5 Author: RuQing Xu Date: Sun Oct 3 13:14:19 2021 +0900 Armv8 Fix 6x8 Row-Maj Ukr - Fixed for 6x8 only, 4x4 & 4x8 pending; - Installed to config firestorm as benchmark seems to show better perf: Old: blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS New: blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS commit 0a45bc0fbc7aee3876c315ed567fc37f19cdc57f Merge: 5013a6cb 13dbd5b5 Author: Devin Matthews Date: Sat Oct 2 18:59:43 2021 -0500 Merge pull request #552 from flame/armsve_beta_0 Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. commit 13dbd5b5d3dbf27e33ecf0e98d43c97019a6339d Author: Devin Matthews Date: Sat Oct 2 20:40:25 2021 +0000 Apply patch from @xrq-phys. commit ae0eeeaf77c77892db17027cef10b95ec97c904f Author: Devin Matthews Date: Wed Sep 29 16:42:33 2021 -0500 Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs. commit 5013a6cb7110746c417da96e4a1308ef681b0b88 Author: Field G. Van Zee Date: Wed Sep 29 10:38:50 2021 -0500 More edits and fixes to docs/FAQ.md. commit b36fb0fbc5fda13d9a52cc64953341d3d53067ee Author: Field G. Van Zee Date: Tue Sep 28 18:47:45 2021 -0500 Fixed newly broken link to CREDITS in FAQ.md. commit 3442d4002b3bfffd8848f72103b30691df2b19b1 Author: Field G. Van Zee Date: Tue Sep 28 18:43:23 2021 -0500 More minor fixes to FAQ.md and Sandboxes.md. commit 89aaf00650d6cc19b83af2aea6c8d04ddd3769cb Author: Field G. Van Zee Date: Tue Sep 28 18:34:33 2021 -0500 Updates to FAQ.md, Sandboxes.md, and README.md. Details: - Updated FAQ.md to include two new questions, reordered an existing question, and also removed an outdated and redundant question about BLIS vs. AMD BLIS. - Updated Sandboxes.md to use 'gemmlike' as its main example, along with other smaller details. - Added ARM as a funder to README.md. commit c52c43115ec2264fda9380c48d9e6bb1e1ea2ead Merge: 1fc23d21 1f527a93 Author: Field G. Van Zee Date: Sun Sep 26 15:56:54 2021 -0500 Merge branch 'dev' commit 1fc23d2141189c7b583a5bff2cffd87fd5261444 Author: Field G. Van Zee Date: Tue Sep 21 14:54:20 2021 -0500 Safelist 'master', 'dev', 'amd' branches. Details: - Modified .travis.yml so that only commits to 'master', 'dev', and 'amd' branches get built by Travis CI. Thanks to Devin Matthews for helping to track down the syntax for this change. commit 1f527a93b996093e06ef7a8e94fb47ee7e690ce0 Author: Field G. Van Zee Date: Mon Sep 20 17:56:36 2021 -0500 Re-enable and fix fb93d24. Details: - Re-enabled the changes made in fb93d24. - Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c, all of which needed the definition (in addition to config_detect.c) in order for the configure-time hardware detection binary to be compiled properly. Thanks to Minh Quan Ho for helping identify these additional files as needing to be updated. - Added additional comments to all four source files, most notably to prompt the reader to remember to update all of the files when updating any of the files. Also made the cpp code in each of the files as consistent/similar as possible. - Refer to issues #532 and PR #546 for more history. commit 7b39c1492067de941f81b49a3b6c1583290336fd Author: Field G. Van Zee Date: Mon Sep 20 16:13:50 2021 -0500 Reverted fb93d24. Details: - The latest changes in fb93d24 are still causing problems. Reverting and preparing to move them to a branch. commit fb93d242a4fef4694ce2680436da23087bbdd5fe Author: Field G. Van Zee Date: Mon Sep 20 15:42:08 2021 -0500 Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM). Details: - Re-enable the changes originally made in 8e0c425 but quickly reverted in 2be78fc. - Moved the #include of bli_config.h so that it occurs before the #include of bli_system.h. This allows the #define BLIS_ENABLE_SYSTEM or #define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the time it is needed in bli_system.h. This change should have been in the original 8e0c425, but was accidentally omitted. Thanks to Minh Quan Ho for catching this. - Add #define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper cpp conditional branch executes in bli_system.h when compiling the hardware detection binary. The changes made in 8e0c425 were an attempt to support the definition of BLIS_OS_NONE when configuring with --disable-system (in issue #532). That commit failed because, aside from the required but omitted header reordering (second bullet above), AppVeyor was unable to compile the hardware detection binary as a result of missing Windows headers. This commit, which builds on PR #546, should help fix that issue. Thanks to Minh Quan Ho for his assistance and patience on this matter. commit eaa554aa52b879d181fdc87ba0bfad3ab6131517 Author: Minh Quan HO Date: Wed Sep 15 15:39:36 2021 +0200 bli_error: more cleanup on the error strings array - There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing: the maximal number of error codes/messages. - The previous initialization of error messages at compile time ignored that the 'bli_error_string' array still occupies useless memory due to 2D char[][] declaration. Instead, it should be just an array of pointers, pointing at strings in .rodata section. - This commit does the two modifications: * retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere * switch bli_error_string from char[][] to char *[] to reduce its footprint from 40KB (200*200) to 1.3KB (170*sizeof(char*)). (No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time, since compiler is smart enough to determine its value is 170.) commit 52f29f739dbbb878c4cde36dbe26b82847acd4e9 Author: Field G. Van Zee Date: Fri Sep 17 08:38:29 2021 -0500 Removed last vestige of #define BLIS_NUM_ARCHS. Details: - Removed the commented-out #define BLIS_NUM_ARCHS in bli_type_defs.h and its associated (now outdated) comments. BLIS_NUM_ARCHS has been part of the arch_t enum for some time now, and so this change is mostly about removing any opportunity for confusion for people who may be reading the code. Thanks to Minh Quan Ho for leading me to cleanup. commit 849aae09f4fbf8d7abf11f4df1471f1d057e874b Author: Field G. Van Zee Date: Thu Sep 16 14:47:45 2021 -0500 Added new packm var3 to 'gemmlike'. Details: - Defined a new packm variant for the 'gemmlike' sandbox. This new variant (bls_l3_packm_var3.c) parallelizes the packing operation over the k dimension rather than the m or n dimensions. Note that the gemmlike implementation still uses var1 by default, and use of the new code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c so that var3 is called instead. Thanks to Jeff Diamond for proposing this (perhaps NUMA-friendly) solution. commit b6f71fd378b7cd0cdc5c780e0b8c975a7abde998 Merge: 9293a68e e3dc1954 Author: Devin Matthews Date: Thu Sep 16 12:24:33 2021 -0500 Merge pull request #544 from flame/haswell-gemmsup-fpe Fix more copy-paste errors in the haswell gemmsup code. commit e3dc1954ffb5eee2a8b41fce85ba589f75770eea Author: Devin Matthews Date: Thu Sep 16 10:59:37 2021 -0500 Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell. The fix is to use the same (valid) source register twice in the horizontal addition. commit 5191c43faccf45975f577c60b9089abee25722c9 Author: Devin Matthews Date: Thu Sep 16 10:16:17 2021 -0500 Fix more copy-paste errors in the haswell gemmsup code. Fixes #486. commit 30c29b256ef13f0141ca9e9169cbdc7a45ce3a61 Author: RuQing Xu Date: Thu Sep 16 05:01:03 2021 +0900 Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9 Affected configs: a64fx. commit bffa85be59dece8e756b9444e762f18892c06ee1 Author: RuQing Xu Date: Thu Sep 16 04:31:45 2021 +0900 Arm SVE: Correct PACKM Ker Name: Intrinsic Kers SVE-Intrinsic-based kernels ought not to use asm in their names. commit 9293a68eb6557a9ea43a846435908c3d52d4218b Merge: ade10f42 98ce6e8b Author: Devin Matthews Date: Fri Sep 10 14:13:29 2021 -0500 Merge pull request #534 from flame/cxx_test Add test to Travis using C++ compiler to make sure blis.h is C++-compatible commit 98ce6e8bc916e952510872caa60d818d62a31e69 Author: Devin Matthews Date: Fri Sep 10 14:12:13 2021 -0500 Do a fast test on OSX. [ci skip] commit c76fcad0c2836e7140b6bef3942e0a632a5f2cda Author: Devin Matthews Date: Fri Sep 10 13:57:02 2021 -0500 Fix AArch64 tests and consolidate some other tests. commit e486d666ffefee790d5e39895222b575886ac1ea Author: Devin Matthews Date: Fri Sep 10 13:50:16 2021 -0500 Use C++ cross-compiler for ARM tests. commit fbb3560cb8e2aeab205c47c2b096d4fa306d93db Author: Devin Matthews Date: Fri Sep 10 13:38:27 2021 -0500 Attempt to fix cxx-test for OOT builds. commit 9c0064f3f67d59263c62d57ae19605562bb87cc2 Author: Devin Matthews Date: Fri Sep 10 10:39:04 2021 -0500 Fix config_name in bli_arch.c commit ade10f427835d5274411cafc9618ac12966eb1e7 Author: Field G. Van Zee Date: Fri Aug 27 12:47:12 2021 -0500 Updated travis-ci.org link in README.md to .com. commit 2be78fc97777148c83d20b8509e38aa1fc1b4540 Author: Field G. Van Zee Date: Fri Aug 27 12:17:26 2021 -0500 Disabled (at least temporarily) commit 8e0c425. Details: - Reverted changes in 8e0c425 due to AppVeyor build failures that we do not yet understand. commit 820f11a4694aee5f234e24277aecca40885ae9d4 Author: RuQing Xu Date: Fri Aug 27 13:40:26 2021 +0900 Arm Whole GEMMSUP Call Route is Asm/Int Optimized - `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out. - `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but it's not called by any upper routine. commit 8e0c4255de52a0a5cffecbebf6314aa52120ebe4 Author: Field G. Van Zee Date: Thu Aug 26 15:29:18 2021 -0500 Define BLIS_OS_NONE when using --disable-system. Details: - Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS- detecting macro conditionals are considered. This change is to accommodate a solution to a cross-compilation issue described in #532. commit d6eb70fbc382ad7732dedb4afa01cf9f53e3e027 Author: Field G. Van Zee Date: Thu Aug 26 13:12:39 2021 -0500 Updated stale calls to malloc_intl() in gemmlike. Details: - Updated two out-of-date calls to bli_malloc_intl() within the gemmlike sandbox. These calls to malloc_intl(), which resided in bls_l3_decor_pthreads.c, were missing the err_t argument that the function uses to report errors. Thanks to Jeff Diamond for helping isolate this issue. commit 2f7325b2b770a15ff8aaaecc087b22238f0c67b7 Author: Field G. Van Zee Date: Mon Aug 23 15:04:05 2021 -0500 Blacklist clang10/gcc9 and older for 'armsve'. Details: - Prohibit use of clang 10.x and older or gcc 9.x and older for the 'armsve' subconfiguration. Addresses issue #535. commit 7e2951e61fda1c325d6a76ca9956253482d84924 Author: RuQing Xu Date: Mon Aug 23 17:06:44 2021 +0900 Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref Ref cannot handle panel strides (packed cases) thus cannot be called from the beginning of `gemmsup` (i.e. cannot be dispatch target of gemmsup to other sizes.) commit 4fd82b0e9348553d83e258bd4969e49a81f8fcf0 Author: RuQing Xu Date: Mon Aug 23 05:18:32 2021 +0900 Header Typo commit 35409ebe67557c0e7cf5ced138c8166c9c1c909f Author: RuQing Xu Date: Mon Aug 23 04:51:47 2021 +0900 Arm: DGEMMSUP ??r(rv) Invoke Edge Size Plus some fix at edges. TODO: Should ensure that no ref kernel appear in beginning of gemmsup kernels. As ref does not recognise panel stride. commit a361492c24fdd919ee037763fc6523e8d7d2967a Author: RuQing Xu Date: Mon Aug 23 01:13:39 2021 +0900 Arm: DGEMMSUP ?rc(rd) Invoke Edge Size commit eaea67401c2ab31f2e51eede59725f64c1a21785 Merge: 5fc65cdd e320ec6d Author: Devin Matthews Date: Sat Aug 21 16:09:31 2021 -0500 Merge branch 'master' into cxx_test commit 5fc65cdd9e4134c5dcb16d21cd4a79ff426ca9f3 Author: Devin Matthews Date: Sat Aug 21 15:59:27 2021 -0500 Add test to Travis using C++ compiler to make sure blis.h is C++-compatible. commit e320ec6d5cd44e03cb2e2faa1d7625e84f76d668 Author: Field G. Van Zee Date: Fri Aug 20 17:15:20 2021 -0500 Moved lang defs from _macro_def.h to _lang_defs.h. Details: - Moved miscellaneous language-related definitions, including defs related to the handling of the 'restrict' keyword, from the top half of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now #included immediately after "bli_system.h" in blis.h. This change is an attempt to fix a report of recent breakage of C++ compilers due to the recent introduction of 'restrict' in bli_type_defs.h (which previously was being included *before* bli_macro_defs.h and its restrict handling therein. Thanks to Ivan Korostelev for reporting this issue in #527. - CREDITS file update. commit e6799b26a6ecf1e80661a77d857d1c9e9adf50dc Author: RuQing Xu Date: Sat Aug 21 02:39:38 2021 +0900 Arm: Implement GEMMSUP Fallback Method bli_dgemmsup_rv_armv8a_int_6x4mn commit 7d5903d8d7570090eb37c592094424d1c64805d1 Author: RuQing Xu Date: Sat Aug 21 01:55:50 2021 +0900 Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin Forgot to support `alpha`/`beta` in gemmsup_armv8a_int. commit 3b275f810b2479eb5d6cf2296e97a658cf1bb769 Author: Field G. Van Zee Date: Thu Aug 19 16:06:46 2021 -0500 Minor tweaks to gemmlike sandbox. Details: - In the gemmlike sandbox, changed the loop index variable of inner loop of packm_cxk() from 'd' to 'i' (and likewise for the corresponding inlined code within packm_var2()). - Pack matrices A and B using packm_var1() instead of packm_var2(). commit 3eccfd456e7e84052c9a429dcde1183a7ecfaa48 Author: Field G. Van Zee Date: Thu Aug 19 13:22:10 2021 -0500 Added local _check() code to gemmlike sandbox. Details: - Added code to the gemmlike sandbox that handles parameter checking. Previously, the gemmlike implementation called bli_gemm_check(), which resides within the BLIS framework proper. Certain modifications that a user may wish to perform on the sandbox, such as adding a new matrix or vector operand, would have required additional checks, and so these changes make it easier for such a person to implement those checks for their custom gemm-like operation. commit 7144230cdb0653b70035ddd91f7f41e06ad8d011 Author: Field G. Van Zee Date: Wed Aug 18 13:25:39 2021 -0500 README.md citation updates (e.g. BLIS7 bibtex). commit 4a955e939044cfd2048cf9f3e33024e3ad1fbe00 Author: Field G. Van Zee Date: Mon Aug 16 13:49:27 2021 -0500 Tweaks to gemmlike to facilitate 3rd party mods. Details: - Changed the implementation in the 'gemmlike' sandbox to more easily allow others to provide custom implementations of packm. These changes include: - Calling a local version of packm_cxk() that can be modified. This version of packm_cxk() uses inlined loops in packm_cxk() rather than querying the context for packm kernels (or even using scal2m). - Providing two variants of packm, one of which calls the aforementioned packm_cxk(), the other of which inlines the contents of packm_cxk() into the variant itself, making it self-contained. To switch from one to the other, simply change which function gets called within bls_packm_a() and bls_packm_b(). - Simplified and cleaned up some variant names in both variants of packm, relative to their parent code. commit 2c0b4150e40c83ea814f69ca766da74c19ed0a58 Merge: c99fae50 4b8ed99d Author: Devin Matthews Date: Sat Aug 14 18:41:35 2021 -0500 Merge pull request #527 from flame/obj_t_makeover Implement proposed new function pointer fields for obj_t. commit 4b8ed99d926876fbf54c15468feae4637268eb6b Author: Field G. Van Zee Date: Fri Aug 13 15:31:10 2021 -0500 Whitespace tweaks. commit c99fae50ac3de0b5380a085aeebebfe67a645407 Merge: e6d68bc4 4f70eb79 Author: Devin Matthews Date: Fri Aug 13 14:48:00 2021 -0500 Merge pull request #530 from flame/fix_clang_warnings Clean up some warnings that show up on clang/OSX. commit e6d68bc4fd0981bea90d7f045779cacfe53f6ae8 Merge: 20a1c401 ec06b6a5 Author: Devin Matthews Date: Fri Aug 13 14:47:46 2021 -0500 Merge pull request #529 from flame/fix_make_check_dependencies Add dependency on the "flat" blis.h file for the BLIS and BLAS testuite objects. commit 1772db029e10e0075b5a59d3fb098487b1ad542a Author: Devin Matthews Date: Fri Aug 13 14:46:35 2021 -0500 Add row- and column-strides for A/B in obj_ukr_fn_t. commit 4f70eb7913ad3ded193870361b6da62b20ec3823 Author: Devin Matthews Date: Fri Aug 13 11:12:43 2021 -0500 Clean up some warnings that show up on clang/OSX. commit 3cddce1e2a021be6064b90af30022b99cbfea986 Author: Devin Matthews Date: Thu Aug 12 22:32:34 2021 -0500 Remove schema field on obj_t (redundant) and add new API functions. commit ec06b6a503a203fa0cdb23273af3c0e3afeae7fa Author: Devin Matthews Date: Thu Aug 12 19:27:31 2021 -0500 Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects. This fixes a bug where "make -j check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes. commit 20a1c4014c999063e6bc1cfa605b152454c5cbf4 Author: Field G. Van Zee Date: Thu Aug 12 14:44:04 2021 -0500 Disabled sanity check in bli_pool_finalize(). Details: - Disabled a sanity check in bli_pool_finalize() that was meant to alert the user if a pool_t was being finalized while some blocks were still checked out. However, this is exactly the situation that might happen when a pool_t is re-initialized for a larger blocksize, and currently bli_pool_reinit() is implemeneted as _finalize() followed by _init(). So, this sanity check is not universally appropriate. Thanks to AMD-India for reporting this issue. commit e366665cd2b5ae8d7683f5ba2de345df0a41096f Author: Field G. Van Zee Date: Thu Aug 12 14:06:53 2021 -0500 Fixed stale API calls to membrk API in gemmlike. Details: - Updated stale calls to the bli_membrk API within the 'gemmlike' sandbox. This API is now called bli_pba (packed block allocator). Ideally, this forgotten update would have been included as part of 21911d6, which is when the branch where the membrk->pba changes was introduced was merged into 'master'. - Comment updates. commit e38ca28689f31c5e5bd2347704dc33042e5ea176 Author: RuQing Xu Date: Fri Aug 13 03:21:19 2021 +0900 Added Apple Firestorm (A14/M1) Subconfig - Use the same bulk kernel as Cortex-A53 / ThunderX2; - Larger block size; - Use gemmsup kernels for double precision. commit 3df0e9b653fbb1293cad93010273eea579e753d9 Author: RuQing Xu Date: Sat Jul 17 04:21:53 2021 +0900 Arm64 8x4 Kernel Use Less Regs commit 4e7e225057a05b9722ce65ddf75a9c31af9fbf36 Author: RuQing Xu Date: Wed Jun 9 15:46:36 2021 +0900 Armv8-A Supplimentary GEMMSUP Sizes for RD commit c792d506ba09530395c439051727631fd164f59a Author: RuQing Xu Date: Sat Jun 5 04:20:24 2021 +0900 Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm Suffixed NEON opcode is not supported by GNU assembler commit ce4473520975c2c8790c82c65a69d75f8ad758ea Author: RuQing Xu Date: Sat Jun 5 04:08:14 2021 +0900 Armv8-A Adjust Types for PACKM Kernels GCC does not have full NEON intrinsics support. commit 8a32d19af85b61af92fcab1c316fb3be1a8d42ce Author: RuQing Xu Date: Sat Jun 5 03:31:30 2021 +0900 Armv8-A GEMMSUP-RD 6x8m Armv8-A now has a complete set of GEMMSUP kernels.. commit afd0fa6ad1889ed073f781c8aa8635f99e76b601 Author: RuQing Xu Date: Sat Jun 5 01:19:01 2021 +0900 Armv8-A GEMMSUP-RD 6x8n commit 3c5f7405148ab142dee565d00da331d95a7a07b9 Author: RuQing Xu Date: Fri Jun 4 21:50:51 2021 +0900 Armv8-A s/d Packing Kernels Fix Typo For GCC. commit 49b05df7929ec3abc0d27b475d2d406116fe2682 Author: RuQing Xu Date: Fri Jun 4 18:04:59 2021 +0900 Armv8-A Introduced s/d Packing Kernels Sizes according to the 2014 kernels. commit c3faf93168c3371ff48a2d40d597bdb27021cad4 Author: RuQing Xu Date: Thu Jun 3 23:09:05 2021 +0900 Armv8-A DGEMMSUP 6x8m Kernel Recommended kernels set: ... BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE, BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE, ... bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1, -1, 8, -1, -1 ); bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 ); ... commit 3efe707b5500954941061d4c2363d6ed41d17233 Author: RuQing Xu Date: Thu Jun 3 17:20:57 2021 +0900 Armv8-A DGEMMSUP Adjustments commit 8ed8f5e625de9b77a0f14883283effe79af01771 Author: RuQing Xu Date: Thu Jun 3 16:37:37 2021 +0900 Armv8-A Add More DGEMMSUP - Add 6x8 GEMMSUP. - Adjust prefetching. - Workaround for Clang's disability to handle reg clobbering. - Subproduct 6x8 row-major GEMM <- incomplete. commit a9ba79ea14de3b5a271e5970cb473d3c52e2fa5f Author: RuQing Xu Date: Wed Jun 2 15:04:29 2021 +0900 Armv8-A Add GEMMSUP 4x8n Kernel - Compile w/ both GCC & Clang. - Edge cases use ref-kernels. - Can give performance boost in some contexts. commit df40efe8fbfd399d76c6000ec03791a9b76ffbdf Author: RuQing Xu Date: Wed Jun 2 00:04:20 2021 +0900 Armv8-A Add Part of GEMMSUP 8x4m Kernel - Compile w/ both GCC & Clang - Only block part is implement. Edge cases WIP - Not Optimal kernel scheme. Should do 4x8 instead commit 66399992881316514f64d68ec9eb60a87d53f674 Author: RuQing Xu Date: Sat May 29 05:52:05 2021 +0900 Armv8A DGEMM 4x4 Kernel WIP. Slow Quite slow. commit a29c16394ccef02d29141c79b71fb408e20073e6 Author: RuQing Xu Date: Sat May 29 04:58:45 2021 +0900 Armv8-A Add 8x4 Kernel WIP Test result: a bit lower GFlOps than 6x8. commit 64a1f786d58001284aa4f7faf9fae17f0be7a018 Author: Devin Matthews Date: Wed Aug 11 17:53:12 2021 -0500 Implement proposed new function pointer fields for obj_t. The added fields: 1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs. 2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer. 3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree. 4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field). 5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does **not** necessarily have to call the ukr function specified on the obj_t. Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687. commit a32257eeab2e9946e71546a05a1847a39341ec6b Author: Field G. Van Zee Date: Thu Aug 5 16:23:02 2021 -0500 Fixed bli_init.c compile-time error on OSX clang. Details: - Fixed a compile-time error in bli_init.c when compiling with OSX's clang. This error was introduced in 868b901, which introduced a post-declaration struct assignment where the RHS was a struct initialization expression (i.e. { ... }). This use of struct initializer expressions apparently works with gcc despite it not being strict C99. The fix included in this commit declares a temporary variable for the purposes of being initialized to the desired value, via the struct initializer, and then copies the temporary struct (via '=' struct assignment) to the persistent struct. Thanks to Devin Matthews for his help with this. commit c8728cfbd19ecde9d43af05829e00bcfe7d86eed Author: Field G. Van Zee Date: Thu Aug 5 15:17:09 2021 -0500 Fixed configure breakage on OSX clang. Details: - Accept either 'clang' or 'LLVM' in vendor string when greping for the version number (after determining that we're working with clang). Thanks to Devin Matthews for this fix. commit 868b90138e64c873c780d9df14150d2a370a7a42 Author: Field G. Van Zee Date: Wed Aug 4 18:31:01 2021 -0500 Fixed one-time use property of bli_init() (#525). Details: - Fixes a rather obvious bug that resulted in segmentation fault whenever the calling application tried to re-initialize BLIS after its first init/finalize cycle. The bug resulted from the fact that the bli_init.c APIs made no effort to allow bli_init() to be called subsequent times at all due to it, and bli_finalize(), being implemented in terms of pthread_once(). This has been fixed by resetting the pthread_once_t control variable for initialization at the end of bli_finalize_apis(), and by resetting the control variable for finalization at the end of bli_init_apis(). Thanks to @lschork2 for reporting this issue (#525), and to Minh Quan Ho and Devin Matthews for suggesting the chosen solution. - CREDITS file update. commit 8dba1e752c6846a85dea50907135bbc5cbc54ee5 Author: Field G. Van Zee Date: Tue Jul 27 12:38:24 2021 -0500 CREDITS file update. commit cc9206df667b7c710b57b190b8ad351176de53b8 Author: Field G. Van Zee Date: Fri Jul 16 15:48:37 2021 -0500 Added Graviton2 Neoverse N1 performance results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on a Graviton2 Neoverse N1 server. Special thanks to Nicholai Tukanov for collecting these results via the Arm-HPC/AWS hackaton. - Corrected what was supposed to be a temporary tweak to the legend labels in test/3/octave/plot_l3_perf.m. commit fab5c86d68137b59800715efb69214c0a7e458a7 Merge: 84f9dcd4 d073fc9a Author: Devin Matthews Date: Tue Jul 13 16:46:21 2021 -0500 Merge pull request #516 from nicholaiTukanov/p10-sandbox-rework P10 sandbox rework commit 84f9dcd449fa7a4cf4087fca8ec4ca0d10e9b801 Author: Devin Matthews Date: Tue Jul 13 16:45:44 2021 -0500 Remove unnecesary windows/zen2 directory. commit 21911d6ed3438ca4ba942d05851ba5d7e9835586 Merge: 17729cf4 689fa0f4 Author: Field G. Van Zee Date: Fri Jul 9 18:10:46 2021 -0500 Merge branch 'dev' commit 17729cf449919d1db9777cea5b65d2efc77e2692 Author: Devin Matthews Date: Fri Jul 9 14:59:48 2021 -0500 Add vzeroupper to Haswell microkernels. (#524) Details: - Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm' microkernels so as to avoid a performance penalty when mixing AVX and SSE instructions. These vzeroupper instructions were once part of the haswell kernels, but were inadvertently removed during a source code shuffle some time ago when we were managing duplicate 'haswell' and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down and re-inserting the missing instructions. commit c9a7f59aa84daa54d8f8c771f1f1ef2bd8730da2 Merge: 75f03907 9a8e649c Author: Devin Matthews Date: Thu Jul 8 14:00:38 2021 -0500 Merge pull request #522 from flame/windows-avx512 Fix Win64 AVX512 bug. commit 9a8e649c5ac89eba951bbee7136ca28aeb24d731 Author: Devin Matthews Date: Wed Jul 7 15:23:57 2021 -0500 Fix Win64 AVX512 bug. Use `-march=haswell` for kernels. Fixes #514. commit 75f03907c58385b656c8bd35d111db245814a9f3 Author: Devin Matthews Date: Wed Jul 7 15:44:11 2021 -0500 Add comment about make checkblas on Windows [ci skip] commit 4651583b1204a965e4aa672c7ad6de60f3ab1600 Merge: 69205ac2 174f7fc9 Author: Devin Matthews Date: Wed Jul 7 01:11:20 2021 -0500 Merge pull request #520 from flame/travis-ci-install Test installation in Travis CI commit 69205ac266947723ad4d7bb028b7521fe5c76991 Author: Field G. Van Zee Date: Tue Jul 6 20:39:22 2021 -0500 CREDITS file update. Details: - Thanks to Chengguo Sun for submitting #515 (5ef7f68). - Thanks to Andrew Wildman for submitting #519 (551c6b4). - Whitespace update to configure (spaces to tabs). commit 174f7fc9a11712c7bd1a61510bdc5c262b3e8e1f Author: Devin Matthews Date: Tue Jul 6 19:35:55 2021 -0500 Test installation in Travis CI commit 551c6b4ee8cd9dd2e1d1b46c8dde09eb50b91b2c Merge: 78eac6a0 f648df4e Author: Devin Matthews Date: Tue Jul 6 19:32:53 2021 -0500 Merge pull request #519 from awild82/oot_build_bugfix Fix installation from out-of-tree builds commit f648df4e5588f069b2db96f8be320ead0c1967ef Author: Andrew Wildman Date: Tue Jul 6 16:35:12 2021 -0700 Add symlink to blis.pc.in for out-of-tree builds commit 78eac6a0ab78c995c3f4e46a9e87388b5c3e1af6 Author: Devin Matthews Date: Tue Jul 6 11:05:43 2021 -0500 Revert "Always run `make check`." This reverts commit a201a53440c51244739aaee20e3309b50121cc68. commit a201a53440c51244739aaee20e3309b50121cc68 Author: Devin Matthews Date: Mon Jul 5 21:39:18 2021 -0500 Always run `make check`. I'm concerned that problems may lurk for `x86_64` builds on Windows which may be uncovered by a fuller `make check`. commit 5ef7f684dc75fc707c82f919e0836615f90a2627 Merge: aaa10c87 ad6231cc Author: Devin Matthews Date: Mon Jul 5 21:35:07 2021 -0500 Merge pull request #515 from chengguosun/bug-fix Fixed configure script bug. commit ad6231cca3fc1e477752ecd31b1ee2323398a642 Author: sunchengguo Date: Tue Jul 6 07:30:00 2021 -0400 Fixed configure script bug. Details: - Fixed kernel list string substitution error by adding function substitute_words in configure script. if the string contains zen and zen2, and zen need to be replaced with another string, then zen2 also be incorrectly replaced. commit d073fc9acac9d702556cab9fbbb3a253eeb1f998 Author: nicholaiTukanov Date: Fri Jul 2 19:54:33 2021 -0500 Update POWER10.md commit 907226c0af4afb6323b4e02be4f73f5fb89cddaf Author: nicholaiTukanov Date: Fri Jul 2 19:47:18 2021 -0500 Rework POWER10 sandbox - Add a testsuite for gathering performance (in GFLOPs) and measuring correctness for the POWER10 GEMM reduced precision/integer kernels. - Reworked GENERIC_GEMM template to hardcode the cache parameters. - Remove kernel wrapper that checked that only allowed matrices that weren't transposed or conjugated. However, the kernels still assume the matrices are not transposed. This wrapper was removed for performance reasons. - Renamed and restructured files and functions for clarity. - Editted the POWER10 document to reflect new changes. commit aaa10c87e19449674a4ca30fa3b6392bb22c3a66 Author: Field G. Van Zee Date: Mon Jun 21 17:53:52 2021 -0500 Skip clearing temp microtile in gemmlike sandbox. Details: - Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. This code, introduced recently in 7f7d726, did not actually fix any bug (despite that commit's log entry). The microtile does not need to be initialized because it is completely overwritten by a "beta = 0" invocation of gemm prior to it being read. Any NaNs or Infs present at the outset would have no impact on the output matrix C. Thanks to Devin Matthews for reminding me of this. commit bc10a3f2ff518360c32bea825b3eb62a9e4c8a77 Merge: bf727636 6548ceba Author: Devin Matthews Date: Fri Jun 18 19:01:08 2021 -0500 Merge pull request #492 from flame/thunderx2-clang Allow clang for ThunderX2 config commit bf727636632a368f3247dc8ab1d4b6119e9c511a Merge: e28f2a2d 5fc93e28 Author: Devin Matthews Date: Fri Jun 18 18:59:43 2021 -0500 Merge pull request #506 from xrq-phys/arm64-mac BLIS on Darwin_Aarch64 commit e28f2a2dfcff14e7094fce0b279b3a917b3ab98c Merge: d10e05bb 56ffca6a Author: Devin Matthews Date: Tue Jun 15 19:35:07 2021 -0500 Merge pull request #513 from nicholaiTukanov/asm_warning_p9_fix Fix assembler warning in POWER9 DGEMM commit 56ffca6a9bc67432a7894298739895f406e5f467 Author: nicholai Date: Tue Jun 15 18:17:39 2021 -0500 Fix asm warning commit 689fa0f40399bde1acc5367d6dd4e8fc4eb6f3ea Merge: b683d01b d10e05bb Author: Field G. Van Zee Date: Sun Jun 13 19:44:14 2021 -0500 Merge branch 'master' into dev commit d10e05bbd1ce45ce2c0dfe5c64daae2633357b3f Author: Field G. Van Zee Date: Sun Jun 13 19:36:16 2021 -0500 Sandbox header edits trigger full library rebuild. Details: - Adjusted the top-level Makefile so that any change to a sandbox header file will result in blis.h being regenerated along with a full recompilation of the library. Previously, sandbox files were omitted from the list of header files that, when touched, could trigger a full rebuild. Why was it like that previously? Because originally we only envisioned using sandboxes to *replace* gemm, not augment the library with new functionality. When replacing gemm, blis.h does not need to contain any local sandbox defintions in order for the user to be able to (indirectly) use that sandbox. But if you are adding functions to the library, those functions need to be prototyped so the compiler can perform type checking against the user's invocation of those new functions. Thanks to Jeff Diamond for helping us discover this deficiency in the build system. commit 7c3eb44efaa762088c190bb820ef6a3c87db8f65 Author: Devin Matthews Date: Wed Jun 2 11:28:22 2021 -0500 Add vhsubpd/vhsubpd. Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip]. commit 7f7d72610c25f511ba8cd2a53be7b59bdb80f3f3 Author: Field G. Van Zee Date: Mon May 31 16:50:18 2021 -0500 Fixed bugs in cpackm kernels, gemmlike code. Details: - Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the kappa scalar was incorrectly loaded at an offset of 8 bytes (instead of 4 bytes) from the real component. This was almost certainly a copy- paste bug carried over from the corresonding zpackm kernels. Thanks to Devin Matthews for bringing this to my attention. - Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and bls_gemm_bp_var2.c that initializes the elements of the temporary microtile to zero. (This bug was never observed in output but rather noticed analytically. It probably would have also manifested as intermittent failures, this time involving edge cases.) - Minor commented-out/disabled changes to testsuite/src/test_gemm.c relating to debugging. commit 5fc93e280614b4a21a9cff36cf873b4b9407285b Author: RuQing Xu Date: Sat May 29 18:44:47 2021 +0900 Armv8A Rename Regs for Safe Darwin Compile Avoid x18 use in FP32 kernel: - C address lines x[18-26] renamed to x[19-27] (reg index +1) - Original role of x27 fulfilled by x5 which is free after k-loop pert. FP64 does not require changing since x18 is not used there. commit 9f4a4a3cfb2244e4024445e127dafd2a11f39fc5 Author: RuQing Xu Date: Sat May 29 17:21:28 2021 +0900 Armv8A Rename Regs for Clang Compile: FP32 Part Roughly the same as 916e1fa , additionally with x15 clobbering removed. - x15: Not used at all. Compilation w/ Clang shows warning about x18 reservation, but compilation itself is OK and all tests got passed. commit 916e1fa8be3cea0e3e2a4a7e8b00027ac2ee7780 Author: RuQing Xu Date: Sat May 29 16:46:52 2021 +0900 Armv8A Rename Regs for Clang Compile: FP64 Part - x7, x8: Used to store address for Alpha and Beta. As Alpha & Beta was not used in k-loops, use x0, x1 to load Alpha & Beta's addresses after k-loops are completed, since A & B's addresses are no longer needed there. This "ldr [addr]; -> ldr val, [addr]" would not cause much performance drawback since it is done outside k-loops and there are plenty of instructions between Alpha & Beta's loading and usage. - x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used any longer. Directly loading cs_c and into x10 and scale by 8 spares x9 straightforwardly. - x11, x12: Not used at all. Simply remove from clobber list. - x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is also used in a conditional branch so that "cmp x13, #1" needs to be modified into "cmp x14, #8" to completely free x13. - x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load these addresses into x0 and x1 after Alpha & Beta are both loaded, since then neigher address of A/B nor address of Alpha/Beta is needed. commit 7fabd896af773623ed01820a71bbff432e8a7d25 Author: RuQing Xu Date: Sat May 29 16:28:03 2021 +0900 Asm Flag Mingling for Darwin_Aarch64 Apple+Arm64 requires additional "tagging" of local symbols. commit 213dce32d2eed8b7a38c6a3f6112072b0a89ecd0 Author: Field G. Van Zee Date: Fri May 28 14:49:57 2021 -0500 Added a new 'gemmlike' sandbox. Details: - Added a new sandbox called 'gemmlike', which implements sequential and multithreaded gemm in the style of gemmsup but also unconditionally employs packing. The purpose of this sandbox is to (1) avoid select abstractions, such as objects and control trees, in order to allow readers to better understand how a real-world implementation of high-performance gemm can be constructed; (2) provide a starting point for expert users who wish to build something that is gemm-like without "reinventing the wheel." Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi Parikh for requesting and inspiring this work. - The functions defined in this sandbox currently use the "bls_" prefix instead of "bli_" in order to avoid any symbol collisions in the main library. - The sandbox contains two variants, each of which implements gemm via a block-panel algorithm. The only difference between the two is that variant 1 calls the microkernel directly while variant 2 calls the microkernel indirectly, via a function wrapper, which allows the edge case handling to be abstracted away from the classic five loops. - This sandbox implementation utilizes the conventional gemm microkernel (not the skinny/unpacked gemmsup kernels). - Updated some typos in the comments of a few files in the main framework. commit 82af05f54c34526a60fd2ec46656f13e1ac8f719 Author: Field G. Van Zee Date: Tue May 25 15:25:08 2021 -0500 Updated Fugaku (a64fx) performance results. Details: - Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx entry within Performance.md, and also updated the experiment details accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2 experiments reflected in this commit. - In Performance.md, added an English translation of the project name under which the Fugaku results were gathered, courtesy of RuQing Xu. commit e5c85da3763f73854ecd739ba3008bb467ed77c3 Merge: cbd8d393 5feb04e2 Author: Devin Matthews Date: Mon May 24 16:56:22 2021 -0500 Merge pull request #503 from flame/windows-compiler-check Add explicit compiler check for Windows. commit cbd8d3932599485727204479fded66ac19186db4 Merge: 6d4ab022 932dfe6a Author: Devin Matthews Date: Mon May 24 16:32:42 2021 -0500 Merge pull request #500 from xrq-phys/armsve+travis Upgrade Travis CI for Arm SVE commit 5feb04e233e1e6f81c727578ad9eae1367a2562f Author: Devin Matthews Date: Sun May 23 18:46:56 2021 -0500 Add explicit compiler check for Windows. Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes #463. commit 6d4ab0223d9014ac2a66d66759536aa305be5867 Merge: 61584ded 859fb77a Author: Devin Matthews Date: Sun May 23 18:39:53 2021 -0500 Merge pull request #502 from flame/rm-rm-dupls Remove `rm-dupls` function in common.mk. commit 859fb77a320a3ace71d25a8885c23639b097a1b6 Author: Devin Matthews Date: Sun May 23 18:15:23 2021 -0500 Remove `rm-dupls` function in common.mk. AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation. commit 932dfe6abb9617223bd26a249e53447169033f8c Author: RuQing Xu Date: Thu May 20 02:07:31 2021 +0900 Travis CI Revert Unnecessary Extras from 91d3636 - Removed `V=1` in make line - Removed `CFLAGS` in configure line - Restored `pwd` surrounding OOT line commit bd156a210d347a073a6939cc4adab3d9256c2e2b Author: RuQing Xu Date: Sun May 16 02:56:14 2021 +0900 Adjust TravisCI - ArmSVE don't test gemmt (seems Qemu-only problem); - Clang use TravisCI-provided version instead of fixing to clang-8 due to that clang-8 seems conflicting with TravisCI's clang-7. commit 91d3636031021af3712d14c9fcb1eb34b6fe2a31 Author: RuQing Xu Date: Sat May 15 17:05:16 2021 +0900 Travis Support Arm SVE - Updated distro to 20.04 focal aarch64-gcc-10. This is minimal version required by aarch64-gcc-10. SVE intrinsics would not compile without GCC >=10. - x86 toolchains use official repo instead of ubuntu-toolchain-r/test. 20.04 focal is not supported by that PPA at the moment. - Add extra configuration-time options to .travis.yml. - Add Arm SVE entry to .travis.yml. commit 61584deddf9b3af6d11a811e6e04328d22390202 Author: RuQing Xu Date: Wed May 19 23:52:29 2021 +0900 Added 512b SVE-based a64fx subconfig + SVE kernels. Details: - Added 512-bit specific 'a64fx' subconfiguration that uses empirically tuned block size by Stepan Nassyr. This subconfig also sets the sector cache size and enables memory-tagging code in SVE gemm kernels. This subconfig utilizes (16, k) and (10, k) DPACKM kernels. - Added a vector-length agnostic 'armsve' subconfiguration that computes blocksizes according to the analytical model. This part is ported from Stepan Nassyr's repository. - Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE at size (2*VL, 10). These kernels use unindexed FMLA instructions because indexed FMLA takes 2 FMA units in many implementations. PS: There are indexed-FLMA kernels in Stepan Nassyr's repository. - Implemented 512-bit SVE dpackm kernels with in-register transpose support for sizes (16, k) and (10, k). - Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size (12, k). This dpackm kernel is not currently used by any subconfiguration. - Implemented several experimental dgemmsup kernels which would improve performance in a few cases. However, those dgemmsup kernels generally underperform hence they are not currently used in any subconfig. - Note: This commit squashes several commits submitted by RuQing Xu via PR #424. commit b683d01b9c4ea5f64c8031bda816beccfbf806a0 Author: Field G. Van Zee Date: Thu May 13 15:23:22 2021 -0500 Use extra #undef when including ba/ex API headers. Details: - Inserted a "#include bli_xapi_undef.h" after each usage of the basic and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to the previous status quo, in which each header made minimal #undef prior to its own definitions and then a single instance of "#include bli_xapi_undef.h" cleaned up any remaining macro defs after all other headers were used. This commit will guarantee that macro defs from the setup of one header (say, bli_oapi_ex.h) don't "infect" the definitions made in a subsequent header. As with this previous commit, this change does not fix any issue but rather attempts to avoid creating orphaned macro definitions that are only needed within a very limited scope. - Removed minimal #undef from bli_?api_[ba|ex].h. - Removed old commented-out lines from bli_?api_[ba|ex].h. commit d4427a5b2f5cab5d2a64c58d87416628867c2b4a Author: Field G. Van Zee Date: Thu May 13 13:55:11 2021 -0500 Minor preprocessor/header cleanup. Details: - Added frame/include/bli_xapi_undef.h, which explicitly undefines all macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. (This is for safety and good cpp coding practice, not because it fixes anything.) - Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h, bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h. - Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. - Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing that nothing in BLIS used those function pointer types. Also commented out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h. commit 5aa63cd927b22a04e581b07d0b68ef391f4f9b1f Author: Field G. Van Zee Date: Wed May 12 19:53:35 2021 -0500 Fixed typo in cpp guard in bli_util_ft.h. Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be. commit f0e8634775094584e89f1b03811ee192f2aaf67f Author: Field G. Van Zee Date: Wed May 12 18:45:32 2021 -0500 Defined eqsc, eqv, eqm to test object equality. Details: - Defined eqsc, eqv, and eqm operations, which set a bool depending on whether the two scalars, two vectors, or two matrix operands are equal (element-wise). eqsc and eqv support implicit conjugation and eqm supports diagonal offset, diag, uplo, and trans parameters (in a manner consistent with other level-1m operations). These operations are currently housed under frame/util, at least for now, because they are not computational in nature. - Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm. - Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md. Also: - Documented getsc and setsc in both docs. - Reordered entry for setijv in BLISTypedAPI.md, and added separator bars to both docs. - Added missing "Observed object properties" clauses to various levle-1v entries in BLISObjectAPI.md. - Defined bli_apply_trans() in bli_param_macro_defs.h. - Defined supporting _check() function, bli_l0_xxbsc_check(), in bli_l0_check.c for eqsc. - Programming style and whitespace updates to bli_l1m_unb_var1.c. - Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c - Consolidated redundant macro redefinition for copym function pointer type in bli_l1m_ft.h. - Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that allow oapi and tapi source files to forego defining certain expert functions. (Certain operations such as printv and printm do not need to have both basic expert interfaces. This also includes eqsc, eqv, and eqm.) commit 5d46dbee4a06ba5a422e19817836976f8574cb4f Author: Devin Matthews Date: Wed May 12 18:42:09 2021 -0500 Replace bli_dlamch with something less archaic (#498) Details: - Added new implementations of bli_slamch() and bli_dlamch() that use constants from the standard C library in lieu of dynamically-computed values (via code inherited from netlib). The previous implementation is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is defined by the subconfiguration at compile-time. Thanks to Devin Matthews for providing this patch, and to Stefano Zampini for reporting the issue (#497) that prompted Devin to propose the patch. commit 6a89c7d8f9ac3f51b5b4d8ccb2630d908d951e6f Author: Field G. Van Zee Date: Sat May 1 18:54:48 2021 -0500 Defined setijv, getijv to set/get vector elements. Details: - Defined getijv, setijv operations to get and set elements of a vector, in bli_setgetijv.c and .h. - Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively. - Added additional bounds checking to getijm and setijm to prevent actions with negative indices. - Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv and setijv. - Added documentation to BLISTypedAPI.md for getijm and setijm, which were inadvertently missing. - Added a new entry to the FAQ titled "Why does BLIS have vector (level-1v) and matrix (level-1m) variations of most level-1 operations?" - Comment updates. commit 4534daffd13ed7a8983c681d3f5e9de17c9f0b96 Author: Field G. Van Zee Date: Tue Apr 27 18:16:44 2021 -0500 Minor API breakage in bli_pack API. Details: - Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that instead of returning a bool, they set a bool that is passed in by address. This does break the public exported API, but I expect very few users actually use this function. (This change is being made in preparation for a much more extensive commit relating to error checking.) commit 6a4aa986ffc060d3e64ed230afe318b82630f8b2 Author: Field G. Van Zee Date: Fri Apr 23 13:10:01 2021 -0500 Fixed typo in Table of Contents. commit f6424b5b82160d346a09a0fbb526981ecf66cdb3 Author: Field G. Van Zee Date: Fri Apr 23 13:08:06 2021 -0500 Added dedicated Performance section to README.md. Details: - Spun off the Performance.md and PerformanceSmall.md links in the Documentation section into a new Performance section dedicated to those two links. (The previous entries remain redundantly listed within Documentation section.) Thanks to Robert van de Geijn for suggesting this change. commit 40ce5fd241b9ad140bf57278d440f0598d7f15d8 Merge: 6280757b 1f3461a5 Author: Devin Matthews Date: Wed Apr 21 09:54:25 2021 -0500 Merge pull request #493 from cassiersg/patch-1 Fix typo in FAQ.md commit 1f3461a5a5a88510f913451a93e3190ec1556f39 Author: Gaëtan Cassiers Date: Wed Apr 21 16:49:05 2021 +0200 Fix typo in FAQ.md commit 6548cebaf55a1f9bdb8417cc89dd0444d8f9c2e4 Author: Devin Matthews Date: Wed Apr 14 13:00:42 2021 -0500 Allow clang for ThunderX2 config Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc. commit 6280757be32f90fd77d8dd9357b07d9306e6f80d Author: Field G. Van Zee Date: Wed Apr 7 13:03:56 2021 -0500 Minor updates to a64fx section of Performance.md. commit 1e6ed823c6cd11f9b671779f3c8bdbd2bbb40f34 Author: RuQing Xu Date: Thu Apr 8 02:59:26 2021 +0900 Additional A64fx Comments (#490) * Performance.md Update A64fx Comments - Reason for ARMPL's missing data; - Additional envs / flags for kernel selection; - Update BLIS SRC commit. * Include Another Fix in armsve-cfg-vendor A prototype was forgotten, causing that void* pointer was not fully returned. commit 2688f21a5b073950f6f187c95917fdbb5aac234a Author: Field G. Van Zee Date: Tue Apr 6 19:02:37 2021 -0500 Added Fujitsu A64fx (512-bit SVE) perf results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on the "Fugaku" Fujitsu A64fx supercomputer at the RIKEN Center for Computational Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan Nassyr for their work in developing and optimizing A64fx support in BLIS and RuQing for gathering the performance data that is reflected in these new graphs. commit ba3ba8da83d48397162139e11337c036a631ba79 Author: Field G. Van Zee Date: Tue Apr 6 18:39:58 2021 -0500 Minor updates and fixes to test/3/octave scripts. Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m. commit 09bd4f4f12311131938baa9f75d27e92b664d681 Author: Field G. Van Zee Date: Wed Mar 31 17:09:36 2021 -0500 Add err_t* "return" parameter to malloc functions. Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0. commit f9ad55ce7e12f59930605753959fcfd41a218d8d Merge: 04502492 90508192 Author: Field G. Van Zee Date: Wed Mar 31 14:20:19 2021 -0500 Merge branch 'master' into dev commit 90508192f2d6ae95adc2a3ba9f4e5bad2c8d6fd2 Author: Devin Matthews Date: Tue Mar 30 21:16:44 2021 -0500 Update do_sde.sh (#489) Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore. commit 22c6b5dc4c9cc21942f8ccc30891f9b4385a9504 Author: Nicholai Tukanov Date: Tue Mar 30 19:07:42 2021 -0500 Fixed bug in power10 microkernel I/O. (#488) Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.) commit 04502492671456b94bcdee60b9de347b6763a32d Author: Field G. Van Zee Date: Sun Mar 28 19:11:43 2021 -0500 Always stay initialized after BLAS compat calls. Details: - Removed the option to finalize BLIS after every BLAS call, which also means that BLIS would initialize at the beginning of every BLAS call. This option never really made sense and wasn't even implemented properly to begin with. (Because bli_init_auto() and _finalize_auto() were implemented in terms of bli_init_once() and _finalize_once(), respectively, the application would have only been able to call one BLAS routine before BLIS would find itself in a unusable, permanently uninitialized state.) Because this option was never meant for regular use, it never made it into configure as an actual configure-time option, and therefore this commit only removes parts of the code affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED. commit 3a6f41afb8197e831b6ce2f1ae7f63735685fa0a Author: Field G. Van Zee Date: Sat Mar 27 17:22:14 2021 -0500 Renamed membrk files/vars/functions to pba. Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates. commit 36cb4116d15cfef2d42ec4a834efd4a958f261b5 Author: Field G. Van Zee Date: Sat Mar 27 15:15:09 2021 -0500 Switch allocator mutexes to static initialization. Details: - Switched the small block allocator (sba), as defined in bli_sba.c and bli_apool.c, to static initialization of its internal mutex. Did a similar thing for the packing block allocator (pba), which appears as global_membrk in bli_membrk.c. - Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex() to ensure they won't be used in the future. - In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp blocks guarded by BLIS_USE_PTHREAD_MUTEX. commit 159ca6f01a5f91b93513134c9470b69ff78f5354 Author: Field G. Van Zee Date: Wed Mar 24 15:57:32 2021 -0500 Made test/3/octave scripts robust to missing data. Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m. commit 545e6c2f6d09d023b353002a9a43b11aa0c1d701 Author: Field G. Van Zee Date: Mon Mar 22 17:42:33 2021 -0500 CHANGELOG update (0.8.1) commit 8535b3e11d2297854991c4272932ce4974dda629 (tag: 0.8.1) Author: Field G. Van Zee Date: Mon Mar 22 17:42:33 2021 -0500 Version file update (0.8.1) commit e56d9f2d94ed247696dda2cbf94d2ca05c7fc089 Author: Field G. Van Zee Date: Mon Mar 22 17:40:50 2021 -0500 ReleaseNotes.md update in advance of next version. commit ca83f955d45814b7d84f53933cdb73323c0dea2c Author: Field G. Van Zee Date: Mon Mar 22 17:21:21 2021 -0500 CREDITS file update. commit 57ef61f6cdb86957f67212aa59407f2f8e7f3d1a Merge: bf1b578e e7a4a8ed Author: Field G. Van Zee Date: Fri Mar 19 13:05:43 2021 -0500 Merge branch 'master' of github.com:flame/blis commit bf1b578ea32ea1c9dbf7cb3586969e8ae89aa5ef Author: Field G. Van Zee Date: Fri Mar 19 13:03:17 2021 -0500 Reduced KC on skx from 384 to 256. Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change. commit e7a4a8edc940942357e8e4c4594383a29a962f93 Author: Nicholai Tukanov Date: Wed Mar 17 19:43:31 2021 -0500 Fix calculation of new pb size (#487) Details: - Added missing parentheses to the i8 and i4 instantiations of the GENERIC_GEMM macro in sandbox/power10/generic_gemm.c. commit 4493cf516e01aba82642a43abe350943ba458fe2 Author: Field G. Van Zee Date: Mon Mar 15 13:12:49 2021 -0500 Redefined BLIS_NUM_ARCHS to update automatically. Details: - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum value in the arch_t enum. This means that it no longer needs to get updated manually whenever new subconfigurations are added to BLIS. Also removed the explicit initial index assigment of 0 from the first enum value, which was unnecessary due to how the C language standard mandates indexing of enum values. Thanks to Devin Matthews for originally submitting this as a PR in #446. - Updated docs/ConfigurationHowTo.md to reflect the aforementioned change. commit a4b73de84cdffcbe5cf71969a0f7f0f8202b3510 Author: Field G. Van Zee Date: Fri Mar 12 17:12:27 2021 -0600 Disabled _self() and _equal() in bli_pthread API. Details: - Disabled the _self() and _equal() extensions to the bli_pthread API introduced in d479654. These functions were disabled after I realized that they aren't actually needed yet. Thanks to Devin Matthews for helping me reason through the appropriate consumer code that will appear in BLIS (eventually) in a future commit. (Also, I could never get the Windows branch to link properly in clang builds in AppVeyor. See the comment I left in the code, and #485, for more info.) commit f9d604679d8715bc3e79a8630268446889b51388 Author: Field G. Van Zee Date: Thu Mar 11 16:57:55 2021 -0600 Added _self() and _equal() to bli_pthread API. Details: - Expanded the bli_pthread API to include equivalents to pthread_self() and pthread_equal(). Implemented these two functions for all three cpp branches present within bli_pthread.c: systemless, Windows, and Linux/BSD. commit fa9b3c8f6b3d5717f19832362104413e1a86dfb0 Author: Field G. Van Zee Date: Thu Mar 11 15:13:51 2021 -0600 Shuffled code in Windows branch of bli_pthreads.c. Details: - Reordered the definitions in the cpp branch in bli_pthreads.c that defines the bli_pthreads API in terms of Windows API calls. Also added missing comments that mark sections of the API, which brings the code into harmony with other cpp branches (as well as bli_pthread.h). commit 95d4f3934d806b3563f6648d57a4e381d747caf5 Author: Field G. Van Zee Date: Thu Mar 11 13:50:40 2021 -0600 Moved cpp macro redef of strerror_r to bli_env.c. Details: - Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r (in terms of strerror_s) from bli_thread.h to bli_env.c. It was likely left behind in bli_thread.h in a previous commit, when code that now resides in bli_env.c was moved from bli_thread.c. (I couldn't find any other instance of strerror_r being used in BLIS, so I moved the #define directly to bli_env.c rather than place it in bli_env.h.) The code that uses strerror_r is currently disabled, though, so this commit should have no affect on BLIS. commit 8a3066c315358d45d4f5b710c54594455f9e8fc6 Author: Field G. Van Zee Date: Tue Mar 9 17:52:59 2021 -0600 Relocated gemmsup_ref general stride handling. Details: - Moved the logic that checks for general stridedness in any of the matrix operands in a gemmsup problem. The logic previously resided near the top of bli_gemmsup_int(), which is the thread entry point for the parallel region of the current gemmsup implementation. The problem with this setup was that the code would attempt to reject problems with any general-strided operands by returning BLIS_FAILURE, and that return value was then being ignored by the l3_sup thread decorator, which unconditionally returns BLIS_SUCCESS. To solve this issue, rather than try to manage n return values, one from each of n threads, I simply moved the logic into bli_gemmsup_ref(). I didn't move it any higher (e.g. bli_gemmsup()) because I still want the logic to be part of the current gemmsup handler implementation. That is, perhaps someone else will create a different handler, and that author wants to handle general stride differently. (We don't want to force them into a particular way of handling general stride.) - Removed the general stride handling from bli_gemmtsup_int(), even though this function is inoperative for now. - This commit addresses issue #484. Thanks to RuQing Xu for reporting this issue. commit 670bc7b60f6065893e8ec1bebd2fc9e5ba710dff Author: Nicholai Tukanov Date: Fri Mar 5 13:53:43 2021 -0600 Add low-precision POWER10 gemm kernels (#467) Details: - This commit adds a new BLIS sandbox that (1) provides implementations based on low-precision gemm kernels, and (2) extends the BLIS typed API for those new implementations. Currently, these new kernels can only be used for the POWER10 microarchitecture; however, they may provide a template for developing similar kernels for other microarchitectures (even those beyond POWER), as changes would likely be limited to select places in the microkernel and possibly the packing routines. The new low-precision operations that are now supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more information, refer to the POWER10.md document that is included in 'sandbox/power10'. commit b8dcc5bc75a746807d6f8fa22dc2123c98396bf5 Author: RuQing Xu Date: Tue Mar 2 06:58:24 2021 +0800 Fixed typed API definition for gemmt (#476) Details: - Fixed incorrect definition and prototype of bli_?gemmt() in frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously defined identically to gemm, which was wrong because it did not take into account the uplo property of C. - Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md. Specifically, the document erroneously listed only a single transab parameter instead of transa and transb. commit a0e4fe2340a93521e1b1a835a96d0f26dec8406a Author: Ilknur Date: Tue Mar 2 02:06:56 2021 +0400 Fixed double free() in level1v example (#482) Details: - In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and pointer 'a' was not being freed at all. This commit correctly frees each pointer exactly once. commit f5871c7e06a75799251d6b55a8a5fbfa1a92cf95 Author: Field G. Van Zee Date: Sun Feb 28 17:03:57 2021 -0600 Added complex asm packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision complex domain (c and z) and housed them in the 'haswell' kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Minor modifications to the corresponding s and d packm kernels that were introduced in 426ad67. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), upon which these complex kernels are partially based. commit 426ad679f55264e381eb57a372632b774320fb85 Author: Field G. Van Zee Date: Sat Feb 27 18:39:56 2021 -0600 Added assembly packm kernels for 'haswell' set. Details: - Implemented assembly-based packm kernels for single- and double- precision real domain (s and d) and housed them in the 'haswell' kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all optimized. - Registered the aforementioned packm kernels in the haswell, zen, and zen2 subconfigs. - Thanks to AMD, who originally contributed the double-precision real packm kernels (d6xk and d8xk), which I have now tweaked and used to create comparable single-precision real kernels (s6xk and s16xk). commit f50c1b7e5886d29efe134e1994d05af9949cd4b6 Merge: 8f39aea1 b3953b93 Author: Devin Matthews Date: Mon Feb 1 11:55:51 2021 -0600 Merge pull request #473 from ajaypanyala/pkgconfig build: generate pkgconfig file commit 8f39aea11f80a805b66cff4b4dc5e72727ea461d Merge: f8db9fb3 2a815d5b Author: Field G. Van Zee Date: Sat Jan 30 17:59:56 2021 -0600 Merge branch 'dev' commit f8db9fb33b48844d6b47fdef699625bd9197745a Author: Field G. Van Zee Date: Thu Jan 28 08:04:52 2021 -0600 Fixed missing parentheses in README.md Citations. commit b3953b938eee59f79b4a4162ba583a5cb59fa34e Author: Ajay Panyala Date: Tue Jan 12 17:07:04 2021 -0800 drop CFLAGS in the generated pkgconfig file commit b02d9376bac31c1a1c7916f44c4946277a1425e2 Author: Ajay Panyala Date: Mon Jan 11 20:50:01 2021 -0800 add datadir commit d8d8deeb6d8b84adb7ae5fdb88c6dd4f06624a76 Author: Ajay Panyala Date: Mon Jan 11 17:47:50 2021 -0800 generate pkgconfig file commit 8c65411c7c8737248a6f054ffa0ce008c95cb515 Merge: 328b4f88 874c3f04 Author: Devin Matthews Date: Mon Jan 11 16:01:45 2021 -0600 Merge pull request #471 from flame/fix-470 Fix kernel-to-config mapping for intel64 commit 874c3f04ece9af4d8fdf0e2713e21a259c117656 Author: Devin Matthews Date: Fri Jan 8 13:56:30 2021 -0600 Update configure Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470. commit 2a815d5b365d934cb351b2f2a8cd1366e997b2e1 Author: Field G. Van Zee Date: Mon Jan 4 18:03:39 2021 -0600 Support trsm pre-inversion in 1m, bb, ref kernels. Details: - Expanded support for disabling trsm diagonal pre-inversion to other microkernel types, including the reference microkernel as well as the kernel implementations for 1m and the pre-broadcast B (bb) format used by the power9 subconfig. This builds on the 'haswell' and 'penryn' kernel support added in 7038bba. Thanks to Bhaskar Nallani for reminding me, in #461 (post-closure), that 1m support was missing from that commit. - Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the omp simd implementation after making a stripped-down copy in 'old'. This code has been disabled for some time and it seemed better suited to rot away out of sight rather than clutter up a file that is already cluttered by the presence of lower and upper versions. - Minor comment update to bli_ind_init(). commit c3ed2cbb9f60100fc9beb2a9d75476de9f711dc5 Author: Field G. Van Zee Date: Mon Jan 4 16:16:32 2021 -0600 Enable 1m only if real domain ukr is not reference. Details: - Previously, BLIS would automatically enable use of the 1m method for a given precision if the complex domain microkernel was a reference kernel. This commit adds an additional constraint so that 1m is only enabled if the corresponding real domain microkernel is NOT reference. That is, BLIS now forgos use of 1m if both the real and complex domain kernels are reference implementations. Note that this does not prevent 1m from being enabled manually under those conditions; it only means that 1m will not be enabled automatically at initialization-time. commit ed50c947385ba3b0b5d550015f38f7f0a31755c0 Merge: 0cef09aa 328b4f88 Author: Field G. Van Zee Date: Mon Jan 4 14:31:44 2021 -0600 Merge branch 'master' into dev commit 328b4f8872b4bca9a53d2de8c6e285f3eb13d196 Author: Devin Matthews Date: Wed Dec 30 17:54:18 2020 -0600 Shared object (dylib) was not built correctly for partial build. The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not. commit ae6ef66ef824da9bc6348bf9d1b588cd4f2ded9b Author: Devin Matthews Date: Wed Dec 30 17:34:55 2020 -0600 bli_diag_offset_with_trans had wrong return type. Fixes #468. commit ebcf197fb86fdd0a864ea928140752bc2462e8c6 Merge: 472f138c 21aa67e1 Author: Devin Matthews Date: Sat Dec 5 22:26:27 2020 -0600 Merge pull request #466 from isuruf/patch-3 fix cc_vendor for crosstool-ng toolchains commit 21aa67e11cebbc5a6dd7c6353154256294df3c33 Author: Isuru Fernando Date: Sat Dec 5 21:59:13 2020 -0600 fix cc_vendor for crosstool-ng toolchains commit 472f138cb927b7259126ebb9c68919cfcc7a4ea3 Author: Field G. Van Zee Date: Sat Dec 5 14:13:52 2020 -0600 Fixed typo in README.md to CodingConventions.md. commit 0cef09aa92208441a656bf097f197ea8e22b533b Author: Field G. Van Zee Date: Fri Dec 4 16:40:59 2020 -0600 Consolidated code in level-3 _front() functions. Details: - Reduced a code segment that appears in all of the bli_*_front() functions except for bli_gemm_front(). Previously, the code looked like this (taken from bli_herk_front()): if ( bli_cntx_method( cntx ) == BLIS_NAT ) { bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local ); bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local ); } else // if ( bli_cntx_method( cntx ) != BLIS_NAT ) { pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); } This code segment is part of a sort-of-hack that allows us to communicate the pack schemas into the level-3 thread decorator, which needs them so that they can be passed into bli_l3_cntl_create_if(), where the control tree is created. However, the first conditional case above is unnecessary because the second case is fully generalized. That is, even in the native case, the context contains correct, queryable schemas. Thus, these code segments were reduced to something like: pack_t schema_a = bli_cntx_schema_a_block( cntx ); pack_t schema_b = bli_cntx_schema_b_panel( cntx ); bli_obj_set_pack_schema( schema_a, &a_local ); bli_obj_set_pack_schema( schema_b, &ah_local ); There's always a small chance that the seemingly unnecessary code in the first branch case has some special use that is not apparent to me, but the testsuite's default input parameters seem to think this commit will be fine. commit 7038bbaa05484141195822291cf3ba88cbce4980 Author: Field G. Van Zee Date: Fri Dec 4 16:08:15 2020 -0600 Optionally disable trsm diagonal pre-inversion. Details: - Implemented a configure-time option, --disable-trsm-preinversion, that optionally disables the pre-inversion of diagonal elements of the triangular matrix in the trsm operation and instead uses division instructions within the gemmtrsm microkernels. Pre-inversion is enabled by default. When it is disabled, performance may suffer slightly, but numerical robustness should improve for certain pathological cases involving denormal (subnormal) numbers that would otherwise result in overflow in the pre-inverted value. Thanks to Bhaskar Nallani for reporting this issue via #461. - Added preprocessor macro guards to bli_trsm_cntl.c as well as the gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant to the aforementioned feature. - Added macros to frame/include/bli_x86_asm_macros.h related to division instructions. commit 78aee79452cce2691c40f05b3632bdfc122300af Author: Field G. Van Zee Date: Wed Dec 2 13:02:36 2020 -0600 Allow amaxv testsuite module to run with dim = 0. Details: - Exit early from libblis_test_amaxv_check() when the vector dimension (length) of x is 0. This allows the module to run when the testsuite driver passes in a problem size of 0. Thanks to Meghana Vankadari for alerting us to this issue via #459. - Note: All other testsuite modules appear to work with problem sizes of 0, except for the microkernel modules. I chose not to "fix" those modules because a failure (or segmentation fault, as happens in this case) is actually meaningful in that it alerts the developer that some microkernels cannot be used with k = 0. Specifically, the 'haswell' kernel set contains microkernels that preload elements of B. Those microkernels would need to be restructured to avoid preloading in order to support usage when k = 0. commit 92d2b12a44ee0990c22735472aeaf1c17deb2d9b Author: Field G. Van Zee Date: Wed Dec 2 13:02:00 2020 -0600 Fixed obscure testsuite gemmt dependency bug. Details: - Fixed a bug in the gemmt testsuite module that only manifested when testing of gemmt is enabled but testing of gemv is disabled. The bug was due to a copy-paste error dating back to the introduction of gemmt in 88ad841. commit b43dae9a5d2f078c9bbe07079031d6c00a68b7de Author: Field G. Van Zee Date: Tue Dec 1 16:44:38 2020 -0600 Fixed copy-paste bugs in edge-case sup kernels. Details: - Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly instructions that were left over from when the kernels were first written. These instructions would cause segmentation faults in some situations where extra memory was not allocated beyond the end of the matrix buffers. Thanks to Kiran Varaganti for reporting these bugs and to Bhaskar Nallani for identifying the cause and solution. commit 11dfc176a3c422729f453f6c23204cf023e9954d Author: Field G. Van Zee Date: Tue Dec 1 19:51:27 2020 +0000 Reorganized thread auto-factorization logic. Details: - Reorganized logic of bli_thread_partition_2x2() so that the primary guts were factored out into "fast" and "slow" variants. Then added logic to the "fast" variant that allows for more optimal thread factorizations in some situations where there is at least one factor of 2. - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and added comments to that file describing BLIS_THREAD_RATIO_? and BLIS_THREAD_MAX_?R. - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several macros not used in vanilla BLIS and removed the unused macro BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file. - Disabled AMD's small matrix handling entry points in bli_syrk_front.c and bli_trsm_front.c. (These branches of small matrix handling have not been reviewed by vanilla BLIS developers.) - Added commented-out calls printf() to bli_rntm.c. - Whitespace changes to bli_thread.c. commit 6d3bafacd7aa7ad198762b39490876c172bfbbcb Author: Devin Matthews Date: Sat Nov 28 17:17:56 2020 -0600 Update BuildSystem.md Add git version >= 1.8.5 requirement (see #462). commit 64856ea5a61b01d585750815788b6a775f729647 Author: Field G. Van Zee Date: Mon Nov 23 16:54:51 2020 -0600 Auto-reduce (by default) prime numbers of threads. Details: - When requesting multithreaded parallelism by specifying the total number of threads (whether it be via environment variable, globally at runtime, or locally at runtime), reduce the number of threads actually used by one if the original value (a) is prime and (b) exceeds a minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set to 11 by default. If, when specifying the total number of threads (and not the individual ways of parallelism for each loop), prime numbers of threads are desired, this feature may be overridden by defining the BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that corresponds to the configuration family targeted at configure-time. (For now, there is no configure option(s) to control this feature.) Thanks to Jeff Diamond for suggesting this change. - Defined a new function in bli_thread.c, bli_is_prime(), that returns a bool that determines whether an integer is prime. This function is implemented in terms of existing functions in bli_thread.c. - Updated docs/Multithreading.md to document the above feature, along with unrelated minor edits. commit 55933b6ff6b9b8a12041715f42bba06273d84b74 Author: Field G. Van Zee Date: Fri Nov 20 10:39:32 2020 -0600 Added missing attribution to docs/ReleaseNotes.md. commit e310f57b4b29fbfee479e0f9fe2040851efdec4f Author: Field G. Van Zee Date: Thu Nov 19 13:33:37 2020 -0600 CHANGELOG update (0.8.0) commit 9b387f6d5a010969727ec583c0cdd067a5274ed8 (tag: 0.8.0) Author: Field G. Van Zee Date: Thu Nov 19 13:33:37 2020 -0600 Version file update (0.8.0) commit 2928ec750d3a3e1e5d55de5b57ddc04e9d0bd796 Author: Field G. Van Zee Date: Wed Nov 18 18:31:35 2020 -0600 ReleaseNotes.md update in advance of next version. Details: - Updated docs/ReleaseNotes.md in preparation for next version. commit b9899bedff6854639468daa7a973bb14ca131a74 Author: Field G. Van Zee Date: Wed Nov 18 16:52:41 2020 -0600 CREDITS file update. commit 9bb23e6c2a44b77292a72093938ab1ee6e6cc26a Author: Field G. Van Zee Date: Mon Nov 16 15:55:45 2020 -0600 Added support for systemless build (no pthreads). Details: - Added a configure option, --[enable|disable]-system, which determines whether the modest operating system dependencies in BLIS are included. The most notable example of this on Linux and BSD/OSX is the use of POSIX threads to ensure thread safety for when application-level threads call BLIS. When --disable-system is given, the bli_pthreads implementation is dummied out entirely, allowing the calling code within BLIS to remain unchanged. Why would anyone want to build BLIS like this? The motivating example was submitted via #454 in which a user wanted to build BLIS for a simulator such as gem5 where thread safety may not be a concern (and where the operating system is largely absent anyway). Thanks to Stepan Nassyr for suggesting this feature. - Another, more minor side effect of the --disable-system option is that the implementation of bli_clock() unconditionally returns 0.0 instead of the time elapsed since some fixed point in the past. The reasoning for this is that if the operating system is truly minimal, the system function call upon which bli_clock() would normally be implemented (e.g. clock_gettime()) may not be available. - Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h to remove redundancies. - Removed old comments and commented #include of "bli_pthread_wrap.h" from bli_system.h. - Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md and BLISTypedAPI.md, with a note that both are non-functional when BLIS is configured with --disable-system. commit 88ad84143414644df4c56733b1cf91a36bfacaf8 Author: Field G. Van Zee Date: Sat Nov 14 09:39:48 2020 -0600 Squash-merge 'pr' into 'squash'. (#457) Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes. commit 234b8b0cf48f1ee965bd7999b291fc7add3b9a54 Author: Field G. Van Zee Date: Thu Nov 12 19:11:16 2020 -0600 Increased dotxaxpyf testsuite thresholds. Details: - Increased the test thresholds used by the dotxaxpyf testsuite module by a factor of five in order to avoid residuals that unnecessarily fall in the MARGINAL range. This commit should fix #455. Thanks to @nagsingh for reporting this issue. commit ed612dd82c50063cfd23576a6b2465213d31b14b Author: Field G. Van Zee Date: Sat Nov 7 13:09:42 2020 -0600 Updated README.md with sgemmsup blurb. Details: - Added an entry to the "What's New" section of the README.md to announce the availability of sgemmsup. commit e14424f55b15d67e8d18384aea45a11b9b772e02 Merge: 0cfe1aac eccdd75a Author: Field G. Van Zee Date: Sat Nov 7 13:02:50 2020 -0600 Merge branch 'dev' commit 0cfe1aac222008a78dff3ee03ef5183413936706 Author: Field G. Van Zee Date: Fri Oct 30 17:10:36 2020 -0500 Relocated operation index to ToC in API docs. Details: - Moved the "Operation index" section of both the BLISObjectAPI.md and BLISTypedAPI.md docs to appear immediately after the table of contents of each document. This allows the reader to quickly jump to the documentation for any operation without having to scroll through much of the document (when rendered via a web browser). - Fixed a mistake in the BLISObjectAPI.md for the setd operation, which does *not* observe the diag property of its matrix argument. Thanks to Jeff Diamond for reporting this. commit 2a0682f8e5998be536da313525292f0da6193147 Author: Field G. Van Zee Date: Sun Oct 18 18:04:03 2020 -0500 Implemented runtime subconfig selection (#451). Details: - Implemented support for the user manually overriding the automatic subconfiguration selection that happens at runtime. This override can be requested by setting the BLIS_ARCH_TYPE environment variable. The variable must be set to the arch_t id (as enumerated in bli_type_defs.h) corresponding to the desired subconfiguration. If a value outside this enumerated range is given, BLIS will abort with an error message. If the value is in the valid range but corresponds to a subconfiguration that was not activated at configure-time/compile-time, BLIS will abort with a (different) error message. Thanks to decandia50 for suggesting this feature via issue #451. - Defined a new function bli_gks_lookup_id to return the address of an internal data structure within the gks. If this address is NULL, then it indicates that the subconfig corresponding to the arch_t id passed into the function was not compiled into BLIS. This function is used in the second of the two abort scenarios described above. - Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which is returned for the latter of the two abort scenarios mentioned above, along with a corresponding error message and a function to perform the error check. - Added cpp macro branching to bli_env.c to support compilation of the auto-detect.x executable during configure-time. This cpp branch is similar to the cpp code already found in bli_arch.c and bli_cpuid.c. - Cleaned up the auto_detect() function to facilitate easier maintenance going forward. Also added a convenient debug switch that outputs the compilation command for the auto-detect.x executable and exits. commit eccdd75a2d8a0c46e91e94036179c49aa5fa601c Author: Field G. Van Zee Date: Fri Oct 9 15:44:16 2020 -0500 Whitespace tweak in docs/PerformanceSmall.md. commit 7677e9ba60ac27496e3421c2acc7c239e3f860e9 Merge: addcd46b a0849d39 Author: Field G. Van Zee Date: Fri Oct 9 15:41:25 2020 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit addcd46b0559d401aa7d33d4c7e6f63f5313a8e0 Author: Field G. Van Zee Date: Fri Oct 9 15:41:09 2020 -0500 Added Epyc 7742 Zen2 ("Rome") sup perf results. Details: - Added single-threaded and multithreaded sup performance results to docs/PerformanceSmall.md for both sgemm and dgemm. These results were gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Updates to octave scripts in test/sup/octave for use with Octave 5.2 and for use with subplot_tight(). - Minor updates to octave scripts in test/3/octave. - Renamed files containing the previous Zen performance results for consistency with the new results. - Decreased line thickness slightly in large/conventional Zen2 graphs. I'm done tweaking those this time. Really. - Added missing line regarding eigen header installation for each microarchitecture section. commit a0849d390d04067b82af937cda8191b049b98915 Author: Field G. Van Zee Date: Fri Oct 9 20:22:17 2020 +0000 Register l3 sup kernels in zen2 subconfig. Details: - Registered full suite of sgemm and dgemm sup millikernels, blocksizes, and crossover thresholds in bli_cntx_init_zen2.c. - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742 system. commit d98368c32d5fbfaab8966ee331d9bcb5c4fe7a59 Author: Field G. Van Zee Date: Thu Oct 8 19:05:51 2020 -0500 Another tweak to line thickness of Zen2 graphs. commit 1855dfbdaafa37892b36c97fd317fd5d8da76676 Author: Field G. Van Zee Date: Thu Oct 8 19:01:00 2020 -0500 Tweaked line thickness in Zen2 graphs once more. Details: - Decreased (relative to previous commit) line thickness in recent Zen2 graphs. commit 0991611e7ed82889c53a5c3f1ef1d49552c50d61 Author: Field G. Van Zee Date: Thu Oct 8 18:54:49 2020 -0500 Increased line thickness in recent Zen2 graphs. Details: - Increased the width of the lines in the graphs introduced in 74ec6b8. commit 8273cbacd7799e9af59e5320d66055f2f5d9cb31 Author: Field G. Van Zee Date: Wed Oct 7 14:51:33 2020 -0500 README.md, docs/FAQ.md updates. Details: - Added a frequently asked question to docs/FAQ.md regarding the difference between upstream (vanilla) BLIS and AMD BLIS. - Updated the name of ICES in the README.md to reflect the Oden rebranding. commit a178a822ad3d5021489a0e61f909d8550ae12a8f Author: Field G. Van Zee Date: Wed Sep 30 16:00:52 2020 -0500 Added Zen2 links to docs/Performance.md Contents. commit 74ec6b8f457cabe37d2382aaab35ba04fc737948 Author: Field G. Van Zee Date: Wed Sep 30 15:54:18 2020 -0500 Added Epyc 7742 Zen2 ("Rome") performance results. Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on an Epyc 7742 "Rome" server with AMD's Zen2 microarchitecture. Special thanks to Jeff Diamond for facilitating access to the system via the Oracle Cloud. - Renamed files containing the previous Zen performance results for consistency with the new results. commit bc4a213a2c3dcf8bbfcbb3a1ef3e9fc9e3226c34 Author: Field G. Van Zee Date: Wed Sep 30 15:28:20 2020 -0500 Updated matlab (now octave) plot code in test/3. Details: - Renamed test/3/matlab to test/3/octave. - Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m files for use with octave (which is free and doesn't crash on me mid-way through my use of subplot). - Updated runthese.m scratchpad for zen2 invocations. - Added Nikolay S.'s subplot_tight() function, along with its license. commit c77ddc418187e1884fa6bcfe570eee295b9cb8bc Author: Field G. Van Zee Date: Wed Sep 30 20:15:43 2020 +0000 Added optional numactl usage to test/3/runme.sh. commit 2d8ec164e7ae4f0c461c27309dc1f5d1966eb003 Author: Nicholai Tukanov Date: Tue Sep 29 16:52:18 2020 -0500 Add POWER10 support to BLIS (#450) commit 4fd8d9fec2052257bf2a5c6e0d48ae619ff6c3e4 Author: Field G. Van Zee Date: Mon Sep 28 23:39:05 2020 +0000 Tweaked zen2 subconfig's MC cache blocksizes. Details: - Updated the MC cache blocksizes registered by the 'zen2' subconfig. - Minor updates to test/3/Makefile and test/3/runme.sh. commit 5efcdeffd58af621476d179afc0c19c0f912baa8 Author: Field G. Van Zee Date: Fri Sep 25 14:25:24 2020 -0500 More minor README.md updates. commit 9e940f8aad6f065ea1689e791b9a4e1fb7900c40 Author: Field G. Van Zee Date: Fri Sep 25 13:53:35 2020 -0500 Added 1m SISC bibtex to README.md. Details: - Added final citation info to 1m bibtex in README.md file. - Updated draft 1m paper link. - Changed some http to https. commit e293cae2d1b9067261f613f25eaa0e871356b317 Author: Field G. Van Zee Date: Tue Sep 15 16:09:11 2020 -0500 Implemented sgemmsup assembly kernels. Details: - Created a set of single-precision real millikernels and microkernels comparable to the dgemmsup kernels that already exist within BLIS. - Added prototypes for all kernels within bli_kernels_haswell.h. - Registered entry-point millikernels in bli_cntx_init_haswell.c and bli_cntx_init_zen.c. - Added sgemmsup support to the Makefile, runme.sh script, and source file in test/sup. This included edits that allow for separate "small" dimensions for single- and double-precision as well as for single- vs. multithreaded execution. commit 2765c6f37c11cb7f71cd4b81c64cea6130636c68 Author: Field G. Van Zee Date: Sat Sep 12 17:48:15 2020 -0500 Type saga continues; fixed sgemm ukernel signature. Details: - Changed double* pointers in sgemm function signature to float*. At this point I've lost track of whether this was my fault or another dormant bug like the one described in ece9f6a, but at this point I no longer care. It's one of those days (aka I didn't ask for this). commit 0779559509e0a1af077530d09ed151dac54f32ee Author: Field G. Van Zee Date: Sat Sep 12 17:37:21 2020 -0500 Fixed missing restrict in knl sgemm prototype. Details: - Added a missing 'restrict' qualifier in the sgemm ukernel prototype for knl. (Not sure how that code was ever compiling before now.) commit ece9f6a3ef1b26b53ecf968cd069df7a85b139fb Author: Field G. Van Zee Date: Sat Sep 12 17:22:42 2020 -0500 Fixed dormant type bugs in bli_kernels_knl.h. Details: - Fixed dormant type mismatches in the use of the prototype-generating macros in bli_kernels_knl.h. Specifically, some float prototypes were incorrectly using double as their ctype. This didn't actually matter until the type changes in 645d771, as previously those types were not used since packm was prototyped with void* pointers. commit 8ebb3b60e1c4c045ddb48e02de6e246cecde24a4 Author: Field G. Van Zee Date: Sat Sep 12 17:00:47 2020 -0500 Fixed accidental breakage in 645d771. Details: - In trying to clean up kappa_cast variables in the reference packm kernels, which I initally believed to be redundant given the other void* -> ctype* changes in 645d771, I accidentally ended up violating restrict semantics for 1e/1r packing and possibly other packm kernels. (Normally, my pre-commit testsuite run would have caught this, but I was unknowingly using an edited input.operations file in which I'd disabled most tests as part of unrelated work.) This commit reverts the kappa_cast changes in 645d771. commit 645d771a14ae89aa7131d6f8f4f4a8090329d05e Author: Field G. Van Zee Date: Sat Sep 12 15:31:56 2020 -0500 Minor packm kernel type cleanup (void* -> ctype*). Details: - Changed all void* function arguments in reference packm kernels to those of the native type (ctype*). These pointers no longer need to be void* and are better represented by their native types anyway. (See below for details.) Updated knl packm kernels accordingly. - In the definition of the PACKM_KER_PROT prototype macro template in frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a, and p from void* to ctype*. They were originally void* because these function signatures had to share the same type so they could all be stored in a single array of that shared type, from which they were queried and called by packm_cxk(). This is no longer how the function pointers are stored, and so it no longer makes sense to force the caller of packm kernels to use void*, only so that the implementor of the packm kernels can typecast back to the native datatype within the kernel definition. This change has no effect internally within BLIS because currently all packm kernels are called after querying the function addresses from the context and then typecasting to the appropriate function pointer type, which is based upon type-specific function pointers like float* and double*. - Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and misleading due to changes to the handling of packm kernels since moving them into the context. commit 54bf6c35542a297e25bc8efec6067a6df80536f4 Author: Field G. Van Zee Date: Thu Sep 10 15:42:01 2020 -0500 Minor README.md update. Details: - Added a new entry to the "What people are saying about BLIS" section. commit e50b4d40462714ae33df284655a2faf7fa35f37c Author: Field G. Van Zee Date: Wed Sep 9 14:12:53 2020 -0500 Minor update to README.md (SIAM Best Paper Prize). commit a8efb72074691e2610372108becd88b4b392299e Merge: b0c4da17 97e87f2c Author: Devin Matthews Date: Mon Sep 7 16:18:19 2020 -0500 Merge pull request #434 from flame/intel-zdot Add an option to change the complex return type. commit 97e87f2c9f3878a05e1b7c6ec237ee88d9a72a42 Author: Field G. Van Zee Date: Mon Sep 7 15:56:42 2020 -0500 Whitespace/comment updates to #434 PR. commit b0c4da1732b6c6a9ff66f70c36e4722e0f9645ae Merge: 810e90ee b1b5870d Author: Devin Matthews Date: Mon Sep 7 15:47:54 2020 -0500 Merge pull request #436 from flame/s390x Add checks so that s390x is detected as 64-bit. commit 810e90ee806510c57504f0cf8eeaf608d38bd9dd Author: Field G. Van Zee Date: Tue Sep 1 16:11:40 2020 -0500 Minor README.md update. Details: - Added HPE to list of funders. - Changed http to https in funders' website links. commit 7d411282196e036991c26e52cb5e5f85769c8059 Author: Devin Matthews Date: Thu Aug 13 17:50:58 2020 -0500 Use -O2 for all framework code. (#435) It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342. commit 9c5b485d356367b0a1288761cd623f52036e7344 Author: Dave Love Date: Fri Aug 7 20:11:18 2020 +0000 Don't override -mcpu with -march on ARM (#353) * Use -mcpu for ARM See the GCC doc about -march, -mtune, and -mpu and maybe https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu * Fix typo in flags * Fix typo in cortexa9 flags * Modify cortexa53 compilation flags to fix failing BLAS check (#341) commit c253d14a72a746b670b3ffbb6e81bcafc73d1133 Author: Devin Matthews Date: Fri Aug 7 09:39:04 2020 -0500 Also handle Intel-style complex return in CBLAS interface. commit 5d653a11a0cc71305d0995507b1733995856f475 Author: Devin Matthews Date: Thu Aug 6 17:58:26 2020 -0500 Update Multithreading.md Addresses the issue raised in #426. commit b1b5870dd3f9b1c78cf5f58a53514d73f001fc4c Author: Devin Matthews Date: Thu Aug 6 17:34:20 2020 -0500 Add checks so that s390x is detected as 64-bit. commit 882dcb11bfc9ea50aa2f9044621833efd90d42be Author: Field G. Van Zee Date: Thu Aug 6 17:28:14 2020 -0500 Mention example code at top of documentation docs. Details: - Steer the reader towards the example code section of each documentation doc (object and typed). - Trivial update to examples/oapi/README, examples/tapi/README. commit f4894512e5bf56ff83701c07dd02972e300741a5 Author: Field G. Van Zee Date: Thu Aug 6 17:20:00 2020 -0500 Very minor updates to previous commit. commit adedb893ae8dfacd1dc54035979e15c44d589dbb Author: Field G. Van Zee Date: Thu Aug 6 17:14:01 2020 -0500 Documented mutator functions in BLISObjectAPI.md. Details: - Added documentation for commonly-used object mutator functions in BLISObjectAPI.md. Previously, only accessor functions were documented. Thanks to Jeff Diamond for pointing out this omission. - Explicitly set the 'diag' property of objects in oapi example modules (08level2.c and 09level3.c). commit 5b5278ff494888509543a79c09ea82089f6c95d9 Author: Devin Matthews Date: Thu Aug 6 14:19:37 2020 -0500 Use #ifdef instead of #if as macro may be undefined. commit 7fdc0fc893d0c6727b725ea842053b65be2c20ba Author: Devin Matthews Date: Thu Aug 6 14:03:55 2020 -0500 Add an option to change the complex return type. ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes #433. commit 6e522e5823b762d4be09b6acdca30faafba56758 Author: Field G. Van Zee Date: Thu Jul 30 19:31:37 2020 -0500 Mention disabling of sup in docs/Sandboxes.md. Details: - Added language to remind the reader to disable sup if the intended behavior is for the sandbox implementation to handle all problem sizes, even the smaller ones that would normally be handled by the sup code path. commit 00e14cb6d849e963a2e1ac35e7dbbe186af00a58 Author: Field G. Van Zee Date: Wed Jul 29 14:24:34 2020 -0500 Replaced use of bool_t type with C99 bool. Details: - Textually replaced nearly all non-comment instances of bool_t with the C99 bool type. A few remaining instances, such as those in the files bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being used not for boolean purposes but to index into an array. - This commit constitutes the third phase of a transition toward using C99's bool instead of bool_t, which was raised in issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit a69a4d7. The second phase, which redefined the bool_t typedef in terms of bool (from gint_t), was implemented by commit 2c554c2. commit 2c554c2fce885f965a425e727a0314d3ba66c06d Author: Field G. Van Zee Date: Fri Jul 24 15:57:19 2020 -0500 Redefined bool_t typedef in terms of C99 bool. Details: - Changed the typedef that defines bool_t from: typedef gint_t bool_t; where gint_t is a signed integer that forms the basis of most other integers in BLIS, to: typedef bool bool_t; - Changed BLIS's TRUE and FALSE macro definitions from being in terms of integer literals: #define TRUE 1 #define FALSE 0 to being in terms of C99 boolean constants: #define TRUE true #define FALSE false which are provided by stdbool.h. - This commit constitutes the second phase of a transition toward using C99's bool instead of bool_t, which will address issue #420. The first phase, which cleaned up various typecasts in preparation for using bool as the basis for bool_t (instead of gint_t), was implemented by commit a69a4d7. commit e01dd125581cec87f61e15590922de0dc938ec42 Author: Field G. Van Zee Date: Fri Jul 24 15:41:46 2020 -0500 Fail-safe updates to Makefiles in 'test' dir. Details: - Updated Makefiles in test, test/3, and test/sup so that running any of the usual targets without having first built BLIS results in a helpful error message. For example, if BLIS is not yet configured, make will output: Makefile:327: *** Cannot proceed: config.mk not detected! Run configure first. Stop. Similarly, if BLIS is configured but not yet built, make will output: Makefile:340: *** Cannot proceed: BLIS library not yet built! Run make first. Stop. In previous commits, these actions would result in a rather cryptic make error such as: make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x', needed by 'blis-nat-st'. Stop. commit b4f47f7540062da3463e2cb91083c12fdda0d30a Author: Devin Matthews Date: Fri Jul 24 13:56:13 2020 -0500 Add BLIS_EXPORT_BLIS to bli_abort. (#429) Fixes #428. commit a69a4d7e2f4607c919db30b14535234ce169c789 Author: Field G. Van Zee Date: Wed Jul 22 16:13:09 2020 -0500 Cleaned up bool_t usage and various typecasts. Details: - Fixed various typecasts in frame/base/bli_cntx.h frame/base/bli_mbool.h frame/base/bli_rntm.h frame/include/bli_misc_macro_defs.h frame/include/bli_obj_macro_defs.h frame/include/bli_param_macro_defs.h that were missing or being done improperly/incompletely. For example, many return values were being typecast as (bool_t)x && y rather than (bool_t)(x && y) Thankfully, none of these deficiencies had manifested as actual bugs at the time of this commit. - Changed the return type of bli_env_get_var() from dim_t to gint_t. This reflects the fact that bli_env_get_var() needs to be able to return a signed integer, and even though dim_t is currently defined as a signed integer, it does not intuitively appear to necessarily be signed by inspection (i.e., an integer named "dim_t" for matrix "dimension"). Also, updated use of bli_env_get_var() within bli_pack.c to reflect the changed return type. - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t and added comments to the bli_thrcomm_*.h files that will explain a planned replacement of bool_t with C99's bool type. - Note: These changes are being made to facilitate the substitution of 'bool' for 'bool_t', which will eliminate the namespace conflict with arm_sve.h as reported in issue #420. This commit implements the first phase of that transition. Thanks to RuQing Xu for reporting this issue. - CREDITS file update. commit a6437a5c11d364c6c88af527294d29734d7cc7d6 Author: Field G. Van Zee Date: Mon Jul 20 19:21:07 2020 -0500 Replaced broken ref99 sandbox w/ simpler version. Details: - The 'ref99' sandbox was broken by multiple refactorings and internal API changes over the last two years. Rather than try to fix it, I've replaced it with a much simpler version based on var2 of gemmsup. Why not fix the previous implementation? It occurred to me that the old implementation was trying to be a lightly simplified duplication of what exists in the framework. Duplication aside, this sandbox would have worked fine if it had been completely independent of the framework code. The problem was that it was only partially independent, with many function calls calling a function in BLIS rather than a duplicated/simplified version within the sandbox. (And the reason I didn't make it fully independent to begin with was that it seemed unnecessarily duplicative at the time.) Maintaining two versions of the same implementation is problematic for obvious reasons, especially when it wasn't even done properly to begin with. This explains the reimplementation in this commit. The only catch is that the newer implementation is single-threaded only and does not perform any packing on either input matrix (A or B). Basically, it's only meant to be a simple placeholder that shows how you could plug in your own implementation. Thanks to Francisco Igual for reporting this brokenness. - Updated the three reference gemmsup kernels (defined in ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle conjugation of conja and/or conjb. The general storage kernel, which is currently identical to the column-storage kernel, is used in the new ref99 sandbox to provide basic support for all datatypes (including scomplex and dcomplex). - Minor updates to docs/Sandboxes.md, including adding the threading and packing limitations to the Caveats section. - Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new sandbox implementation is based). commit bca040be9da542dd9c75d91890fa7731841d733d Merge: 2605eb4d 171ecc1d Author: Devin Matthews Date: Mon Jul 20 09:27:30 2020 -0500 Merge pull request #425 from gmargari/patch-1 Update Multithreading.md commit 171ecc1dc6f055ea39da30e508f711b49a734359 Author: Giorgos Margaritis Date: Mon Jul 20 12:24:06 2020 +0300 Update Multithreading.md commit 2605eb4d99d3813c37a624c011aa2459324a6d89 Author: Field G. Van Zee Date: Wed Jul 15 15:25:19 2020 -0500 Added missing rv_d?x6 edge cases to sup kernel. Details: - Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling various n = 6 edge cases with a single sup kernel call. Previously, only n = {4,2,1} were handled explicitly as single kernel calls; that is, cases where n = 6 were previously being executed via two kernel calls (n = 4 and n = 2). - Added commented debug line to testsuite's test_libblis.c. commit 72f6ed0637dfcb021de04ac7d214d5c87e55d799 Author: Field G. Van Zee Date: Fri Jul 3 17:55:54 2020 -0500 Declare/define static functions via BLIS_INLINE. Details: - Updated all static function definitions to use the cpp macro BLIS_INLINE instead of the static keyword. This allows blis.h to use a different keyword (inline) to define these functions when compiling with C++, which might otherwise trigger "defined but not used" warning messages. Thanks to Giorgos Margaritis for reporting this issue and Devin Matthews for suggesting the fix. - Updated the following files, which are used by configure's hardware auto-detection facility, to unconditionally #define BLIS_INLINE to the static keyword (since we know BLIS will be compiled with C, not C++): build/detect/config/config_detect.c frame/base/bli_arch.c frame/base/bli_cpuid.c - CREDITS file update. commit 5fc701ac5f94c6300febbb2f24e731aa34f0f34a Author: Field G. Van Zee Date: Wed Jul 1 15:48:58 2020 -0500 Added -fomit-frame-pointer option to CKOPTFLAGS. Details: - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS variable in the following make_defs.mk files: config/haswell/make_defs.mk config/skx/make_defs.mk as well as comments that mention why the compiler option is needed. This option is needed to prevent the compiler from using the rbp frame register (in the very early portion of kernel code, typically where k_iter and k_left are defined and computed), which, as of 1c719c9, is used explicitly by the gemmsup millikernels. Thanks to Devin Matthews for identifying this missing option and to Jeff Diamond for reporting the original bug in #417. - The file config/zen/amd_config.mk which feeds into the make_defs.mk for both zen and zen2 subconfigs, was also touched, but only to add a commented-out compiler option (and the aforementioned explanatory comment) since that file already uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of CKOPTFLAGS. commit 6af59b705782dada47e45df6634b479fe781d4fe Author: Field G. Van Zee Date: Wed Jul 1 14:54:23 2020 -0500 Fixed disabled edge case optimization in gemmsup. Details: - Fixed an inadvertently disabled edge case optimization in the two gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case optimizations allow the last millikernel operation in the jr loop to be executed with inflated an register blocksize if it is the last (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup problem is m=8, n=100, k=100. (In this case, the panel-block variant (var1n) is executed, which places the jr loop in the m dimension.) In principle, this problem could be executed as two millikernels: one with dimensions 6x100x100, and one as 2x100x100. However, with the support for inflated blocksizes in the kernel, the entire 8x100x100 problem can be passed to the millikernel function, which will then execute it more favorably as two 4x100x100 millikernel sub-calls. Now, this optimization is disabled under certain circumstances, such as when multithreading. Previously, the is_mt predicate was being set incorrectly such that it was non-zero even when running single-threaded. - Upon fixing the is_mt issue above, another bit of code needed to be moved so that the result of the optimization could have an impact on the assignment of loop bounds ranges to threads. commit b37634540fab0f9b8d4751b8356ee2e17c9e3b00 Author: Field G. Van Zee Date: Thu Jun 25 16:05:12 2020 -0500 Support ldims, packing in sup/test drivers. Details: - Updated the test/sup source file (test_gemm.c) and Makefile to support building matrices with small or large leading dimensions, and updated runme.sh to support executing both kinds of test drivers. - Updated runme.sh to allow for executing sup drivers with unpacked (the default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B environment variables), and for capturing output to files that encode both the leading dimension (small or large) and packing status into the filenames. - Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt into test/sup/octave and updated the octave code in that consolidated directory to read the new output filename format (encoding ldim and packing). Also added comments and streamlined code, particularly in plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0. - Moved old octave_st, octave_mt directories to test/sup/old. commit ceb9b95a96cc3844ecb43d9af48ab289584e76b6 Author: Field G. Van Zee Date: Thu Jun 18 17:15:25 2020 -0500 Fixed incorrect link to shiftd in BLISTypedAPI.md. Details: - Previously, the entry for shiftd in the Operation index section of BLISTypedAPI.md was incorrectly linking to the shiftd operation entry in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for helping find this incorrect link. commit b3c42016818797f79e55b32c8b7d090f9d0aa0ea Author: Field G. Van Zee Date: Thu Jun 18 14:00:56 2020 -0500 CREDITS file update. commit 31af73c11abae03248d959da0f81eacea015b57a Author: Isuru Fernando Date: Thu Jun 18 13:35:54 2020 -0500 Expand windows instructions (#414) * Expand windows instructions * Windows: both static and shared don't work at the same time commit b5b604e106076028279e6d94dc0e51b8ad48e802 Author: Field G. Van Zee Date: Wed Jun 17 16:42:24 2020 -0500 Ensure random objects' 1-norms are non-zero. Details: - Fixed an innocuous bug that manifested when running the testsuite on extremely small matrices with randomization via the "powers of 2 in narrow precision range" option enabled. When the randomization function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will then compute 0.0/0.0 during the normalization process, which leads to NaN residuals. The solution entails smarter implementaions of randv, randnv, randm, and randnm, each of which will compute the 1-norm of the vector or matrix in question. If the object has a 1-norm of 0.0, the object is re-randomized until the 1-norm is not 0.0. Thanks to Kiran Varaganti for reporting this issue (#413). - Updated the implementation of randm_unb_var1() so that it loops over a call to the randv_unb_var1() implementation directly rather than calling it indirectly via randv(). This was done to avoid the overhead of multiple calls to norm1v() when randomizing the rows/columns of a matrix. - Updated comments. commit 35e38fb693e7cbf2f3d7e0505a63b2c05d3f158d Author: Isuru Fernando Date: Tue Jun 16 10:59:41 2020 -0500 FIx typo in FAQ commit 1c719c91a3ef0be29a918097652beef35647d4b2 Author: Field G. Van Zee Date: Thu Jun 4 17:21:08 2020 -0500 Bugfixes, cleanup of sup dgemm ukernels. Details: - Fixed a few not-really-bugs: - Previously, the d6x8m kernels were still prefetching the next upanel of A using MR*rs_a instead of ps_a (same for prefetching of next upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given that the upanels might be packed, using ps_a or ps_b is the correct way to compute the prefetch address. - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck, executed as intended even though it was based on a faulty pointer management. Basically, in the rd_d6x8m kernel, the pointer for B (stored in rdx) was loaded only once, outside of the jj loop, and in the second iteration its new position was calculated by incrementing rdx by the *absolute* offset (four columns), which happened to be the same as the relative offset (also four columns) that was needed. It worked only because that loop only executed twice. A similar issue was fixed in the rd_d6x8n kernels. - Various cleanups and additions, including: - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so that it is loaded only once outside of the loops rather than multiple times inside the loops. - Changed outer loop in rd kernels so that the jump/comparison and loop bounds more closely mimic what you'd see in higher-level source code. That is, something like: for( i = 0; i < 6; i+=3 ) rather than something like: for( i = 0; i <= 3; i+=3 ) - Switched row-based IO to use byte offsets instead of byte column strides (e.g. via rsi register), which were known to be 8 anyway since otherwise that conditional branch wouldn't have executed. - Cleaned up and homogenized prefetching a bit. - Updated the comments that show the before and after of the in-register transpositions. - Added comments to column-based IO cases to indicate which columns are being accessed/updated. - Added rbp register to clobber lists. - Removed some dead (commented out) code. - Fixed some copy-paste typos in comments in the rv_6x8n kernels. - Cleaned up whitespace (including leading ws -> tabs). - Moved edge case (non-milli) kernels to their own directory, d6x8, and split them into separate files based on the "NR" value of the kernels (Mx8, Mx4, Mx2, etc.). - Moved config-specific reference Mx1 kernels into their own file (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory. - Added rd_dMx1 assembly kernels, which seems marginally faster than the corresponding reference kernels. - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using the row-oriented reference kernels for all storage combos. commit 943a21def0bedc1732c0a2453afe7c90d7f62e95 Author: Isuru Fernando Date: Thu May 21 14:09:21 2020 -0500 Add build instructions for Windows (#404) commit fbef422f0d968df10e598668b427af230cfe07e8 Author: Field G. Van Zee Date: Thu May 21 10:30:41 2020 -0500 Separate OS X and Windows into separate FAQs. Details: - Separated the unified Mac OS X / Windows frequently asked question into two separate questions, one for each OS. commit 28be1a4265ea67e3f177c391aba3dbbcf840bd52 Author: Guodong Xu Date: Thu May 21 02:22:22 2020 +0800 avoid loading twice in armv8a gemm kernel (#403) This bug happens at a corner case, when k_iter == 0 and we jump to CONSIDERKLEFT. In current design, first row/col. of a and b are loaded twice. The fix is to rearrange a and b (first row/col.) loading instructions. Signed-off-by: Guodong Xu commit d51245e58b0beff2717156b980007c90337150d8 Author: Field G. Van Zee Date: Fri May 8 18:00:54 2020 -0500 Add support for Intel oneAPI in configure. Details: - Properly select cc_vendor based on the output of invoking CC with the --version option, including cases where CC is the variant of clang that is included with Intel oneAPI. (However, we continue to treat the compiler as clang for other purposes, not icc.) Thanks to Ajay Panyala and Devin Matthews for reporting on this issue via #402. commit 787adad73bd5eb65c12c39d732723a1ac0448748 Author: Field G. Van Zee Date: Fri May 8 16:18:20 2020 -0500 Defined netlib equivalent of xerbla_array(). Details: - Added a function definition for xerbla_array_(), which largely mirrors its netlib implementation. Thanks to Isuru Fernando for suggesting the addition of this function. commit c53b5153bee585685bf95ce22e058a7af72ecef0 Author: Field G. Van Zee Date: Tue May 5 12:39:12 2020 -0500 Documented Perl prerequisite for build system. Details: - Added Perl to list of prerequisites for building BLIS. This is in part (and perhaps completely?) due to some substitution commands used at the end of configure that include '\n' characters that are not properly interpreted by the version of sed included on some versions of OS X. This new documentation addresses issue #398. commit f032d5d4a6ed34c8c3e5ba1ed0b14d1956d0097c Author: Guodong Xu Date: Thu Apr 30 01:08:46 2020 +0800 New kernel set for Arm SVE using assembly (#396) Here adds two kernels for Arm SVE vector extensions. 1. a gemm kernel for double at sizes 8x8. 2. a packm kernel for double at dimension 8xk. To achive best performance, variable length agonostic programming is not used. Vector length (VL) of 256 bits is mandated in both kernels. Kernels to support other VLs can be added later. "SVE is a vector extension for AArch64 execution mode for the A64 instruction set of the Armv8 architecture. Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains into a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. Instructions are provided specifically to query an implementation for its register size, to guarantee that the applications can run on different implementations of the ISA without the need to recompile the code." [1] [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning Signed-off-by: Guodong Xu commit 4d87eb24e8e1f5a21e04586f6df4f427bae0091b Author: Yingbo Ma Date: Mon Apr 27 17:02:47 2020 -0400 Update KernelsHowTo.md (#395) commit 477ce91c5281df2bbfaddc4d86312fb8c8f879e2 Author: Field G. Van Zee Date: Wed Apr 22 14:26:49 2020 -0500 Moved #include "cpuid.h" to bli_cpuid.c. Details: - Relocated the #include "cpuid.h" directive from bli_cpuid.h to bli_cpuid.c. This was done because cpuid.h (which is pulled into the post-build blis.h developer header) doesn't protect its definitions with a preprocessor guard of the form: #ifndef FOOBAR_H #define FOOBAR_H // header contents. #endif and as a result, applications (previously) could not #include both blis.h and cpuid.h (since the former was already including the latter). Thanks to Bhaskar Nallani for raising this issue via #393 and to Devin Matthews for suggesting this fix. - CREDITS file update. commit 8bde63ffd7474a97c3a3b0b0dc1eae45be0ab889 Author: Field G. Van Zee Date: Sat Apr 18 12:50:12 2020 -0500 Adding missing conjy to her2/syr2 in typed API doc. Details: - Fixed a missing argument (conjy) in the function signatures of bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert van de Geijn for reporting this omission. commit 976902406b610afdbacb2d80a7a2b4b43ff30321 Author: Field G. Van Zee Date: Fri Apr 17 15:11:10 2020 -0500 Disable packing by default in expert rntm_t init. Details: - Changed the behavior of bli_rntm_init() as well as the static initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t objects by default specify the disabling of packing for A and B. Packing of A/B was already disabled by default when calling non-expert APIs (and enabled only when the user set environment variables BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of using user-initialized rntm_t objects with expert APIs comes into line with the default behavior of non-expert APIs--that is, they now both lead to the avoidance of packing in the sup code path. (Note: The conventional code path is unaffected by the environment variables BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t object when calling an expert API.) This addresses issue #392. Thanks to Kiran Varaganti for bringing this inconsistency to our attention. - The above change was accomplished by changing the the definitions of static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b() in bli_rntm.h, which are both for internal use only. commit 5f2aee7c5fa5d562acaf8fbde3df0e2a04e1dd1b Author: Field G. Van Zee Date: Tue Apr 7 14:55:15 2020 -0500 README.md update to promote supmt dgemm. Details: - Updated the sup entry in the "What's New" section of the README.md file to promote the multithreaded dgemm sup feature introduced in c0558fd. commit f5923cd9ff5fbd91190277dea8e52027174a1d57 Author: Field G. Van Zee Date: Tue Apr 7 14:41:45 2020 -0500 CHANGELOG update (0.7.0) commit 68b88aca6692c75a9f686187e6c4a4e196ae60a9 (tag: 0.7.0) Author: Field G. Van Zee Date: Tue Apr 7 14:41:44 2020 -0500 Version file update (0.7.0) commit b04de636c1702e4cb8e7ad82bab3cf43d2dbdfc6 Author: Field G. Van Zee Date: Tue Apr 7 14:37:43 2020 -0500 ReleaseNotes.md update in advance of next version. Details: - Updated docs/ReleaseNotes.md in preparation for next version. commit 2cb604ba472049ad498df72d4a2dc47a161d4c3c Author: Field G. Van Zee Date: Mon Apr 6 16:42:14 2020 -0500 Rename more bli_thread_obarrier(), _obroadcast(). Details: - Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast() that were made in the supmt-specific code commited to the 'amd' branch, which has now been merged with 'master'. Prior to the merge, 'master' received commit c01d249, which applied these renamings to the existing, non-sup codebase. commit efb12bc895de451067649d5dceb059b7827a025f Author: Field G. Van Zee Date: Mon Apr 6 15:01:53 2020 -0500 Minor updates/elaborations to RELEASING file. commit 2e3b3782cfb7a2fd0d1a325844983639756def7d Merge: 9f3a8d4d da0c086f Author: Field G. Van Zee Date: Mon Apr 6 14:55:35 2020 -0500 Merge branch 'master' into amd commit da0c086f4643772e111318f95a712831b0f981a8 Author: Satish Balay Date: Tue Mar 31 17:09:41 2020 -0500 OSX: specify the full path to the location of libblis.dylib (#390) * OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime Before this change: Appication gives runtime error [when linked with blis] dyld: Library not loaded: libblis.3.dylib balay@kpro lib % otool -L libblis.dylib libblis.dylib: libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0) After this change: balay@kpro lib % otool -L libblis.dylib libblis.dylib: /Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0) * INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR Co-Authored-By: Jed Brown * CREDITS file update. Co-authored-by: Jed Brown Co-authored-by: Field G. Van Zee commit 2bca03ea9d87c0da829031a5332545d05e352211 Author: Field G. Van Zee Date: Sat Mar 28 22:10:00 2020 +0000 Updates, tweaks to runme.sh in test/1m4m. Details: - Made several updates to test/1m4m/runme.sh, including: - Added missing handling for 1m and 4m1a implementations when setting the BLIS_??_NT environment variables. - Added support for using numactl to run the test executables. - Several other cleanups. commit c40a33190b94af5d5c201be63366594859b1233f Author: Field G. Van Zee Date: Thu Mar 26 16:55:00 2020 -0500 Warn user when auto-detection returns 'generic'. Details: - Added logic to configure that causes the script to output a warning to the user if/when "./configure auto" is run and the underlying hardware feature detection code is unable to identify the hardware. In these cases, the auto-detect code will return 'generic', which is likely not what the user expected, and a flag will be set so that a message is printed at the end of the configure output. (Thankfully, we don't expect this scenario to play out very often.) Thanks to Devin Matthews for suggesting this fix #384. commit 492a736fab5b9c882996ca024b64646877f22a89 Author: Devin Matthews Date: Tue Mar 24 17:28:47 2020 -0500 Fix vectorized version of bli_amaxv (#382) * Fix vectorized version of bli_amaxv To match Netlib, i?amax should return: - the lowest index among equal values - the first NaN if one is encountered * Fix typos. * And another one... * Update ref. amaxv kernel too. * Re-enabled optimized amaxv kernels. Details: - Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen' kernel set for use in haswell, zen, zen2, knl, and skx subconfigs. These two kernels (for s and d datatypes) were temporarily disabled in e186d71 as part of issue #380. However, the key missing semantic properties that prompted the disabling of these kernels--returning the index of the *first* rather than of the last element with largest absolute value, and returning the index of the first NaN if one is encountered--were added as part of #382 thanks to Devin Matthews. Thus, now that the kernels are working as expected once more, this commit causes these kernels to once again be registered for the affected subconfigs, which effectively reverts all code changes included in e186d71. - Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c. Co-authored-by: Field G. Van Zee commit e186d7141a51f2d7196c580e24e7b7db8f209db9 Author: Field G. Van Zee Date: Sat Mar 21 18:40:36 2020 -0500 Disabled optimized amaxv kernels. Details: - Disabled use of optimized amaxv kernels, which use vector intrinsics for both 's' and 'd' datatypes. We disable these kernels because the current implementations fail to observe a semantic property of the BLAS i?amax_() subroutine, which is to return the index of the *first* element containing the maximum absolute value (that is, the first element if there exist two or more elements that contain the same value). With the optimized kernels disabled, the affected subconfigurations (haswell, zen, zen2, knl, and skx) will use the default reference implementations. Thanks to Mat Cross for reporting this issue via #380. - CREDITS file update. commit 9f3a8d4d851725436b617297231a417aa9ce8c6a Author: Field G. Van Zee Date: Sat Mar 14 17:48:43 2020 -0500 Added missing return to bli_thread_partition_2x2(). Details: - Added a missing return statement to the body of an early case handling branch in bli_thread_partition_2x2(). This bug only affected cases where n_threads < 4, and even then, the code meant to handle cases where n_threads >= 4 executes and does the right thing, albeit using more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti for reporting this bug via issue #377. - Whitespace changes to bli_thread.c (spaces -> tabs). commit 8c3d9b9eeb6f816ec8c32a944f632a5ad3637593 Merge: 71249fe8 0f9e0399 Author: Field G. Van Zee Date: Tue Mar 10 14:03:33 2020 -0500 Merge branch 'amd' of github.com:flame/blis into amd commit 71249fe8ddaa772616698f1e3814d40e012909ea Author: Field G. Van Zee Date: Tue Mar 10 13:55:29 2020 -0500 Merged test/sup, test/supmt into test/sup. Details: - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able to compile and run both single-threaded and multithreaded experiments. This should help with maintenance going forward. - Created a test/sup/octave_st directory of scripts (based on the previous test/sup/octave scripts) as well as a test/sup/octave_mt directory (based on the previous test/supmt/octave scripts). The octave scripts are slightly different and not easily mergeable, and thus for now I'll maintain them separately. - Preserved the previous test/sup directory as test/sup/old/supst and the previous test/supmt directory as test/sup/old/supmt. commit 0f9e0399e16e96da2620faf2c0c3c21274bb2ebd Author: Field G. Van Zee Date: Thu Mar 5 17:03:21 2020 -0600 Updated sup performance graphs; added mt results. Details: - Reran all existing single-threaded performance experiments comparing BLIS sup to other implementations (including the conventional code path within BLIS), using the latest versions (where appropriate). - Added multithreaded results for the three existing hardware types showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc (Zen1). - Various minor updates to the text in docs/PerformanceSmall.md. - Updates to the octave scripts in test/sup/octave, test/supmt/octave. commit 90db88e5729732628c1f3acc96eeefab49f2da41 Author: Field G. Van Zee Date: Mon Mar 2 15:06:48 2020 -0600 Updated sup[mt] Makefiles for variable dim ranges. Details: - Updated test/sup/Makefile and test/supmt/Makefile to allow specifying different problem size ranges for the drivers where one, two, or three matrix dimensions is large. This will facilitate the generation of more meaningful graphs, particularly when two dimensions are tiny. commit 31f11a06ea9501724feec0d2fc5e4644d7dd34fc Author: Field G. Van Zee Date: Thu Feb 27 14:33:20 2020 -0600 Updates to octave scripts in test/sup[mt]/octave. Details: - Optimized scripts in test/sup/octave and test/supmt/octave for use with octave 5.2.0 on Ubuntu 18.04. - Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m, which were not only unnecessary but also causing issues with versions 5.x. commit c01d249d7c546fe2e3cee3fe071cd4c4c88b9115 Author: Field G. Van Zee Date: Tue Feb 25 14:50:53 2020 -0600 Renamed bli_thread_obarrier(), _obroadcast(). Details: - Renamed two bli_thread_*() APIs: bli_thread_obarrier() -> bli_thread_barrier() bli_thread_obroadcast() -> bli_thread_broadcast() The 'o' was a leftover from when thrcomm_t objects tracked both "inner" and "outer" communicators. They have long since been simplified to only support the latter, and thus the 'o' is superfluous. commit f6e6bf73e695226c8b23fe7900da0e0ef37030c1 Author: Field G. Van Zee Date: Mon Feb 24 17:52:23 2020 -0600 List Gentoo under supported external packages. Details: - Add mention of Gentoo Linux under the list of external packages in the README.md file. Thanks to M. Zhou for maintaining this package. commit 9e5f7296ccf9b3f7b7041fe1df20b927cd0e914b Author: Field G. Van Zee Date: Tue Feb 18 15:16:03 2020 -0600 Skip building thrinfo_t tree when mt is disabled. Details: - Return early from bli_thrinfo_sup_grow() if the thrinfo_t object address is equal to either &BLIS_GEMM_SINGLE_THREADED or &BLIS_PACKM_SINGLE_THREADED. - Added preprocessor logic to bli_l3_sup_thread_decorator() in bli_l3_sup_decor_single.c that (by default) disables code that creates and frees the thrinfo_t tree and instead passes &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the sup implementation. - The net effect of the above changes is that a small amount of thrinfo_t overhead is avoided when running small/skinny dgemm problems when BLIS is compiled with multithreading disabled. commit 90081e6a64b5ccea9211bdef193c2d332c68492f Author: Field G. Van Zee Date: Mon Feb 17 14:57:25 2020 -0600 Fixed bug(s) in mt sup when single-threaded. Details: - Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of changing function interface for the thread entry point function (of type l3supint_t). - Unfortunately, fixing the interface was not enough, as it caused a memory leak in the sba at bli_finalize() time. It turns out that, due to the new multithreading-capable variant code useing thrinfo_t objects--specifically, their calling of bli_thrinfo_grow()--we have to pass in a real thrinfo_t object rather than the global objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED. Thus, I inserted the appropriate logic from the OpenMP and pthreads versions so that single-threaded execution would work as intended with the newly upgraded variants. commit c0558fde4511557c8f08867b035ee57dd2669dc6 Author: Field G. Van Zee Date: Mon Feb 17 14:08:08 2020 -0600 Support multithreading within the sup framework. Details: - Added multithreading support to the sup framework (via either OpenMP or pthreads). Both variants 1n and 2m now have the appropriate threading infrastructure, including data partitioning logic, to parallelize computation. This support handles all four combinations of packing on matrices A and B (neither, A only, B only, or both). This implementation tries to be a little smarter when automatic threading is requested (e.g. via BLIS_NUM_THREADS) in that it will recalculate the factorization in units of micropanels (rather than using the raw dimensions) in bli_l3_sup_int.c, when the final problem shape is known and after threads have already been spawned. - Implemented bli_?packm_sup_var2(), which packs to conventional row- or column-stored matrices. (This is used for the rrc and crc storage cases.) Previously, copym was used, but that would no longer suffice because it could not be parallelized. - Minor reorganization of packing-related sup functions. Specifically, bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]() instead of from the variant functions. This has the effect of making the variant functions more readable. - Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h and inserted usage of these functions within bli_thrinfo_init(), which previously was accessing thrinfo_t fields via the -> operator. - Renamed bli_partition_2x2() to bli_thread_partition_2x2(). - Added an auto_factor field to the rntm_t struct in order to track whether automatic thread factorization was originally requested. - Added new test drivers in test/supmt that perform multithreaded sup tests, as well as appropriate octave/matlab scripts to plot the resulting output files. - Added additional language to docs/Multithreading.md to make it clear that specifying any BLIS_*_NT variable, even if it is set to 1, will be considered manual specification for the purposes of determining whether to auto-factorize via BLIS_NUM_THREADS. - Minor comment updates. commit d7a7679182d72a7eaecef4cd9b9a103ee0a7b42b Author: Field G. Van Zee Date: Fri Feb 7 17:37:03 2020 -0600 Fixed int-to-packbuf_t conversion error (C++ only). Details: - Fixed an error that manifests only when using C++ (specifically, modern versions of g++) to compile drivers in 'test' (and likely most other application code that #includes blis.h. Thanks to Ajay Panyala for reporting this issue (#374). commit d626112b8d5302f9585fb37a8e37849747a2a317 Author: Field G. Van Zee Date: Wed Jan 15 13:27:02 2020 -0600 Removed sorting on LDFLAGS in common.mk (#373). Details: - Removed a line of code in common.mk that passed LDFLAGS through the sort function. The purpose was not to sort the contents, but rather to remove duplicates. However, there is valid syntax in a string of linker flags that, when sorted, yields different/broken behavior. So I've removed the line in common.mk that sorts LDFLAGS. Also, for future use, I've added a new function, rm-dupls, that removes duplicates without sorting. (This function was based on code from a stackoverflow thread that is linked to in the comments for that code.) Thanks to Isuru Fernando for reporting this issue (#373). commit e67deb22aaeab5ed6794364520190936748ef272 Author: Field G. Van Zee Date: Tue Jan 14 16:01:34 2020 -0600 CHANGELOG update (0.6.1) commit 10949f528c5ffc5c3a2cad47fe16a802afb021be (tag: 0.6.1) Author: Field G. Van Zee Date: Tue Jan 14 16:01:33 2020 -0600 Version file update (0.6.1) commit 5db8e710a2baff121cba9c63b61ca254a2ec097a Author: Field G. Van Zee Date: Tue Jan 14 15:59:59 2020 -0600 ReleaseNotes.md update in advance of next version. Details: - Updated ReleaseNotes.md in preparation for next version. commit cde4d9d7a26eb51dcc5a59943361dfb8fda45dea Author: Field G. Van Zee Date: Tue Jan 14 15:19:25 2020 -0600 Removed 'attic/windows' (to prevent confusion). Details: - Finally removed 'attic/windows' and its contents. This directory once contained "proto" Windows support for BLIS, but we've since moved on to (thanks to Isuru Fernando) providing Windows DLL support via AppVeyor's build artifacts. Furthermore, since 'windows' was the only subdirectory within 'attic', the directory path would show up in GitHub's listing at https://github.com/flame/blis, which probably led to someone being confused about how BLIS provides Windows support. I assume (but don't know for sure) that nobody is using these files, so this is admittedly a case of shoot first and ask questions later. commit 7d3407d4681c6449f4bbb8ec681983700ab968f3 Author: Field G. Van Zee Date: Tue Jan 14 15:17:53 2020 -0600 CREDITS file update. commit f391b3e2e7d11a37300d4c8d3f6a584022a599f5 Author: Dave Love Date: Mon Jan 6 20:15:48 2020 +0000 Fix parsing in vpu_count on workstation SKX (#351) * Fix parsing in vpu_count on workstation SKX * Document Skylake-X as Haswell for single FMA * Update vpu_count for Skylake and Cascade Lake models * Support printing the configuration selected, controlled by the environment Intended particularly for diagnosing mis-selection of SKX through unknown, or incorrect, number of VPUs. * Move bli_log outside the cpp condition, and use it where intended * Add Fixme comment (Skylake D) * Mostly superficial edits to commits towards #351. Details: - Moved architecture/sub-config logging-related code from bli_cpuid.c to bli_arch.c, tweaked names, and added more set/get layering. - Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c. - Content, whitespace changes to new bullet in HardwareSupport.md that relates to single-VPU Skylake-Xs. * Fix comment typos Co-authored-by: Field G. Van Zee commit 5ca1a3cfc1c1cc4dd9da6a67aa072ed90f07e867 Author: Field G. Van Zee Date: Mon Jan 6 12:29:12 2020 -0600 Fixed 'configure' breakage introduced in 6433831. Details: - Added a missing 'fi' (endif) keyword to a conditional block added in the configure script in commit 6433831. commit e7431b4a834ef4f165c143f288585ce8e2272a23 Author: Field G. Van Zee Date: Mon Jan 6 12:01:41 2020 -0600 Updated 1m draft article link in README.md. commit 6433831cc3988ad205637ebdebcd6d8f7cfcf148 Author: Jeff Hammond Date: Fri Jan 3 17:52:49 2020 -0800 blacklist ICC 18 for knl/skx due to test failures Signed-off-by: Jeff Hammond commit af3589f1f98781e3a94a8f9cea8d5ea6f155f7d2 Author: Jeff Hammond Date: Fri Jan 3 13:23:24 2020 -0800 blacklist Intel 19+ Signed-off-by: Jeff Hammond commit 60de939debafb233e57fd4e804ef21b6de198caf Author: Jeff Hammond Date: Wed Jan 1 21:30:38 2020 -0800 fix link to docs the comment contains an incorrect link, which is trivially fixed here. @fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything. commit 52711073789b6b84eb99bb0d6883f457ed3fcf80 Author: Field G. Van Zee Date: Mon Dec 16 16:30:26 2019 -0600 Fixed bugs in cblas_sdsdot(), sdsdot_(). Details: - Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar, named 'sb'. This value was already being added by the underlying sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub(). Thanks to Simon Lukas Märtens for reporting this bug via #367. - Fixed a second bug in order of typecasting intermediate products in sdsdot_(). Previously, the "alpha" scalar was being added after the "outer" typecast to float. However, the operation is supposed to first add the dot product to the (promoted) scalar and THEN downcast the sum to float. Thanks to Devin Matthews for catching this bug. commit fe2560a4b1d8ef8d0a446df6002b1e7decc826e9 Author: Field G. Van Zee Date: Fri Dec 6 17:12:44 2019 -0600 Annoted missing thread-related symbols for export. Details: - Added BLIS_EXPORT_BLIS annotation to function prototypes for bli_thrcomm_bcast() bli_thrcomm_barrier() bli_thread_range_sub() so that these functions are exported to shared libraries by default. This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for reporting this bug. - CREDITS file update. commit 2853825234001af8f175ad47cef5d6ff9b7a5982 Merge: efa61a6c 61b1f0b0 Author: Field G. Van Zee Date: Fri Dec 6 16:06:46 2019 -0600 Merge branch 'master' into amd commit 61b1f0b0602faa978d9912fe58c6c952a33af0ac Author: Nicholai Tukanov Date: Wed Dec 4 14:18:47 2019 -0600 Add prototypes for POWER9 reference kernels (#365) Updates and fixes to power9 subconfig. Details: - Register s,c,z reference gemm and trsm ukernels that assume elements of B have been broadcast. - Added prototypes for level-3 ukernels that assume elements of B have been broadcast. Also added prototype for an spackm function that employs a duplication/broadcast factor of 4. - Register virtual gemmtrsm ukernels that work with broadcasting of B. - Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h. - Thanks to Nicholai Tukanov for providing these updates. commit efa61a6c8b1cfa48781fc2e4799ff32e1b7f8f77 Author: Field G. Van Zee Date: Fri Nov 29 16:17:04 2019 -0600 Added missing bli_l3_sup_thread_decorator() symbol. Details: - Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp and pthreads so that those builds don't fail when performing shared library linking (especially for Windows DLLs via AppVeyor). For now, these dummy implementations of bli_l3_sup_thread_decorator() are merely carbon-copies of the implementation provided for single- threaded execution (ie: the one found in bli_l3_sup_decor_single.c). Thus, an OpenMP or pthreads build will be able to use the gemmsup code (including the new selective packing functionality), as it did before 39fa7136, even though it will not actually employ any multithreaded parallelism. commit 39fa7136f4a4e55ccd9796fb79ad5f121b872ad9 Author: Field G. Van Zee Date: Fri Nov 29 15:27:07 2019 -0600 Added support for selective packing to gemmsup. Details: - Implemented optional packing for A or B (or both) within the sup framework (which currently only supports gemm). The request for packing either matrix A or matrix B can be made via setting environment variables BLIS_PACK_A or BLIS_PACK_B (to any non-zero value; if set, zero means "disable packing"). It can also be made globally at runtime via bli_pack_set_pack_a() and bli_pack_set_pack_b() or with individual rntm_t objects via bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert interface of either the BLIS typed or object APIs. (If using the BLAS API, environment variables are the only way to communicate the packing request.) - One caveat (for now) with the current implementation of selective packing is that any blocksize extension registered in the _cntx_init function (such as is currently used by haswell and zen subconfigs) will be ignored if the affected matrix is packed. The reason is simply that I didn't get around to implementing the necessary logic to pack a larger edge-case micropanel, though this is entirely possible and should be done in the future. - Spun off the variant-choosing portion of bli_gemmsup_ref() into bli_gemmsup_int(), in bli_l3_sup_int.c. - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along with corresponding headers, in which higher-level packm-related functions are defined for use within the sup framework. The actual packm variant code resides in bli_l3_sup_packm_var.c. - Pass the following new parameters into var1n and var2m: packa, packb bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now always NULL), and pointer to a thrinfo_t* (which for nowis the address of the global single-threaded packm thread control node). - Added panel strides ps_a and ps_b to the auxinfo_t structure so that the millikernel can query the panel stride of the packed matrix and step through it accordingly. If the matrix isn't packed, the panel stride of interest for the given millikernel will be set to the appropriate value so that the mkernel may step through the unpacked matrix as it normally would. - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate panel strides (ps_a and ps_b, respectively) instead of computing them on the fly. - Spun off the environment variable getting and setting functions into a new file, bli_env.c (with a corresponding prototype header). These functions are now used by the threading infrastructure (e.g. BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B). - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER. - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER, for use within the definition of BLIS_MEM_INITIALIZER. - Moved the global_rntm object to bli_rntm.c and extern it where needed. This means that the function bli_thread_init_rntm() was renamed to bli_rntm_init_from_global() and relocated accordingly. - Added a new bli_pack.c function, which serves as the home for functions that manage the pack_a and pack_b fields of the global rntm_t, including from environment variables, just as we have functions to manage the threading fields of the global rntm_t in bli_thread.c. - Reorganized naming for files in frame/thread, which mostly involved spinning off the bli_l3_thread_decorator() functions into their own files. This change makes more sense when considering the further addition of bli_l3_sup_thread_decorator() functions (for now limited only to the single-threaded form found in the _single.c file). - Explicitly initialize the reference sup handlers in both bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more obvious how to customize to a different handler, if desired. - Removed various snippets of disabled code. - Various comment updates. commit bbb21fd0a9be8c5644bec37c75f9396eeeb69e48 Author: Field G. Van Zee Date: Thu Nov 21 18:15:16 2019 -0600 Tweaked SIAM/SC Best Prize language in README.md. commit 043366f92d5f5f651d5e3371ac3adb36baf4adce Author: Field G. Van Zee Date: Thu Nov 21 18:13:51 2019 -0600 Fixed typo in previous commit (SIAM/SC prize). commit 05a4d583e65a46ff2a1100ab4433975d905d91f9 Author: Field G. Van Zee Date: Thu Nov 21 18:12:24 2019 -0600 Added SIAM/SC prize to "What's New" in README.md. commit 881b05ecd40c7bc0422d3479a02a28b1cb48383f Author: Field G. Van Zee Date: Thu Nov 21 16:34:27 2019 -0600 Fixed blastest failure for 'generic' subconfig. Details: - Fixed a subtle and complicated bug that only manifested via the BLAS test drivers in the generic subconfiguration, and possibly any other subconfiguration that did not register complex-domain gemm ukernels, or registered ONLY real-domain ukernels as row-preferential. This is a long story, but it boils down to an exception to the "transpose the operation to bring storage of C into agreement with ukernel pref" optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the proper functioning of the 1m method, but only when the imaginary component of beta is zero. See the comments in issue #342 for more details. Thanks to Dave Love for identifying the commit in which this bug was introduced, and other feedback related to this bug. commit 0c7165fb01cdebbc31ec00124d446161b289942f Author: Field G. Van Zee Date: Thu Nov 14 16:48:14 2019 -0600 Fixed obscure bug in bli_acquire_mpart_[mn]dim(). Details: - Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(), and bli_acquire_mpart_mndim() that allowed the use of a blocksize b that is too large given the current row/column index (i.e., the i/j argument) and the size of the dimension being partitioned (i.e., the m/n argument). This bug only affected backwards partitioning/motion through the dimension and was the result of a misplaced conditional check-and-redirect to the backwards code path. It should be noted that this bug was discovered not because it manifested the way it could (thanks to the callers in BLIS making sure to always pass in the "correct" blocksize b), but could have manifested if the functions were used by 3rd party callers. Thanks to Minh Quan Ho for reporting the bug via issue #363. commit fb8bef9982171ee0f60bc39e41a33c4d31fd59a9 Author: Field G. Van Zee Date: Thu Nov 14 13:05:28 2019 -0600 Fixed copy-paste bug in bli_spackm_6xk_bb4_ref(). Details: - Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that manifested as failures in single-precision real level-3 operations. Also replaced the duplication factor constants with a const-qualifed varialbe, dfac, so that this won't happen again. - Changed NC for single-precision real from 4080 to 8160 so that the packed matrix B will have the same byte footprint in both single and double real. commit 8f399c89403d5824ba767df1426706cf2d19d0a7 Author: Field G. Van Zee Date: Tue Nov 12 15:32:57 2019 -0600 Tweaked/added notes to docs/Multithreading.md. Details: - Added language to docs/Multithreading.md cautioning the reader about the nuances of setting multithreading parameters via the manual and automatic ways simultaneously, and also about how these parameters behave when multithreading is disabled at configure-time. These changes are an attempt to address the issues that arose in issue #362. Thanks to Jérémie du Boisberranger for his feedback on this topic. - CREDITS file update. commit bdc7ee3394500d8e5b626af6ff37c048398bb27e Author: Field G. Van Zee Date: Mon Nov 11 15:47:17 2019 -0600 Various fixes to support packing duplication in B. Details: - Added cpp macros to trmm and trmm3 front-ends to optionally force those operations to be cast so the structured matrix is on the left. symm and hemm already had such macros, but these too were renamed so that the macros were individual to the operation. We now have four such macros: #define BLIS_DISABLE_HEMM_RIGHT #define BLIS_DISABLE_SYMM_RIGHT #define BLIS_DISABLE_TRMM_RIGHT #define BLIS_DISABLE_TRMM3_RIGHT Also, updated the comments in the symm and hemm front-ends related to the first two macro guards, and added corresponding comments to the trmm and trmm3 front-ends for the latter two guards. (They all functionally do the same thing, just for their specific operations.) Thanks to Jeff Hammond for reporting the bugs that led me to this change (via #359). - Updated config/old/haswellbb subconfiguration (used to debug issues related to duplicating B during packing) to register: a packing kernel for single-precision real; gemmbb ukernels for s, c, and z; trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c and z; and to use non-default cache and register blocksizes for s, c, and z datatypes. Also declared prototypes for all of the gemmbb, trsmbb, and gemmtrsmbb ukernel functions within the bli_cntx_init_haswellbb() function. This should, once applied to the power9 configuration, fix the remaining issues in #359. - Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a duplication factor of 4. This function is defined in the same file as bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c). commit 0eb79ca8503bd7b237994335b9687457227d3290 Author: Field G. Van Zee Date: Fri Nov 8 14:48:48 2019 -0600 Avoid unused variable warning in lread.c (#356). Details: - Replaced the line f = f; with ( void )f; for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully) addresses issue #356, but since we don't use xlc who knows. Thanks to Jeff Hammond for reporting this. commit f377bb448512f0b578263387eed7eaf8f2b72bb7 Author: Jérôme Duval Date: Thu Nov 7 23:39:29 2019 +0100 Add Haiku to the known OS list (#361) commit e29b1f9706b6d9ed798b7f6325f275df4e6be973 Author: Field G. Van Zee Date: Tue Nov 5 17:15:19 2019 -0600 Fixed failing testsuite gemmtrsm_ukr for power9. Details: - Added code that fixes false failures in the gemmtrsm_ukr module of the testsuite. The tests were failing because the computation (bli_gemv()) that performs the numerical check was not able to properly travserse the matrix operands bx1 and b11 that are views into the micropanel of B, which has duplicated/broadcast elements under the power9 subconfig. (For example, a micropanel of B with duplication factor of 2 needs to use a column stride of 2; previously, the column stride was being interpreted as 1.) - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride() static functions in bli_obj_macro_defs.h. (Previously, only the function bli_obj_set_strides() was defined. Amazing to think that we got this far without these former functions.) - Updated/expounded upon comments. commit 49177a6b9afcccca5b39a21c6fd8e243525e1505 Author: Field G. Van Zee Date: Mon Nov 4 18:09:37 2019 -0600 Fixed latent testsuite ukr module bugs for power9. Details: - Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and gemmtrsm) that only manifested once we began running with parameters that mimic those of power9. The problem was rooted in the way those modules were creating objects (and thus allocating memory) for the micropanel operands to the microkernel being tested. Since power9 duplicates/broadcasts elements of B in memory, we needed an easy way of asking for more than one storage element per logical element in the matrix. I incorrectly expressed this as: bli_obj_create( datatype, k, n, ldbp, 1, &bp ); The problem here is that bli_obj_create() is exceedingly efficient at calculating the size it passes to malloc() and doesn't allocate a full leading dimension's worth of elements for the last column (or row, in this example). This would normally not bother anyone since you're not supposed to access that memory anyway. But here, my attempted "hack" for getting extra elements was insufficient, and needed to be changed to: bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp ); That is, the extra elements needed to be baked into the dimensions of the matrix object in order to have the intended effect on the number of elements actually allocated. Thanks to Jeff Hammond for reporting this bug. - Fixed a typically harmless memory leak in the aforementioned test modules (the objects for the packed micropanels were not being freed). - Updated/expanded a common comment across all three ukr test modules. commit c84391314d4f1b3f73d868f72105324e649f2a72 Author: Field G. Van Zee Date: Mon Nov 4 13:57:12 2019 -0600 Reverted minor temp/wspace changes from b426f9e. Details: - Added missing license header to bli_pwr9_asm_macros_12x6.h. - Reverted temporary changes to various files in 'test' and 'testsuite' directories. - Moved testsuite/jobscripts into testsuite/old. - Minor whitespace/comment changes across various files. commit 4870260f6b8c06d2cc01b7147d7433ddee213f7f Author: Jeff Hammond Date: Mon Nov 4 11:55:47 2019 -0800 blacklist GCC 5 and older for POWER9 (#360) commit b426f9e04e5499c6f9c752e49c33800bfaadda4c Author: Nicholai Tukanov Date: Fri Nov 1 17:57:03 2019 -0500 POWER9 DGEMM (#355) Implemented and registered power9 dgemm ukernel. Details: - Implemented 12x6 dgemm microkernel for power9. This microkernel assumes that elements of B have been duplicated/broadcast during the packing step. The microkernel uses a column orientation for its microtile vector registers and thus implements column storage and general stride IO cases. (A row storage IO case via in-register transposition may be added at a future date.) It should be noted that we recommend using this microkernel with gcc and *not* xlc, as issues with the latter cropped up during development, including but not limited to slightly incompatible vector register mnemonics in the GNU extended inline assembly clobber list. commit 58102aeaa282dc79554ed045e1b17a6eda292e15 Merge: 52059506 b9bc222b Author: Field G. Van Zee Date: Mon Oct 28 17:58:31 2019 -0500 Merge branch 'amd' commit 52059506b2d5fd4c3738165195abeb356a134bd4 Author: Field G. Van Zee Date: Wed Oct 23 15:26:42 2019 -0500 Added "How to Download BLIS" section to README.md. Details: - Added a new section to the README.md, just prior to the "Getting Started" section, titled "How to Download BLIS". This section details the user's options for obtaining BLIS and lays out four common ways of downloading the library. Thanks to Jeff Diamond for his feedback on this topic. commit e6f0a96cc59aef728470f6850947ba856148c38a Author: Field G. Van Zee Date: Mon Oct 14 17:05:39 2019 -0500 Updated README.md to ack Facebook as funder. commit b9bc222bfc3db4f9ae5d7b3321346eed70c2c3fb Author: Field G. Van Zee Date: Mon Oct 14 16:38:15 2019 -0500 Call bli_syrk_small() before error checking. Details: - In bli_syrk_front(), moved the conditional call to bli_syrk_check() (if error checking is enabled) and the conditional scaling of C by beta (if alpha is zero) so that they occur after, instead of before, the call to bli_syrk_small(). This sequencing now matches that of bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in bli_trsm_front(). commit f0959a81dbcf30d8a1076d0a6348a9835079d31a Author: Field G. Van Zee Date: Mon Oct 14 15:46:28 2019 -0500 When manual config is blacklisted, output error. Details: - Fixed and adjusted the logic in configure so that a more informative error message is output when a user runs './configure ... ' and is present in the configuration blacklist. Previously, this particular set of conditions would result in the message: 'user-specified configuration '' is NOT registered! That is, the error message mis-identified the targeted configuration as the empty string, and (more importantly) mis-identifies the problem. Thanks to Tze Meng Low for reporting this issue. - Fixed a nearby error messages somewhat unrelated to the issue above. Specifically, the wrong string was being printed when the error message was identifying an auto-detected configuration that did not appear to be registered. commit 6218ac95a525eefa8921baf8d0d7057dfacebe9c Merge: 0016d541 a617301f Author: Field G. Van Zee Date: Fri Oct 11 11:53:51 2019 -0500 Merge branch 'master' into amd commit 0016d541e6b0da617b1fae6612d2b314901b7a75 Author: Field G. Van Zee Date: Fri Oct 11 11:09:44 2019 -0500 Changed -march=znver2 to =znver1 for clang on zen2. Details: - In config/zen2/make_defs.mk, changed the -march= flag so that -march=znver1 is used instead of -march=znver2 when CC_VENDOR is clang. (The gcc branch attempts to differentiate between various versions, but the equivalent version cutoffs for clang are not yet known by us, so we have to use a single flag for all versions of clang. Hopefully -march=znver1 is new enough. If not, we'll fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.) This issue was discovered thanks to AppVeyor. commit e94a0530e5ac4c78a18f09105f40003be2b517f7 Author: Field G. Van Zee Date: Fri Oct 11 10:48:27 2019 -0500 Corrected zen NC that was non-multiple of NR. Details: - Updated an incorrectly set cache blocksize NC for single real within config/zen/bli_cntx_init_zen.c that was non a multiple of the corresponding value of NR. This issue, which was caught by Travis CI, was introduced in 29b0e1e. commit a2ffac752076bf55eb8c1fe2c5da8d9104f1f85b Merge: 1cfe8e25 29b0e1ef Author: Field G. Van Zee Date: Fri Oct 11 10:31:18 2019 -0500 Merge branch 'amd-master' into amd commit 29b0e1ef4e8b84ce76888d73c090009b361f1306 Merge: 1cfe8e25 fdce1a56 Author: Field G. Van Zee Date: Fri Oct 11 10:24:24 2019 -0500 Code review + tweaks to AMD's AOCL 2.0 PR (#349). Details: - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis into 'amd-master' of flame/blis. - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was inadvertantly not incremented when the Zen2 subconfiguration was added. - In bli_gemm_front(), added a missing conditional constraint around the call to bli_gemm_small() that ensures that the computation precision of C matches the storage precision of C. - In bli_syrk_front(), reorganized and relocated the notrans/trans logic that existed around the call to bli_syrk_small() into bli_syrk_small() to minimize the calling code footprint and also to bring that code into stylistic harmony with similar code in bli_gemm_front() and bli_trsm_front(). Also, replaced direct accessing of obj_t fields with proper accessor static functions (e.g. 'a->dim[0]' becomes 'bli_obj_length( a )'). - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is strictly speaking unnecessary, but it serves as a useful visual cue to those who may be reading the files. - Removed cpp macro-protected small matrix debugging code from bli_trsm_front.c. - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc version check for availability of -march=znver2, and added appropriate support to configure script. - Cleanups to compiler flags common to recent AMD microarchitectures in config/zen/amd_config.mk, including: removal of -march=znver1 et al. from CKVECFLAGS (since the -march flag is added within make_defs.mk); setting CRVECFLAGS similarly to CKVECFLAGS. - Cleanups to config/zen/bli_cntx_init_zen.c. - Cleanups, added comments to config/zen/make_defs.mk. - Cleanups to config/zen2/make_defs.mk, including making use of newly- added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct set of compiler flags based on the version of gcc being used. - Reverted downstream changes to test/test_gemm.c. - Various whitespace/comment changes. commit a617301f9365ac720ff286514105d1b78951368b Author: Field G. Van Zee Date: Tue Oct 8 17:14:05 2019 -0500 Updates to docs/CodingConventions.md. commit 171f10069199f0cd280f18aac184546bd877c4fe Merge: 702486b1 05d58edf Author: Field G. Van Zee Date: Fri Oct 4 11:18:23 2019 -0500 Merge remote-tracking branch 'loveshack/emacs' commit 702486b12560b5c696ba06de9a73fc0d5107ca44 Author: Field G. Van Zee Date: Wed Oct 2 16:35:41 2019 -0500 Removed stray FAQ section introduced in 1907000. commit 1907000ad6ea396970c010f07ae42980b7b14fa0 Author: Field G. Van Zee Date: Wed Oct 2 16:31:54 2019 -0500 Updated to FAQ (AMD-related questions). Details: - Added a couple potential frequently-asked questions/answers releated to AMD's fork of BLIS. - Updated existing answers to other questions. commit 834f30a0dad808931c9d80bd5831b636ed0e1098 Author: Field G. Van Zee Date: Wed Oct 2 12:45:56 2019 -0500 Mention mixeddt paper in docs/MixedDatatypes.md. commit 05d58edfe0ea9279971d74f17a5f7a69c4672ed5 Author: Dave Love Date: Wed Oct 2 10:33:44 2019 +0100 Note .dir-locals.el in docs commit 531110c339f199a4d165d707c988d89ab4f5bfe8 Author: Dave Love Date: Wed Oct 2 10:16:22 2019 +0100 Modify Emacs config Confine it to cc-mode and add comment-start/end. commit 4bab365cab98202259c70feba6ec87408cba28d8 Author: Dave Love Date: Tue Oct 1 19:22:47 2019 +0000 Add .dir-locals.el for Emacs (#348) A minimal version that could probably do with extending, but at least gets the indentation roughly right. commit 4ec8dad66b3d37b0a2b47d19b7144bb62d332622 Author: Dave Love Date: Thu Sep 26 16:27:53 2019 +0100 Add .dir-locals.el for Emacs A minimal version that could probably do with extending, but at least gets the indentation roughly right. commit bc16ec7d1e2a30ce4a751255b70c9cbe87409e4f Author: Field G. Van Zee Date: Mon Sep 23 15:37:33 2019 -0500 Set execute bits of shared library at install-time. Details: - Modified the 0644 octal code used during installation of shared libraries to 0755 (for Linux/OSX only). Thanks to Adam J. Stewart for reporting this issue via #343. - CREDITS file update. commit c60db26aee9e7b4e5d0b031b0881e58d23666b53 Author: Field G. Van Zee Date: Tue Sep 17 18:04:17 2019 -0500 Fixed bad loop counter in bli_[cz]scal2bbs_mxn(). Details: - Fixed a typo in the loop counter for the 'd' (duplication) dimension in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h. They shouldn't be used by anyone yet, but thankfully clang via AppVeyor spit out warnings that alerted me to the issue. commit c766c81d628f0451d8255bf5e4b8be0a4ef91978 Author: Field G. Van Zee Date: Tue Sep 17 18:00:29 2019 -0500 Added missing schema arg to knl packm kernels. Details: - Added the pack_t schema argument to the knl packm kernel functions. This change was intended for inclusion in 31c8657. (Thank you SDE + Travis CI.) commit 31c8657f1d6d8f6efd8a73fd1995e995fc56748b Author: Field G. Van Zee Date: Tue Sep 17 17:42:10 2019 -0500 Added support for pre-broadcast when packing B. Details: - Added support for being able to duplicate (broadcast) elements in memory when packing matrix B (ie: the left-hand operand) in level-3 operations. This turns out advantageous for some architectures that can afford the cost of the extra bandwidth and somehow benefit from the pre-broadcast elements (and thus being able to avoid using broadcast-style load instructions on micro-rows of B in the gemm microkernel). - Support optionally disabling right-side hemm and symm. If this occurs, hemm_r is implemented in terms of hemm_l (and symm_r in terms of symm_l). This is needed when broadcasting during packing because the alternative--supporting the broadcast of B while also allowing matrix B to be Hermitian/symmetric--would be an absolute mess. - Support alignment factors for packed blocks of A, B, and C separately (as well as for general-purpose buffers). In addition, we support byte offsets from those alignment values (which is different from aligning by align+offset bytes to begin with). The default alignment values are BLIS_PAGE_SIZE in all four cases, with the offset values defaulting to zero. - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed into the packm kernel, where it will be needed by packm kernels that perform broadcasts of B, since the idea is that we *only* want to broadcast when packing micropanels of B and not A. - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be used to set custom virtual level-3 microkernels in the cntx_t, which would typically be done in the bli_cntx_init_*() function defined in the subconfiguration of interest. - Added a "broadcast B" kernel function for use with NP/NR = 12/6, defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c. - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels defined in ref_kernels/3/bb. (These kernels have been tested with double real with NP/NR = 12/6.) - Added #ifndef ... #endif guards around several macro constants defined in frame/include/bli_kernel_macro_defs.h. - Defined a few "broadcast B" static functions in frame/include/level0/bb for use by "broadcast B"-style packm reference kernels. For now, only the real domain kernels are tested and fully defined. - Output the alignment and offset values for packed blocks of A and B in the testsuite's "BLIS configuration info" section. - Comment updates to various files. - Bumped so_version to 3.0.0. commit fd9bf497cd4ff73ccdfc030ba037b3cb2f1c2fad Author: Field G. Van Zee Date: Tue Sep 17 15:45:24 2019 -0500 CREDITS file update. commit 6c8f2d1486ce31ad3c2083e5c2035acfd4409a43 Author: ShmuelLevine Date: Tue Sep 17 16:43:46 2019 -0400 Fix description for function bli_*pxby2v (#340) Fix typo in BLISTypedAPI.md for bli_?axpy2v() description. commit b5679c1520f8ae7637b3cc2313133461f62398dc Author: Field G. Van Zee Date: Tue Sep 17 14:00:37 2019 -0500 Inserted Multithreading links into BuildSystem.md. Details: - Inserted brief disclaimers about default disabled multithreading and default single-threadedness to BuildSystem.md along with links to the Multithreading.md document. Thanks to Jeff Diamond for suggesting these additions. - Trivial reword of sentence regarding automatically-detected architectures. commit f4f5170f8482c94132832eb3033bc8796da5420b Author: Isuru Fernando Date: Wed Sep 11 07:34:48 2019 -0500 Update README.md (#338) commit 1cfe8e2562e5e50769468382626ce36b734741c1 Author: Field G. Van Zee Date: Thu Sep 5 16:08:30 2019 -0500 Reimplemented bli_cpuid_query() for ARM. Details: - Rewrote bli_cpuid_query() for ARM architectures to use stdio-based functions such as fopen() and fgets() instead of popen(). The new code does more or less the same thing as before--searches /proc/cpuinfo for various strings, which are then parsed in order to determine the model, part number, and features. Thanks to Dave Love for suggesting this change in issue #335. commit 7c7819145740e96929466a248d6375d40e397e19 Author: Devin Matthews Date: Fri Aug 30 16:52:09 2019 -0500 Always use sqsumv to compute normfv. (#334) * Always use sqsumv to compute normfv on MacOS. * Unconditionally disable the "dot trick" in normfv. * Added explanatory comment to normfv definition. Details: - Added a comment above the unconditional disabling of the dotv-based implementation to normfv. Thanks to Roman Yurchak, Devin Matthews, and Isuru Fernando in helping with this improvement. - CREDITS file update. commit 80e6c10b72d50863b4b64d79f784df7befedfcd1 Author: Field G. Van Zee Date: Thu Aug 29 12:12:08 2019 -0500 Added reproduction section to Performance docs. Details: - Added section titled "Reproduction" to both Performance.md and PerformanceSmall.md that briefly nudges the motivated reader in the right direction if he/she wishes to run the same performance benchmarks used to produce the graphs shown in those documents. Thanks to Dave Love for making this suggestion. commit 14cb426414856024b9ae0f84ac21efcc1d329467 Author: Field G. Van Zee Date: Wed Aug 28 17:04:33 2019 -0500 Updated OpenBLAS, Eigen sup results. Details: - Updated the results shown in docs/PerformanceSmall.md for OpenBLAS and Eigen. commit b02e0aae8ce2705e91023b98ed416cd05430a78e Author: Field G. Van Zee Date: Tue Aug 27 14:37:46 2019 -0500 Updated test drivers to iterate backwards. Details: - Updated test driver source in test, test/3, test/1m4m, and test/mixeddt to iterate through the problem space backwards. This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix (originally made to test/sup drivers in 57e422a). - Applied off-by-one matlab output bugfix from b6017e5 to test drivers in test, test/3, test/1m4m, and test/mixeddt directories. commit b6017e53f4b26c99b14cdaa408351f11322b1e80 Author: Field G. Van Zee Date: Tue Aug 27 14:18:14 2019 -0500 Bugfix of output text + tweaks to test/sup driver. Details: - Fixed an off-by-one bug in the output of matlab row indices in test/sup/test_gemm.c that only manifested when the problem size increment was equal to 1. - Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage combinations for blissup drivers in test/sup. This helps make the building of drivers complete sooner. - Trivial changes to test/sup/runme.sh. commit 138d403b6bb15e687a3fe26d3d967b8ccd1ed97b Author: Devin Matthews Date: Mon Aug 26 18:11:27 2019 -0500 Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331) commit d5a05a15a7fcc38fb2519031dcc62de8ea4a530c Author: Field G. Van Zee Date: Mon Aug 26 16:54:31 2019 -0500 Cropped whitespace from new sup graphs. Details: - Previously forgot crop whitespace from the new .png graphs added/updated in docs/graphs/sup. commit a6c80171a353db709e43f9e6e7a3da87ce4d17ed Author: Field G. Van Zee Date: Mon Aug 26 16:51:31 2019 -0500 Fixed contents links in docs/PerformanceSmall.md. Details: - Corrected links in contents section of docs/PerformanceSmall.md, which were erroneously directing readers to the corresponding sections of docs/Performance.md. commit 40781774df56a912144ef19cc191ed626a89f0de Author: Field G. Van Zee Date: Mon Aug 26 16:47:37 2019 -0500 Updated sup performance graphs with libxsmm. Details: - Added libxsmm to column-stored sup graphs presented in docs/PerformanceSmall.md. - Updated sup results for BLASFEO. - Added sup results for Lonestar5 (Haswell). - Addresses issue #326. commit bfddf671328e7e372ac7228f72ff2d9d8e03ae18 Author: figual Date: Mon Aug 26 12:01:33 2019 +0200 Fixed context registration for Cortex A53 (#329). commit 4a0a6e89c568246d14de4cc30e3ff35aac23d774 Author: Field G. Van Zee Date: Sat Aug 24 15:25:16 2019 -0500 Changed test/sup alpha to 1; test libxsmm+netlib. Details: - Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is needed because libxsmm currently only optimizes gemm operations where alpha is unit (and beta is unit or zero). - Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its fallback library. This is the library that will be called the problem dimensions are deemed too large, or any other criteria for optimization are not met. (This was done not because it is realistic, but rather so that it would be very clear when libxsmm ceased handling gemm calls internally when the data are graphed.) commit 7aa52b57832176c5c13a48e30a282e09ecdabf73 Author: Field G. Van Zee Date: Fri Aug 23 16:12:50 2019 -0500 Use libxsmm API in test/sup; add missing -ldl. Details: - Switch the driver source in test/sup so that libxsmm_?gemm() is called instead of ?gemm_() when compiling for / linking against libxsmm. libxsmm's documentation isn't clear on whether it is even *trying* to provide BLAS API compatibility, and I got tired of trying to figure it out. - Added missing -ldl in LDFLAGS when linking against libxsmm. commit 57e422aa168bee7416965265c93fcd4934cd7041 Author: Field G. Van Zee Date: Fri Aug 23 14:17:52 2019 -0500 Added libxsmm support to test/sup drivers. Details: - Modified test/sup/Makefile to build drivers that test the performance of skinny/small problems via libxsmm. - Modified test/sup/runme.sh to run aforementioned drivers. - Modified test/sup/test_gemm.c so that problem sizes are tested in reverse order (from largest to smallest). This can help avoid certain situations where the CPU frequency does not immediately throttle up to its maximum. Thanks to Robert van de Geijn for recommending this fix. commit 661681fe33978acce370255815c76348f83632bc Merge: 2f387e32 ef0a1a0f Author: Field G. Van Zee Date: Thu Aug 22 14:29:50 2019 -0500 Merge branch 'master' of github.com:flame/blis commit 2f387e32ef5f9a17bafb5076dc9f66c38b52b32d Author: Field G. Van Zee Date: Thu Aug 22 14:27:30 2019 -0500 Added Eigen -march=native hack to perf docs. Details: - Spell out the hack given to me by Sameer Agarwal in order to get Eigen to build with -march=native (which is critically important for Eigen) in docs/Performance.md and docs/PerformanceSmall.md. commit ef0a1a0faf683fe205f85308a54a77ffd68a9a6c Author: Devin Matthews Date: Wed Aug 21 17:40:24 2019 -0500 Update do_sde.sh (#330) * Update do_sde.sh Automatically accept SDE license and download directly from Intel * Update .travis.yml [ci skip] * Update .travis.yml Enable SDE testing for PRs. commit 0cd383d53a8c4a6871892a0395591ef5630d4ac0 Author: Field G. Van Zee Date: Wed Aug 21 13:39:05 2019 -0500 Corrected variable type and comment update. Details: - Forgot to save all changes from bli_gemmtrsm4m1_ref.c before commit in 8122f59. Fixed type mismatch and referenced github issue in comment. commit 8122f59745db780987da6aa1e851e9e76aa985e0 Author: Field G. Van Zee Date: Wed Aug 21 13:22:12 2019 -0500 Pacify 'restrict' warning in gemmtrsm4m1 ref ukr. Details: - Previously, some versions of gcc would complain that the same pointer, one_r, is being passed in for both alpha and beta in the fourth call to the real gemm ukernel in bli_gemmtrsm4m1_ref.c. This is understandable since the compiler knows that the real gemm ukernel qualifies all of its floating-point arguments (including alpha and beta) with restrict. A small hack has been inserted into the file that defines a new variable to store the value 1.0, which is now used in lieu of one_r for beta in the fourth call to the real gemm ukernel, which should pacify the compiler now. Thanks to Dave Love for reporting this issue (#328) and for Devin Matthews for offering his 'restrict' expertise. commit e8c6281f139bdfc9bd68c3b36e5e89059b0ead2e Author: Field G. Van Zee Date: Wed Aug 21 12:38:53 2019 -0500 Add -march support for specific gcc version ranges. Details: - Added logic to configure that checks the version of the compiler against known version ranges that could cause problems later in the build process. For example, versions of gcc older than 4.9.0 use different -march labels than version 4.9.0 or later ('-march=corei7-avx' vs '-march=sandybridge', respectively). Similarly, before 6.1, compilation on Zen was possible, but you need to start with -march=bdver4 and then disable instruction sets that were discarded during the transition from Excavator to Zen. So now, configure substitutes 'yes'/'no' values into anchors in config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0), which can be accessed and branched upon by the various configurations' make_defs.mk files when setting their compiler flags. - Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0. - Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0. - Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0. commit e6ac4ebcb6e6a372820e7f509c0af3342966b84a Author: Field G. Van Zee Date: Tue Aug 20 13:49:47 2019 -0500 Added page size, source location to perf docs. Details: - Added the page size, as returned via 'getconf -a | grep PAGE_SIZE', and the location of the performance drivers to docs/Performance.md (test/3) and docs/PerformanceSmall.md (test/sup). Thanks to Dave Love for suggesting these additions in #325. commit fdce1a5648d69034fab39943100289323011c36f Author: Meghana Date: Wed Jul 24 15:04:41 2019 +0530 changed gcc version check condition from 'ifeq' to 'if greater or equal' Change-Id: Ie4c461867829bcc113210791bbefb9517e52c226 commit c9486e0c4f82cd9f58f5ceb71c0df039e9970a20 Author: Meghana Date: Wed Jul 24 09:45:17 2019 +0530 code to detect version of gcc and set flags accordingly for zen2 Change-Id: I29b0311d0000dee1a2533ee29941acf53f9e9f34 commit 54afe3dfe6828a1aff65baabbf14c98d92e50692 Author: Field G. Van Zee Date: Tue Jul 23 16:54:28 2019 -0500 Added "Education and Learning" ToC entry to README. commit 9f53b1ce7ac702e84e71801fe96986f6aa16040e Author: Field G. Van Zee Date: Tue Jul 23 16:50:35 2019 -0500 Added "Education and Learning" section to README. Details: - Added a short section after the Intro of the README.md file titled "Education and Learning" that directs interested readers to the "LAFF-On Programming for High-Performance" massive open online course (MOOC) hosted via edX. commit deda4ca8a094ee18d7c7c45e040e8ef180f33a48 Author: Field G. Van Zee Date: Mon Jul 22 13:59:05 2019 -0500 Added test/1m4m driver directory. Details: - Added a new standalone test driver directory named '1m4m' that can build and run performance experiments for BLIS 1m, 4m1a, assembly, OpenBLAS, and the vendor library (MKL). This new driver directory was used to regenerate performance results for the 1m paper. - Added alternate (commented-out) cache blocksizes to config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to work well on an a 12-core Intel Xeon E5-2650 v3. commit dcc0ce12fde4c6dca2b4764a1922a2ab19725867 Author: Meghana Date: Mon Jul 22 17:12:01 2019 +0530 Added a global Makefile for AMD architectures in config/zen folder This Makefile(amd_config.mk) has all the flags that are common to EPYC series Change-Id: Ic02c60a8293ccdd37f0f292e631acd198e6895de commit af17bca26a8bd3dcbee8ca81c18d7b25de09c483 Author: Field G. Van Zee Date: Fri Jul 19 14:46:23 2019 -0500 Updated haswell MC cache blocksizes. Details: - Updated the default MC cache blocksizes used by the haswell subconfig for both row-preferential (the default) and column-preferential microkernels. commit b5e9bce4dde5bf014dd9771ae741048e1f6c7748 Author: Field G. Van Zee Date: Fri Jul 19 14:42:37 2019 -0500 Updated -march flags for sandybridge, haswell. Details: - Updated the '-march=corei7-avx' flag in the sandybridge subconfig to '-march=sandybridge' and the '-march=core-avx2' flag in the haswell subconfig to '-march=haswell'. The older flags were used by older versions of gcc and should have been updated to the newer forms a long time ago. (The older flags were clearly working, even though they are no longer documented in the gcc man page.) commit c22b9dba5859a9fc94c8431eccc9e4eb9be02be1 Author: Field G. Van Zee Date: Tue Jul 16 13:14:47 2019 -0500 More updates to comments in testsuite modules. Details: - Updated most comments in testsuite modules that describe how the correctness test is performed so that it is clear whether the vector (normfv) or matrix (normfm) form of Frobenius norm is used. commit c4cc6fa702f444a05963db01db51bc7d6669e979 Author: Field G. Van Zee Date: Tue Jul 16 13:00:35 2019 -0500 New cntx_t blksz "set" functions + misc tweaks. Details: - Defined two new static functions in bli_cntx.h: bli_cntx_set_blksz_def_dt() bli_cntx_set_blksz_max_dt() which developers may find convenient when experimenting with different values of cache blocksizes. - Updated one- and two-socket multithreaded problem size range and increment values in test/3/Makefile. - Changed default to column storage in test/3/test_gemm.c. - Fixed typo in comment in testsuite/src/test_subm.c. commit b84cee29f42855dc1f263e42b83b1a46ac8def87 Merge: 1f80858a c7dd6e6c Author: Meghana Vankadari Date: Mon Jul 8 02:03:07 2019 -0400 Merge "Added compiler flags for vanilla clang" into amd-staging-rome2.0 commit 1f80858abf5ca220b2998fbe6f9b06c32d3864c3 Author: kdevraje Date: Fri Jul 5 16:05:11 2019 +0530 This checkin solves the dgemm performance issue jira ticket CPUPL 458, as #else was missed during integration, it was always following else path to get the block sizes Change-Id: I0084b5856c2513ab1066c08c15b5086db6532717 commit c7dd6e6cd2f910cbefcdc1e04a5adeb919a23de0 Author: Meghana Date: Thu Jul 4 09:32:51 2019 +0530 Added compiler flags for vanilla clang Change-Id: I13c00b4c0d65bbda4c929848fd48b0ab611952ab commit 2acd49b76457635625a01e31c2abc8902b23cf51 Author: Meghana Date: Mon Jul 1 15:42:38 2019 +0530 fix for test failures using AOCC 2.0 Change-Id: If44eaccc64bbe96bbbe1d32279b1b5773aba08d1 commit ceee2f973ebe115beca55ca77f9e3ce36b14c28a Author: Field G. Van Zee Date: Mon Jun 24 17:47:40 2019 -0500 Fixed thrinfo_t printing bug for small problems. Details: - Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c, whereby subnodes of the thrinfo_t tree are "dereferenced" near the beginning of the functions, which may lead to segfaults in certain situations where the thread tree was not fully formed because the matrix problem was too small for the level of parallelism specified. (That is, too small because some problems were assigned no work due to the smallest units in the m and n dimensions being defined by the register blocksizes mr and nr.) The fix requires several nested levels of if statements, and this is one of those few instances where use of goto statements results in (mostly) prettier code, especially in the case of _gemm_paths(). And while it wasn't necessary, I ported this goto usage to the loop body that prints the thrinfo_t work_id and comm_id values for each thread. Thanks to Nicholai Tukanov for helping to find this bug. commit cac127182dd88ed0394ad81e6b91b897198e168a Merge: 565fa385 3a45ecb1 Author: kdevraje Date: Mon Jun 24 13:01:27 2019 +0530 Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis with public repo commit id 565fa3853b381051ac92cff764625909d105644d. Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42 commit c152109e9a3b1cd74760e8a3215a676d25c18d2e Author: Field G. Van Zee Date: Wed Jun 19 13:23:24 2019 -0500 Updated BLASFEO results in PerformanceSmall.md. Details: - Updated the BLASFEO performance graphs shown in PerformanceSmall.md using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md accordingly. - Updated test/sup/octave/plot_l3sup_perf.m so that the .m files containing the mpnpkp results do not need to be preprocessed in order to plot half the problem size range (ie: up to 400 instead of the 800 range of the other shape cases). - Trivial updates to runme.m. commit 4d19c98110691d33ecef09d7e1b97bd1ccf4c420 Author: Field G. Van Zee Date: Sat Jun 8 11:02:03 2019 -0500 Trivial change to MixedDatatypes.md link text. commit 24965beabe83e19acf62008366097a7f198d4841 Author: Field G. Van Zee Date: Sat Jun 8 11:00:22 2019 -0500 Fixed typo in README.md's MixedDatatypes.md link. commit 50dc5d95760f41c5117c46f754245edc642b2179 Author: Field G. Van Zee Date: Fri Jun 7 13:10:16 2019 -0500 Adjust -fopenmp-simd for icc's preferred syntax. Details: - Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel icc. Recall that this option is used for SIMD auto-vectorization in reference kernels only. Support for the -f option has been completely deprecated and removed in newer versions of icc in favor of -q. Thanks to Victor Eijkhout for reporting this issue and suggesting the fix. commit ad937db9507786874c801b41a4992aef42d924a1 Author: Field G. Van Zee Date: Fri Jun 7 11:34:08 2019 -0500 Added missing #include "bli_family_thunderx2.h". Details: - Added a cpp-conditional directive block to bli_arch_config.h that #includes "bli_family_thunderx2.h". The code has been missing since adf5c17f. However, this never manifested as an error because the file is virtually empty and not needed for thunderx2 (or most subconfigs). Thanks to Jeff Diamond for helping to spot this. commit ce671917b2bc24895289247feef46f6fdd5020e7 Author: Field G. Van Zee Date: Thu Jun 6 14:17:21 2019 -0500 Fixed formatting/typo in docs/PerformanceSmall.md. commit 86c33a4eb284e2cf3282a1809be377785cdb3703 Author: Field G. Van Zee Date: Wed Jun 5 11:43:55 2019 -0500 Tweaked language in README.md related to sup/AMD. commit cbaa22e1ca368d36a8510f2b4ecd6f1523d1e1f3 Author: Field G. Van Zee Date: Tue Jun 4 16:06:58 2019 -0500 Added BLASFEO results to docs/PerformanceSmall.md. Details: - Updated the graphs linked in PerformanceSmall.md with BLASFEO results, and added documenting language accordingly. - Updated scripts in test/sup/octave to plot BLASFEO data. - Minor tweak to language re: how OpenBLAS was configured for docs/Performance.md. commit 763fa39c3088c0e2c0155675a3ca868a58bffb30 Author: Field G. Van Zee Date: Tue Jun 4 14:46:45 2019 -0500 Minor tweaks to test/sup. Details: - Changed starting problem and increment from 16 to 4. - Added 'lll' (square problems) to list of problem size shapes to compile and run with. - Define BLASFEO location and added BLASFEO-related definitions. commit 5e1e696003c9151b1879b910a1957b7bdd7b0deb Author: Field G. Van Zee Date: Mon Jun 3 18:37:20 2019 -0500 CHANGELOG update (0.6.0) commit 18c876b989fd0dcaa27becd14e4f16bdac7e89b3 (tag: 0.6.0) Author: Field G. Van Zee Date: Mon Jun 3 18:37:19 2019 -0500 Version file update (0.6.0) commit 0f1b3bf49eb593ca7bb08b68a7209f7cd550f912 Author: Field G. Van Zee Date: Mon Jun 3 18:35:19 2019 -0500 ReleaseNotes.md update in advance of next version. Details: - Updated ReleaseNotes.md in preparation for next version. - CREDITS file update. commit 27da2e8400d900855da0d834b5417d7e83f21de1 Author: Field G. Van Zee Date: Mon Jun 3 17:14:56 2019 -0500 Minor edits to docs/PerformanceSmall.md. Details: - Added performance analysis to "Comments" section of both Kaby Lake and Epyc sections. - Added emphasis to certain passages. commit 09ba05c6f87efbaadf085497dc137845f16ee9c5 Author: Field G. Van Zee Date: Mon Jun 3 16:53:19 2019 -0500 Added sup performance graphs/document to 'docs'. Details: - Added a new markdown document, docs/PerformanceSmall.md, which publishes new performance graphs for Kaby Lake and Epyc showcasing the new BLIS sup (small/skinny/unpacked) framework logic and kernels. For now, only single-threaded dgemm performance is shown. - Reorganized graphs in docs/graphs into docs/graphs/large, with new graphs being placed in docs/graphs/sup. - Updates to scripts in test/sup/octave, mostly to allow decent output in both GNU octave and Matlab. - Updated README.md to mention and refer to the new PerformanceSmall.md document. commit 6bf449cc6941734748034de0e9af22b75f1d6ba1 Merge: abd8a9fa a4e8801d Author: Field G. Van Zee Date: Fri May 31 17:42:40 2019 -0500 Merge branch 'amd' commit a4e8801d08d81fa42ebea6a05a990de8dcedc803 Author: Field G. Van Zee Date: Fri May 31 17:30:51 2019 -0500 Increased MT sup threshold for double to 201. Details: - Fine-tuned the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 180 to 201 for haswell and 180 to 256 for zen. - Updated octave scripts in test/sup/octave to include a seventh column to display performance for m = n = k. commit 3a45ecb15456249c30ccccd60e42152f355615c1 Merge: 3f867c96 b69fb0b7 Author: Kiran Devrajegowda Date: Fri May 31 06:47:02 2019 -0400 Merge "Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup" into amd-staging-rome2.0 commit b69fb0b74a4756168de270fc9b18f7cf7aa57f17 Author: Kiran Varaganti Date: Fri May 31 15:14:22 2019 +0530 Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc commit 3f867c96caea3bbbbeeff1995d90f6cf8c9895fb Author: kdevraje Date: Fri May 31 12:22:44 2019 +0530 When running HPL with pure MPI without DGEMM Threading (Single Threaded BLIS ), making this macro 1 gives best performance.wq Change-Id: I24fd0bf99216f315e49f1c74c44c3feaffd7078d commit abd8a9fa7df4569aa2711964c19888b8e248901f (origin/pfhp) Author: Field G. Van Zee Date: Tue May 28 12:49:44 2019 -0500 Inadvertantly hidden xerbla_() in blastest (#313). Details: - Attempted a fix to issue #313, which reports that when building only a shared library (ie: static library build is disabled), running the BLAS test drivers can fail because those drivers provide their own local version of xerbla_() as a clever (albeit still rather hackish) way of checking the error codes that result from the individual tests. This local xerbla_() function is never found at link-time because the BLAS test drivers' Makefile imports BLIS compilation flags via the get-user-cflags-for() function, which currently conveys the -fvisibility=hidden flag, which hides symbols unless they are explicitly annotated for export. The -fvisibility=hidden flag was only ever intended for use when building BLIS (not for applications), and so the attempted solution here is to omit the symbol export flag(s) from get-user-cflags-for() by storing the symbol export flag(s) to a new BULID_SYMFLAGS variable instead of appending it to the subconfigurations' CMISCFLAGS variable (which is returned by every get-*-cflags-for() function). Thanks to M. Zhou for reporting this issue and also to Isuru Fernando for suggesting the fix. - Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly created BUILD_SYMFLAGS. - Fixed typo in entry for --export-shared flag in 'configure --help' text. commit 13806ba3b01ca0dd341f4720fb930f97e46710b0 Author: kdevraje Date: Mon May 27 16:24:43 2019 +0530 This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019 Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041 commit ee123f535872510f77100d3d55a43d4ca56047d5 Author: Meghana Date: Mon May 27 15:36:44 2019 +0530 Defined small matrix thresholds for TRSM for various cases for NAPLES and ROME Updated copyright information for kernels/zen/bli_trsm_small.c file Removed separate kernels for zen2 architecture Instead added threshold conditions in zen kernels both for ROME and NAPLES Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5 commit 9d93a4caa21402d3a90aac45d7a1603736c9fd63 Author: prangana Date: Fri May 24 17:59:13 2019 +0530 update version 2.0 commit 755730608d923538273a90c48bfdf77571f86519 Author: Field G. Van Zee Date: Thu May 23 17:34:36 2019 -0500 Minor rewording of language around mt env. vars. commit ba31abe73c97c16c78fffc59a215761b8d9fd1f6 Author: Field G. Van Zee Date: Thu May 23 14:59:53 2019 -0500 Added BLIS theading info to Performance.md. Details: - Documented the BLIS environment variables that were set (e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and threading configuration in order to achieve the parallelism reported on in docs/Performance.md. commit cb788ffc89cac03b44803620412a5e83450ca949 Author: Field G. Van Zee Date: Thu May 23 13:00:53 2019 -0500 Increased MT sup threshold for double to 180. Details: - Increased the double-precision real MT threshold (which controls whether the sup implementation kicks for smaller m dimension values) from 80 to 180, and this change was made for both haswell and zen subconfigurations. This is less about the m dimension in particular and more about facilitating a smoother performance transition when m = n = k. commit 057f5f3d211e7513f457ee6ca6c9555d00ad1e57 Author: Field G. Van Zee Date: Thu May 23 12:51:17 2019 -0500 Minor build system housekeeping. Details: - Commented out redundant setting of LIBBLIS_LINK within all driver- level Makefiles. This variable is already set within common.mk, and so the only time it should be overridden is if the user wants to link to a different copy of libblis. - Very minor changes to build/gen-make-frags/gen-make-frag.sh. - Whitespace and inconsequential quoting change to configure. - Moved top-level 'windows' directory into a new 'attic' directory. commit e05171118c377f356f89c4daf8a0d5ddc5a4e4f7 Author: Meghana Date: Thu May 23 16:15:27 2019 +0530 Implemented TRSM for small matrices for cases where A is on the right Added separate kernels for zen and zen2 Change-Id: I6318ddc250cf82516c1aa4732718a35eae0c9134 commit 02920f5c480c42706b487e37b5ecc96c3555b851 Author: kdevraje Date: Thu May 23 15:29:59 2019 +0530 make checkblis fails for matrix dimension check at the begining hence reverting it Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87 commit 84215022f29fb3bfedd254d041635308d177e6c0 Author: kdevraje Date: Thu May 23 11:08:41 2019 +0530 Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5 commit a3554eb1dcc1b5b94d81c60761b2f01c3d827ffa Merge: ea082f83 17b878b6 Author: kdevraje Date: Thu May 23 11:51:07 2019 +0530 Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2 Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae commit ea082f839071dd9ec555062dc3851c31d12f00e4 Author: kdevraje Date: Thu May 23 10:38:29 2019 +0530 adding empty zen2 directory with .gitignore file Change-Id: Ifa37cf54b2578aa19ad335372b44bca17043fe4b commit b80bd5bcb2be8551a9a21fafc8e6c8b6336c99b5 Author: Kiran Varaganti Date: Tue May 21 15:11:47 2019 +0530 config/zen/bli_cntx_init_zen.c: removed BLIS_ENBLE_ZEN_BLOCK_SIZES macro. We have different configurations for both zen and zen2 config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H. test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64) Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2 commit a042db011df9a1c3e7c7ac546541f4746b176ea5 Author: Kiran Varaganti Date: Mon May 20 14:17:32 2019 +0530 Modified make_defs.mk for zen2 to get compiled by gcc version less than gcc9.0 Change-Id: I8fcac30538ee39534c296932639053b47b9a2d43 commit a23f92594cf3d530e5794307fe97afc877d853b7 Author: Kiran Varaganti Date: Mon May 20 10:48:06 2019 +0530 config_registry: New AMD zen2 architecture configuration added. frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS] frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...). frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..). frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20 Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866 commit 17b878b66d917d50b6fe23721d8579e826cb3e8c Author: kdevraje Date: Wed May 22 14:02:53 2019 +0530 adding license same as in ut-austin-amd-branch Change-Id: I6790768d2bf5d42369d304ef93e34701f95fbaff commit df755848b8a271323e007c7a628c64af63deab00 Merge: ca4b33c0 c72ae27a Author: kdevraje Date: Wed May 22 13:30:07 2019 +0530 Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0 Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc commit c72ae27adee4726679ee004d02c972582b5285b4 Author: Nisanth M P Date: Mon Mar 19 12:49:26 2018 +0530 Re-enabling the small matrix gemm optimization for target zen Change-Id: I13872784586984634d728cd99a00f71c3f904395 commit ab0818af80f7f683080873f3fa24734b65267df2 Author: sraut Date: Wed Oct 3 15:30:33 2018 +0530 Review comments incorporated for small TRSM. Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9 commit 32392cfc72af7f42da817a129748349fb1951346 Author: Jeff Hammond Date: Tue May 14 15:52:30 2019 -0400 add info about CXX in configure (#311) commit fa7e6b182b8365465ade178b0e4cd344ff6f6460 Author: Field G. Van Zee Date: Wed May 1 19:13:00 2019 -0500 Define _POSIX_C_SOURCE in bli_system.h. Details: - Added #ifndef _POSIX_C_SOURCE #define _POSIX_C_SOURCE 200809L #endif to bli_system.h so that an application that uses BLIS (specifically, an application that #includes blis.h) does not need to remember to #define the macro itself (either on the command line or in the code that includes blis.h) in order to activate things like the pthreads. Thanks to Christos Psarras for reporting this issue and suggesting this fix. - Commented out #include in bli_system.h, since I don't think this header is used/needed anymore. - Comment update to function macro for bli_?normiv_unb_var1() in frame/util/bli_util_unb_var1.c. commit 3df84f1b5d5e1146bb01bfc466ac20c60a9cc859 Author: Field G. Van Zee Date: Sat Apr 27 21:27:32 2019 -0500 Minor bugfixes in sup dgemm implementation. Details: - Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel that only affected the beta == 0, column-storage output case. Thanks to the BLAS test drivers for catching this bug. - Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if k = 0, when the correct action would be to scale by beta (and then return). Thanks to the BLAS test drivers to catching this bug. - Changed the sup threshold behavior such that the sup implementation only kicks in if a matrix dimension is strictly less than (rather than less than or equal to) the threshold in question. - Initialize all thresholds to zero (instead of 10) by default in ref_kernels/bli_cntx_ref.c. This, combined with the above change to threshold testing means that calls to BLIS or BLAS with one or more matrix dimensions of zero will no longer trigger the sup implementation. - Added disabled debugging output to frame/3/bli_l3_sup.c (for future use, perhaps). commit ecbdd1c42dcebfecd729fe351e6bb0076aba7d81 Author: Field G. Van Zee Date: Sat Apr 27 19:38:11 2019 -0500 Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros. Details: - Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long conditional that would determine whether to allow the last iteration to be merged with the second-to-last iteration. Functionally, the macros were not needed, and they ended up causing problems when building configuration families such as intel64 and x86_64. commit aa8a6bec3036a41e1bff2034f8ef6766a704ec49 Author: Field G. Van Zee Date: Sat Apr 27 18:53:33 2019 -0500 Fixed typo in --disable-sup-handling macro guard. Details: - Fixed an incorrectly-named macro guard that is intended to allow disabling of the sup framework via the configure option --disable-sup-handling. In this case, the preprocessor macro, BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older uncommitted version of the code (BLIS_DISABLE_SM_HANDLING). commit b9c9f03502c78a63cfcc21654b06e9089e2a3822 Author: Field G. Van Zee Date: Sat Apr 27 18:44:50 2019 -0500 Implemented gemm on skinny/unpacked matrices. Details: - Implemented a new sub-framework within BLIS to support the management of code and kernels that specifically target matrix problems for which at least one dimension is deemed to be small, which can result in long and skinny matrix operands that are ill-suited for the conventional level-3 implementations in BLIS. The new framework tackles the problem in two ways. First the stripped-down algorithmic loops forgo the packing that is famously performed in the classic code path. That is, the computation is performed by a new family of kernels tailored specifically for operating on the source matrices as-is (unpacked). Second, these new kernels will typically (and in the case of haswell and zen, do in fact) include separate assembly sub-kernels for handling of edge cases, which helps smooth performance when performing problems whose m and n dimension are not naturally multiples of the register blocksizes. In a reference to the sub-framework's purpose of supporting skinny/unpacked level-3 operations, the "sup" operation suffix (e.g. gemmsup) is typically used to denote a separate namespace for related code and kernels. NOTE: Since the sup framework does not perform any packing, it targets row- and column-stored matrices A, B, and C. For now, if any matrix has non-unit strides in both dimensions, the problem is computed by the conventional implementation. - Implemented the default sup handler as a front-end to two variants. bli_gemmsup_ref_var2() provides a block-panel variant (in which the 2nd loop around the microkernel iterates over n and the 1st loop iterates over m), while bli_gemmsup_ref_var1() provides a panel-block variant (2nd loop over m and 1st loop over n). However, these variants are not used by default and provided for reference only. Instead, the default sup handler calls _var2m() and _var1n(), which are similar to _var2() and _var1(), respectively, except that they defer to the sup kernel itself to iterate over the m and n dimension, respectively. In other words, these variants rely not on microkernels, but on so-called "millikernels" that iterate along m and k, or n and k. The benefit of using millikernels is a reduction of function call and related (local integer typecast) overhead as well as the ability for the kernel to know which micropanel (A or B) will change during the next iteration of the 1st loop, which allows it to focus its prefetching on that micropanel. (In _var2m()'s millikernel, the upanel of A changes while the same upanel of B is reused. In _var1n()'s, the upanel of B changes while the upanel of A is reused.) - Added a new configure option, --[en|dis]able-sup-handling, which is enabled by default. However, the default thresholds at which the default sup handler is activated are set to zero for each of the m, n, and k dimensions, which effectively disables the implementation. (The default sup handler only accepts the problem if at least one dimension is smaller than or equal to its corresponding threshold. If all dimensions are larger than their thresholds, the problem is rejected by the sup front-end and control is passed back to the conventional implementation, which proceeds normally.) - Added support to the cntx_t structure to track new fields related to the sup framework, most notably: - sup thresholds: the thresholds at which the sup handler is called. - sup handlers: the address of the function to call to implement the level-3 skinny/unpacked matrix implementation. - sup blocksizes: the register and cache blocksizes used by the sup implementation (which may be the same or different from those used by the conventional packm-based approach). - sup kernels: the kernels that the handler will use in implementing the sup functionality. - sup kernel prefs: the IO preference of the sup kernels, which may differ from the preferences of the conventional gemm microkernels' IO preferences. - Added a bool_t to the rntm_t structure that indicates whether sup handling should be enabled/disabled. This allows per-call control of whether the sup implementation is used, which is useful for test drivers that wish to switch between the conventional and sup codes without having to link to different copies of BLIS. The corresponding accessor functions for this new bool_t are defined in bli_rntm.h. - Implemented several row-preferential gemmsup kernels in a new directory, kernels/haswell/3/sup. These kernels include two general implementation types--'rd' and 'rv'--for the 6x8 base shape, with two specialized millikernels that embed the 1st loop within the kernel itself. - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference gemmsup microkernels. NOTE: These microkernels, unlike the current crop of conventional (pack-based) microkernels, do not use constant loop bounds. Additionally, their inner loop iterates over the k dimension. - Defined new typedef enums: - stor3_t: captures the effective storage combination of the level-3 problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A special value of BLIS_XXX is used to denote an arbitrary combination which, in practice, means that at least one of the operands is stored according to general stride. - threshid_t: captures each of the three dimension thresholds. - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create() can be passed "-1, -1" as a lazy request for row storage. (Note that "0, 0" is still accepted as a lazy request for column storage.) - Added support for various instructions to bli_x86_asm_macros.h, including imul, vhaddps/pd, and other instructions related to integer vectors. - Disabled the older small matrix handling code inserted by AMD in bli_gemm_front.c, since the sup framework introduced in this commit is intended to provide a more generalized solution. - Added test/sup directory, which contains standalone performance test drivers, a Makefile, a runme.sh script, and an 'octave' directory containing scripts compatible with GNU Octave. (They also may work with matlab, but if not, they are probably close to working.) - Reinterpret the storage combination string (sc_str) in the various level-3 testsuite modules (e.g. src/test_gemm.c) so that the order of each matrix storage char is "cab" rather than "abc". - Comment updates in level-3 BLAS API wrappers in frame/compat. commit 0d549ceda822833bec192bbf80633599620c15d9 Author: Isuru Fernando Date: Sat Apr 27 22:56:02 2019 +0000 make unix friendly archives on appveyor (#310) commit ca4b33c001f9e959c43b95a9a23f9df5adec7adf Author: Kiran Varaganti Date: Wed Apr 24 15:02:39 2019 +0530 Added compiler option (-mno-avx256-split-unaligned-store) in the file config/zen/make_defs.mk to improve performance of intrinsic codes, this flag ensures compiler generates 256-bit stores for the equivalent intrinsics code. Change-Id: I8f8cd81a3604869df18d38bc42097a04f178d324 commit 945928c650051c04d6900c7f4e9e29cd0e5b299f Merge: 663f6629 74e513eb Author: Field G. Van Zee Date: Wed Apr 17 15:58:56 2019 -0500 Merge branch 'amd' of github.com:flame/blis into amd commit 74e513eb6a6787a925d43cd1500277d54d86ab8f Author: Field G. Van Zee Date: Wed Apr 17 13:34:44 2019 -0500 Support row storage in Eigen gemm test/3 driver. Details: - Added preprocessor branches to test/3/test_gemm.c to explicitly support row-stored matrices. Column-stored matrices are also still supported (and is the default for now). (This is mainly residual work leftover from initial integration of Eigen into the test drivers, so if we ever want to test Eigen with row-stored matrices, the code will be ready to use, even if it is not yet integrated into the Makefile in test/3.) commit b5d457fae9bd75c4ca67f7bc7214e527aa248127 Author: Field G. Van Zee Date: Tue Apr 16 12:50:01 2019 -0500 Applied forgotten variable rename from 89a70cc. Details: - Somehow the variable name change (root_file_name -> root_inputname) in flatten-headers.py mentioned in the commit log entry for 89a70cc didn't make it into the actual commit. This commit applies that change. commit 89a70cccf869333147eb2559cdfa5a23dc915824 Author: Field G. Van Zee Date: Thu Apr 11 18:33:08 2019 -0500 GNU-like handling of installation prefix et al. Details: - Changed the default installation prefix from $HOME/lib to /usr/local. - Modified the way configure internally handles the prefix, libdir, includedir, and sharedir (and also added an --exec-prefix option). The defaults to these variables are set as follows: prefix: /usr/local exec_prefix: ${prefix} libdir: ${exec_prefix}/lib includedir: ${prefix}/include sharedir: ${prefix}/share The key change, aside from the addition of exec_prefix and its use to define the default to libdir, is that the variables are substituted into config.mk with quoting that delays evaluation, meaning the substituted values may contain unevaluated references to other variables (namely, ${prefix} and ${exec_prefix}). This more closely follows GNU conventions, including those used by GNU autoconf, and also allows make to override any one of the variables *after* configure has already been run (e.g. during 'make install'). - Updates to build/config.mk.in pursuant to above changes. - Updates to output of 'configure --help' pursuant to above changes. - Updated docs/BuildSystem.md to reflect the new default installation prefix, as well as mention EXECPREFIX and SHAREDIR. - Changed the definitions of the UNINSTALL_OLD_* variables in the top-level Makefile to use $(wildcard ...) instead of 'find'. This was motivated by the new way of handling prefix and friends, which leads to the 'find' command being run on /usr/local (by default), which can take a while almost never yielding any benefit (since the user will very rarely use the uninstall-old targets). - Removed periods from the end of descriptive output statements (i.e., non-verbose output) since those statements often end with file or directory paths, which get confusing to read when puctuated by a period. - Trival change to 'make showconfig' output. - Removed my name from 'configure --help'. (Many have contributed to it over the years.) - In configure script, changed the default state of threading_model variable from 'no' to 'off' to match that of debug_type, where there are similarly more than two valid states. ('no' is still accepted if given via the --enable-debug= option, though it will be standardized to 'off' prior to config.mk being written out.) - Minor variable name change in flatten-headers.py that was intended for 32812ff. - CREDITS file update. commit 9d76688ad90014a11ddc0c2f27253d62806216b1 Author: kdevraje Date: Thu Apr 11 10:22:48 2019 +0530 Fix for single rank crash with HPL application. When computing offset of C buffer, as integer variables are used for a row and column index, the intermediate result value overflows and a negative value gets added to the buffer, when the negative value is too large it would index the buffer out of the range resulting in segmentation fault. Although the crash is a result of dgemm kernel, added similar code in sgemm kernel also. Change-Id: I171119b0ec0dfbd8e63f1fcd6609a94384aabd27 commit 32812ff5aba05d34c421fe1024a61f3e2d5e7052 Author: Field G. Van Zee Date: Tue Apr 9 12:20:19 2019 -0500 Minor bugfix to flatten-headers.py. Details: - Fixed a minor bug in flatten-headers.py whereby the script, upon encountering a #include directive for the root header file, would erroneously recurse and inline the conents of that root header. The script has been modified to avoid recursion into any headers that share the same name as the root-level header that was passed into the script. (Note: this bug didn't actually manifest in BLIS, so it's merely a precaution for usage of flatten-headers.py in other contexts.) commit bec90e0b6aeb3c9b19589c2b700fda2d66f6ccdf Author: Field G. Van Zee Date: Tue Apr 2 17:45:13 2019 -0500 Minor update to docs/HardwareSupport.md document. Details: - Added more details and clarifying language to implications of 1m and the recycling of microkernels between microarchitectures. commit 89cd650e7be01b59aefaa85885a3ea78970351e4 Author: Field G. Van Zee Date: Tue Apr 2 17:23:55 2019 -0500 Use void_fp for function pointers instead of void*. Change void*-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (*void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void* to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void*, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void* and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files. commit ffce3d632b284eb52474036096815ec38ca8dd5f Author: Field G. Van Zee Date: Tue Apr 2 14:40:50 2019 -0500 Renamed armv8a gemm kernel filename. Details: - Renamed kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c to kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c. This follows the naming convention used by other kernel sets, most notably haswell. commit 77867478af02144544b4e7b6df5d54d874f3f93b Author: Isuru Fernando Date: Tue Apr 2 13:33:11 2019 -0500 Use pthreads on MinGW and Cygwin (#307) commit 7bc75882f02ce3470a357950878492e87e688cec Author: Field G. Van Zee Date: Thu Mar 28 17:40:50 2019 -0500 Updated Eigen results in docs/graphs with 3.3.90. Details: - Updated the level-3 performance graphs in docs/graphs with new Eigen results, this time using a development version cloned from their git mirror on March 27, 2019 (version 3.3.90). Performance is improved over 3.3.7, though still noticeably short of BLIS/MKL in most cases. - Very minor updates to docs/Performance.md and matlab scripts in test/3/matlab. commit 20ea7a1217d3833db89a96158c42da2d6e968ed8 Author: Field G. Van Zee Date: Wed Mar 27 18:09:17 2019 -0500 Minor text updates (Eigen) to docs/Performance.md. Details: - Added/updated a few more details, mostly regarding Eigen. commit bfb7e1bc6af468e4ff22f7e27151ea400dcd318a Merge: 044df950 2c85e1dd Author: Field G. Van Zee Date: Wed Mar 27 17:58:19 2019 -0500 Merge branch 'dev' commit 2c85e1dd9d5d84da7228ea4ae6deec56a89b3a8f Author: Field G. Van Zee Date: Wed Mar 27 16:29:51 2019 -0500 Added Eigen results to performance graphs. Details: - Updated the Haswell, SkylakeX, and Epyc performance graphs in docs/graphs to report on Eigen implementations, where applicable. Specifically, Eigen implements all level-3 operations sequentially, however, of those operations it only provides multithreaded gemm. Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are omitted. Thanks to Sameer Agarwal for his help configuring and using Eigen. - Updated docs/Performance.md to note the new implementation tested. - CREDITS file update. commit bfac7e385f8061f2e6591de208b0acf852f04580 Author: Field G. Van Zee Date: Wed Mar 27 16:04:48 2019 -0500 Added ability to plot with Eigen in test/3/matlab. Details: - Updated matlab scripts in test/3/matlab to optionally plot/display Eigen performance curves. Whether Eigen is plotted is determined by a new boolean function parameter, with_eigen. - Updated runme.m scratchpad to reflect the latest invocations of the plot_panel_4x5() function (with Eigen plotting enabled). commit 67535317b9411c90de7fa4cb5b0fdb8f61fdcd79 Author: Field G. Van Zee Date: Wed Mar 27 13:32:18 2019 -0500 Fixed mislabeled eigen output from test/3 drivers. Details: - Fixed the Makefile in test/3 so that it no longer incorrectly labels the matlab output variables from Eigen-linked hemm, herk, trmm, and trsm driver output as "vendor". (The gemm drivers were already correctly outputing matlab variables containing the "eigen" label.) commit 044df9506f823643c0cdd53e81ad3c27a9f9d4ff Author: Isuru Fernando Date: Wed Mar 27 12:39:31 2019 -0500 Test with shared on windows (#306) Export macros can't support both shared and static at the same time. When blis is built with both shared and static, headers assume that shared is used at link time and dllimports the symbols with __imp_ prefix. To use the headers with static libraries a user can give -DBLIS_EXPORT= to import the symbol without the __imp_ prefix commit 5e6b160c8a85e5e23bab0f64958a8acf4918a4ed Author: Field G. Van Zee Date: Tue Mar 26 19:10:59 2019 -0500 Link to Eigen BLAS for non-gemm drivers in test/3. Details: - Adjusted test/3/Makefile so that the test drivers are linked against Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do this since Eigen's headers don't define implementations to the standard BLAS APIs. - Simplified #included headers in hemm, herk, trmm, and trsm source driver files, since nothing specific to Eigen is needed at compile-time for those operations. commit e593221383aae19dfdc3f30539de80ed05cfec7f Merge: 92fb9c87 c208b9dc Author: Field G. Van Zee Date: Tue Mar 26 15:51:45 2019 -0500 Merge branch 'master' into dev commit 92fb9c87bf88b9f9c401eeecd9aa9c3521bc2adb Author: Field G. Van Zee Date: Tue Mar 26 15:43:23 2019 -0500 Add more support for Eigen to drivers in test/3. Details: - Use compile-time implementations of Eigen in test_gemm.c via new EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS library is not necessary.) However, as of Eigen 3.3.7, Eigen only parallelizes the gemm operation and not hemm, herk, trmm, trsm, or any other level-3 operation. - Fixed a bug in trmm and trsm drivers whereby the wrong function (bli_does_trans()) was being called to determine whether the object for matrix A should be created for a left- or right-side case. This was corrected by changing the function to bli_is_left(), as is done in the hemm driver. - Added support for running Eigen test drivers from runme.sh. commit c208b9dc46852c877197d53b6dd913a046b6ebb6 Author: Isuru Fernando Date: Mon Mar 25 13:03:44 2019 -0500 Fix clang version detection (#305) clang -dumpversion gives 4.2.1 for all clang versions as clang was originally compatible with gcc 4.2.1 Apple clang version and clang version are two different things and the real clang version cannot be deduced from apple clang version programatically. Rely on wikipedia to map apple clang to clang version Also fixes assembly detection with clang clang 3.8 can't build knl as it doesn't recognize zmm0 commit 53842c7e7d530cb2d5609d6d124ae350fc345c32 Author: Kiran Varaganti Date: Fri Mar 22 13:57:14 2019 +0530 Removed printing alpha and beta values Change-Id: I49102db510311a30f6a936f9d843f35838f50d23 commit 6805db45e343d83d1adaf9157cf0b841653e9ede Author: Kiran Varaganti Date: Fri Mar 22 12:55:35 2019 +0530 Corrected setting alpha & beta values- alpha = -1 and beta = 1 - bli_setc(-1.0, 0, &alpha) should be used rather than bli_setc(0.0, -1.0, &alpha). This corrected now Change-Id: Ic1102dfd6b50ccf212386a1211c6f31e8d987ef9 commit feefcab4427a75b0b55af215486b85abcda314f7 Author: Field G. Van Zee Date: Thu Mar 21 18:11:20 2019 -0500 Allow disabling of BLAS prototypes at compile-time. Details: - Modified bli_blas.h so that: - By default, if the BLAS layer is enabled at configure-time, BLAS prototypes are also enabled within blis.h; - But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including blis.h, BLAS prototypes are skipped over entirely so that, for example, the application or some other header pulled in by the application may prototype the BLAS functions without causing any duplication. - Updated docs/BuildSystem.md to document the feature above, and related text. commit 20153cd4b594bc34f860c381ec18de3a6cc743c7 Author: Kiran Varaganti Date: Thu Mar 21 16:23:53 2019 +0530 Modified test_gemm.c file in test folder A Macro 'FILE_IN_OUT" is defined to read input parameters from a csv file. Format for input file: Each line defines a gemm problem with following parameters: m k n cs_a cs_b cs_c The operation always implemented is C = C - A*B and column-major format. When macro is disabled - it reverts back to original implementation. Usage: ./test_gemm_.x input.csv output.csv GEMM is called through BLAS interface For BLIS - the test application also prints either 'S' indicating small gemm routine or 'N' - conventional BLIS gemm for MKL/OpenBLAS - ignore this character Change-Id: I0924ef2c1f7bdea48d4cdb230b888e2af2c86a36 commit 288843b06d91e1b4fade337959aef773090bd1c9 Author: Field G. Van Zee Date: Wed Mar 20 17:52:23 2019 -0500 Added Eigen support to test/3 Makefile, runme.sh. Details: - Added targets to test/3/Makefile that link against a BLAS library build by Eigen. It appears, however, that Eigen's BLAS library does not support multithreading. (It may be that multithreading is only available when using the native C++ APIs.) - Updated runme.sh with a few Eigen-related tweaks. - Minor tweaks to docs/Performance.md. commit 153e0be21d9ff413e370511b68d553dd02abada9 Author: Field G. Van Zee Date: Tue Mar 19 17:53:18 2019 -0500 More minor tweaks to docs/Performance.md. Details: - Defined GFLOPS as billions of floating-point operations per second, and reworded the sentence after about normalization. commit 05c4e42642cc0c8dbfa94a6c21e975ac30c0517a Author: Field G. Van Zee Date: Tue Mar 19 17:07:20 2019 -0500 CHANGELOG update (0.5.2) commit 9204cd0cb0cc27790b8b5a2deb0233acd9edeb9b (tag: 0.5.2) Author: Field G. Van Zee Date: Tue Mar 19 17:07:18 2019 -0500 Version file update (0.5.2) commit 64560cd9248ebf4c02c4a1eeef958e1ca434e510 Author: Field G. Van Zee Date: Tue Mar 19 17:04:20 2019 -0500 ReleaseNotes.md update in advance of next version. Details: - Updated ReleaseNotes.md in preparation for next version. commit ab5ad557ea69479d487c9a3cb516f43fa1089863 Author: Field G. Van Zee Date: Tue Mar 19 16:50:41 2019 -0500 Very minor tweaks to Performance.md. commit 03c4a25e1aa8a6c21abbb789baa599ac419c3641 Author: Field G. Van Zee Date: Tue Mar 19 16:47:15 2019 -0500 Minor fixes to docs/Performance.md. Details: - Fixed some incorrect labels associated with the pdf/png graphs, apparently the result of copy-pasting. commit fe6dd8b132f39ecb8893d54cd8e75d4bbf6dab83 Author: Field G. Van Zee Date: Tue Mar 19 16:30:23 2019 -0500 Fixed broken section links in docs/Performance.md. Details: - Fixed a few broken section links in the Contents section. commit 913cf97653f5f9a40aa89a5b79e2b0a8882dd509 Author: Field G. Van Zee Date: Tue Mar 19 16:15:24 2019 -0500 Added docs/Performance.md and docs/graphs subdir. Details: - Added a new markdown document, docs/Performance.md, which reports performance of a representative set of level-3 operations across a variety of hardware architectures, comparing BLIS to OpenBLAS and a vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs, in pdf and png formats, reside in docs/graphs. - Updated README.md to link to new Performance.md document. - Minor updates to CREDITS, docs/Multithreading.md. - Minor updates to matlab scripts in test/3/matlab. commit 9945ef24fd758396b698b19bb4e23e53b9d95725 Author: Field G. Van Zee Date: Tue Mar 19 15:28:44 2019 -0500 Adjusted cache blocksizes for zen subconfig. Details: - Adjusted the zen sub-configuration's cache blocksizes for float, scomplex, and dcomplex based on the existing values for double. (The previous values were taken directly from the haswell subconfig, which targets Intel Haswell/Broadwell/Skylake systems.) commit d202d008d51251609d08d3c278bb6f4ca9caf8e4 Author: Field G. Van Zee Date: Mon Mar 18 18:18:25 2019 -0500 Renamed --enable-export-all to --export-shared=[]. Details: - Replaced the existing --enable-export-all / --disable-export-all configure option with --export-shared=[public|all], with the 'public' instance of the latter corresponding to --disable-export-all and the 'all' instance corresponding to --enable-export-all. Nothing else semantically about the option, or its default, has changed. commit ff78089870f714663026a7136e696603b5259560 Author: Field G. Van Zee Date: Mon Mar 18 13:22:55 2019 -0500 Updates to docs/Multithreading.md. Details: - Made extra explicit the fact that: (a) multithreading in BLIS is disabled by default; and (b) even with multithreading enabled, the user must specify multithreading at runtime in order to observe parallelism. Thanks to M. Zhou for suggesting these clarifications in #292. - Also made explicit that only the environment variable and global runtime API methods are available when using the BLAS API. If the user wishes to use the local runtime API (specify multithreading on a per-call basis), one of the native BLIS APIs must be used. commit 3a929a3d0ba0353159a6d4cd188f01b7a390ccfc Author: Kiran Varaganti Date: Mon Mar 18 10:51:41 2019 +0530 Fixed code merging: bli_gemm_small.c - missed conditional checks for L!=0 && K!=0. Now they are added. This fix is done to pass blastest Change-Id: Idc9c9a04d2015a68a19553c437ecaf8f1584026c commit 663f662932c3f182fefc3c77daa1bf8c3394bb8b Merge: 938c05ef 6bfe3812 Author: Field G. Van Zee Date: Sat Mar 16 16:17:12 2019 -0500 Merge branch 'amd' of github.com:flame/blis into amd commit 938c05ef8654e2fc013d39a57f51d91d40cc40fb Merge: 4ed39c09 5a5f494e Author: Field G. Van Zee Date: Sat Mar 16 16:01:43 2019 -0500 Merge branch 'amd' of github.com:flame/blis into amd commit 6bfe3812e29b86c95b828822e4e5473b48891167 Author: Field G. Van Zee Date: Fri Mar 15 13:57:49 2019 -0500 Use -fvisibility=[...] with clang on Linux/BSD/OSX. Details: - Modified common.mk to use the -fvisibility=[hidden|default] option when compiling with clang on non-Windows platforms (Linux, BSD, OS X, etc.). Thanks to Isuru Fernando for pointing out this option works with clang on these OSes. commit 809395649c5bbf48778ede4c03c1df705dd49566 Author: Field G. Van Zee Date: Wed Mar 13 18:21:35 2019 -0500 Annotated additional symbols for export. Details: - Added export annotations to additional function prototypes in order to accommodate the testsuite. - Disabled calling bli_amaxv_check() from within the testsuite's test_amaxv.c. commit e095926c643fd9c9c2220ebecd749caae0f71d42 Author: Field G. Van Zee Date: Wed Mar 13 17:35:18 2019 -0500 Support shared lib export of only public symbols. Details: - Introduced a new configure option, --enable-export-all, which will cause all shared library symbols to be exported by default, or, alternatively, --disable-export-all, which will cause all symbols to be hidden by default, with only those symbols that are annotated for visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS symbols), to be exported. The default for this configure option is --disable-export-all. Thanks to Isuru Fernando for consulting on this commit. - Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h, which was intended for 5a5f494. - Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to frame/include/bli_config_macro_defs.h. - Provided appropriate logic within common.mk to implement variable symbol visibility for gcc, clang, and icc (to the extend that each of these compilers allow). - Relocated --help text associated with debug option (-d) to configure slightly further down in the list. commit 5a5f494e428372c7c27ed1f14802e15a83221e87 Author: Field G. Van Zee Date: Tue Mar 12 18:45:09 2019 -0500 Removed export macros from all internal prototypes. Details: - After merging PR #303, at Isuru's request, I removed the use of BLIS_EXPORT_BLIS from all function prototypes *except* those that we potentially wish to be exported in shared/dynamic libraries. In other words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of functions that can be considered private or for internal use only. This is likely the last big modification along the path towards implementing the functionality spelled out in issue #248. Thanks again to Isuru Fernando for his initial efforts of sprinkling the export macros throughout BLIS, which made removing them where necessary relatively painless. Also, I'd like to thank Tony Kelman, Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for participating in the initial discussion in issue #37 that was later summarized and restated in issue #248. - CREDITS file update. commit 3dc18920b6226026406f1d2a8b2c2b405a2649d5 Merge: b938c16b 766769ee Author: Field G. Van Zee Date: Tue Mar 12 11:20:25 2019 -0500 Merge branch 'master' into dev commit 766769eeb944bd28641a6f72c49a734da20da755 Author: Isuru Fernando Date: Mon Mar 11 19:05:32 2019 -0500 Export functions without def file (#303) * Revert "restore bli_extern_defs exporting for now" This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8. * Remove symbols not intended to be public * No need of def file anymore * Fix whitespace * No need of configure option * Remove export macro from definitions * Remove blas export macro from definitions commit 4ed39c0971c7917e2675cf5449f563b1f4751ccc Merge: 540ec1b4 b938c16b Author: Field G. Van Zee Date: Fri Mar 8 11:56:58 2019 -0600 Merge branch 'amd' of github.com:flame/blis into amd commit b938c16b0c9e839335ac2c14944b82890143d02f Author: Field G. Van Zee Date: Thu Mar 7 16:40:39 2019 -0600 Renamed test/3m4m to test/3. Details: - Renamed '3m4m' directory to '3', which captures the directory nicely since it builds test drivers to test level-3 operations. - These test drivers ceased to be used to test the 3m and 4m (or even 1m) induced methods long ago, hence the name change. commit ab89a40582ec7acf802e59b0763bed099a02edd8 Author: Field G. Van Zee Date: Thu Mar 7 16:26:12 2019 -0600 More minor updates and edits to test/3m4m. Details: - Further updates to matlab scripts, mostly for compatibility with GNU Octave. - More tweaks to runme.sh. - Updates to runme.m that allow copy-paste into matlab interactive session to generate graphs. commit f0e70dfbf3fee4c4e382c2c4e87c25454cbc79a1 Author: Field G. Van Zee Date: Thu Mar 7 01:04:05 2019 +0000 Very minor updates to test/3m4m for ul252. Details: - Very minor updates to the newly revamped test/3m4m drivers when used on a Xeon Platinum (SkylakeX). commit 7fe44748383071f1cbbc77d904f4ae5538e13065 Author: Kiran Varaganti Date: Wed Mar 6 16:23:31 2019 +0530 Disabled BLIS_ENABLE_ZEN_BLOCK_SIZES in bli_family_zen.h for ROME tuning Change-Id: Iec47fcf51f4d4396afef1ce3958e58cf02c59a57 commit 9f1dbe572b1fd5e7dd30d5649bdf59259ad770d5 Author: Field G. Van Zee Date: Tue Mar 5 17:47:55 2019 -0600 Overhauled test/3m4m Makefile and scripts. Details: - Rewrote much of Makefile to generate executables for single- and dual- socket multithreading as well as single-threaded. Each of the three can also use a different problem size range/increment, as is often appropriate when doubling/halving the number of threads. - Rewrote runme.sh script to flexibly execute as many threading parameter scenarios as is given in the input parameter string (currently set within the script itself). The string also encodes the maximum problem size for each threading scenario, which is used to identify the executable to run. Also improved the "progress" output of the script to reduce redundant info and improve readability in terminals that are not especially wide. - Minor updates to test_*.c source files. - Updated matlab scripts according to changes made to the Makefile, test drivers, and runme.sh script, and renamed 'plot_all.m' to 'runme.m'. commit f5ed95ecd7d5eb4a63e1333ad5cc6765fc8df9fe Author: Kiran Varaganti Date: Tue Mar 5 15:01:57 2019 +0530 Merged BLIS Release 1.3 Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1 Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81 commit 3bdab823fa93342895bf45d812439324a37db77c Merge: 70f12f20 e2a02ebd Author: Field G. Van Zee Date: Thu Feb 28 14:07:24 2019 -0600 Merge branch 'master' into dev commit e2a02ebd005503c63138d48a2b7d18978ee29205 Author: Field G. Van Zee Date: Thu Feb 28 13:58:59 2019 -0600 Updates (from ls5) to test/3m4m/runme.sh. Details: - Lonestar5-specific updates to runme.sh. commit f0dcc8944fa379d53770f5cae5d670140918f00c Author: Isuru Fernando Date: Wed Feb 27 17:27:23 2019 -0600 Add symbol export macro for all functions (#302) * initial export of blis functions * Regenerate def file for master * restore bli_extern_defs exporting for now commit 540ec1b479712d5e1da637a718927249c15d867f Author: Field G. Van Zee Date: Sun Feb 24 19:09:10 2019 -0600 Updated level-3 BLAS to call object API directly. Details: - Updated the BLAS compatibility layer for level-3 operations so that the corresponding BLIS object API is called directly rather than first calling the typed BLIS API. The previous code based on the typed BLIS API calls is still available in a deactivated cpp macro branch, which may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not yet correspond to a configure option. If it seems like people might want to toggle this behavior more regularly, a configure option can be added in the future.) - Updated the BLIS typed API to statically "pre-initialize" objects via new initializor macros. Initialization is then finished via calls to static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(), which are similar to the previously-called functions, bli_obj_create_1x1_with_attached_buffer() and bli_obj_create_with_attached_buffer(), respectively. (The BLAS compatibility layer updates mentioned above employ this new technique as well.) - Transformed certain routines in bli_param_map.c--specifically, the ones that convert netlib-style parameters to BLIS equivalents--into static functions, now in bli_param_map.h. (The remaining three classes of conversation routines were left unchanged.) - Added the aforementioned pre-initializor macros to bli_type_defs.h. - Relocated bli_obj_init_const() and bli_obj_init_constdata() from bli_obj_macro_defs.h to bli_type_defs.h. - Added a few macros to bli_param_macro_defs.h for testing domains for real/complexness and precisions for single/double-ness. commit 8e023bc914e9b4ac1f13614feb360b105fbe44d2 Author: Field G. Van Zee Date: Fri Feb 22 16:55:30 2019 -0600 Updates to 3m4m/matlab scripts. Details: - Minor updates to matlab graph-generating scripts. - Added a plot_all.m script that is more of a scratchpad for copying and pasting function invocations into matlab to generate plots that are presently of interest to us. commit b06244d98cc468346eb1a8eb931bc05f35ff280c Merge: e938ff08 4c7e6680 Author: praveeng Date: Thu Feb 21 12:56:15 2019 +0530 Merge branch 'ut-austin-amd' of ssh://git.amd.com:29418/cpulibraries/er/blis into ut-austin-amd commit e938ff08cea3d108c84524eb129d9e89d701ea90 Author: praveeng Date: Thu Feb 21 12:44:38 2019 +0530 deleted test.txt Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3 commit ed13ad465dcba350ad3d5e16c9cc7542e33f3760 Author: mkv Date: Thu Feb 21 01:04:16 2019 -0500 added test file for initial commit commit 4c7e6680832b497468cf50c2399e3ac4de0e3450 Author: praveeng Date: Thu Feb 21 12:44:38 2019 +0530 deleted test.txt Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3 commit 95e070581c54ed2edc211874faec56055ea298c8 Author: mkv Date: Thu Feb 21 01:04:16 2019 -0500 added test file for initial commit commit 70f12f209bc1901b5205902503707134cf2991a0 Author: Field G. Van Zee Date: Wed Feb 20 16:10:10 2019 -0600 Changed unsafe-loop to unsafe-math optimizations. Details: - Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to account for a miscommunication in issue #300). Thanks to Dave Love for this suggestion and Jeff Hammond for his feedback on the topic. commit 7690855c5106a56e5b341a350f8db1c78caacd89 Author: Field G. Van Zee Date: Mon Feb 18 19:16:01 2019 -0600 Restored -funsafe-loop-optimizations to subconfigs. Details: - Restored use of -funsafe-loop-optimizations in the definitions of CRVECFLAGS (when using gcc), but only for sub-configurations (and not configuration families such as amd64, intel64, and x86_64). This more or less reverts 5190d05 and 6cf1550. commit 44994d1490897b08cde52a615a2e37ddae8b2061 Author: Field G. Van Zee Date: Mon Feb 18 18:35:30 2019 -0600 Disable TBM, XOP, LWP instructions in AMD configs. Details: - Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer, piledriver, steamroller, and excavator configurations to explicitly disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an attempt to fix the invalid instruction error that has plagued Travis CI builds since 6a014a3. Thanks to Devin Matthews for pointing out that the offending instruction was part of TBM (issue #300). - Restored -O3 to piledriver configuration's COPTFLAGS. commit 1e5b530744c1906140d47f43c5cad235eaa619cf Author: Field G. Van Zee Date: Mon Feb 18 18:04:38 2019 -0600 Reverted piledriver COPTFLAGS from -O3 to -O2. Details: - Debugging continues; changing COPTFLAGS for piledriver subconfig from -O3 to -O2, its original value prior to 6a014a3. commit 6cf155049168652c512aefdd16d74e7ff39b98df Author: Field G. Van Zee Date: Mon Feb 18 17:29:51 2019 -0600 Removed -funsafe-loop-optimizations from all configs. Details: - Error persists. Removed -funsafe-loop-optimizations from all remaining sub-configurations. commit 5190d05a27c5fa4c7942e20094f76eb9a9785c3e Author: Field G. Van Zee Date: Mon Feb 18 17:07:35 2019 -0600 Removed -funsafe-loop-optimizations from piledriver. Details: - Error persists; continuing debugging from bf0fb78c by removing -funsafe-loop-optimizations from piledriver configuration. commit bf0fb78c5e575372060d22f5ceeb5b332e8978ec Author: Field G. Van Zee Date: Mon Feb 18 16:51:38 2019 -0600 Removed -funsafe-loop-optimizations from families. Details: - Removed -funsafe-loop-optimizations from the configuration families affected by 6a014a3, specifically: intel64, amd64, and x86_64. This is part of an attempt to debug why the sde, as executed by Travis CI, is crashing via the following error: TID 0 SDE-ERROR: Executed instruction not valid for specified chip (ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103 commit 6a014a3377a2e829dbc294b814ca257a2bfcb763 Author: Field G. Van Zee Date: Mon Feb 18 14:52:29 2019 -0600 Standardized optimization flags in make_defs.mk. Details: - Per Dave Love's recommendation in issue #300, this commit defines COPTFLAGS := -03 and CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations in the make_defs.mk for all Intel- and AMD-based configurations. commit 565fa3853b381051ac92cff764625909d105644d Author: Field G. Van Zee Date: Mon Feb 18 11:43:58 2019 -0600 Redirect trsm pc, ir parallelism to ic, jr loops. Details: - trsm parallelization was temporarily simplifed in 075143d to entirely ignore any parallelism specified via the pc or ir loops. Now, any parallelism specified to the pc loop will be redirected to the ic loop, and any parallelism specified to the ir loop will be redirected to the jr loop. (Note that because of inter-iteration dependencies, trsm cannot parallelize the ir loop. Parallelism via the pc loop is at least somewhat feasible in theory, but it would require tracking dependencies between blocks--something for which BLIS currently lacks the necessary supporting infrastructure.) commit a023c643f25222593f4c98c2166212561d030621 Author: Field G. Van Zee Date: Thu Feb 14 20:18:55 2019 -0600 Regenerated symbols in build/libblis-symbols.def. Details: - Reran ./build/regen-symbols.sh after running 'configure --enable-cblas auto' commit 075143dfd92194647da9022c1a58511b20fc11f3 Author: Field G. Van Zee Date: Thu Feb 14 18:52:45 2019 -0600 Added support for IC loop parallelism to trsm. Details: - Parallelism within the IC loop (3rd loop around the microkernel) is now supported within the trsm operation. This is done via a new branch on each of the control and thread trees, which guide execution of a new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm subproblem corresponds to the macrokernel computation on only the block of A that contains the diagonal (labeled as A11 in algorithms with FLAME-like partitioning), and the corresponding row panel of C. During the trsm subproblem, all threads within the JC communicator participate and parallelize along the JR loop, including any parallelism that was specified for the IC loop. (IR loop parallelism is not supported for trsm due to inter-iteration dependencies.) After this trsm subproblem is complete, a barrier synchronizes all participating threads and then they proceed to apply the prescribed BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT parallelism specified within) to the remaining gemm subproblem (the rank-k update that is performed using the newly updated row-panel of B). Thus, trsm now supports JC, IC, and JR loop parallelism. - Modified bli_trsm_l_cntl_create() to create the new "prenode" branch of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for now, since it is not currently used. (All trsm problems are cast in terms of left-side trsm.) - Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t subnode is only recursed upon if there existed a corresponding thrinfo_t node, which may not always exist (for problems too small to employ full parallelization due to the minimum granularity imposed by micropanels). - Updated other functions in frame/base/bli_cntl.c, such as bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes if they exist. - Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes when they exist, and added support for growing a prenode branch to bli_thrinfo_grow() via a corresponding set of help functions named with the _prenode() suffix. - Added a bszid_t field thrinfo_t nodes. This field comes in handy when debugging the allocation/release of thrinfo_t nodes, as it helps trace the "identity" of each nodes as it is created/destroyed. - Renamed bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths() and created a separate bli_l3_thrinfo_print_trsm_paths() function to print out the newly reconfigured thrinfo_t trees for the trsm operation. - Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c regarding variable declarations. - Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B, BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which represent the subpartition ahead of and behind, respectively, BLIS_SUBPART1. Updated check functions in bli_check.c accordingly. - Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and bli_acquire_mpart_t2b/b2t(), _l2r/r2l(). - Deprecated old functions in frame/3/bli_l3_thrinfo.c. commit 78bc0bc8b6b528c79b11f81ea19250a1db7450ed Author: Nicholai Tukanov Date: Thu Feb 14 13:29:02 2019 -0600 Power9 sub-configuration (#298) Formally registered power9 sub-configuration. Details: - Added and registered power9 sub-configuration into the build system. Thanks to Nicholai Tukanov and Devangi Parikh for these contributions. - Note: The sub-configuration does not yet have a corresponding architecture-specific kernel set registered, and so for now the sub-config is using the generic kernel set. commit 6b832731261f9e7ad003a9ea4682e9ca973ef844 Author: Field G. Van Zee Date: Tue Feb 12 16:01:28 2019 -0600 Generalized ref kernels' pragma omp simd usage. Details: - Replaced direct usage of _Pragma( "omp simd" ) in reference kernels with PRAGMA_SIMD, which is defined as a function of the compiler being used in a new bli_pragma_macro_defs.h file. That definition is cleared when BLIS detects that the -fopenmp-simd command line option is unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions that guided this commit. - Updated configure and bli_config.h.in so that the appropriate anchor is substituted in (when the corresponding pragma omp simd support is present). commit b1f5ce8622b682b79f956fed83f04a60daa8e0fc Author: Field G. Van Zee Date: Tue Feb 5 17:38:50 2019 -0600 Minor updates to scripts in test/mixeddt/matlab. commit 38203ecd15b1fa50897d733daeac6850d254e581 Author: Devangi N. Parikh Date: Mon Feb 4 15:28:28 2019 -0500 Added thunderx2 system in the mixeddt test scripts Details: - Added thunderx2 (tx2) as a system in the runme.sh in test/mixeddt commit dfc91843ea52297bf636147793029a0c1345be04 Author: Devangi N. Parikh Date: Mon Feb 4 15:23:40 2019 -0500 Fixed gcc flags for thunderx2 subconfiguration Details: - Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a. commit c665eb9b888ec7e41bd0a28c4c8ac4094d0a01b5 Author: Field G. Van Zee Date: Mon Jan 28 16:22:23 2019 -0600 Minor updates to docs, Makefiles. Details: - Changed all occurrances of micro-kernel -> microkernel macro-kernel -> macrokernel micro-panel -> micropanel in all markdown documents in 'docs' directory. This change is being made since we've reached the point in adoption and acceptance of BLIS's insights where words such as "microkernel" are no longer new, and therefore now merit being unhyphenated. - Updated "Implementation Notes" sections of KernelsHowTo.md, which still contained references to nonexistent cpp macros such as BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?. - Added 'run-fast' and 'check-fast' targets to testsuite/Makefile. - Minor updates to Testsuite.md, including suggesting use of 'make check' and 'make check-fast' when running from the local testsuite directory. - Added a comment to top-level Makefile explaining the purpose behind the TESTSUITE_WRAPPER variable, which at first glance appears to serve no purpose. commit 1aa280d0520ed5eaea3b119b4e92b789ecad78a4 Author: M. Zhou <5723047+cdluminate@users.noreply.github.com> Date: Sun Jan 27 21:40:48 2019 +0000 Amend OS detection for kFreeBSD. (#295) commit fffc23bb35d117a433886eb52ee684ff5cf6997f Author: Field G. Van Zee Date: Fri Jan 25 13:35:31 2019 -0600 CREDITS file update. commit 26c5cf495ce22521af5a36a1012491213d5a4551 Author: Field G. Van Zee Date: Thu Jan 24 18:49:31 2019 -0600 Fixed bug in skx subconfig related to bdd46f9. Details: - Fixed code in the skx subconfiguration that became a bug after committing bdd46f9. Specifically, the bli_cntx_init_skx() function was overwriting default blocksizes for the scomplex and dcomplex microkernels despite the fact that only single and double real microkernels were being registered. This was not a problem prior to bdd46f9 since all microkernels used dynamically-queried (at runtime) register blocksizes for loop bounds. However, post-bdd46f9, this became a bug because the reference ukernels for scomplex and dcomplex were written with their register blocksizes hard-coded as constant loop bounds, which conflicted the the erroneous scomplex and dcomplex values that bli_cntx_init_skx() was setting in the context. The lesson here is that going forward, all subconfigurations must not set any blocksizes for datatypes corresponding to default/reference microkernels. (Note that a blocksize is left unchanged by the bli_cntx_set_blkszs() function if it was set to -1.) commit 180f8e42e167b83a757340ad4bd4a5c7a1d6437b Author: Field G. Van Zee Date: Thu Jan 24 18:01:15 2019 -0600 Fixed undefined behavior trsm ukr bug in bdd46f9. Details: - Fixed a bug that mainfested anytime a configuration was used in which optimized microkernels were registered and the trsm operation (or kernel) was invoked. The bug resulted from the optimized microkernels' register blocksizes conflicting with the hard-coded values--expressed in the form of constant loop bounds--used in the new reference trsm ukernels that were introduced in bdd46f9. The fix was easy: reverting back to the implementation that uses variable-bound loops, which amounted to changing an #if 0 to #if 1 (since I preserved the older implementation in the file alongside the new code based on constant- bound loops). It should be noted that this fix must be permanent, since the trsm kernel code with constant-bound loops can never work with gemm ukernels that use different register blocksizes. commit bdd46f9ee88057d52610161966a11c224e5a026c Author: Field G. Van Zee Date: Thu Jan 24 17:23:18 2019 -0600 Rewrote reference kernels to use #pragma omp simd. Details: - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified indexing annotated by the #pragma omp simd directive, which a compiler can use to vectorize certain constant-bounded loops. (The new kernels actually use _Pragma("omp simd") since the kernels are defined via templatizing macros.) Modest speedup was observed in most cases using gcc 5.4.0, which may improve with newer versions. Thanks to Devin Matthews for suggesting this via issue #286 and #259. - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex, respectively, with a default row preference for the gemm ukernel. Also updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4, respectively, for all datatypes. - Modified configure to verify that -fopenmp-simd is a valid compiler option (via a new detect/omp_simd/omp_simd_detect.c file). - Added a new header in which prefetch macros are defined according to which compiler is detected (via macros such as __GNUC__). These prefetch macros are not yet employed anywhere, though. - Updated the year in copyrights of template license headers in build/templates and removed AMD as a default copyright holder. commit 63de2b0090829677755eb5cdb27e73bc738da32d Author: Field G. Van Zee Date: Wed Jan 23 12:16:27 2019 -0600 Prevent redef of ftnlen in blastest f2c_types.h. Details: - Guard typedef of ftnlen in f2c_types.h with a #ifndef HAVE_BLIS_H directive to prevent the redefinition of that type. Thanks to Jeff Diamond for reporting this compiler warning (and apologies for the delay in committing a fix). commit eec2e183a7b7d67702dbd1f39c153f38148b2446 Author: Field G. Van Zee Date: Mon Jan 21 12:12:18 2019 -0600 Added escaping to '/' in os_name in configure. Details: - Add os_name to the list of variables into which the '/' character is escaped. This is meant to address (or at least make progress toward addressing) #293. Thanks to Isuru Fernando for spotting this as the potential fix, and also thanks to M. Zhou for the original report. commit adf5c17f0839fdbc1f4a1780f637928b1e78e389 Author: Field G. Van Zee Date: Fri Jan 18 15:14:45 2019 -0600 Formally registered thunderx2 subconfiguration. Details: - Added a separate subconfiguration for thunderx2, which now uses different optimization flags than cortexa57/cortexa53. commit 094cfdf7df6c2764c25fcbfce686ba29b933942c Author: M. Zhou <5723047+cdluminate@users.noreply.github.com> Date: Fri Jan 18 18:46:13 2019 +0000 Port BLIS to GNU Hurd OS. (#294) Prevent blis.h from misidentifying Hurd as OSX. commit 5d7d616e8e591c2f3c7c2d73220eb27ea484f9c9 Author: Field G. Van Zee Date: Tue Jan 15 20:52:51 2019 -0600 README.md update re: mixeddt TOMS paper. commit 58c7fb4788177487f73a3964b7a910fe4dc75941 Author: Field G. Van Zee Date: Tue Jan 8 17:00:27 2019 -0600 Added more matlab scripts for mixeddt paper. Details: - Added a variant set of matlab scripts geared to producing plots that reflect performance data gathered with and without extra memory optimizations enabled. These scripts reside (for now) in test/mixeddt/matlab/wawoxmem. commit 34286eb914b48b56cdda4dfce192608b9f86d053 Author: Field G. Van Zee Date: Tue Jan 8 11:41:20 2019 -0600 Minor update to docs/HardwareSupport.md. commit 108b04dc5b1b1288db95f24088d1e40407d7bc88 Author: Field G. Van Zee Date: Mon Jan 7 20:16:31 2019 -0600 Regenerated symbols in build/libblis-symbols.def. Details: - Reran ./build/regen-symbols.sh after running 'configure --enable-cblas auto' to reflect removal of bli_malloc_pool() and bli_free_pool(). commit 706cbd9d5622f4690e6332a89cf41ab5c8771899 Author: Field G. Van Zee Date: Mon Jan 7 18:28:19 2019 -0600 Minor tweaks/cleanups to bli_malloc.c, _apool.c. Details: - Removed malloc_ft and free_ft function pointer arguments from the interface to bli_apool_init() after deciding that there is no need to specify the malloc()/free() for blocks within the apool. (The apool blocks are actually just array_t structs.) Instead, we simply call bli_malloc_intl()/_free_intl() directly. This has the added benefit of allowing additional output when memory tracing is enabled via --enable-mem-tracing. Also made corresponding changes elsewhere in the apool API. - Changed the inner pools (elements of the array_t within the apool_t) to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL and BLIS_FREE_INTL. - Disabled definitions of bli_malloc_pool() and bli_free_pool() since there are no longer any consumers of these functions. - Very minor comment / printf() updates. commit 579145039d945adbcad1177b1d53fb2d3f2e6573 Author: Minh Quan Ho <1337056+hominhquan@users.noreply.github.com> Date: Mon Jan 7 23:00:15 2019 +0100 Initialize error messages at compile time (#289) * Initialize error messages at compile time - Assigning strings directly to the bli_error_string array, instead of snprintf() at execution-time. * Retired bli_error_init(), _finalize(). Details: - Removed functions obviated by changes in 80e8dc6: bli_error_init(), bli_error_finalize(), and bli_error_init_msgs(), as well as calls to the former two in bli_init.c. * Regenerated symbols in build/libblis-symbols.def. Details: - Reran ./build/regen-symbols.sh after running 'configure --enable-cblas auto'. commit aafbca086e36b6727d7be67e21fef5bd9ff7bfd9 Author: Field G. Van Zee Date: Mon Jan 7 12:38:21 2019 -0600 Updated external package language in README.md. Details: - Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the newly-renamed "External GNU/Linux packages" section. Thanks to Dave Love for providing these revisions. commit daacfe68404c9cc8078e5e7ba49a8c7d93e8cda3 Author: Field G. Van Zee Date: Mon Jan 7 12:12:47 2019 -0600 Allow running configure with python 3.4. Details: - Relax version blacklisting of python3 to allow 3.4 or later instead of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was sufficient for the purpose of BLIS's build system. (It should be noted that we're not sure which, if any, python3 versions prior to 3.4 are insufficient, and that the only thing stopping us from determining this is the fact that these earlier versions of python3 are not readily available for us to test with.) - Updated docs/BuildSystem.md to be explicit about current python2 vs python3 version requirements. commit cdbf16aa93234e0d6a80f0d0e385ec81e7b75465 Author: prangana Date: Fri Jan 4 15:59:21 2019 +0530 Update version 1.3 Change-Id: I32a7d24af860e87a60396614075236afb65a28a9 commit cf9c1150515b8e9cc4f12e0d4787b3471b12ba4a Author: kdevraje Date: Thu Jan 3 09:51:46 2019 +0530 This commit adds a macro, which is to be enabled when BLIS is working on single instance mode Change-Id: I7f3fd654b78e64c4e6e24e9f0e245b1a30c492b0 commit ad8d9adb09a7dd267bbdeb2bd1fbbf9daf64ee76 Author: Field G. Van Zee Date: Thu Jan 3 16:08:24 2019 -0600 README.md, CREDITS update. Details: - Added "What's New" and "What People Are Saying About BLIS" sections to README.md. - Added missing github handles to various individuals' entries in the CREDITS file. commit 7052fca5aef430241278b67d24cef6fe33106904 Author: Field G. Van Zee Date: Wed Jan 2 13:48:40 2019 -0600 Apply f272c289 to bli_fmalloc_noalign(). Details: - Perform the same check for NULL return values and error message output in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This change was intended for f272c289.) commit 528e3ad16a42311a852a8376101959b4ccd801a5 Merge: 3126c52e f272c289 Author: Field G. Van Zee Date: Wed Jan 2 13:39:19 2019 -0600 Merge branch 'amd' commit 3126c52ea795ffb7d30b16b7f7ccc2a288a6158d Merge: 61441b24 8091998b Author: Field G. Van Zee Date: Wed Jan 2 13:37:37 2019 -0600 Merge branch 'amd' commit f272c2899a6764eedbe05cea874ee3bd258dbff3 Author: Field G. Van Zee Date: Wed Jan 2 12:34:15 2019 -0600 Add error message to malloc() check for NULL. Details: - Output an error message if and when the malloc()-equivalent called by bli_fmalloc_align() ever returns NULL. Everything was already in place for this to happen, including the error return code, the error string sprintf(), the error checking function bli_check_valid_malloc_buf() definition, and its prototype. Thanks to Minh Quan Ho for pointing out the missing error message. - Increased the default block_ptrs_len for each inner pool stored in the small block allocator from 10 to 25. Under normal execution, each thread uses only 21 blocks, so this change will prevent the sba from needing to resize the block_ptrs array of any given inner pool as threads initially populate the pool with small blocks upon first execution of a level-3 operation. - Nix stray newline echo in configure. commit eb97f778a1e13ee8d3b3aade05e479c4dfcfa7c0 Author: Field G. Van Zee Date: Tue Dec 25 20:17:09 2018 -0600 Added missing AMD copyrights to previous commit. Details: - Forgot to add AMD copyrights to several touched files that did not already have them in 2f31743. commit 2f3174330fb29164097d664b7c84e05c7ced7d95 Author: Field G. Van Zee Date: Tue Dec 25 19:35:01 2018 -0600 Implemented a pool-based small block allocator. Details: - Implemented a sophisticated data structure and set of APIs that track the small blocks of memory (around 80-100 bytes each) used when creating nodes for control and thread trees (cntl_t and thrinfo_t) as well as thread communicators (thrcomm_t). The purpose of the small block allocator, or sba, is to allow the library to transition into a runtime state in which it does not perform any calls to malloc() or free() during normal execution of level-3 operations, regardless of the threading environment (potentially multiple application threads as well as multiple BLIS threads). The functionality relies on a new data structure, apool_t, which is (roughly speaking) a pool of arrays, where each array element is a pool of small blocks. The outer pool, which is protected by a mutex, provides separate arrays for each application thread while the arrays each handle multiple BLIS threads for any given application thread. The design minimizes the potential for lock contention, as only concurrent application threads would need to fight for the apool_t lock, and only if they happen to begin their level-3 operations at precisely the same time. Thanks to Kiran Varaganti and AMD for requesting this feature. - Added a configure option to disable the sba pools, which are enabled by default; renamed the --[dis|en]able-packbuf-pools option to --[dis|en]able-pba-pools; and rewrote the --help text associated with this new option and consolidated it with the --help text for the option associated with the sba (--[dis|en]able-sba-pools). - Moved the membrk field from the cntx_t to the rntm_t. We now pass in a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we do for bli_sba_acquire() and _release(). - Replaced all calls to bli_malloc_intl() and bli_free_intl() that are used for small blocks with calls to bli_sba_acquire(), which takes a rntm (in addition to the bytes requested), and bli_sba_release(). These latter two functions reduce to the former two when the sba pools are disabled at configure-time. - Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as required by the new usage of bli_sba_acquire() and _release(). - Moved the freeing of "old" blocks (those allocated prior to a change in the block_size) from bli_membrk_acquire_m() to the implementation of the pool_t checkout function. - Miscellaneous improvements to the pool_t API. - Added a block_size field to the pblk_t. - Harmonized the way that the trsm_ukr testsuite module performs packing relative to that of gemmtrsm_ukr, in part to avoid the need to create a packm control tree node, which now requires a rntm_t that has been initialized with an sba and membrk. - Re-enable explicit call bli_finalize() in testsuite so that users who run the testsuite with memory tracing enabled can check for memory leaks. - Manually imported the compact/minor changes from 61441b24 that cause the rntm to be copied locally when it is passed in via one of the expert APIs. - Reordered parameters to various bli_thrcomm_*() functions so that the thrcomm_t* to the comm being modified is last, not first. - Added more descriptive tracing for allocating/freeing small blocks and formalized via a new configure option: --[dis|en]able-mem-tracing. - Moved some unused scalm code and headers into frame/1m/other. - Whitespace changes to bli_pthread.c. - Regenerated build/libblis-symbols.def. commit 61441b24f3244a4b202c29611a4899dd5c51d3a1 Author: Field G. Van Zee Date: Thu Dec 20 19:38:11 2018 -0600 Make local copy of user's rntm_t in level-3 ops. Details: - In the case that the caller passes in a non-NULL rntm_t pointer into one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()), make a local copy of the rntm_t and use the address of that local copy in all subsequent execution (which may change the contents of the rntm_t). This prevents a potentially confusing situation whereby a user-initialized rntm_t is used once (in, say, gemm), and then found by the user to be in a different state before it is used a second time. commit e809b5d2f1023b4249969e2f516291c9a3a00b80 Merge: 76016691 0476f706 Author: Field G. Van Zee Date: Thu Dec 20 16:27:26 2018 -0600 Merge branch 'master' into amd commit 1f4eeee5175a8fc9ac312847c796ce6db5fe75b9 Author: sraut Date: Wed Dec 19 21:21:10 2018 +0530 Fixed BLAS test failures of small matrix SYRK for single and double precision. Details: - SYRK for small matrix was implemented by reusing small GEMM routine. This was resulting in output written to the full C matrix, and C being symmetric the lower and upper triangles of C matrix contained same results. BLAS SYRK API spec demands either lower or upper triangle of C matrix to be written with results. So, this was resulting in BLAS test failures, even though testsuite of BLIS was passing small SYRK operation. - To fix BLAS test failures of small matrix SYRK, separate kernel routines are implemented for small SYRK for both single and double precision. The newly added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c. Now the intermediate results of matrix C are written to a scratch buffer. Final results are written from scratch buffer to matrix C using SIMD copy to either lower or upper traingle part of matrix C. - Source and header files frame/3/syrk/bli_syrk_front.c and frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines. Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb commit 6d267375c3a0543f20604d74cc678ad91db3b6f1 Author: sraut Date: Wed Dec 19 14:22:21 2018 +0530 This commit improves the performance of multi-instance DGEMM when these multiple threads are binded to a CCX. Multi-Instance: Each thread runs a sequential DGEMM. Change-Id: I306920c8061b6dad61efac1dae68727f4ac27df6 commit 0476f706b93e83f6b74a3d7b7e6e9cc9a1a52c3b Author: Field G. Van Zee Date: Tue Dec 18 14:56:20 2018 -0600 CHANGELOG update (0.5.1) commit e0408c3ca3d53bc8e6fedac46ea42c86e06c922d (tag: 0.5.1) Author: Field G. Van Zee Date: Tue Dec 18 14:56:16 2018 -0600 Version file update (0.5.1) commit 3ab231afc9f69d14493908c53c85a84c5fba58aa Author: Field G. Van Zee Date: Tue Dec 18 14:53:37 2018 -0600 ReleaseNotes.md update in advance of next version. Details: - Updated ReleaseNotes.md in preparation for next version. commit d1aa87164e1e82347d62aa98793963c5265ef7e7 Author: Field G. Van Zee Date: Tue Dec 18 14:52:40 2018 -0600 README.md update (External packages section). Details: - Updated External packages section in anticipation of introducing BLIS into Debian package universe. Thanks to M. Zhou for sponsoring BLIS in Debian. commit 7bf901e9265a1acd78e44c06f7178c8152c7e267 Author: sraut Date: Tue Dec 18 14:39:16 2018 +0530 Fix on EPYC machine for multi instance performance issue, Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically. Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores. Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe commit d2b2a0819a2fccad9165bc48c0e172d79a87542c Author: Field G. Van Zee Date: Mon Dec 17 19:26:35 2018 -0600 Removed stray sections from Multithreading.md. Details: - Removed unintended section headers from before table of contents. commit 93d56319f2953cf0e9df1ff2cda90b8e41351b2c Author: Field G. Van Zee Date: Mon Dec 17 19:17:30 2018 -0600 Added missing bli_init_once() in bli_thread API. Details: - Fixed an issue with specifying threading globally at runtime via bli_thread_set_num_threads() (the automatic way) or via bli_thread_set_ways() (the manual way), with bli_thread_init_rntm() also affected. These functions were not calling bli_init_once() prior to acting, and therefore their effects on the global rntm_t structure were being wiped out by the eventual call to bli_init_once(), by some other BLIS function. Thanks to Ali Emre Gülcü for reporting the behavior associated with this bug. - Added additional content to docs/Multithreading.md covering topics of choosing between OpenMP and pthreads, and specifying affinity via OpenMP. - CREDITS file update. commit 76016691e2c514fcb59f940c092475eda968daa2 Author: Field G. Van Zee Date: Thu Dec 13 17:23:09 2018 -0600 Improvements to bli_pool; malloc()/free() tracing. Details: - Added malloc_ft and free_ft fields to pool_t, which are provided when the pool is initialized, to allow bli_pool_alloc_block() and bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align() with arbitrary align_size values (according to how the pool_t was initialized). - Added a block_ptrs_len argument to bli_pool_init(), which allows the caller to specify an initial length for the block_ptrs array, which previously suffered the cost of being reallocated, copied, and freed each time a new block was added to the pool. - Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t into a single "buf" field. Consolidated the bli_pblk API accordingly and also updated the bli_mem API implementation. This was done because I'd previously already implemented opaque alignment via bli_malloc_align(), which allocates extra space and stores the original pointer returned by malloc() one element before the element whose address is aligned. - Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call bli_fmalloc_align() and bli_ffree_align(), which required adding an align_size field to the membrk_t struct. - Pass the pack schemas directly into bli_l3_cntl_create_if() rather than transmit them via objects for A and B. - Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free(). The function had not been conditionally freeing control trees for quite some time. Also, removed obj_t* parameters since they aren't needed anymore (or never were). - Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a separate function, bli_l3_thread_decorator_thread_check(). - Renamed: bli_malloc_align() -> bli_fmalloc_align() bli_free_align() -> bli_ffree_align() bli_malloc_noalign() -> bli_fmalloc_noalign() bli_free_noalign() -> bli_ffree_noalign() The 'f' is for "function" since they each take a malloc_ft or free_ft function pointer argument. - Inserted various printf() calls for the purposes of tracing memory allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which, for now, is intended to be a "hidden" feature rather than one hooked up to a configure-time option. - Defined bli_rntm_equals(), which compares two rntm_t for equality. (There are no use cases for this function yet, but there may be soon.) - Whitespace changes to function parameter lists in bli_pool.c, .h. commit f808d829c58dc4194cc3ebc3825fbdde12cd3f93 Author: Field G. Van Zee Date: Wed Dec 12 15:22:59 2018 -0600 Handle edge cases, zero-filling in packm kernels. Details: - Updated the API and semantics of packm kernels such that they must now handle edge cases, meaning that a c-by-k packm kernel must be able to pack edge cases that are fewer than c rows/columns and be able to zero-fill the remaining elements. They must also be able to zero-fill the equivalent region when copying fewer than k columns/rows (which is needed by trsm). The new packm kernel API is generally: void packm_kernel ( conj_t conja, dim_t cdim, dim_t n, dim_t n_max, ctype* restrict kappa, ctype* restrict a, inc_t inca, inc_t lda, ctype* restrict p, inc_t ldp, cntx_t* restrict cntx ); where cdim and n are the dimensions (short and long, respectively) of the submatrix being copied from the source matrix A, and n_max is the "full" long dimension (corresponding to the k dimension in gemm) of the micropanel. The "full" short dimension (corresponding to the register blocksize MR or NR) is not part of the API because it is known intrinsically by the packm kernel implementation. Thanks to Devin Matthews for prompting us to make this change (#282). - Updated all reference packm kernels in ref_kernels/1m according to above changes, as well as all optimized packm kernels (which only consisted of those for knl). - Bumped the major soname version number in 'so_version' to 2. At first I was considering leaving it unchanged, but I couldn't escape the reality that the packm kernel API is much closer to an expert API than it is some obscure helper function interface within the framework that nobody would ever notice. - Removed reference packm kernels for mr/nr = 30. The only sub-config that would have been using those kernels is knc, which is likely no longer being used by very many people (if any). (This also mostly offset the larger object code footprint incurred by moving the edge- case handling into the individual packm kernels.) - Fixed an obscure race condition for 3mh and 4mh induced methods in which those implementations were modifying the contexts stored in the gks rather than a local copy. - Fixed a minor bug in the testsuite that prevented non-1m-based induced method implementations of trsm from executing. commit 02ec0be3ba0b0d6b4186386ae140906a96de919b Merge: e275def3 c534da62 Author: Field G. Van Zee Date: Wed Dec 5 19:33:53 2018 -0600 Merge branch 'master' into amd commit c534da62c0015f91391983da5376c9e091378010 Author: Field G. Van Zee Date: Wed Dec 5 15:51:05 2018 -0600 Disabled ARM configuration families in registry. Details: - Disabled (commented out) the arm32 and arm64 configuration families in the config_registry file. Having a configuration family registered only makes sense if BLIS is currently outfitted with runtime hardware detection logic to choose the appropriate sub-configuration. That logic is currently missing for ARM architectures, and thus having the ARM configuration families in the configuration registry only serves to confuse people. Thanks to Devangi Parikh for suggesting this change. commit 6885051a164628904fad0d8a3b39c82f9a7b193c Author: Field G. Van Zee Date: Wed Dec 5 14:45:39 2018 -0600 Generalizations/cleanup to mixeddt matlab scripts. Details: - Parameterized, reorganized, and added comments to matlab scripts in test/mixeddt/matlab. - Reordered some lines of code and added comments to plot_l3_perf.m in test/3m4m/matlab. commit cbdb0566bf3201a495bbdcb8cb50342fa0098649 Author: Field G. Van Zee Date: Wed Dec 5 20:06:32 2018 +0000 Updates to 3m4m, mixeddt test driver files. Details: - Updated 3m4m and mixeddt Makefiles and runme.sh scripts, mostly to port recent changes to the former to the latter. - Disabled (for now) code in 3m4m/test_*.c files that disables all induced methods except for the one that is requested from the Makefile via the IND macro. This is done because usually, we want to test whatever method is enabled automatically for complex datatypes. (That is, when native complex microkernels are missing, we usually want to test performance of 1m.) commit 0645f239fbdf37ee9d2096ee3bb0e76b3302cfff Author: Field G. Van Zee Date: Tue Dec 4 14:31:06 2018 -0600 Remove UT-Austin from copyright headers' clause 3. Details: - Removed explicit reference to The University of Texas at Austin in the third clause of the license comment blocks of all relevant files and replaced it with a more all-encompassing "copyright holder(s)". - Removed duplicate words ("derived") from a few kernels' license comment blocks. - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c with format of all other comment blocks. commit 9b688a2d69dd420f4d2582827c5ac87e422cd3bc Author: Field G. Van Zee Date: Tue Dec 4 13:30:25 2018 -0600 Refer to color mm algorithm in Multithreading.md. commit 22384fd2b749aa8cfdfad1084ce5e7dbd4ad2d64 Author: Field G. Van Zee Date: Tue Dec 4 13:09:04 2018 -0600 Minor updates to test_gemm.c in test/mixeddt. commit 2ba3b1780cbca58e43a3948d67bd07e637036125 Author: Field G. Van Zee Date: Mon Dec 3 19:40:39 2018 -0600 Removed symbols from libblis-symbols.def. Details: - Removed bli_gemm_md_front() and bli_gemm_md_zgemm() symbols from build/libblis-symbols.def, which will hopefully appease AppVeyor. commit dcb38c4e59c3395c258799e69bfe2104c578c528 Merge: dc184095 375eb30b Author: Field G. Van Zee Date: Mon Dec 3 18:06:19 2018 -0600 Merge branch 'dev' commit 375eb30b0a63ac06a363a5f75f283584258db48b Author: Field G. Van Zee Date: Mon Dec 3 17:49:52 2018 -0600 Added mixed-precision support to 1m method. Details: - Lifted the constraint that 1m only be used when all operands' storage datatypes (along with the computation datatype) are equal. Now, 1m may be used as long as all operands are stored in the complex domain. This change largely consisted of adding the ability to pack to 1e and 1r formats from one precision to another. It also required adding logic for handling complex values of alpha to bli_packm_blk_var1_md() (similar to the logic in bli_packm_blk_var1()). - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c, bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong ukernel output preference field being read. Previously, the preference for the native complex ukernel was being read instead of the pref for the native real domain ukernel. This bug would not manifest if the preference for the native complex ukernel happened to be equal to that of the native real ukernel. - Added support for testing mixed-precision 1m execution via the gemm module of the testsuite. - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack schemas are always read from the context, rather than trying to sometimes embed them directly to the A and B objects. (They are still embedded, but now uniformly only after reading the schemas from the context.) - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only consumer). - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to bli_gemm_ker_var2_md(). - Added explicit handling for beta == 1 and beta == 0 in the reference gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c. - Rewrote various level-0 macro defs, including axpyris, axpbyris, scal2ris, and xpbyris (and their conjugating counterparts) to explicitly support three operand types and updated invocations to xpbyris in bli_gemmtrsm1m_ref.c. - Query and use the storage datatype of the packed object instead of the storage datatype of the source object in bli_packm_blk_var1(). - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to frame/3/gemm/ind/bli_gemm_ind_opt.h. - Various whitespace/comment updates. commit e275def30ac41cadce296560fa67282704f20a02 Merge: 8091998b dc184095 Author: Field G. Van Zee Date: Fri Nov 30 15:39:50 2018 -0600 Merge branch 'master' into amd commit dc18409551f341125169fe8d4d43ac45e81bdf28 Author: Field G. Van Zee Date: Wed Nov 28 11:58:40 2018 -0600 CREDITS file update. commit ee4d2712963816f84d7e3fdd39d93424e1aaf63d Merge: e81c4b56 3d7e8bc3 Author: Field G. Van Zee Date: Wed Nov 28 11:52:57 2018 -0600 Merge pull request #287 from SuperFluffy/fix_configuration_links Fix configuration links commit 3d7e8bc3b8e77693152138e75676f71573e5e6cd Author: Richard Janis Goldschmidt Date: Wed Nov 28 15:56:37 2018 +0100 Fix configuration links commit 6a4885f8be9ecd81423ebf2eb6da75d7981c979b Merge: 1d8aae22 e81c4b56 Author: Field G. Van Zee Date: Tue Nov 27 13:22:59 2018 -0600 Merge branch 'master' into dev commit e81c4b56660b25a39f8fdc09fbe07459c5bd8e8e Merge: 757043ea cfbdb58d Author: Field G. Van Zee Date: Wed Nov 21 17:00:49 2018 -0600 Merge pull request #285 from isuruf/pthread Move LDFLAGS to the end commit cfbdb58de2e44f2e3a3d8b14fceece7aef4b3006 Author: Isuru Fernando Date: Wed Nov 21 14:23:39 2018 -0600 Move LDFLAGS to the end Otherwise the linker will drop flags like -lpthread commit 757043eae8630c0a76e9bb04f2cb0bd72439a86a Merge: e769bf46 7af8fa01 Author: Field G. Van Zee Date: Wed Nov 21 13:07:26 2018 -0600 Merge pull request #283 from isuruf/patch-3 Fix MinGW and Cygwin build failures commit 7af8fa01373b7bb30fa3b1fd110fd201c87ea225 Author: Isuru Fernando Date: Wed Nov 21 02:10:05 2018 -0600 Fix blis dll path commit 2acd8dcd23805203a6821358c5e3e09d521fecdf Author: Isuru Fernando Date: Wed Nov 21 02:02:18 2018 -0600 Fix install path of dll.a commit b7b0ad22b151e89e2a6c7782cf4d8d47b4e60734 Author: Isuru Fernando Date: Wed Nov 21 01:54:44 2018 -0600 Test mingw commit bafe521ed0012b7b8814404b78a6c576d8386370 Author: Isuru Fernando Date: Wed Nov 21 01:54:36 2018 -0600 Fixes for mingw commit be831879bd03edcddff8a345161f749ad92215af Author: Isuru Fernando Date: Wed Nov 21 01:39:32 2018 -0600 test gcc shared commit f6b924648c79c4b1c3d3c7fbf85372680aff8362 Author: Isuru Fernando Date: Wed Nov 21 01:39:19 2018 -0600 Don't use .def for gcc commit ce6e4eae6d5e977e6f699acc9cf239be8ac53771 Author: Isuru Fernando Date: Wed Nov 21 01:34:56 2018 -0600 test no threading commit c9169b4685bfe81bc562cf9128b35a6a9884799b Author: Isuru Fernando Date: Wed Nov 21 01:17:36 2018 -0600 Add mingw64 path commit 0f753090eaf4264b743a49ce15de97514bcbe112 Author: Isuru Fernando Date: Wed Nov 21 01:14:52 2018 -0600 Fix PATH commit d424470b1f2fa8717fa54c0245b21341504665f6 Author: Isuru Fernando Date: Wed Nov 21 01:04:26 2018 -0600 Check openmp and pthreads threading commit c73e7601e58239e2dedec6c9f1b752e949254a42 Author: Isuru Fernando Date: Wed Nov 21 00:50:33 2018 -0600 Revert "enable rdp" This reverts commit 368274bcbd0c9232521d14fa28304f35ced0e6d7. commit 6209b2e6060b89e65f3405c31333af8952dd63c0 Author: Isuru Fernando Date: Wed Nov 21 00:50:22 2018 -0600 Remove conda commit 0b1b344447b8a2fcd635a48f0ce7ce89b2107dc4 Author: Isuru Fernando Date: Wed Nov 21 00:42:39 2018 -0600 Fix make name commit 7a9838983ba8dd32ac9f87712255721542ff561f Author: Isuru Fernando Date: Wed Nov 21 00:35:27 2018 -0600 Use m2w64-make commit 4c1dedd6a90087807f16353a5d0bcaaade35a7a5 Author: Isuru Fernando Date: Wed Nov 21 00:28:20 2018 -0600 No activate on gcc commit 368274bcbd0c9232521d14fa28304f35ced0e6d7 Author: Isuru Fernando Date: Tue Nov 20 23:40:26 2018 -0600 enable rdp commit 707a5e7f9b07f554e1e9289dd0ce3b7dc4fded6e Author: Isuru Fernando Date: Tue Nov 20 23:39:31 2018 -0600 No conda for mingw build commit 65b0565c0ad9162d4474bd84eabde491fa971538 Author: Isuru Fernando Date: Tue Nov 20 23:19:38 2018 -0600 Check MinGW-w64 commit 9ddffba5847080e0d77d9e6059d05dc4b1d89ba5 Author: Isuru Fernando Date: Wed Nov 21 00:23:34 2018 -0600 Fix MinGW build failure Fixes https://github.com/flame/blis/issues/278 commit 1d8aae220bc52ce8e3a8afaa64b57e5d83480bdc Author: Field G. Van Zee Date: Tue Nov 20 18:42:07 2018 -0600 Track internal scalar datatypes. Details: - Added a num_t datatype bitfield to the obj_t in the form of a new info2 field in the obj_t. This change was made primarily so that in the case of mixed-datatype gemm, the alpha scalar would not need to be cast to the storage datatype of B (or A) before then being cast to the computation datatype just before the macrokernel is called. This double-casting regime could result in loss of precision if the storage datatype of B (or A) is less than the computation precision. In practice, it was likely not going to be a big deal since most usage of alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which can all be represented exactly in single or double precision. - The type of objbits_t was changed to uint32_t, so the new format potentially takes up the same space as the previous obj_t definition, assuming no padding inserted by the compiler. Shrinking info to 32 bits and spilling over into a second field was chosen over using the high 32 bits of a single 64-bit objbits_t info field because many of the bitwise operations are performed with enums such as num_t, dom_t, and prec_t, which may take on the type of 32-bit ints. It's easier to just keep all of those bitwise operations in 32 bits than perform a million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h to ensure that the integers are treated as 64-bit for the purposes of the ANDs, ORs, and bitshifts. - Many comment updates. - Thanks to Devin Matthews and Devangi Parikh for their feedback and involvement during this commit cycle. commit e769bf46b0931d68031af212110484ec98e16908 Author: Field G. Van Zee Date: Tue Nov 20 16:16:53 2018 -0600 Tweak testsuite to issue FAIL for Nan, Inf (#279). Details: - Adjusted the definition for libblis_test_get_string_for_result() in testsuite/src/test_libblis.c so that the "FAIL" string is returned if the computed residual contains either NaN or Inf. Previously, a residual containing NaN would result in the selection of the "PASS" string. Thanks to Devin Matthews for reporting this issue (#279). - Expounded on comment for the macro definitions of bli_isnan() and bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they must remain macros. commit 279deae18fb8b8106161863b46fcb38232314de4 Author: Field G. Van Zee Date: Fri Nov 16 11:34:19 2018 -0600 Added 4x5 matlab plotting scripts to test/3m4m. Details: - Added a new directory, test/3m4m/matlab, containing matlab scripts for plotting 4x5 panels of performance graphs (using the subplot() function) for gemm, hemm, herk, trmm, and trsm across all four floating-point datatypes. I expect to further refine these scripts as time goes on, but their current state constitutes a good start. commit 7b02c726650336c12286c8ba166d1d0fdf7601a8 Author: Field G. Van Zee Date: Wed Nov 14 13:49:55 2018 -0600 CREDITS file update. commit 84dd298a27033945fa2d3b6e5dce1fe625cd2a0a Author: Field G. Van Zee Date: Wed Nov 14 13:47:45 2018 -0600 Patch to fix msys2/Windows build failure (#277). Details: - Expanded cpp guard in frame/include/bli_x86_asm_macros.h to also check __MINGW32__ in addition to _WIN32, __clang__, and __MIC__. Thanks to Isuru Fernando for suggesting this fix, and also to Costas Yamin for originally reporting the issue (#277). commit 8091998b6500e343c2024561c2b1aa73c3bafb0b Merge: 333d8562 7b5ba731 Author: Field G. Van Zee Date: Wed Nov 14 12:36:35 2018 -0600 Merge branch 'master' into amd commit 7b5ba7319b3901ad0e6c6b4fa3c1d96b579efbe9 Merge: ce719f81 52392932 Author: Field G. Van Zee Date: Wed Nov 14 12:32:01 2018 -0600 Merge branch 'dev' of github.com:flame/blis into dev commit 52392932dc1ea3c16220cc4e6978efcb2f5f0616 Author: Field G. Van Zee Date: Tue Nov 13 22:23:38 2018 +0000 Minor fixes to test/3m4m drivers. Details: - Cleanups to Makefile to allow all test drivers to be built for OpenBLAS and MKL in addition to BLIS. - Fixed copy-paste typos in test_hemm in calls to ssymm_() and dsymm_(). - Fixed incorrect types for betap in BLAS cpp macro branch of test_herk.c. commit 4f12e36a0d0e6df146314b4e50e36c5e7a1af3d3 Author: Field G. Van Zee Date: Tue Nov 13 14:23:12 2018 -0600 Fixed number of columns in first output line. Details: - In previous commit, forgot to remove output column corresponding to the k dimension. commit a2e0cdd7debf8109198536d55af05d5631072fb2 Author: Field G. Van Zee Date: Tue Nov 13 14:15:11 2018 -0600 Added hemm test driver to test/3m4m. Details: - Added a new test_hemm.c test driver to test/3m4m, which was modeled after the driver by the similar name in test. Also updated Makefile so that blis-nat-[sm]t would trigger builds for the new driver. commit 0f9b53e84b48d8d73a56cc9889eae3595ca58a78 Author: Field G. Van Zee Date: Tue Nov 13 13:03:15 2018 -0600 Fixed a bug in high-level mixeddt conditional. Details: - Fixed a bug in frame/3/bli_l3_oapi.c in the conditional that divides use of induced method (1m) execution from native execution. The former was intended to only be used in cases where all storage datatypes are complex and the datatype of C is equal to the computation datatype. (If mixed datatypes are detected, native execution would be used.) However, the code in bli_gemm() was erroneously checking the execution datatype instead of the computation datatype, which at that point is guaranteed to be equal to the storage datatype even if the computation datatype contains a different value. Thanks to Devangi Parikh for helping in isolating this bug. commit 333d8562f04eea0676139a10cb80a97f107b45b0 Author: Field G. Van Zee Date: Sun Nov 11 14:28:53 2018 -0600 Added debug output to bli_malloc.c. Details: - Added debug output to bli_malloc.c in order to debug certain kinds of memory behavior in BLIS. The printf() statements are disabled and must be enabled manually. - Whitespace/comment updates in bli_membrk.c. commit ce719f816d1237f5277527d7f61123e77180be54 Author: Field G. Van Zee Date: Sat Nov 10 14:48:43 2018 -0600 More edits to mixeddt matlab scripts. Details: - Renamed scripts in test/mixeddt/matlab: plot_case_all.m -> plot_dom_all.m plot_case_md.m -> plot_dom_case.m plot_all_md.m -> plot_dt_all.m - Added plot_dt_select.m in order to plot select graphs for the main body of the mixeddt paper, and added additional related legend handling in plot_gemm_perf.m. - Added test/mixeddt/matlab/output and a .gitkeep file within in order to force git to recognize the directory. commit bf99e7c14baf45725b698d06ad043b531e3a2763 Author: Field G. Van Zee Date: Thu Nov 8 18:47:17 2018 -0600 Minor updates to test/mixeddt driver. Details: - Cleaned up test/mixeddt Makefile in preparation for gathering new data for mixeddt paper, including renaming implementations to "internal" and "ad-hoc" to match the terminology to be used in the paper. - Added new matlab scripts for generating 8 figures, each covering all mixed-precision cases for each mixed-domain case. - Updated the runme.sh script according to changes to Makefile. - Fixed a minor bug in test_gemm.c that may have given incorrect performance in complex, homogeneous storage datatype cases where the computation precision was equal to the storage precisions. (Examples: zzzd, cccs.) commit 4bbb454bf3c361af9e97bfa394a73d610cd9002a Author: Field G. Van Zee Date: Sat Nov 3 19:11:01 2018 -0500 Testsuite docs update for mixed-datatype gemm. Details: - Updated docs/Testsuite.md to include mention of the new mixed-domain and mixed-precision settings, including descriptions. - Updated docs/MixedDatatypes.md to include a brief section on running the testsuite to exercise mixed-datatype functionality, which mostly amounts to a link to the Testsuite.md document. - Minor verbiage change to testsuite output to correct a misleading label associated with the value returned by the query function bli_info_get_simd_num_registers(). (The function does not return the number of SIMD registers present in the hardware, but rather a maximum assumed value for the purposes of allocating temporary microtile workspace on the function stack.) commit 16401ae922b1285437cf5f6867b2764650a95fb0 Merge: f19c33af 2d403a15 Author: Field G. Van Zee Date: Sat Nov 3 19:09:43 2018 -0500 Merge branch 'dev' commit 2d403a1535380a2ebe2ae2c0f5ac54ba7564fbeb Merge: e90e7f30 4a12979f Author: Field G. Van Zee Date: Thu Nov 1 20:18:53 2018 -0500 Merge pull request #275 from RhysU/patch-1 Spelling in FAQ commit 4a12979f65697ed79ba290efd59f4b994ac9429b Author: Rhys Ulerich Date: Thu Nov 1 20:20:59 2018 -0400 Spelling in FAQ commit f19c33af4cbe6f5705b96fbf2b8799c3c2bd75c3 Author: Field G. Van Zee Date: Fri Oct 26 17:07:15 2018 -0500 Disallow 64b BLAS integers + 32b BLIS integers. Details: - Print an error message from configure if the user attempts to explicitly configure BLIS for simultaneous use of 64-bit integers in the BLAS API with 32-bit integers in the BLIS API. - Added cpp macro conditional to bli_type_defs.h to mandate that BLIS integers be 64 bits if the BLAS integers are 64 bits. This and the above item take care of issue #274. Thanks to Devin Matthews and Jeff Hammond for suggesting these safeguards. - Slight reorganization and relabeling (for clarity) of BLAS/CBLAS sections and BLIS integer size line of the testsuite configuration output. - Very minor edits to docs/MixedDatatypes.md. commit e90e7f309b3f2760a01e8e09a29bf702754fa2b5 (origin/win-pthreads) Author: Field G. Van Zee Date: Thu Oct 25 14:09:43 2018 -0500 CHANGELOG update (0.5.0) commit be7c57819cfd48adb175d9a480cc9f37928645c1 (tag: 0.5.0) Author: Field G. Van Zee Date: Thu Oct 25 14:09:40 2018 -0500 Version file update (0.5.0) commit 75da7f2a208ad7d26ed9c6d3e10d08b2a1caf9d6 Author: Field G. Van Zee Date: Thu Oct 25 14:02:41 2018 -0500 ReleaseNotes.md update in advance of next version. Details: - Updated ReleaseNotes.md in preparation for next version. - Updated docs/FAQ.md to reflect recent developments, and other edits. - Minor updates to RELEASING. commit 6fbc456fb3f4401ec951a618990f15a84fdfa236 Author: Field G. Van Zee Date: Thu Oct 25 13:20:25 2018 -0500 Added SALT testing to Travis CI. Details: - Modified .travis.yml to automatically employ the simulation of application-level threading within the testsuite, with supporting changes to common.mk, the top-level Makefile, and travis/do_testsuite.sh. - Added a new pair of input files to testsuite directory with the '.salt' suffix (similar to those with the '.fast' suffix) for testing application-level threading. - Updated docs/BuildSystem.md to document the new make targets 'testblis-salt' and 'checkblis-salt'. commit 0e27963a6770e6b64f3299ad0613d5df45d8b6ae Author: Field G. Van Zee Date: Wed Oct 24 12:16:19 2018 -0500 Add bli_pthread_mutex_trylock(). Details: - Added the missing bli_pthread_mutex_trylock() function and prototype to the non-Windows sections of bli_pthread.c and .h. This function isn't needed by BLIS, but I figured why not make the Windows and non-Windows sections consistent with one another. commit 4b683740c12f83804a51ec610b16ce28607d5c85 Author: Field G. Van Zee Date: Wed Oct 24 11:56:16 2018 -0500 Defined bli_pthread_cond_*() and related defs. Details: - Added function definitions for bli_pthread_cond_*() as well as related types and constants to bli_pthread.c, and corresponding prototypes to bli_pthread.h. commit 4b4f8072b9bb495b3e01d45698b0bad3dac31ba8 Author: Field G. Van Zee Date: Wed Oct 24 11:31:46 2018 -0500 Define bli_pthreads barrier types on OS X. Details: - Fully define bli_pthreads barrier-related types on OS X. Only typedef those types in terms of pthreads types on non-Windows, non-Apple OSes (i.e. Linux). commit ad98790dcef6bd9aab7f13d615b987b5daa58757 Author: Field G. Van Zee Date: Tue Oct 23 20:35:05 2018 -0500 Fix names of Windows pthread initializer macros. Details: - Renamed the PTHREAD_ initializer macros in the Windows cpp case to use BLIS_ prefixes to match their non-Windows counterparts. commit 06c23954e6b17219a50c3d37821544a46defaf89 Author: Field G. Van Zee Date: Tue Oct 23 19:16:54 2018 -0500 Defined unified bli_pthreads_*() API for all OSes. Details: - Expanded the bli_pthread_*() -> pthread_*() wrappers in frame/thread/bli_pthread.c to include cases for Windows taken from frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined and always used by BLIS and the BLIS testsuite (in lieu of calling pthreads directly, as before). The implementation used in this new API depends on whether we are building for Windows, and to a lesser extent, whether we are building on OS X. For the core API, Windows uses Windows threads, non-Windows (Linux, OS X) uses pthreads. OS X and Windows get barriers implemented in terms of other bli_pthread_*() functions, and Linux gets barriers implemented in terms of pthread_barrier*(). This commit addresses issue #273. - Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(), which was erroneously calling pthread_mutex_lock(). - Minor changes to configure so that the auto-detection executable can be built given the above changes (most notably, turning on POSIX extensions via -D_GNU_SOURCE). - Removed temporary play-test code for shiftd that accidentally got committed into test/3m4m/test_gemm.c. commit 0ae9585da1e3db1cf8034d4b16305a5883beb0d3 Author: pradeeptrgit Date: Tue Oct 23 09:36:23 2018 +0530 Update version number to 1.2 Change-Id: Ibb31f6683cdecca6b218bc2f0c14701d7e92ebf3 commit eac7d267a017d646a2c5b4fa565f4637ebfd9da7 Author: Field G. Van Zee Date: Mon Oct 22 18:10:59 2018 -0500 Unconditionally define bli_l3_thread_entry(). Details: - Define a dummy bli_l3_thread_entry() function when multithreading is disabled altogether, or enabled via OpenMP. This function was originally necessary when multithreading is enabled via pthreads. By defining the function no matter the threading options given, it is less likely that an AppVeyor Windows build will complain due to a missing symbol in the DLL. (To be clear: AppVeyor was working fine before, but a problem may have arisen if it were switched to an OpenMP build.) - Removed the prototype for bli_l3_thread_entry() from bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h. - Regenerated the symbols list file build/libblis-symbols.def. commit 4ee986f0a74207f4ca29df077929134725d62b80 Author: Field G. Van Zee Date: Mon Oct 22 14:09:44 2018 -0500 Added mixed-datatype testing to Travis CI (#271). Details: - Modified .travis.yml to automatically test the mixed-datatype support of the gemm operation, with supporting changes to common.mk, the top-level Makefile, and travis/do_testsuite.sh. - Added a new pair of input files to testsuite directory with the '.mixed' suffix (similar to those with the '.fast' suffix) for testing mixed-datatype gemm. - Updated docs/BuildSystem.md to document the new make targets 'testblis-md' and 'checkblis-md'. commit c3c6ebc9c6244053d654a9b0c955acb2fef42ee8 Author: Field G. Van Zee Date: Sun Oct 21 18:48:54 2018 -0500 Fixed thrinfo_t printing for small problems. Details: - Fixed a bug in the code that prints out the communicator and work ids from the various threads' thrinfo_t nodes. This bug manifested when the dimension being parallelized was not large enough such that every thread was assigned actual work (since the minimum amount of work is determined by the register blocksize in the dimension being parallelized). In those cases, the threads that receive no work in that dimension do not finish building their thrinfo_t tree, leaving lower-level nodes non-existent. (The bug itself was usally observed as a segfault when the printing code attempted to dereference all the way down the thrinfo_t tree.) The solution involves explicitly checking each node as it is dereferenced, and if at any time NULL is found, all subsequent communicator and work ids are set to -1. commit 73a222c0d99dcc221be7dea10eaebf844f31f72e Author: Field G. Van Zee Date: Sat Oct 20 14:13:04 2018 -0500 Minor edits to 'configure --help' text. commit 14f3d5e6df183819a0c393b2661ad15df0786544 Author: Field G. Van Zee Date: Fri Oct 19 20:39:35 2018 -0500 Refresh libblis-symbols.def post-merge 090e4f0. commit 090e4f08fc2f429a1b2db77b0a6f8276f892a7ac Merge: c9be5889 0854e880 Author: Field G. Van Zee Date: Fri Oct 19 18:41:10 2018 -0500 Merge branch 'master' into dev commit 0854e880b0848e0c2e3d0644c93c80b0fd13c0dc Merge: 4e38a8d4 343a2715 Author: Field G. Van Zee Date: Fri Oct 19 18:05:00 2018 -0500 Merge pull request #261 from flame/win-pthreads Implement missing pthreads function on Windows commit c9be5889fbe947c64ef75740662e4d63032f4c35 Author: Field G. Van Zee Date: Fri Oct 19 17:42:40 2018 -0500 Added "Known issues" section to Multithreading.md. Details: - Added known issues section to Multithreading.md. - Trivial changes to MixedDatatypes.md, Sandboxes.md. commit 343a2715ebee28d250ee41b914abdcd1dc77c344 Author: Field G. Van Zee Date: Fri Oct 19 16:59:19 2018 -0500 Whitespace changes to configure, bli_pthread_wrap. Details: - Mostly whitespace changes (spaces to tabs) to configure and bli_pthread_wrap.c and .h. commit 3678a1cd518df9447b4b1ea86885eb2ba8abcf6e Merge: 85397cd4 4e38a8d4 Author: Field G. Van Zee Date: Fri Oct 19 16:11:31 2018 -0500 Merge branch 'master' into win-pthreads commit 4e38a8d4eebb18ead74e644fac76a4fde8e7f6c6 Author: Field G. Van Zee Date: Fri Oct 19 15:54:15 2018 -0500 Implemented python version checking in configure. Details: - Added python version checking to configure script. (Recall that python is needed to execute the flatten-headers.py script.) Minimum versions of python needed are currently as follows: python2: 2.7 or later python3: 3.5 or later The standard search order for python interpeters is: python python3 python2 The PYTHON environment variable is also supported and will be checked before the standard search order list. - Updated BuildSystem.md to include: a minimum make version; mention that the C compiler must actually be a C99 compiler; and the caveat that Windows builds do not require pthreads since BLIS can provide an implementation of pthreads internally. commit 85397cd4fa52f6c4c33f4fb715478c55533c680e Author: Field G. Van Zee Date: Fri Oct 19 13:12:43 2018 -0500 Added explanatory comment to bli_pthread.c. Details: - Added a verbose comment to bli_pthread.c that explains why a bli_ wrapper to pthreads APIs is useful. commit 53c07035ef61cc9b8469636d4d8fa5085f37652d Author: Field G. Van Zee Date: Fri Oct 19 12:53:03 2018 -0500 Refresh libblis-symbols.def from bb6df28. Details: - Forgot to regenerate the symbols file after the previous commit (bb6df281) in which shiftd operation was introduced. commit 473ce54f5fbea4860ac0514e7e8b022c1ea03e63 Author: Field G. Van Zee Date: Thu Oct 18 19:03:56 2018 -0500 Added bli_pthread_*() API. Details: - Defined a bli_pthread_*() API so that the testsuite, when being linked against a Windows DLL, will be able to access pthreads functionality without those pthreads functions being explicitly exported by the DLL. Instead, we export the bli_pthread_*() layer, which uses types and functions that are identical to pthreads, but adds a 'bli_' prefix. Only a few basic functions are present in the bli_pthreads_*() API for now. Thanks to Devin Matthews and Isuru Fernando for their help on a related PR (#261) that this commit will hopefully facilitate. - Updated testsuite so that it calls bli_pthread_*() layer instead of pthread_*() functions directly. - Regenerated build/libblis-symbols.def. - Comment updated to build/regen-symbols.sh. commit bb6df2814fcaa2fa62a549379f61be2f8667a598 Author: Field G. Van Zee Date: Thu Oct 18 17:11:39 2018 -0500 Defined a new level-1d operation: shiftd. Details: - Defined a new level-1d operation called 'shiftd', including object and typed APIs. This operation adds a scalar value to every element along an arbitrary diagonal of a matrix. Currently, shiftd is implemented in terms of the addv kernel. (The scalar is passed in as the x vector with an increment of zero.) - Replaced ad-hoc usage of setd and addd (after creating a temporary matrix object) with use of shiftd, which is much more concise, in various test driver files in the testsuite. Similar changes were made to the standalone test drivers and the example code. - Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md for bli_shiftd() and bli_?shiftd(), respectively. - Added observed object properties to level-1d documentation in BLISObjectAPI.md. commit 53e0a0c9b38e8525c7224e280342ef56328af567 Merge: 1c7247b6 ec676799 Author: Field G. Van Zee Date: Thu Oct 18 14:54:59 2018 -0500 Merge branch 'master' into win-pthreads commit ec67679990660a60362a49406595383672812287 Author: Field G. Van Zee Date: Thu Oct 18 14:27:02 2018 -0500 Refreshed Windows symbol list; added regen script. Details: - Moved windows/build/libblis-symbols.def to build/libblis-symbols.def. Updated link commands in common.mk accordingly. - Added a new script build/regen-symbols.sh that will regenerate the libblis-symbols.def file in its new location after building a haswell-targeted shared library. Thanks to Isuru Fernando for providing the symbol generation command. - Ran the new script to refresh the symbols file. commit fdad54ab8eee4a7efd04ec4afb3e6902eb22e60a Author: Field G. Van Zee Date: Thu Oct 18 12:43:22 2018 -0500 Removed old symbol from libblis-symbols.def. Details: - Removed bli_gemm_ker_var1() from windows/build/libblis-symbols.def since this function is no longer compiled. commit 49d3f9fcbb4a75553439f97c099ea48d85763eea Merge: 779d64dc 3c527256 Author: Field G. Van Zee Date: Wed Oct 17 18:00:40 2018 -0500 Merge branch 'master' into dev commit 3c52725693d0d7726e1c8fb224f9b1ef786db8b9 Author: Field G. Van Zee Date: Wed Oct 17 14:56:22 2018 -0500 Renamed/moved l3 zen ukernels to haswell kernel set. Details: - Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and then updated the file contents to use the 'haswell' infix. - Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to above function renames. - Moved/updated the corresponding prototypes in bli_kernels_zen.h to bli_kernels_haswell.h. - Updated config_registry according to above changes. - NOTE: This rename reflects the fact that haswell microkernels are specifically written to overcome the floating-point latency for FMA instructions on Intel Haswell-like architectures, which can issue two FMA instructions per cycle. These ukernels happen to work fine on AMD Zen-based architectures. However, Zen only issues one FMA per cycle, which, while halving its floating-point throughput, gives it extra flexibility in the design of its microkernels--namely, mr and nr can be smaller and still overcome the floating-point latency for those single-issue cores. A smaller value of mr and nr allows for a larger value of kc, which may be useful in some situations. In the future, we may write such Zen-specific microkernels to take advantage of this additional flexibility. commit 71c5832d5f5596f25204980803423d08143a4010 Author: Field G. Van Zee Date: Wed Oct 17 14:11:01 2018 -0500 Consolidated slab/rr-explicit level-3 macrokernels. Details: - Consolidated the *sl.c and *rr.c level-3 macrokernels into a single file per sl/rr pair, with those files named as they were before c92762e. The consolidation does not take away the *option* of using slab or round-robin assignment of micropanels to threads; it merely *hides* the choice within the definitions of functions such as bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter() rather than expose that choice explicitly in the code. The choice of slab or rr is not always hidden, however; there are some cases involving herk and trmm, for example, that require some part of the computation to use rr unconditionally. (The --thread-part-jrir option controls the partitioning in all other cases.) - Note: Originally, the sl and rr macrokernels were separated out for clarity. However, aside from the additional binary code bloat, I later deemed that clarity not worth the price of maintaining the additional (mostly similar) codes. commit 57eab3a4f0e43099fc2ff189df9fcc0d7801c2cd Author: Field G. Van Zee Date: Wed Oct 17 11:29:20 2018 -0500 CREDITS file update. commit 6722ec21817cbab9d86ee63f00984eb407b5e627 Author: Ye Luo Date: Wed Oct 17 11:26:00 2018 -0500 Fix bgclang compilation on BGQ (#270) * Fix bgq kernels * Support bgq with bgclang commit 1c7247b6d146fc728d7c4240e4e069e33f8f8868 Merge: c1bc5530 6c5a1aaf Author: Devin Matthews Date: Tue Oct 16 14:44:32 2018 -0500 Merge branch 'win-pthreads' of github.com:flame/blis into win-pthreads commit c1bc5530d51bf55b4aa3c35165f6d4452a0fd779 Author: Devin Matthews Date: Tue Oct 16 14:44:10 2018 -0500 Don't call pthread_once in auto-detect. commit b9c61d03f542a2e92551ff0595415bec3076ab25 Merge: 5a1e461f 3612ecac Author: Field G. Van Zee Date: Tue Oct 16 14:39:57 2018 -0500 Merge branch 'nested-omp-patch' commit 5a1e461ffe09ed200ee2fc7aafccf6dd7e8c0080 Author: Field G. Van Zee Date: Tue Oct 16 14:21:45 2018 -0500 Execute flatten-headers.py via $(PYTHON). Details: - Execute build/flatten-headers.py python script via $(PYTHON) in common.mk. This allows distributions that define the current/preferred python interpreter in the PYTHON environment variable to use that interpreter when executing flatten-headers.py. Thanks to Isuru Fernando for this suggestion, and for Dave Love for submitting the initial issue/request. commit 6c5a1aaff540b19672e91501e894ed695aee322b Author: Devin Matthews Date: Tue Oct 16 10:15:59 2018 -0500 Fix type in bli_pthread_wrap.c commit 29e6245816760b1bd4ac738d7d3e11a9d9d13473 Merge: 0b73209f ed657714 Author: Devin Matthews Date: Tue Oct 16 10:12:25 2018 -0500 Merge branch 'master' into win-pthreads commit 0b73209f6b22cc024169146d343627f6999b63d8 Author: Devin Matthews Date: Tue Oct 16 10:02:06 2018 -0500 Add missing argument to WaitForSingleObject and use $is_win in configure to turn off pthreads. commit ed65771482a705f7ed028d822489766327b44e76 Author: Field G. Van Zee Date: Mon Oct 15 17:54:45 2018 -0500 Fixed merge fail on testsuite threading macros. Details: - Applied the following C preprocessor macro renames BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N in src/test_libblis.c. This is apparently the result of a failure by git to properly merge the 'master' and 'amd' branches in the previous commit. (The 'master' branch contained a commit, 53a9ab1, in which these same cpp macros were renamed throughout the source distribution. commit dc5fd898af8c74c2e2a75fc647157da0d04dd922 Merge: 667d3929 637c2ce7 Author: Field G. Van Zee Date: Mon Oct 15 17:41:35 2018 -0500 Merge branch 'amd' commit 779d64dc3091dea6b7530283304e52878151d218 Author: Field G. Van Zee Date: Mon Oct 15 17:13:18 2018 -0500 Added entry for xpbym to input.operations.fast. Details: - Forgot to add an entry for the new xpbym operation to input.operations.fast in previous commit. commit 5fec95b99f61761963834f62a9867f797687813c Author: Field G. Van Zee Date: Mon Oct 15 16:37:39 2018 -0500 Implemented mixed-datatype support for gemm. Details: - Implemented support for gemm where A, B, and C may have different storage datatypes, as well as a computational precision (and implied computation domain) that may be different from the storage precision of either A or B. This results in 128 different combinations, all which are implemented within this commit. (For now, the mixed-datatype functionality is only supported via the object API.) If desired, the mixed-datatype support may be disabled at configure-time. - Added a memory-intensive optimization to certain mixed-datatype cases that requires a single m-by-n matrix be allocated (temporarily) per call to gemm. This optimization aims to avoid the overhead involved in repeatedly updating C with general stride, or updating C after a typecast from the computation precision. This memory optimization may be disabled at configure-time (provided that the mixed-datatype support is enabled in the first place). - Added support for testing mixed-datatype combinations to testsuite. The user may test gemm with mixed domains, precisions, both, or neither. - Added a standalone test driver directory for building and running mixed-datatype performance experiments. - Defined a new variation of castm, castnzm, which operates like castm except that imaginary values are not touched when casting a real operand to a complex operand. (By contrast, in these situations castm sets the imaginary components of the destination matrix to zero.) - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and also simplified the implementation of bli_obj_imag_equals(). - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex() when given BLIS_CONSTANT objects. - Disabled dt_on_output field in auxinfo_t structure as well as all accessor functions. Also commented out all usage of accessor functions within macrokernels. (Typecasting in the microkernel is still feasible, though probably unrealistic for now given the additional complexity required.) - Use void function pointer type (instead of void*) for storing function pointers in bli_l0_fpa.c. - Added documentation for using gemm with mixed datatypes in docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c. - Defined level-1d operation xpbyd and level-1m operation xpbym. - Added xpbym test module to testsuite. - Updated frame/include/bli_x86_asm_macros.h with additional macros (courtsey of Devin Matthews). commit 3612ecac98a9d36c3fcd64154121d420bb69febd Author: Field G. Van Zee Date: Thu Oct 11 15:16:41 2018 -0500 Added comments to nested OpenMP handling code. Details: - Added comments to bli_thrcomm_openmp.c relating to changes made in 6ac0c80 and 1064d79. commit 667d3929ee20e94849b4e25b693b4037b7e3f350 Author: Field G. Van Zee Date: Thu Oct 11 11:47:57 2018 -0500 Added Fortran APIs for some thread functions. Details: - Defined Fortran-77 compatible APIs for bli_thread_set_num_threads() and bli_thread_set_ways(). These wrappers are defined in frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for suggesting these new interfaces. - Added missing prototype for bli_thread_set_ways() in bli_thread.h and removed prototypes for non-existent functions bli_thread_set_*_nt(). - CREDITS file update. commit 1064d79711f03a0541b92d8b8b9b7e25e04097a5 Author: Devin Matthews Date: Thu Oct 11 11:14:25 2018 -0500 Adjust rntm_t struct as well. commit 6ac0c805609b85616ddb32e50101c4f9feb25a35 Author: Devin Matthews Date: Thu Oct 11 10:45:07 2018 -0500 Fix OMP nesting problem. Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267. commit 78a6935483409ae277c766406e175772e820b1de Author: sraut Date: Thu Oct 11 10:49:40 2018 +0530 Added comments for the change in syrk small matrix change. Change-Id: I958939e9953323730da49ef07d1b10e578837d82 commit 53a9ab1c85be14dcfd2560f5b16e898e3e258797 Author: Field G. Van Zee Date: Wed Oct 10 15:11:09 2018 -0500 Renamed thread auto-factorization macro constants. Details: - Renamed the following C preprocessor macros whose fallback/default values are specified within frame/include/bli_kernel_macro_defs.h: BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N - Renamed the above cpp macro overrides within the knl, skx, and zen sub-configurations, as well as invocations of those macros in bli_rntm.c. - Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer used by any code within BLIS. commit 637c2ce794b0414ba8b25e9a452f7d64f825d63a Author: Field G. Van Zee Date: Tue Oct 9 17:18:04 2018 -0500 Updated column index range for irun.py -q. Details: - Forgot to apply the column index range fix in 10f179f to situations when "quiet" mode (-q) is requested. This commit applies the new column index range modifications to the quiet case. commit e2a59400bdda7ed7ee0ff00edea70c00ed593b6c Author: Field G. Van Zee Date: Tue Oct 9 15:29:48 2018 -0500 Allow trsm_l parallelism in the jc loop. Details: - Previously, trsm was consolidating all ways of parallelism into the jr loop. This was unnecessary and to some degree detrimental on some types of hardware. Now, any parallelism bound for the jc loop will be applied to the jc loop, while all other loops' parallelism is funneled to the jr loop. Thanks to Devangi Parikh for helping investigate this issue and suggesting the fix. - NOTE: This change affects only left-side trsm. However, currently right-side trsm is currently implemented in terms of the left-side case, and thus the change effectively applies to both left and right cases. commit f1dba506c970f14e612580d3c171e7c5ffd0a5fb Author: Field G. Van Zee Date: Mon Oct 8 17:59:41 2018 -0500 Output threading status/params from testsuite. Details: - Updated testsuite to output various parameters related to parallelism in BLIS. These parameters include: - threading status: disabled, openmp, or pthreads; - thread partitioning for jr/ir loops: slab or rr (round-robin); - ways of parallelism from environment variables, and also actual values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for square problems (assuming all dimensions are set to 1000); - automatic thread factorization parameters. - Also output the status of two relatively new configure-time options: libmemkind and the sandbox. commit 10f179fb13fc1179921a4ef8efdd2174f01e07da Author: Field G. Van Zee Date: Mon Oct 8 14:36:38 2018 -0500 Updated irun.py to use updated column index range. Details: - Updated the irun.py script so that it updates the matlab column index range (if found) to reflect the additional columns of data that are substituted in. Thanks to Devangi Parikh for recognizing and reporting this issue. commit c244a716c97849dee41f52b5f424116aae1b710b Author: Field G. Van Zee Date: Sun Oct 7 20:59:40 2018 -0500 Added missing -r option to configure --help output. Details: - Added inadvertantly-omitted mention of -r option-equivalent to --thread-part-jrir to the output for 'configure --help'. Also made minor edits to the same text. commit c92762ecdca1eb0b08c8acd583b4739a1e3fbd39 Author: Field G. Van Zee Date: Sun Oct 7 20:30:32 2018 -0500 Added option of slab or rr partitioning in jr/ir. Details: - Updated existing macrokernel function names and definitions to explicitly use slab assignment of micropanels to threads, then created duplicate versions of macrokernels that explicitly use round-robin assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels were not substantially updated in this commit because they are currently disabled in bli_trsm_front.c. - Updated existing packing function (in blk_packm_blk_var1.c) to explicitly use slab partitioning, and then duplicated for round-robin. - Updated control tree initialization to use the appropriate macrokernel and packm function pointers depending on which method (slab or rr) was enabled at configure-time. - Updated configure script to accept new --thread-part-jrir=[slab|rr] option (-m [slab|rr] for short), which allows the user to explicitly request either slab or round-robin assignment (partitioning) of micropanels to threads. - Updated sandbox/ref99 according to above changes. - Minor updates to build/add-copyright.py. commit 98e01ea04bfe1032e5bd4781043afd84f864a19e Merge: ac18949a 541b8a3b Author: Field G. Van Zee Date: Thu Oct 4 20:44:12 2018 -0500 Merge branch 'master' into amd commit 541b8a3b3e9af4078f5e6fb2f9608d681839952a Author: Field G. Van Zee Date: Thu Oct 4 20:39:06 2018 -0500 Removed 1h short-circuit from bli_clock_min_diff(). Details: - Removed a guard from bli_clock_min_diff() that would return 0 if the time delta was greater than 60 minutes. This was originally intended to disregard extremely large values under the assumption that the user probably didn't intend to run a test that long. However, since it is in bli_clock_min_diff(), it doesn't actually help short-circuit an implementation that is hanging or looping infinitely, since such an implementation would first have to finish before the bli_clock_min_diff() is called. Thanks to Kiran Varaganti for reporting this issue. commit f0c3ef359f7c6c1687fb2671cb35deb346e00597 Author: Kiran V Date: Thu Oct 4 16:32:21 2018 +0530 This is a fix to floating-point exception error for BLIS SGEMM with larger matrix sizes. BUG No: CPUPL-197 fixed by Thangaraj Santanu The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid. gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a commit 8bf30eb4735872388b5317883d99b775a344ce25 Author: Devangi N. Parikh Date: Wed Oct 3 22:22:29 2018 -0400 Fixed runme.sh in test/studies/thunderx2 Details: - Fixed the setting of threads for a single core run. commit f6f2456ba2afa8f85f43c7c2c90acc439d61d94f Author: Devangi N. Parikh Date: Wed Oct 3 21:43:46 2018 -0400 Fixed the Makefile in test/studies/thunderx2 Details: - Fixed target for make-all-st and make-all-mt so that the armpl targets are built commit 743a1a6dec1bd3908f0f15513b501c9bd59715b3 Author: Field G. Van Zee Date: Wed Oct 3 14:40:10 2018 -0500 Fixed misleading version query from gcc 7+. Details: - gcc 7 introduced new behavior to the -dumpversion option whereby only the major version component is output. However, as part of this change, gcc 7 also introduced a new option, -dumpfullversion, which is guaranteed to always output the major, minor, and revision numbers. If we are using gcc 7 or later, we re-query the version string with this new option and then re-parse the result so as to avoid misleading output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7). commit de07840ba5672b9d7b2ed2b918974e98c3f249fb Author: Field G. Van Zee Date: Wed Oct 3 13:57:25 2018 -0500 Whitespace, https updates to README.md. Details: - Reformatted to fit all lines within 80 columns, unless a link is too long to fit on a single line. - Changed some links from http to https. commit 80a8b3dd8034ec8bc03d31be3f9c837c3f6fc94b Author: sraut Date: Wed Oct 3 15:30:33 2018 +0530 Review comments incorporated for small TRSM. Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9 commit b8dfd82e0d1afda4ee5436662d63515a59b2dee3 Author: Devin Matthews Date: Tue Oct 2 15:37:12 2018 -0500 Get pthreads via blis.h in the test driver. commit d0c0c20b7bd3ecf914b5910a50f618fb7d7aa355 Author: Devin Matthews Date: Tue Oct 2 15:16:00 2018 -0500 There seems to be a problem with _POSIX_BARRIERS on Travis. commit 0904d9e4df0c8a256ac35c491f14a587ebe9fca2 Author: Devin Matthews Date: Tue Oct 2 15:04:36 2018 -0500 *Always* use Windows primitives instead of pthreads. commit 998317d309934cd7129f8c818ea6e5f07534ebc8 Author: Devin Matthews Date: Tue Oct 2 14:43:24 2018 -0500 Remove pthreads from appveyor build. commit 627d0c5bfd4b7b149803587391c93b164c11ced5 Author: Devin Matthews Date: Tue Oct 2 14:40:55 2018 -0500 Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows. commit 81d2c064a209df7eca7d6103696ca3a137a7f82e Author: Devin Matthews Date: Tue Oct 2 11:46:36 2018 -0500 Add wrapper for basic pthreads functionality (mutex, once) with MSVC. commit d33f130ea621fca1dccb30631f454d237918eb04 Author: Devin Matthews Date: Tue Oct 2 11:45:43 2018 -0500 Some configure changes: 1) Allow environment variables to be set anywhere in the argument list. 2) Allow any environment variable to be set. 3) Allow LIBPHTREAD to be set to null without getting defaulted to -lpthread. commit 9d5f1c4f3bf70c2c0ea84bfa326a0113ae2d176c Author: Field G. Van Zee Date: Mon Oct 1 17:39:26 2018 -0500 Patch to avoid gcc warning in blastest/f2c/open.c. Details: - Use the modulo operator to limit the size of an integer that is given to sprintf(). This avoids a warning in some versions of gcc about the integer potentially overflowing the available space in the string into which the integer is being printed. commit 0c3cd00ba76de607e807f8deb04b1a2ce18ea7a8 Author: Field G. Van Zee Date: Mon Oct 1 16:18:25 2018 -0500 More README.md updates. Details: - Replaced much of "Getting Started" section with a shortened version of the bullet list of documentation currently shown in the github wiki page. Thanks to Devangi Parikh for her feedback in this change. commit 8eaf34bd23b30a1857a50d7142ee9811895f24bf Author: Field G. Van Zee Date: Mon Oct 1 14:29:07 2018 -0500 Very minor README.md update. commit 599090e0eb41b2706fa1231fa7b90096f3281678 Author: Field G. Van Zee Date: Mon Oct 1 14:04:30 2018 -0500 README.md update. Details: - Added language mentioning SHPC group to Introduction. commit ee46fa3efb6e920fa6c3d0b0601007f5de31deb5 Author: sraut Date: Mon Oct 1 16:30:30 2018 +0530 Small TRSM optimization changes :- 1) single precision small trsm kernels for XAt=B case are further optimized for performance. 2) double precision small trsm kernels for AX=B and XAtB cases are implemented. 3) single precision small trsm kernels for AutX=B are implemented in intrinsics to improve the current performance. Change-Id: Ic9d67ae6d8522615257dde018903f049dcffa2cf commit 08045a6c52b6e025652c5b18eb120c0f4e61cf6f Author: sraut Date: Mon Oct 1 15:38:23 2018 +0530 Corrected the fix made for blastest level-3 failure to check m,n,k non-zero condition in bli_gemm_small.c Change-Id: Idaf9f2327c3127b04a2738ae8a058b83d6c57934 commit ac18949a4b9613741b9ea8e5026d8083acef6fe4 Author: Field G. Van Zee Date: Sun Sep 30 18:54:56 2018 -0500 Multithreading optimizations for l3 macrokernels. Details: - Adjusted the method by which micropanels are assigned to threads in the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly) employ contiguous "slab" partitioning rather than interleaved (round robin) partitioning. The new partitioning schemes and related details for specific families of operations are listed below: - gemm: slab partitioning. - herk: slab partitioning for region corresponding to non-triangular region of C; round robin partitioning for triangular region. - trmm: slab partitioning for region corresponding to non-triangular region of B; round robin partitioning for triangular region. (NOTE: This affects both left- and right-side macrokernels: trmm_ll, trmm_lu, trmm_rl, trmm_ru.) - trsm: slab partitioning. (NOTE: This only affects only left-side macrokernels trsm_ll, trsm_lu; right-side macrokernels were not touched.) Also note that the previous macrokernels were preserved inside of the 'other' directory of each operation family directory (e.g. frame/3/gemm/other, frame/3/herk/other, etc). - Updated gemm macrokernel in sandbox/ref99 in light of above changes and fixed a stale function pointer type in blx_gemm_int.c (gemm_voft -> gemm_var_oft). - Added standalone test drivers in test/3m4m for herk, trmm, and trsm and minor changes to test/3m4m/Makefile. - Updated the arguments and definitions of bli_*_get_next_[ab]_upanel() and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h. - Renamed bli_thread_get_range*() APIs to bli_thread_range*(). commit b952ca8feb6f17f71a4512649c2aa72bdee9c8f4 Author: Field G. Van Zee Date: Fri Sep 28 16:12:32 2018 -0500 CREDITS file update. commit 7d96fc437ebaa9dd2d7071865b5df16402fadd64 Author: Field G. Van Zee Date: Fri Sep 28 15:40:45 2018 -0500 Allow slashes ('/') in version tags. Details: - Updated the configure script to allow slashes in version string. This is needed so that downstream maintainers (such as those for Debian) can create local tags such as "upstream/0.4.1". Thanks to M. Zhou for reporting this issue via PR #256 and providing me the information needed to debug the problem. commit 5fdddf6f37c64da093c7f59e3a85214e819ae652 Author: Field G. Van Zee Date: Fri Sep 28 11:25:54 2018 -0500 Removed 'debian' directory. Details: - Removed the top-level 'debian' directory. This directory is apparently no longer needed (issue #257). Thanks to M. Zhou and Nico Schlömer for their contributions. commit 9814cfdf3157ef4726ee604fc895d56e8063d765 Author: Meghana Date: Fri Sep 28 11:02:39 2018 +0530 fixed blastest level-3 failure by adding ((M&N&K) != 0) to check condition in bli_gemm_small.c Change-Id: I85e4a32996ebb880f3c00bd293edc38f74700fe6 commit 86330953b14c180862deef3ccdcc6431259be27b Merge: 7af5283d 807a6548 Author: praveeng Date: Fri Sep 28 10:08:06 2018 +0530 Resolved conflicts and modified bli_trsm_small.c Change-Id: I578d419cff658003e0fdd4c4cdc93145d951ce31 commit 60b2650d7406d266feffe232c2d5692a9e3886d0 Author: Field G. Van Zee Date: Mon Sep 24 15:04:45 2018 -0500 Added statistics-collecting irun.py script. Details: - Added irun.py script to 'build' directory. This irun.py script is a python script for repeatedly invoking a test driver executable, such as those found in test/3m4m, and replace the performance output column with four columns that aggregate statistics. Specifically, the script reports the minimum, average, maximum, and standard deviation for each problem size. This script is useful especially (though not exclusively) when trying to determine the impact of relatively minor changes to the code, or other small optimizations that may be difficult to distinguish from "noise." One way this "noise" manifests is that a test executable may run slightly slower or faster for all problem sizes (and all implementations) tested by the executable over the life of a single execution. The cause of these minor across-the-board pertubations in the overall performance signatures is unknown, though we hypothesize that it may relate to any number of issues such as operating system scheduling, where in memory the program is loaded, or how the CPU clock frequency is throttled at the time of execution. Regardless of the source of these subtle performance anomalies, the statistical properties reported by the irun.py script help the user to more precisely characterize the underlying performance exhibited by any given test driver, which allows him or her to make better judgments about the true difference in performance between two implementations, or minor changes within a single implementation. commit 807a654888117fb3a27ea36384f1c1c11b882cd5 Author: Field G. Van Zee Date: Thu Sep 20 15:41:05 2018 -0500 Fixed confusing configure message for libmemkind. Details: - Corrected feedback echoed to user by configure when libmemkind is found but not explicitly requested. In these cases, configure would echo a message that it had received an explicit request to enable libmemkind, which was not accurate, even if the end result was the same--that libmemkind is enabled by default when it is found. Thanks To Devangi Parikh for reporting this issue. commit 02adab427c779b0aaf38a5877a5f0246b1909e8f Author: Devangi N. Parikh Date: Thu Sep 20 14:38:50 2018 -0400 Created a 'thunderx2' subdirectory within test/studies Details: - Created a 'thunderx2' subdirectory within test/studies to house various level-3 test driver used to measure performance on ThunderX2. commit d7537fb51dac0636591fc7c68261a2322642ab3c Merge: dad07245 c03728f1 Author: Field G. Van Zee Date: Wed Sep 12 15:24:20 2018 -0500 Merge branch 'dev' commit dad07245dbcfaf35232ec379ba756eb133c361c1 Author: Devangi N. Parikh Date: Wed Sep 12 04:16:58 2018 -0500 Fixed yet another bug in runme script in test/studies Details: - Fixed another copy-paste bug commit e669057fe35f2037d8111af687d84a0ecf6d7a2a Author: Devangi N. Parikh Date: Tue Sep 11 22:29:42 2018 -0500 Fixed bug in runme script in test/studies Details: - Fixed bug in runme script for skx studies that set the number of threads incorrectly commit 232fdc3df3e01ae3f86d53767bd14eb93b511e6e Author: Devangi N. Parikh Date: Mon Sep 10 18:45:50 2018 -0500 Updated runme script in test/studies. Details: - Updated runme script for skx studies to run multithreading tests on 1 and 2 sockets. commit c03728f1f45edb5e434db90ab8a77ba0184a682b Author: Field G. Van Zee Date: Mon Sep 10 17:54:27 2018 -0500 Various minor cleanups. Details: - Rewrote bli_winsys.c to define bli_setenv() and bli_sleep() unconditionally, but differently for Windows and non-Windows, but then disabled the definition of bli_setenv() entirely since BLIS no longer needs to set environment variables. Updated bli_winsys.h accordingly, and call bli_sleep() from within testsuite instead of sleep() directly. - Use #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L) instead of #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0) when guarding against local definition of pthread barrier in testsuite. (The description for unistd.h implies that _POSIX_BARRIERS should always be set to 200809L when barriers are supported, though I won't be surprised if we encounter a case in the future where it is set to something else such as 1 while still supported.) - Removed old _VERS_CONF_INST definitions and installation rules in top-level Makefile. These are no longer needed because we no longer output libraries with the version and configuration name as substrings. - Comment/whitespace updates in Makefile, config.mk.in, common.mk, configure, bli_extern_defs.h, and test_libblis.h. - Added mention of 1m to README.md and other trivial tweaks. commit e249a00a82908054ecd307cf602c8801275903e8 Author: Field G. Van Zee Date: Mon Sep 10 16:48:35 2018 -0500 Imported skx dgemm ukernel from skx-redux branch. Details: - Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux branch, along with appropriate blocksizes in bli_cntx_init_skx.c and a prototype in bli_kernels_skx.h. (Devin has not yet written the sgemm analague, so for now we will continue using the older sgemm ukernel.) - Updated frame/include/bli_x86_asm_macros.h with a minor change that was present within the skx-redux branch. commit e93b01ff60bf9742baa5eefd93e208d1219e7a43 Author: Isuru Fernando Date: Sun Sep 9 15:57:43 2018 -0500 Windows DLL support (#246) * Enable shared * Enable rdp * Add support for dll * Use libblis-symbols.def * Fix building dlls * Fix libblis-symbols.def * Fix soname * Fix Makefile error * Fix install target * Fix missing symbols * Add BLIS_MINUS_TWO * Add path to dll * Fix OSX soname * Add declspec for dll * Add -DBLIS_BUILD_DLL * Replace @enable_shared@ in config * switch to auto for now * blis_ -> bli_ * Remove BLIS_BUILD_DLL in make check * change auto->haswell * enable_shared_01 * Add wno-macro-redefined * print out.cblat3 * BLIS_BUILD_DLL -> BLIS_IS_BUILDING_LIBRARY * Use V=1 * Remove fpic for windows * Remember LIBPTHREAD * Remove libm for windows * Remember AR * Fix remembering libpthread * Add Wno-maybe-uninitialized in only gcc * Don't do blastest for shared for now * Fix install target And remove unnecessary change * test auto and x86_64 * Fix install target again * Use IS_WIN variable * Remove leading dot from LIBBLIS_SO_MAJ_EXT * Make is_win yes/no * Add comments for windows builds * Change if else blocks location commit 1330d5c4bc3b644ec0af54c3939a5b9f00eacd9c Author: Field G. Van Zee Date: Fri Sep 7 19:37:59 2018 -0500 Employ "user" cflags for tl Makefile test targets. Details: - Use get-user-cflags-for() to generate cflags when compiling BLAS test drivers and BLIS testsuite from top-level Makefile. Meant to include these changes in previous commit (4b5437e). Thanks to Isuru Fernando for pointing out this oversight. commit 4b5437ec7afb2befffffbb83f7872bcb4fc61e51 Author: Field G. Van Zee Date: Fri Sep 7 17:24:32 2018 -0500 Define a cpp macro specific to BLIS compilation. Details: - Tweaked the cflags functions in common.mk so that a new preprocessor macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS itself is being built. This macro will not be defined when, for example, the testsuite or example code compiles code local to those applications. This was done in part by defining a new cflags function get-user-cflags-for(), which is now the designated function for application Makefiles if they wish to inherit a basic set of CFLAGS from BLIS. (The compiler flags returned are identical to that of get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is omitted.) - Updated all test driver-like makefiles to call get-user-cflags-for() instead of get-frame-cflags-for(). commit cc2cca4f56eb30212a0dce3e5c121e64d9e59560 Merge: e19e7212 fb81c7fc Author: Field G. Van Zee Date: Thu Sep 6 17:12:13 2018 -0500 Merge branch 'dev' commit e19e7212872da3d464734199193436faa51f0da0 Merge: 97965b09 b3d0702c Author: Jeff Hammond Date: Thu Sep 6 14:58:49 2018 -0700 Merge pull request #244 from kali/pthread-barrier-osx add an adhoc impl for pthread_barrier commit b3d0702cf2ef6dda19a23dd8a677be1b6f73c322 Merge: 4e7d0670 97965b09 Author: Jeff Hammond Date: Thu Sep 6 14:58:23 2018 -0700 Merge branch 'master' into pthread-barrier-osx commit 4e7d06700f176a62952d7d51e41fdcbc6b7a9d5f Author: Mathieu Poumeyrol Date: Thu Sep 6 23:48:31 2018 +0200 second __APPLE__ commit fb81c7fc665d68e6a2add163feb29acc0bce8936 Author: Field G. Van Zee Date: Thu Sep 6 16:29:39 2018 -0500 Defined cortexa53 sub-configuration. Details: - Added a new sub-configuration 'cortexa53', which is a mirror image of cortexa57 except that it will use slightly different compiler flags. Thanks to Mathieu Poumeyrol for making this suggestion after discovering that the compiler flags being used by cortexa57 were not working properly in certain OS X environments (the fix to which is currently pending in pull request #245). commit 24ecc0d94aaa9ab4df1ae6d199c4ec6d7783169f Author: Mathieu Poumeyrol Date: Thu Sep 6 22:10:16 2018 +0200 use _POSIX_BARRIERS instead of __APPLE__ commit 97965b09059a610db06fb7a22bdfa79c0d37d673 Author: Mathieu Poumeyrol Date: Thu Sep 6 21:10:29 2018 +0200 cortexa9 and cortexa53 travis build + qemu test (#245) commit a6802eab7d94b5a9de633c53beca8245b74f5dc6 Author: Mathieu Poumeyrol Date: Thu Sep 6 17:16:35 2018 +0200 reinstantiate test on macos commit d688a2b7e5a19cba44ea398a99e325e19b8fce50 Author: Mathieu Poumeyrol Date: Thu Sep 6 15:25:16 2018 +0200 add an adhoc impl for pthread_barrier commit ab9f9e684dc3ffbb70cc45b21c67af5d916919e5 Author: Field G. Van Zee Date: Thu Aug 30 15:14:02 2018 -0500 CHANGELOG update (0.4.1) commit 10fd614031307c46db3d893528d4e5fc31f490b3 (tag: 0.4.1) Author: Field G. Van Zee Date: Thu Aug 30 15:13:59 2018 -0500 Version file update (0.4.1) commit 08dd67c4b21244851f8416bd59159bea7a9c5b3d Author: Field G. Van Zee Date: Thu Aug 30 15:12:13 2018 -0500 ReleaseNotes.md update in advance of next version. commit 4fa4cb0734e7de6505b5d6f1aeef3a5d5c89dcbb Author: Field G. Van Zee Date: Wed Aug 29 18:06:41 2018 -0500 Trivial comment header updates. Details: - Removed four trailing spaces after "BLIS" that occurs in most files' commented-out license headers. - Added UT copyright lines to some files. (These files previously had only AMD copyright lines but were contributed to by both UT and AMD.) - In some files' copyright lines, expanded 'The University of Texas' to 'The University of Texas at Austin'. - Fixed various typos/misspellings in some license headers. commit b051ffb815baf6c3ece2b5118b679fd9219d5780 Merge: 6f33d9de aaa549f4 Author: Field G. Van Zee Date: Wed Aug 29 17:06:48 2018 -0500 Merge branch 'dev' commit 6f33d9de21fbc2f579846b9104fb9d513753f79c Author: Mathieu Poumeyrol Date: Wed Aug 29 23:48:22 2018 +0200 fix compilation of armv7a kernels (#242) commit 8199e339aefdd27019c7f3d8c99818d375d5400b Author: Field G. Van Zee Date: Mon Aug 27 07:00:12 2018 -0500 Added testsuite threading to input.general.fast. Details: - Added lines associated with the testsuite's new threading option to input.general.fast. This change was intended for the previous commit (10d0735). commit 10d07357afbb2d468837aa97369ef9a6d0610817 Author: Field G. Van Zee Date: Sun Aug 26 20:34:30 2018 -0500 Better thread safety; added threading to testsuite. Details: - Replaced critical sections that were conditional upon multithreading being enabled (via pthreads or OpenMP) with unconditional use of pthreads mutexes. (Why pthreads? Because BLIS already requires it for its initialization mechanism: pthread_once().) This was done in bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's mtx_t object and bli_mutex_*() API with pthread mutexes in bli_thread.c. The previous status quo could result in a race condition if the application called BLIS from more than one thread. The new pthread-based code should be completely agnostic to the application's threading configuration. Thanks to AMD for bringing to our attention the need for a thread-safety review. - Added an option to the testsuite to simulate application-level multithreading. Specifically, each thread maintains a counter that is incremented after each experiment. The thread only executes the experiment if: counter % n_threads == thread_id. In other words, the threads simply take turns executing each problem experiment. Also, POSIX guarantees that fprintf() will not intermingle output, so output was switched to fprintf() instead of libblis_test_fprintf(). - Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with wrappers to pthread_mutex_init()/_destroy(). - Changed the implementation of bli_l3_ind_oper_enable_only() to fix a race condition; specifically, two threads calling the function with the same parameters could lead to a non-deterministic outcome. - Added #include to bli_cpuid.c and moved the same in bli_arch.c. - Added 'const' to declaration of OPT_MARKER in bli_getopt.c. - Added #include to bli_system.h. - Added add-copyright.py script to automate adding new copyright lines to (and updating existing lines of) source files. commit aaa549f4d1e63929fe2bea023ce849253cfbbb42 Author: Field G. Van Zee Date: Sun Aug 26 20:13:51 2018 -0500 Minor update to configure --help (--sharedir option). Details: - Fixed/tweaked description for --sharedir=SHAREDIR option. commit 573b8ac373f821a65cc8afd51cdbe03b8ec01081 Author: Field G. Van Zee Date: Sun Aug 26 13:51:32 2018 -0500 Fixed copy-paste typo in previous commit. Details: - Fixed a typo in travis/do_testsuite.sh introduced in 62ea1d3. commit 62ea1d33d3bc1e890420a1e828b9d0e87e87533b Author: Field G. Van Zee Date: Sun Aug 26 13:35:53 2018 -0500 Fixed broken out-of-tree builds. Details: - Fixed stale filepaths to check-blastest.sh and check-blistest.sh in travis/do_testsuite.sh and travis/do_sde.sh. - Create a symbolic link to the 'config' directory so that the top-level Makefile can find the configs' make_defs.mk files during out-of-tree builds. - Added additional case handling to out-of-tree scenario to handle situations where files 'Makefile', 'common.mk', or 'config' exist but are not symbolic links. In such cases, configure warns the user and exits. - Homogenized various error messages throughout configure. - Belated thanks to Victor Eijkhout for requesting the feature added in 0f491e9 whereby lesser Makefiles can compile and link against an existing installation of BLIS. commit 0f491e994a7e14d4dfce26e6a51dba2bccad29a3 Author: Field G. Van Zee Date: Sat Aug 25 20:12:36 2018 -0500 Allow lesser Makefiles to reference installed BLIS. Details: - Updated the build system so that "lesser" Makefiles, such as those in belonging to example code or the testsuite, may be run even if the directory is orphaned from the original build tree. This allows a user to configure, compile, and install BLIS, delete the build tree (that is, the source distribution, or the build directory for out- of-tree builds) and then compile example or testsuite code and link against the installed copy of BLIS (provided the example or testsuite directory was preserved or obtained from another source). The only requirement is that make be invoked while setting the BLIS_INSTALL_PATH variable to the same installation prefix used when BLIS was configured. The easiest syntax is: make BLIS_INSTALL_PATH=/install/prefix though it's also permissible to set BLIS_INSTALL_PATH as an environment variable prior to running 'make'. - Updated all lesser Makefiles to implement the new aforementioned build behavior. - Relocated check-blastest.sh and check-blistest.sh from build to blastest and testsuite, respectively, so that if those directories are copied elsewhere the user can still run 'make check' locally. - Updated docs/Testsuite.md with language that mentions this new option of building/linking against an installed copy of BLIS. commit 36ff92ce0d3b428b15b6cddc6f5944afe22e43ec Author: Field G. Van Zee Date: Fri Aug 24 18:26:09 2018 -0500 Missing C++ compiler no longer fatal to configure. Details: - Changed configure so that the absence of any C++ compiler from the pre-defined search list does not result in an exit. Instead, in this situation, the found_cxx variable is assigned 'c++notfound' and the error message is changed to remind the user that C++ will not be available in the sandbox. Thanks to Devangi Parikh for reporting this issue. - Also tweaked the message when a C++ compiler *is* found to remind any would-be confused user that BLIS will only use C++ if it is needed by code in the sandbox. commit 658f0a129bdc565b072696b6ebddce501132091c Author: Field G. Van Zee Date: Fri Aug 24 17:49:37 2018 -0500 Fixed obscure integer size bug in va_arg() usage. Details: - Fixed a bug in the way that the variadic bli_cntx_set_l3_nat_ukrs() function was defined. This function is meant to take a microkernel id, microkernel datatype, microkernel address, and microkernel preference as arguments, and is typically called within the bli_cntx_init_*() function defined within a sub-configuration for initializing an appropriate context. The problem is with the final argument: the microkernel preference. These preferences are actually boolean values, 0 or 1 (encoded as FALSE or TRUE). Since the variadic function does not give the compiler any type information for any variadic arguments, they are "promoted" in the course of internal (macroized) processing according to default argument promotion rules. Thus, integer literals such as 0 and 1 become int and floating-point literals (such as 0.0 or 1.0) become double. Previous to this commit, we indicated to va_arg() that the ukernel preference was a 'bool_t', which is a typedef of int64_t on 64-bit systems. On systems where int is defined as 64 bits, no problems manifest since int is the same size as the type we passed in to va_arg(), but on systems where int is 32 bits, the ukernel preference could be misinterpreted as a garbage value. (This was observed on a modern armv8 system.) The fix was to interpret the bool_t value as int and then immediately typecast it to and store it as a bool_t. Special thanks to Devangi Parikh for helping track down this issue, including deciphering the use of va_arg() and its byzantine treatment of types. - Added explicit typecasts for all invocations of va_arg() in bli_cntx.c. commit e71dc389120b032e42091e4d1a928515ed6f7275 Author: Field G. Van Zee Date: Fri Aug 24 15:56:04 2018 -0500 Fixed a very minor memory leak in gks. Details: - Fixed a memory leak in the global kernel structure that resulted in 56 bytes per configured architecture (of which only 18 are presently supported by BLIS). The leak would only manifest if BLIS was initialized and then finalized before the application terminated. Thanks to Devangi Parikh for helping track down this leak. commit a7e3a5f9753468c8e665e6c5c3b38d22b7c92500 Author: Field G. Van Zee Date: Fri Aug 24 14:51:11 2018 -0500 Fixed uncallable bli_finalize(). Details: - Previously, bli_finalize_once()--which, like bli_init_once(), was implemented in terms of pthread_once()--was using the same pthread_once_t control object being used by bli_init(), thus guaranteeing that it would never be called as long as BLIS had already been initialized. This could manifest as a rather large memory leak to any application that attempted to finalize BLIS midway through its execution (since BLIS reserves several megabytes of storage for packing buffers per thread used). The fix entailed giving each function its own pthread_once_t object. Thanks to Devangi Parikh for helping track down this very quiet bug. commit a79c21c7c17fb4854fd24c73b81ec5543f74082d Author: Field G. Van Zee Date: Thu Aug 23 14:40:46 2018 -0500 Fixed cleanmk target post-1b0f8d6. Details: - Changed the cleanmk target to delete makefile fragments from their new home in obj/$(CONFIG_NAME). The old definition worked only because of a typo (REFERKN_PATH instead of REFKERN_PATH), and only in the non-verbose (V != 1) case. commit ffb57242f3eb1175c991fe1b492595fdaa175c27 Author: Field G. Van Zee Date: Wed Aug 22 18:22:41 2018 -0500 Cosmetic output changes to configure. Details: - Disable sandbox-related obj directory creation, directory mirroring, and makefile fragment generation when a sandbox is not enabled. - Prevent various duplicate actions by configure (such as those mentioned above for sandboxes above). commit ac17454aae9ad430f05aa7c156919c6c695c300c Merge: a77bec76 7afd095a Author: Field G. Van Zee Date: Wed Aug 22 15:34:53 2018 -0500 Merge branch 'master' into dev commit a77bec766a01e42f13f8cacbec8c4cbde8ecefef Author: Field G. Van Zee Date: Wed Aug 22 15:31:29 2018 -0500 Whitespace changes, minor renames in build system. Details: - Minor whitespace cleanup, mostly in the form of spaces -> tabs. - Shortened certain variables' _FRAGMENT_ infixes to _FRAG_ in common.mk. commit 1b0f8d60d1132b56485cc202ebf1246898d3a2a4 Author: Devin Matthews Date: Wed Aug 22 13:19:29 2018 -0700 Generate makefile fragments in build tree (#240) * Make src dir read-only in out-of-tree build test. * Generate makefile fragments in the build tree. commit 7afd095af33690e0175903852b354c9fe46993f6 Author: Field G. Van Zee Date: Wed Aug 22 14:58:24 2018 -0500 Removed skx from code snippet in previous commit. Details: - The docs/ConfigurationHowTo.md document was written with examples that did not yet contain the skx sub-configuration, but the previous commit included bli_arch.c code copied and pasted from a recent commit that does support skx. To keep things consistent, I've removed skx from the recently-added ConfigurationHowTo.md code snippet. commit 48211a980d78673133076e8eced1007b1980f5e6 Author: Field G. Van Zee Date: Wed Aug 22 14:55:02 2018 -0500 Update to docs/ConfigurationHowTo.md. Details: - Added missing language directing the reader to modify the config_name string array in bli_arch.c when adding a new sub-configuration. Thanks to Devangi Parikh for reporting this missing section. commit 65c9096c6e21f3dc2947fa12be9ea3034f8662dc Author: Field G. Van Zee Date: Fri Aug 17 11:44:12 2018 -0500 Fixed broken -p option to configure. Details: - Fixed some stale code that was preventing the -p option to configure from working as expected (though the --prefix option was unaffected). This bug was was most likely introduced in 7e5648c (May 7 2018). Thanks to Dave Love for reporting this issue. commit e358d5e497c77b305af462f44266370a596445e2 Author: Field G. Van Zee Date: Thu Aug 16 12:18:45 2018 -0500 README.md update (Funding section). commit a61dd5e7bcf23f7237d407a5e06dd44e1bec9ad0 Author: Field G. Van Zee Date: Tue Aug 14 17:08:03 2018 -0500 Changed 'test' target to be more like 'check'. Details: - Redefined the 'test' make target in the top-level Makefile so that the final result ("everything passed" or at "least one failure") is echoed to stdout. Note that 'check' is unchanged, and thus is now effectively a fast version of 'test'. - Updated docs/BuildSystem.md to reflect the above change. commit ce5c3a198a7ae1ca676c27da4541d51ed19d16e1 Merge: 4f6745d6 0bbe69d5 Author: Field G. Van Zee Date: Tue Aug 14 16:52:19 2018 -0500 Merge branch 'master' of github.com:flame/blis commit 4f6745d68a2c66511695eff0beb00a82ffc6bbbe Author: Field G. Van Zee Date: Tue Aug 14 16:50:47 2018 -0500 Fixed link error when building only shared library. Details: - Fixed a linker error that occurred when attempting to compile and link the testsuite and/or BLAS test drivers after having configured BLIS to only generate a shared library (no static library). The chosen solution involved (1) adding the local library path, $(BASE_LIB_PATH), to the search paths for the shared library via the link option -Wl,-rpath,$(BASE_LIB_PATH). (2) adding a local symlink to $(BASE_LIB_PATH) that uses the .so major version number so that ld would find the shared library at execution time. Thanks to Sajid Ali for reporting this issue, to Devin Matthews for pointing out the need for the -rpath option, and to Devangi Parikh for helping Sajid isolate the problem. - Added #include to bli_system.h to avoid a compiler warning resulting from using toupper() from bli_string.c without a prototype. Thanks again to Sajid Ali, whose build log revealed this compiler warning. - Added '*.so.*' to .gitignore. - CREDITS file update. commit 0bbe69d5ed260849297d8f2d35b7668d167482ed Author: Devangi N. Parikh Date: Tue Aug 14 14:49:58 2018 -0500 Updated plotting scripts in test/studies. Details: - Fixed indexing on plots to correspond to the removal of dtime in the test drivers. commit e93e0e149e087e08eca2885f1a748a4e88ffe55d Author: Field G. Van Zee Date: Tue Aug 7 15:54:30 2018 -0500 Removed redefinition of axpyv, scal2v func types. Details: - Removed a stray/accidental redefinition of axpyv and scal2v function types in frame/1d/bli_l1d_ft.h (probably a copy/paste leftover during development). commit 1deb33bd16349aaa643694d1bd685ff8a9a5f476 Author: Field G. Van Zee Date: Tue Aug 7 15:02:50 2018 -0500 Updated penryn kernels to use new _ker_ft type names. Details: - Updated older _ft kernel type suffixes used within penryn level-1v and -1f kernels to use the newer _ker_ft suffix that was introduced in 0175483. (Thank you Travis CI.) commit 9cb0b023ca91abdc056d726cdc070062e4954611 Author: Field G. Van Zee Date: Tue Aug 7 14:21:07 2018 -0500 INSTALL file update. commit 017548314f3f78f66fbe3264509ac5302bd8d62b Author: Field G. Van Zee Date: Tue Aug 7 14:13:25 2018 -0500 Replaced function chooser macros w/ func ptr arrays. Details: - Previously, most object API functions (_oapi.c) used a function chooser macro that would expand out to an if-elseif-elseif-else conditional that used a num_t datatype to call the appropriate type-specific API (_tapi.c). This always felt a little hackish, and would get in the way somewhat of addig support for new num_t datatypes in the future. So, I've replaced that functionality with code that queries a function pointer that is then typecast appropriately. This model of function calling was already pervasive for kernels queried from the cntx_t structure. It was also already in use in various other functions, such as macrokernels, and this commit simply extends that pattern. - The above change required many new files, mostly header files, that define the function types (mostly _ft.h) for the queriable functions as well as some source files to define the function pointer arrays and their corresponding query functions (_fpa.c). Various other function types, mostly for kernel function types, were renamed to reduce the potential for confusion with the function types for expert and basic (non-expert) typed API functions. - Removed definitions for all of the "bli_call_ft_*()" function chooser macros from bli_misc_macro_defs.h. commit addce089664561f9f63efa6f107e58fc48d29871 Author: Field G. Van Zee Date: Mon Aug 6 13:18:20 2018 -0500 Format spec and other updates in test, test/3m4m. Details: - Removed the dtime (delta time, or wallclock time) column from the matlab output of all test drivers in test, test/3m4m, test/studies. This value was rarely (if ever) really needed and usually only served to take up screen space. - Updated format specifier in test/studies/skx to use %7.2f instead of %6.3f. - For the test drivers in 'test' directory, added an initial line of output that sets last entry of matlab matrix to zero in order to induce a pre-allocation of the entire array of performance results. commit 94d5ef42c833a4d43e50a80d46dddbd7a56d2db6 Author: Field G. Van Zee Date: Sat Aug 4 15:57:17 2018 -0500 Adjusted gflops format spec in testsuite, test/3m4m. Details: - Changed the format specifier for the gflops column in the testsuite output from %7.3f to %7.2f. This was done mainly to keep the output aligned properly when the expected perfomance exceeded 1000 gflops. Also, two decimal places still conveys plenty of precision for all practical applications, including just eyeballing performance deltas between two executions (let alone two implementations). - Changed the format specifier for gflops in the test/3m4m drivers from %6.3f to %7.2f (for the same reasons listed above). commit c7ff06bae92b9b6c6656f2030d13486b95417821 Merge: 6074082c ebe998d0 Author: Devangi N. Parikh Date: Wed Aug 1 14:20:41 2018 -0500 Merge branch 'master' of https://github.com/flame/blis commit 6074082cd359dd775ef72478f8f3a281c5a6a6f9 Author: Devangi N. Parikh Date: Wed Aug 1 13:30:51 2018 -0500 Fixed bug in bli_cntx_set_packm_ker_dt() implementation. Details: - Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the corresponding func_t. commit ebe998d06cc56a9a9d66990b6ebf683d6fd0efdf Author: Field G. Van Zee Date: Wed Aug 1 13:24:00 2018 -0500 Fixed typos in BuildSystem.md from previuos commit. commit e72a344e94c5ae253f69b60f41d92ca89a5d1d1c Author: Field G. Van Zee Date: Wed Aug 1 13:00:38 2018 -0500 Added table of 'make' targets to BuildSystem.md. Details: - Added a new section to BuildSystem.md that describes the most useful make targets defined in the top-level Makefile. commit 4f60d0288e00586dc921ff57db851f1266ff8e70 Author: Field G. Van Zee Date: Mon Jul 30 19:22:57 2018 -0500 README.md, comment updates. Details: - Added links, and sandbox language to README.md. - Adjusted some comments in high-level level-3 object functions to make clear what bli_thread_init_rntm() does. commit 455d3f49e5c8362395be14c79e6adb5123e29623 Author: Field G. Van Zee Date: Sun Jul 29 18:31:29 2018 -0500 Edits to object/typed API, multithreading docs. commit 922a1c05e06f52c97fb369870dce07233e61c4c9 Author: Field G. Van Zee Date: Sat Jul 28 20:15:55 2018 -0500 More tweaks to README.md. commit a7a0cf2b5d9f1dea5061c0f20eeaf371dfd4ea12 Author: Field G. Van Zee Date: Sat Jul 28 16:59:31 2018 -0500 More edits to docs/Multithreading.md. commit be21d0cf68c330fd0d2048465a43ddc59d0b9d6c Author: Field G. Van Zee Date: Sat Jul 28 16:46:51 2018 -0500 Fixed typos in docs/Multithreading.md. commit eac07c7b4f7a41c68d63f1e67141b2b58009609e Author: Field G. Van Zee Date: Sat Jul 28 16:45:28 2018 -0500 Edits to docs/Multithreading.md. commit 5438375a032273b46ae626fee909ffc05f48ab72 Author: Field G. Van Zee Date: Sat Jul 28 16:34:21 2018 -0500 Fixed link in README.md. commit 1f1a237d3f0b24d71ce2d7ee52d8a84f8e6a29ad Author: Field G. Van Zee Date: Sat Jul 28 16:33:28 2018 -0500 Fixed links in BLISTypedAPI.md. commit 89c8806e3aa49310f36c0314c5f6956c83a627a1 Author: Field G. Van Zee Date: Sat Jul 28 16:30:56 2018 -0500 Minor doc fixes to previous commit. commit b8c7574f84873b9c408f70c29c41ce464df57c2d Author: Field G. Van Zee Date: Sat Jul 28 16:27:09 2018 -0500 README.md, typed/object API updates. Details: - Updated the typed and object APIs to include language on the rntm_t parameters in the expert interfaces. - Updated README to include link to object API. commit 29c34c4adb02d91fb34d1ccc0e821d6cfb7ce5c5 Author: Field G. Van Zee Date: Fri Jul 27 16:26:19 2018 -0500 CREDITS file update. commit 55a04edf52ac4f16c51b738bc884684adc1f1777 Author: Field G. Van Zee Date: Fri Jul 27 16:10:46 2018 -0500 CHANGELOG update (0.4.0) commit 4ad61ce905d250dd3ef197f0d06a69ce6d99d309 (tag: 0.4.0) Author: Field G. Van Zee Date: Fri Jul 27 16:10:43 2018 -0500 Version file update (0.4.0) commit b86cf13793b07f35c027a56c9faec8f4b6279d3e Author: Field G. Van Zee Date: Fri Jul 27 16:08:21 2018 -0500 Release Notes update in advance of next version. commit a8b4084a0e04e47ac02ceae93a2018f5363e1205 Author: Field G. Van Zee Date: Fri Jul 27 16:07:26 2018 -0500 CREDITS file update. commit 8e10cac5f388ac961c3d77b0a465214e7c9dc91a Author: Field G. Van Zee Date: Fri Jul 27 14:45:35 2018 -0500 Updates to CREDITS, RELEASING, config/README.md. Details: - Added individuals' github handles to CREDITS file. - Updated RELEASING, config/README.md files. commit 401b69c8f26a86726ac5e1fb4f9fc2d2098ef204 Author: Field G. Van Zee Date: Wed Jul 25 17:55:13 2018 -0500 More indentation in docs/ConfigurationHowTo.md. commit 1c6a1b921ef96999bb449d657cca6d9a556f7245 Author: Field G. Van Zee Date: Wed Jul 25 17:14:58 2018 -0500 Trying new indentation in ConfigurationHowTo.md. Details: - Modified a few sections to take advantage of a feature of markdown that allows a bullet or enumeration to have multiple paragraphs. This is a trial run to make sure the indentation looks good when rendered in a web browser. commit 71f978719527fcf17617cb234e48bf349a76c12d Author: Field G. Van Zee Date: Wed Jul 25 15:55:36 2018 -0500 Whitespace changes to macrokernels' func ptr defs. commit 87d57c31c2bfcf4609dfe31ce915e9345150e613 Author: Field G. Van Zee Date: Wed Jul 25 14:20:18 2018 -0500 Various minor updates to typed, object API docs. commit fb6e16268aaafbab2fd78d47cbf821e2152261fd Author: Field G. Van Zee Date: Wed Jul 25 14:17:28 2018 -0500 Consolidated prototypes in bli_l1v_tapi.h. Details: - Consolidated typed API function prototypes in bli_l1v_tapi.h by leveraging identical function signatures between operations. - Removed 'restrict' keyword since it is not actually present in the function definitions. commit af60d738f21340ccb0903e6c87dbf6af4fc44fc0 Author: Field G. Van Zee Date: Tue Jul 24 15:35:52 2018 -0500 Finished object creation part of BLISObjectAPI.md. Details: - Filled in remaining section on object creation function reference of BLISObjectAPI.md. All object management functions demonstrated as part of the example code in examples/oapi are now documented, as well as some other functions that are not shown in the example code. - Updated variuos links (mostly in function index) to correctly point to the object API reference instead of the typed API reference. - Added documentation to getijm, setijm. commit 8217a6a3b68382c62f016c658d337e6086112fef Author: Field G. Van Zee Date: Tue Jul 24 13:13:10 2018 -0500 Moved sandbox README.md to docs/Sandboxes.md. Details: - Relocated sandbox/ref99/README.md to docs/Sandboxes.md and made minor edits to the document. commit b7db29332394324ffd1a73c3847a75e9a5b38c8d Author: Field G. Van Zee Date: Thu Jul 19 11:14:30 2018 -0500 Explicitly typecast return vals in static funcs. Details: - Added explicit typecasting to various functions (mostly static functions), primarily those in bli_param_macro_defs.h, bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header files. - This change was prompted by feedback from Jacob Gorm Hansen, who reported that #including "blis.h" from his application caused a gcc to output error messages (relating to types being returned mismatching the declared return types) when used via the C++ compiler front-end. This is the first pass of fixes, and we may need to iterate with additional follow-up commits (#233). commit fa08e5ead95f9d757af6ab5b095a8bf131e3874d Author: Field G. Van Zee Date: Tue Jul 17 19:02:15 2018 -0500 Fixed minor issues in ecbebe7 with mt disabled. Details: - Fixed an unused variable warning in frame/base/bli_rntm.c when multithreading is disabled. - Fixed a missing variable declaration in bli_thread_init_rntm_from_env() when multithreading is disabled. commit ecbebe7c2e43950dfa369f71c2b83cabe348a046 Author: Field G. Van Zee Date: Tue Jul 17 18:37:32 2018 -0500 Defined rntm_t to relocate cntx_t.thrloop (#235). Details: - Defined a new struct datatype, rntm_t (runtime), to house the thrloop field of the cntx_t (context). The thrloop array holds the number of ways of parallelism (thread "splits") to extract per level-3 algorithmic loop until those values can be used to create a corresponding node in the thread control tree (thrinfo_t structure), which (for any given level-3 invocation) usually happens by the time the macrokernel is called for the first time. - Relocating the thrloop from the cntx_t remedies a thread-safety issue when invoking level-3 operations from two or more application threads. The race condition existed because the cntx_t, a pointer to which is usually queried from the global kernel structure (gks), is supposed to be a read-only. However, the previous code would write to the cntx_t's thrloop field *after* it had been queried, thus violating its read-only status. In practice, this would not cause a problem when a sequential application made a multithreaded call to BLIS, nor when two or more application threads used the same parallelization scheme when calling BLIS, because in either case all application theads would be using the same ways of parallelism for each loop. The true effects of the race condition were limited to situations where two or more application theads used *different* parallelization schemes for any given level-3 call. - In remedying the above race condition, the application or calling library can now specify the parallelization scheme on a per-call basis. All that is required is that the thread encode its request for parallelism into the rntm_t struct prior to passing the address of the rntm_t to one of the expert interfaces of either the typed or object APIs. This allows, for example, one application thread to extract 4-way parallelism from a call to gemm while another application thread requests 2-way parallelism. Or, two threads could each request 4-way parallelism, but from different loops. - A rntm_t* parameter has been added to the function signatures of most of the level-3 implementation stack (with the most notable exception being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert APIs. (A few internal functions gained the rntm_t* parameter even though they currently have no use for it, such as bli_l3_packm().) This required some internal calls to some of those functions to be updated since BLIS was already using those operations internally via the expert interfaces. For situations where a rntm_t object is not available, such as within packm/unpackm implementations, NULL is passed in to the relevant expert interfaces. This is acceptable for now since parallelism is not obtained for non-level-3 operations. - Revamped how global parallelism is encoded. First, the conventional environment variables such as BLIS_NUM_THREADS and BLIS_*_NT are only read once, at library initialization. (Thanks to Nathaniel Smith for suggesting this to avoid repeated calls getenv(), which can be slow.) Those values are recorded to a global rntm_t object. Public APIs, in bli_thread.c, are still available to get/set these values from the global rntm_t, though now the "set" functions have additional logic to ensure that the values are set in a synchronous manner via a mutex. If/when NULL is passed into an expert API (meaning the user opted to not provide a custom rntm_t), the values from the global rntm_t are copied to a local rntm_t, which is then passed down the function stack. Calling a basic API is equivalent to calling the expert APIs with NULL for the cntx and rntm parameters, which means the semantic behavior of these basic APIs (vis-a-vis multithreading) is unchanged from before. - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op() and reimplemented, with the function now being able to treat the incoming rntm_t in a manner agnostic to its origin--whether it came from the application or is an internal copy of the global rntm_t. - Removed various global runtime APIs for setting the number of ways of parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well as the corresponding "get" functions. The new model simplifies these interfaces so that one must either set the total number of threads, OR set all of the ways of parallelism for each loop simultaneously (in a single function call). - Updated sandbox/ref99 according to above changes. - Rewrote/augmented docs/Multithreading.md to document the three methods (and two specific ways within each method) of requesting parallelism in BLIS. - Removed old, disabled code from bli_l3_thrinfo.c. - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md. commit 323eaaab99752858b12e81e2eb8e416f009a3028 Author: Devangi N. Parikh Date: Fri Jul 13 11:40:06 2018 -0500 Removed left over code from plotting scripts. commit 60c197736495b47ce974ffb9b43874d1ebcfe78c Author: Field G. Van Zee Date: Thu Jul 12 19:22:14 2018 -0500 Documented accessor functions in BLISObjectAPI.md. Details: - Added documentation to docs/BLISObjectAPI.md for a handful of commonly-used obj_t accessor functions. - Minor updates to docs/BLISTypedAPI.md. commit 77327ad796e11ef67df0cc91d45ed663598ba4df Merge: 73b0b2a3 9fef8575 Author: Devangi N. Parikh Date: Thu Jul 12 17:09:33 2018 -0500 Merge branch 'master' of https://github.com/flame/blis commit 73b0b2a3ac1be6dfbe85c116886b4e29d98ac945 Author: Devangi N. Parikh Date: Thu Jul 12 16:53:10 2018 -0500 Created hardware-specific test driver directory. Details: - Created a 'studies' subdirectory within 'test' to be used to house test drivers, makefiles, run scripts, matlab plot code, and related files that have been customized for collecting performance data on specific host machines or product lines. This new setup will help us catalog, track, and share test driver materials over time, and in a way that facilitates reproducibility. - Created an 'skx' subdirectory within 'test/studies' to house various level-3 test driver files used to measure performance on SkylakeX nodes (specifically, those nodes used by TACC's stampede2 system). commit 9fef85756d15ee0f977fff6e57acd01c20cba184 Author: Field G. Van Zee Date: Wed Jul 11 18:40:30 2018 -0500 Cleaned up loose ends in BLISObjectAPI.md. Details: - Deleted some lines from the API function signatures that did not belong (and were only left over from the copy-paste of the typed API). - Fixed some paragraph-in-bullet indentation. commit 80ddeae4629022b69fdf1f1b053a1fcba643c40c Author: Field G. Van Zee Date: Wed Jul 11 18:31:57 2018 -0500 Added BLISObjectAPI.md to docs. Details: - Added first draft of BLISObjectAPI.md. (Object management section is still missing.) - Small fixes to BLISTypedAPI.md found while writing BLISObjectAPI.md. - In various .md files, changed ``` verbatim blocks to language attributes (e.g. ```c for C code). commit 038442add39ce629fee0d960b212ce0c95138d46 Author: Field G. Van Zee Date: Wed Jul 11 12:24:18 2018 -0500 Added -lpthread to makefile example in BuildSystem.md. Details: - Added missing pthreads library linking to example makefile in docs/BuildSystem.md, as well as similar language to build requirements at the beginning of the document. Thanks to Stefanos Mavros for bringing this to our attention. - Updated CREDITS file. commit bf10d8624e7b5902c9d9189c7c93f318b8e1b9a5 Author: Field G. Van Zee Date: Mon Jul 9 18:40:13 2018 -0500 Small updates to KernelsHowTo.md, BLISTypedAPI.md. Details: - Minor updates to BLISTypedAPI.md, mostly to bring terminology up-to-date with the new "typed API" classification. - Added contents section to KernelsHowTo.md. commit 1fd3bce59e43b422e62f9684bca9d1296a29edc3 Author: Field G. Van Zee Date: Mon Jul 9 18:20:11 2018 -0500 Further updates to KernelsHowTo.md, BLISTypedAPI.md. Details: - Added missing level-1v operations to BLISTypedAPI (e.g. axpbyv, xpbyv). - Updated broken linkes in KernelsHowTo.md based on misnamed anchors. - Other minor changes. commit c40d30a6c920bd2e5a8353a3cd07a7e2b2265758 Author: Field G. Van Zee Date: Mon Jul 9 17:55:54 2018 -0500 Updated KernelsHowTo.md, BLISTypedAPI.md. Details; - Added missing (basic) information in KernelsHowTo.md for level-1f and level-1v kernels. - Updated section regarding contexts. commit f8913c2bf91c0e0fb4e68aedf64a242a19db92a0 Author: Field G. Van Zee Date: Sat Jul 7 20:35:13 2018 -0500 Fixed outdated scalv() calls in penryn l1f kernels. Details: - Fixed stale calls to dscalv() from the dotxf and dotxaxpyf penryn kernels that were not updated during the basic/expert API separation in e88aeda. commit e78e71d549ac17ecd52c7b33008df1cd78f1b59e Author: Field G. Van Zee Date: Sat Jul 7 20:18:09 2018 -0500 Added README.md mention/link to examples/tapi. Details: - Added language to README.md to bring the reader's attention to the example code for the typed API (in addition to those for the object API). commit 419ffb158573a26bfec47bac73e4394e7926a7b8 Author: Field G. Van Zee Date: Sat Jul 7 20:14:23 2018 -0500 Updates to README.md. Details: - Updated wiki links according to renamed/relocated files in 'docs'. - Converted links to relative paths. - Added link to docs/Multithreading.md. commit 7d3e8a7e5f1ec299d009fb6c9071f0c1b089b460 Author: Field G. Van Zee Date: Sat Jul 7 20:01:29 2018 -0500 Reverted docs/*.md links to relative paths. Details: - Within the documents in docs/*.md, reverted links to other local documents to relative paths. - Fixed some links/documents that did not yet have the '.md' suffix. - Testing whether we can use relative links ('docs/BLISTypedAPI.md') from within README.md. commit d97c862c2b9170d774f414e63ae365488fffb4f5 Author: Field G. Van Zee Date: Sat Jul 7 19:40:41 2018 -0500 Updated links (URLs) in docs/*.md. Details: - Updated most markdown links in the documents/wikis to use absolute paths instead of the relative paths that were in use previously. A few links were not updated, except for adding a ".md" to reflect the documents' new names, in order to test whether relative linking still works. commit 3a0c12135875e0fb04de9798664e4fae632d994e Merge: 2c7960c8 bcacddfa Author: Field G. Van Zee Date: Sat Jul 7 16:51:38 2018 -0500 Merge branch 'dev' commit bcacddfad75b20969660606751eea6ead6c42ca9 Author: Field G. Van Zee Date: Sat Jul 7 16:45:29 2018 -0500 Added 'docs' directory with wiki markdown files. Details: - Exported all github wikis to a new 'docs' directory. - Renamed 'BLISAPIQuickReference' wiki to 'BLISTypedAPI' and removed all cntx_t* arguments from the (now non-expert) APIs (with the exception of the kernel APIs). - Added section to BuildSystem documenting new ARG_MAX hack. commit 3ee2bc0f7aa3b08da92331d64271bee99eaf8c1d Author: Field G. Van Zee Date: Sat Jul 7 16:02:16 2018 -0500 Renamed files that distinguish basic/expert APIs. Details: - Renamed various files that were previously named according to a "with context" or "without context" convention. For example, the following files in frame/3 were renamed: frame/3/bli_l3_oapi_woc.c -> frame/3/bli_l3_oapi_ba.c frame/3/bli_l3_oapi_wc.c -> frame/3/bli_l3_oapi_ex.c frame/3/bli_l3_tapi_woc.c -> frame/3/bli_l3_tapi_ba.c frame/3/bli_l3_tapi_wc.c -> frame/3/bli_l3_tapi_ex.c Here, the "ba" is for "basic" and "ex" is for "expert". This new naming scheme will make more sense especially if/when additional expert parameters are added to the expert APIs (typed and object). commit e88aedae735dfeb6fa5ac28d4527eb3ca58c6510 Author: Field G. Van Zee Date: Fri Jul 6 19:14:02 2018 -0500 Separated expert, non-expert typed APIs. Details: - Split existing typed APIs into two subsets of interfaces: one for use with expert parameters, such as the cntx_t*, and one without. This separation was already in place for the object APIs, and after this commit the typed and object APIs will have similar expert and non- expert APIs. The expert functions will be suffixed with "_ex" just as is the case for expert interfaces in the object APIs. - Updated internal invocations of typed APIs (functions such as bli_?setm() and bli_?scalv()) throughout BLIS to reflect use of the new explictly expert APIs. - Updated example code in examples/tapi to reflect the existence (and usage) of non-expert APIs. - Bumped the major soname version number in 'so_version'. While code compiled against a previous version/commit will likely still work (since the old typed function symbol names still exist in the new API, just with one less function argument) the semantics of the function have changed if the cntx_t* parameter the application passes in is non-NULL. For example, calling bli_daxpyv() with a non-NULL context does not behave the same way now as it did before; before, the context would be used in the computation, and now the context would be ignored since the interace for that function no longer expects a context argument. commit 331694e52414c0cd50048daf880a9ace9e29b94a Author: Isuru Fernando Date: Fri Jul 6 09:07:38 2018 -0600 Fix windows build and enable x86_64 on appveyor (#230) * Upload artifacts built on appveyor (#228) * Upload artifacts * Fix install in appveyor * Remove windows.h in bli_winsys.c (#229) Looks like it is unneeded. * Implemented ARG_MAX hack in configure, Makefile. Details: - Added support for --enable-arg-max-hack to configure, which will change the behavior of make when building BLIS so that rather than invoke the archiver/linker with all of the object files as command line arguments, those object files are echoed to a temporary file and then the archiver/linker is fed that temporary file via the @ notation. An example of this can be found in the GNU make docs at https://www.gnu.org/software/make/manual/make.html#File-Function - Thanks to Isuru Fernando for prompting this feature. * Enable x86_64 and arg-max-hack on appveyor * Use gas style assembly for clang on windows commit a64a780d28c99d35f237f59212772e9beff35b3e Merge: 89e178ce 3cb396d1 Author: Devin Matthews Date: Fri Jul 6 09:38:42 2018 -0500 Merge pull request #231 from flame/travis-pr Disable SDE for PRs commit 3cb396d1ae4ee569f862db201c6a976712fd128e Author: Devin Matthews Date: Fri Jul 6 09:19:44 2018 -0500 Disable SDE for PRs Pull requests cannot use Travis secret variables, so SDE needs to be disabled. This PR should suffice as a test. commit 2c7960c8416ee9b67364be5f2b210fd7a0aec4b5 Author: Field G. Van Zee Date: Thu Jul 5 14:38:33 2018 -0500 Implemented ARG_MAX hack in configure, Makefile. Details: - Added support for --enable-arg-max-hack to configure, which will change the behavior of make when building BLIS so that rather than invoke the archiver/linker with all of the object files as command line arguments, those object files are echoed to a temporary file and then the archiver/linker is fed that temporary file via the @ notation. An example of this can be found in the GNU make docs at https://www.gnu.org/software/make/manual/make.html#File-Function - Thanks to Isuru Fernando for prompting this feature. commit c422a5cd191d47e6aeb9cea6de0e348f46e3e318 Merge: b6470262 89e178ce Author: Field G. Van Zee Date: Thu Jul 5 12:33:35 2018 -0500 Merge branch 'dev' commit b6470262ea66c0f48a5b4d85ca4bf85c1fb2b3af Author: Isuru Fernando Date: Wed Jul 4 19:14:29 2018 -0600 Remove windows.h in bli_winsys.c (#229) Looks like it is unneeded. commit eac4bdf98691c5ec784af0dc11d1ad2269840661 Author: Isuru Fernando Date: Wed Jul 4 18:31:01 2018 -0600 Upload artifacts built on appveyor (#228) * Upload artifacts * Fix install in appveyor commit 89e178ce380439dea951925e33703dc4b979e914 Merge: d868eb3e e32b2ef9 Author: Field G. Van Zee Date: Wed Jul 4 17:51:16 2018 -0500 Merge branch 'master' into dev commit e32b2ef983ea1c3521dd3821116c0078690f125e Author: Field G. Van Zee Date: Wed Jul 4 17:49:39 2018 -0500 Update to CREDITS file. commit 14648e137696484e0ff04f89b16c6b4183ea42b8 Author: Isuru Fernando Date: Wed Jul 4 16:48:42 2018 -0600 Native windows support using clang (#227) * Add appveyor file * Build script * Remove fPIC for now * copy as * set CC and CXX * Change the order of immintrin.h * Fix testsuite header * Move testsuite defs to .c * Fix appveyor file * Remove fPIC again and fix strerror_r missing bug * Remove appveyor script * cd to blis directory * Fix sleep implementation * Add f2c_types_win.h * Fix f2c compilation * Remove rdp and rename appveyor.yml * Remove setenv declaration in test header * set CPICFLAGS to empty * Fix another immintrin.h issue * Escape CFLAGS and LDFLAGS * Fix more ?mmintrin.h issues * Build x86_64 in appveyor * override LIBM LIBPTHREAD AR AS * override pthreads in configure * Move windows definitions to bli_winsys.h * Fix LIBPTHREAD default value * Build intel64 in appveyor for now commit b45ea92fc6f77f2313b50dbe95922f838cbead07 Author: Field G. Van Zee Date: Tue Jul 3 18:27:29 2018 -0500 Added typed (BLAS-like) API code examples. Details: - Added new example code to examples/tapi demonstrating how to use the BLIS typed API. These code examples directly mirror the corresponding example code files in examples/oapi. This setup provides a convenient opportunity for newcomers to BLIS to compare and contrast the typed and object APIs when they are used to perform the same tasks. - Minor cleanups to examples/oapi. commit d868eb3e200f657a1284c4cc933e7a4d25260dce Author: Field G. Van Zee Date: Fri Jun 29 12:36:04 2018 -0500 Implemented bli_obj_scalar_cast_to(). Details: - Implemented bli_obj_scalar_cast_to(), which will typecast the value in the internal scalar of an obj_t to a specified datatype. - Changed bli_obj_scalar_attach() so that the scalar value being attached is first typecast to the storage datatype of the destination object rather than the target datatype. - Reformatted function type signatures in bli_obj_scalar.c as well as prototypes in its corresponding header file. commit 52d80b5f09517d80ac8a7c96983a576c1ec2080b Author: Field G. Van Zee Date: Fri Jun 29 12:30:44 2018 -0500 Fixed static funcs related to target and exec dts. Details: - Fixed incorrect bit shifts in the following static functions: bli_obj_set_target_domain() bli_obj_set_target_prec() bli_obj_set_exec_domain() bli_obj_set_exec_prec() - Fixed incorrect bitmask in bli_dt_proj_to_single_prec(). - Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates the target and exec datatypes (in addition to the storage datatypes). commit e006f2d0eeb229c1cd05a424496a774c29bdc5d7 Merge: bd8c55fe dafca7a0 Author: Field G. Van Zee Date: Wed Jun 27 15:54:38 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit bd8c55fe268e8e352508341ebd739ef4fc68eb92 Author: Field G. Van Zee Date: Wed Jun 27 15:52:37 2018 -0500 Added dt_on_output field to auxinfo_t. Details: - Added a new field to the auxinfo_t struct that can be used, in theory, to request type conversion before the microkernel stores/accumulates its microtile back to memory. - Added the appropriate get/set static functions to bli_type_defs.h. commit dafca7a0c2c72aaf15cb588b2bef6f246abb1905 Author: Devin Matthews Date: Mon Jun 25 16:20:10 2018 -0500 Fix botched memory addressing in Penryn kernel (no effect for GAS output). commit de493b0f349efebab98ab17f063d4d3d932c24c3 Merge: 195480be a7166feb Author: Devin Matthews Date: Mon Jun 25 14:26:06 2018 -0500 Merge pull request #226 from devinamatthews/dev Finish macroization of assembly ukernels. commit 195480beb589db7d582646f556e855c611d4c3a9 Merge: 07c3d0a9 3f387ca3 Author: Field G. Van Zee Date: Mon Jun 25 13:24:21 2018 -0500 Merge branch 'master' into dev commit 3f387ca35e42519f0d6a154814e4c8800fa2acb8 Author: Field G. Van Zee Date: Mon Jun 25 12:32:03 2018 -0500 Fixed bugs in configure's select_cc() function. Details: - This commit fixes several bugs in configure relating to selecting a C compiler. By dumb luck, two of the two bugs sort of cancelled each other out in most use cases, which manifested as the expected behavior. Thanks to Mathieu Poumeyrol for bringing this issue to our attention, and to Devin Matthews for suggesting the more portable way of capturing both stdout and stderr and suggesting a return code check instead of testing stdout/stderr. - The first bug: As the values of the compiler search list are iterated over, only stderr is captured when querying a compiler with --version rather than both stdout and stderr. - The second bug: After each query, a conditional attempted to test whether the query resulted in anything being output. That conditional erroneously was using "-z" instead of "-n" for non-emptiness. Thus, most of the time, stderr was empty (because the --version info was being output on stdout), and since it was empty, the -z conditional (intended to execute only when a compiler was found to be responsive) executed. - A third bug was also fixed in the way that the merged stdout/stderr output was tested for non-emptiness (moving the 'cat' invocation to another line and testing the contents of a variable instead). - The three bugs above have been fixed as part of a partial rewrite of the select_cc() function in terms of a return code check, which obviated the need to save the output of stdout and stderr. - The fourth bug involved a misnamed variable in the right-hand side of a statement intended to prepend CC to search_list when CC was non-empty. This typically did not manifest as a bug since usually CC (if it was set) was set to a value that was known to work. commit a7166feb1053814b7dd27f3879ae38acfc9637fc Author: Devin Matthews Date: Mon Jun 25 12:09:18 2018 -0500 Finish macroization of assembly ukernels. commit f986396c2af5de06283b9834112782afd0a8907e Author: Field G. Van Zee Date: Fri Jun 22 18:12:40 2018 -0500 Added 'configure --help' text for CFLAGS, LDFLAGS. Details: - Added mention of the new support for preset CFLAGS, LDFLAGS to the bottom of the text output by './configure --help'. - Updated usage example to use 'haswell' instead of 'sandybridge'. commit 884175d9ffb62e49535e6c1f7d58fb3b83e7e78f Author: Field G. Van Zee Date: Fri Jun 22 18:08:43 2018 -0500 Added configure support for preset CFLAGS, LDFLAGS. Details: - Any preexisting values set to the CFLAGS environment variable (or the CFLAGS variable if given on the command line) are saved by configure for later inclusion (prepending, to be precise) along with the compiler flags automatically determined by the BLIS build system. LDFLAGS is treated in a similar manner.) Thanks to Dave Love for requesting this feature in issue #223 and Mathieu Poumeyrol for his support on this and a previous related issue. - Comment updates to build/config.mk.in. - Strip whitespace from return value of various cflags functions in common.mk. commit 07c3d0a95190bd23f0cd2ef220deb3384d8378d1 Author: Field G. Van Zee Date: Thu Jun 21 12:35:07 2018 -0500 Update to CREDITS file. commit a1ebbbf158c7b34c9032ef45431bc610b6f14858 Merge: 17928b1c c81c6f23 Author: Devin Matthews Date: Wed Jun 20 15:37:53 2018 -0500 Merge pull request #224 from devinamatthews/asm-macros Asm macros commit c81c6f23b9547b5d55ae68fd5a3bbd8a78290b6b Author: Devin Matthews Date: Wed Jun 20 15:20:44 2018 -0500 Fix problem with inc and dec macros. commit 5a63971c822fd452f97ba869625c8e87f6cbeebc Merge: b4d94e54 17928b1c Author: Devin Matthews Date: Wed Jun 20 14:07:49 2018 -0500 Merge remote-tracking branch 'upstream/dev' into asm-macros commit b4d94e54d44cf30e4bb452ca5263be3473c0582d Author: Devin Matthews Date: Wed Jun 20 14:07:24 2018 -0500 Convert x86 microkernels to assembly macros. commit 17928b1c9941aa58aef1f122c793e2b14e705267 Author: Field G. Van Zee Date: Tue Jun 19 17:59:03 2018 -0500 Added static funcs bli_dt_domain(), bli_dt_prec(). Details: - Added definitions of static functions bli_dt_domain()/bli_dt_prec(), which extract a dom_t domain or prec_t precision value, respectively, from a num_t datatype. - Changed the return types of bli_obj_domain() and bli_obj_prec() from objbits_t to dom_t and prec_t. (Not sure why they were ever set to return objbits_t.) commit 5f7fbb7115b1bf532c169dfd9adef84c41a95031 Author: Field G. Van Zee Date: Tue Jun 19 15:38:55 2018 -0500 Static funcs for projecting dt to single/double. Details: - Added static functions for projecting a datatype to single precision or double precision, both for obj_t's storage datatypes and standalone datatypes. commit d4a22702c7a90273dc14f271db465c2e11e5b87e Author: Field G. Van Zee Date: Tue Jun 19 14:54:57 2018 -0500 Set up haswell config for optional col-pref ukrs. Details: - Added two presently-disabled cpp blocks in bli_cntx_init_haswell.c to easily allow one to switch to a set of column-preferential gemm microkernels (in the haswell subconfiguration). The second column- preferring block sets the the register blocksizes to their appropriate values. However, cache blocksizes are left unchanged, and therefore are likely suboptimal. This should be addressed later. commit f317c2e31bfc329cb6bb4e06005e45b9c8a9d6a7 Author: Field G. Van Zee Date: Tue Jun 19 12:21:23 2018 -0500 Added get/set static funcs for exec dt/dom/prec. Details: - Added functions to bli_obj_macro_defs.h to get and set the target domain and target precision bits in the obj_t, and also added the appropriate support in bli_type_defs.h. commit e88a5b8da8c26caebd2b0fb73b30836fb5417c9c Author: Field G. Van Zee Date: Mon Jun 18 15:56:26 2018 -0500 Implemented castm, castv operations. Details: - Implemented castm and castv operations, which behave like copym and copyv except where the obj_t operands can be of different datatypes. These new operations, however, unlike copym/copyv, do not build upon existing level-1v kernels. - Reorganized projm, projv into a 'proj' subdirectory of frame/base (to match the newly added frame/base/cast directory). - Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype combinations. Previously, one had to invoke two additional macros--one which mixed domains only and another that included all remaining cases--in order to get full type combination coverage. - Defined a new static function, bli_set_dims_incs_2m(), to aid in the setting of various variables in the implementations of bli_??castm(). This static function joins others like it in bli_param_macro_defs.h. - Comment update to bli_copysc.h. commit 2000cdff59272974438e88e0e82d8e1a32710325 Author: Field G. Van Zee Date: Mon Jun 18 14:17:28 2018 -0500 Update to CREDITS file. commit ed2c8aed848ba2dede18df090cf2e0b6e4cc059f Author: Field G. Van Zee Date: Mon Jun 18 11:49:34 2018 -0500 Temporarily disabled small matrix handling on zen. Details: - Disabled small matrix handling in config/zen/bli_family_zen.h due to what appears to be a bug that manifests as failures in the single and double precision real level-3 BLAS test drivers (visible via out.sblat3 and out.dblat3). Thanks to Robin Christ for reporting this issue. commit ed20392c500940bfc0947795c1ff7c8c24f8e26f Author: Field G. Van Zee Date: Fri Jun 15 16:31:22 2018 -0500 Added get/set static funcs for exec dt/dom/prec. Details: - Added functions to bli_obj_macro_defs.h to get and set the execution domain and execution precision bits in the obj_t. - Added/rearranged a few functions in bli_obj_macro_defs.h. - Renamed some macros in bli_type_defs.h: EXECUTION -> EXEC. commit 22594e8e9ab55f5bc0e69d96a23e128502849999 Author: Field G. Van Zee Date: Thu Jun 14 17:35:23 2018 -0500 Updated sandbox/ref99 according to f97a86f. Details: - Applied changes to ref99 sandbox analagous to those applied to framework code in f97a86f. This involves setting the pack schemas of A and B objects temporarily to communicate those desired schemas to the control tree creation function in blx_gemm_cntl.c. This allows us to (henceforth) query the schemas from the control tree rather than the context. commit 1b5d0424d2c7e5eac33e02359c12917ef280949f Author: Field G. Van Zee Date: Wed Jun 13 18:41:32 2018 -0500 Prototype column-preferential zen gemm ukernels. Details: - Added prototypes to bli_kernels_zen.h for each of the four gemm microkernels that prefer outputting to column storage. commit f88c2e7a539e383297e846e6d4647058dd3db128 Author: Field G. Van Zee Date: Wed Jun 13 18:27:46 2018 -0500 Defined static function bli_blksz_scale_def_max(). Details: - Added a new static function to bli_blksz.h that scales both the default (regular) blocksize as well as the maximum blocksize in the blksz_t object. Reminder: maximum blocksizes have different meanings in different contexts. For register blocksizes, they refer to the packing register blocksizes (PACKMR or PACKNR) while for cache blocksizes, they refer to the maximum blocksize to use during the final iteration of a loop. commit 87db5c048e0c7f37351fda486abaf7d19fc5821c Author: Field G. Van Zee Date: Tue Jun 12 19:38:37 2018 -0500 Changed usage of virtual microkernel slots in cntx. Details: - Changed the way virtual microkernels are handled in the context. Previously, there were query routines such as bli_cntx_get_l3_ukr_dt() which returned the native ukernel for a datatype if the method was equal to BLIS_NAT, or the virtual ukernel for that datatype if the method was some other value. Going forward, the context native and virtual ukernel slots will both be initialized to native ukernel function pointers for native execution, and for non-native execution the virtual ukernel pointer will be something else. This allows us to always query the virtual ukernel slot (from within, say, the macrokernel) without needing any logic in the query routine to decide which function pointer (native or virtual) to return. (Essentially, the logic has been shifted to init-time instead of compute-time.) This scheme will also allow generalized virtual ukernels as a way to insert extra logic in between the macrokernel and the native microkernel. - Initialize native contexts (in bli_cntx_ref.c) with native ukernel function addresses stored to the virtual ukernel slots pursuant to the above policy change. - Renamed all static functions that were native/virtual-ambiguous, such as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt() pursuant to the above polilcy change. Those routines now use the substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All of these functions were static functions defined in bli_cntx.h, and most uses were in level-3 front-ends and macrokernels. - Deprecated anti_pref bool_t in context, along with related functions such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's panel-block execution is disabled. commit dbaf440540837b03643190cd685ed889fa7fd212 Merge: 22aa44eb 2610fff0 Author: Field G. Van Zee Date: Mon Jun 11 12:37:04 2018 -0500 Merge branch 'master' into dev commit 2610fff0b07bdb345cb2e334ef6bea0c63c8cead Author: Field G. Van Zee Date: Mon Jun 11 12:32:54 2018 -0500 Renamed 1m packm kernels from _1e to _1er. Details: - Renamed the reference packm kernels used by 1m. Previously, they used a _1e suffix, which was confusing since they packed to both 1e and 1r schemas. This was likely an artifact of the time when there were separate kernels for each schema before I decided to combine them into a single function (per datatype and panel dimension), and the 1e functions were the ones to inherit the 1r functionality. The kernels have now been renamed to use a _1er suffix. commit 7af5283dcc3dded114852d6013d33134021b81aa Author: sraut Date: Mon Jun 11 15:00:22 2018 +0530 added check condition on n-dimension for XA'=B intrinsic code to process till 128 size Change-Id: I95d020a5ca3ea21d446b8c2e379d56e1eea18530 commit 712de9b371a8727682352a2f52cd4880de905f0b Author: Field G. Van Zee Date: Sat Jun 9 14:36:30 2018 -0500 Added missing semicolon in 03obj_view.c Details: - Thanks to Tony Skjellum for pointing out this typo due to a last-minute change to the source prior to committing. commit 043d0cd37ef4a27b1901eeb89d40083cfb2a57ba Author: Field G. Van Zee Date: Sat Jun 9 13:46:49 2018 -0500 Implemented bli_acquire_mpart(), added example code. Details: - Implemented bli_acquire_mpart(), a general-purpose submatrix view function that will alias an obj_t to be a submatrix "view" of an existing obj_t. - Renumbered examples in examples/oapi and inserted a new example file, 03obj_view.c, which shows how to use bli_acquire_mpart() to obtain submatrix views of existing objects, which can then be used to indirectly modify the parent object. commit f1908d39767baef56077def69126d96f805ee27e Author: Field G. Van Zee Date: Fri Jun 8 14:22:22 2018 -0500 Fixed broken input.operations.fast. Details: - Removed three input lines from input.operations.fast (labeled "test sequential micro-kernel") that I intended to remove in bd02c4e. These lines prevented 'make check' (and 'make checkblis-fast') from completing correctly. Note: This bug was fixed in 3df39b3, but that commit has not yet been merged into master, hence this redundant commit. Thanks to Robert van de Geijn for reporting this issue. commit 262a62e3482c5caa947a89cabb562b5887555bd6 Author: Field G. Van Zee Date: Fri Jun 8 12:10:54 2018 -0500 Fixed undefined ref in steamroller/excavator configs. Details: - Fixed erroneous calls to bli_cntx_init_piledriver_ref() in bli_cntx_init_steamroller() and bli_cntx_init_excavator(), which should have been to their respectively-named bli_cntx_init_*() functions instead. Thanks to qnerd for bringing these bugs to our attention. commit 22aa44ebec2c7884bdc944775a1aa7534ab53f0d Merge: 65fae950 b65d0b84 Author: Field G. Van Zee Date: Thu Jun 7 17:42:59 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit 65fae95074d239354737355bbe6f202d4f8b2871 Author: Field G. Van Zee Date: Thu Jun 7 17:41:09 2018 -0500 Implemented bli_setrm, _setim, _setrv, _setiv. Details: - Defined new wrappers to setm/setv operations in frame/base/bli_setri.c that will target only the real or only the imaginary parts of a matrix/vector object. - Updated bli_obj_real_part() so that the complex-specific portions of the function are not executed if the object is real. - Defined bli_obj_imag_part(). - Caveat: If bli_obj_imag_part() is called on a real object, it does nothing, leaving the destination object untouched. The caller must take care to only call the function on complex objects. - Reordered some of the static functions in bli_obj_macro_defs.h related to aliasing. commit b65d0b841b7e4357bc2cf743bbb03384a3ab0bfa Author: Field G. Van Zee Date: Thu Jun 7 14:38:41 2018 -0500 Fixed bug in bli_dt_proj_to_complex(). Details: - Fixed a bug identical to the one fixed in 0a4a27e, except this time in the bli_obj_param_defs.h header file. It looks like the only consumers of this static function were in bli_l0_oapi.c, and so this may not have been manifesting (yet). commit 55b6abdf7458e31df3ad01796d67c2332c776948 Author: Field G. Van Zee Date: Thu Jun 7 14:08:12 2018 -0500 Enforce consistent datatypes in most object APIs. Details: - Added logic to level-1v, -1d, -1f, -1m, -2, and -3 operations' _check() functions to ensure that all operands are of the same datatype. There are some exceptions that were left out, such as the _check() function for the various norm operations since they have a different idea of datatype consistency (ie: the norm object must be the real projection of the primary input vector/matrix object). commit 513138b1a1ecebd015580423c779810cae5c67f2 Author: Field G. Van Zee Date: Thu Jun 7 12:24:47 2018 -0500 Defined/implemented bli_projv(). Details: - Added an implementation for bli_projv() to go along with the implementation of bli_projm() added in 0a4a27e. The only difference between the two is that bli_projv() may only be used on vectors, whereas bli_projm() is general-purpose. - Added a _check() function corresponding to bli_projv(). commit 5f71c1e719eb482b2a4e40daa280c4f7d05b6963 Merge: b5a641e9 3df39b37 Author: Field G. Van Zee Date: Wed Jun 6 19:06:14 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit b5a641e968469805906eb2c971384d12ad1beac5 Author: Field G. Van Zee Date: Wed Jun 6 19:05:37 2018 -0500 Added char-to-dt and dt-to-char mapping functions. Details: - Defined additional functions in bli_param_map.c: bli_param_map_char_to_blis_dt() bli_param_map_blis_to_char_dt() which will map a char to its corresponding num_t, or vice versa. commit 0a4a27e1a4487480410bc0b1bb034bcf97583214 Author: Field G. Van Zee Date: Wed Jun 6 19:02:29 2018 -0500 Defined/implemented bli_projm(). Details: - Defined a new operation in frame/base/bli_proj.c, bli_projm(), which behaves like bli_copym(), except that operands a and b are allowed to contain data of differing domains (e.g. a is real while b is complex, or vice versa). The file is named bli_proj.c, rather than bli_projm.c, with the intention that a 'v' vector version of the function may be added to the same file (at some point in the future). - Added supporting bli_check_*() functions in bli_check.c to confirm consistent precisions between to datatypes/objects, as well as the appropriate error message in bli_error.c and a new error code in bli_type_defs.h. - Wrote a bli_projm_check() function to go along with bli_projm(). - Defined static function bli_obj_real_part() in bli_obj_macro_defs.h, which will initialize an obj_t alias to the real part of the source object. - Fixed a bug in the static function bli_dt_proj_to_complex(), found in bli_param_macro_defs.h. Thankfully, there were no calls to the function to produce buggy behavior. commit 3df39b37a0134befa34b6b6259db98467c7bc965 Author: Field G. Van Zee Date: Wed Jun 6 15:35:05 2018 -0500 Fixed recently broken input.operations.fast. Details: - Removed "test sequential front-end" lines from microkernel test entries of input.operations.fast. This change was meant for inclusion in bd02c4e but was missed due to slightly different wording of the comment (I used "sed //d" to remove the lines). This fixes the broken 'make checkblis-fast' (and 'make check') targets. commit 695cd520e2f5eab938f66afe9fe36201ab2700c5 Author: sraut Date: Wed Jun 6 11:48:56 2018 +0530 AMD Copyright information changed to 2018 Change-Id: Idfd11afd5d252f8063d0158680d24bf7e2854469 commit df1dd24fd896821de60917b429f303bab7fd0d4b Author: sraut Date: Wed Jun 6 11:24:33 2018 +0530 small matrix trsm intrinsics optimization code for AX=B and XA'=B Change-Id: I90123c4d9adbd314c867995cd19dc975150b448c commit 3f48c38164b4135515b5c752c506fdccc4480be2 Author: Field G. Van Zee Date: Tue Jun 5 16:52:35 2018 -0500 Cosmetic fix to configure output in config.mk. Details: - Fixed configure so that MK_ENABLE_MEMKIND is assigned "no" when the option is disabled due to libmemkind not being present. This wasn't affecting anything since the one use of the variable (in common.mk) was formulated as "ifeq ($(MK_ENABLE_MEMKIND),yes)". That is, the variable being empty was effectively equivalent to it being set to "no". - Comment updates to build/config.mk.in, common.mk. commit 5df201260f64aa98a365931f6d2da70144d69932 Merge: 1b9af85e 96d2774b Author: Field G. Van Zee Date: Tue Jun 5 16:14:19 2018 -0500 Merge branch 'master' into dev commit 1b9af85ec98d91bb2b27aadaa3df344d18faff35 Author: Field G. Van Zee Date: Tue Jun 5 16:07:13 2018 -0500 Updated ref99 call to _cntx_set_thrloop_from_env(). Details: - Reordered the arguments in the ref99 sandbox's call to bli_cntx_set_thrloop_from_env() to be consistent with the updated function signature from f97a86f. Thanks to Devangi Parikh for reporting this issue. commit 96d2774b4cb44ff1e8b5798d7cfc83154a607624 Author: Tyler Michael Smith Date: Tue Jun 5 14:17:39 2018 +0200 Make bli_auxinfo_next_b() return b_next, not a_next (#216) commit d4c24ea5f644eb635046e7fe249d3e8e58b4c98a Author: sraut Date: Tue Jun 5 15:42:59 2018 +0530 copyright message changed to 2018 Change-Id: I33c1ebda41bc7f1973ff19e3b1947bdad62b4d44 commit 3f1ba4e646776699ebfaa042fe24691d9e2f55d0 Author: sraut Date: Tue Jun 5 14:21:13 2018 +0530 copyright changed to 2018 Change-Id: Ie916c7cd6f95aedc3cab6eec3a703c9ddb333bc3 commit bd02c4e9f7fe07487276e61507335d48c8e05f35 Author: Field G. Van Zee Date: Mon Jun 4 13:42:17 2018 -0500 Cleanups to testsuite, input.operations format. Details: - Removed the line in each operation entry in input.operations titled "test sequential front-end" and the corresponding support for the lines in the testsuite input parsing code. This line was included in the some of the earliest versions of the testsuite, back when I intended to eventually have separate multithreaded APIs. Specifically, I envisioned that multithreaded and sequential testing could be enabled or disabled on an operation level. However, BLIS evolved in a different direction and still does not have multithreaded-specific APIs (even if it will eventually someday). But even if it did have such APIs, I doubt I would allow the user to enable/disable them on an operation level. Thus, this was a zombie future parameter that was never used and never made sense to begin with. The one instance of the front_seq variable, used in the various libblis_test_() functions to guard the call to the operation test driver, that remains was commented out instead of deleted so that someday it could be easily changed via sed, if desired. - Various minor cleanups to the testsuite code, including consolidating use of DISABLE and DISABLE_ALL and reexpressing certain conditional expressions in the libblis_test_() functions in terms of boolean functions. commit 2c6d99b99e50d70f904da298a0c59be16cc5c180 Author: Field G. Van Zee Date: Sun Jun 3 18:13:36 2018 -0500 Fixed names out of alphabetical order in CREDITS. commit 7a207e8f2c5046f8b295a78e029ff2de765c7409 Author: Field G. Van Zee Date: Sun Jun 3 18:04:27 2018 -0500 Disabled indirect blacklisting (issue #214). Details: - Return early from function, pass_config_kernel_registries(), that implements indirect blacklisting of subconfigurations (during pass 0). In short, I realized that indirect blacklisting is not needed in the situations I envisioned, and can actually cause problems under certain circumstances. Thanks to Tony Skjellum for reporting the issue (#214) that led to this commit, and to Devin Matthews for prompting me to realize that indirect blacklisting was unnecessary, at least as originally envisioned. commit d7fb32682057c7458c8891c0eedafc374fd9beef Author: Field G. Van Zee Date: Sun Jun 3 13:20:37 2018 -0500 Fixed syntax artifacts from 4b36e85 in examples. Details: - Fixed artifacts of malformed recursive sed expressions used when preparing 4b36e85, in which most function-like macros were converted to static functions. The syntactically defective code was contained entirely in examples/oapi. Thanks to Tony Skjellum for reporting this issue. - Update to CREDITS file. commit ed7dedfd4a07eefeb5a038f9899afb8053b45383 Merge: f97a86f3 469727d4 Author: Field G. Van Zee Date: Sat Jun 2 20:29:53 2018 -0500 Merge branch 'master' into dev commit f97a86f322a6e3e31f33c89befc66189b0b8c64f Author: Field G. Van Zee Date: Sat Jun 2 20:28:20 2018 -0500 Updated setting/querying pack schema (cntx->cntl). - Query pack schemas in level-3 bli_*_front() functions and store those values in the schema bitfields of the correponding obj_t's when the cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default native schemas are stored to the obj_t's.) - In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in bli_*_front(), clear the schema bitfields, and pass the queried values into bli_gemm_cntl_create() and bli_trsm_cntl_create(). - Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to take schemas for A and B, and use these values to initialize the appropriate control tree nodes. (Also cpp-disabled the panel-block cntl tree creation variant, bli_gemmpb_cntl_create(), as it has not been employed by BLIS in quite some time.) - Simplified querying of schema in bli_packm_init() thanks to above changes. - Updated openmp and pthreads definitions of bli_l3_thread_decorator() so that thread-local aliases of matrix operands are guaranteed, even if aliasing is disabled within the internal back-end functions (e.g. bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c explaining why the extra aliasing is not needed there. - Change bli_gemm() and level-3 friends so that the operation's ind() function is called only if all matrix operands have the same datatype, and only if that datatype is complex. The former condition is needed in preparation for work related to mixed domain operands, while the latter helps with readability, especially for those who don't want to venture into frame/ind. - Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be consistent with BLIS calling conventions (modified argument(s) are last), and updated all invocations in the level-3 _front() functions. - Comment updates to bli_cntx_set_thrloop_from_env(). commit 965db85d29977d228ea744581edf2b682eb8e8a8 Author: Field G. Van Zee Date: Fri Jun 1 12:32:15 2018 -0500 Updated macro invocations in bli_gemm_ker_var2.c. Details: - Updated "get next a/b micropanel" macro invocations in bli_gemm_ker_var2.c according to changes in 9588625. - Comment update in bli_cntx.c. commit 8749fa0b48a7710f4115023e2c46bc80167bc8f9 Author: Field G. Van Zee Date: Thu May 31 12:34:01 2018 -0500 Cleanups to ref99/README.md, test/3m4m/Makefile. Details: - Minor edits to sandbox/ref99/README.md. - Removed cpp guards in sandbox/ref99/thread/blx_gemm_thread.h to be consistent with other headers in sandbox/ref99. - Additional targets and related cleanups in test/3m4m/Makefile. commit 9588625c43c86ef1bde8140f620a30f52420e6a6 Author: Field G. Van Zee Date: Wed May 30 15:19:53 2018 -0500 Renamed "next micropanel" macros in _l3_thrinfo.h. Details: - Renamed several macros defined in bli_l3_thrinfo.h designed to compute the values of a_next and b_next to insert into an auxinfo_t struct in level-3 macrokernels. (Previously, the macros did not use a bli_ prefix.) - Updated instances of above macro usage within various macrokernels. commit e4420591225fca2f63ca74ef6a23b962fcd4bec0 Merge: 34f974d1 850a8a46 Author: Field G. Van Zee Date: Tue May 29 17:12:22 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit 34f974d1a83a7d29ba09f67e392d361231fdf99c Author: Field G. Van Zee Date: Tue May 29 17:11:52 2018 -0500 More tweaks/updates to sandbox/ref99/README.md. commit 850a8a46c0a569a2652d8c200e5c53b61bcf988d Author: Devin Matthews Date: Tue May 29 13:51:21 2018 -0500 Test all x86_64 configurations*... (#212) * Add custom SDE cpuid files. * Set up testing of all x86_64 architectures (except bulldozer) using SDE. * Update .travis.yml [ci skip] * Update do_testsuite.sh [ci skip] * Updated .travis.yml with my secret token. Details: - Replaced Devin's temporary secret token with my own, which is used by Travis when accessing the Intel SDE via Dropbox. * Work around CPUID dispatch in glibc/libm by patching ld.so. * Detect path of loader at runtime. * Attempt to make SDE run on Travis * Allow unpatched ld.so if we don't know how to patch it. I *think* this only happens for older glibc without the multi-arch stuff (e.g. Ubuntu 14.04 on Travis), but who knows? * Upgrade Travis to gcc-6 and binutils-2.26. * Try to get Travis to use the right assembler. * Apparently you need ld-2.26 too. * Try to also patch ld.so from Ubuntu 14.04. * Take the nuclear option. * Account for non-absolute dependencies in ldd output. * String manipulation fail. * Update patch-ld-so.py * Add Zen to SDE testing. * Removed dead variable from travis/do_testsuite.sh. Details: - Removed 'BLIS_ENABLE_TEST_OUTPUT=yes' from make invocations in travis/do_testsuite.sh. This variable is no longer present in the BLIS build system (if it ever was?), and therefore has no effect. commit 42ea02a34e5c144893fe239ae55daef895d92677 Author: Field G. Van Zee Date: Tue May 29 12:48:14 2018 -0500 Renamed c99 sandbox to ref99. Details: - Renamed sandbox/c99 to sandbox/ref99. I wanted to name the sandbox so that it would be thought of as a "reference" sandbox. I kept the "99" to differientiate it from future reference sandboxes that may be written in another language (such as C++). - Updates to sandbox/ref99/README.md. commit 0e7205ccef50dccd4306cf427a63633396472813 Author: Field G. Van Zee Date: Tue May 29 12:36:13 2018 -0500 Remove sandbox/.gitkeep now that dir is non-empty. commit 3a4603858e3819cbd6ed7dd67d0fc0b3f89ed254 Author: Field G. Van Zee Date: Sat May 26 15:51:08 2018 -0500 More README.md updates to sandbox/c99. Details: - Added a section that walks the reader through how to configure BLIS to use a gemm sandbox. commit 2bad97f6bdf4642884d60fc03970549902a54d74 Author: Field G. Van Zee Date: Sat May 26 15:31:16 2018 -0500 Updates to CREDITS, sandbox/c99/README.md. commit 2b4a447526effa3e847a7e5c15c3758573f12318 Author: Field G. Van Zee Date: Fri May 25 18:51:23 2018 -0500 Initial implementation of c99 "reference" sandbox. Details: - Added a c99 sandbox (in sandbox/c99) to serve as a starting point for others looking to experiment with alternative implementations of gemm in BLIS. Note that this sandbox implementation is a first draft and will be refined over time. - Minor updates to Makefile and common.mk to restrict what source files get recompiled when sandbox files are touched. - Added an initial draft of a README.md in sandbox/c99. commit 469727d4f8a976d8713afb4d0b6235c322498db0 Author: Field G. Van Zee Date: Fri May 25 16:17:13 2018 -0500 Very minor comment updates. commit 66dbe69a0f9359bf1e39b5672ee365213de2e3ee Author: Field G. Van Zee Date: Fri May 25 15:45:53 2018 -0500 Converted macros to static funcs in _packm_cntl.h. Details: - Converted various macros in frame/1m/packm/bli_packm_cntl.h (designed to access fields of a packm_params_t struct) to static functions. commit 22deef2f5463a47e3b3c37fc313d17550f10ee06 Author: Field G. Van Zee Date: Thu May 24 14:28:55 2018 -0500 Support alternative gemm implementation sandboxes. Detail: - configure: - add support for --enable-sandbox=NAME to configure script, where NAME is a subdirectory of a new 'sandbox' directory that contains an alternative implementation of gemm. (For now, only implementations of gemm may be provided via a sandbox.); - add support for C++ compiler. C++ compilers are handled in a manner similar to that of C compilers, in that a default search order is used, and that CXX is searched for first, if the variable is set. In practice, the C++ compiler that is selected should correspond to the selected C compiler. (Example: If gcc is selected for C, g++ should be selected for C++.) The result of the search is output to config.mk via build/config.mk.in. NOTE: The use of C++ in BLIS is still hypothetical, but may eventually move to being experimental. This support was intended only for use of C++ within a gemm sandbox. - build/config.mk.in: - define SANDBOX variable containing sandbox subdirectory name. - build/bli_config.in: - define either of the BLIS_ENABLE_SANDBOX or BLIS_DISABLE_SANDBOX macros in bli_config.h. - common.mk: - include makefile fragments that were propagated into the specified sandbox subdirectory; - generate different CFLAGS for sandboxes, as well as a separate CXXFLAGS variable for sandboxes when C++ source files are compiled; - isolate into a single location lists of file suffixes for various purposes. - reorganized/clean up code related to identifying header files and paths. - Makefile: - generate object filepaths for and compile source code files found in sandbox sub-directory; - remove makefile fragments placed in sandbox sub-directory (cleanmk); - various other cleanups. - Added .cc, .cpp, and .cxx to list of suffixes of files to recognize in makefile fragments (via build/gen-make-frags/suffix_list). - Updated blis.h to conditionally #include bli_sandbox.h (via a new file, bli_sbox.h), which each sandbox is assumed to use for any type definitions and function prototypes it wishes to export out to blis.h. - Conditionally disable bli_gemmnat() implementation in frame/3 when BLIS_ENABLE_SANDBOX is defined. commit 25e3501ed57a0db7f860c88b7199b36049aec12a Merge: 216a4cb9 5140ee34 Author: Field G. Van Zee Date: Thu May 24 13:57:16 2018 -0500 Merge branch 'master' into dev commit 5140ee3424c744981a3fed3b5a748ebbfc111388 Author: Field G. Van Zee Date: Wed May 23 16:56:14 2018 -0500 Updated types of bli_is_[un]aligned_to() functions. Details: - Changed the void* arguments of the following static functions: bli_is_aligned_to() bli_is_unaligned_to() bli_offset_past_alignment() to siz_t, and the return type of bli_offset_past_alignment() from guint_t to siz_t. This allows for more versatile usage of these functions (e.g. when aligning both pointers and leading dimension). - Updated all invocations of these functions, mostly in kernels/penryn but also in kernels/bgq, to include explicit typecasts to siz_t when pointer arguments are passed in. - Thanks to Devin Matthews for pointing out this potential bug (via issue #211). - Deleted a few trailing spaces in various penryn kernels. - Removed duplicate instances of the words "derived" and "THEORY" from various kernel license headers, likely from a malformed recursive sed performed long ago. commit 216a4cb9cb87fa4c93f6ceb6ae90602e5018b305 Author: Field G. Van Zee Date: Fri May 18 18:47:03 2018 -0500 Minor update to flatten-headers.[py|sh] help text. Details: - Fixed a typo and removed some outdated language from the help text of flatten-headers.py and flatten-headers.sh. commit 962a706a6f56ea070ac4683f0af69c7e59af8ecb Author: Field G. Van Zee Date: Fri May 18 18:19:40 2018 -0500 Updated LICENSE file to mention HP Enterprise. Details: - Added HP Enterprise to the LICENSE file. Previously, only the source files touched by HPE contained the corresponding copyright notices. (This oversight was unintentional.) - Updated file-level copyright notices to include a comma, to match the formatting used for UT and AMD copyrights. commit efa43e13effe901ad31e734ac90f027e89473bd9 Author: Field G. Van Zee Date: Fri May 18 12:20:40 2018 -0500 More updates to CREDITS and RELEASING files. commit f94ab97af8e86baf9ee9a9cbaef8bb3712df2e11 Author: Field G. Van Zee Date: Thu May 17 17:45:31 2018 -0500 Update to CREDITS file. commit 4919b10c005e006a6d818eb8f865f9dbd8aa16df Author: Field G. Van Zee Date: Thu May 17 16:38:49 2018 -0500 Minor changes to README.md and CONTRIBUTING.md. commit b89451187e8321b673a1cf7603c8d48028d9d4c8 Author: Field G. Van Zee Date: Thu May 17 16:23:06 2018 -0500 README.md update. Details: - Added "Contributing" section with relevant links. commit af244194e7d76276a1b90fe59f9307dde0429e1d Author: Field G. Van Zee Date: Thu May 17 15:38:02 2018 -0500 Removed explicit critical sec. from bli_memsys.c. Details: - Removed critical sections protecting the initialization/finalization of bli_memsys.c. These synchronization mechanisms are no longer needed now that BLIS initializes all APIs via pthread_once(). commit 10c9e8f95254d8c6436c4d3cb093fa5544b45c90 Author: Field G. Van Zee Date: Thu May 17 15:22:51 2018 -0500 Cache hardware's arch_t id after querying once. Details: - Added logic to bli_arch.c that will call what was previously the body of bli_arch_query_id() only once and then cache the value in a static variable local to the file. (Previously, the arch_t associated with the hardware/configuration was queried every time bli_arch_query_id() was called, which was at least once per level-3 function call. Thanks to Devin Matthews for suggesting this feature via issue #175. - Added -lpthread to the compile/link command line of the compiler invocation that compiles build/detect/config/config_detect.c, which prints the string identifying the detected configuration, since it is now needed due to new pthread_once() logic in bli_arch.c. - Implementation note: I chose to implement this arch_t caching feature via pthread_once(), using a separate pthread_once_t variable local to the file, rather than calling bli_init_once(). The reason is that I did not want to require bli_init() as a prerequisite to this function. bli_init() already calls several sub-components, some of which make use of bli_arch_query_id(), and therefore it would be easy to fall into a circular self-init situation (which usually causes pthreads to hang indefinitely). commit f28a15293890ac6fbceac229fd204dbc9fec6e27 Author: Francisco Igual Date: Thu May 17 09:26:14 2018 +0000 Fixed clobber list bug in ARMv8 ukernel commit 2e31dd7852b4d6a9355899cf9659d4b8130461cb Author: Field G. Van Zee Date: Wed May 16 17:28:33 2018 -0500 Inserted missing integer typecasting into ukernels. Details: - Inserted missing safeguards into most microkernels to ensure that the integers read by the microkernel's assembly instructions are of the appropriate size. In many cases, this bug was going undetected likely because the compiler was inserting zero padding before the integers in the calling function, allowing the assembly code to read 64-bits in a way that did not corrupt the "lower" 32 integer bits with garbage in the higher bits. Thanks to Francisco Igual and Devangi Parikh for finding this issue. commit 12dfa9516428b4092554f0ce70b07571d35de222 Author: Field G. Van Zee Date: Wed May 16 12:46:57 2018 -0500 Fixed a bug in determining default integer size. Details: - Fixed a bug that would cause configurations to inadvertantly define their integers to be 32 bits when those environments actually call for 64-bit integers. While either BLIS_ARCH_64 or BLIS_ARCH_32 is defined in bli_system.h (based on whether preprocessor macros such as __x86_64 or __aarch64__ are defined by the environment), bli_system.h was being #included *after* bli_config_macro_defs.h, in which the BLIS_ARCH_64 macro was used to choose an integer type size in the event that BLIS_INT_TYPE_SIZE was not already defined by configure via bli_config.h. And due to the structure of the cpp code in that file, the 32-bit integer case was being chosen. Thanks to Francisco Igual and Devangi Parikh for their help in isolating this bug. - Moved the #include of hbwmalloc.h and related preprocessor code to bli_kernel_macro_defs.h to facilitate the reshuffling of the #include for bli_system.h in blis.h. commit f930cec0f35824c0f9ebbd218614209217d491cb Author: Field G. Van Zee Date: Tue May 15 17:47:08 2018 -0500 More tweaks to CONTRIBUTING.md. commit 173e30ff7d293ba31f3fab8ab0c0a695eda3d4fd Author: Field G. Van Zee Date: Tue May 15 14:48:34 2018 -0500 Added initial draft of CONTRIBUTING.md file. Details: - Thanks to the Ruby on Rails project for providing a good template off of which to build. commit 6e25e758b444bf725046674e1e64c6a52421749d Author: Nico Schlömer Date: Tue May 15 14:03:20 2018 +0200 Debian config (#206) * add debian config * correct wording in the README commit fcf6c6a3c87da08a7cdb92b102489b991ef7a644 Author: Alex Arslan Date: Mon May 14 18:41:03 2018 -0700 Fix shared library builds on platforms other than Linux and macOS (#209) * Fix detection of systems other than Linux and macOS The way the logic is currently laid out, any platform that isn't Linux gets assigned the .dylib shared library extension and the macOS-specific compiler flags. This reverses the logic to check for macOS first, and have the fallback use the Linux definitions, which apply to most other systems as well. * Use SHLIB_EXT instead of SO_SUF The former is more standard, as jakirkham pointed out in a comment. commit 6f7f51048c48f31d691c06451d0fd2cbc453ad03 Author: Field G. Van Zee Date: Mon May 14 18:41:56 2018 -0500 Echo cc_vendor when printing compiler version. Details: - Echo the ${cc_vendor} when informing the user of the compiler's version. Previously, the actual ${cc} (which could be a path to the executable) was being printed, which has already been printed by that point in the configure script. commit ad67dc4e348b0a381efc057573a6b03cc7e26db0 Author: Field G. Van Zee Date: Mon May 14 18:35:28 2018 -0500 Communicate cc, cc_vendor to make via config.mk. Details: - Historically, the compiler selection has happened statically in the various make_defs.mk and would only be overriden by setting CC (either prior to running configure or as a configure argument). However, in the last couple months, configure has evolved to contain rather sophisticated compiler detection logic for the purposes of blacklisting sub-configurations. It only makes sense that configure now fully take over the responsibility of selecting a compiler from the GNU make side of the build system. Thanks to Alex Arslan for his help exposing this issue. - Substitute found_cc into CC in config.mk via configure. - Set a new variable, CC_VENDOR, in config.mk via substitution from configure, and disable the corresponding CC_VENDOR code in common.mk. - Disabled default compiler selection (usually gcc) in the sub-configs' various make_def.mk files. commit 20af119fc97ec6120017a7a5ba5f9aaa920c7640 Author: Field G. Van Zee Date: Mon May 14 17:44:58 2018 -0500 Added README.md to 'config' directory. Details: - Added a brief README.md file to the config directory to redirect those who may be exploring the source tree to the ConfigurationHowTo wiki. (Included is a very brief explanation of configurations for those who don't have time to read the wiki.) Thanks to Nico Schlömer for this suggestion. commit 9dbce16269c3e1f27c7a0d64372cc76aed30dfc1 Author: Field G. Van Zee Date: Mon May 14 17:04:54 2018 -0500 Search for 'cc clang gcc' on OpenBSD, FreeBSD. Details: - Swapped gcc and clang in the compiler search list for OpenBSD. - Use the same search list for FreeBSD as above. commit 55ebf24d63128b5fd15b10160485667415a02a55 Author: Field G. Van Zee Date: Mon May 14 16:19:08 2018 -0500 Change compiler search order on OpenBSD. Details: - Set a compiler search list (and order) as a function of the OS detected via 'uname -s'. By default, this list and order is 'gcc clang cc' for Linux and Darwin (OS X), and any other OS except OpenBSD). On OpenBSD, we use 'cc gcc clang' because OpenBSD's default installation of gcc (4.2.1) is too old for BLIS. Thanks to Alex Arslan for reporting this issue and suggesting a fix. commit 4fb353bd90e6642c8aeffd1b1e6329f54eee4bb4 Merge: 4b36e85b 8a2857b5 Author: Field G. Van Zee Date: Sun May 13 17:50:51 2018 -0500 Merge branch 'master' into dev commit 8a2857b5e3c633b18c24f2275110437a702a71d0 Author: Field G. Van Zee Date: Fri May 11 18:42:05 2018 -0500 Fixed README.md typo; mention 'make check'. commit 543935c02f9335142d2e485a15f37dbaebe012ed Author: Field G. Van Zee Date: Fri May 11 18:35:32 2018 -0500 Updated README.md with Ubuntu packages link. Details: - Created a separate section of README.md for external packages, with one bullet each for Dave Love's rpms and Nico Schlömer's Ubuntu apt packages. Thanks to Dave and Nico for their contributions. commit af1d8470b56d3b2a1c8513d366d788dddcb84baa Author: Field G. Van Zee Date: Fri May 11 17:49:58 2018 -0500 Better handling of shared libraries on OS X. Details: - Use the .dylib shared library suffix on OS X (instead of .so in Linux). - Link with the -dynamiclib and -install_name options on OS X (instead of -shared and -soname in Linux). - Determine operating system (e.g. Linux, Darwin) during configure and substitute into config.mk.in rather than run 'uname -s' during make. - Echo operating system during configure. commit 4b72a462d7467cf815422aafac7b05037d2e3b13 Author: Field G. Van Zee Date: Thu May 10 18:35:38 2018 -0500 Enable building shared library by default. Details: - Tweaked configure so that the shared library is generated by default. - Updated --help text and configure's feedback messages reporting the status of the static/shared builds. - Changed the order of build product installation so that headers are installed last, after libraries and symlinks. commit b699bb1ff03c6e9baaa054805b4939983ae7145b Author: Field G. Van Zee Date: Thu May 10 15:54:17 2018 -0500 Adopt Linux-like .so versioning at install-time. Details: - Changed the naming conventions used for installed libraries and symlinks to more closely mirror patterns used by typical GNU/Linux libraries. Whereas previously static and shared libraries were installed and symlinked as follows: (library) libblis-0.3.2-15-haswell.a (library) libblis-0.3.2-15-haswell.so (symlink) libblis.a -> libblis-0.3.2-15-haswell.a (symlink) libblis.so -> libblis-0.3.2-15-haswell.so we now use the following naming conventions: (library) libblis.a (symlink) libblis.so -> libblis.so.0.1.2 (symlink) libblis.so.0 -> libblis.so.0.1.2 (library) libblis.so.0.1.2 where 0.1.2 indicates shared library major, minor, and build versions of 0, 1, and 2, respectively. The conventional version string can still be queried by linking to the library in question and then calling bli_info_get_version_str(). (The testsuite binary does this automatically at startup.) - Added logic to common.mk to set the soname field in the shared library via the -soname linker flag. - Added a 'so_version' file to the top-level directory containing two lines. The first line specifies the .so major version number, and the second line specifies the minor and build version numbers joined with a '.'. This file is read by configure and those values substituted into build/config.mk.in to define SO_MAJOR, SO_MINORB, and SO_MMB variables. commit fc2d9ec6bf46f6e5b19d196208415ce433e95b10 Author: Field G. Van Zee Date: Wed May 9 15:19:28 2018 -0500 Tweaks to top-level clean and distclean targets. Details: - Moved the removal of bli_config.h from cleanh to distclean. - Removed cleantest as a dependency of clean. commit bf0350305971e3991861b5117a13fda31ff97b6d Author: Field G. Van Zee Date: Tue May 8 16:49:22 2018 -0500 Renamed (shortened) a few build system variables. Details: - Renamed the following variables in config.mk (via build/config.mk.in): BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE BLIS_ENABLE_STATIC_BUILD -> MK_ENABLE_STATIC BLIS_ENABLE_SHARED_BUILD -> MK_ENABLE_SHARED BLIS_ENABLE_BLAS2BLIS -> MK_ENABLE_BLAS BLIS_ENABLE_CBLAS -> MK_ENABLE_CBLAS BLIS_ENABLE_MEMKIND -> MK_ENABLE_MEMKIND and also renamed all uses of these variables in makefiles and makefile fragments. Notice that we use the "MK_" prefix so that those variables can be easily differentiated (such as via grep) from their "BLIS_" C preprocessor macro counterparts. - Other whitespace changes to build/config.mk.in. - Renamed the following C preprocessor macros in bli_config.h (via build/bli_config.h.in): BLIS_ENABLE_BLAS2BLIS -> BLIS_ENABLE_BLAS BLIS_DISABLE_BLAS2BLIS -> BLIS_DISABLE_BLAS BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE and also renamed all relevant uses of these macros in BLIS source files. - Renamed "blas2blis" variable occurrences in configure to "blas", as was done in build/config.mk.in and build/bli_config.h.in. - Renamed the following functions in frame/base/bli_info.c: bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas() bli_info_get_blas2blis_int_type_size() -> bli_info_get_blas_int_type_size() - Remove bli_config.h during 'make cleanh' target of top-level Makefile. commit 4b36e85be9b516b4089b24768f881dd976668997 Author: Field G. Van Zee Date: Tue May 8 14:26:30 2018 -0500 Converted function-like macros to static functions. Details: - Converted most C preprocessor macros in bli_param_macro_defs.h and bli_obj_macro_defs.h to static functions. - Reshuffled some functions/macros to bli_misc_macro_defs.h and also between bli_param_macro_defs.h and bli_obj_macro_defs.h. - Changed obj_t-initializing macros in bli_type_defs.h to static functions. - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from bli_constants.h. - Whitespace changes in select files (four spaces to single tab). commit 7e5648ca150757b874f6823da832f3798c40b9f9 Author: Field G. Van Zee Date: Mon May 7 18:59:19 2018 -0500 Add configure support for --libdir, --includedir. Details: - Added support for two new configure options: --libdir and --includedir. They specify the precise install directories for libraries and header files, respectively, and override any location implied by the --prefix option (including the default install prefix, if --prefix was not given). Thanks to Nico Schlömer for suggesting this via issue #195. - Removed the INSTALL_PREFIX definition/anchor from build/config.mk.in and replaced it with corresponding definitions/anchors for libdir and includedir. - Updated top-level Makefile to use the new variables, INSTALL_LIBDIR and INSTALL_INCDIR, instead of INSTALL_PREFIX (which is now no longer needed by make). - Set default sane values for INSTALL_LIBDIR and INSTALL_INCDIR in common.mk when configure has not been run, as is already done for DIST_PATH. This is to safeguard against statements in the top-level Makefile that use 'find' to locate old libraries and headers for the uninstall targets, which run regardless of make target. Without setting INSTALL_LIBDIR and INSTALL_INCDIR, those variables are empty and the 'find' ends up looking at '/', which is obviously not what we want. (Also enclosed those definitions in an IS_CONFIGURED guard so that they won't get evaluated unless configure has been run.) - Rearranged "ifeq ($(IS_CONFIGURED),yes)" conditionals in Makefile to reduce occurrences and separated "local" and top-level components of cleanblastest and cleanblistest targets to improve readability. - Adjusted out-of-tree builds so that they are no longer oblivious to the .git directories, if present, and thus now properly augment version strings with the appropriate patch number. - Include missing version string in 'configure --help' output. commit b09e4e8852a6c42895910e3bcb9041124dc8bf9f Author: Field G. Van Zee Date: Mon May 7 14:37:50 2018 -0500 Allow 'make clean' and friends without configuring. Details: - Modified top-level Makefile so that a user can run 'make distclean', 'make clean', or any of the other clean-related targets prior to running configure (or after a previous 'make distclean'). Thanks to Nico Schlömer for suggesting this via issue #197. - Made the cleanblastest and cleanblistest more comprehensive in that they now clean out build products that would have resulted from local compilation (ie: builds performed within the 'blastest' or 'testsuite' directories). - Added "cc" to list of expected compiler "vendors" since the CC variable seems to automatically be set to "cc" on Ubuntu 16.04 (which is just an alias to gcc). - Comment update to build/config.mk.in. commit 35c5a1449c3efe0b2ec43cdefcfdf00e71828149 Author: Field G. Van Zee Date: Mon May 7 12:04:57 2018 -0500 No longer update version file during configure. Details: - Recycled the core functionality of build/update-version-file.sh into a function in configure, disabling the updating of the 'version' file in the process. Instead of writing the patched version string back to the version file and then reading it again from within configure, the patched version string is now saved directly to a variable in the main() function in configure. This will prevent developers from accidentally committing configure-induced changes to the version file in between releases. commit 8adb2f919b62da4a2885ae04a10925e0e6a2e304 Author: Mathieu Poumeyrol Date: Sun May 6 19:58:16 2018 +0200 Some cross compilations fixes (#198) * cross-compilation fixes * add doc ranlib variable * icc support -dumpversion, posix compatible test, plus one stupid mistake * retab * revert version as requested commit 89acd9ebe516eeb97006dba344354bfc98826645 Merge: 4cff432d 0557eba7 Author: Field G. Van Zee Date: Wed May 2 12:53:35 2018 -0500 Merge branch 'amd' commit 4cff432d707891ada705b039a7e043558bbf3c51 Author: Nisanth M P <31736542+nisanthmpamd@users.noreply.github.com> Date: Wed May 2 23:20:42 2018 +0530 AMD specific optimizations for target 'zen' (#194) Re-enabled AMD-specific optimizations for zen. Details: - Re-enabled Zen-specific cache blocksizes for 'zen' sub-configuration. - Re-enabled small matrix gemm optimization for 'zen'. - These were both temporarily disabled during a previous merge simply due to lack of Zen hardware for testing. commit 8eda5fe7f678b413cb274bd84716995a7d0b87a9 Author: Field G. Van Zee Date: Wed May 2 12:20:37 2018 -0500 Typo fix in README.md. commit 0557eba78f5fcf28f0f039f28da79498ffde848c Author: Nisanth M P Date: Mon Mar 19 12:49:26 2018 +0530 Re-enabling the small matrix gemm optimization for target zen Change-Id: I13872784586984634d728cd99a00f71c3f904395 commit df78ceb3d6f33a27fe69017854405edaea7c40e5 Author: Nisanth M P Date: Mon Mar 19 11:34:32 2018 +0530 Re-enabling Zen optimized cache block sizes for config target zen Change-Id: I8191421b876755b31590323c66156d4a814575f1 commit 5e515f9a76f4aaf43dc21315a34d797726ca8069 Author: Field G. Van Zee Date: Tue May 1 13:44:10 2018 -0500 Tweaked new language in README.md. commit 1ddd9e316ad5024af8b606dfcebd1e7d587a130f Author: Field G. Van Zee Date: Tue May 1 13:36:28 2018 -0500 Added link to Dave Love's Fedora Copr page. Details: - Added a blurb to README.md advertising Dave Love's Copr homepage, which contains rpm packages for RHEL/Fedora-like distributions. commit 078a852f738c66c6468bd5e64b06467edc9057fd Author: Field G. Van Zee Date: Mon Apr 30 16:15:26 2018 -0500 Minor tweaks to top-level 'make clean' target. Details: - Execute 'cleanh' target as part of 'clean' - Remove cblas.h file from 'include//' as part of 'cleanh' target. - Updated the echoed (non-verbose) text for uniformity. commit 75d0d1057dda69c655bd1cd8f791cb39b54d99b8 Author: Field G. Van Zee Date: Mon Apr 30 14:57:33 2018 -0500 Renamed various datatype-related macros/functions. Details: - Renamed the following macros in bli_obj_macro_defs.h and bli_param_macro_defs.h: - bli_obj_datatype() -> bli_obj_dt() - bli_obj_target_datatype() -> bli_obj_target_dt() - bli_obj_execution_datatype() -> bli_obj_exec_dt() - bli_obj_set_datatype() -> bli_obj_set_dt() - bli_obj_set_target_datatype() -> bli_obj_set_target_dt() - bli_obj_set_execution_datatype() -> bli_obj_set_exec_dt() - bli_obj_datatype_proj_to_real() -> bli_obj_dt_proj_to_real() - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex() - bli_datatype_proj_to_real() -> bli_dt_proj_to_real() - bli_datatype_proj_to_complex() -> bli_dt_proj_to_complex() - Renamed the following functions in bli_obj.c: - bli_datatype_size() -> bli_dt_size() - bli_datatype_string() -> bli_dt_string() - bli_datatype_union() -> bli_dt_union() - Removed a pair of old level-1f penryn intrinsics kernels that were no longer in use. commit 01c4173238baf08e7f6700a3f91a2ea58cca50c1 Author: Field G. Van Zee Date: Sat Apr 28 14:07:34 2018 -0500 CHANGELOG update (0.3.2) commit 2fb440876690bdcec0c11a30e2b33ad100bab529 (tag: 0.3.2) Author: Field G. Van Zee Date: Sat Apr 28 14:07:31 2018 -0500 Version file update (0.3.2) commit cdf041ddadd8725e578e2f59f37ae341f26655af Author: Field G. Van Zee Date: Sat Apr 28 14:05:00 2018 -0500 Use config.mk instead of common.mk in bump-version.sh. Details: - Fixed inadvertent targeting of common.mk when testing whether configure had already been run, rather than config.mk. commit 6ded8f9f0364b3c07255e2532ada3eeb2ed2a715 Author: Field G. Van Zee Date: Sat Apr 28 14:01:29 2018 -0500 Account for recent 'make distclean' in bump-version.sh. Details: - Added logic to build/bump-version.sh that will run './configure auto' if 'common.mk' is not present (usually because 'make distclean' was run recently). commit 7c16fdce433f5dea0e83d5047553c955d8e46fd2 Author: Field G. Van Zee Date: Sat Apr 28 13:50:55 2018 -0500 Fixed typo in RELEASING file. commit 5e5ca4984fcf6d72d3036c338bb9cdc64520a325 Author: Field G. Van Zee Date: Sat Apr 28 13:48:01 2018 -0500 README updates. Details: - Updates to the top-level README files in the top-level directory as well as the 'examples/oapi' directory. commit 627b045e301defea6770dc5b64e1110cbec25153 Author: Field G. Van Zee Date: Fri Apr 27 18:11:19 2018 -0500 Added an example of using transposition with gemm. Details: - Added an example to examples/oapi/8level3.c to show how to indicate transposition when performing a gemm operation. commit 13a0eadc69d72933e322901f5b44944834e3c787 Author: Field G. Van Zee Date: Fri Apr 27 18:00:07 2018 -0500 Added more transposition/conjugation examples. Details: - Added code to examples/oapi/5level1m.c that demonstrates transposing (and conjugate-transposing) unstructured matrices. - Comment updates to 6level1m_diag.c to maintain consistency with new examples in 5level1m.c. commit 5606cd8881e75264a96af45dc8ea1905bab054f5 Author: Field G. Van Zee Date: Fri Apr 27 17:13:10 2018 -0500 Added utility module to examples/oapi. Details: - Added a new code example file to examples/oapi demonstrating how to use various utility operations. - Comment updates to other example files. - README updates. commit ff26c94c6486374c709f93c6965ea18903bd6a18 Author: Field G. Van Zee Date: Fri Apr 27 12:31:34 2018 -0500 Added missing gcc version constraint for knl. Details: - Previously forgot to add explicit enforcement of a minimum gcc version in configure script when 'knl' sub-configuration is requested. - Comment updates to configure. commit 4d97574e477b3e55ddbb6044b0542a92cd9bab30 Author: Field G. Van Zee Date: Tue Apr 24 18:48:09 2018 -0500 Added object API example code. Details: - Added an 'examples' directory at the top level. - Added an 'oapi' subdirectory in 'examples' that contains a tutorial-like sequence of example code demostrating the core functionality of BLIS's object-based API, along with a Makefile and README. Thanks to Victor Eijkhout for being the first to suggest including such code in BLIS. commit d6ab25a3232aa52b9b855088fb4b0b46ff2c00c8 Author: Field G. Van Zee Date: Tue Apr 24 18:43:03 2018 -0500 Add setijm, getijm operations. Details: - Added bli_setgetijm.c, which defines bli_setijm(), bli_getijm(), and related functions that can be used to read and write individual elements of an obj_t. - Defined a new function, bli_obj_create_conf_to(), in bli_obj.c that will create a new object with dimensions conformal to an existing object. Transposition and conjugation states on the existing object are ignored, as are structure and uplo fields. - Defined a new function, bli_datatype_string(), in bli_obj.c that returns a char* to a string representation of the name of each num_t datatype. For example, BLIS_DOUBLE is "double" and BLIS_DCOMPLEX is "dcomplex". BLIS_INT is included (as "int"), but BLIS_CONSTANT is not, and thus is not a valid input argument to bli_datatype_string(). - Added calls to bli_init_once() to various functions in bli_obj.c, the most important of which was bli_obj_create_without_buffer(). - Removed unintended/extra newline from the end of printv output. - Whitespace changes to - frame/base/bli_machval.c - frame/base/bli_machval.h - frame/0/copysc/bli_copysc.c - Trivial changes to README.md and common.mk. commit a731a428f7fc02fd6ab4f953ead828c1d06fb5a1 Author: Field G. Van Zee Date: Tue Apr 17 16:44:55 2018 -0500 Another README.md update. commit c734ee928a824b27d280a9a67b1b4bc8423d5795 Author: Field G. Van Zee Date: Tue Apr 17 16:40:05 2018 -0500 README.md update. commit 03ecad372d8eb603ee905a7b944d0544a813460a Author: Field G. Van Zee Date: Tue Apr 17 14:16:59 2018 -0500 Added RELEASING file. Details: - Added a file named 'RELEASING' that contains basic notes on how to create a new version/release of BLIS. This is mostly just a reminder to myself, but also may become useful if/when others take over development and administration of the project. commit 24b3c3149ce66546b9a1afc2cc794a637a86aa60 Merge: 60366a3f 817b67c0 Author: Field G. Van Zee Date: Mon Apr 16 18:49:38 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit 60366a3faba4e60cee85c3b87a3f69625f4b9026 Author: Field G. Van Zee Date: Mon Apr 16 18:46:21 2018 -0500 Updates to knl kernels and related code. Details: - Imported the 24x16 knl sgemm microkernel (and its corresonding spackm kernel) from TBLIS and enabled its use in the knl sub-config. Also Added sgemm microkernel prototype to bli_kernels_knl.h. - Updated dgemm and dpackm microkernels from TBLIS, which included an important change regarding the offsets array (changed from extern declaration to static declaration/definition). - Activated use of level-1v and -1f zen kernels in skx and knl sub-configs. - Removed some old macros no longer needed in bli_family_skx.h now that libmemkind support exists in configure. - Moved bli_avx512_macros.h to frame/include and adjusted #includes in skx and knl kernels accordingly. - Moved unused kernels in kernels/knl/3 to kernels/knl/3/other directory. - Fixed a minor bug in the 'make' output per compile when verboseness is not turned on. The rule-generating function 'make-kernel-rule' was previously passing in the name of the config, rather than the name of the kernel set returned by get-config-for-kset, which could give misleading information to the user when the kconfig_map mapped a kernel set to a sub-configuration that did not share the same name. (This didn't affect the CFLAGS that were actually used.) - Updated test/3m4m/Makefile, removing acml targets and renaming the remaining targets. commit 817b67c01752e0ca8fe230bb8ad23afc7bd0f64e Merge: 67c9c2f8 2b7108a8 Author: Field G. Van Zee Date: Mon Apr 16 14:06:26 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit 67c9c2f86d5ef2accc439b21581d73d82754a2e3 Author: Field G. Van Zee Date: Mon Apr 16 14:03:12 2018 -0500 Retired haswell gemm microkernels. Details: - Moved microkernels in kernels/haswell/3 to kernels/haswell/3/old. These microkernels were no longer being used and only sowed confusion to anyone inspecting the repository without being fully cognizant of the build system and how it works (and sometimes even to those who wrote the build system). Note that the haswell configuration currently employs the zen microkernels. commit 2b7108a8ef8ce958b3acad028ff07c85ff97fd63 Author: Field G. Van Zee Date: Mon Apr 16 12:35:53 2018 -0500 Minor updates to test driver makefiles. Details: - Cleaned up and homogenized the various test driver Makefiles in testsuite and test directories. - Very minor updates to test driver code. commit 9f56df95570a24587b910b169f342bd356ccbfb6 Author: Field G. Van Zee Date: Wed Apr 11 14:51:36 2018 -0500 Trivial tweaks to configure blacklisting output. Details: - Updated output of information vis-a-vis configuration blacklisting. commit f56481efebd9a7785c0618f3a12c0bec36f46333 Author: Field G. Van Zee Date: Tue Apr 10 19:02:21 2018 -0500 Cleaned up assembler version query on OS X. Details: - Swiched from querying version of 'objdump' to 'as' (e.g. the assembler). - Fixed the outputting of the version of 'as' on OS X, which required this beauty: ...=$(as -v /dev/null -o /dev/null 2>&1) - Only add sub-configs to blacklist if the sub-config hasn't already been added. commit 088c474e629535affbe111f141f895af50d109be Author: Field G. Van Zee Date: Tue Apr 10 18:09:56 2018 -0500 Added support for blacklisting via the assembler. Details: - Added logic to configure that attempts to assemble various small files containing select instructions designed to reveal whether binutils (specifically, the assembler) supports emitting those instruction sets. This information provides additional opportunities to blacklist sub- configurations that are unsupported by the environment. Thanks to Devin Matthews for pointing me towards a similar solution in TBLIS as an example. - Various other cleanups in configure. - Reorganized the detection code in the 'build' directory, bringing the "auto-detect" configuration detection, libmemkind detection, and new instruction set detection codes into a single new subdirectory named 'detect'. commit 78a24e7dada52a3582f8488795bd1a44993989d9 Author: Field G. Van Zee Date: Mon Apr 9 17:02:13 2018 -0500 Updated bli_avx512_macros.h in knl and skx configs. Details: - Downloaded updated version of bli_avx512_macros.h from TBLIS [1] in attempt to address issue #192. [1] https://github.com/devinamatthews/tblis/ commit 388f64d6ade14caa4a6c286845ad2d565378b2bb Author: Field G. Van Zee Date: Mon Apr 9 15:33:10 2018 -0500 Fixed failure to honor CC= argument to configure. Details: - Fixed a failure to observe the value of CC when selecting the compiler in configure. Thanks to Devangi Parikh for reporting this bug. - The semantics now also work for the CC environment variable. That is, if CC is set prior to running configure, that value is used, but will be overridden by specifying the CC= argument to configure. If the CC environment variable is not set, the CC= value is used. If neither the environment variable nor CC= are specified, then the choice is made internally to configure: first attempting to find gcc, then clang, and then cc. commit 45fbe66b3e2ab92f0b4fdf437d57c5d06603803d Author: Field G. Van Zee Date: Mon Apr 9 14:01:08 2018 -0500 Fixed libmemkind dependency for x86_64. Details: - Removed some old conditional code in config/knl/make_defs.mk that added -lmemkind to LDFLAGS if DEBUG_TYPE was not 'sde' and inserted code into common.mk that affirmatively filters out -lmemkind from LDFLAGS if DEBUG_TYPE is 'sde'. (Thanks to Dave Love for reporting this issue.) Other minor cleanups to neighboring code in common.mk. - Updated CRVECFLAGS in knl/make_defs.mk to be based on -march=knl, and then AVX-512 functionality is manually removed via various -mno-avx512* flags. Also, make the setting of CRVECFLAGS conditional on CC_VENDOR. Similar change to skx/make_defs.mk. - Comment/whitespace updates. commit ca982148b3b419db063cad2fa74376ec383a5c80 Author: dnp Date: Sun Apr 8 21:27:10 2018 -0500 Fixed bug in SKX sgemm microkernel. Modified SKX dgemm mircokernel to be consistent with the sgemm microkernel commit bd0276752ccdd56ff897b1a5ae022f2ffe6e0b38 Author: Field G. Van Zee Date: Fri Apr 6 18:51:43 2018 -0500 Track separate ref kernel flags for each sub-config. Details: - Renamed CVECFLAGS variables in sub-configurations' make_defs.mk files to CKVECFLAGS. - Added default defintions of two new make variables to most sub- configurations' make_defs.mk files--CROPTFLAGS and CRVECFLAGS-- which correspond to reference kernel analogues of the CKOPTFLAGS and CKVECFLAGS, which track optimization and vectorization flags for optimized kernels. Currently, two sub-configurations (knl and skx) explicitly set CRVECFLAGS to non-default values (using AVX2 instead of AVX-512 for reference kernels. Thanks to Jeff Hammond, whose feedback prompted me to make this change (issue #187). - Changed common.mk so that the get-refkern-cflags-for function returns the flags associated with the given sub-configuration's CROPTFLAGS and CRVECFLAGS (instead of CKOPTFLAGS and CKVECFLAGS). commit b9aebce19480448817373e2df2b36bd090eae41a Author: Field G. Van Zee Date: Fri Apr 6 18:37:33 2018 -0500 De-verbosify makefile fragment generation. Details: - Changed from -v1 to -v0 when calling gen-make-frag.sh from configure. The directory-by-directory recursive output didn't add much value to the user, so now we just echo a line for each top-level directory into which we will recurse (e.g. 'config', 'ref_kernels', 'frame', etc.). This also helps keep more interesting information (from earlier in the execution of configure) from scrolling out of the terminal window. commit b549b91f26948991e13364f1f26a878da0f43aa0 Author: Field G. Van Zee Date: Fri Apr 6 16:31:33 2018 -0500 Added 64-bit integer support to BLAS test drivers. Details: - Updated the build system and BLAS test drivers to use 64-bit integers when BLIS is configured for 64-bit integers in the BLAS layer. Also updated blastest/Makefile accordingly. Thanks to Dave Love for reporting the need for this feature. - Added a 'check' target to blastest/Makefile so that the user can see a summary of the tests. - Commented out the initial definition of INCLUDE_PATHS in common.mk, which was used pre-monolithic header, back when BLIS needed paths to *all* headers, rather than just a select few. This line is no longer needed since the value of INCLUDE_PATHS is overwritten by a later definition limited to only the header paths that are needed now. commit d39fa1c04265869bdf8b6f453076359eec2f3c59 Author: Field G. Van Zee Date: Thu Apr 5 19:38:35 2018 -0500 Adjusted CFLAGS used to compile bli_cntx_ref.c. Details: - Removed CKOPTFLAGS and CVECFLAGS from the set of CFLAGS used to compile bli_cntx_ref.c for each configuration. This is necessary because the file defines functions like bli_cntx_init_skx_ref(), which are called during BLIS's initialization of the global kernel structure, potentially being executed by an architecture that lacks the instruction set used to compile the kernels for, in this example, skx, which would lead to an illegal instruction error. Thanks to Dave Love for reporting this issue. - Further adjusted CFLAGS used when compiling code in the 'config' directory (e.g. bli_cntx_init_skx.c) as well as code in 'frame' so as to avoid the aforementioned issue. commit 08b123084d35680beab379012f8f5a5a8b44a443 Author: Field G. Van Zee Date: Thu Apr 5 14:25:39 2018 -0500 Added color-coding to 'make check' output. Details: - Added color coding to output of check-blistest.sh, check-blastest.sh scripts. Success messages are coded green and failure are coded red. This helps draw the eye toward those messages as the 'make checkblis', 'make checkblis-fast', and 'make checkblas' targets are executed. - Changed top-level Makefile so that execution will not halt if 'checkblis', 'checkblis-fast', or 'checkblas' targets fail, which means that the second of the two tests (BLIS and BLAS) run by 'make check' will run even if the first test fails. commit c9e4d7db7410b03c1ffe8c9727e9f1b2ba7fecfe Author: Field G. Van Zee Date: Wed Apr 4 17:13:15 2018 -0500 CHANGELOG update (0.3.1) commit 1f28d7c86e17730f05bd239c8e8d67e3e7510a4f (tag: 0.3.1) Author: Field G. Van Zee Date: Wed Apr 4 17:13:15 2018 -0500 Version file update (0.3.1) commit e6cc9ee26bcf0450f1120d5d12985b04d9fb8516 Merge: 786d15c5 3c91c7ae Author: Field G. Van Zee Date: Wed Apr 4 16:08:18 2018 -0500 Merge branch 'dev' of github.com:flame/blis into dev commit 786d15c5ef09f1f647b126b63d57e76d5810c58e Author: Field G. Van Zee Date: Wed Apr 4 16:06:47 2018 -0500 Added skx, knl to x86_64 configuration family. Details: - Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration family in the config_registry file. - Added logic to configure that avoids committing certain sub-configs to the configuration/kernel registries if those sub-configs cannot be handled properly by the chosen compiler. (This was modeled after similar logic in TBLIS's configure; thanks to Devin Matthews for pointing this out.) First, the compiler and its version are inspected and, based on the results, certain configurations are added to a "blacklist". Then, as the configuration registries are being created, configurations and/or kernels that match items in the blacklist are skipped over and not commited to the registries. Under certain circumstances, omitting a blacklisted configuration will indirectly invalidate other configurations due to the loss of availability of the original blacklisted configuration's kernel set. This additional indirect blacklist is also accounted for. - Added output to the beginning of configure that echos information about the chosen compiler as well as the configurations that are blacklisted and must be stripped from the registries. - Various other cleanups in configure, especially with respect to explicitly declaring local variables in functions. - Comment updates to config/zen/make_defs.mk regarding choice of -march flags based on compiler version. commit 3c91c7aebafb446a2582267beb3b22c8bb475b3b Author: Field G. Van Zee Date: Mon Apr 2 12:40:25 2018 -0500 Fixed 64b type mismatch warning in cblas_xerbla.c. Details: - Fixed a compiler warning concerning a type mismatch between the format specifier of the printf() call in cblas_xerbla.c and its corresponding (info) argument. The warning manifested when the CBLAS layer was enabled and the BLAS/CBLAS integer type siwas is set to 64 (the default is 32). The warning was fixed by changing the specifier from %d to %jd and typecasting the argument to intmax_t. Thanks to Dave Love for reporting this issue and submitting the patch. commit 71eaf449a812fe2bd640d21513ec83974b2edb45 Merge: 6a628184 ae9a5be5 Author: Field G. Van Zee Date: Tue Mar 27 17:21:43 2018 -0500 Merge branch 'dev' commit ae9a5be56d6f9b87278d6032154d2dcf3fb7d54f Author: dnp Date: Tue Mar 27 17:01:23 2018 -0500 Fixed bug in skx sgemm microkernel commit 3f02af0905b1e2e2e065862f8afe5e9a52f282b2 Author: Field G. Van Zee Date: Mon Mar 26 17:40:04 2018 -0500 Row storage optimizations to zen dotxf kernels. Details: - Split the main loop bodies of zen's [sd]dotxf kernels into two cases: one to handle a column-stored matrix A and one to handle a row-stored matrix A. This allows vector instructions to be employed even if A is stored by rows (and A^T appears stored as columns). Both storage cases use a common edge case loop. Thanks to Devin Matthews for this idea and for prototyping the change needed for sdotxf kernel. commit 679dcc331dd870ec680e135a3fb65ffa6e3a91c2 Author: Field G. Van Zee Date: Mon Mar 26 15:35:17 2018 -0500 Make k_iter/k_left uint64_t in bulldozer fma ukrs. Details: - Changed the declaration of k_iter and k_left for d, c, z microkernels from dim_t to uint64_t. This is needed to ensure compatibility with the movq instruction used to load the value into registers. This change should have been made a long time ago, but for some reason only recently began showing up via Travis CI. commit 6a628184f6938673440e4cdd4fed0208c51fd1f9 Author: Field G. Van Zee Date: Mon Mar 26 14:48:16 2018 -0500 Fixed a memkind-related compile-time bug on knl. Details: - Fixed a compile-time error that occurred due to the fact that BLIS_ENABLE_MEMKIND, defined in bli_config.h, was not being defined soon enough to be used in bli_system.h where it is needed to determine whether hbwmalloc.h should be #included. bli_system.h is now included after bli_config.h (and bli_config_macro_defs.h). Thanks to Dave Love for reporting this issue. - Tweaked the language used by configure to echo the status of the --with[out]-memkind option. commit e2192a8fd58ec3657434ddd407033e097edad8f4 Author: Field G. Van Zee Date: Fri Mar 23 12:53:48 2018 -0500 Removed vzeroupper intrinsics from zen kenels. Details: - Fixed a bug in the zen (also used by haswell) dotxf kernels whereby a vzeroupper instruction destoryed part of the intermediate result stored by the vdpps instructions that came right before. (The vzeroupper instrinsic was removed.) - Removed remaining vzeroupper instrinsics from other zen kernels. Previously, the vzeroupper instructions were included because BLIS is typically compiled with -mfpmath=sse. But it was brought to my attention that inserting these vzeroupper instructions is unnecessary for our purposes, since (a) -mfpmath=sse results in VEX-encoded scalar code rather than literal SSE instructions, and (b) compilers already (likely) insert vzeroupper instructions where necessary. Thanks to Devin Matthews for zeroing in on the dotxf bug. - Removed -malign-double from bulldozer make_defs.mk. This alignment was already happening by default since bulldozer is an x86_64 system. commit 22289ad23cd10b81451ce82f60d84b5f97e7fd85 Author: Field G. Van Zee Date: Thu Mar 22 18:21:30 2018 -0500 Added build system support for libmemkind. Details: - Added support for libmemkind to configure. configure attempts to detect the presence of libmemkind by compiling a small program containing #include and a call to hbw_malloc(). If successful, it is assumed that libmemkind is present and available. If present, use of libmemkind is enabled by default, and otherwise use is disabled by default. If libmemkind is present, the user may explicitly disable use of the library by running configure with the --without-memkind option. Furthermore, a configuration may disable libmemkind, perhaps conditional on some aspect of the build system, by including -DBLIS_DISABLE_MEMKIND in the configuration's CPPROCFLAGS make variable and setting the BLIS_ENABLE_MEMKIND makefile variable, set in config.mk, to 'no'. (The knl configuration makes use of this latter feature; see below.) - If enabled at configure-time, bli_system.h will #include and bli_kernel_macro_defs.h will define BLIS_MALLOC_POOL and BLIS_FREE_POOL to use hbw_malloc() and hbw_free(), respectively. - Deprecated explicit use of BLIS_NO_HBWMALLOC in config/knl/bli_family.knl.h and replaced use of -DBLIS_NO_HBWMALLOC in config/knl/make_defs.mk with -DBLIS_DISABLE_MEMKIND, which overrides (#undefs) the definition of BLIS_ENABLE_MEMKIND in bli_system.h, if it would otherwise be defined. Also, set the BLIS_ENABLE_MEMKIND makefile variable to 'no'. - common.mk now adds libmemkind to LDFLAGS if libmemkind is enabled. commit 7dc40eafdd9af3e8c4519a8d1b04d25830b4ca7a Author: Field G. Van Zee Date: Wed Mar 21 18:39:16 2018 -0500 Updates to top-level and test driver Makefiles. Details: - Added logic to common.mk that will choose a BLIS library against which to link (LIBBLIS_LINK). The default choice is the static (.a) library; the shared (.so) library is chosen only if the shared library build was enabled and the static one was disabled. - Updated the various test driver Makefiles to reference this common, pre-chosen library against which to link. (Previously, these drivers unconditionally linked against the static library and would have failed if the static library build was disabled at configure-time.) - Renamed many of the variables in common.mk and the top-level Makefile so that variables relating to the libblis.[a|so] files, including paths to those files, begin with "LIBBLIS". - Shuffled around some of the library definitions from the top-level Makefile to common.mk. - Renamed BLIS_ENABLE_DYNAMIC_BUILD to BLIS_ENABLE_SHARED_BUILD, and the @enable_dynamic@ anchor to @enable_shared@ in build/config.mk.in and in configure. - A few other cleanups in the top-level Makefile. commit 97e1eeade3c51df1bae574a9bc1da34b05bf2bd3 Author: Field G. Van Zee Date: Wed Mar 21 15:47:11 2018 -0500 Added input.operations.fast file for 'make check'. Details: - Added an 'input.operations.fast' file to testsuite directory to go along with the 'input.general.fast' file used by the 'make check' target in the top-level Makefile. This will allow the "fast" check to prune operations and/or parameter combinations from the test space in order to save time. - Currently, input.operations.fast prunes trmm3 and all transposition and conjugation parameters from the level-3 test space. - Reduced problem size tested in input.general.fast to 100 and disabled testing of 1m method. commit c441caa95aabe69f54e2160eb67bf4ca76a66c34 Author: Field G. Van Zee Date: Tue Mar 20 17:56:02 2018 -0500 README update. Details: - Minor updates to README.md. - Minor change to blastest/Makefile. commit 6fe018eb4ac8c16f2edc916c24f5994848017b7f Author: Field G. Van Zee Date: Tue Mar 20 15:35:45 2018 -0500 Added .gitkeep file to blastest/obj. Details: - Added an empty file named '.gitkeep' to blastest/obj/ so that git will track the otherwise empty directory. (This is already done for the BLIS testsuite in testsuite/obj.) commit 0e6d000db9291342913dc5f8590a28c67bbcbc95 Author: Field G. Van Zee Date: Tue Mar 20 15:08:43 2018 -0500 Updated .gitignore to ignore BLAS test out.* files. commit 40c040a31d96fbadff11f761d0cad1ef03ef2cc5 Author: Field G. Van Zee Date: Tue Mar 20 14:33:50 2018 -0500 Fixes to .travis.yml. Details: - Invoke the full BLIS testsuite via 'make testblis' instead of the fast version via 'blistest-fast' (which was wrong anyway, since the correct fast traget is 'testblis-fast'). - Invoke the BLAS tests via 'make testblas' instead of 'blastest'. commit 664ec4813d8b53121cce7a68bef47da656ece9cb Author: Field G. Van Zee Date: Tue Mar 20 13:54:58 2018 -0500 Integrated f2c'ed netlib BLAS test suite. Details: - Created a new test suite that exercises only the BLAS compatibility found in BLIS. The test suite is a straightforward port of code obtained from netlib LAPACK, run through f2c and linked to a stripped- down version of libf2c that is compiled along with the test drivers (to prevent any obvious ABI issues). The new BLAS test suite can be run from within its new local directory, 'blastest' (through its local 'make ; make run' targets) or from the top-level Makefile (via the 'make testblas' target). Output files are created in whatever directory the test drivers are run, whether it be the 'blastest' directory, the top-level source distribution directory, or the out-of-tree directory in which 'configure' was run. Also, the results of the BLAS test suite can be checked via 'make checkblas', which summarizes the presence or absence of test failures in a single line printed to stdout. - Updated the 'test' target to run both 'testblis' and 'testblas'. - Added a new 'testblis-fast' target that runs the BLIS testsuite with smaller problem sizes, allowing it to finish more quickly. - Added a 'make check' target, which runs 'checkblis-fast' and 'checkblas'. - Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of 'testblis' before (calling the check-blistest.sh script to check the result manually). - Renamed some targets in the top-level Makefile to be consistent between BLAS and BLIS. commit fc53ad6c5b2e39238b1bbbf625cc0c638b9da4e1 Author: Nisanth M P Date: Mon Mar 19 12:49:26 2018 +0530 Re-enabling the small matrix gemm optimization for target zen Change-Id: I13872784586984634d728cd99a00f71c3f904395 commit d12d34e167d7dc32732c0ed135f8065a55088106 Author: Nisanth M P Date: Mon Mar 19 11:34:32 2018 +0530 Re-enabling Zen optimized cache block sizes for config target zen Change-Id: I8191421b876755b31590323c66156d4a814575f1 commit 40fa10396c0a3f9601cf49f6b6cd9922185c932e Author: Field G. Van Zee Date: Mon Mar 19 18:19:43 2018 -0500 Fixed a few obscure bugs in the BLAS API. Details: - Fixed a missing parameter in the definition of sdsdot_(). The 'sb' argument was missing. Strangely, the argument is omitted from dsdot_() in the BLAS API. - Fixed the missing 'c' or 'u' in the "?gerc" or "?geru" operation string passed to xerbla_() by the bla_ger_check() macro. - For bla_syrk_check() and bla_syr2k_check() macros, only allow conjugate-transpose (trans='c') as a valid argument for the real domain functions [sd]syrk_() and [sd]syr2k_(). (Previously, the argument was allowed even for the complex domain equivalents, which was inconsistent with the BLAS API.) commit fe7d7f1e43e4c26249eed83d4188beee1ba96202 Author: Field G. Van Zee Date: Sun Mar 18 19:43:06 2018 -0500 Fixed cpp macro parameter "ch" typo in bla_ger.c. Details: - Previously, the BLAS routine-generating macro in bla_ger.c was incorrectly passing MKSTR(ch) into the _check() macro when it should have been passing in the char that was available, chxy. I've instead changed the name of the macro parameter from chxy to ch. Similar change as made to bla_ger.h for consistency. Thanks to Dave Love in helping track this down. (NOTE: This is actually the root cause of the bug that was first patched by increasing the length of the operation name strings passed into xerbla_(), as defined by the constant BLIS_MAX_BLAS_FUNC_STR_LENGTH, in 3d1a5a7. In theory, that change could be backed out now.) - Applied aforementioned chxy->ch change to bla_dot.[ch], as well as frame/compat/cblas/f77_sub/f77_dot_sub.[ch] (not because it needed to happen, but for naming consistency). - Reformatted function signatures/prototypes of CBLAS functions and function calls to BLAS in frame/compat/cblas/f77_sub/*.c. commit cb7ed90752d1ddbac11368c4510641ca4f3a02eb Author: Field G. Van Zee Date: Fri Mar 16 13:05:56 2018 -0500 Convert op names to uppercase before calling xerbla_(). Details: - Defined a new function, bli_string_mkupper(), that calls toupper() on every non-NULL character in a string. - Call bli_string_mkupper() prior to calling xerbla_() in the level-2/-3 BLAS _check() macros. This prevents the BLAS testsuite from complaining that the operation name (e.g. "dgemm") does not match the expected value (e.g. "DGEMM"). Thanks to Dave Love for reporting this issue. commit 3d1a5a7c08fed3ba29f060fe1db2b0dc42dde223 Author: Field G. Van Zee Date: Fri Mar 16 12:24:07 2018 -0500 Fixed printf() format overflow. Details: - Increased the length of operation name strings passed to xerbla_() in the level-2 and level-3 operation _check() functions, found in frame/compat/check. This avoids a format specifier overflow warning by gcc 7. Thanks to Dave Love for reporting this issue and suggesting the fix. commit c73055f028684d998e03b2392093c393782bbfe7 Author: Field G. Van Zee Date: Thu Mar 15 16:08:21 2018 -0500 Return after non-zero info in BLAS checks. Details: - Previously, when calling the BLAS compatibility layer, discovering a parameter check failure would result in the proper setting of the info parameter (printed by xerbla_()), but would also come with an immediate abort() rather than a return. This was incorrect behavior for two overlapping reasons. (1) BLAS should return gracefully to the caller in the event of a bad set of parameters, not abort(). (2) When BLIS was being tested via the BLAS testsuite, BLIS's xerbla_() would correctly get preempted/overridden by the xerbla_() in the BLAS testsuite, but execution would then erroneously continue on to the BLIS implementation with bad parameter values. - The previous issue was addressed by disabling the abort() in BLIS's xerbla_(), changing all of the BLAS _check() functions to cpp macros, and adding a return statement to the end of each _check() macro's "if ( info != 0 )" conditional. Thanks to Dave Love for reporting this issue. commit c4f1d18b97a6a8c3ea0366aa759db597a664062a Author: Field G. Van Zee Date: Wed Mar 14 19:10:09 2018 -0500 Minor typo fix to printing arch in testsuite. Details: - Mistakenly was calling bli_cpuid_query_id() instead of bli_arch_query_id() in the recent addition to the testsuite output that prints the active sub-configuration. The former function is only used for multi-architecture builds, whereas the latter is the more general option that also works for single configuration (including 'configure auto') builds. commit 8f2fabec800a720b3e94b33c0048cc8c4ead436d Author: Devin Matthews Date: Wed Mar 14 17:43:42 2018 -0500 Make arm32 and arm64 families work. (#176) commit fc6a1842518a0820c6708c285611346d5a1419da Author: Field G. Van Zee Date: Wed Mar 14 15:31:17 2018 -0500 Print sub-configuration name in testsuite output. Details: - Added a line to the testsuite output that prints the name of the current/active sub-configuration. This is useful when linking the testsuite against multi-configuration builds because it confirms the sub-configuration that is actually being employed at runtime. Thanks to Devin Matthews for suggesting this feature. commit 9943a899d64bf7ec4a24106f6f4c70629bbe1f6e Merge: 290dd4a9 b1a15ae6 Author: Devin Matthews Date: Wed Mar 14 13:27:44 2018 -0500 Merge pull request #173 from devinamatthews/dev Fix Cortex-A9 and Cortex-A15 configs. commit b1a15ae6ee0f46c9a95cf59f9555925e0e8e21ff Author: Devin Matthews Date: Wed Mar 14 13:26:44 2018 -0500 Use BLIS_H_FLAT commit 290dd4a9feee447e69b40ad108954af78e196f7e Author: Field G. Van Zee Date: Wed Mar 14 13:15:37 2018 -0500 Allow arbitrarily deep configuration families. Details: - Updated configure so that configuration families specified in the config_registry are no longer constrained as being only one level deep. For example, previously the x86_64 family could not be defined concisely in terms of, say, intel64 and amd64 families, and instead had to be defined as containing "haswell, sandybridge, penryn, zen, etc." In other words, families were constrained to only having singleton configurations as their members. That constraint is now lifted. - Redefined x86_64 family in config_registry in terms of intel64 and amd64. commit 9cee78e006d56543ac02fc9c488905c0434e60ae Author: Devin Matthews Date: Wed Mar 14 13:09:48 2018 -0500 Fix Cortex-A9 and Cortex-A15 configs. Tested with QEMU. commit 1a3031740f7fcbbcc2c99d5c4cb50d0413407455 Author: Field G. Van Zee Date: Tue Mar 13 16:04:40 2018 -0500 Updates to ARM hardware detection support. Details: - Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c. Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit) sub-configurations are supported. However, the functions that detect features specific to a15 and a9 are identical, and since a15 is tested first, it will always be chosen for arm32 hardware (even if both sub-configurations were enabled at configure-time and the library is linked and run on an a9). Thus, more work needs to be done to distinguish these two. - Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either the x86_64 or ARM code will be compiled (or neither, if neither environment is detected). - In bli_arch_query_id(), call bli_cpuid_query_id() when the BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined. - Added arm64 and arm32 configuration families to config_registry. - Added a note to the arch_t typedef enum in bli_type_defs.h reminding the developer to update the string array in bli_arch.c whenever new enum values are added or existing values are reordered. commit 1442d06886ebdc34d8f1cb620229ddc6062c2ce8 Author: Field G. Van Zee Date: Sun Mar 11 16:59:50 2018 -0500 Fixed misnamed kernels in _cntx_init_cortexa57.c. Details: - Changed incorrect kernel function names in bli_cntx_init_cortexa57.c: bli_sgemm_cortexa57_asm_8x12 -> bli_sgemm_armv8a_asm_8x12 bli_dgemm_cortexa57_asm_6x8 -> bli_dgemm_armv8a_asm_6x8 Thanks to Jacob Gorm Hansen for reporting this issue. commit 28bcea37dfcf0eb99a99da6f46de2a2830393d1d Merge: b1ea3092 8b0475a8 Author: praveeng Date: Fri Mar 9 19:13:08 2018 +0530 Merge master code till 06_mar_2018 to amd-staging Change-Id: I12267e5999c92417e3715fef4f36ac2131d00f1a commit 48da9f5805f0a49f6ad181ae2bf57b4fde8e1b0a Author: Field G. Van Zee Date: Wed Mar 7 12:54:06 2018 -0600 Tweaked common.mk, Makefile, skx/knl make_defs.mk. Details: - Reorganized linker-related section of common.mk so that LDFLAGS set in a sub-configuration's make_defs.mk file will not be immediately (and erroneously) overridden by the default values. - Re-enabled redirected (to file) output of the testsuite when run from the top-level Makefile via 'make test'. (For some reason, it was commented-out for the non-verbose case.) - Removed old/unnecessary code from the make_defs.mk files of skx and knl sub-configurations. commit 8b0475a87daa177916e2caac0e530c6a57fa07cf Author: Field G. Van Zee Date: Tue Mar 6 06:39:44 2018 -0600 Fixed typo in attempted fix in 1a8350f7. Details: - Mistakenly entered 148 as knl mc blocksize for double real when the value should have been 144. Thanks to Dave Love for reporting this. commit 8912e6886b97eabb4ce0c35a3609a0fd994d347b Author: Field G. Van Zee Date: Mon Mar 5 18:00:45 2018 -0600 Fixed missing flags during shared object build. Details: - Fixed a bug in common.mk that caused warning, position-independent code, miscellaneous, and general preprocessor flags to be omitted from the configuration family-specific variables that hold those values, as registered by the family's make_defs.mk file. This would most obviously manifest when targeting a configuration family such as 'intel64' while simultaneously configuring for a shared object build, as the key '-fPIC' flag would be omitted at compile-time and prevent successful linking. Thanks to Dave Love for reporting this bug. - Other cleanups to common.mk for readability and clarity. commit 1a8350f70557fc53ca0c2eadf2076710dd0d9bc9 Author: Field G. Van Zee Date: Mon Mar 5 13:32:00 2018 -0600 Fixed cache blocksize bug in knl configuration. Details: - Changed the mc blocksize for double real execution in the knl sub- configuration from 160 to 148. The old value was not a multiple of mr (which is 24), and thus the safeguards in bli_gks_register_cntx() were tripping. Thanks for Dave Love for reporting this issue. - Switch knl sub-configuration to use default blocksizes for datatypes not supported by native kernels. - Fixed typos in bli_error.c that prevented certain error strings (which report maximum cache blocksizes not being multiples of their corresponding register blocksize) from properly initializing. commit c09fffa827fe6241dc20193a1c404496664220de Author: Field G. Van Zee Date: Sat Mar 3 13:13:39 2018 -0600 Added missing cntx_t* arg in knl packm kernels. Details: - Added the missing cntx_t* argument to the function signature of packm kernels in kernels/knl/1m/. Thanks to Dave Love for reporting this issue. commit b1ea30925dff751eced23dfa94ff578a20ea0b94 Author: Field G. Van Zee Date: Fri Feb 23 17:42:48 2018 -0600 CHANGELOG update (0.3.0) Change-Id: Id038b00a62de51c9818ad249651ec5dc662f4415 commit 1ef9360b1fd0209fbeb5766f7a35402fbd080fcb Author: Field G. Van Zee Date: Thu Mar 1 14:36:39 2018 -0600 Enable non-unit vector stride tests by default. Details: - Change "vector storage schemes to test" parameter in testsuite's input.general file to "cj". This means that both unit stride column vectors and non-unit stride column vectors will be tested in operations with vector operands (e.g. level-1v, level-1f, level-2). - Very minor comment (typo) changes to input.operations. commit 8c4e55a1a1ead9a5e970200fee027ffd2c7e8454 Author: Field G. Van Zee Date: Wed Feb 28 17:01:47 2018 -0600 Added individual operation overrides in testsuite. Details: - Updated the testsuite driver so that setting one or more individual operation test switches to "2" in input.operations will enable ONLY those operations and disable all others, regardless of the values of the section overrides and other operation switches. This makes it every easy to quickly test only one or two operations, and equally easy to revert back to the previous combination of operation tests. - Added more comments to input.operations describing the use of individual "enable only" overrides. commit 34862aed89e5d5a8f35aeecd49f3052ada1f337b Author: Field G. Van Zee Date: Wed Feb 28 15:30:14 2018 -0600 Use zen kernels in haswell sub-configuration. Details: - Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv, dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf and dotxf. This works because these kernels simply target AVX/AVX2, and therefore work without modification on haswell hardware. - Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen kernels are essentially identical to those used by haswell, except that now zen kernels are a bit more up-to-date. In the future, I may continue to maintain duplicates, or I may keep the kernels named after one architecture (zen or haswell) but used by both sub-configurations. - In config_registry, enable use of both haswell and zen kernels for the haswell sub-configuration. This is necessary in order to make zen kernels visible when registering kernels in bli_cntx_init_haswell.c. - Enable use of assembly-based complex gemm microkernels for zen, bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in bli_cntx_init_zen.c. This was actually intended for 1681333. commit 709f8361ebc90b96b02ebe5c5ffb6fc3b1b25e58 (tag: 0.3.0) Author: Field G. Van Zee Date: Fri Feb 23 17:42:48 2018 -0600 Version file update (0.3.0) commit d9079655c9cbb903c6761d79194a21b7c0a322bc Author: Field G. Van Zee Date: Fri Feb 23 17:42:48 2018 -0600 CHANGELOG update (0.3.0) commit 3defc7265c12cf85e9de2d7a1f243c5e090a6f9d Author: Field G. Van Zee Date: Fri Feb 23 17:38:19 2018 -0600 Applied 34b72a3 to non-active/unused microkernels. Details: - Applied the read-beyond-bounds bugfix in 34b72a3 to other haswell and zen kernels (ie: other microtile shapes) which are not used by default. This was done mostly in case someone decided to pick up these kernels and start using them, not because it affects BLIS's behavior out-of-the-box. commit 34b72a351745aa0d47bb0b74ebcd0f0a616d613d Author: Field G. Van Zee Date: Fri Feb 23 16:33:32 2018 -0600 Fixed obscure read-beyond-bounds bug in sgemm ukrs. Details: - Fixed an obscure bug in the bli_sgemm_haswell_asm_6x16 and bli_sgemm_zen_asm_6x16 microkernels when the input/output matrix C is stored with general stride (ie: both rs and cs are non-unit). The bug was rooted in the way those microkernels read from matrix C-- namely, they used vmovlps/vmovhps instead of movss. By loading two floats at a time, even if one of them was treated as junk, the assembly code could be written in a more concise manner. However, under certain conditions--if m % mr == 0 and n % nr == 0 and the underlying matrix is not an internal "view" into a larger matrix-- this could result in the very last vmovhps of the last (bottom-right) microkernel invocation reading beyond valid memory. Specifically, the low 32 bits read would always be valid, but the high 32 bits could reside beyond the bounds of the array in which the output C matrix is contained. To remedy this situation, we now selectively use movss to load any element that could be the last element in the matrix. commit 5112e1859e7f8888f5555eb7bc02bd9fab9b4442 (origin/rt) Author: Field G. Van Zee Date: Fri Feb 23 14:31:26 2018 -0600 Added missing 'restrict' to some kernels' cntx_t*. Details: - Added missing 'restrict' keyword to cntx_t* argument of function signatures corresponding to level-1v, level-1f, and level-1m kernels. This affected bli_l1v_ker_prot.h, bli_l1f_ker_prot.h, and bli_l1m_ker_prot.h. (The 'restrict' was already being used to qualify cntx_t* arguments for kernels defined in bli_l3_ker_prot.h.) - Added comments to bli_l1v_ker.h, bli_l1f_ker.h, bli_l1m_ker.h, and bli_l3_ukr.h that help explain how those headers function to produce kernel prototypes using the prototype macros defined in the files mentioned above. commit 1fa8af95d807168e0849adb668492601e7009be0 Merge: c084b03b 16813335 Author: Field G. Van Zee Date: Wed Feb 21 17:54:02 2018 -0600 Merge branch 'rt' commit c084b03b31d84427a120e391963db5419f1911ee Merge: 5d03b6e6 fa74af4e Author: Field G. Van Zee Date: Wed Feb 21 17:52:17 2018 -0600 Merge branch 'rt' commit 16813335bdb5978bc9a26cd00a32bd5a130130c4 Merge: fa74af4e 5a7005dd Author: Field G. Van Zee Date: Wed Feb 21 17:43:32 2018 -0600 Merge branch 'amd' into rt Details: - Merged contributions made by AMD via 'amd' branch (see summary below). Special thanks to AMD for their contributions to-date, especially with regard to intrinsic- and assembly-based kernels. - Added column storage output cases to microkernels in bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with the extra cost of transposing the microtile in registers, this is much faster than using the general storage case when the underlying matrix is column-stored. - Added s and d assembly-based zen gemmtrsm_u microkernel (including column storage optimization mentioned above). - Updated zen sub-configuration to reflect presence of new native kernels. - Temporarily reverted zen sub-configuration's level-3 cache blocksizes to smaller haswell values. - Temporarily disabled small matrix handling for zen configuration family in config/zen/bli_family_zen.h. - Updated zen CFLAGS according to changes in 1e4365b. - Updated haswell microkernels such that: - only one vzeroupper instruction is called prior to returning - movapd/movupd are used in leiu of movaps/movups for double-real microkernels. (Note that single-real microkernels still use movaps/movups.) - Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is now included via frame/include/bli_arch_config.h. - Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation in testsuite/src/test_amaxv.c). - Added early return for alpha == 0 in bli_dotxv_ref.c. - Integrated changes from f07b176, including a fix for undefined behavior when executing the 1m method under certain conditions. - Updated config_registry; no longer need haswell kernels for zen sub-configuration. - Tweaked marginal and pass thresholds for dotxf. - Reformatted level-1v, -1f, and -3 amd kernels and inserted additional comments. - Updated LICENSE file to explicitly mention that parts are copyright UT-Austin and AMD. - Added AMD copyright to header templates in build/templates. Summary of previous changes from 'amd' branch. - Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and s and d assembly-based zen gemmtrsm_l microkernels (d6x8). - Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv, and scalv, with extra-unrolling variants for axpyv and scalv. - Added a small matrix handler to bli_gemm_front(), with the handler implemented in kernels/zen/3/bli_gemm_small_matrix.c. - Added additional logic to sumsqv that first attempts to compute the sum of the squares via dotv(). If there is a floating-point exception (FE_OVERFLOW), then the previous (numerically conservative) code is used; otherwise, the result of dotv() is square-rooted and stored as the result. This new implementation is only enabled when FE_OVERFLOW is #defined. If the macro is not #defined, then the previous implementation is used. - Added axpyv and dotv standalone test drivers to test directory. - Added zen support to old cpuid_x86.c driver in build/auto-detect/old. - Added thread-local and __attribute__-related macros to bli_macro_defs.h. commit 5d03b6e6e19d5a07f0cccf1a158f02fbd62dfd99 Author: Devin Matthews Date: Mon Feb 19 11:31:30 2018 -0600 Fix asm macro include line for KNL. Fixes #167. commit f07b176c84dc9ca38fb0d68805c28b69287c938a Author: Field G. Van Zee Date: Thu Feb 15 18:36:54 2018 -0600 Fixed an obscure bug in the 1m implementation. Details: - Fixed a bug in the way the bli_gemm1m_cntx_ref() function (defined in ref_kernels/bli_cntx_ref.c) initializes its context for 1m execution. Previously, the function probed the context that was in the process of being updated for use with 1m--this context being previously initialized/copied from a native context--for its storage preference to determine which "variant" (row- or column-oriented) of 1m would be needed. However, the _cntx_ref() function was not updating the method field of the context until AFTER this query, and the conditional which depended on it, had taken place, meaning the storage preference query function would mistakenly think the context was for native execution, since the context's method field would still be set to BLIS_NAT. This would lead it to incorrectly grab the storage preference of the complex domain microkernel rather than the corresponding real domain microkernel, which could cause the storage preference predicate to evaluate to the wrong value, which would lead to the _cntx_ref() function choosing the wrong variant. This could lead to undefined behavior at runtime. The method is now explicitly set within the context prior to calling the storage preference query function. - Updated comments in frame/ind/oapi/bli_l3_3m4m1m_oapi.c. - Fixed a typo in the commented-out CFLAGS in config/zen/make_defs.mk, which are appropriate for gcc 6.x and newer. (Mistakenly used -march=bdver4 instead of -march=znver1.) commit 1f94bb7b96eb2b67257e6c4df89e29c73e9ab386 Author: Field G. Van Zee Date: Fri Jan 19 12:46:53 2018 -0600 Document how to enable zen-specific instructions. Details: - Added as a comment in config/zen/make_defs.mk the list of compiler flags that could be added to manually enable the instructions provided by the Zen microarchitecture that are not already implied by -march=bdver4. This information, along with the previous commit's flags to selectively disable Bulldozer instructions no longer present in Zen, was gathered from [1]. I hesitate to enable use of these instructions since I don't have any Zen hardware to test on yet. [1] https://wiki.gentoo.org/wiki/Ryzen commit 1e4365b21bafa02bd108c5ac4705a25671fb9441 Author: Field G. Van Zee Date: Thu Jan 18 12:03:51 2018 -0600 Augment zen CFLAGS to prevent illegal instruction. Details: - Added various compiler flags (-mno-fma4 -mno-tbm -mno-xop -mno-lwp) so that compiling with -march=bdver4 on zen-based architectures does not result in an illegal instruction error at runtime. Note: This fix is only needed for gcc 5.4; gcc 6.3 or later supports the use of -march=znver1, which can be used in lieu of the augmented set of flags based on bdver4. Thanks to Nisanth Padinharepatt for reporting this error. commit fa74af4e1fa7385ac3f3089fe1ea7bb88c906029 Author: Field G. Van Zee Date: Tue Jan 9 13:43:15 2018 -0600 Minor labeling update for './configure -c' output. Details: - Print the name of the configuration in the output of the kernel-to-config map (and chosen pairs list) as a subtle way to remind the user that these only apply to the targeted configuration (whereas the config list and kernel list are printed without regard to which configuration was actually targeted). commit 5cdea756c7391e2c6cbfb38436ef9a205f860237 Merge: 9d8858b5 1e7a4896 Author: Field G. Van Zee Date: Sun Jan 7 19:45:20 2018 -0600 Merge branch 'rt' commit 9d8858b5cff4a4b078b87872847a5710073fff0a Merge: 0b3ca3cf f7df64da Author: Devin Matthews Date: Sun Jan 7 10:03:25 2018 -0600 Merge pull request #164 from devinamatthews/master Don't use memkind for skx configuration. commit f7df64daf6bbe6431effada6e13d8d1fab5aa221 Author: Devin Matthews Date: Sun Jan 7 09:37:25 2018 -0600 Don't use memkind for skx configuration. Fixes #163. commit 1e7a4896e0cbe73c4685fa956278e3f28273cdf9 Author: Field G. Van Zee Date: Fri Jan 5 12:33:48 2018 -0600 Minor error handling in update-version-file.sh. Details: - Added explicit handling of situations when 'git describe --tags' returns an error. This command is used by update-version-file.sh when deciding whether or not to update the version file prior to configuration. - Removed bli_packm.c and bli_unpackm.c, as they contained no source code. commit 0b3ca3cfb682715a3686fd93ebb10d4a695d1162 Author: Field G. Van Zee Date: Thu Jan 4 20:51:35 2018 -0600 Intelligently select compiler for auto-detection. Details: - Rewrote code that selects the compiler for the purposes of compiling the auto-detection executable. CC (if specified) is tried first. Then gcc. Then clang. The absolute fallback is cc. The previous code was sort of broken, and seemed to unintentionally always use gcc. - Moved various configuration-agnostic flags from config/*/make_defs.mk files to common.mk. The new mechanism appends the configuration- agnostic flags to the various compiler flag variables initialized in make_defs.mk. Flags specific to the sub-configuration are still set in make_defs.mk. - Added -Wno-tautological-compare to CMISCFLAGS when clang is in use. Also added the flag to the compiler instantiation during configure- time hardware detection (when clang is selected). - Added some missing (but mostly-optional) quotes to configure script. commit 5a7005dd44ed3174abbe360981e367fd41c99b4b Merge: 7be88705 3bc99a96 Author: Nisanth M P Date: Wed Jan 3 12:05:12 2018 +0530 Merge changes in AMD beta release 0.95 into amd branch commit 0b9c5127e91508c115228ca604ee2dac8de8f477 Author: Field G. Van Zee Date: Sat Dec 23 15:53:44 2017 -0600 Enabled C99, added stdint.h to auto-detect build. Details: - Added "-std=c99" to compiler arguments when building auto-detection driver in configure script. - Added #include to all three source files needed by auto- detection program. commit 0ce5e19c318e04909d3e664d69accb3a0fc6b988 Author: Field G. Van Zee Date: Sat Dec 23 15:32:03 2017 -0600 Reimplemented configure-time hardware detection. Details: - Reimplemented the hardware detection functionality invoked when running "./configure auto". Previously, a standalone script in build/auto-detect that used CPUID was used. However, the script attempted to enumerate all models for each microarchitecture supported. The new approach recycles the same code used for runtime hardware detection introduced in 2c51356. This has two immediate benefits. First, it reduces and consolidates the code required to detect microarchitectures via the CPUID instruction. Second, it provides an indirect way of testing at configure-time the code that is used to detect hardware at runtime. This code is (a) only activated when targeting a configuration family (such as intel64 or amd64) at configure-time and (b) somewhat difficult to test in practice, since it relies on having access to older microarchitectures. - The above change required placing conditional cpp macro blocks in bli_arch.c and bli_cpuid.c which either #include "blis.h" or #include a bare-bones set of headers that does not rely on the presence of a bli_config.h header. This is needed because bli_config.h has not been created yet when configure-time auto-detection takes places. - Defined a new function in bli_arch.c, bli_arch_string(), which takes an arch_t id and returns a pointer to a string that contains the lowercase name of the corresponding microarchitecture. This function is used by the auto-detection script to printf() the name of the sub-configuration corresponding to the detected hardware. commit 9804adfd405056ec332bb8e13d68c7b52bd3a6c1 (origin/selfinit) Author: Field G. Van Zee Date: Thu Dec 21 19:22:57 2017 -0600 Added option to disable pack buffer memory pools. Details: - Added a new configure option, --[en|dis]able-packbuf-pools, which will enable or disable the use of internal memory pools for managing buffers used for packing. When disabled, the function specified by the cpp macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed (and BLIS_FREE_POOL is called when the buffer is ready to be released, usually at the end of a loop). When enabled, which was the status quo prior to this commit, a memory pool data structure is created and managed to provide threads with packing buffers. The memory pool minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls BLIS_MALLOC_POOL), but does so through a somewhat more complex mechanism that may incur additional overhead in some (but not all) situations. The new option defaults to --enable-packbuf-pools. - Removed the reinitialization of the memory pools from the level-3 front-ends and replaced it with automatic reinitialization within the pool API's implementation. This required an extra argument to bli_pool_checkout_block() in the form of a requested size, but hides the complexity entirely from BLIS. And since bli_pool_checkout_block() is only ever called within a critical section, this change fixes a potential race condition in which threads using contexts with different cache blocksizes--most likely a heterogeneous environment--can check out pool blocks that are too small for the submatrices it wishes to pack. Thanks to Nisanth Padinharepatt for reporting this potential issue. - Removed several functions in light of the relocation of pool reinit, including bli_membrk_reinit_pools(), bli_memsys_reinit(), bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool(). - Updated the testsuite to print whether the memory pools are enabled or disabled. commit 107801aaae180c00022f1b990bc59038c14949d2 Merge: d9c05745 0084531d Author: Field G. Van Zee Date: Mon Dec 18 16:29:28 2017 -0600 Merge branch 'master' into selfinit commit 0084531d3eea730a319ecd7018428148c81bbba7 Author: Field G. Van Zee Date: Sun Dec 17 18:58:25 2017 -0600 Updated flatten-headers.py for python3. Details: - Modifed flatten-headers.py to work with python 3.x. This mostly amounted to removing print statements (which I replaced with calls to my_print(), a wrapper to sys.stdout.write()). Thanks to Stefan Husmann for pointing out the script's incompatibility with python 3. - Other minor changes/cleanups. commit 90b11b79c302f208791bdfb1ed754873103c7ce5 Author: Field G. Van Zee Date: Sun Dec 17 17:34:32 2017 -0600 Modest performance boost to flatten-headers.py. Details: - Updated flatten-headers.py to pre-compile the main regular expression used to isolate #include directives and the header filenames they reference. The compiled regex object is then used over and over on each header file in the tree of referenced headers. This appears to have provided a 1.7-2x performance increase in the best case. - Other minor tweaks, such as renaming the main recursive function from replace_pass() to flatten_header(). commit 99dee87f30b4d437fa6b5e4ba862526d07b9f08b Author: Field G. Van Zee Date: Sun Dec 17 16:47:27 2017 -0600 Reimplemented flatten-headers.sh in python. Details: - Added flatten-headers.py, a python implementation of the bash script flatten-headers.sh. The new script appears to be 25-100x faster, depending on the operating system, filesystem, etc. The python script abides by the same command line interface as its predecessor and targets python 2.7 or later. (Thanks to Devin Matthews for suggesting that I look into a python replacement for higher performance.) - Activated use of flatten-headers.py in common.mk via the FLATTEN_H variable. - Made minor tweaks to flatten-headers.sh such as spelling corrections in comments. commit d9c0574599c3f97c0f9b6c334a077bab9452e1f4 Author: Field G. Van Zee Date: Thu Dec 14 17:13:42 2017 -0600 Allow travis failures of OS X builds that run testsuite. Details: - Added an allowance for OS X builds that run the testsuite to fail. There seems to be an issue with 1m when running in Travis CI under OS X and clang, but only in double-precision. Haven't been able to reproduce the error on my own, and thus, I can't debug it. (Hopefully it is simply a version-specific compiler bug.) commit 86cd23b7379b00a42b4ecc04fa668f1e3f9b54ee Author: Field G. Van Zee Date: Thu Dec 14 15:47:41 2017 -0600 Fixed testsuite Makefile brokenness from 9091a207. Details: - Fixed a makefile error encountered when building the testsuite directly in its directory (as opposed to indirectly via 'make test'). The fix involves introducing a new variable, BUILD_PATH, alongside the existing DIST_PATH variable. By default, BUILD_PATH is set to the current directory, and is overridden by other Makefiles used by, for example, the testsuite and standalone test drivers in testsuite or test, respectively. - Some files/directories in common.mk were redefined in terms of BUILD_DIR, such as the locations of config.mk file and the intermediate include directory. commit 6a3a8924c04d25507fc4aa593df30c56c7dc12f7 Author: Field G. Van Zee Date: Thu Dec 14 13:20:02 2017 -0600 Temporarily show Makefile's testsuite output. Details: - Disabled redirection of testsuite output for 'test' target. This is part of an attempt to debug a segmentation fault on OS X via Travis. commit 9a01080dd426915bed18229f70401bfa639dc283 Merge: 83316485 a32e8a47 Author: Field G. Van Zee Date: Thu Dec 14 11:27:19 2017 -0600 Merge branch 'master' into selfinit commit a32e8a47c022b6071302b2956af5728976c83ca9 (origin/travis) Author: Field G. Van Zee Date: Wed Dec 13 16:31:36 2017 -0600 Added an exclusion to .travis.yml. Details: - Added exclusion for out-of-tree builds on OS X (clang). commit b9f7d987df548965c86e16e0ba94d5cad0d9b399 Author: Field G. Van Zee Date: Wed Dec 13 16:22:09 2017 -0600 Cleaned up after previous travis oot debugging. Details: - Removed debugging output from common.mk related to Travis CI out-of-tree builds. - Other minor cleanups to common.mk. commit 9091a207aa8c49e279676ea02be533480b3b0d5a Author: Field G. Van Zee Date: Wed Dec 13 16:12:34 2017 -0600 Attempted fix to travis oot build failure. Details: - Found the likely cause of the Travis CI out-of-tree build failures: config.mk was being read from DIST_PATH, rather than the current directory. commit c01c71c33e236e6c91f5ddd3ec1e3faec89368c1 Author: Field G. Van Zee Date: Wed Dec 13 15:58:50 2017 -0600 Added debugging output to Makefile. Details: - Added $(info ...) statements in key locations in an attempt to reveal why Travis CI doesn't like building BLIS out-of-tree. commit 784289d69dd6b3692444d3b3e290f6a014465b72 Author: Field G. Van Zee Date: Wed Dec 13 15:31:27 2017 -0600 Updated SHELL in common.mk from /bin/bash to bash. commit d9bb1d1d4ebc89ea75d9d927d09882162a914f77 Author: Field G. Van Zee Date: Wed Dec 13 15:27:54 2017 -0600 Defined SHELL in common.mk so "echo -n" works. Details: - Defined the SHELL variable in common.mk as "/bin/bash" so that the -n option can be used with echo in the Makefile rule for flattening blis.h. Thanks to Devin Matthews for suggesting this fix. commit 9289a08667df2044f3a37af54d893efe2b56d555 Author: Field G. Van Zee Date: Wed Dec 13 15:14:27 2017 -0600 Attempt 3 on .travis.yml. commit 720bfcf0ef54fdc41df0dcaa94503edb0d5c8972 Author: Field G. Van Zee Date: Wed Dec 13 14:52:28 2017 -0600 More fixes to .travis.yml. Details: - Fixed a mistake (hopefully) in d0c4dd0 that resulted in many more osx/clang sub-tests than intended. - Shortened the variable names in an effort to make them more readable via the Travis CI web interface. commit 8717c9c97fe9b1ecd3b3192049a73976f8390ca7 Author: Field G. Van Zee Date: Wed Dec 13 14:36:37 2017 -0600 Added 'pwd' commands to .travis.yml for debugging. Details: - Added 'pwd' commands to the script portion of the .travis.yml file in an attempt to uncover the problem with the recent out-of-tree build testing changes made in d0c4dd0. commit 83316485ce10f6fcafe92a1c146282de0dd8068a Author: Field G. Van Zee Date: Wed Dec 13 14:14:50 2017 -0600 Simplified/fixed self-initialization. Details: - Fixed a race condition in self-initialization whereby the bli_is_init static variable could be erroneously read as TRUE by thread 1 while thread 0 is still executing bli_init_apis(), thus allowing thread 1 to use the library before it is actually ready. Thanks to to Minh Quan Ho and Devin Matthews for pointing out this issue. - Part of the solution to the aforementioned race condition was involved replacing the runtime initialization of the global scalar constants (e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static initialization of those same constants. This eliminates the need for bli_const_init() altogether. (The static initialization is made concise via preprocess macros.) - Defined bli_gks_query_cntx_noinit(), which behaves just like bli_gks_query_cntx(), except that it does not call bli_init_once(). This function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and bli_memsys_init() so as to not result in any recursion into bli_init_once(). - Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants. They have no use in BLIS or its test products, and we have little reason to believe they are used by others. - Removed testsuite/out file, which was accidentally committed as part of 70640a3. commit 6526d1d4ae6dbfa854ca8d1e5f224cd6ab3fa958 Author: Field G. Van Zee Date: Tue Dec 12 13:50:43 2017 -0600 Added temp_dir argument to flatten-headers.sh. Details: - Added "temp_dir" argument to flatten-headers.sh so that the caller can specify where intermediate files should be created as the script runs. - Updated flatten-headers.sh to create intermediate files in temp_dir instead of alongside the corresponding source files. This should now (once again) allow out-of-tree builds where the BLIS distribution is read-only, or where the out-of-tree build is running concurrently with another out-of-tree build. (Thanks to Devin Matthews for pointing out the possibility of simultaneous out-of-tree builds.) commit 94755017c967630daf2e31c1f63ed5e88ab0d6ab Merge: d0c4dd00 5cf7b0c4 Author: Field G. Van Zee Date: Tue Dec 12 12:50:41 2017 -0600 Merge branch 'master' of github.com:flame/blis commit d0c4dd000ff38acc249e8acf7e0655a523991695 Author: Field G. Van Zee Date: Tue Dec 12 12:47:53 2017 -0600 Added out-of-tree build test to .travis.yml file. Details: - Modified .travis.yml file to include an out-of-tree build test (using the "auto" configure target). Thanks to Devin Matthews for this suggestion. commit 5cf7b0c4e52922069183a87dc2aa177419644e04 Author: Devin Matthews Date: Tue Dec 12 12:38:48 2017 -0600 Ignore blis.h.interm [ci skip] commit 8d8ff74d15b4a584929cec36034ba6d3c53f7d27 Author: Field G. Van Zee Date: Tue Dec 12 12:32:50 2017 -0600 Further attempt to fix out-of-tree builds. Details: - Fix applied in 87978f6 was necessary but not sufficient to fix out-of-tree builds. It turns out that using a source tree that had already built the target erroneously gave the impression that out-of-tree builds were working again, when in fact they were still broken. The additional changes in this commit should complete the fix that was started in the aforementioned commit. Thanks to Devin Matthews and Shaden Smith for their help in isolating this issue. commit 70640a37109290b57c344083c00624e13c496e30 Author: Field G. Van Zee Date: Mon Dec 11 17:18:43 2017 -0600 Implemented library self-initialization. Details: - Defined two new functions in bli_init.c: bli_init_once() and bli_finalize_once(). Each is implemented with pthread_once(), which guarantees that, among the threads that pass in the same pthread_once_t data structure, exactly one thread will execute a user-defined function. (Thus, there is now a runtime dependency against libpthread even when multithreading is not enabled at configure-time.) - Added calls to bli_init_once() to top-level user APIs for all computational operations as well as many other functions in BLIS to all but guarantee that BLIS will self-initialize through the normal use of its functions. - Rewrote and simplified bli_init() and bli_finalize() and related functions. - Added -lpthread to LDFLAGS in common.mk. - Modified the bli_init_auto()/_finalize_auto() functions used by the BLAS compatibility layer to take and return no arguments. (The previous API that tracked whether BLIS was initialized, and then only finalized if it was initialized in the same function, was too cute by half and borderline useless because by default BLIS stays initialized when auto-initialized via the compatibility layer.) - Removed static variables that track initialization of the sub-APIs in bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and bli_ind.c. We don't need to track initialization at the sub-API level, especially now that BLIS can self-initialize. - Added a critical section around the changing of the error checking level in bli_error.c. - Deprecated bli_ind_oper_has_avail() as well as all functions bli__ind_get_avail(), where is a level-3 operation name. These functions had no use cases within BLIS and likely none outside of BLIS. - Commented out calls to bli_init() and bli_finalize() in testsuite's main() function, and likewise for standalone test drivers in 'test' directory, so that self-initialization is exercised by default. commit 70a64432ee5a7adbee10fb7ff6d7b608c1940a7a Author: Field G. Van Zee Date: Mon Dec 11 13:14:20 2017 -0600 Fixed off-by-one indexing in bli_cpuid.c. Details: - In bli_cpuid.c, fixed an off-by-one indexing statement in vpu_count() whereby a string-terminating NULL character, '\0', is written beyond the bounds of the model_num string. - Minor whitespace and formatting edits to bli_cpuid.c. commit 87978f6261a080d261d01f9acf4e9cc18855c833 Author: Field G. Van Zee Date: Mon Dec 11 12:49:03 2017 -0600 Fixed broken out-of-tree builds since 52f9e6f. Details: - Added missing $(DIST_PATH)/ prefix to relative path to flatten-headers.sh script in common.mk so that the script could be found during out-of-tree builds. Thanks to Devin Matthews for reporting this bug. commit 513ef4d040f89a18dda5154e8c4cf1aaf7463999 Author: Field G. Van Zee Date: Mon Dec 11 12:35:59 2017 -0600 Various typecasting fixes, mis-typed enums, etc. Details: - Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c. - Properly typecast integer arguments to match format specifier in various calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and bli_util_oapi.c. - Fixed "unsigned less-than-comparison with zero" checks in bli_check.c, bli_cntx.h. - Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been l1fkr_t or l1vkr_t). - Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t value BLIS_GEMM_UKR in bli_cntx_ref.c. - NOTE: These issues were identified via compiler warnings when building BLIS with clang on a rather old installation of OS X: $ clang --version Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn) Target: x86_64-apple-darwin15.2.0 Thread model: posix commit 3bc99a96a3648f51b9acdc8a8c7e1cf4eb815459 Merge: 3a441183 78199c53 Author: prangana Date: Mon Dec 11 12:53:03 2017 +0530 Fix merge conflicts after rebase with release branch Change-Id: I581b26c6d515f717ff0dce91c7c0c92553aa2630 commit 3a44118398955d6f872e01f73ae5bb4a4f8500f7 Author: Nisanth M P Date: Wed Nov 15 11:11:17 2017 +0530 Added AMD copyright line to the changed files in last 3 commits Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66 commit 268a56c06e94d1c388766dbfe81d54efbe432809 Author: Field G. Van Zee Date: Wed Nov 1 11:51:41 2017 -0500 Revert to default SIMD alignment for bulldozer. Details: - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in config/bulldozer/bli_kernel.h. Not sure where this value came from, but it would seem to allow for insufficient starting address alignment for any matrices created via bli_malloc_user(), such as via bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that led us to this bug. - This commit is a manual patch of the same fix made to the 'rt' branch in 8f150f2. commit 510a6863e28277f9446abfb77f1aea9f01d37e7a Author: Devin Matthews Date: Mon Oct 30 10:04:42 2017 -0500 Fix CVECFLAGS for bulldozer config. commit c669716790bdda5d2b11ea0a026cbc121b228842 Author: Nisanth M P Date: Tue Oct 24 16:36:36 2017 +0530 Adding __attribute__((constructor/destructor)) for CLANG case. CLANG supports __attribute__, but its documentation doesn't mention support for constructor/destructor. Compiling with clang and testing shows that it does support this. Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b commit 24e64a9d0877d788357fc63d4b947e977f8697f7 Author: Field G. Van Zee Date: Wed Oct 18 13:41:25 2017 -0500 Removed a duplicate bli_avx512_macros.h header. Details: - Removed a duplicate header file that was causing problems during installation for the 'knl' configuration. Thanks to Victor Eijkhout for reporting this issue. commit 9c0a3c4c0260cbfefb9f11532f46508b4fd19ec2 Author: Nisanth M P Date: Mon Oct 16 22:06:57 2017 +0530 Thread Safety: Move bli_init() before and bli_finalize() after main() BLIS provides APIs to initialize and finalize its global context. One application thread can finalize BLIS, while other threads in the application are stil using BLIS. This issue can be solved by removing bli_finalize() from API. One way to do this is by getting bli_finalize() to execute by default after application exits from main(). GCC supports this behaviour with the help of __attribute__((destructor)) added to the function that need to be executed after main exits. Similarly bli_init() can be made to run before application enters main() so that application need not call it. Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac commit 83f31253eb21c5ecd8a5907835e57720daae0b8b Author: Nisanth M P Date: Mon Oct 16 21:07:50 2017 +0530 Thread safety: Make the global induced method status array local to thread BLIS retains a global status array for induced methods, and provides APIs to modify this state during runtime. So, one application thread can modify the state, before another starts the corresponding BLIS operation. This patch solves this issue by making the induced method status array local to threads. Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe commit e923402e68029be379a4297de3ac6fb155ffd928 Author: sthangar Date: Thu Sep 28 12:15:36 2017 +0530 The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5 commit a64c15de19327c7595376d699be676c7003e850e Author: Field G. Van Zee Date: Tue Sep 26 19:02:53 2017 -0500 Fixed a pthread typo in previous commit. Details: - Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'. commit 42dcd589c37e1a2473ab2e1539207da97aebc07f Author: Field G. Van Zee Date: Tue Sep 26 17:00:04 2017 -0500 Fixed bugs in gemm/gemmtrsm ukr tests in testsuite. Details: - Fixed a bug in gemmtrsm test module that was due to improper partitioning into a k x k triangular matrix for the purposes of obtaining an mr x k micropanel of A with which to test. - Fixed a bug in gemm and gemmtrsm test modules that would only manifest for very large k (depending on the product of mr x kc on that architecture). The bug arose from the fact that the test module was triggering the allocation of blocks from the internal memory pools, which are limited in size. This allocation imposes an implicit assumption that the micro- panel being tested with will fit inside, and this assumption is violated for large values of k. Arbitrarily large k may now be tested for both operation tests. - Added OpenMP/pthread critical sections around the setting or getting of statuses from the induced method operation lookup table in bli_l3_ind.c. - Added the 'static' keyword to all pthread_mutex_t global variables in BLIS. - Thanks to Nisanth Padinharepatt of AMD for reporting the first and third issues. commit 206beb68ff73b75f5c382413967aacbb8a0aac3a Author: Field G. Van Zee Date: Sat Sep 9 14:10:15 2017 -0500 Updated bibtex info for BLIS5 (3m4m) article. commit 0c8c0363aeb1f4aa88f7ec2d02403dab05a6e014 Author: sthangar Date: Mon Aug 28 16:44:42 2017 +0530 Bug fix for the testsuite build failing Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77 commit 63d1c84465b50f64787808dd3e8494e683c16821 Author: sthangar Date: Wed Aug 23 13:01:14 2017 +0530 Adding auto hardware detection for Zen Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf commit 537fb2a895b09be94b11947696fd2da629be24dd Author: Devin Matthews Date: Tue Aug 15 10:02:25 2017 -0500 Add vzeroupper to Intel AVX kernels. commit 7628de3f76f78a44788807605a4601ddda445854 Author: Field G. Van Zee Date: Thu Aug 10 16:24:28 2017 -0500 Removed trailing enum commas from bli_type_defs.h. Details: - Removed trailing commas from enums in bli_type_defs.h. Thanks to Erling Andersen for pointing out this inconsistency and suggesting the change. commit a666fd4e267ffae3d4b21f38d569c61ff56adc9e Author: Field G. Van Zee Date: Sat Aug 5 13:04:31 2017 -0500 Added edge handling to _determine_blocksize_b(). Details: - Added explicit handling of situations where i == dim to bli_determine_blocksize_b_sub(). This isn't actually needed by any current use case within BLIS, but handling the situation is nonetheless prudent. Thanks to Minh Quan for reporting this issue and requesting the fix. commit 0c8afa546d7f33760415519ba328d7c49eb7aa06 Author: Field G. Van Zee Date: Fri Aug 4 14:17:44 2017 -0500 Fixed a minor bug in level-3 packm management. Details: - Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t entries to be released and then re-acquired unnecessarily. (In essence, the "<" operands in the conditional that guards the release-and-reacquire code block simply needed to be swapped.) The bug should have only affected performance (rather than the computed result). Thanks to Minh Quan for identifying and reporting the bug. commit 6cf68a185d83fa46d438fcef65258ace78e24b13 Author: Devin Matthews Date: Mon Jul 31 15:19:51 2017 -0500 Change lsame_ signature to match lapacke. commit 6a9bd97295cc4fb1cbcd28f69824a43c073c9a76 Author: Field G. Van Zee Date: Sat Jul 29 20:17:05 2017 -0500 Fixed pthreads compile bug with previous commit. Details: - Erroneously passed family parameter into l3int_t function despite that function not taking the parameter. Oops. commit 95adc43d800431dc0a02ca83a51426dbef641ad6 Author: Field G. Van Zee Date: Sat Jul 29 14:53:39 2017 -0500 Moved 'family' field from cntx_t to cntl_t. Details: - Removed the family field inside the cntx_t struct and re-added it to the cntl_t struct. Updated all accessor functions/macros accordingly, as well as all consumers and intermediaries of the family parameter (such as bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This change was motivated by the desire to keep the context limited, as much as possible, to information about the computing environment. (The family field, by contrast, is a descriptor about the operation being executed.) - Added additional functions to bli_blksz_*() API. - Added additional functions to bli_cntx_*() API. - Minor updates to bli_func.c, bli_mbool.c. - Removed 'obj' from bli_blksz_*() API names. - Removed 'obj' from bli_cntx_*() API names. - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines that operate only on a single struct to contain the "_node" suffix to differentiate with those routines that operate on the entire tree. - Added enums for packm and unpackm kernels to bli_type_defs.h. - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h. They weren't being used and probably never will be. commit a98e4aa547f61ab09dd91d11478c2a2ef9882e11 Author: Devin Matthews Date: Thu Jul 20 14:50:13 2017 -0500 Clang can't make up it's mind what to support. commit 32eb36c3e8c2add2528514272044de16faed0c8f Author: Devin Matthews Date: Thu Jul 20 12:54:58 2017 -0500 Add default #define for __has_extension. commit 2a9aa134f7c29d3d4fdc160022ff257e61885a95 Author: Devin Matthews Date: Thu Jul 20 10:04:34 2017 -0500 Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes #143. commit 6f07a034d575e1e9e30bb6417b8fcb77cf301297 Author: Field G. Van Zee Date: Wed Jul 19 15:40:48 2017 -0500 Updated ar option list used by all configurations. Details: - Dropped 'u' from the list of modifiers passed into the library archiver ar. Previously, "cru" was used, while now we employ only "cr". This change was prompted by a warning observed on Ubuntu 16.04: ar: `u' modifier ignored since `D' is the default (see `U') This caused me to realize that the default mode causes timestamps to be zero, and thus the 'u' option, which causes only changed object files to be inserted, is not applicable. commit 32bc03f9eed8795cfd2f2615d1c9f8673e039c57 Author: Field G. Van Zee Date: Wed Jul 19 13:51:53 2017 -0500 Added --force-version=STRING option to configure. Details: - Added an option to configure that allows the user to force an arbitrary version string at configure-time. The help text also now describes the usage information. - Changed the way the version string is communicated to the Makefile. Previously, it was read into the VERSION variable from the 'version' file via $(shell cat ...). Now, the VERSION variable is instead set in config.mk (via a configure-substituted anchor from config.mk.in). commit befaee6dd8b2a72de9e0461fe2ec1f36e9f88f3c Author: Field G. Van Zee Date: Tue Jul 18 17:56:00 2017 -0500 Updated openmp/pthread barriers with GNU atomics. Details: - Updated the non-tree openmp and pthreads barriers defined in bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new implementation goes through the same motions as the previous codes, but protects its loads and increments with GNU atomic built-ins. These atomic statements take memory ordering parameters that allow us to specify just enough constraints for the barrier to work as intended on weakly-ordered hardware. The prior implementation was only guaranteed to work on systems with strongly- ordered memory. (Thanks to Devin Matthews for suggesting this change and his crash-course in atomics and memory ordering.) - Removed 'volatile' from structs' barrier field declarations in bli_thrcomm_*.h. - Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields consistent with that of the _openmp.? files. - Updated other bli_thrcomm_* files to rename "communicator" variables to simply "comm". commit 8f739cc847fcff2ddeeb336f8b2b9d080eb16f6c Author: Field G. Van Zee Date: Mon Jul 17 19:03:22 2017 -0500 Added API to set mt environment variables. Details: - Renamed bli_env_get_nway() -> bli_thread_get_env(). - Added bli_thread_set_env() to allow setting environment variables pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS. - Added the following convenience wrapper routines: bli_thread_get_jc_nt() bli_thread_get_ic_nt() bli_thread_get_jr_nt() bli_thread_get_ir_nt() bli_thread_get_num_threads() bli_thread_set_jc_nt() bli_thread_set_ic_nt() bli_thread_set_jr_nt() bli_thread_set_ir_nt() bli_thread_set_num_threads() - Added #include "errno.h" to bli_system.h. - This commit addresses issue #140. - Thanks to Chris Goodyer for inspiring these updates. commit 10163833075fd42be5b5b503acc855f91a484cfd Author: Marat Dukhan Date: Thu Jul 13 21:39:24 2017 -0700 Fix Emscripten builds commit c09b30d115eade72f44f37bf90aa848c9c0e79af Author: Minh Quan HO Date: Fri Jul 7 10:52:05 2017 +0200 set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is not set in bli_membrk_init commit 997628ed9793c72e9ef576dd8d715cfec27c4862 Author: sthangar Date: Fri Jun 30 12:23:19 2017 +0530 Reducing the framework overhead of GEMV routines Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684 commit ee869066168239b710ad9938bb0e1ae454883f3a Author: Kiran Varaganti Date: Tue Jul 4 12:57:32 2017 +0530 Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024. Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4 commit 7b933b90b1859c96de49a402d48de82909bc73e5 Author: Devin Matthews Date: Tue Jun 6 20:23:17 2017 -0500 Add new SSI acknowledgment commit 3485abba4b426fbf42b146a9611a0841f6d236c6 Author: sthangar Date: Wed May 24 11:48:16 2017 +0530 Checked in the small matrix code to compute GEMM called with A transpose case Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462 commit de16beb83b29b4b9748f70db985b0fe04db85f7d Author: Devin Matthews Date: Fri May 26 14:49:31 2017 -0400 PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%. commit 25d0e618544b6eea7d3f13c7aec513ac0139801d Author: Devin Matthews Date: Fri May 26 14:47:36 2017 -0400 Revert "Change PACKDIM_MR (double) for haswell to 8." This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99. commit c5bdd84b35bc2a8ebf55b7763fb56c0c945be0cb Author: Devin Matthews Date: Fri May 26 12:28:09 2017 -0500 Change PACKDIM_MR (double) for haswell to 8. commit 172789d562001293b973bbdd8015bd27d37292e8 Author: Field G. Van Zee Date: Wed May 17 13:03:52 2017 -0500 Restored deleted lines from makefile fragments. commit 3ea9bd2c8e90dbd35655fa6a5b953dfea1f308fe Author: Devin Matthews Date: Wed May 17 12:29:44 2017 -0500 Change to /bin/sh. All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh. commit 49438409eedb98d3f0ebf00b8d1eee0ae45f4f8c Author: Devin Matthews Date: Wed May 17 12:27:14 2017 -0500 Remove shebangs from makefiles. commit 497e2640474c016d576dce3530fa6a66891642a0 Author: J M Dieterich Date: Tue May 16 23:11:22 2017 -0400 Fix if/else structure. Thanks to TravisCI. commit 835035c56a8de36ad25bb8d1375db170d489ef57 Author: J M Dieterich Date: Tue May 16 22:23:27 2017 -0400 Mark piledriver compilable w/ clang. commit 6cdb533472ee61af297c1f948307abbf45828887 Author: J M Dieterich Date: Tue May 16 22:12:12 2017 -0400 Mark bulldozer compilable w/ clang. commit a85697d62272da06d28cd1c947f6cf1098df6467 Author: J M Dieterich Date: Tue May 16 22:06:59 2017 -0400 Correct error message. commit e0c64cad271058688a2b999caf8c2767dc3aef7e Author: J M Dieterich Date: Tue May 16 22:03:23 2017 -0400 Indeed once can compile for carrizo also using clang. commit 4aafe0505d3f0954d095ded5459a76976e5093b4 Author: J M Dieterich Date: Tue May 16 21:50:49 2017 -0400 A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash commit abaeaa68ea11e84be1810f564d6f38d506cbeb6a Author: Field G. Van Zee Date: Fri May 5 15:06:56 2017 -0500 Fixed a bug in norm1v, norm1m. Details: - Fixed a bug that manifested as improperly-computed 1-norm for vectors and matrices. This is one of the few operations in BLIS that does not have its own test module within the testsuite, hence why it went undetected for so long. The bad 1-norms were being used to normalize matrices in the testsuite after initialization, which led to some matrices containing a combination of "large" and "small" values. This tended to push the residuals computed after each test away from zero. In some cases, they were off *just* enough to the testsuite to label it a "failure". Many thanks to Jeff Hammond for reporting this bug. (Wonky details: the bug was due to improperly-defined level-0 scalar macros for abval2, an operation that computes the absolute square, or complex magnitude/modulus. Certain complex domain instances of abval2 were being incorrectly defined in terms of real-only solutions, leading to bad results. This level-0 operation forms the basis of norm1v/norm1m. absq2 was also affected, but almost nothing uses this operation.) commit cc3107ae1c2074f72b724aa748d2e5b4cb290ed5 Author: Devin Matthews Date: Thu May 4 10:35:22 2017 -0500 Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. commit c8ab91f70d399ee14edd30a3a5c46b24c5d2f910 Author: Field G. Van Zee Date: Wed May 3 15:04:51 2017 -0500 Disable complex 3m/4m in testsuite by default. Details: - Disabled testsuite tests of all level-3 implementations based on 3m and 4m. This will improve testing runtime on Travis CI as well as for anyone manually running the testsuite using default test parameters. Thanks to Devin Matthews for suggesting this change. commit 9700f0e5785007ddafb72a5ca83800dee61fd35c Author: Jeff Hammond Date: Tue May 2 19:25:21 2017 -0700 allow KNL build without hbwmalloc.h (i.e. emulated) we want to be able to run BLIS KNL binaries on non-KNL machines via SDE. although it is possible to install hbwmalloc implementation on such systems, it is easier not to, since obviously the performance of SDE execution is not representative so there is no reason to emulate HBW allocation. commit 17dcd5a33ff91967f67e7c0ba09b4f18754609a4 Author: Field G. Van Zee Date: Tue May 2 16:48:43 2017 -0500 Fixed stray parentheses in README citations. commit 2910d44ff9e1d951d3249313f4ab39d18ea1b48d Author: Field G. Van Zee Date: Tue May 2 16:38:43 2017 -0500 CHANGELOG update (0.2.2) commit 5ca3863220e07972fcefc6682ddd3f6e54fe4a94 Author: Field G. Van Zee Date: Tue May 2 15:48:30 2017 -0500 Fixed a trsm1m bug that affected right-side cases. Details: - Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result was nondeterministic behavior (usually segmentation faults) for certain problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c which explicitly directed the virtual gemm micro-kernel to use temporary space if the storage preference of the [real domain] gemm ukernel did not match the storage of the output matrix C. In the context of gemm, this handling is not needed because agreement between the storage pref and the matrix is guaranteed by a high-level optimization in BLIS. However, this optimization is not applied to trsm because the storage of C is not necessarily the same as the storage of the micro-panels of B--both of which are updated by the micro-kernel during a trsm operation. Thus, the guarantee of storage/preference agreement is not in place for trsm, which means we must handle that case within the virtual gemm micro-kernel. - Comment updates and a minor macro change to bli_trsm*_cntx_init() for 3m1, 4m1a, and 1m. commit 1af0b09f5c275ee7bac896cc6f36f42af721d9b5 Author: Field G. Van Zee Date: Tue May 2 12:09:39 2017 -0500 README.md update. Details: - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th and 6th BLIS papers. commit db4a0bb8ba7cd697d68be8e5632371ee3e59fd63 Author: Field G. Van Zee Date: Fri Mar 17 12:07:27 2017 -0500 Whitespace reformatting to armv8a kernels file. Details: - Updated formatting of function signature/header in kernels/armv8a/3/bli_gemm_opt_4x4.c. commit e3eb01f6b990e205b15edcbaffd3d54b3ddd1ca4 Author: Field G. Van Zee Date: Tue Feb 21 15:33:39 2017 -0600 Disabled experiment-related 1m code. Details: - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was specifically inserted to facilitate the benchmarking of 1m block-panel and panel-block algorithms. - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to reflect changes used/needed during benchmarking. commit 4f61528d56eed6a139eeac9db0c44e56f2d2d136 Author: Field G. Van Zee Date: Wed Jan 25 16:25:46 2017 -0600 Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates. commit 1d728ccb2394e77365e7c42683db6579c5fba014 Author: Field G. Van Zee Date: Fri Nov 25 18:29:49 2016 -0600 Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver. commit 0d1b90286e29aa8b768e280b5286d92c02ad87a1 Author: Jeff Hammond Date: Tue Oct 25 21:15:26 2016 -0700 never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures. commit b150870397e7aee558e61d1bd72a0c0d1d99bee8 Author: Field G. Van Zee Date: Fri Dec 8 16:08:41 2017 -0600 Removed most "old" directories. Details: - Removed the vast majority of directories named "old", which contained deprecated code that I wasn't quite ready to jettison from the source tree. commit 270c65985df849297ba1951aa3b56c03948d7775 Author: Field G. Van Zee Date: Fri Dec 8 15:21:18 2017 -0600 Modified bli_getopt() for thread-safety. Details: - Changed the interface of bli_getopt() to take a new argument, a getopt_t struct, that stores the values of optarg, optind, opterr, and optopt, and updated the implementation accordingly. (Previously, these variables were assumed to be global.) - Added a function for initializing a getopt_t struct. - Changed test_libblis.c--currently the only consumer of bli_getopt()--to utilize the new getopt_t state object. commit ce4d8fabc2e39371f89c12192fb707be82ae021a Merge: 39be59f2 e05a8dfa Author: Field G. Van Zee Date: Thu Dec 7 17:36:44 2017 -0600 Merge branch 'master' of github.com:flame/blis commit 39be59f2a8470f40475907d9dd52639b8a911a92 Author: Field G. Van Zee Date: Thu Dec 7 17:35:20 2017 -0600 Replaced several macros with static function APIs. Details: - Reimplemented several sets of get/set-style preprocessor macros with static functions, including those in the following frame/base headers: auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in frame/thread were touched as well: mutex_*, thrcomm, and thrinfo. commit e05a8dfa7cc7df41e966c1ad04e51c482b308b23 Merge: 79507337 4423e33d Author: dnp Date: Wed Dec 6 16:45:24 2017 -0600 Merge branch 'rt' commit 4423e33dc593115cda92c5763d756d7ad1298aa9 Author: dnp Date: Wed Dec 6 16:35:03 2017 -0600 Adding SKX kernels and configuration. commit 79507337e140daec7639f6eb3ed9cfe6e123d342 Author: Field G. Van Zee Date: Wed Dec 6 16:21:35 2017 -0600 Various checks to ensure that arch_t id is in range. Details: - Expanded checking of the arch_t id in bli_gks.c--either passed in from the caller or as returned from bli_arch_query_id()--against the expected range of id values. Thanks to Devangi Parikh for suggesting these additional sanity checks. commit fde7c1126c58373ecde83471890b257399144876 Author: Field G. Van Zee Date: Mon Dec 4 16:11:01 2017 -0600 Added 'uninstall-old-headers' target to Makefile. Details: - Defined a new 'uninstall-old-headers' target that allows users of BLIS to uninstall no-longer-needed headers left over from previous installations. - Fixed the 'uninstall-old' target so that it will install both .a and .so libraries. - Renamed 'uninstall-old' to 'uninstall-old-libs'. - Added 'uninstall-old' target (different from previous 'uninstall-old' target) that combines 'uninstall-old-libs' and 'uninstall-old-headers'. commit d4ee770bde213a87aa6049245145318324dc6b51 Author: Field G. Van Zee Date: Mon Dec 4 14:53:43 2017 -0600 Create/install monolithic cblas.h. Details: - When CBLAS is enabled at configure-time, BLIS now creates a monolithic cblas.h using the same flatten-header.sh script that was recently introduced for creating monolithic blis.h header files. The top-level Makefile will also install this cblas.h file into the install prefix alongside blis.h when the 'install' target is invoked. The two header files are compatible with one another. Regardless whether the user's source #includes cblas.h, both blis.h and cblas.h, or just blis.h, the user will get the CBLAS function prototypes and enums, as expected. commit 52f9e6f1b6468785af8947317656445d4729fc8b Merge: ab57b979 21360dd8 Author: Field G. Van Zee Date: Fri Dec 1 12:28:09 2017 -0600 Merge branch 'rt' commit 21360dd8e2c7287100645e109acaabcc6ba1140c Author: Field G. Van Zee Date: Wed Nov 29 14:11:34 2017 -0600 Fixed cntx_t packm query when ker_id > _NUM_PACKM_KERS. Details: - Fixed a subtle bug in bli_cntx_get_[un]packm_ker_dt() in which the function fails to return NULL when passed a kernel id argument that is equal to or beyond BLIS_NUM_[UN]PACKM_KERS. Instead, the function was attempting to index into the cntx_t's packm kernel array, which resulted in undefined behvaior. Thanks to Devangi Parikh for finding this bug. commit 244a6f4e66e8ff091e995f8090ce779c1928aa8b Author: Field G. Van Zee Date: Tue Nov 28 17:48:48 2017 -0600 Fixed POSIX sed non-compliance in flatten-header.sh. Details: - Changed GNU usage of 'i' and 'a' sed commands used in flatten-header.sh to POSIX-compliant usage that will work on OS X's sed. commit 45078621676833e53a2878af8f89479c4f93b8ab Author: Field G. Van Zee Date: Tue Nov 28 15:16:22 2017 -0600 Generate/compile with/install monolithic blis.h. Details: - Rewrote monolithify-header.sh (and renamed to flatten-header.sh) so that headers are inserted recursively. This improves performance by a factor of 3-4x. - Modified configure to create an 'include/' directory in which make can create a monolithic header. - Modified the top-level Makefile so that a monolithic header is generated unconditionally prior to compilation (stored in include/) and so that the single header is installed instead of the 450 or so header files that reside throughout the framework source tree. - Added "include/*/*.h" to .gitignore file. - Removed some pnacl/emscripten leftovers that I intended to include in a1caeba (mostly in testsuite/Makefile). - Trivial comment changes to frame/include/bli_f2c.h. commit 1f30b1301bf6d6047ec29e57a5fde8eb1072a0ee Author: Field G. Van Zee Date: Sat Nov 25 16:54:26 2017 -0600 Added missing framework support for x86_64 family. Details: - Added support for the x86_64 configuration family to bli_arch.c and bli_arch_config.h. Thanks to Johannes Dieterich for reporting this issue. - Bumped the default value for BLIS_SIMD_NUM_REGISTERS from 16 to 32 and the default value for BLIS_SIMD_SIZE from 32 to 64. This will support configuration families that include Skylake and newer processors without any supported needed in the bli_family_*.h file. The semantics of these values have always been "maximum" and not exact values; comments in bli_kernel_macro_defs.h and the github wiki have been adjusted accordingly. commit 9f39806c4ed484c9ed13edf96005838d977722a9 Author: Field G. Van Zee Date: Tue Nov 21 16:03:56 2017 -0600 Fixed a bug in e31f0b3/b131b9a. Details: - Erroneously placed the "don't overwrite existing blocksize" logic in bli_blksz_init*() rather than in bli_cntx_set_blkszs(). It belongs in the latter because that function copies blocksizes as-is from the blksz_t function argument to the appropriate field in the cntx_t. If the blksz_t was previously initialized selectively, based on the sign of the blocksize value passed into bli_blksz_init*(), that just leaves some fields possibly uninitialized (with garbage values), which definitely will not work. - The aforementioned logic has been moved to bli_cntx_set_blkszs() via a new function bli_blksz_copy_if_pos(), which selectively copies only the blocksizes that are greater than zero. commit b131b9a025c15f548d4c2952a9ec85eee3d139b1 Author: Field G. Van Zee Date: Tue Nov 21 14:30:26 2017 -0600 Updated configs to omit setting some blocksizes. Details: - Employ the new semantics of bli_blksz_init*() in e31f0b3 in various sub-configurations' bli_cntx_init_*() functions by passing in 0 for register and cache blocksizes that correpond to gemm microkernel datatypes that were not registered, allowing the default values set by the bli_cntx_init_*_ref() function call to remain. commit 499a4c002f895744ecaf81ef7f62d2d6d0d7d594 Merge: e31f0b3e 6c3ba502 Author: Field G. Van Zee Date: Tue Nov 21 14:25:08 2017 -0600 Merge branch 'rt' of github.com:flame/blis into rt commit e31f0b3e2dba19ca8a2946bc21beb136a42d0f57 Author: Field G. Van Zee Date: Tue Nov 21 14:21:25 2017 -0600 Subtle update to bli_blksz_init*() API. Details: - Updated the semantics of bli_blksz_init() and bli_blksz_init_ed() so that non-positive blocksize values are ignored entirely. This provides an easy way to indicate that certain existing values should not be touched by the update. Thanks to Devangi Parikh for feedback that led to these changes. commit 6c3ba502a11f87bc67555d26154cfd39d0af1bac Author: Field G. Van Zee Date: Tue Nov 21 13:50:53 2017 -0600 Added 'x86_64' sub-config directory. Details: - Added missing x86_64 configuration directory, which was intended to be part of b7ca580. - Added -Wfatal-errors compiler warning flag to all configurations so that compilation stops after the first error. - Changed the vectorization flags for intel64 configuration to be compatible with 'penryn', the oldest sub-config included in that family. - Changed the vectorization flags for penryn to target the 'core2' microarchitecture and ssse3. commit 25eee3cc49b0631812485d4d5ceef0c23ed1b6dd Author: Field G. Van Zee Date: Tue Nov 21 12:34:20 2017 -0600 Added a dummy file to kernels/generic. Details: - Added a dummy file to kernels/generic, which was previously empty, so that git would begin tracking the otherwise-empty directory. This directory's existence is necessary for proper execution of configure for any configuration family that contains the 'generic' sub-configuration. Thanks to Johannes Dieterich for reporting the issue that led to this fix. commit ef024ce4cafa217669eaabb31ff8ab6df93cca05 Author: Field G. Van Zee Date: Mon Nov 20 18:08:29 2017 -0600 More tweaks to monolithify-header.sh Details: - Further fixes monolithify-header.sh script. - Removed unnecessary #include "blis.h" from frame/3/bli_l3_packm.h. commit 5028e7dec269b62895511453272585da36e591b5 Author: Field G. Van Zee Date: Mon Nov 20 17:00:37 2017 -0600 Second attempt to implement travis_wait. Details: - Corrected accidental misplacement of the travis_wait prefix (on the wrong line of the .travis.yml file) in commit 13e5d91. commit 13e5d9107b3763cba46fb1bae87476852601b47c Author: Field G. Van Zee Date: Mon Nov 20 15:57:06 2017 -0600 Added travis_wait prefix to testsuite via Travis. Details: - It appears that Travis CL has implemented a new policy that results in a test failing if it does not produce any output for more than 10 minutes. (Two test instances are now failing in Travis despite the most recent commit not affecting the library or testsuite.) This issue can be worked around by executing the test run via travis_wait, which takes an optional time parameter. This commit attempts to use 'travis_wait 30' in the .travis.yml file to prevent the early failure at 10 minutes. commit a1caeba0ea79c8fecb1abadca1f91c6367ab3afb Author: Field G. Van Zee Date: Mon Nov 20 13:31:20 2017 -0600 Removed pnacl, emscripten support from Makefile. commit 78199c539beaa50f37893add220261ce0dcb921a Merge: b3d8ab2e ab57b979 Author: praveeng Date: Mon Nov 20 15:51:20 2017 +0530 Merge master code till 01-Nov-2017 to amd-staging Change-Id: I40b53f876db84c8b947b3f2385c9b882245c6603 commit 9df6dda9ec51a0d40166169d2d8a2f84b42266e6 Author: Field G. Van Zee Date: Sat Nov 18 19:03:26 2017 -0600 Improvements, bugfixes to monolithify-header.sh. commit 21d26201f90b884eb8d5de279ed74bbd244ffcb5 Merge: 43baa3b3 b7ca5806 Author: Field G. Van Zee Date: Sat Nov 18 14:16:53 2017 -0600 Merge branch 'rt' of github.com:flame/blis into rt commit 43baa3b327d5ae1e2ba619432687b4dd849b05e3 Author: Field G. Van Zee Date: Sat Nov 18 14:14:44 2017 -0600 Removed unnecessary flags for generic config. Details: - Removed -D_POSIX_C_SOURCE=200112L and -m64 flags from make_defs.mk file of generic sub-configuration. These flags are generally not necessary, and particularly not desirable for the generic configuration since they unnecessarily restrict the environments in which the configuration can be built. commit b7ca580618f9382b7982168fd035ed058f83e4c2 Author: iotamudelta Date: Sat Nov 18 14:56:05 2017 -0500 [WIP] Add x86 and x86_64 processor families. (#154) * Add x86 and x86_64 processor families. * Use generic config as fallback for more families. After discussion with fgvanzee, a) it's "generic" and 2) use it for all the families as a fallback. Goal is that if a specific CPU is not yet supported by a family (say a new Intel microarchitecture on x86_64), it'll fall through to still work with the slower "generic" kernels commit 870597d1663aaba1b74d7654b1d4946280aa0d3f Author: Field G. Van Zee Date: Fri Nov 17 17:06:42 2017 -0600 Added bash script for creating monolithic headers. Details: - Added a new script, monolithify-header.sh, to the 'build' directory. This script recursively replaces all #include directives in a selected file with the contents of the header files referenced by each directive. The idea is to "flatten" a tree of .h files into a single file, with the script acting as a C preprocessor that only processes #include directives. commit c76f77f4cc1e71988251c5e63cf6ef137477bf9c Author: Field G. Van Zee Date: Fri Nov 17 15:10:52 2017 -0600 Removed unnecessary #include "blis.h" from header. Details: - Removed an errant #include "blis.h directive from bli_cntx_ind_stage.h. The generaly policy is that no header file in BLIS should include blis.h. This will be important in the near future when using a tool to recursively create a monolithic blis.h file from its consitutent headers. commit 2bb9bc6e9536fa239fbc19a7efaaf151116e15b4 Author: Field G. Van Zee Date: Fri Nov 17 13:50:14 2017 -0600 Miscellaneous tweaks to gks, rt functionality. Details: - Updated bli_cpuid_query_id() so that BLIS_ARCH_GENERIC is always returned if the hardware fails to test positive for any supported sub-configuration. - Defined bli_gks_init_ref_cntx(), which will call the context initialization function bli_cntx_init_configname() for the sub-configuration 'configname' associated with the arch_t id returned by bli_arch_query_id(). This makes initializing a reference context easy for experts who wish to construct those contexts. commit b3d8ab2ea02c127ab241532abc214624f35bfaab Merge: 189ffbb0 fe71c06e Author: Santanu Thangaraj Date: Wed Nov 15 01:33:12 2017 -0500 Merge "Added AMD copyright line to the changed files in last 3 commits" into amd-staging commit fe71c06e42b072407c83112779055b0afb67173d Author: Nisanth M P Date: Wed Nov 15 11:11:17 2017 +0530 Added AMD copyright line to the changed files in last 3 commits Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66 commit d5bf79e50bf97072bbe7117c86b7c45e6e707ea0 Author: Field G. Van Zee Date: Mon Nov 13 14:24:29 2017 -0600 Miscellaneous tweaks and fixes. Details: - Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of bli_blksz_init_easy() that should have been bli_blksz_init(). - Fixed a bug in code that is supposed to output the list of sub-directories in the 'config' directory when configure script is run with no arguments. - Expanded the output of "make showconfig" to include more info from config.mk. - Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for someone to add excavator and zen support. - Added a link to the ConfigurationHowTo wiki to config_registry. - Other minor tweaks to configure. commit 673e5184030532c4ebd9fdeecbaa6442bb3ad54f Merge: 2c51356a 8f150f28 Author: Field G. Van Zee Date: Wed Nov 1 17:37:42 2017 -0500 Merge branch 'rt' of github.com:flame/blis into rt commit 2c51356a8b2699c99f9507c80d69c08a35d45fe3 Author: Field G. Van Zee Date: Wed Nov 1 17:37:02 2017 -0500 Implemented runtime hardware detection via cpuid. Details: - Added runtime support for selecting an appropriate arch_t value based on the results of the cpuid instruction (for x86_64). This allows deferral of choosing a context (kernels, blocksizes, etc.) until runtime, which allows BLIS to be built with support for multiple microarchitectures. Currently, only amd64 and intel64 configurations are registered in the config_registry; however, one could create custom configuration families to support arbitrary sets of x86_64 microarchitectures. - Current Intel microarchitectures supported via cpuid are knl, haswell, sandybridge, and penryn. - Current AMD microarchitectures supported via cpuid are: zen, excavator, steamroller, piledriver, and bulldozer. commit ab57b979046479bcda7f83165838a80117c2ad95 Author: Field G. Van Zee Date: Wed Nov 1 11:51:41 2017 -0500 Revert to default SIMD alignment for bulldozer. Details: - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in config/bulldozer/bli_kernel.h. Not sure where this value came from, but it would seem to allow for insufficient starting address alignment for any matrices created via bli_malloc_user(), such as via bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that led us to this bug. - This commit is a manual patch of the same fix made to the 'rt' branch in 8f150f2. commit 8f150f28a678c4a0c1591400177ad7cca81fcaec Author: Field G. Van Zee Date: Wed Nov 1 11:41:45 2017 -0500 Revert to default SIMD alignment for bulldozer. Details: - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in bli_family_bulldozer.h. Not sure where this value came from, but it would seem to allow for insufficient starting address alignment for any matrices created via bli_malloc_user(), such as via bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that led us to this bug. commit e3f10557caf114441fbfff990e3ce3576c177bdc Author: Field G. Van Zee Date: Mon Oct 30 13:37:54 2017 -0500 Use perl for some substitution for OS X compatibility. Details: - Discovered that sed commands where the replacement string contains '\n' are problematic with the version of sed present in OS X. For these cases cases in the configure script, we instead use 'perl -pe' for search-and-replace functionality. - Various other minor comment/whitespace tweaks to configure. - Removed remaining lines of code related to setting/checking variables to track "unregistered" configurations. commit dd45cfdfc3d8f9acf4cf7f69138d9b83dafc8842 Merge: 3e4f42a4 f60c827b Author: Field G. Van Zee Date: Mon Oct 30 12:23:05 2017 -0500 Merge branch 'master' into rt commit f60c827ba95f452c8454fb914f5564f4895bf644 Author: Devin Matthews Date: Mon Oct 30 10:04:42 2017 -0500 Fix CVECFLAGS for bulldozer config. commit 3e4f42a4d2ebb37b95988933d92e561c5b2cc201 Author: Field G. Van Zee Date: Fri Oct 27 11:41:37 2017 -0500 Typecast l1mkr_t enum value prior to comparison. Details: - Typecast l1mkr_t enum value in bli_cntx.h to guint_t before testing for out-of-range value. This is an attempt to pacify a strange warning from clang on OS X that is seemingly the result of the following compiler warning flag: -Wtautological-constant-out-of-range-compare commit aec6e038d942d35b81bbd723a640cce2c054fb8e Author: Field G. Van Zee Date: Thu Oct 26 16:12:36 2017 -0500 Removed associative arrays from configure. Details: - Implemented a replacement for associative arrays in the configure script that does not utilize arrays, and therefore works in pre-4.0 versions of bash. (It appears that Mac OS X will be stuck with version 3.2 indefinitely due to bash switching to the GPL 3.0 license starting with version 4.0.) commit 189ffbb0d37262b21acddc0d35b4a22f2cbbca94 Merge: 06e0e635 3eb44f67 Author: Santanu Thangaraj Date: Wed Oct 25 02:00:30 2017 -0400 Merge changes Ie115b206,I7ce6cfa2,Iff59b6f4 into amd-staging * changes: Adding __attribute__((constructor/destructor)) for CLANG case. Thread Safety: Move bli_init() before and bli_finalize() after main() Thread safety: Make the global induced method status array local to thread commit 3eb44f67618b91ae5f5f0aaaba67e38f16042ee4 Author: Nisanth M P Date: Tue Oct 24 16:36:36 2017 +0530 Adding __attribute__((constructor/destructor)) for CLANG case. CLANG supports __attribute__, but its documentation doesn't mention support for constructor/destructor. Compiling with clang and testing shows that it does support this. Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b commit 07c352188bf5265af242255f8e6fcb97050d973d Author: Field G. Van Zee Date: Mon Oct 23 16:59:22 2017 -0500 Added "generic" configuration. Details: - Added a "generic" configuration that leaves the default blocksizes and kernels unchanged. This replaces the older "reference" configuration. Updated auto-detect script and code accordingly. - Added support for generic configuration to arch_t (bli_type_defs.h), bli_gks_init() (bli_gks.c), and bli_arch_config.h - Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h). - Whitespace changes to configurations' make_defs.mk files. commit c1a98d6f70608b02a1e6bcad6ba020a60773dace Author: Field G. Van Zee Date: Mon Oct 23 14:24:41 2017 -0500 Minor update to .travis.yml file. commit 75b9383f01caa8b83f8be0117e15085b0d807ba6 Author: Field G. Van Zee Date: Fri Oct 20 16:41:22 2017 -0500 Minor header renaming ahead of bli_arch.c. Details: - Renamed the various configurations' "bli_arch_.h" header files (replacing "arch" with "family") to free up the 'bli_arch' namespace for a different purpose (hardware detection). - Renamed "bli_arch.h" and "bli_arch_pre_macro_defs.h" in frame/include to "bli_arch_config.h" and "bli_arch_config_pre.h", respectively. commit 482af51add26d5ed103c3e3f167657f273b32c7a Author: Field G. Van Zee Date: Fri Oct 20 15:44:26 2017 -0500 Fixed 'make test' target from top-level Makefile. Details: - Updated the top-level Makefile's build rule for testsuite object files to properly obtain CFLAGS via get-frame-cflags-for() function instead of simply using the $(CFLAGS) variable (which is empty). This means that 'make test' should now work as expected. commit 3c269f700d207efe6c04193f09d519c88c1d4045 Author: Field G. Van Zee Date: Fri Oct 20 13:57:21 2017 -0500 Makefile updates for test drivers, testsuite. Details: - Fixed semi-broken testsuite Makefile and very-broken test driver Makefiles, as well as those for test/3m4m, test/thread_ranges, and test/exec_sizes sub-directories. - Factored out much of the top-level Makefile into common.mk. A Makefile needs only set DIST_PATH to the relative path to the top level of the BLIS source distribution before including common.mk in order to acquire all of the definitions typically needed in a Makefile that tests BLIS. commit 0557189d463446b4c32077cdcf0467fa71ca68dc Author: Field G. Van Zee Date: Wed Oct 18 15:05:27 2017 -0500 Minor updates to .travis.yml, configure script. commit 2553734d1d62043793f4e783a027349ef6d4d563 Merge: 453deb29 37534279 Author: Field G. Van Zee Date: Wed Oct 18 13:46:50 2017 -0500 Merge branch 'master' into rt commit 375342799cbae981c28d831793af588d7951f3f6 Author: Field G. Van Zee Date: Wed Oct 18 13:41:25 2017 -0500 Removed a duplicate bli_avx512_macros.h header. Details: - Removed a duplicate header file that was causing problems during installation for the 'knl' configuration. Thanks to Victor Eijkhout for reporting this issue. commit 453deb29068889698e274f269c9aa90eea99b527 Author: Field G. Van Zee Date: Wed Oct 18 13:29:32 2017 -0500 Implemented runtime kernel management. Details: - Reworked the build system around a configuration registry file, named config_registry', that identifies valid configuration targets, their constituent sub-configurations, and the kernel sets that are needed by those sub-configurations. The build system now facilitates the building of a single library that can contains kernels and cache/register blocksizes for multiple configurations (microarchitectures). Reference kernels are also built on a per-configuration basis. - Updated the Makefile to use new variables set by configure via the config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP, in determining which sub-configurations (CONFIG_LIST) and kernel sets (KERNEL_LIST) are included in the library, and which make_defs.mk files' CFLAGS (KCONFIG_MAP) are used when compiling kernels. - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel functions into a standard format that includes the kernel set name (e.g. 'haswell'). Created a "bli_kernels_.h" file in each kernels sub-directory. These files exist to provide prototypes for the kernels present in those directories. - Reorganized reference kernels into a top-level 'ref_kernels' directory. This directory includes a new source file, bli_cntx_ref.c (compiled on a per-configuration basis), that defines the code needed to initialize a reference context and a context for induced methods for the microarchitecture in question. - Rewrote make_defs.mk files in each configuration so that the compiler variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration basis. - Modified bli_config.h.in template so that bli_config.h is generated with #defines for the config (family) name, the sub-configurations that are associated with the family, and the kernel sets needed by those sub-configurations. - Deprecated all kernel-related information in bli_kernel.h and transferred what remains to new header files named "bli_arch_.h", which are conditionally #included from a new header bli_arch.h. These files are still needed to set library-wide parameters such as custom malloc()/free() functions or SIMD alignment values. - Added bli_cntx_init_.c files to each configuration directory. The files contain a function, named the same as the file, that initializes a "native" context for a particular configuration (microarchitecture). The idea is that optimized kernels, if available, will be initialized into these contexts. Other fields will retain pointers to reference functions, which will be compiled on a per-configuration basis. These bli_cntx_init_*() functions will be called during the initialization of the global kernel structure. They are thought of as initializing for "native" execution, but they also form the basis for contexts that use induced methods. These functions are prototyped, along with their _ref() and _ind() brethren, by prototype-generating macros in bli_arch.h. - Added a new typedef enum in bli_type_defs.h to define an arch_t, which identifies the various sub-configurations. - Redesigned the global kernel structure (gks) around a 2D array of cntx_t structures (pointers to cntx_t, actually). The first dimension is indexed over arch_t and the inner dimension is the ind_t (induced method) for each microarchitecture. When a microarchitecture (configuration) is "registered" at init-time, the inner array for that configuration in the 2D array is initialized (and allocated, if it hasn't been already). The cntx_t slot for BLIS_NAT is initialized immediately and those for other induced method types are initialized and cached on-demand, as needed. At cntx_t registration, we also store function pointers to cntx_init functions that will initialize (a) "reference" contexts and (b) contexts for use with induced methods. We don't cache the full contexts for reference contexts since they are rarely needed. The functions that initialize these two kinds of contexts are generated automatically for each targeted sub-configuration from cpp-templatized code at compile-time. Induced method contexts that need "stage" adjustments can still obtain them via functions in bli_cntx_ind_stage.c. - Added new functions and functionality to bli_cntx.c, such as for setting the level-1f, level-1v, and packm kernels, and for converting a native context into one for executing an induced method. - Moved the checking of register/cache blocksize consistency from being cpp macros in bli_kernel_macro_defs.h to being runtime checks defined in bli_check.c and called from bli_gks_register_cntx() at the time that the global kernel structure's internal context is initialized for a given microarchitecture/configuration. - Deprecated all of the old per-operation bli_*_cntx.c files and removed the previous operation-level cntx_t_init()/_finalize() invocations. Instead, we now query the gks for a suitable context, usually via bli_gks_query_cntx(). - Deprecated support for the 3m2 and 3m3 induced methods. (They required hackery that I was no longer willing to support.) - Consolidated the 1e and 1r packm kernels for any given register blocksize into a single kernel that will branch on the schema and support packing to both formats. - Added the cntx_t* argument to all packm kernel signatures. - Deprecated the local function pointer array in all bli_packm_cxk*.c files and instead obtain the packm kernel from the cntx_t. - Added bli_calloc_intl(), which serves as the calloc-equivalent to to bli_malloc_intl(). Useful when we wish to allocate and initialize to zero/NULL. - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h, bli_cntx.h into static functions. commit 4607aac297e55ad540cbe5fffbe02e6b1889c181 Author: Nisanth M P Date: Mon Oct 16 22:06:57 2017 +0530 Thread Safety: Move bli_init() before and bli_finalize() after main() BLIS provides APIs to initialize and finalize its global context. One application thread can finalize BLIS, while other threads in the application are stil using BLIS. This issue can be solved by removing bli_finalize() from API. One way to do this is by getting bli_finalize() to execute by default after application exits from main(). GCC supports this behaviour with the help of __attribute__((destructor)) added to the function that need to be executed after main exits. Similarly bli_init() can be made to run before application enters main() so that application need not call it. Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac commit 0f5ce26fc597cda6e8ae93a7526f52eb8cba01e9 Author: Nisanth M P Date: Mon Oct 16 21:07:50 2017 +0530 Thread safety: Make the global induced method status array local to thread BLIS retains a global status array for induced methods, and provides APIs to modify this state during runtime. So, one application thread can modify the state, before another starts the corresponding BLIS operation. This patch solves this issue by making the induced method status array local to threads. Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe commit b882648af87deb1b365fc6b3e94151e69c5ccfa4 Merge: 8b379069 e02d3cb8 Author: Field G. Van Zee Date: Wed Oct 11 16:32:21 2017 -0500 Merge branch 'master' into rt commit 06e0e6351acb9481225975ad9a4e0b8925336621 Author: sthangar Date: Thu Sep 28 12:15:36 2017 +0530 The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5 commit e02d3cb84190a345ebe9b32f53db03a1838976b1 Author: Field G. Van Zee Date: Tue Sep 26 19:02:53 2017 -0500 Fixed a pthread typo in previous commit. Details: - Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'. commit f5962a1aae0fb3c9be104d0035c0d73210e7f670 Author: Field G. Van Zee Date: Tue Sep 26 17:00:04 2017 -0500 Fixed bugs in gemm/gemmtrsm ukr tests in testsuite. Details: - Fixed a bug in gemmtrsm test module that was due to improper partitioning into a k x k triangular matrix for the purposes of obtaining an mr x k micropanel of A with which to test. - Fixed a bug in gemm and gemmtrsm test modules that would only manifest for very large k (depending on the product of mr x kc on that architecture). The bug arose from the fact that the test module was triggering the allocation of blocks from the internal memory pools, which are limited in size. This allocation imposes an implicit assumption that the micro- panel being tested with will fit inside, and this assumption is violated for large values of k. Arbitrarily large k may now be tested for both operation tests. - Added OpenMP/pthread critical sections around the setting or getting of statuses from the induced method operation lookup table in bli_l3_ind.c. - Added the 'static' keyword to all pthread_mutex_t global variables in BLIS. - Thanks to Nisanth Padinharepatt of AMD for reporting the first and third issues. commit 8e917b256ca2d4bcdc059fe98d86be8775c69561 Author: Field G. Van Zee Date: Sat Sep 9 14:10:15 2017 -0500 Updated bibtex info for BLIS5 (3m4m) article. commit 7be887057358df4978a4833eeae0c17e15acd9d1 Author: Nisanth M P Date: Mon Aug 28 17:38:22 2017 +0530 Merging "Adding auto hardware detection for Zen" Change-Id: Id450fb0c4f91a5cd5cbdc06970f4f9ed28dd8520 commit e056d810d16621891ead032603de0c2105cfc0f7 Author: sthangar Date: Mon Aug 28 16:44:42 2017 +0530 Bug fix for the testsuite build failing Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77 commit 83796b7caf745fafc263e9e5e1bfcf5eff00c025 Merge: 8176f4e4 d1ee7762 Author: Kiran Varaganti Date: Mon Aug 28 05:23:28 2017 -0400 Merge "Adding auto hardware detection for Zen" into amd-staging commit d1ee776202b26874333af7a91b6d2686342c4c81 Author: sthangar Date: Wed Aug 23 13:01:14 2017 +0530 Adding auto hardware detection for Zen Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf commit 8176f4e43872714b997f1a5f83056daadb0ff1a5 Merge: 12413018 adafe974 Author: praveeng Date: Mon Aug 28 12:21:16 2017 +0530 resolving conflicts bli_gemm_front.c and LICENCE Change-Id: Id24ce53896d4c1c7ceccc3e004014a0ecceb5474 commit 57e1e5cd51e7ffe8612c96a20b6a041b55426ddb Merge: f86ce54d d6ef56c6 Author: Nisanth M P Date: Tue Aug 22 17:07:44 2017 +0530 Merge AMD authored changes commit adafe974b4bc3fc0663bc2f6f4ce2fde71a97988 Merge: f86ce54d 7dc78b49 Author: Devin Matthews Date: Tue Aug 15 15:17:21 2017 -0500 Merge pull request #150 from devinamatthews/vzeroupper Add vzeroupper to Intel AVX kernels. commit 7dc78b49f97e6b3cd6d72fcdc588ace534d0e700 Author: Devin Matthews Date: Tue Aug 15 10:02:25 2017 -0500 Add vzeroupper to Intel AVX kernels. commit f86ce54d6f315006984534fe29e47a2deaacc9f5 Author: Field G. Van Zee Date: Thu Aug 10 16:24:28 2017 -0500 Removed trailing enum commas from bli_type_defs.h. Details: - Removed trailing commas from enums in bli_type_defs.h. Thanks to Erling Andersen for pointing out this inconsistency and suggesting the change. commit 60a1eeb2317939d732b9eb6ff1e0d6d668c9a1e5 Author: Field G. Van Zee Date: Sat Aug 5 13:04:31 2017 -0500 Added edge handling to _determine_blocksize_b(). Details: - Added explicit handling of situations where i == dim to bli_determine_blocksize_b_sub(). This isn't actually needed by any current use case within BLIS, but handling the situation is nonetheless prudent. Thanks to Minh Quan for reporting this issue and requesting the fix. commit b01c80829907d50ec79977fba8e7b53cfe7db80a Author: Field G. Van Zee Date: Fri Aug 4 14:17:44 2017 -0500 Fixed a minor bug in level-3 packm management. Details: - Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t entries to be released and then re-acquired unnecessarily. (In essence, the "<" operands in the conditional that guards the release-and-reacquire code block simply needed to be swapped.) The bug should have only affected performance (rather than the computed result). Thanks to Minh Quan for identifying and reporting the bug. commit 8b379069fcd4811669855b1248ece831f190dff6 Merge: 1f3a5819 05925dd5 Author: Field G. Van Zee Date: Tue Aug 1 15:30:40 2017 -0500 Merge branch 'master' into rt commit 05925dd5d30e8f403bb671ce33029170d65ce7c0 Merge: 803bbef0 cecdc05d Author: Devin Matthews Date: Tue Aug 1 09:31:02 2017 -0500 Merge pull request #146 from devinamatthews/master Change lsame_ signature to match lapacke. commit cecdc05d2834786a84ff85775d3f99a958c0765a Author: Devin Matthews Date: Mon Jul 31 15:19:51 2017 -0500 Change lsame_ signature to match lapacke. commit 803bbef0a386dd0571ad389f69d55154dbfe3c50 Author: Field G. Van Zee Date: Sat Jul 29 20:17:05 2017 -0500 Fixed pthreads compile bug with previous commit. Details: - Erroneously passed family parameter into l3int_t function despite that function not taking the parameter. Oops. commit c63980f4ca750618f359031d0691289b1abf5146 Author: Field G. Van Zee Date: Sat Jul 29 14:53:39 2017 -0500 Moved 'family' field from cntx_t to cntl_t. Details: - Removed the family field inside the cntx_t struct and re-added it to the cntl_t struct. Updated all accessor functions/macros accordingly, as well as all consumers and intermediaries of the family parameter (such as bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This change was motivated by the desire to keep the context limited, as much as possible, to information about the computing environment. (The family field, by contrast, is a descriptor about the operation being executed.) - Added additional functions to bli_blksz_*() API. - Added additional functions to bli_cntx_*() API. - Minor updates to bli_func.c, bli_mbool.c. - Removed 'obj' from bli_blksz_*() API names. - Removed 'obj' from bli_cntx_*() API names. - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines that operate only on a single struct to contain the "_node" suffix to differentiate with those routines that operate on the entire tree. - Added enums for packm and unpackm kernels to bli_type_defs.h. - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h. They weren't being used and probably never will be. commit 07837395560d413a1ba828163b41186e21a7bcfe Merge: ca1d1d85 ad8610b4 Author: Field G. Van Zee Date: Fri Jul 21 16:49:48 2017 -0500 Merge pull request #139 from Maratyszcza/emscripten Fix Emscripten builds commit ad8610b4415cc7982804d74f9aba29875e9e2b6c Merge: 8772a0b3 ca1d1d85 Author: Field G. Van Zee Date: Fri Jul 21 15:18:33 2017 -0500 Merge branch 'master' into emscripten commit ca1d1d8560c9ab1a7e3b0ac43ac70d08075bf904 Merge: b537b5bb 733faf84 Author: Devin Matthews Date: Fri Jul 21 09:49:50 2017 -0500 Merge pull request #144 from devinamatthews/fix_atomics_on_bgq Add fallbacks to __sync_* or __c11_atomic_* builtins... commit 733faf848dcc54834fcdfbb0185dc644978d8864 Author: Devin Matthews Date: Thu Jul 20 14:50:13 2017 -0500 Clang can't make up it's mind what to support. commit 7425d0744d9e9cd29a887120e57c2b43ba287040 Author: Devin Matthews Date: Thu Jul 20 12:54:58 2017 -0500 Add default #define for __has_extension. commit b537b5bbe8cbee459a85bac11458498ae2bce4de Merge: 1f1ec0db 7f41bb0a Author: Devin Matthews Date: Thu Jul 20 10:58:39 2017 -0500 Merge pull request #133 from devinamatthews/haswell-packdim Fix prefetching in haswell ukernel commit 8823f91a14638ce6f4e45e67df03212bb61609d6 Author: Devin Matthews Date: Thu Jul 20 10:04:34 2017 -0500 Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes #143. commit 1f1ec0db9380b87679d5c771c4594daa1cfc5f0d Author: Field G. Van Zee Date: Wed Jul 19 15:40:48 2017 -0500 Updated ar option list used by all configurations. Details: - Dropped 'u' from the list of modifiers passed into the library archiver ar. Previously, "cru" was used, while now we employ only "cr". This change was prompted by a warning observed on Ubuntu 16.04: ar: `u' modifier ignored since `D' is the default (see `U') This caused me to realize that the default mode causes timestamps to be zero, and thus the 'u' option, which causes only changed object files to be inserted, is not applicable. commit 5caaba2d61cbbc36d63102a0786ece28ff797f72 Author: Field G. Van Zee Date: Wed Jul 19 13:51:53 2017 -0500 Added --force-version=STRING option to configure. Details: - Added an option to configure that allows the user to force an arbitrary version string at configure-time. The help text also now describes the usage information. - Changed the way the version string is communicated to the Makefile. Previously, it was read into the VERSION variable from the 'version' file via $(shell cat ...). Now, the VERSION variable is instead set in config.mk (via a configure-substituted anchor from config.mk.in). commit 13175c5fb70fb6a378d5fff6ecede62e5ea6a1f6 Author: Field G. Van Zee Date: Tue Jul 18 17:56:00 2017 -0500 Updated openmp/pthread barriers with GNU atomics. Details: - Updated the non-tree openmp and pthreads barriers defined in bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new implementation goes through the same motions as the previous codes, but protects its loads and increments with GNU atomic built-ins. These atomic statements take memory ordering parameters that allow us to specify just enough constraints for the barrier to work as intended on weakly-ordered hardware. The prior implementation was only guaranteed to work on systems with strongly- ordered memory. (Thanks to Devin Matthews for suggesting this change and his crash-course in atomics and memory ordering.) - Removed 'volatile' from structs' barrier field declarations in bli_thrcomm_*.h. - Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields consistent with that of the _openmp.? files. - Updated other bli_thrcomm_* files to rename "communicator" variables to simply "comm". commit 0e58ba1b3aa84700ca51a96f1c0eed6067562fba Author: Field G. Van Zee Date: Mon Jul 17 19:03:22 2017 -0500 Added API to set mt environment variables. Details: - Renamed bli_env_get_nway() -> bli_thread_get_env(). - Added bli_thread_set_env() to allow setting environment variables pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS. - Added the following convenience wrapper routines: bli_thread_get_jc_nt() bli_thread_get_ic_nt() bli_thread_get_jr_nt() bli_thread_get_ir_nt() bli_thread_get_num_threads() bli_thread_set_jc_nt() bli_thread_set_ic_nt() bli_thread_set_jr_nt() bli_thread_set_ir_nt() bli_thread_set_num_threads() - Added #include "errno.h" to bli_system.h. - This commit addresses issue #140. - Thanks to Chris Goodyer for inspiring these updates. commit 8772a0b33a90154c80d88b381dcdd66f824e041f Author: Marat Dukhan Date: Thu Jul 13 21:39:24 2017 -0700 Fix Emscripten builds commit 72c8b49bb8d3b9370b2cc37718da22f065de9c57 Merge: 70cc825b ba7cada5 Author: Field G. Van Zee Date: Wed Jul 12 14:58:12 2017 -0500 Merge pull request #138 from hominhquan/membrk_set_free_fp Set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers commit ba7cada51a238d320528e3504ed0f0a17a6b022a Author: Minh Quan HO Date: Fri Jul 7 10:52:05 2017 +0200 set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is not set in bli_membrk_init commit 1241301869957c96f16a2c6567e3ad70afa547de Merge: 969b67e8 25ead66f Author: Kiran Varaganti Date: Wed Jul 5 02:24:00 2017 -0400 Merge "Reducing the framework overhead of GEMV routines" into amd-staging commit 25ead66fb78557f73af48bac305724d5d8aa3309 Author: sthangar Date: Fri Jun 30 12:23:19 2017 +0530 Reducing the framework overhead of GEMV routines Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684 commit 969b67e8800fbd5d14a086606f3b5afbf66ed093 Author: Kiran Varaganti Date: Tue Jul 4 12:57:32 2017 +0530 Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024. Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4 commit 70cc825b552dec05165b9d70f9e6eb33d8abb118 Author: Devin Matthews Date: Tue Jun 6 21:58:21 2017 -0500 Update LICENSE Remove totally unnecessary first 9 lines and hopefully get Github to recognize it as 3BSD [ci skip]. commit cf54c77bc79a0f33a514be72c80a654c4e6e6f63 Author: Devin Matthews Date: Tue Jun 6 20:23:17 2017 -0500 Add new SSI acknowledgment commit d6ef56c6dbaf6df8ee1af1ca6a0f0792a811396a Author: prangana Date: Thu Jun 1 16:11:09 2017 +0530 Update version number Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4 commit 897bfa0e92082c30bbb74229562d7d7327cbbac8 Author: prangana Date: Thu Jun 1 16:11:09 2017 +0530 Update version number Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4 commit 99d0ba5606d4b63e6a9c639aa78d4defc2455f79 Merge: be2c7eb8 6d17e012 Author: Santanu Thangaraj Date: Thu Jun 1 02:19:02 2017 -0400 Merge "Checked in the small matrix code to compute GEMM called with A transpose case" into amd-staging commit 6d17e0120fe5c127b941136ad2c0c08e91439535 Author: sthangar Date: Wed May 24 11:48:16 2017 +0530 Checked in the small matrix code to compute GEMM called with A transpose case Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462 commit 9d93f8481a1404695f7b78a3ced8ca47e890b649 Author: prangana Date: Tue May 30 09:58:10 2017 +0530 Update Licence File Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2 commit be2c7eb85168937bd4318f4d05ded37620119310 Author: prangana Date: Tue May 30 09:58:10 2017 +0530 Update Licence File Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2 commit 7f41bb0a0becde6a7de7df0f99668d7b4686c3b0 Author: Devin Matthews Date: Fri May 26 14:49:31 2017 -0400 PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%. commit d87614af3f3d9187be94d6e77984b282bf890928 Author: Devin Matthews Date: Fri May 26 14:47:36 2017 -0400 Revert "Change PACKDIM_MR (double) for haswell to 8." This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99. commit 681eec913d7c2ebcff637cec5c1627ced9a92b99 Author: Devin Matthews Date: Fri May 26 12:28:09 2017 -0500 Change PACKDIM_MR (double) for haswell to 8. commit 0a3ae0ecaa0ddcb5887005d7051fa234499f1120 Merge: 0f4e6652 6e04f9df Author: praveeng Date: Sat May 20 16:53:50 2017 +0530 frame/3/gemm/bli_gemm_front.c Change-Id: I52a0fbc1d33bb948d430942323bbc5fe44e3ca13 commit 6e04f9df01d79c1b0e673943ca0d5d0a6095eb2e Author: Field G. Van Zee Date: Wed May 17 13:03:52 2017 -0500 Restored deleted lines from makefile fragments. commit ec5c0c0448275280dca0991f6f33afeb73650450 Author: Devin Matthews Date: Wed May 17 12:29:44 2017 -0500 Change to /bin/sh. All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh. commit 555ddc30d4c7e44f3f335e436c98606f56e1598b Author: Devin Matthews Date: Wed May 17 12:27:14 2017 -0500 Remove shebangs from makefiles. commit f26bd7f42e0c2a47fe321b2c452644990b689654 Merge: cbf8710a 169fb05f Author: Devin Matthews Date: Wed May 17 11:58:41 2017 -0500 Merge pull request #128 from iotamudelta/master Portability and clang commit 169fb05f225c2f060265bcaa872f7f80dc638b70 Author: J M Dieterich Date: Tue May 16 23:11:22 2017 -0400 Fix if/else structure. Thanks to TravisCI. commit 0579dfea0bcfbb90ebc073fcf78b92a5cf7238e1 Author: J M Dieterich Date: Tue May 16 22:58:07 2017 -0400 Restore version. commit a75b05c23dc786a1fdc45dc1627a5ce2299f1a7b Author: J M Dieterich Date: Tue May 16 22:23:27 2017 -0400 Mark piledriver compilable w/ clang. commit 7541d46e2ba8659bb2e36b444edef112fefa1345 Author: J M Dieterich Date: Tue May 16 22:12:12 2017 -0400 Mark bulldozer compilable w/ clang. commit 91f897073ec0df3330ede449c4d6af8158266ae3 Author: J M Dieterich Date: Tue May 16 22:06:59 2017 -0400 Correct error message. commit f5131e1e49167f948bddd714bb1af1761829c212 Author: J M Dieterich Date: Tue May 16 22:03:23 2017 -0400 Indeed once can compile for carrizo also using clang. commit 5fa4e9439c04f35f89dd7d26ff742cb2dadc3180 Author: J M Dieterich Date: Tue May 16 21:50:49 2017 -0400 A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash commit 1f3a58197e5d5f9ac862bda91e7527cbfbab5d76 Author: Field G. Van Zee Date: Mon May 8 16:10:03 2017 -0500 Housekeeping, induced method file/function renames. Details: - Renamed all level-3 induced method files to use the "_vir.c" suffix instead of "_ref.c". Also renamed functions within these files accordingly. - Renamed cpp macro definitions in frame/ind/include according to the above changes. - Removed frame/3/old. commit cbf8710a1ba63e25aadaa6fc5da51ea81b3d596d Merge: cf39d3ef fdc66f12 Author: Tyler Michael Smith Date: Mon May 8 11:21:20 2017 -0500 Merge pull request #127 from devinamatthews/fix_blis_nt_xx Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS commit cf39d3ef3b29b8058c39fb4638c1a734fe64aaed Author: Field G. Van Zee Date: Fri May 5 15:06:56 2017 -0500 Fixed a bug in norm1v, norm1m. Details: - Fixed a bug that manifested as improperly-computed 1-norm for vectors and matrices. This is one of the few operations in BLIS that does not have its own test module within the testsuite, hence why it went undetected for so long. The bad 1-norms were being used to normalize matrices in the testsuite after initialization, which led to some matrices containing a combination of "large" and "small" values. This tended to push the residuals computed after each test away from zero. In some cases, they were off *just* enough to the testsuite to label it a "failure". Many thanks to Jeff Hammond for reporting this bug. (Wonky details: the bug was due to improperly-defined level-0 scalar macros for abval2, an operation that computes the absolute square, or complex magnitude/modulus. Certain complex domain instances of abval2 were being incorrectly defined in terms of real-only solutions, leading to bad results. This level-0 operation forms the basis of norm1v/norm1m. absq2 was also affected, but almost nothing uses this operation.) commit 799485124f4d823e908d2e5d38b0c3a1e6172ade Merge: 773a24ef 0df3541f Author: Devin Matthews Date: Thu May 4 10:52:09 2017 -0500 Merge pull request #121 from jeffhammond/not-real-knl allow KNL build without hbwmalloc (i.e. emulated) commit fdc66f12d40754ff46179804bff592fddafbca02 Author: Devin Matthews Date: Thu May 4 10:35:22 2017 -0500 Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123. commit 773a24efb2fa1c3a220bf0ce1dd621a3176196da Merge: dd58c954 b8854259 Author: Field G. Van Zee Date: Wed May 3 15:07:59 2017 -0500 Merge branch 'master' of github.com:flame/blis commit dd58c9545c877c3f7553eaebca7b5e9720a66f5d Author: Field G. Van Zee Date: Wed May 3 15:04:51 2017 -0500 Disable complex 3m/4m in testsuite by default. Details: - Disabled testsuite tests of all level-3 implementations based on 3m and 4m. This will improve testing runtime on Travis CI as well as for anyone manually running the testsuite using default test parameters. Thanks to Devin Matthews for suggesting this change. commit 0df3541f54b7fe0c604ab2ec47ba814f12391798 Author: Jeff Hammond Date: Tue May 2 19:25:21 2017 -0700 allow KNL build without hbwmalloc.h (i.e. emulated) we want to be able to run BLIS KNL binaries on non-KNL machines via SDE. although it is possible to install hbwmalloc implementation on such systems, it is easier not to, since obviously the performance of SDE execution is not representative so there is no reason to emulate HBW allocation. commit b88542591d4dd0cde366e5ae35afd3205cb81bdc Merge: 43007f7b c2c91e09 Author: Field G. Van Zee Date: Tue May 2 19:22:41 2017 -0500 Merge pull request #107 from jeffhammond/intel-compilers-no-use-libm never use libm with Intel compilers commit 43007f7b65ec7926cbbfc39965ff733fa251c15f Author: Field G. Van Zee Date: Tue May 2 16:48:43 2017 -0500 Fixed stray parentheses in README citations. commit a4f1d0b8801c114e9ef8be39df01e1b8d27ebcb3 Author: Field G. Van Zee Date: Tue May 2 16:38:43 2017 -0500 CHANGELOG update (0.2.2) commit 940a707ac78de975110e17c95765e65b89aa5e10 (tag: 0.2.2) Author: Field G. Van Zee Date: Tue May 2 16:38:42 2017 -0500 Version file update (0.2.2) commit d5a5e003ea9b24bb6abf12e88862e8eb61ffb03d Author: Field G. Van Zee Date: Tue May 2 15:48:30 2017 -0500 Fixed a trsm1m bug that affected right-side cases. Details: - Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result was nondeterministic behavior (usually segmentation faults) for certain problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c which explicitly directed the virtual gemm micro-kernel to use temporary space if the storage preference of the [real domain] gemm ukernel did not match the storage of the output matrix C. In the context of gemm, this handling is not needed because agreement between the storage pref and the matrix is guaranteed by a high-level optimization in BLIS. However, this optimization is not applied to trsm because the storage of C is not necessarily the same as the storage of the micro-panels of B--both of which are updated by the micro-kernel during a trsm operation. Thus, the guarantee of storage/preference agreement is not in place for trsm, which means we must handle that case within the virtual gemm micro-kernel. - Comment updates and a minor macro change to bli_trsm*_cntx_init() for 3m1, 4m1a, and 1m. commit e80993e71f4d571e9650a8e90ed386e32059eae5 Merge: a509fbd5 ca3a7924 Author: Field G. Van Zee Date: Tue May 2 12:30:28 2017 -0500 Merge branch 'master' into 1m commit ca3a7924770d6cf203cce4ca9f5482e1d0d4e961 Author: Field G. Van Zee Date: Tue May 2 12:09:39 2017 -0500 README.md update. Details: - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th and 6th BLIS papers. commit 0f4e6652dfe9b30105d3bab328ac26d9d5c11182 Merge: 42e7f6fb 6e7de6ef Author: praveeng Date: Wed Apr 19 17:54:10 2017 +0530 Merge master code till 2017_04_19 to amd-staging Change-Id: Ibebe83c8ea2e7eb15798c2bcf214b7228a1c9518 commit 42e7f6fb2a531429ee600b2fe0293b67371c7ccb Author: sthangar Date: Tue Mar 28 18:10:03 2017 +0530 fixed license attribute issues in AMD added files Change-Id: I303f870a777c7cd1c1af29ea0b93f3e0a27948e4 commit 5600001e973c6cea048bd3fdb28117f1d7c98b9d Merge: 0b190293 b3ed4933 Author: prangana Date: Mon Mar 20 13:56:33 2017 +0530 Fix merge conflicts after sync with release branch Change-Id: Icf14a09f728befb69a73fff9fa79c4128e728310 commit 6e7de6ef84babb273dc5528a9b9d01f0febe394b Author: Field G. Van Zee Date: Fri Mar 17 12:10:24 2017 -0500 Minor updates to test/3m4m. Details: - Updated initial problem size and increment in Makefile. - Updated code in test_gemm.c to correctly query kc from context. commit f484c6cd4389dc7ae5b972849e12e98ad5bbf9a4 Author: Field G. Van Zee Date: Fri Mar 17 12:07:27 2017 -0500 Whitespace reformatting to armv8a kernels file. Details: - Updated formatting of function signature/header in kernels/armv8a/3/bli_gemm_opt_4x4.c. commit 0b19029342ffc530fa22ef20398a26221cb8f6ec Author: Kiran Varaganti Date: Tue Mar 14 14:51:31 2017 +0530 Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81 commit 825363bd2a5a60a923d4a6d9691dc143845a9cab Merge: 093bdb80 513944e4 Author: praveeng Date: Wed Mar 8 15:42:49 2017 +0530 Merge code from master to amd-staging as on 2017_03_08 by praveeng Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d commit 093bdb80c86b06367e595aa17487139ae983822f Author: sthangar Date: Tue Mar 7 13:35:50 2017 +0530 Checked in Unpacked DGEMM code Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723 commit 33923da9a108854590d386e74b6ee66b971e7796 Author: Kiran Varaganti Date: Mon Mar 6 14:31:31 2017 +0530 Added variant 10 for double precision axpyv microkernel Change-Id: I7a20cc113a422603250bc450825c965136354974 commit bc828f7f8e3ddb9f58af07edc0b935b21759fb0f Author: Kiran Varaganti Date: Fri Mar 3 14:45:35 2017 +0530 Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972 commit c9949f4603419267c10973adf1d63ec38497475d Author: sthangar Date: Fri Feb 17 14:16:33 2017 +0530 Checked in DGEMMTRSM and edge case handling routine in DDOTXF Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e commit a509fbd5ac04fafd4e51b43d2f59ca56432dc212 Merge: 69b4846a 513944e4 Author: Field G. Van Zee Date: Tue Feb 21 17:06:16 2017 -0600 Merge branch 'master' into 1m commit 69b4846ae9adb157c4171b52e159684db2867853 Author: Field G. Van Zee Date: Tue Feb 21 15:33:39 2017 -0600 Disabled experiment-related 1m code. Details: - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was specifically inserted to facilitate the benchmarking of 1m block-panel and panel-block algorithms. - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to reflect changes used/needed during benchmarking. commit 513944e4a951d8823b4de161b86ad7a965b4d99b Merge: 8b462a0e 0e18f68c Author: Devin Matthews Date: Mon Feb 20 10:04:33 2017 -0500 Merge pull request #118 from devinamatthews/master Handle k=0 correctly in KNL dgemm ukernel. commit 0e18f68cf12eb9189ba901a20040b1cdae417670 Author: Devin Matthews Date: Mon Feb 20 09:03:21 2017 -0600 Handle k=0 correctly in KNL dgemm ukernel. commit 8b462a0e8c3e9252f0401940849e53cc772256fa Merge: c362afc5 7d42fc07 Author: Devin Matthews Date: Sun Feb 19 23:03:03 2017 -0500 Merge pull request #117 from devinamatthews/master Cast dim_t and inc_t parameters to 64-bit in KNL microkernels. commit 7d42fc0796ef0c010375fd8e59b1240ba41ce4d2 Author: Devin Matthews Date: Sun Feb 19 21:10:55 2017 -0500 Cast dim_t and inc_t parameters to 64-bit in KNL microkernels. commit 04245c9ff7f8b3c70d61003029c964bb9a4320ee Author: Kiran Varaganti Date: Fri Feb 10 14:24:30 2017 +0530 Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5 commit c362afc525bab4050581d1b0fcea2fe4d582c608 Author: Field G. Van Zee Date: Thu Feb 9 11:54:59 2017 -0600 Added missing "level-0" BLAS [sd]cabs1_(). Details: - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_() to the BLAS compatibility layer. Thanks to heroxbd for pointing out their absence. commit 018180c938c32efbeaaf626ba71ec5b780664db1 Author: Field G. Van Zee Date: Wed Feb 8 11:20:52 2017 -0600 Fixed a minor bug in configure (issue #114). Details: - Fixed a bug in the configure script whereby a non-preferred value for --enable-threading would cause problems in common.mk vis-a-vis detecting which threading model was chosen. Thanks to heroxbd for reporting this issue. commit 58b5b77e5fdb179ea465e398e416e6a00d917e05 Author: Kiran Varaganti Date: Wed Feb 8 21:43:34 2017 +0530 Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e commit 85de4ebf74d0a5587d5a12724eb5489d51674db3 Author: Kiran Varaganti Date: Wed Feb 8 14:41:04 2017 +0530 variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff commit 3fa53e8af31d634779f40258c51483ae8af494fa Merge: b5291a44 95be7b04 Author: Kiran Varaganti Date: Wed Feb 8 11:46:34 2017 +0530 Merged axpyv and gemm small in bli_kernel.h Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging modified: config/zen/bli_kernel.h modified: frame/3/gemm/bli_gemm_front.c modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c Change-Id: If181cf9345178c448b3530beb8bef453917fe295 commit 95be7b04709e688a4cb01fba680081e30f4258ef Author: sthangar Date: Tue Feb 7 14:01:27 2017 +0530 Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0 commit b5291a445b1313e01f1e0e8102c5f3660ab07f69 Author: Kiran Varaganti Date: Tue Feb 7 12:39:31 2017 +0530 Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9 commit f4bfc1662af82aa4b98185334c44835e51f1cbec Author: Kiran Varaganti Date: Mon Feb 6 15:04:27 2017 +0530 New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072 commit ddf45e71770c55ea4a58ca24ea4913fe5d8beb9b Merge: a6ab91bc 78e1b16e Author: Devin Matthews Date: Fri Jan 27 14:25:40 2017 -0600 Merge pull request #113 from devinamatthews/knl_thread_params Change default threading parameters for KNL. commit 78e1b16e16d589ed31b2e712115ee282097f114d Author: Devin Matthews Date: Fri Jan 27 14:22:20 2017 -0600 Change default threading parameters for KNL. commit 574472ba5a89924eca7dbd10055d0e1dcd7f4c71 Author: sthangar Date: Tue Jan 10 14:51:46 2017 +0530 checked in unpacked SGEMM optimization Change-Id: I8e4ea374415c0c402c660b656fb076af15354181 commit 1c732d3ddc4ac0861d3b0e0dd15eb7e071615502 Author: Field G. Van Zee Date: Wed Jan 25 16:25:46 2017 -0600 Added 1m-specific APIs for bp, pb gemm algorithms. Details: - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the body of bli_gemm_cntl_create() replaced with a call to the former. - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now, bli_cntl_free() can check if the thread parameter is NULL, and if so, call the latter, and otherwise call the former. - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in terms of bli_gemm1mxx_cntx_init(), which behaves the same as bli_gemm1m_cntx_init() did before, except that an extra bool parameter (is_pb) is used to support both bp and pb algorithms (including to support the anti-preference field described below). - Added support for "anti-preference" in context. The anti_pref field, when true, will toggle the boolean return value of routines such as bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of causing BLIS to transpose the operation to achieve disagreement (rather than agreement) between the storage of C and the micro-kernel output preference. This disagreement is needed for panel-block implementations, since they induce a transposition of the suboperation immediately before the macro-kernel is called, which changes the apparent storage of C. For now, anti-preference is used only with the pb algorithm for 1m (and not with any other non-1m implementation). - Defined new functions, bli_cntx_l3_ukr_eff_prefers_storage_of() bli_cntx_l3_ukr_eff_dislikes_storage_of() bli_cntx_l3_nat_ukr_eff_prefers_storage_of() bli_cntx_l3_nat_ukr_eff_dislikes_storage_of() which are identical to their non-"eff" (effectively) counterparts except that they take the anti-preference field of the context into account. - Explicitly initialize the anti-pref field to FALSE in bli_gks_cntx_set_l3_nat_ukr_prefs(). - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel in terms of the existing block-panel macro-kernel _ker_var2(). This technique requires inducing transposes on all operands and swapping the A and B. - Changed bli_obj_induce_trans() macro so that pack-related fields are also changed to reflect the induced transposition. - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily specify the 1m algorithm (block-panel or panel-block). - Renamed the following cntx_t-related macros: bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block() bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel() bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel() and updated all instantiations. Also updated the field names in the cntx_t struct. - Comment updates. commit 41595e98eedaf3f1f93802c14dcae490402f933f Merge: d625c49e a6ab91bc Author: praveeng Date: Wed Dec 7 15:13:21 2016 +0530 Merge master code as on 2016_12_07 to amd-staging Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f commit d625c49e20bd3c50d6d44e330e34076cced114a3 Author: sthangar Date: Tue Nov 29 15:05:19 2016 +0530 checked-in SGEMMTRSM microkernel for Zen Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f commit a6ab91bc61432490fadf18d596de4589645f37dd Merge: 145a551d 7f31a630 Author: Field G. Van Zee Date: Wed Nov 30 09:26:58 2016 -0600 Merge pull request #111 from figual/master Fixed missing cntx argument in ARMv8 microkernels. commit 7f31a6307b7bd35f913c895947552c3a176f789b Author: Francisco Igual Date: Sun Nov 27 14:40:47 2016 +0100 Fixed missing cntx argument in ARMv8 microkernels. commit 126482a3b609b9ad7026ba348f6c4bf6a29be8a1 Author: Field G. Van Zee Date: Fri Nov 25 18:29:49 2016 -0600 Implemented the 1m method. Details: - Implemented the 1m method for inducing complex domain matrix multiplication. 1m support has been added to all level-3 operations, including trsm, and is now the default induced method when native complex domain gemm microkernels are omitted from the configuration. - Updated _cntx_init() operations to take a datatype parameter. This was needed for the corresponding function for 1m (because 1m requires us to choose between column-oriented or row-oriented execution, which requires us to query the context for the storage preference of the gemm microkernel, which requires knowing the datatype) but I decided that it made sense for consistency to add the parameter to all other cntx initialization functions as well, even though those functions don't use the parameter. - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take a second scalar for each blocksize entry. The semantic meaning of the two scalars now is that the first will scale the default blocksize while the second will scale the maximum blocksize. This allows scaling the two independently, and was needed to support 1m, which requires scaling for a register blocksize but not the register storage blocksize (ie: "packdim") analogue. - Deprecated bli_blksz_reduce_dt_to() and defined two new functions, bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing default and maximum blocksizes to some desired blocksize multiple. These functions are needed in the updated definitions of bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs(). - Added support for the 1e and 1r packing schemas to packm, including 1e/1r packing kernels. - Added a minor optimization to bli_gemm_ker_var2() that allows, under certain circumstances (specifically, real domain beta and row- or column-stored matrix C), the real domain macrokernel and microkernel to be called directly, rather than using the virtual microkernel via the complex domain macrokernel, which carries a slight additional amount of overhead. - Added 1m support to the testsuite. - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified some code in test_gemm.c driver. commit d8f13beeea90338e0ecb0a3aeaa2d59d8ebd6c36 Merge: c25a9205 145a551d Author: praveeng Date: Fri Nov 25 17:31:08 2016 +0530 Merge master code till 2016_11_25 to amd-staging commit c25a9205fd8c8d8de7fd81b1e5621e7ac79f4e87 Merge: 65298762 bdc0a264 Author: praveeng Date: Fri Nov 25 17:06:36 2016 +0530 Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1 commit 145a551d524ae5492667a05fc248923d922df850 Author: Field G. Van Zee Date: Wed Nov 23 17:59:06 2016 -0600 Switched to simpler trsm_r implementation. Details: - Disabled the implementation of trsm_r that allows the right-hand matrix B to be trianglar, and switched to the implementation that simply transposes the operation (and thus the storage of C) in order to recast the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru macrokernels, which require an awkward swapping of MR and NR. For now, the support for trsm_r macrokernels, via separate control trees, remains. - Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS is defined by default. This is mostly a safety precaution in case someone tries to switch back to the previous trsm_r implementation, but also serves as a convenience on some systems where one does not naturally choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0. commit b3e58ee30307cf1e11529f2113acb9abbeda25af Author: Field G. Van Zee Date: Wed Nov 23 17:58:26 2016 -0600 Reimplemented 4x12 haswell ukernels (real only). Details: - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which defines 4x24 single real and 4x12 double real gemm microkernels, with broadcast-based implementations. (The previous microkernel file has been moved to an 'old' subdirectory.) commit 65298762ff15c45e8588e0c279a9feaa98c927a0 Author: sthangar Date: Tue Nov 22 12:15:33 2016 +0530 removed a redundant copy operation in DNRM2 Change-Id: I673b08efde4480e871779716f7715566740ad9ce commit d6863e851adeef037e4d1476fe63bb293fb9d987 Author: sthangar Date: Mon Nov 21 11:30:30 2016 +0530 checked-in DNRM2 optimizations Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522 commit bdc0a264d2fb5940bfd09298b1de823674a39053 Author: Field G. Van Zee Date: Wed Nov 16 14:13:08 2016 -0600 Adjusted stride selection of ct in macrokernels. Details: - Updated the changes introduced in 618f433 so that the strides of the temporary microtile ct used in the macrokernels is determined based on the storage preference of the microkernel (via the new functions below), rather than the strides of c. In almost all cases, presently, this change results in no net effect, as a high-level optimization in the _front() functions aligns the storage of c to that of the microkernel's preference. However, I encountered some cases where this is not always the case in some development code that has yet to be committed, and therefore I'm generalizing the framework code in advance. - Defined two new functions in bli_cntx.c: bli_cntx_l3_ukr_prefers_rows_dt() bli_cntx_l3_ukr_prefers_cols_dt() which return bool_t's based on the current micro-kernel's storage preferences. For induced methods, the preference of the underlying real domain microkernel is returned. - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of the above functions, rather than querying the preferences of the native microkernel directly (which did the wrong thing for induced methods). commit 031978d2647cf08316858baf29c84ebba9c3133e Author: Field G. Van Zee Date: Wed Nov 16 14:04:33 2016 -0600 Fixed inactive trsm_r blocksize constraint code. Details: - Changed a cpp macro that was meant to prevent using certain trsm_r code if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded incorrectly at first. I've now fixed its location and changed its consequence to a compile-time #error message. commit 9772218cae57d55c252595b01e3669d8bed84944 Author: sthangar Date: Wed Nov 16 15:19:19 2016 +0530 Added optimized DAMAX routines for Zen Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8 commit 9c448e30174e5eb76a94b43b30819704a5dfcb3f Merge: 998d8240 e35d3c23 Author: Santanu Thangaraj Date: Wed Nov 16 04:18:57 2016 -0500 Merge "Added new optimized micro-kernel for dotxv routine" into amd-staging commit 998d824044adac0d54c921dcd44fb58f3d54aad2 Merge: 0d13e9a4 6b5a4032 Author: praveeng Date: Wed Nov 16 14:22:42 2016 +0530 Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5 commit 6b5a4032d2e3ed29a272c7f738b7e3ed6657e556 Merge: 3b524a08 a8220e3a Author: Field G. Van Zee Date: Thu Nov 10 15:28:24 2016 -0600 Merge pull request #109 from devinamatthews/omp_num_threads Add automatic loop thread assignment. commit a8220e3a86433b5d76789e32ea7ca014a11b6d17 Author: Devin Matthews Date: Thu Nov 10 14:19:34 2016 -0600 - Fix typo in bli_cntx.c - Bump BLIS_DEFAULT_NR_THREAD_MAX to 4 commit e35d3c23f28784e50ee13d2e77a69d60e0c24c1f Author: Kiran Varaganti Date: Thu Nov 10 14:30:53 2016 +0530 Added new optimized micro-kernel for dotxv routine Change-Id: I2c544e9b25a454d971ad690353502a55cd668391 commit 0d13e9a4f6f2fcda08f205215240cdf86442d6c6 Merge: e044fa62 3b524a08 Author: praveeng Date: Mon Nov 7 14:40:41 2016 +0530 bli_kernel.h Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091 commit c05b3862f6241486442b313eff0c8bee7b5e1274 Author: Devin Matthews Date: Fri Nov 4 15:48:02 2016 -0500 Add automatic loop thread assignment. - Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before. - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h. - All level-3 BLAS covered. commit 3b524a08e3fb8380e7b8b2ba835312c51a331570 Author: Field G. Van Zee Date: Wed Nov 2 17:45:18 2016 -0500 Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code. Details: - Consolidated the macros that define the lower and upper versions of the gemmtrsm microkernels into a single macro that is instantiated twice. Did this for both 3m1 and 4m1 microkernels. - Consolidated lower and upper versions of the trsm microkernels for 3m1 and 4m1 into single files (each). commit ead231aca635deb3db270f118454e4222c627f31 Merge: d25e6f8b 62987f60 Author: Field G. Van Zee Date: Wed Nov 2 13:03:50 2016 -0500 Merge pull request #108 from devinamatthews/patch-2 Update .travis.yml with additional tests commit 62987f60a6a6ff0a75b31d0404f493593ce35ccc Author: Devin Matthews Date: Wed Nov 2 11:20:37 2016 -0500 Allow KNL to fail commit 8f9010542c751ae3cbfe6121cb011d8985c1e00d Author: Devin Matthews Date: Wed Nov 2 11:18:32 2016 -0500 Fix some problems with OSX builds: - Update CPU detection for Intel archs (esp. Skylake) - Allow clang for the reference config commit d25e6f8b63c57f30b8a67dffbf4995977cf9f235 Author: Field G. Van Zee Date: Tue Nov 1 14:35:15 2016 -0500 Can disable trsm_r-specific blocksize constraints. Details: - Added cpp guards around the constraints in bli_kernel_macro_defs.h that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY needed when handling right-side trsm by allowing the matrix on the right (matrix B) to be triangular, because it involves swapping register, but not cache, blocksizes (packing A by NR and B by MR) and then swapping the operands to gemmtrsm just before that kernel is called. It may be useful to disable these constraints if, for example, the developer wishes to test the configuration with a different set of cache blocksizes where only MC % MR = 0 and NC % NR = 0 are enforced. - In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass the enforcement of MC % NR = 0 and NC % MR = 0. commit 1a67e3688edb073a9d44c160e7b0798e08796b8a Author: Devin Matthews Date: Tue Nov 1 13:53:18 2016 -0500 Bogus commit Need to trigger another Travis build. commit 2cd82d67b372cad1bed50cfd99e524f1f40b4e24 Author: Devin Matthews Date: Tue Nov 1 13:25:50 2016 -0500 Some fixes for .travis.yml - Switch to gcc-5 to support knl - Don't run tests in parallel -- it is super slow. - Use clang on OSX since gcc is only a zombie husk. commit a3db4e6bdfe745083acf704ab0f51f74ea869538 Author: Devin Matthews Date: Tue Nov 1 10:33:18 2016 -0500 Update .travis.yml with additional tests - Test knl configuration (without running of course). - Test openmp and pthreads threading for auto configuration with 4 threads. - Test auto configuration with and without pthreads on OSX. - Also, run make in parallel. I don't know how the `addons:` section works on OSX; hopefully it is just ignored. commit 8a11a2174a1a5b9426f13bbc5338dc86ab138cdd Author: Field G. Van Zee Date: Mon Oct 31 19:07:55 2016 -0500 Updates to non-default haswell microkernels. Details: - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment constraints. - Added missing c and z microkernels, which are based on the corresponding kernels in the d6x8 set. - This completes the d8x6 set (which may be used for situations when it is desirable to have a microkernel with a column preference). commit 618f4331eba209803ecab99747872eceb1b5f091 Author: Field G. Van Zee Date: Mon Oct 31 14:40:51 2016 -0500 Align strides of ct in macrokernels to that of c. Details: - Previously, rs_ct and cs_ct, the strides of the temporary microtile used primarily in the macrokernels' edge case handling, were unconditionally set to 1 and MR, respectively. However, Devin Matthews noted that this ought to be changed so that the strides of ct were in agreement with the strides of C. (That is, if C was row-stored, then ct should be accessed as by rows as well.) The implicit assumption is that the strides of C have already been adjusted, via induced transposition, if the storage preference of the microkernel is at odds with the storage of C. So, if the microkernel prefers row storage, the macrokernel's interior cases would present row-stored (ideal) microkernel subproblems to the microkernel, but for edge cases, it would still see column-stored subproblems (not ideal). This commit fixes this issue. Thanks to Devin for his suggestion. commit c2c91e09b4893cb81314774557f728a95080f81e Author: Jeff Hammond Date: Tue Oct 25 21:15:26 2016 -0700 never use libm with Intel compilers Intel compilers include a highly optimized math library (libimf) that should be used instead of GNU libm. yes, this change is for ALL targets, including those that are not supported by the Intel compiler. there is no harm in doing this, and it is future-proof in the event that the Intel compilers support other architectures. commit 630391002325a589063aec2ab0a7d89ef2e178c0 Merge: 956b3edf 216206c1 Author: Field G. Van Zee Date: Tue Oct 25 19:34:51 2016 -0500 Merge pull request #105 from devinamatthews/knl Support for Intel Knight's Landing. commit 216206c1d328a865c2192e35a4df6e9aff79a85b Author: Devin Matthews Date: Tue Oct 25 13:56:18 2016 -0500 Fix up for merge to master. commit 11eb7957abbcdf02d5e312898e094260eadb1209 Merge: cd5b6681 956b3edf Author: Devin Matthews Date: Tue Oct 25 13:51:07 2016 -0500 Merge branch 'master' into knl # Conflicts: # frame/thread/bli_thread.h commit cd5b6681838899283cd94e5427dfda206e7fbabe Author: Devin Matthews Date: Tue Oct 25 13:49:27 2016 -0500 Don't use %rbp in KNL packing kernels. commit 956b3edf8eb09480f31f2e861c1b10f9ecbb2e52 Merge: b7e41d71 0662a3c1 Author: Field G. Van Zee Date: Tue Oct 25 13:02:57 2016 -0500 Merge pull request #104 from devinamatthews/misspellings Add flexible options for thread model (pthread/posix for pthreads etc.). commit 0662a3c1b1f4644a86bf8e5073d1391808c91b4a Author: Devin Matthews Date: Tue Oct 25 12:42:44 2016 -0500 Add flexible options for thread model (pthread/posix for pthreads etc.). commit e044fa624008c161de32a39d734cddf1dd22dd41 Author: Kiran Varaganti Date: Tue Oct 25 13:03:05 2016 +0530 Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a commit b3ed4933aa0da72ad771fb0fdf1727e5ba9ad7b4 Author: Kiran Varaganti Date: Tue Oct 25 13:03:05 2016 +0530 Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a commit b7e41d71b07d2af6d22d632c70e0c5f7ce46852c Merge: 4bd905bd 5117d444 Author: Field G. Van Zee Date: Mon Oct 24 16:47:46 2016 -0500 Merge pull request #103 from devinamatthews/patch-1 Change .align to .p2align in Bulldozer ukernels. commit 5117d444f7f3a2bc327f067926eaf2398212edda Author: Devin Matthews Date: Mon Oct 24 16:20:47 2016 -0500 Change .align to .p2align in Bulldozer ukernels Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts. commit 4bd905bd4597e0ad7bedf31e25e779d3e2dfda29 Merge: 936d5fdc 7f32dd57 Author: Field G. Van Zee Date: Fri Oct 21 14:48:44 2016 -0500 Merge pull request #93 from ShadenSmith/config_check Adds sanity check to configuration choice. commit 936d5fdc26c6c4dab199a8d11fde948975cfa1d6 Author: Field G. Van Zee Date: Fri Oct 21 14:34:27 2016 -0500 Fixed multithreading compilation bug in 970745a. Details: - Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING from bli_thread.h to bli_config_macro_defs.h. Also moved the sanity check that OpenMP and POSIX threads are not both enabled. - Thanks to Krzysztof Drewniak for reporting this bug. commit d250e6a3af3af8beedcda28f508ac03e94efb3c8 Author: Kiran Varaganti Date: Thu Oct 20 14:34:39 2016 +0530 Merged TRSM and scalv routines into zen folder Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9 commit 8feb0f85a674e84bec2417486e3bcea584b14c04 Author: Field G. Van Zee Date: Wed Oct 19 16:05:41 2016 -0500 Removed auto-prototyping of malloc()/free() substitutes. Details: - Removed the header file, bli_malloc_prototypes.h, which automatically generated prototypes for the functions specified by the following cpp macros: BLIS_MALLOC_INTL BLIS_FREE_INTL BLIS_MALLOC_POOL BLIS_FREE_POOL BLIS_MALLOC_USER BLIS_FREE_USER These prototypes were originally provided primarily as a convenience to those developers who specified their own malloc()/free() substitutes for one or more of the following. However, we generated these prototypes regardless, even when the default values (malloc and free) of the macros above were used. A problem arose under certain circumstances (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that stemmed from the "throw" specification which was added to the glibc's malloc() prototype, resulting in a prototype mismatch. Therefore, going forward, developers who specify their own custom malloc()/free() substitutes must also prototype those substitutes via bli_kernel.h. Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews for researching the nature and potential solutions. commit 970745a5fc7c29de3e202988e5eb104fabca4fdc Author: Field G. Van Zee Date: Wed Oct 19 15:58:03 2016 -0500 Reorganized typedefs to avoid compiler warnings. Details: - Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h. - Moved #include of bli_malloc.h from blis.h to bli_type_defs.h. - Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h. - Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h. - The redundant typedefs of membrk_t and mtx_t caused a warning on some C compilers. Thanks to Tyler Smith for reporting this issue. commit 1c2f7b57d557c05f5ef6148cccafaf0f70d910da Author: sthangar Date: Tue Oct 18 15:06:35 2016 +0530 Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048 commit d864ea9f4f039fe2b2dc395d0015bd9e8902bc8e Merge: 7045fcbf 28b2af8a Author: praveeng Date: Fri Oct 14 17:00:57 2016 +0530 Merge master code 2016_10_14 till Added disabled code thrinfo_t structures Change-Id: If7db98d286c1471fcd30f00757abee9b253ef987 commit 28b2af8a71133ce68774e153b6e05afb05affba8 Author: Field G. Van Zee Date: Thu Oct 13 14:50:08 2016 -0500 Added disabled code to print thrinfo_t structures. Details: - Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious developer to print the contents of the thrinfo_t structures of each thread, for verification purposes or just to study the way thread information and communicators are used in BLIS. - Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing an array of thrinfo_t* values that is used in the new, cpp-guarde code mentioned above. - Removed some old commented lines from bli_gemm_front.c. commit 11eed3f683d09e65f721567b346b0f733bff9a64 Author: Field G. Van Zee Date: Thu Oct 13 14:23:23 2016 -0500 Fixed a configure -t omp/openmp bug from fd04869. Details: - Forgot to update certain occurrences of "omp" in common.mk during commit fd04869, which changed the preferred configure option string for enabling OpenMP from "omp" to "openmp". commit 7045fcbf0bd349ebe6cb9ac4508c6a387bb05966 Merge: 7e044900 9cda6057 Author: praveeng Date: Thu Oct 13 12:02:28 2016 +0530 Merge master code 2016_10_13 Removed previously renamed/old files Change-Id: I8106d371afaa0af474a8967388d44481b05de923 commit 7e04490002206d3557fcfb7dd893838a7f36916f Author: sthangar Date: Wed Oct 12 16:43:02 2016 +0530 Checked in the SAMAX optimizations Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd commit 9cda6057eaa16a24ac8785a9fa167df6c9edba44 Author: Field G. Van Zee Date: Tue Oct 11 13:21:26 2016 -0500 Removed previously renamed/old files. Details: - Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h, both of which were renamed/removed in 701b9aa. For some reason, these files survived when the compose branch was merged back into master. (Clearly, git's merging algorithm is not perfect.) - Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed memory allocator that I was keeping around for no particular reason). commit 22377abd84b9e560ffe1c4e4d284eb443ddb7133 Author: Field G. Van Zee Date: Mon Oct 10 13:43:56 2016 -0500 Fixed bli_gemm() segfault on empty C matrices. Details: - Fixed a bug that would manifest in the form of a segmentation fault in bli_cntl_free() when calling any level-3 operation on an empty output matrix (ie: m = n = 0). Specifically, the code previously assumed that the entire control tree was built prior to it being freed. However, if the level-3 operation performs an early exit, the control tree will be incomplete, and this scenario is now handled. Thanks to Elmar Peise for reporting this bug. commit 0b571cd94d9b175331c9453258a6b1389a718ae8 Author: Field G. Van Zee Date: Thu Oct 6 14:48:15 2016 -0500 Fixed segfault in bli_free_align() for NULL ptrs. Details: - Fixed a bug in bli_free_align() caused by failing to handle NULL pointers up-front, which led to performing pointer arithmetic on NULL pointers in order to free the address immediately before the pointer. Thanks to Devin Matthews for reporting this bug. commit cd84fb95182514601d72c78ee0e36a394d0284d7 Author: praveeng Date: Thu Oct 6 15:08:21 2016 +0530 syntax erros in configure file Change-Id: Ibe8a6071aad97df550df64c009fec33a9d8f43a1 commit f2e7ea113aa93b74f1d42408d5db2c5a7b00a653 Merge: 133983c3 86969873 Author: praveeng Date: Thu Oct 6 12:35:30 2016 +0530 conflicts merge for bli_kernel.h Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0 commit 133983c36fa01c7acb6d666b3744f77f216314a5 Author: sthangar Date: Thu Oct 6 11:26:22 2016 +0530 code clean up in bli_kernel.h Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d commit 4fb9b4ef2e4cf2626a6e000a41628fb823f16da8 Author: Field G. Van Zee Date: Wed Oct 5 14:41:35 2016 -0500 CHANGELOG update (0.2.1) commit 866b2dde3f41760121115fb25f096d4344e8b4f9 (tag: 0.2.1) Author: Field G. Van Zee Date: Wed Oct 5 14:41:34 2016 -0500 Version file update (0.2.1) commit 87fddeab3c8a5ccb1bbf02e5f89db1464e459ba9 Merge: 86969873 6f71cd34 Author: Field G. Van Zee Date: Wed Oct 5 13:35:01 2016 -0500 Merge branch 'compose' commit 6f71cd344951854e4cff9ea21bbdfe536e72611d (origin/compose) Merge: c0630c40 8d55033c Author: Field G. Van Zee Date: Tue Oct 4 15:53:46 2016 -0500 Merge pull request #94 from flame/distcomm Implemented distributed thrinfo_t management. commit 86969873b5b861966d717d8f9f370af39e3d9de6 Author: Field G. Van Zee Date: Tue Oct 4 14:24:59 2016 -0500 Reclassified amaxv operation as a level-1v kernel. Details: - Moved amaxv from being a utility operation to being a level-1v operation. This includes the establishment of a new amaxv kernel to live beside all of the other level-1v kernels. - Added two new functions to bli_part.c: bli_acquire_mij() bli_acquire_vi() The first acquires a scalar object for the (i,j) element of a matrix, and the second acquires a scalar object for the ith element of a vector. - Added integer support to bli_getsc level-0 operation. This involved adding integer support to the bli_*gets level-0 scalar macros. - Added a new test module to test amaxv as a level-1v operation. The test module works by comparing the value identified by bli_amaxv() to the the value found from a reference-like code local to the test module source file. In other words, it (intentionally) does not guarantee the same index is found; only the same value. This allows for different implementations in the case where a vector contains two or more elements containing exactly the same floating point value (or values, in the case of the complex domain). - Removed the directory frame/include/old/. commit 8d55033c966feed99fcca2a58017c3ab5b1646dc Author: Field G. Van Zee Date: Tue Sep 27 15:20:58 2016 -0500 Implemented distributed thrinfo_t management. Details: - Implemented Ricardo Magana's distributed thread info/communicator management. Rather that fully construct the thrinfo_t structures, from root to leaf, prior to spawning threads, the threads individually construct their thrinfo_t trees (or, chains), and do so incrementally, as needed, reusing the same structure nodes during subsequent blocked variant iterations. This required moving the initial creation of the thrinfo_t structure (now, the root nodes) from the _front() functions to the bli_l3_thread_decorator(). The incremental "growing" of the tree is performed in the internal back-end (ie: _int()) function, and so mostly invisible. Also, the incremental growth of the thrinfo_t tree is done as a function of the current and parent control tree nodes (as well as the parent thrinfo_t node), further reinforcing the parallel relationship between the two data structures. - Removed the "inner" communicator from thrinfo_t structure definition, as well as its id. Changed all APIs accordingly. Renamed bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm(). - Defined bli_l3_thrinfo_print_paths(), which prints the information in an array of thrinfo_t* structure pointers. (Used only as a debugging/verification tool.) - Deprecated the following thrinfo_t creation functions: bli_packm_thrinfo_create() bli_l3_thrinfo_create() because they are no longer used. bli_thrinfo_create() is now called directly when creating thrinfo_t nodes. commit fd04869ae4d4a3b0ebb9052557c296456bce7c0d Author: Field G. Van Zee Date: Tue Sep 27 14:14:11 2016 -0500 Changed configure's 'omp' threading to 'openmp'. Details: - Changed the configure script so that the expected string argument to the -t (or --enable-threading=) option that enables OpenMP multithreading is 'openmp'. The previous expected string, 'omp', is still supported but should be considered deprecated. commit 9424af87209e4e435e2e742430945152690170b0 Merge: efa7341d c0630c40 Author: Field G. Van Zee Date: Tue Sep 27 12:51:08 2016 -0500 Merge branch 'compose' commit 7f32dd57c6bd41c0704341752842277dd6a4c8eb Author: Shaden Smith Date: Sat Sep 17 11:33:57 2016 -0500 Adds sanity check to configuration choice. commit efa7341df0b0115926aa8a6e8a4ebfb24fdbf11e Merge: 121c39d4 e1453f68 Author: Field G. Van Zee Date: Fri Sep 16 11:01:57 2016 -0500 Merge pull request #92 from ShadenSmith/readme_fix Fixes broken URL in README.md commit e1453f68f6afd90ae9a29b7a5faa46aa79bbf741 Author: Shaden Smith Date: Fri Sep 16 09:29:28 2016 -0500 Fixes broken URL in README.md commit b922d7563422e14c49a4677bc6ae088a408861ed Author: Field G. Van Zee Date: Tue Aug 23 13:38:36 2016 -0500 Avoid compiling BLAS/CBLAS files when disabled. Details: - Updated the top-level Makefile, build/config.mk.in template, and configure script so that object files corresponding to source files belonging to the BLAS compatibility layer are not compiled (or archived) when the compatibility layer is disabled. (Same for CBLAS.) Thanks to Devin Matthews for suggesting this optimization. - Slight change to the way configure handles internal variables. Instead of converting (overwriting) some, such as enable_blas2blis and enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are now stored in new variables that live alongside the originals (with the suffix "_01"). This is convenient since some values need to be sed-substituted into the config.mk.in template, which requires "yes" or "no", while some need to be written to the bli_config.h.in template, which requires "0" or "1". Updated BLIS4 TOMS citation in README.md. Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have). Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1 commit 69826110bab2a064ec76457c24843d28f2581281 Merge: 64598ee4 a58dd35e Author: Pradeep Rao Date: Wed Sep 14 03:26:25 2016 -0400 Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging commit c0630c4024b08750043a2942a3e8a037aa6b6259 Author: Field G. Van Zee Date: Mon Sep 12 13:59:02 2016 -0500 Added debugging printf()'s to bli_l3_thrinfo.c. Details: - Added optional printf() statements to print out thread communicator info as the thrinfo_t structure is built in bli_l3_thrinfo.c. - Minor changes to frame/thread/bli_thrinfo.h. commit 7b3bf1ffcd7160ccbf6c2518af6d88f6742e4977 Merge: 35509818 121c39d4 Author: Field G. Van Zee Date: Tue Sep 6 15:47:13 2016 -0500 Merge branch 'master' into compose commit 121c39d455f2db6f7ce6802ba7f73ad5e088c68c Author: Field G. Van Zee Date: Mon Sep 5 13:11:42 2016 -0500 Added complex gemm micro-kernels for haswell. Details: - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based architectures. As with their real domain brethren, these kernels perfer row storage, (though this doesn't affect most users due to high-level optimizations in most level-3 operations that induce a transpose to whatever storage preference the kernel may have). commit 35509818cbea1598b123421f81c42120889a03c3 Author: Field G. Van Zee Date: Wed Aug 31 17:34:15 2016 -0500 Added, moved some thread barriers. Details: - Removed thread barriers from the end of the loop bodies of bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(), and bli_trsm_blk_var2(). - Moved the thread barrier at the end of bli_packm_int() to the end of bli_l3_packm(), and added missing barriers to that function. - Removed the no longer necessary (and now incorrect) ochief guard in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C. - Thanks to Tyler Smith for help with these changes. commit 64598ee4cfb86f64abbd4bcef5a82ba0d5565b67 Author: sthangar Date: Wed Aug 31 12:54:50 2016 +0530 fixed the symlink issue Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955 commit abd61f9fa75d77a96d1491b3e035451ee73238fe Author: Field G. Van Zee Date: Tue Aug 30 12:34:19 2016 -0500 Updated BLIS4 TOMS citation in README.md. commit 8a2373f26ba8fcd5b2d7b2cc72cb8b2e1f841a03 Author: sthangar Date: Mon Aug 29 14:10:45 2016 +0530 Norm 2 optimization Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b commit fdc663902347aa252ea88cf09ce24ab748958dff Author: sthangar Date: Mon Aug 29 10:43:38 2016 +0530 Placed 1 and 1f AMD optimized AVX routines under zen folder Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328 commit 701b9aa3ff028decbf90efac0dca5bd64fe26269 Author: Field G. Van Zee Date: Fri Aug 26 19:04:45 2016 -0500 Redesigned control tree infrastructure. Details: - Altered control tree node struct definitions so that all nodes have the same struct definition, whose primary fields consist of a blocksize id, a variant function pointer, a pointer to an optional parameter struct, and a pointer to a (single) sub-node. This unified control tree type is now named cntl_t. - Changed the way control tree nodes are connected, and what computation they represent, such that, for example, packing operations are now associated with nodes that are "inline" in the tree, rather than off- shoot braches. The original tree for the classic Goto gemm algorithm was expressed (roughly) as: blk_var2 -> blk_var3 -> blk_var1 -> ker_var2 | | -> packb -> packa and now, the same tree would look like: blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2 Specifically, the packb and packa nodes perform their respective packing operations and then recurse (without any loop) to a subproblem. This means there are now two kinds of level-3 control tree nodes: partitioning and non-partitioning. The blocked variants are members of the former, because they iteratively partition off submatrices and perform suboperations on those partitions, while the packing variants belong to the latter group. (This change has the effect of allowing greatly simplified initialization of the nodes, which previously involved setting many unused node fields to NULL.) - Changed the way thrinfo_t tree nodes are arranged to mirror the new connective structure of control trees. That is, packm nodes are no longer off-shoot branches of the main algorithmic nodes, but rather connected "inline". - Simplified control tree creation functions. Partitioning nodes are created concisely with just a few fields needing initialization. By contrast, the packing nodes require additional parameters, which are stored in a packm-specific struct that is tracked via the optional parameters pointer within the control tree struct. (This parameter struct must always begin with a uint64_t that contains the byte size of the struct. This allows us to use a generic function to recursively copy control trees.) gemm, herk, and trmm control tree creation continues to be consolidated into a single function, with the operation family being used to select among the parameter-agnostic macro-kernel wrappers. A single routine, bli_cntl_free(), is provided to free control trees recursively, whereby the chief thread within a groups release the blocks associated with mem_t entries back to the memory broker from which they were acquired. - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the function pointer stored in the current control tree node (rather than index into a local function pointer array). Before being invoked, these function pointers are first cast to a gemm_voft (for gemm, herk, or trmm families) or trsm_voft (for trsm family) type, which is defined in frame/3/bli_l3_var_oft.h. - Retired herk and trmm internal back-ends, since all execution now flows through gemm or trsm blocked variants. - Merged forwards- and backwards-moving variants by querying the direction from routines as a function of the variant's matrix operands. gemm and herk always move forward, while trmm and trsm move in a direction that is dependent on which operand (a or b) is triangular. - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(), each of which takes additional arguments and hides complexity in managing the difference between the way ranges are computed for the four families of operations. - Simplified level-3 blocked variants according to the above changes, so that the only steps taken are: 1. Query partitioning direction (forwards or backwards). 2. Prune unreferenced regions, if they exist. 3. Determine the thread partitioning sub-ranges. 4. Determine the partitioning blocksize (passing in the partitioning direction) 5. Acquire the curren iteration's partitions for the matrices affected by the current variants's partitioning dimension (m, k, n). 6. Call the subproblem. - Instantiate control trees once per thread, per operation invocation. (This is a change from the previous regime in which control trees were treated as stateless objects, initialized with the library, and shared as read-only objects between threads.) This once-per-thread allocation is done primarily to allow threads to use the control tree as as place to cache certain data for use in subsequent loop iterations. Presently, the only application of this caching is a mem_t entry for the packing blocks checked out from the memory broker (allocator). If a non-NULL control tree is passed in by the (expert) user, then the tree is copied by each thread. This is done in bli_l3_thread_decorator(), in bli_thrcomm_*.c. - Added a new field to the context, and opid_t which tracks the "family" of the operation being executed. For example, gemm, hemm, and symm are all part of the gemm family, while herk, syrk, her2k, and syr2k are all part of the herk family. Knowing the operation's family is necessary when conditionally executing the internal (beta) scalar reset on on C in blocked variant 3, which is needed for gemm and herk families, but must not be performed for the trmm family (because beta has only been applied to the current row-panel of C after the first rank-kc iteration). - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind to comform with the new control tree design, and renamed the macro- kernel codes corresponding to 3m2 and 4m1b. - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h. - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to frame/base/bli_auxinfo.h. - Fixed a minor bug whereby the storage-to-ukr-preference matching optimization in the various level-3 front-ends was not being applied properly when the context indicated that execution would be via an induced method. (Before, we always checked the native micro-kernel corresponding to the datatype being executed, whereas now we check the native micro-kernel corresponding to the datatype's real projection, since that is the micro-kernel that is actually used by induced methods. - Added an option to the testsuite to skip the testing of native level-3 complex implementations. Previously, it was always tested, provided that the c/z datatypes were enabled. However, some configurations use reference micro-kernels for complex datatypes, and testing these implementations can slow down the testsuite considerably. commit a58dd35ed7b5b77a6b272655d2edd7a822b8fa87 Author: Kiran Varaganti Date: Fri Aug 26 14:55:12 2016 +0530 Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9 commit 73517f522b69de429dd7f3df60a70c068149ab28 Merge: c6f5c215 50293da3 Author: Field G. Van Zee Date: Tue Aug 23 13:46:59 2016 -0500 Merge branch 'master' into compose commit 50293da38d5f2b7be9bbc94b9e85aacb6a10f672 Author: Field G. Van Zee Date: Tue Aug 23 13:38:36 2016 -0500 Avoid compiling BLAS/CBLAS files when disabled. Details: - Updated the top-level Makefile, build/config.mk.in template, and configure script so that object files corresponding to source files belonging to the BLAS compatibility layer are not compiled (or archived) when the compatibility layer is disabled. (Same for CBLAS.) Thanks to Devin Matthews for suggesting this optimization. - Slight change to the way configure handles internal variables. Instead of converting (overwriting) some, such as enable_blas2blis and enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are now stored in new variables that live alongside the originals (with the suffix "_01"). This is convenient since some values need to be sed-substituted into the config.mk.in template, which requires "yes" or "no", while some need to be written to the bli_config.h.in template, which requires "0" or "1". commit 22dd6a353ddb56614309c01533b1a94c9fd32bca Merge: cdfb3c3f f20ed388 Author: praveeng Date: Tue Aug 23 15:15:35 2016 +0530 Merge master code as on 2016_08_23 to amd-staging branch by praveeng Changes to be committed: modified: frame/thread/bli_mutex_openmp.h modified: frame/thread/bli_mutex_pthreads.h Change-Id: Ica522edbb1d0173f53f38d5057b1f7aef73666be commit c6f5c215ee793d03ea834469fc2adc53feaffc42 Merge: d52cb767 16a4c7a8 Author: Field G. Van Zee Date: Mon Aug 22 17:33:02 2016 -0500 Merge branch 'master' into compose commit f20ed3885d628992fab88690f629a5a2bab3eb88 Merge: 02ac597e 4bc842ca Author: praveeng Date: Mon Aug 22 15:27:33 2016 +0530 Merge branch 'master' of https://github.com/clMathLibraries/blis-amd for "Fixed bugs in bli_mutex_init() and friends." commit 02ac597e4b9be2670d9fff65d28552f8e1ec81b3 Author: praveeng Date: Thu Jul 28 15:11:08 2016 +0530 Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414 Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99 commit 84e41cc73c9c87ce64582acd4264b8e1b5316482 Author: praveeng Date: Thu Jul 28 15:01:36 2016 +0530 Revert commits 8aee306 Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189 commit 30ccfcee82db93d0109d1571242e2db925e95d0a Author: praveeng Date: Mon Jul 25 14:14:00 2016 +0530 removed changes from readme file which are giving confilcts Change-Id: Ic71ad1313e1404fed444e899466043704d875af6 commit aeca25cd63fc8971f8fe7809599c57853f976548 Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 6b2274864b36fd1019d97bcc4ca6dd7a57ef16d9 Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit daa7a9ecb25982f2551adbd95e65f8ba97cfe944 Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 5f66a4aa05aeffcb6eb587851d78d9527319466c Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit c6cbd78d2388c08824822b91a1c36ac4349bb67f Author: praveeng Date: Thu Jul 28 15:11:08 2016 +0530 Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414 Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99 commit 9219a9060762525f87ebbf556d78fe8621858513 Author: praveeng Date: Thu Jul 28 15:01:36 2016 +0530 Revert commits 8aee306 Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189 commit 728573296efa7cf14d2381570e116509dfe2a240 Author: praveeng Date: Mon Jul 25 14:14:00 2016 +0530 removed changes from readme file which are giving confilcts Change-Id: Ic71ad1313e1404fed444e899466043704d875af6 commit ad7862e291c240505c733a41d231b1a126ade73c Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit ad4b471a25ce77867295e5529dfc787e7c18b03f Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit 55d641363fcd8bdfdabbd7c22822fa2d0b7f3fa6 Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit f3b6b15f6d591d323802bd6c81c522a02056506d Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit 16a4c7a823d60707ed9272f5d36e5c5d54c0ba4b Author: Field G. Van Zee Date: Fri Aug 19 11:38:36 2016 -0500 Fixed bugs in bli_mutex_init() and friends. Details: - Fixed a couple of bugs that affected OpenMP and POSIX threads configurations that resulted in compiler errors and warnings due to type mismatch, and in the case of pthreads, a missing function argument. The bugs are fairly recent, introduced in a017062. commit c8e4ef93953ba2b79fb7e0973c08469c0e28a2cd Author: Devin Matthews Date: Wed Aug 3 16:13:03 2016 -0500 Add prefetchw to 30x8 kernel. commit 4b5a2f3d6e7ffeb5cc2be8448554f5c2083ad68f Merge: 380736bf 9f52a587 Author: Devin Matthews Date: Wed Aug 3 16:09:51 2016 -0500 Merge remote-tracking branch 'origin/knl' into knl # Conflicts: # kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c commit 380736bfe955efbdd7274c90b6fd635688e83bc4 Author: Devin Matthews Date: Wed Aug 3 16:08:28 2016 -0500 Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug. commit 9f52a587dee855daa73c194e41b6951416544e9a Author: Devin Matthews Date: Wed Aug 3 16:03:53 2016 -0500 Try prefetchw[t1] instead of regular prefetch for C. commit 8945a1512d366bc6a8a85718d12cbf5de6f2898b Author: Devin Matthews Date: Wed Aug 3 11:28:24 2016 -0500 This version gets ~1550 GFLOPs on KNL wuth 16x4. commit cdfb3c3f29d321033fca106aa58ab67ead90a95d Merge: 50a2f2ef 4bc842ca Author: praveeng Date: Fri Jul 29 12:45:04 2016 +0530 Merge master code as on 2016_07_29 to amd-staging branch by praveeng Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486 commit 4bc842ca3a64e658c0808bfe4c5693a5ace97923 Merge: 117f8838 b0d510bf Author: praveeng Date: Thu Jul 28 17:32:12 2016 +0530 Merge branch 'master' of publicrepo commit 117f8838511a478aa16137e770d27dd21f4227c5 Author: praveeng Date: Thu Jul 28 15:11:08 2016 +0530 Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414 Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99 commit 2fcdc28f1055d385b2e662aa920fb97c472394d7 Author: praveeng Date: Thu Jul 28 15:01:36 2016 +0530 Revert commits 8aee306 Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189 commit 1b5d104afe0628b8b6c0650f1e58cfb08be67004 Author: praveeng Date: Mon Jul 25 14:14:00 2016 +0530 removed changes from readme file which are giving confilcts Change-Id: Ic71ad1313e1404fed444e899466043704d875af6 commit d81273047bff56501e9413a90991d3d1f8b56a06 Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 65905c3011a11cda95761681d4ae84337e46bdb5 Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit 23cca231be10fe1797aed451bcbc69d38c78bc0c Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 922e3091702f25e3287b417719a33adbd5bbf138 Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit b0d510bf0e4dfd177f9e4ae0069f41921e2ecdc1 Author: praveeng Date: Thu Jul 28 15:11:08 2016 +0530 Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414 Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99 commit 5ebeece5b4a8df81d59ca7558b278a4263d15128 Author: praveeng Date: Thu Jul 28 15:01:36 2016 +0530 Revert commits 8aee306 Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189 commit 6ce4c022ebdea00c2b951090e3c2e9e88735b9ce Author: Devin Matthews Date: Wed Jul 27 16:26:36 2016 -0500 Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved. commit d52cb7671509592a8078729477b40b60380518a2 Merge: 95abea46 c31b1e7b Author: Field G. Van Zee Date: Wed Jul 27 16:04:55 2016 -0500 Merge branch 'master' into compose commit c31b1e7b9d659b96433a87e5aecb90e457a104cc Author: Field G. Van Zee Date: Wed Jul 27 15:58:07 2016 -0500 Relax alignment restrictions for sandybridge ukrs. Details: - Relaxed the base pointer and leading dimension alignment restrictions in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd instead of vmovaps/vmovapd. These change mimic those made to the haswell microkernels in e0d2fa0 and ee2c139. - Updated testsuite modules as well as standalone test drivers in 'test' directory to use DBL_MAX as the initial time candidate. Thanks to Devin Matthews for suggesting this change. - Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX). - Minor update (vis-a-vis contexts) to driver code in test/3m4m. commit b8f2b55532849d45d379afbdd05a52ff6100800d Author: Devin Matthews Date: Wed Jul 27 15:22:55 2016 -0500 Try an 8x24 kernel for the hell of it. commit 7ede5863ae3567f7c0852efc2d5cd649ca19e0f3 Author: Devin Matthews Date: Wed Jul 27 13:41:27 2016 -0600 Allocate pack buffer on MCDRAM for KNL. commit ad89ed2e829c7b261d8ba0998a3cb83ad576ee04 Merge: 2c9de740 81e2b05f Author: Devin Matthews Date: Wed Jul 27 11:45:40 2016 -0500 Merge branch 'knl' of github.com:devinamatthews/blis into knl commit 2c9de740edb66c4692c200731763bbd1d3171ccb Author: Devin Matthews Date: Wed Jul 27 11:44:54 2016 -0500 This version gets ~26GF on one core. commit 81e2b05f31bca4e1e1676e7b533d1868d9f9be33 Author: Devin Matthews Date: Wed Jul 27 11:39:05 2016 -0500 Add optimized packing kernels for KNL. commit a7d8ca97b8d835c32d90ff20a565c82733f014a8 Author: Devin Matthews Date: Mon Jul 25 15:15:13 2016 -0500 All fixed. commit 963d0393b023f4134bb0c682923faf9964c0e645 Author: Devin Matthews Date: Mon Jul 25 14:40:53 2016 -0500 Add 24xk pack kernel. commit 117b76739afba481768897d2580f8365d3345417 Author: Devin Matthews Date: Mon Jul 25 13:53:07 2016 -0500 In the midst of debugging. commit 8c0a4fd1d3535d608a9a309a61ffee0a73c3646f Author: Devin Matthews Date: Mon Jul 25 13:09:24 2016 -0500 Fix some row/column confusion. commit c44f9f96930312125b15e64c326ab5ab5cc02633 Author: Devin Matthews Date: Mon Jul 25 12:02:24 2016 -0500 Simplify displacements -- clang assembler was badly botching EVEX compressed displacements giving false alarms for instruction length. commit e0cce177cc1b47ec9f11ac0556241feaa3564df1 Author: Devin Matthews Date: Mon Jul 25 10:02:25 2016 -0500 Minor fixes for 8x24 KNL kernel. commit 50a2f2efcbeb46537f1deaa8e44dc579a4e49eb8 Merge: 1aa77dfc cfd46c88 Author: praveeng Date: Mon Jul 25 17:01:20 2016 +0530 Merge master code as on 2016_07_25 to amd-staging branch by praveeng Change-Id: I84886ae241db2aac0bef6b7ef399f04aa8bca16d commit cfd46c88d59c8f61d5e7cf768d606e4c44623584 Merge: f493bf4d a017062f Author: praveeng Date: Mon Jul 25 15:38:13 2016 +0530 Merge remote-tracking branch 'publicrepo/master' commit f493bf4d704fe0e967783cd6e6877d3302c056a1 Author: praveeng Date: Mon Jul 25 14:14:00 2016 +0530 removed changes from readme file which are giving confilcts Change-Id: Ic71ad1313e1404fed444e899466043704d875af6 commit 65735bbedf75784c48bd11e05b3fdc98fc66b4bc Author: Devin Matthews Date: Sun Jul 24 21:50:32 2016 -0500 Switch to 24x8 kernel, unrolled by 16. commit 45d5dc97177117220bd9dd0abf85aafc185acad1 Author: Devin Matthews Date: Sun Jul 24 14:25:26 2016 -0500 Add 24x8 "KNC-style" kernel for KNL. commit 95abea46f86816fddfc9ff0abfa52880801461be Merge: d0dfe5b5 a017062f Author: Field G. Van Zee Date: Sat Jul 23 15:38:33 2016 -0500 Merge branch 'master' into compose commit a017062fdf763037da9d971a028bb07d47aa1c8a Author: Field G. Van Zee Date: Fri Jul 22 17:02:59 2016 -0500 Integrated "memory broker" (membrk_t) abstraction. Details: - Integrated a patch originally authored and submitted by Ricardo Magana of HP Enterprise. The changeset inserts use of a new object type, membrk_t, (memory broker) that allows multiple sets of memory pools on, for example, separate NUMA nodes, each of which has a separate memory space. - Added membrk field to cntx_t and defined corresponding accessor macros. - Added membrk field to mem_t object and defined corresponding accessor macros. - Created new bli_membrk.c file, which contains the new memory broker API, including: bli_membrk_init(), bli_membrk_finalize() bli_membrk_acquire_[mv](), bli_membrk_release(), bli_membrk_init_pools(), bli_membrk_reinit_pools(), bli_membrk_finalize_pools(), bli_membrk_pool_size() - In bli_mem.c, changed function calls to bli_mem_init_pools() -> bli_membrk_init() bli_mem_reinit_pools() -> bli_membrk_reinit() bli_mem_finalize_pools() -> bli_membrk_finalize() - In bli_packv_init.c, bli_packm_init.c, changed function calls to: bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]() bli_mem_release() -> bli_membrk_release() - Added bli_mutex.c and related files to frame/thread. These files define abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or single-threaded execution. This new API is employed within functions such as bli_membrk_acquire_[mv]() and bli_membrk_release(). commit 8ff2e069c48c12fd06b9c48c6b3aeb4ea9b0e6e1 Author: Devin Matthews Date: Fri Jul 22 16:22:26 2016 -0500 Add 4x unrolled variant for KNL microkernel. commit 9cb2ed9b0c25f31a22c1c9719b062fa665ad7adf Author: Devin Matthews Date: Fri Jul 22 16:10:30 2016 -0500 Git rid of one RBX update. commit 451bde076f0320d60cd2475cfb048ac4a2b798bb Author: Devin Matthews Date: Fri Jul 22 15:43:00 2016 -0500 Add some more knobs to twiddle for KNL microkernel. commit 8c6e621c099521e7a4d87e007bb8224faa5f33a3 Author: Devin Matthews Date: Fri Jul 22 15:05:15 2016 -0500 Make knl conform to new kernel dir structure. commit ce7214c6618d6f22f4ce2ee452336236916d1f30 Merge: 119d0399 ce59f811 Author: Devin Matthews Date: Fri Jul 22 14:59:53 2016 -0500 Merge remote-tracking branch 'origin/master' into knl commit ce59f81108ec9aea918a7e77030da8acfdd397ce Merge: ff41153f 707a2b7f Author: Field G. Van Zee Date: Fri Jul 22 14:48:14 2016 -0500 Merge pull request #88 from devinamatthews/32bit-dim_t Handle 32-bit dim_t in 64-bit microkernels. commit 707a2b7faca137cca7cab7b11a12c44ddaf7ad53 Author: Devin Matthews Date: Fri Jul 22 13:49:44 2016 -0500 Somehow forgot the most important microkernel. commit 47ec045056351ac4f0791c071fa0daaa81699c8c Merge: 08f1d6b6 ff41153f Author: Devin Matthews Date: Fri Jul 22 13:45:23 2016 -0500 Merge remote-tracking branch 'upstream/master' into 32bit-dim_t commit 08f1d6b6fa344275de0f675f69737145ccf6646a Author: Devin Matthews Date: Fri Jul 22 13:44:37 2016 -0500 Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit. commit ff41153f4eb7f38ed94bdd9a3fd81fb979f3f401 Merge: f9214ced e0d2fa0d Author: Field G. Van Zee Date: Fri Jul 22 13:21:03 2016 -0500 Merge pull request #86 from devinamatthews/haswell-vmovups Remove alignment restrictions on C in haswell kernel. commit e0d2fa0d835ab49366aeb790363bb2b571d36ed8 Author: Devin Matthews Date: Fri Jul 22 12:56:51 2016 -0500 Relax alignment restrictions for haswell sgemm. commit f9214ced97392861f5a0ea72abfcf6f41faf674c Merge: 413d62ac 08666eaa Author: Field G. Van Zee Date: Fri Jul 22 12:16:39 2016 -0500 Merge pull request #85 from devinamatthews/qopenmp Change -openmp to -fopenmp for icc. commit ee2c139df6ad53c6aec8a67ab23b3b1912e8d259 Author: Devin Matthews Date: Fri Jul 22 12:06:03 2016 -0500 Remove alignment restrictions on C in haswell kernel. commit 08666eaa20d8a31f2f92f944e5bfa7c1558c53e4 Author: Devin Matthews Date: Fri Jul 22 11:07:34 2016 -0500 Change -openmp to -fopenmp for icc. commit 119d0399428905053265f3aca1cc8cc1fde3b363 Author: Devin Matthews Date: Fri Jul 22 10:23:31 2016 -0500 Add 8x24 KNL kernel. commit 1aa77dfc1dc183d16e0b6a1196d9c263f021e83d Merge: 9101a9c8 ec9f5983 Author: praveeng Date: Thu Jul 21 14:22:40 2016 +0530 Merge master code as on 2016_07_21 to amd-staging branch by praveeng Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344 commit b58cda9eba0c1e175460aae109baf792d29ba5bf Merge: 318f063d 413d62ac Author: Devin Matthews Date: Tue Jul 19 14:09:09 2016 -0500 Merge remote-tracking branch 'origin/master' into knl # Conflicts: # frame/base/bli_threading.h # frame/include/blis.h # frame/thread/bli_thread.c commit ec9f59836b32260c29ff1cd24e629c7d8de14992 Merge: 197e182f 763babe4 Author: praveeng Date: Mon Jul 18 12:56:25 2016 +0530 Merge branch 'master' of https://github.com/clMathLibraries/blis-amd commit 197e182fcbf1340fd4a202fac58bea6cfcfa9e2f Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 41fb32711031e7ec86b062aa7f53255d1f5905e2 Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit d0dfe5b5372cc7558ee9c4104b29f82eecc7ed61 Merge: 31def12e 413d62ac Author: Field G. Van Zee Date: Thu Jul 14 11:01:06 2016 -0500 Merge branch 'master' into compose commit 9101a9c880e3934f8a63ffc7fe15f5fc1077a73d Author: sthangar Date: Wed Jul 13 16:51:14 2016 +0530 Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b commit 763babe488880b42c86c7fc207aa7665bd0ff9f7 Merge: 357c990b 413d62ac Author: praveeng Date: Wed Jul 13 11:57:19 2016 +0530 Merge remote-tracking branch 'publirepo/master' commit 413d62aca28edabba56605a9f87d5b715831e1db Author: Field G. Van Zee Date: Tue Jul 12 15:02:52 2016 -0500 README update (use official ACM TOMS links). commit dfa431f696db2df4065ea454df268a2e0bc02eac Author: Field G. Van Zee Date: Tue Jul 12 14:21:19 2016 -0500 README update (BLIS2 TOMS article now in-print). commit 357c990bdd7bd5667aac5adf1bab3712973e7414 Author: praveeng Date: Tue Jul 5 16:51:23 2016 +0530 first commit Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2 commit 8aee306300adb099b66036f2c2f7f3996433cf49 Author: praveeng Date: Tue Jul 5 15:00:31 2016 +0530 small modification to readme for git push test Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a commit 31def12e2629f187e40f93f6bae9e26a6c2660e2 Author: Field G. Van Zee Date: Thu Jun 30 15:19:20 2016 -0500 First phase of control tree redesign. Details: - These changes constitute the first set of changes in preparation to revamping the structure and use of control trees in BLIS. Modifications in this commit don't affect the control tree code yet, but rather lay the groundwork. - Defined wrappers for the following functions, where the the wrappers each take a direction parameter of a new enumerated type (BLIS_BWD or BLIS_FWD), dir_t, and executes the correct underlying function. - bli_acquire_mpart_*() and _vpart_*() - bli_*_determine_kc_[fb]() - bli_thread_get_range_*() and bli_thread_get_range_weighted_*() - Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving) blocked variants for trmm and trsm, and renamed gemm and herk variants accordingly. The direction is now queried via routines such as bli_trmm_direct(), which deterines the direction from the implied side and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD. - Defined wrappers to parameter-specific macrokernels for herk, trmm, and trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying macrokernel based on the implied parameters. The same logic used to choose the dir_t in _direct() functions is used here. - Simplified the function pointer arrays in _int() functions given the consolidation and dir_t querying mentioned above. - Function signature (whitespace) reformatting for various functions. - Removed old code in various 'old' directories. commit 405c9d46344d93c3eab5572b233900b50ca50d68 Author: sthangar Date: Wed Jun 22 12:18:54 2016 +0530 Check-in the fused kernels optimized for Zen Change-Id: I7b2f467b960e7b9a285f06e47be87de122e5fa24 commit 232754feecf29452987666b9f5ebba2619bfd0b0 Author: Field G. Van Zee Date: Tue Jun 21 14:25:39 2016 -0500 Fixed compiler warning in rand[vm], randn[vm]. Details: - Fixed compiler warnings about unused variables related to the disabling of normalization in the structured cases of the rand[vm] and randn[vm] operations. commit a89555d1605574f3685813dcc972b636dd61264d Author: Field G. Van Zee Date: Fri Jun 17 14:08:35 2016 -0500 Added randn[vm] operations, support in testsuite. Details: - Defined a new randomization operation, randn, on vectors and matrices. The randnv and randnm operations randomize each element of the target object with values from a narrow range of values. Presently, those values are all integer powers of two, but they do not need to be powers of two in order to achieve the primary goal, which is to initialize objects that can be operated on with plenty of precision "slack" available to allow computations that avoid roundoff. Using this method of randomization makes it much more likely that testsuite residuals of properly-functioning operations are close to zero, if not exactly zero. - Updated existing randomization operations randv and randm to skip special diagonal handling and normalization for matrices with structure. This is now handled by the testsuite modules by explicitly calling a testsuite function that loads the diagonal (and scales off-diagonal elements). - Added support for randnv and randnm in the testsuite with a new switch in input.general that universally toggles between use of the classic randv/randm, which use real values on the interval [-1,1], and randnv/randnm, which use only values from a narrow range. Currently, the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as well as 0.0. - Updated testsuite modules so that a testsutie wrapper function is called instead of directly calling the randomization operations (such as bli_randv() and bli_randm()). This wrapper also takes a bool_t that indicates whether the object's elements should be normalized. (NOTE: As alluded to above, in the test modules of triangular solve operations such as trsv and trsm, we perform the extra step of loading the diagonal.) - Defined a new level-0 operation, invertsc, which inverts a scalar. - Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely but possible divide-by-zero. - Updated function signature and prototype formatting in testsuite. commit 318f063dcbd8b594969e401bc99146d24b01066a Author: Devin Matthews Date: Wed Jun 8 17:46:50 2016 -0500 Add new KNL microkernel derived from Haswell. commit 096895c5d538a7f8817603d7cf28c52e99340def Author: Field G. Van Zee Date: Mon Jun 6 13:32:04 2016 -0500 Reorganized code, APIs related to multithreading. Details: - Reorganized code and renamed files defining APIs related to multithreading. All code that is not specific to a particular operation is now located in a new directory: frame/thread. Code is now organized, roughly, by the namespace to which it belongs (see below). - Consolidated all operation-specific *_thrinfo_t object types into a single thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*() functions (aside from a few general purpose bli_thrinfo_*() functions). - Renamed thread_comm_t object type to thrcomm_t. - Renamed many of the routines and functions (and macros) for multithreading. We now have the following API namespaces: - bli_thrinfo_*(): functions related to thrinfo_t objects - bli_thrcomm_*(): functions related to thrcomm_t objects. - bli_thread_*(): general-purpose functions, such as initialization, finalization, and computing ranges. (For now, some macros, such as bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the bli_thread_ namespace prefix, even though bli_thrinfo_ may be more appropriate.) - Renamed thread-related macros so that they use a bli_ prefix. - Renamed control tree-related macros so that they use a bli_ prefix (to be consistent with the thread-related macros that were also renamed). - Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This #undef was a temporary fix to some macro defaults which were being applied in the wrong order, which was recently fixed. commit 232530e88ff99f37abcae5b6fb5319a9a375a45f Merge: 4bcabd1b eef37f8b Author: Tyler Michael Smith Date: Wed Jun 1 15:14:10 2016 -0500 Merge commit 'refs/pull/81/head' of https://github.com/flame/blis Conflicts: frame/base/bli_threading_pthreads.c frame/base/bli_threading_pthreads.h commit 4bcabd1bf60688c38cf562459fc5e8be8b831756 Author: Tyler Michael Smith Date: Wed Jun 1 13:27:28 2016 -0500 Use spin locks instead of pthread barriers commit eef37f8b4d81845a6ba4bf25586d32b50c3e8a68 Author: Jeff Hammond Date: Sun May 29 22:28:13 2016 -0700 use GCC intrinsic instead of pthread_mutex for atomic increment and fetch commit 9dcd6f05c4c3ff2ce7cd87a9951a96ebef22681e Author: Field G. Van Zee Date: Tue May 24 13:15:32 2016 -0500 Implemented developer-configurable malloc()/free(). Details: - Replaced all instances of bli_malloc() and bli_free() with one of: - bli_malloc_pool()/bli_free_pool() - bli_malloc_user()/bli_free_user() - bli_malloc_intl()/bli_free_intl() each of which can be configured to call malloc()/free() substitutes, so long as the substitute functions have the same function type signatures as malloc() and free() defined by C's stdlib.h. The _pool() function is called when allocating blocks for the memory pools (used for packing buffers, primarily), the _user() function is called when obj_t's are created (via bli_obj_create() and friends), and the _intl() function is called for internal use by BLIS, such as when creating control tree nodes or temporary buffers for manipulating internal data structures. Substitutes for any of the three types of bli_malloc() may be specified by #defining the following pairs of cpp macros in bli_kernel.h: - BLIS_MALLOC_POOL/BLIS_FREE_POOL - BLIS_MALLOC_USER/BLIS_FREE_USER - BLIS_MALLOC_INTL/BLIS_FREE_INTL to be the name of the substitute functions. (Obviously, the object code that contains these functions must be provided at link-time.) These macros default to malloc() and free(). Subsitute functions are also automatically prototyped by BLIS (in bli_malloc_prototypes.h). - Removed definitions for bli_malloc() and bli_free(). - Note that bli_malloc_pool() and bli_malloc_user() are now defined in terms of a new function, bli_malloc_align(), which aligns memory to an arbitrary (power of two) alignment boundary, but does so manually, whereas before alignment was performed behind the scenes by posix_memalign(). Currently, bli_malloc_intl() is defined in terms of bli_malloc_noalign(), which serves as a simple wrapper to the designated function that is passed in (e.g. BLIS_MALLOC_INTL). Similarly, there are bli_free_align() and bli_free_noalign(), which are used in concert with their bli_malloc_*() counterparts. commit 9dd440109a9d964f5cd286e9f83c487ad703e1e4 Author: Jeff Hammond Date: Sat May 21 15:21:58 2016 -0700 fix 404 link to BuildSystem Google Code is dead. Long live GitHub! commit d309f20b7376a68efa3b864ad790c2021c071655 Author: Field G. Van Zee Date: Wed May 18 15:13:53 2016 -0500 Added alignment switch to testsuite. Details: - Added a new input parameter to input.general that globally toggles whether testsuite tests are performed on objects whose buffers and leading dimensions have been aligned, and changed the implementation of libblis_test_mobj_create() to employ alignment (or not) regardless of whether row, column, or general storage is being tested. - Updated configure script's "--help" text to indicate default behavior for internal integer type size and BLAS/CBLAS integer type size options. commit 32db0adc218ea4ae370164dbe8d23b41cd3526d3 Author: Field G. Van Zee Date: Tue May 17 15:20:16 2016 -0500 Generate prototypes for user-defined packm kernels. Details: - Created template prototypes for packm kernels (in bli_l1m_ker.h), and then redefined reference packm kernels' prototyping headers in terms of this template, as is already done for level-1v, -1f, and -3 kernels. - Automatically generate prototypes for user-defined packm kernels in bli_kernel_prototypes.h (using the new template prototypes in bli_l1m_ker.h). - Defined packm kernel function types in bli_l1m_ft.h, including for packm kernels specific to induced methods, which are now used in bli_packm_cxk.c and friends rather than using a locally-defined function type. - In bli_packm_cxk.c, extended function pointer for packm kernels array from out to index 31 (from previous maximum of 17). This allows us to store the unrolled 30xk kernel in the array for use (on knc, for example). Note: This should have been done a long time ago. commit e3bd5ca64ae7c190ba689396c0de687b829a11fe Author: Devin Matthews Date: Thu May 12 20:54:13 2016 -0500 Fix SIMD definitions in KNL config, and a couple of fixes to C update. commit 4fe02e3d497995d94d34d3fcf5af895084cfc8b9 Author: Devin Matthews Date: Thu May 12 20:53:58 2016 -0500 Move bli_kernel.h before bli_threading.h in order of inclusion in blis.h. commit 4bcf1b35abea3f3dfc8f2fe462dcf155cf199e55 Author: Field G. Van Zee Date: Wed May 11 16:09:49 2016 -0500 Fixed bli_get_range_*() bugs in trsm variants. Details: - Fixed incorrect calls to bli_get_range_*() from within trsm blocked variants 1f, 2b, and 2f. The bug somehow went undetected since the big commit (537a1f4), and, strangely, did not manifest via the BLIS testsuite. The bug finally came to our attention when running thei libflame test suite while linking to BLIS. Thanks to Kiran Varaganti for submitting the initial report that led to this bug. commit 9cfa33023f123a6c17e987f72fba174ce073f0b6 Author: Field G. Van Zee Date: Wed May 11 16:02:30 2016 -0500 Minor updates to bli_f2c.h. Details: - Added #undef guards to certain #define statements in bli_f2c.h, and renamed the file guard to BLIS_F2C_H. This helps when #including "blis.h" from an application or library that already #includes an "f2c.h" header. commit a09a2e23eacf5328858c8318bb637c5ff3b71d08 Merge: 4dcd37eb 7c604e1c Author: Tyler Michael Smith Date: Wed May 11 10:47:11 2016 -0500 Merge pull request #76 from devinamatthews/move_simd_defs Move default SIMD-related definitions to bli_kernel_macro_defs.h commit 4dcd37eb1b12a6e08cc13df7b61391ef8363f5d8 Author: Tyler Smith Date: Tue May 10 16:28:59 2016 -0500 fixing knc simd align size commit 619dee0daec3474b4e5a55df90a61aabcae194f2 Merge: b790b3d9 7c604e1c Author: Devin Matthews Date: Tue May 10 12:13:24 2016 -0500 Merge branch 'move_simd_defs' into knl commit 7c604e1cbc1609b6e12d3ee973c08b7af5035be4 Author: Devin Matthews Date: Tue May 10 12:11:55 2016 -0500 Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h. commit b790b3d9e1820f3b691676de48c291cae083452d Merge: 4f8c05c9 a7be2d28 Author: Devin Matthews Date: Tue May 10 11:49:47 2016 -0500 Merge branch 'master' into knl commit a7be2d28e8930b154d0da1d6929b54a96e210af6 Merge: 97b512ef 4b1e55ed Author: Field G. Van Zee Date: Tue May 10 11:48:51 2016 -0500 Merge pull request #74 from devinamatthews/fix_common_symbols Default-initialize all extern global variables to avoid generating common symbols. commit 4b1e55edbfe0e1cb2e7b9428424903497cb7a841 Author: Devin Matthews Date: Tue May 10 10:08:47 2016 -0500 Default-initialize all extern global variables to avoid generating common symbols. Fixes #73. commit 97b512ef62c7e25c97ed5e9eca81cd7015b2ac91 Author: Field G. Van Zee Date: Fri May 6 10:24:30 2016 -0500 Include headers from cblas.h to pull in f77_int. Details: - Added #include statements for certain key BLIS headers so that the definition of f77_int is pulled in when a user compiles application code with only #include "cblas.h" (and no other BLIS header). This is necessary since f77_int is now used within the cblas API. commit c3a4d39d03665135f1616588b5ef7c3e9ef5688d Author: Field G. Van Zee Date: Wed May 4 17:22:56 2016 -0500 Updates to haswell gemm micro-kernels. Details: - Added two new sets of [sd]gemm micro-kernels for haswell architectures, one that is 4x24/4x12 (s and d) and one that is 6x16/6x8. - Changed the haswell configuration to use the 6x16/6x8 micro-kernels by default. - Updated various Makefiles, in test, test/3m4m, and testsuite. commit 0b01d355ae861754ae2da6c9a545474af010f02e Author: Field G. Van Zee Date: Wed Apr 27 15:21:10 2016 -0500 Miscellaneous cleanups, fixes to recent commits. Details: - Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only manifested when non-reference level-1f kernels were used. - Added an #undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington configuration to prevent a compile-time warning until I can figure out the proper permanent fix. - Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation path (into 'other' directory). _ref_var2 is used by default, which is the variant that is built on axpyf and dotxf instead of dotaxpyv. - Removed section of frame/include/bli_config_macro_defs.h pertaining to mixed datatype support. commit ed7326c836f427e2f8420b015220ce293207b10c Author: Field G. Van Zee Date: Wed Apr 27 14:57:40 2016 -0500 Added 'restrict' to l1v/l1f code in 'kernels' dir. Details: - Added 'restrict' keyword to existing kernel definitions in 'kernels' directory. These changes were meant for inclusion in bbb8569. commit bbb8569b2a08c3bcd631d5a05eb389d01d94ac07 Author: Field G. Van Zee Date: Wed Apr 27 14:13:46 2016 -0500 Use 'restrict' in all kernel APIs; wspace changes. Details: - Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all numerical operand pointers (ie: all pointers except the cntx_t). - Updated level-1f reference kernel definitions to use 'restrict' for all numerical operand pointers. (Level-1v reference kernel definitions were already updated in bdbda6e.) - Rewrote the level-1v and level-1f reference kernel prototypes in bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply #include bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names (as was already being done for the level-3 micro-kernel prototypes in bli_l3_ref.h), rather than duplicate the signatures from the _ker.h files. - Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv and xpbyv, which were probably meant for inclusion in bdbda6e. - Converted a number of instances of four spaces, as introduced in bdbda6e, to tabs. commit 4ea419c72c789825e1f93a1eee88219bbf873930 Merge: f1e9be2a bdbda6e6 Author: Field G. Van Zee Date: Tue Apr 26 12:50:45 2016 -0500 Merge pull request #70 from devinamatthews/daxpby Give the level1v operations some love commit bdbda6e6acc682ab1b6ca680edebd09ae12a832c Author: Devin Matthews Date: Mon Apr 25 11:05:57 2016 -0500 Give the level1v operations some love: - Add missing axpby and xpby operations (plus test cases). - Add special case for scal2v with alpha=1. - Add restrict qualifiers. - Add special-case algorithms for incx=incy=1. commit f1e9be2aba1a057eedb947bbae96848597777408 Author: Field G. Van Zee Date: Fri Apr 22 15:34:02 2016 -0500 Minor tweak to test/Makefile. Details: - Just committing a minor change to test/Makefile that has been lingering in my local working copy for longer than I can remember. commit aa0bceec277938328dabeb744680623f24fb0b61 Merge: 4136553f e2784b4c Author: Field G. Van Zee Date: Fri Apr 22 12:01:31 2016 -0500 Merge branch 'master' of github.com:flame/blis commit 4136553f0d0661a668dfdb9edcd7ce1c5773dde7 Author: Field G. Van Zee Date: Fri Apr 22 11:53:53 2016 -0500 Clear level-3 cntx_t's via memset() before use. Details: - In all level-3 operations' _cntx_init() functions, replaced calls to bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all level-3 operations' _cntx_finalize() functions, removed calls to bli_cntx_obj_finalize(), leaving those function definitions empty. - Changed the definition of bli_cntx_obj_clear() so that the clearing occurs via a single call to memset(). commit 4f8c05c9e2ef4cbb82b35a3ebf1f0a0ac665830e Author: Devin Matthews Date: Thu Apr 21 10:00:59 2016 -0500 Rearrange KNL dgemm kernel again to streamline usage of ymm register. sgemm and dgemm now both working with Intel SDE. commit e2784b4c921f706e756df3e146e20a4cb63f53e3 Merge: dd0ab1d9 a9b6c3ab Author: Field G. Van Zee Date: Wed Apr 20 18:34:09 2016 -0500 Merge pull request #67 from devinamatthews/cblas-f77-int Change CBLAS integer type to f77_int commit a9b6c3abda6222a8b240361643932e83cf726c4f Merge: e4c54c81 dd0ab1d9 Author: Devin Matthews Date: Wed Apr 20 16:00:10 2016 -0500 Merge remote-tracking branch 'origin/master' into cblas-f77-int # Conflicts: # config/haswell/bli_config.h commit e4c54c81463c2a19c9bb6b1f0f1be3fa9d018a45 Author: Devin Matthews Date: Wed Apr 20 15:56:46 2016 -0500 Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer. commit dd0ab1d93f33abca6af9edd7b8e52da62dcfa5b1 Author: Field G. Van Zee Date: Wed Apr 20 14:38:23 2016 -0500 Converted some bli_cntx query functions to macros. Details: - Commented out several datatype-aware query functions (those ending in _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and added equivalent cpp query macros to bli_cntx.h. - Added 'bli_config.h' to .gitignore. commit 7193230f7d35edbd1d2f77842a613971f1603463 Author: Devin Matthews Date: Wed Apr 20 09:37:30 2016 -0500 Work around missing VPMULLQ on KNL. commit a30ccbc4c6a6e6460e78af6b5c530ee0d06f98fb Merge: eb2f18e4 0e1a9821 Author: Field G. Van Zee Date: Tue Apr 19 15:04:33 2016 -0500 Merge pull request #66 from devinamatthews/blas-configure Add configure options and generate bli_config.h automatically. commit bd44cf13e886069bc66c10ac0db178be96629a0d Author: Devin Matthews Date: Tue Apr 19 13:43:04 2016 -0500 Fix copy-paste errors in KNL kernels. commit eb2f18e4844d985715df20798f50f9cc12e3b5ad Author: Field G. Van Zee Date: Tue Apr 19 12:50:32 2016 -0500 More compile-time fixes to bgq gemm ukernel code. commit 0e1a9821d860f6c1d818baf4c48d21a23726c132 Author: Devin Matthews Date: Tue Apr 19 11:44:37 2016 -0500 Add configure options and generate bli_config.h automatically. Options to configure have been added for: - Setting the internal BLIS and BLAS/CBLAS integer sizes. - Enabling and disabling the BLAS and CBLAS layers. Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file. Lastly, support for OMP in clang has been added (closes #56). commit a11eec05928ddc5c43fa5dbcd35f2edd24ff35a1 Author: Devin Matthews Date: Mon Apr 18 13:13:36 2016 -0500 Add sgemm ukernels for KNL. vpmullq is not implemented on KNL -- needs workaround. commit ff84469a4575f1ef8a0010046fde52240a312cae Author: Field G. Van Zee Date: Mon Apr 18 12:29:09 2016 -0500 Applied various compilation fixes to bgq kernels. commit c38e0dab05b2dc36672eab96e1248fb7fb2d785b Merge: bd5e2296 cbcd0b73 Author: Devin Matthews Date: Mon Apr 18 10:21:35 2016 -0500 Merge remote-tracking branch 'origin/master' into knl commit bd5e2296e98e042c31f1e8ece2c1ca8e4bdc2d4c Merge: 4745def0 49f85177 Author: Devin Matthews Date: Mon Apr 18 10:15:22 2016 -0500 Merge remote-tracking branch 'origin/knl' into knl commit 4745def0c87377ae83ad73ac514d7de08a96b2ac Author: Devin Matthews Date: Mon Apr 18 10:15:05 2016 -0500 Add 64-bit offset vector so we can use vgatherqpd. commit 49f85177f886f38889b60503a4e12fa7f04be1fd Author: Devin Matthews Date: Mon Apr 18 10:14:11 2016 -0500 KNL ukernel compiles with gcc. commit cbcd0b739dc54bd14fbb46aeda267c26725cd70f Author: Tyler Michael Smith Date: Mon Apr 18 03:12:57 2016 -0500 Changing ifdef for OSX pthread barriers commit 58b2c3cf040134d1be913c585a3c6905629116c0 Author: Devin Matthews Date: Sat Apr 16 16:12:24 2016 -0500 Rewrite of KNL kernel in GNU extended asm syntax. commit dd62080cea78f3a23616200d6640e52c102b2bb9 Author: Field G. Van Zee Date: Fri Apr 15 11:15:41 2016 -0500 Compile-time fix to bgq l1f kernels. Details: - Fixed an old reference to bli_daxpyf_fusefac, which no longer exists, by replacing it with the axpyf fusing factor (8), and cleaned up the relevant section of config/bgq/bli_kernel.h. - Removed most of the details of the level-3 kernels from the template kernel code in config/template/kernels/3 and replaced it with a reference to the relevant kernel wiki maintained on the BLIS github website. commit d5a915dd8d7a6ead42a68772e4420eb3647e6f1a Merge: 4320b725 41694675 Author: Field G. Van Zee Date: Thu Apr 14 12:56:36 2016 -0500 Merge branch 'master' of github.com:flame/blis commit 4320b725a1f8fd34101470b6cf52ad504a79c517 Author: Field G. Van Zee Date: Thu Apr 14 12:51:29 2016 -0500 Use kernel CFLAGS on "ukernels" directories. Details: - Updated the top-level Makefile so that the CFLAGS variable designated for kernel source code is applied not only to source code in directories named "kernels" but source code in any directory that contains the substring "kernels", such as "ukernels". - Formally disabled some code in gen-make-frag.sh script that was already effectively disabled. The code was related to handling "noopt" and "kernel" directories, which is now handled independently within the top-level Makefile without needing to place these source files into a spearate makefile variable. commit 41694675e4cb56e2e0323c7a7db48e0819606a31 Author: Tyler Smith Date: Wed Apr 13 15:51:08 2016 -0500 pthreads bugfixes Getting pthreads to work on my Mac Implemented a pthread barrier when _POSIX_BARRIER isn't defined Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time Add -lpthread instead of -pthread to LDFLAGS (for clang) commit f756dbfa0d542cbc497724981520c83abf049c4b Author: Field G. Van Zee Date: Wed Apr 13 11:25:33 2016 -0500 Removed stale #include from bgq configuration. Details: - Removed an old #include statement ("bli_gemm_8x8.h") from the bli_kernel.h file in the bgq configuration. It turns out this file was no longer needed even prior to 537a1f4. commit 0bd4169ea75f690714e7d2912229932a75d8a7e2 Author: Field G. Van Zee Date: Mon Apr 11 18:08:32 2016 -0500 Fixed context-broken dunnington/penryn kernels. Details: - Added missing context parameters to several instances where simpler kernels, or reference kernels, are called instead of executing the main body code contained in the kernel function in question. - Renamed axpyv and dotv kernel files to use "opt" instead of "int" substring, for consistency with level-1f kernels. commit 7912af5db45b7372d19a9a3dfeb82df302a05628 Author: Field G. Van Zee Date: Mon Apr 11 17:32:13 2016 -0500 CHANGELOG update (0.2.0) commit 898614a555ea0aa7de4ca07bb3cb8f5708b6a002 (tag: 0.2.0) Author: Field G. Van Zee Date: Mon Apr 11 17:32:09 2016 -0500 Version file update (0.2.0) commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f Author: Field G. Van Zee Date: Mon Apr 11 17:21:28 2016 -0500 Implemented runtime contexts and reorganized code. Details: - Retrofitted a new data structure, known as a context, into virtually all internal APIs for computational operations in BLIS. The structure is now present within the type-aware APIs, as well as many supporting utility functions that require information stored in the context. User- level object APIs were unaffected and continue to be "context-free," however, these APIs were duplicated/mirrored so that "context-aware" APIs now also exist, differentiated with an "_ex" suffix (for "expert"). These new context-aware object APIs (along with the lower-level, type- aware, BLAS-like APIs) contain the the address of a context as a last parameter, after all other operands. Contexts, or specifically, cntx_t object pointers, are passed all the way down the function stack into the kernels and allow the code at any level to query information about the runtime, such as kernel addresses and blocksizes, in a thread- friendly manner--that is, one that allows thread-safety, even if the original source of the information stored in the context changes at run-time; see next bullet for more on this "original source" of info). (Special thanks go to Lee Killough for suggesting the use of this kind of data structure in discussions that transpired during the early planning stages of BLIS, and also for suggesting such a perfectly appropriate name.) - Added a new API, in frame/base/bli_gks.c, to define a "global kernel structure" (gks). This data structure and API will allow the caller to initialize a context with the kernel addresses, blocksizes, and other information associated with the currently active kernel configuration. The currently active kernel configuration within the gks cannot be changed (for now), and is initialized with the traditional cpp macros that define kernel function names, blocksizes, and the like. However, in the future, the gks API will be expanded to allow runtime management of kernels and runtime parameters. The most obvious application of this new infrastructure is the runtime detection of hardware (and the implied selection of appropriate kernels). With contexts in place, kernels may even be "hot swapped" at runtime within the gks. Once execution enters a level-3 _front() function, the memory allocator will be reinitialized on-the-fly, if necessary, to accommodate the new kernels' blocksizes. If another application thread is executing with another (previously loaded) kernel, it will finish in a deterministic fashion because its kernel information was loaded into its context before computation began, and also because the blocks it checked out from the internal memory pools will be unaffected by the newer threads' reinitialization of the allocator. - Reorganized and streamlined the 'ind' directory, which contains much of the code enabling use of induced methods for complex domain matrix multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as those APIs' functionality is now mostly subsumed within the global kernel structure. - Updated bli_pool.c to define a new function, bli_pool_reinit_if(), that will reinitialize a memory pool if the necessary pool block size has increased. - Updated bli_mem.c to use bli_pool_reinit_if() instead of bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed usage of contexts where appropriate to communicate cache and register blocksizes to bli_mem_compute_pool_block_sizes(). - Simplified control trees now that much of the information resides in the context and/or the global kernel structure: - Removed blocksize object pointers (blksz_t*) fields from all control tree node definitions and replaced them with blocksize id (bszid_t) values instead, which may be passed into a context query routine in order to extract the corresponding blocksize from the given context. - Removed micro-kernel function pointers (func_t*) fields from all control tree node definitions. Now, any code that needs these function pointers can query them from the local context, as identified by a level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or level-1v kernel id (l1vkr_t). - Removed blksz_t object creation and initialization, as well as kernel function object creation and initialization, from all operation- specific control tree initialization files (bli_*_cntl.c), since this information will now live in the gks and, secondarily, in the context. - Removed blocksize multiples from blksz_t objects. Now, we track blocksize multiples for each blocksize id (bszid_t) in the context object. - Removed the bool_t's that were required when a func_t was initialized. These bools are meant to allow one to track the micro-kernel's storage preferences (by rows or columns). This preference is now tracked separately within the gks and contexts. - Merged and reorganized many separate-but-related functions into single files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and util directories, but has the most obvious effect of allowing BLIS to compile noticeably faster. - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations in an attempt to reduce overhead for memory-bound operations. This includes removal of default use of object-based variants for level-2 operations. Now, by default, level-2 operations will directly call a low-level (non-object based) loop over a level-1v or -1f kernel. - Converted many common query functions in blk_blksz.c (renamed from bli_blocksize.c) and bli_func.c into cpp macros, now defined in their respective header files. - Defined bli_mbool.c API to create and query "multi-bools", or heterogeneous bool_t's (one for each floating-point datatype), in the same spirit as blksz_t and func_t. - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS and BLIS_SIMD_SIZE. These values are needed in order to compute a third new parameter, which may be set indirectly via the aforementioned macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to statically allocate memory in macro-kernels and the induced methods' virtual kernels to be used as temporary space to hold a single micro-tile. These values are now output by the testsuite. The default value of BLIS_STACK_BUF_MAX_SIZE is computed as "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE". - Cleaned up top-level 'kernels' directory (for example, renaming the embarrassingly misleading "avx" and "avx2" directories to "sandybridge" and "haswell," respectively, and gave more consistent and meaningful names to many kernel files (as well as updating their interfaces to conform to the new context-aware kernel APIs). - Updated the testsuite to query blocksizes from a locally-initialized context for test modules that need those values: axpyf, dotxf, dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr. - Reformatted many function signatures into a standard format that will more easily facilitate future API-wide changes. - Updated many "mxn" level-0 macros (ie: those used to inline double loops for level-1m-like operations on small matrices) in frame/include/level0 to use more obscure local variable names in an effort to avoid variable shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings, which are only output using -Wshadow.) - Added a conj argument to setm, so that its interface now mirrors that of scalm. The semantic meaning of the conj argument is to optionally allow implicit conjugation of the scalar prior to being populated into the object. - Deprecated all type-aware mixed domain and mixed precision APIs. Note that this does not preclude supporting mixed types via the object APIs, where it produces absolutely zero API code bloat. commit dd856c2cb75a2221a503a73dde27790c34b91570 Author: Devin Matthews Date: Mon Apr 11 10:39:18 2016 -0500 Translated MIC kernel to KNL and cleaned up a bit. Only real change is lack of swizzle modifiers for FMA instructions (used bcast from memory instead). commit 7f27431d3fffdda99c282ec412731d0a90cb32a7 Author: Devin Matthews Date: Fri Apr 8 10:04:39 2016 -0500 Copy mic kernel to knl for transliteration. commit f8f02f0334ac020021e15a415bcd33aeea01deb4 Merge: 32c92d94 d1f8e5d9 Author: Devin Matthews Date: Wed Apr 6 11:37:05 2016 -0500 Merge branch 'master' into const_correctness commit 32c92d945c55708da0eb63be1771f8c5430e3910 Merge: 62914ccb 20af937b Author: Devin Matthews Date: Wed Apr 6 11:36:02 2016 -0500 Merge branch 'master' into const_correctness commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173 Merge: 20af937b c11d28ee Author: Field G. Van Zee Date: Tue Apr 5 12:21:27 2016 -0500 Merge pull request #60 from esauvage/master sgemm µkernel for bulldozer : bug correction for k%4 != 0 commit c11d28eed89d65494bc4019f04d046520866c0ff Author: Etienne Sauvage Date: Sat Apr 2 21:15:48 2016 +0200 cgemm µkernel for bulldozer : bug correction for k%4 != 0 commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7 Merge: 36c3abb0 fc61a114 Author: Field G. Van Zee Date: Thu Mar 31 14:37:30 2016 -0500 Merge pull request #59 from devinamatthews/fix_testsuite_makefile Fix testsuite makefile commit fc61a1143edeba4946d4b9915f1775bb08e643fc Author: Devin Matthews Date: Thu Mar 31 10:53:01 2016 -0500 Fix formatting in configure. commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6 Author: Devin Matthews Date: Thu Mar 31 10:45:48 2016 -0500 Adjust paths in common.mk to support building from testsuite dir. commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583 Merge: 64b41fa5 917ce754 Author: Field G. Van Zee Date: Thu Mar 31 10:26:17 2016 -0500 Merge pull request #58 from esauvage/master cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi… commit 356d854fc9e34642cc46e0e02a8ceb56114878af Author: Devin Matthews Date: Wed Mar 30 16:33:15 2016 -0500 Make symlink to common.mk in build directory. commit edbb8470044f82ef959583ee09613a5a985292b5 Author: Devin Matthews Date: Wed Mar 30 16:27:11 2016 -0500 Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile. commit 917ce75482a543fef46553efff6c246939761e59 Author: Etienne Sauvage Date: Wed Mar 30 22:03:09 2016 +0200 cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel commit 62914ccbcdb3c594f065dcfa65bd7e7b95c79283 Merge: bbf704bf 64b41fa5 Author: Devin Matthews Date: Tue Mar 29 15:24:25 2016 -0500 Merge branch 'master' into const_correctness commit 64b41fa554dff44b2f9ad48901b67c63836407a8 Merge: 1b09e343 0171ad58 Author: Field G. Van Zee Date: Tue Mar 29 15:19:41 2016 -0500 Merge pull request #54 from devinamatthews/more_config_opts More config opts commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0 Author: Field G. Van Zee Date: Tue Mar 29 12:55:28 2016 -0500 Updated gcc version from 4.8 to 4.9 in .travis.yml. commit 0171ad58997b3a5a9b76301511dbe0751fffc940 Author: Devin Matthews Date: Mon Mar 28 13:55:06 2016 -0500 Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW. commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11 Merge: 8624e365 4ca5d5b1 Author: Field G. Van Zee Date: Mon Mar 28 12:36:25 2016 -0500 Merge pull request #44 from esauvage/master sgemm micro-kernel for FMA4 instruction set commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc Merge: 469429ec 8624e365 Author: Devin Matthews Date: Sat Mar 26 14:10:15 2016 -0500 Merge branch 'master' into more_config_opts commit 8624e36543160739d954c4dbcc5a5594458f3a12 Merge: a315833f 2bd036f1 Author: Field G. Van Zee Date: Sat Mar 26 13:56:28 2016 -0500 Merge pull request #50 from devinamatthews/fix_noopt_avx Fix configuration issue where instruction set flags are not specified for debug builds. commit 469429ec34e5b1a172ce35596f9c7afdaacac131 Author: Devin Matthews Date: Fri Mar 25 20:45:41 2016 -0500 Fix LD_FLAGS -> LDFLAGS. commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3 Author: Devin Matthews Date: Fri Mar 25 20:06:48 2016 -0500 Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures. commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d Author: Devin Matthews Date: Fri Mar 25 17:22:58 2016 -0500 Add threading option to configure. commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5 Merge: 9452bdb3 2bd036f1 Author: Devin Matthews Date: Fri Mar 25 15:00:02 2016 -0500 Merge branch 'fix_noopt_avx' into more_config_opts commit 9452bdb3afbf2d7f898134a091d7790817e7be9c Author: Devin Matthews Date: Fri Mar 25 14:59:50 2016 -0500 Add options for verbose make output and static/shared linking to configure. commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3 Author: Devin Matthews Date: Fri Mar 25 12:16:49 2016 -0500 Fix configuration issue where instruction set flags are not specified for debug builds. commit bbf704bf7501411964a63a68f1af541f612cf92d Author: Devin Matthews Date: Fri Mar 25 09:55:35 2016 -0500 Add missing const to bli_read_nway_from_env. commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6 Merge: 1d1a426d af92773f Author: Field G. Van Zee Date: Thu Mar 24 12:30:21 2016 -0500 Merge pull request #48 from figual/master Updated and improved ARMv8 micro-kernels. commit af92773f4f85a2441fe0c6e3a52c31b07253d08e Author: figual Date: Wed Mar 23 22:07:02 2016 +0100 Updated and improved ARMv8 micro-kernels. commit a4d7729776d17d9bdf2341eacd70b9770b9ba8d2 Author: Devin Matthews Date: Mon Mar 21 09:55:21 2016 -0500 Set default value for debug_type variable. commit 0e2447fa55d8c5fa2b1fc4150073512495c5f9eb Author: Devin Matthews Date: Thu Mar 17 16:32:05 2016 -0500 Add const correctness to auxinfo_t struct (microkernels need update theoretically). commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf Merge: 5a978fff d226dfa0 Author: Field G. Van Zee Date: Mon Mar 7 15:17:53 2016 -0600 Merge pull request #46 from devinamatthews/new-config-opts Add several changes to the build system. commit d226dfa05190eb477b33563b1edccf8603973336 Author: Devin Matthews Date: Sat Mar 5 16:18:14 2016 -0600 Add several changes to the build system. 1) Add -- options. 2) Add -d/--enable-debug option to enable debugging symbols with and without optimization. 3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor. 4) Add make V=[0,1] option to control build verbosity. commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4 Merge: adb2b4e0 63e26423 Author: Field G. Van Zee Date: Fri Mar 4 17:26:58 2016 -0600 Merge pull request #45 from devinamatthews/high_prec_timers Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday commit 63e264239053b913164a849dd8a45829087eaddc Author: Devin Matthews Date: Fri Mar 4 13:17:50 2016 -0600 Make sure that -lrt is linked on Linux. commit 44fddd48dc1708a956803d1948f04429ec0d8700 Author: Devin Matthews Date: Fri Mar 4 12:36:38 2016 -0600 Add missing \. commit 7cabd2131f953de23e7015d760b0ddfda51b1251 Author: Devin Matthews Date: Thu Mar 3 11:43:07 2016 -0600 Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday. commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8 Author: Tyler Smith Date: Wed Mar 2 14:48:12 2016 -0600 Fixing guard for non implemented partitioning through packed matrices commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51 Author: Etienne Sauvage Date: Tue Mar 1 21:33:01 2016 +0100 sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel commit 627d59b5ba06866b26f46e4434a0435b600925e3 Author: Etienne Sauvage Date: Mon Feb 29 21:53:12 2016 +0100 symbolic link for bulldozer configuration to kernels commit 2dc5c0ae038ed175fab85751803ada05734d1ba1 Merge: f2809fc5 3d0fae81 Author: Field G. Van Zee Date: Mon Feb 29 12:22:51 2016 -0600 Merge pull request #40 from tkelman/bulldozer-symlink Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer commit f2809fc5f74466c755da6a5b4632853e634060b5 Merge: f86b94f2 8624a33c Author: Field G. Van Zee Date: Sat Feb 27 13:06:03 2016 -0600 Merge pull request #39 from devinamatthews/fix_f2c_conflicts Devin's f2c type namespace update. Details: - Added "bla_" prefix to f2c type names to prevent conflicts with external user code. - Removed most of the body of bli_f2c.h, which was unused. commit 3d0fae810d942085d8f2d389820b4e0027577db8 Author: Tony Kelman Date: Thu Feb 25 23:24:03 2016 -0800 Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer to fix linking issue mentioned in #37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6 Author: Devin Matthews Date: Thu Feb 25 13:51:26 2016 -0600 Fix remaining f2c conflicts. commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba Author: Devin Matthews Date: Thu Feb 25 12:01:58 2016 -0600 Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in progress. commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a Author: Field G. Van Zee Date: Tue Feb 23 18:12:34 2016 -0600 Included missing blas2blis integer def to CBLAS. Details: - Added #include "bli_config_macro_defs" to all cblas_*.c files in compat/cblas/src. This has the effect of defining BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does not define it. Thanks to Tony Kelman for reporting this bug. - In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int' to 'f77_int'. This eliminates a compiler warning and a potential runtime bug and/or crash when the size of an int differs from the size of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE). commit 0b126de1342c11c65623bcb38e258e21e9244e3d Author: Field G. Van Zee Date: Fri Nov 13 16:29:12 2015 -0600 Consolidated packm_blk_var1 and packm_blk_var2. Details: - Consolidated the two blocked variants for packm into a single implementation (packm_blk_var1) and removed the other variant. - Updated all induced method _cntl_init() functions in frame/cntl/ind/ to use the new blocked variant 1. - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(), to detect pack_t schemas for induced methods and native execution, respectively. commit 30e5eb29e060b97752f702d2ea5d101d950f53b2 Author: Field G. Van Zee Date: Fri Nov 13 12:14:19 2015 -0600 Minor changes to treatment of rs, cs in bli_obj.c. Details: - Applied a patch submitted by Devin Matthews that: - implements subtle changes to handling of somewhat unusual cases of row and column strides to accommodate certail tensor cases, which includes adding dimension parameters to _is_col_tilted() and _is_row_tilted() macros, - simplifies how buffers are sized when requested BLIS-allocated objects, - re-consolidates bli_adjust_strides_*() into one function, and - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99 environments. commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24 Author: Field G. Van Zee Date: Thu Nov 12 15:22:50 2015 -0600 Fixed unimplemented case in core2 sgemm ukernel. Details: - Implemented the "beta == 0" case for general stride output for the dunnington sgemm micro-kernel. This case had been, up until now, identical to the "beta != 0" case, which does not work when the output matrix has nan's and inf's. It had manifested as nan residuals in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin Matthews for reporting this bug. commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9 Author: Field G. Van Zee Date: Thu Nov 12 12:07:46 2015 -0600 Fixed minor bugs for uncommon obj_create cases. Details: - Separated bli_adjust_strides() into _alloc() and _attach() flavors so that the latter can avoid a test performed by the former, in which the rs and cs are overridden and set to zero if either matrix dimension is zero. Actually, we also disable this overridding behavior, even for the _alloc() case, since keeping the original strides (probably) does not hurt anything. The original code has been kept commented-out, though, in case an unintended consequence is later discovered. - Fixed a typo in an error check for general stride cases where rs == cs. commit 3e6dd11467643fbc2cb45c13cec8dd6024232833 Author: Field G. Van Zee Date: Tue Nov 3 10:30:08 2015 -0600 Minor re-expression in quadratic partitioning code. Details: - Minor change to quadratic equation solution code that avoids recomputation of the sqrt() parameter when the compiler is not smart enough to perform this optimization automatically. commit 0694b722f7e4df00efb32639095a2aca80e67f52 Merge: 3e116f0a 33557ecc Author: Field G. Van Zee Date: Mon Nov 2 17:24:25 2015 -0600 Merge branch 'master' of github.com:flame/blis commit 3e116f0a2953f50b3c068759a775ad7ffae04e49 Author: Field G. Van Zee Date: Mon Nov 2 17:18:23 2015 -0600 Fixed imaginary bug in quadratic partitioning code. Details: - Fixed a bug in the relatively new quadratic partitioning code that, under the right conditions, would perform sqrt() on a negative value. If the solution is imaginary, we discard it and use an alternate partition width that assumes no diagonal intersection. That alternate width is actually already computed, so, the fix was quite simple. Thanks to Devangi Parikh for reporting this bug. commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68 Author: Jeff Hammond Date: Mon Nov 2 12:18:43 2015 -0800 add Travis CI build status icon to the README commit 4a502fbe77bd0f701108baaa559d9cfb483f88de Author: Field G. Van Zee Date: Mon Nov 2 13:28:34 2015 -0600 Laid groundwork for runtime memory pool resizing. Details: - Changed bli_pool_finalize() so that the freeing begins with the block at top_index instead of block 0. This allows us to use the function for terminal finalization as well as temporary cleanup prior to reinitialization. Also, clear the pool_t struct upon _pool_finalize() in case it is called in the terminal case with some blocks still checked out to threads (in which case the threads will see the new block size as 0 and thus release the block as intended). - Added bli_pool_reinit(), which calls _pool_finalize() followed by _pool_init() with new parameters. - Added bli_mem_reinit(), which is based on bli_pool_reinit(). - Added new wrapper, _mem_compute_pool_block_sizes(), which calls _mem_compute_pool_block_sizes_dt(). - Updated bli_mem_release() so that the pblk_t is freed, via _pool_free_block(), if the block size recorded in the mem_t at the time the pblk_t was acquired is now different from the value in the pool_t. commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96 Author: Field G. Van Zee Date: Fri Oct 30 18:25:04 2015 -0500 Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm. Details: - Fixed a family of bugs in the triangular level-3 operations for certain complex implementations (3m1 and 4m1a) that only manifest if one of the register blocksizes (PACKMR/PACKNR, actually) is odd: - Fixed incorrect imaginary stride computation in bli_packm_blk_var2() for the triangular case. - Fixed the incorrect computation of imaginary stride, as stored in the auxinfo_t struct in trmm and trsm macro-kernels. - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the cases where the the register blocksize for the triangular matrix is odd. Introduced a new byte-granular pointer arithmetic macro, bli_ptr_add(), that computes the correct value. - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in terms of __typeof__, which is used by bli_ptr_add() macro. - Disabled the row- vs. column-storage optimization in bli_trmm_front() for singleton problems because the inherent ambiguity of whether a scalar is row-stored or column-stored causes the wrong parameter combination code to be executed (by dumb luck of our checking for row storage first). - Added commented-out debugging lines to 3m1/4m1a and reference micro-kernels, and trsm_ll macro-kernel. commit 46294d80e5a79c598e200e1c8ec2a642ff839971 Merge: d3159c57 a0a7b85a Author: Field G. Van Zee Date: Tue Oct 27 12:41:23 2015 -0500 Merge pull request #35 from figual/master Fixed incomplete code in the double precision ARMv8 microkernel. commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c Author: Francisco Igual Date: Tue Oct 27 08:59:15 2015 +0000 Fixed incomplete code in the double precision ARMv8 microkernel. commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f Merge: b489152e 7e03e45b Author: Field G. Van Zee Date: Wed Oct 21 14:54:00 2015 -0500 Merge branch 'master' of github.com:flame/blis commit b489152e112644ec3b6d19e687231a9607f7694f Author: Field G. Van Zee Date: Wed Oct 21 14:53:17 2015 -0500 Use vzeroall in haswell micro-kernels. commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49 Merge: 77ddb0b1 4f88c29f Author: Field G. Van Zee Date: Wed Oct 14 13:26:07 2015 -0500 Merge pull request #33 from xianyi/master Enable Travis CI commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db Author: Zhang Xianyi Date: Wed Oct 14 12:57:50 2015 -0500 Detect Intel Broadwell (using Haswell config). commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5 Merge: fe3e355c 77ddb0b1 Author: Zhang Xianyi Date: Wed Oct 14 12:51:05 2015 -0500 Merge branch 'upstream_master' commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb Author: Field G. Van Zee Date: Tue Oct 13 12:53:06 2015 -0500 Removed flop-counting mechanism. Details: - Removed the optional flop-counting feature introduced in commit 7574c994. commit 276da366187460a4c8e6e0910e79cb39ce780bfe Author: Field G. Van Zee Date: Mon Oct 12 11:43:03 2015 -0500 Minor formatting change to README.md. commit d17057446f5404824478e8a6cd08f242ab75544a Author: Field G. Van Zee Date: Mon Oct 12 11:39:49 2015 -0500 Added "Getting Started" section to README.md. Details: - Added section to README.md file containing links to wikis with brief descriptions. commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3 Author: Field G. Van Zee Date: Fri Oct 2 16:51:52 2015 -0500 Minor updates to CREDITS, README files. commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea Author: Field G. Van Zee Date: Sat Sep 26 20:47:19 2015 -0500 Minor edits to README.md, testsuite. Details: - Fixed typos in README.md. - Fixed column heading alignment for testsuite when matlab output is enabled. - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile. commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de Author: Field G. Van Zee Date: Fri Sep 25 14:47:27 2015 -0500 Replaced README with README.md. Details: - Replaced the old (and short) README file with a much more comprehensive version written in github-flavored markdown. The new file is based on content taken from the old Google Code homepage. commit e2e9d64a63485461192d9c2a6dd0183a8b71013c Author: Field G. Van Zee Date: Thu Sep 24 12:14:03 2015 -0500 Load balance thread ranges for arbitrary diagonals. Details: - Expanded/updated interface for bli_get_range_weighted() and bli_get_range() so that the direction of movement is specified in the function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b()) and also so that the object being partitioned is passed instead of an uplo parameter. Updated invocations in level-3 blocked variants, as appropriate. - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to carefully take into account the location of the diagonal when computing ranges so that the area of each subpartition (which, in all present level-3 operations, is proportional to the amount of computation engendered) is as equal as possible. - Added calls to a new class of routines to all non-gemm level-3 blocked variants: bli__prune_unref_mparts_[mnk]() where is herk, trmm, or trsm and [mnk] is chosen based on which dimension is being partitioned. These routines call a more basic routine, bli_prune_unref_mparts(), to prune unreferenced/unstored regions from matrices and simultaneously adjust other matrices which share the same dimension accordingly. - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the new pruning routines. - Fixed incorrect blocking factors passed into bli_get_range_*() in bli_trsm_blk_var[12][fb].c - Added a new test driver in test/thread_ranges that can exercise the new bli_get_range_*() and bli_get_range_weighted_*() under a range of conditions. - Reimplemented m and n fields of obj_t as elements in a "dim" array field so that dimensions could be queried via index constant (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification macros accordingly. - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values. - Added bli_round() macro, which calls C math library function round(), and bli_round_to_mult(), which rounds a value to the nearest multiple of some other value. - Added miscellaneous pruning- and mdim_t-related macros. - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to bli_obj_row_off(), bli_obj_col_off(). commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1 Merge: efa641e3 4dd9dd3e Author: Zhang Xianyi Date: Fri Aug 21 14:38:36 2015 -0500 Merge branch 'upstream_master' commit efa641e36b73abee34166a252e90e28a6281d92d Author: Zhang Xianyi Date: Sat Aug 22 03:15:50 2015 +0800 Try to fix the compiling bug on travis. commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38 Author: Field G. Van Zee Date: Fri Aug 21 11:52:37 2015 -0500 Fixed minor alignment ambiguity bug in bli_pool.c. Details: - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which pointer arithmetic was performed on a void* as if it were a byte pointer (such as char*). Some compilers may have already been interpreting this situation as intended, despite the sloppiness. Thanks to Aleksei Rechinskii for reporting this issue. - Redefined pointer alignment macros to typecast to uintptr_t instead of siz_t. commit 12ffd568b04feda57147c13b67717416a01c82f8 Author: Zhang Xianyi Date: Sat Aug 22 00:24:28 2015 +0800 Add Travis CI. commit ecc3ebb749e0861c27deda52b5f87236ede4901b Author: Field G. Van Zee Date: Wed Jul 29 13:31:12 2015 -0500 CHANGELOG update (0.1.8) commit 47caa33485b91ea6f2a5e386e61210c90c5f489f (tag: 0.1.8) Author: Field G. Van Zee Date: Wed Jul 29 13:31:09 2015 -0500 Version file update (0.1.8) commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e Merge: fdfe14f1 d4b89136 Author: Field G. Van Zee Date: Thu Jul 9 13:54:54 2015 -0500 Merge branch 'master' of github.com:flame/blis commit fdfe14f1e17ba5a2f8dfa0bdb799c6b0e730211b Author: Field G. Van Zee Date: Thu Jul 9 13:52:39 2015 -0500 Added support for Intel Haswell/Broadwell. Details: - Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors and FMA instructions. (Complex support is currently provided by default induced method, 4m1a.) - Added a 'haswell' configuration, which uses the aforementioned kernels. - Inserted auto-detection support for haswell configuration in build/auto-detect/cpuid_x86.c. - Modified configure script to explicitly echo when automatic or manual configuration is in progress. - Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9. commit d4b891369c1eb0879ade662ff896a5b9a7fca207 Author: Field G. Van Zee Date: Tue Jul 7 10:06:53 2015 -0500 Added 'carrizo' configuration. Details: - Added a new configuration for AMD Excavator-based hardware also known as Carrizo when referring to the entire APU. This configuration uses the same micro-kernels as the piledriver, but with different cache blocksizes. commit 0b7255a642d56723f02d7ca1f8f21809967b8515 Author: Field G. Van Zee Date: Fri Jun 19 12:01:50 2015 -0500 CHANGELOG update (0.1.7) commit 267253de8a7be546ce87626443ee38701c1d411f (tag: 0.1.7) Author: Field G. Van Zee Date: Fri Jun 19 12:01:49 2015 -0500 Version file update (0.1.7) commit 7cd01b71b5e757a6774625b3c9f427f5e7664a76 Author: Field G. Van Zee Date: Fri Jun 19 11:31:53 2015 -0500 Implemented dynamic allocation for packing buffers. Details: - Replaced the old memory allocator, which was based on statically- allocated arrays, with one based on a new internal pool_t type, which, combined with a new bli_pool_*() API, provides a new abstract data type that implements the same memory pool functionality but with blocks from the heap (ie: malloc() or equivalent). Hiding the details of the pool in a separate API also allows for a much simpler bli_mem.c family of functions. - Added a new internal header, bli_config_macro_defs.h, which enables sane defaults for the values previously found in bli_config. Those values can be overridden by #defining them in bli_config.h the same way kernel defaults can be overridden in bli_kernel.h. This file most resembles what was previously a typical configuration's bli_config.h. - Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which defaults to BLIS_PAGE_SIZE, to specify the alignment of individual blocks in the memory pool. Also added a corresponding query routine to the bli_info API. - Deprecated (once again) the micro-panel alignment feature. Upon further reflection, it seems that the goal of more predictable L1 cache replacement behavior is outweighed by the harm caused by non-contiguous micro-panels when k % kc != 0. I honestly don't think anyone will even miss this feature. - Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call bli_cntl_init() instead of bli_init(). - Removed query functions from bli_info.c that are no longer applicable given the dynamic memory allocator. - Removed unnecessary definitions from configurations' bli_config.h files, which are now pleasantly sparse. - Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite modules. Thanks to Devangi Parikh for pointing out these miscalculations. - Comment, whitespace changes. commit 9848f255a3bab17d1139c391cca13ff3f1ffe6ed Author: Field G. Van Zee Date: Thu Jun 11 19:14:22 2015 -0500 Added early return to API-level _init() routines. Details: - Added conditional code that returns early from the API-level _init() routines if the API is already initialized. Actually meant for this to be included in 5f93cbe8. commit 5f93cbe870f3478870e15581e7fd450dad5bba1e Author: Field G. Van Zee Date: Thu Jun 11 18:52:12 2015 -0500 Introduced API-level initialization. Details: - Added API-level initialization state to _const, _error, _mem, _thread, _ind, and _cntl APIs. While this functionality will mostly go unused, adding miniscule overhead at init-time, there will be at least once instance in the near future where, in order to avoid an infinite loop, a certain portion of the initialization will call a query function that itself attempts to call bli_init(). API-level initialization will allow this later stage to verify that an earlier stage of initialization has completed, even if the overall call to bli_init() has not yet returned. - Added _is_initialized() functions for each API, setting the underlying bool_t during _init() and unsetting it during _finalize(). - Comment, whitespace changes. commit ee129c6b028bc5ac88da7c74fde72c49803742ff Author: Field G. Van Zee Date: Wed Jun 10 12:53:28 2015 -0500 Fixed bugs in _get_range(), _get_range_weighted(). Details: - Fixed some bugs that only manifested in multithreaded instances of some (non-gemm) level-3 operations. The bugs were related to invalid allocation of "edge" cases to thread subpartitions. (Here, we define an "edge" case to be one where the dimension being partitioned for parallelism is not a whole multiple of whatever register blocksize is needed in that dimension.) In BLIS, we always require edge cases to be part of the bottom, right, or bottom-right subpartitions. (This is so that zero-padding only has to happen at the bottom, right, or bottom-right edges of micro-panels.) The previous implementations of bli_get_range() and _get_range_weighted() did not adhere to this implicit policy and thus produced bad ranges for some combinations of operation, parameter cases, problem sizes, and n-way parallelism. - As part of the above fix, the functions bli_get_range() and _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b, and _b2t suffixes, similar to the partitioning functions. This is an easy way to make sure that the variants are calling the right version of each function. The function signatures have also been changed slightly. - Comment/whitespace updates. - Removed unnecessary '/' from macros in bli_obj_macro_defs.h. commit 9135dfd69d39f3bbd75034f479f27a78dbfebcce Author: Field G. Van Zee Date: Fri Jun 5 13:37:44 2015 -0500 Minor updates to test/3m4m files. commit d62ceece943b20537ec4dd99f25136b9ba2ae340 Author: Field G. Van Zee Date: Wed Jun 3 12:56:45 2015 -0500 Minor update to test/3m4m/runme.sh. Details: - Removed some stale script code that should have been removed during 590bb3b8c. commit b6ee82a3d421c9c4f1eb6848c7c6e37aa46de799 Author: Field G. Van Zee Date: Wed Jun 3 12:14:23 2015 -0500 Minor cleanup to bli_init() and friends. Details: - Spun-off initialization of global scalar constants to bli_const_init() and of threading stuff to bli_thread_init(). - Added some missing _finalize() functions, even when there is nothing to do. commit 1213f5cebabc1637ce9dd45c4bfa87bb93677c29 Author: Field G. Van Zee Date: Tue Jun 2 13:27:47 2015 -0500 POSIX thread bugfixes/edits to bli_init.c, _mem.c. Details: - Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex was used to lock access to initialization/finalization actions. But everything worked out okay as long as bli_init() was called by single-threaded code. - Changed to static initialization for memory allocator mutex in bli_mem.c, and moved mutex to that file (from bli_init.c). - Fixed some type mismatches in bli_threading_pthreads.c that resulted in compiler warnings. - Fixed a small memory leak with allocated-but-never-freed (and unused) pthread_attr_t objects. - Whitespace changes to bli_init.c and bli_mem.c. commit 590bb3b8c5c0389159c5a9451b6c156c5f237e8a Author: Field G. Van Zee Date: Sun May 24 16:02:53 2015 -0500 Backed-out adjusted dim changes to test/3m4m. Details: - Reverted most changes applied during commit ec25807b. commit ec25807b26da943868f0d0517c3720e50181b8f9 Author: Field G. Van Zee Date: Fri Apr 10 13:23:50 2015 -0500 Tweaks to test/3m4m to test with adjusted dims. Details: - Updated test/3m4m driver files to build test drivers that allow comparision of real "asm_blis" results to complex "asm_blis" results, except with the latter's problem sizes adjusted so that problems are generated with equal flop counts. commit 426b6488580a92bf071a62dc319a9c837ce39821 Author: Field G. Van Zee Date: Wed Apr 8 15:12:21 2015 -0500 Fixed a packing bug that manifested in trsm_r. Details: - Fixed a bug that caused a memory leak in the contiguous memory allocator. Because packm_init() was using simple aliasing when a subpartition object was marked as zeros by bli_acquire_mpart_*(), the "destination" pack object's mem_t entry was being overwritten by the corresponding field of the "source" object (which was likely NULL). This prevented the block from being released back to the memory allocator. But this bug only manifested when changing the location of packing B from outside the var1 loop to inside the var3 loop, and only for trsm with triangular B (side = right). The bug was fixed by changing the type of alias used in packm_init() when handling zero partition cases. Specifically, we now use bli_obj_alias_for_packing(), which does not clobber the destination (pack) object's mem_t field. Thanks to Devangi Parikh for this bug report. commit c84286d5cef48f16d83831baac1f46b9856b9a36 Author: Field G. Van Zee Date: Sat Apr 4 15:39:14 2015 -0500 More minor tweaks to test/3m4m. Details: - Added a line of output that forces matlab to allocate the entire array up-front. - Re-enabled real domain benchmarks in runme.sh, which were temporarily disabled. commit 309717c8ebf4ef1369f15cf41340e13c25b41573 Author: Field G. Van Zee Date: Fri Apr 3 19:28:49 2015 -0500 More tweaks to test/3m4m, configurations. Details: - Fixed incorrect number of mc_x_kc memory blocks in sandybridge/bli_config.h. - Enabled OpenMP multithreding in piledriver/bli_config.h. - More updates to test/3m4m driver files. commit 4baf3b9c69b2f648be9e46e07ccc9859dd675828 Author: Field G. Van Zee Date: Fri Apr 3 16:44:32 2015 -0500 Tweaked test/3m4m driver, including acml support. Details: - Added ACML support to test/3m4m driver Makefile and runme.sh script. commit a32f7c49ca4ea869d2a6c66818780f4321743d67 Merge: 349e075a 4bfd1ce8 Author: Field G. Van Zee Date: Fri Apr 3 08:28:11 2015 -0500 Merge pull request #23 from xianyi/master Add auto-detecting CPU on configure stage. commit 349e075ad6a8e2a1211d94f36d24828c9d44b052 Author: Field G. Van Zee Date: Thu Apr 2 18:12:28 2015 -0500 Tweaks to sandybridge config, test/3m4m driver. Details: - Enable OpenMP support by default in sandybridge's bli_config.h. - Reorganized sandybridge's bli_kernel.h. - Updated 3m4m Makefile, runme.sh to also test MKL implementation. commit 4bfd1ce8ca93f93d170dd2715f0a32027b417b46 Author: Zhang Xianyi Date: Thu Apr 2 16:40:21 2015 -0500 Detect NEON for cortex-a9 and cortex-a15. commit aa6eec4f43137057276fe6119bdbfb5c52682527 Author: Zhang Xianyi Date: Thu Apr 2 16:03:44 2015 -0500 Detect the CPU architecture. Support ARM cores. Detect the CPU architecture by compiler's predefined macros. Then, detect the CPU cores. Support detecting x86 and ARM architectures. commit 2947cfb749c937b0f62fac36cc92f123bd45b53c Author: Zhang Xianyi Date: Wed Apr 1 12:24:00 2015 -0500 Add auto-detecting CPU on configure stage. e.g. /Path_to_BLIS/configure auto Now, it only support detecting x86 CPUs. commit 26a4b8f6f985597f80e0174990bf541f1d9bafac Author: Field G. Van Zee Date: Wed Apr 1 10:44:54 2015 -0500 Implemented 3m2, 3m3 induced algorithms (gemm only). Details: - Defined a new "3ms" (separated 3m) pack schema and added appropriate support in packm_init(), packm_blk_var2(). - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p) as an argument instead of computing it locally. Exception: for trmm, is_p must be computed locally, since it changes for triangular packed matrices. Also exposed is_p in interface to dt-specific packm_blk_var2 (and _var1, even though it does not use imaginary stride). - Renamed many functions/variables from _3mi to _3mis to indicate that they work for either interleaved or separated 3m pack schemas. - Generalized gemm and herk macro-kernels to pass in imaginary stride rather than compute them locally. - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2- and 3m3-specific virtual micro-kernels. - Added special gemm macro-kernels to support 3m2 and 3m3. - Added support for 3m2 and 3m3 to testsuite. - Corrected the type of the panel dimension (pd_) in various macro- kernels from inc_t to dim_t. - Renamed many functions defined in bli_blocksize.c. - Moved most induced-related macro defs from frame/include to frame/ind/include. - Updated the _ukernel.c files so that the micro-kernel function pointers are obtained from the func_t objects rather than the cpp macros that define the function names. - Updated test/3m4m driver, Makefile, and run script. commit ddf62ba7d2da08225b201585b85e06c967767dea Author: Tyler Smith Date: Fri Mar 27 14:27:51 2015 -0500 Refuse to free the packm thread info if it uses the single threaded version commit 016fc587584d958a0e430a56a5e2c05022ac2f17 Author: Tyler Smith Date: Fri Mar 27 14:23:02 2015 -0500 Don't free packm thread info if it is null commit 00a443c529a60862a57b93e303a0b3212c9b1df4 Author: Tyler Smith Date: Fri Mar 27 14:11:07 2015 -0500 Use bli_malloc instead of malloc for the thread info paths commit f1a6b7d02861ccebdc500ea98778cc0f6cddad17 Author: Field G. Van Zee Date: Wed Mar 18 15:37:10 2015 -0500 Reorganized code for induced complex methods. Details: - Consolidated most of the code relating to induced complex methods (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods are now enabled on a per-operation basis. The current "available" (enabled and implemented) implementation can then be queried on an operation basis. Micro-kernel func_t objects as well as blksz_t objects can also be queried in a similar maner. - Redefined several micro-kernel and operation-related functions in bli_info_*() API, in accordance with above changes. - Added mr and nr fields to blksz_t object, which point to the mr and nr blksz_t objects for each cache blocksize (and are NULL for register blocksizes). Renamed the sub-blocksize field "sub" to "mult" since it is really expressing a blocksize multiple. - Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and trsm to correctly query mr and nr (for purposes of nudging kc). - Introduced an enumerated opid_t in bli_type_defs.h that uniquely identifies an operation. For now, only level-3 id values are defined, along with a generic, catch-all BLIS_NOID value. - Reworked testsuite so that all induced methods that are enabled are tested (one at a time) rather than only testing the first available method. - Reformated summary at the beginning of testsuite output so that blocksize and micro-kernel info is shown for each induced method that was requested (as well as native execution). - Reduced the number of columns needed to display non-matlab testsuite output (from approx. 90 to 80). commit 8d5169ccda954e5f72944308a036dcb7ebfc9097 Author: Field G. Van Zee Date: Wed Mar 18 11:38:08 2015 -0500 Fixed bug in release of mem_t buffer. Details: - Fixed a bug that affects all level-2 and level-3 blocked variants. The bug only manifested, however, if the packing of operands (A and B in gemm, for example) spanned multiple nodes in the control tree. Until recently, the main consumers of packm were level-3 operations, all of which packed both input operands from blocked variant 1 (B outside of the loop, and A within the loop). This particular usage masked a flaw in the code whereby bli_obj_release_pack() would always release the underlying mem_t buffer (provided it was allocated), even if the buffer was not allocated in the current variant. This has been fixed by replacing all calls to bli_obj_release_pack() with calls to a new function, bli_packm_release(), which takes the same control tree node argument passed into the object's corresponding call to packm_init() or packv_init(). bli_packm_release() then proceeds to invoke bli_obj_release_pack() only if the control tree node indicates that packing was requested. Thanks to Devangi Parikh for identifying this bug. commit c0acca0f5182ba96fd39c9d10b34a896a6e74206 Author: Field G. Van Zee Date: Tue Mar 3 10:56:22 2015 -0600 Clarified comments in testsuite input.operations. commit 03ba9a6b17861d9e1adc0cf924439c4d7e860d19 Author: Field G. Van Zee Date: Tue Feb 24 10:33:28 2015 -0600 Removed some 'old' directories. commit a86db60ee270cdeb745ae7cf68f9e0becc9f522d Author: Field G. Van Zee Date: Mon Feb 23 18:42:39 2015 -0600 Extensive renaming of 3m/4m-related files, symbols. Details: - Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi ('i' for "interleaved"). Similar changes to 3M/4M macros. - Renamed all 3m/4m files and functions to 3m1/4m1. - Whitespace changes. commit 8cf8da291a0fb2f491f410969a76ec0fbda47faf Author: Field G. Van Zee Date: Fri Feb 20 15:24:27 2015 -0600 Minor updates to induced complex mode management. Details: - Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and associated headers) from frame/base to frame/base/induced. - Added bli_xm.? to frame/base/induced, which implements bli_xm_is_enabled(), which detects whether ANY induced complex method is currently enabled. - The new function bli_xm_is_enabled() is now used in bli_info.c to detect when an induced complex method is used, so we know when to return blocksizes from one of the induced methods' blocksize objects. commit 411e637ee7d1083a84f58f08938d51e63d7c3c9a Merge: c2569b88 fc0b7712 Author: Tyler Michael Smith Date: Fri Feb 20 20:39:25 2015 -0600 Merge branch 'master' of http://github.com/flame/blis commit c2569b8803d4ccc1d7b6f391713461b51443601d Author: Tyler Michael Smith Date: Fri Feb 20 20:38:19 2015 -0600 Fixed a memory leak in freeing the thread infos commit fc0b771227abf86d81f505b324f69f6e83db1d8f Author: Field G. Van Zee Date: Fri Feb 20 11:47:44 2015 -0600 Added max(mr,nr) to kc in static mem pools. Details: - Changed the static memory definitions to compute the maximum register blocksize for each datatype and add it to kc when computing the size of blocks of A and B. This formally accounts for the nudging of kc up to a multiple of mr or nr at runtime for triangular operations (e.g. trmm). commit af32e3a608631953ef770341df10a14a991bf290 Author: Tyler Michael Smith Date: Thu Feb 19 22:51:11 2015 -0600 Fixed a bug with get_range_weighted would return end = 0 for small problem sizes commit 441d47542a64e131578d00da7404c1ed387a721c Author: Field G. Van Zee Date: Thu Feb 19 17:06:10 2015 -0600 Renamed 3m and 4m symbols/macros to 3mi and 4mi. Details: - Renamed several variables and macros from 3m/4m to 3mi/4mi. This is because those packing schemas were always implicitly "interleaved". This new naming scheme will make way for new schemas that separate instead of interleve the real and imaginary (and summed) parts. - Expanded the pack format sub-field of the pack schema field of the info_t to 4 bits (from 3). This will allow for more schema types going forward. - Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m. commit 518a1756ccf02122b96fc437b538604a597df42a Author: Field G. Van Zee Date: Thu Feb 19 14:27:09 2015 -0600 Fixed indexing bug for trmm3 via 3mh, 4mh. Details: - Fixed a bug that only affected trmm3 when performed via 3mh or 4mh, whereby micro-panels of the triangular matrix were packed with "dead space" between them due to failing to adjust for the fact that pointer arithmetic was occurring in units of complex elements while the data being packed consisted of real elements. It turns out that the macro- kernel suffered from the same bug, meaning the panels were actually being packed and read consistently. The only way I was able to discover the bug in the first place was because the packed block of A was overflowing into the beginning of the packed row panel of B using the sandybridge configuration. commit 493087d730f01d5169434f461644e5633f48a42f Merge: 650d2a6f 25021299 Author: Field G. Van Zee Date: Wed Feb 18 09:45:51 2015 -0600 Merge branch 'master' of github.com:flame/blis commit 25021299b670775df8ca9c87910c63d7e74ed946 Merge: fe2b8d39 f05a5763 Author: Field G. Van Zee Date: Wed Feb 11 20:03:21 2015 -0600 Merge branch 'master' of github.com:flame/blis commit fe2b8d39a445ac848686e78c7540fd046cb95492 Author: Field G. Van Zee Date: Wed Feb 11 19:33:10 2015 -0600 Fixed an obscure bug in 3mh/3m/4mh/4m packing. Details: - Modified bli_packm_blk_var1.c and _var2.c to increase the triangular case's panel increment by 1 if it would otherwise be odd. This is particularly necessary in _var2.c when handling the interleaved 3m or ro/io/rpi pack schemas, since division of an odd number by 2 can happen if both the panel length and the panel packing dimension (register packing blocksize) are odd, thus making their product odd. - Modified bli_packm_init.c so that panel strides are increased by 1 if they would otherwise be odd, even for non-3m related packing. - Modified the trmm and trsm macro-kernels so that triangular packed micro-panels are traversed with this new "increment by 1 if odd" policy. - Added sanity checks in trmm and trsm macro-kernels that would result in an abort() if the conditions that would lead to a "divide odd integer by 2" scenario ever manifest. - Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h. commit 650d2a6ff2e593151a296ca86b5214afcc747afc Author: Field G. Van Zee Date: Mon Feb 9 14:59:20 2015 -0600 Added initial support for imaginary stride. Details: - Added an imaginary stride field ("is") to obj_t. - Renamed bli_obj_set_incs() macro to bli_obj_set_strides(). - Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and added invocations in key locations. - Added some basic error-checking related to imaginary stride. - For now, imaginary stride will not be exposed into the most-used BLIS APIs such as bli_obj_create(), and certainly not the computational APIs such as bli_dgemm(). commit f05a57634a7c8e3864b25b3335d1194c1ea1aeb9 Author: Field G. Van Zee Date: Sun Feb 8 19:40:34 2015 -0600 Defined gemm cntl function to query ukrs func_t. Details: - Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t* for the gemm micro-kernels from the leaf node of the control tree. This allows all the func_t* fields from higher-level nodes in the tree to be NULL, which makes the function that builds the control trees slightly easier to read. - Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in all bli_*_front() functions (which is needed to apply the row/column preference optimization). - In all level-3 bli_*_cntl_init() functions, changed the _obj_create() function arguments corresponding to the gemm_ukrs fields in higher- level cntl tree nodes to NULL. - Removed some old her2k macro-kernels. commit cefd3d5d2001264de17cf63dae541f890cb9daaf Author: Tyler Smith Date: Thu Feb 5 11:09:12 2015 -0600 A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this commit 7574c9947d57a19f613880e3b9f62f8c8f6df4ec Author: Field G. Van Zee Date: Wed Feb 4 12:11:55 2015 -0600 Added basic flop-counting mechanism (level-3 only). Details: - Added optional flop counting to all level-3 front-ends, which is enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be reset at any time via bli_flop_count_reset() and queried via bli_flop_count(). Caveats: - flop counts are approximate for her[2]k, syr[2]k, trmm, and trsm operations; - flop counts ignore extra flops due to non-unit alpha; - flop counts do not account for situations where beta is zero. commit ceda4f27d1f1bcf19320e09848e0f2e3b9941e6c Author: Field G. Van Zee Date: Thu Jan 29 13:22:54 2015 -0600 Implemented bli_obj_imag_equals(). Details: - Implemented a new function, bli_obj_imag_equals(), which compares the imaginary part of the first argument to the second argument, which may be a BLIS_CONSTANT or of a regular real datatype. commit 81114824a05a9053229efd577a8a94a856deda93 Author: Field G. Van Zee Date: Tue Jan 6 12:15:21 2015 -0600 Minor 4m/3m consolidation to mem_pool_macro_defs.h. Details: - Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to reduce code and improve readability. commit 36a9b7b7436d9423ba4de2a9f85cfcd43577b783 Author: Tyler Michael Smith Date: Wed Dec 17 21:53:50 2014 +0000 reduced the default number of MC by KC blocks for bgq commit c60619c7c3568f044a849abbab60209aa7455423 Author: Field G. Van Zee Date: Tue Dec 16 17:08:22 2014 -0600 Minor tweaks for 3m4m test drivers. Details: - Changed gemm_kc blocksizes to be reduced by two-thirds instead of half. - Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when computing the fixed k dimension. - Fixed runme.sh so that it would use multiple threads for s/dgemm cases. commit c6929ba6a5e6f633a7295e979a2b8df8c7ecdb1b Author: Field G. Van Zee Date: Tue Dec 16 11:27:50 2014 -0600 Added 4m_1b to test/3m4m test driver and script. commit 785d480805fc0d6f4251b5499933515740b6b2a7 Merge: 9456f330 4156c088 Author: Field G. Van Zee Date: Fri Dec 12 14:34:19 2014 -0600 Merge branch 'master' of github.com:flame/blis commit 9456f330af4617f9ee32972d51f974aa2d84f97b Author: Field G. Van Zee Date: Fri Dec 12 14:31:57 2014 -0600 Added 4m_1b implementation for gemm. Details: - Added yet another 4m-based implementation for complex domain level-3 operations. This method, which the 3m/4m paper identifies as Algorithm "4m_1b" fissures the first loop around the micro-kernel so that the real sub-panel of the current micro-panel of B is multiplied against (both sub-panels of) all micro-panels of A, before doing the same for the imaginary sub-panel of the micro-panel of B. For now, only gemm is supported, and 4m_1b (labeled "4mb" within the framework) is not yet integrated into the test suite. commit 4156c0880d9aea4ff04a9c4fa139ba8c437d8bfb Author: Field G. Van Zee Date: Tue Dec 9 16:03:14 2014 -0600 Fixed obscure level-2 packing / general stride bug. Details: - Fixed a bug in certain structured level-2 operations that manifested only when the structured matrix was provided to BLIS as matrix stored with general stride. The bug was introduced in c472993b when the densify field was removed from the packm control tree node and associated APIs. Since then, the packed object was unconditionally marked with an uplo field of BLIS_DENSE. This is fine for level-3 operations where micro-panels are always densified, but in level-2 contexts, the underlying unblocked variant (fused or unfused) of structured operations (e.g. trmv) still needs to know whether to execute its "lower" or "upper" branches of code. Since this field was unconditionally being set to BLIS_DENSE, the unblocked variants were always executed the "else" branch, which happened to be the "lower" case code. Thus, running an upper case produced the wrong answer. This most obviously manifested in the form of failures for trmm, trmm3, and trsm in the test suite. The bug was fixed by setting the packed object's uplo field to BLIS_DENSE only if the schema indicated that micro-panels were to be packed. Otherwise, we can assume we are packing to regular row or column storage, as is the case with level-2 packing. Thanks to Francisco Igual for reporting the testsuite failures and ultimately leading us to this bug. commit 689f60a578b461119e9ea90c74f642b9eb79addb Merge: bef24e67 483e4d6a Author: Field G. Van Zee Date: Sun Dec 7 14:03:30 2014 -0600 Merge pull request #21 from figual/master Adding armv8a configuration and micro-kernels. commit 483e4d6a3fdbef9d9ab47fb674c9476c70ca9f0f Author: Francisco D. Igual Date: Sun Dec 7 20:27:49 2014 +0100 Adding armv8a configuration and micro-kernels. Only sgemm micro-kernel is fully functional at this point. commit bef24e67e0f93579c2a80315348dc2e227f72a72 Author: Tyler Smith Date: Wed Nov 26 18:00:56 2014 -0600 Fixed a type of race condition exposed by pthreads implementation. Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those. Barriers were inserted to fix this. commit 76bde44411f0e34266bab9d666a54ef22be97320 Merge: e56e6143 f3d729e5 Author: Field G. Van Zee Date: Wed Nov 26 17:25:24 2014 -0600 Merge branch 'master' of github.com:flame/blis commit f3d729e504ec012e7dc7e02b2ecd42e004c6894d Author: Tyler Michael Smith Date: Wed Nov 26 22:25:24 2014 -0600 Added static mutex to bli_init and bli_finalize commit d71cc797866ff502ad1127527016f463267eef80 Author: Tyler Michael Smith Date: Wed Nov 26 21:35:39 2014 -0600 Refactored bli_threading files and added support for pthreads commit e56e61438ff7fcf25a48c0b7603f18df782b50b6 Author: Field G. Van Zee Date: Wed Nov 26 17:20:35 2014 -0600 Minor cleanups to bli_threading.h and friends. Details: - No longer need to define BLIS_ENABLE_MULTITHREADING manually in bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or BLIS_ENABLE_PTHREADS is defined. - Added sanity check to prevent both BLIS__ENABLE_OPENMP and BLIS_ENABLE_PTHREADS from being enabled simultaneously. - Reorganization of bli_threading*.h header files, which led to simplification of threading-related part of blis.h. - added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk file. commit 3be2744cbe2c56d38c23fd818aa5c1f10cc7ea51 Author: Field G. Van Zee Date: Fri Nov 21 12:28:08 2014 -0600 Update to template gemm ukernel comments. Details: - Updated comments on alignment of a1 and b1 to match wiki. commit 994429c6881b2ade92d9d7949bcaebfbf2cc65eb Merge: 58796abd 694029d9 Author: Field G. Van Zee Date: Thu Nov 20 13:55:35 2014 -0600 Merge pull request #20 from TimmyLiu/master #define PASTEF773 required by cblas compatibility layer commit 694029d9d7db857d642ab536955c0621791108c8 Author: Timmy Date: Wed Nov 19 15:25:14 2014 -0600 #define PASTEF773 required by cblas compatiility layer commit 58796abda66b133346f8d523b39178afc336351f Author: Field G. Van Zee Date: Thu Nov 6 14:31:52 2014 -0600 Removed KC constraint comments from _kernel.h files. Details: - Since 4674ca8c, the constraint that KC be a multiple of both MR and NR have been relaxed, and thus it was time to remove the comments from the top of the bli_kernel.h files of all configurations. commit 7bbc95a54f706d43c7f7951f0e5995f86130cd52 Author: Field G. Van Zee Date: Wed Oct 29 10:52:23 2014 -0500 Added new piledriver micro-kernels. Details: - Added new micro-kernels for the AMD piledriver architecture (one for each datatype). - Updates and tweaks to piledriver configuration. - Added 3xk packm micro-kernel support. - Explicitly unrolled some of the smaller packm micro-kernels. - Added notes to avx/sandybridge and piledriver micro-kernel files acknowledging the influence of the corresponding kernel code in OpenBLAS. commit 59613f1d5500f6279963327db2fbc84bc9135183 Author: Field G. Van Zee Date: Thu Oct 23 17:21:37 2014 -0500 Added separeate micro-panel alignment for A and B. Details: - Changed the recently-added micro-panel alignment macros so that we now have two sets--one for micro-panels of matrix A and one for micro- panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?. - Store each set of alignment values into a separate blksz_t object in bli_gemm_cntl_init(). - Adjusted packm_init() to use the separate alignment values. - Added query routines for the new alignment values to bli_info.c. - Modified test suite output accordingly. commit a8e12884ee1fddd3fd77ca5a68aa0cb857f3af57 Author: Field G. Van Zee Date: Thu Oct 23 11:35:48 2014 -0500 CHANGELOG update (0.1.6) commit 38ea5022e4ed846112198c4e1672fcdaeb90dc71 (tag: 0.1.6) Author: Field G. Van Zee Date: Thu Oct 23 11:35:45 2014 -0500 Version file update (0.1.6) commit a3e6341bdb0e28411f935d6b4708a6389663e004 Author: Field G. Van Zee Date: Thu Oct 23 11:13:28 2014 -0500 Factored common code from blocksize functions. Details: - Split bli_determine_blocksize_[fb]() into two functions each, the newer ones ending with the _sub suffix. These new sub-functions are now called from bli_[gemm|trmm|trsm]_determine_kc_[fb](), which eliminates redundant code and will allow any future tweaks to the core sub-functions to automatically be inherited by the operation- specific versions. commit 4674ca8cffb58331ff7edf23bbe0e3f6a7558489 Author: Field G. Van Zee Date: Thu Oct 23 10:50:59 2014 -0500 Extended newly relaxed KC to hemm, symm. Details: - These changes were intended for the previous commit. - Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](), which determine blocksizes for gemm-based operations, taking special care to "nudge" the kc dimension up to a multiple of MR or NR for hemm and symm operations, as needed. - Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f(). instead of bli_determine_blocksize_f(). - Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c. commit ab954ba6f874eaca7b001804491f866ef6b9b327 Author: Field G. Van Zee Date: Wed Oct 22 17:21:58 2014 -0500 Relaxed constraint that KC be multiple of MR, NR. Details: - Relaxed a long-held requirement in register blocksizes that required the kernel programmer to choose a KC that was divisible by both MR and NR. This was very constraining on some architectures that did not use register blocksizes that were powers of two. The constraint is now enforced only for trmm and trsm, where it is needed, and it is now handled by "nudging" kc upward at runtime, if necessary, to be a multiple of MR or NR, as needed. - Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](), which determine blocksizes for trmm and trsm, taking special care to "nudge" the kc dimension up to a multiple of MR or NR, as needed. - Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]() instead of bli_determine_blocksize_[fb](). - Added safeguard to bli_align_dim_to_mult() that returns the dimension unmodified if the dimension multiple is zero (to avoid division by zero). - Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from bli_kernel_macro_defs.h. - Whitespace, variable name changes to bli_blocksize.c. - Removed old commented code from bli_gemm_cntl.c. commit 95cdae65d6b88e043ee14bcd53cd2e800d7aecb4 Author: Tyler Smith Date: Wed Oct 22 16:30:16 2014 -0500 Fixed bug in KNC microkernel where k=0 and beta != 1 commit e64dba5633fc49b768b5edc7762f2b5d8a4d0588 Author: Field G. Van Zee Date: Mon Oct 20 19:23:06 2014 -0500 Re-implemented micro-panel alignment. Details: - This commit re-implements a feature that was removed in commit c2b2ab62. It was removed because, at the time, I wasn't sure how the micro-panel alignment feature would interact with the 4m method (when applied at the micro-kernrel level), and so it seemed safer to disable the feature entirely rather than allow possible breakage. This commit revisits the issue and safely re-implements the feature in a way that is compatible with 4m, 3m, 4mh, and 3mh (and native execution). - Modified the static memory pool to account for micro-panel alignment space. - Modified packm_init and blocked variants to align whole micro-panels by a datatype-specific alignment value that may be set by the configuration. (If it is not set by the configuration, it will default to BLIS_SIZEOF_?.) - Modified macro-kernels so that: - storage stride is handled properly given the new micro-panel alignment behavior; - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and trsm, is more robust (e.g. will work if the applicable packing register blocksize is odd); - imaginary strides are computed and stored within auxinfo_t structs, which allows the virtual micro-kernels to more easily determine how to index into the micro-panel operands. - Modified virtual 3m and 4m micro-kernels to use the imaginary strides within the auxinfo_t structs instead of panel strides. - Deprecated the panel stride fields from the auxinfo_t structs. - Updated test suite to print out the micro-panel alignment values. commit add16b0e5402924301e7078e4ca5e3ef725bff0b Author: Field G. Van Zee Date: Fri Oct 17 11:49:24 2014 -0500 Added 3m4m test driver subdir of 'test'. Details: - Added a modified test driver for [cz]gemm that will test all 3m/4m as well as assembly-based and OpenBLAS implementations of gemm in single and multithreaded modes. commit e171504a72406c61a173241d8bccf0a5ceb10582 Author: Field G. Van Zee Date: Fri Oct 17 11:25:59 2014 -0500 Use correct definition of bli_is_last_iter(). Details: - As intended for previous commit, the new definition of bli_is_last_iter() is now disabled in favor of the old definition. commit 0d954087b2b55d2f5f3c5e57d702b318ca2300f6 Author: Field G. Van Zee Date: Fri Oct 17 11:19:34 2014 -0500 Minor changes and fixes. Details: - Redefined bli_is_last_iter() to take thread_id and num_thread arguments, which allows the macro to correctly compute whether a given iteration is the last that the thread will compute in that particular loop. The new definition, however, remains disabled (commented out) until someone can look at this more closely, as the new definition seems to actually hurt performance slightly. - Whitespace and related updates to level-3 macro-kernels. - Updated test suite so that performance results in the hundreds of gigaflops does not disrupt the column alignment of the output. commit d1e86e1876e433f54b501ec5a005b4ba7c5ce4e6 Author: Field G. Van Zee Date: Sun Oct 12 13:43:47 2014 -0500 More minor tweaks to sandybridge/avx micro-kernel. Details: - Re-enabled use of b_next for dgemm and cgemm micro-kernels. commit 7b6fe4cae57cb22c09c1a97595e1a201a02cbcd2 Author: Field G. Van Zee Date: Sun Oct 12 12:01:51 2014 -0500 Minor tweaks to sandybridge/avx micro-kernels. Details: - Changed the MC blocksize for zgemm micro-kernel from 128 to 64. - Removed usage of b_next in all x86_64/avx gemm micro-kernels. commit a6a156e9feec47154e7a0fd43bcc006b1fc04aba Author: Field G. Van Zee Date: Fri Oct 10 14:26:41 2014 -0500 Added cgemm ukernel for avx/sandybridge. Details: - Implemented AVX-based cgemm micro-kernel (via GNU extended inline assembly syntax). - Updated sandybridge configuration accordingly. commit 6f8575ab2580e167a022293b76ddf0514f71b613 Author: Field G. Van Zee Date: Fri Oct 10 10:01:45 2014 -0500 Added zgemm ukernel for avx/sandybridge. Details: - Implemented AVX-based zgemm micro-kernel (via GNU extended inline assembly syntax). - Updated sandybridge configuration accordingly. commit 23ce7ee542a12ca40b4b6090ad2558d180e16d37 Merge: 99fd9a39 7a8ad47f Author: Field G. Van Zee Date: Thu Oct 9 16:41:22 2014 -0500 Merge branch 'master' of github.com:flame/blis commit 99fd9a39718cb7281f6fb23f9fef7cca4fe514f4 Author: Field G. Van Zee Date: Thu Oct 9 16:38:04 2014 -0500 Fixed two minor bugs. Details: - Fixed a bug in the test suite for the trsm_ukr and gemmtrsm_ukr test modules whereby the uplo bits of some packed matrix objects were not being set properly, resulting in false FAILURE results for those tests. Thanks to Tyler Smith for bringing this issue to my attention. - Fixed a bug in bli_obj_alloc_buffer() that caused an unnecessary "not yet implemented" abort() when creating a 1x1 object with non-unit strides. commit 7a8ad47fb2d100a9da93aa8cab774fcceeaab733 Author: Tyler Smith Date: Wed Oct 8 15:52:13 2014 -0500 Minor changes to knc configuration, including preference row major storage Also fixed a bug in the knc micro-kernel where it would fail if k == 0 commit 76b7c34af0c09f47d9615b18857a356acddc788a Author: Field G. Van Zee Date: Thu Oct 2 14:15:38 2014 -0500 Fixed a bug in the pack schema-related bit macros. Details: - Expanded the BLIS_PACK_SCHEMA_BITS value in bli_type_defs.h to include all six bits presently used in the pack schema bitfield of the info field of obj_t structs. Prior to this commit, the macro constant only included the lowest five bits, which excluded the "is or is not packed" bit. This manifested as a strange bug in probably many level-2 codes that invoked packing, though we only observed it in ger before fixing. Thanks to Devin Matthews for finding and reporting this bug. commit a5763e332226598d70c47dfa9cad4578e15ef5f4 Author: Field G. Van Zee Date: Thu Oct 2 13:28:17 2014 -0500 Added extra output to bli_obj_print(). Details: - Print extra values from info field of obj_t struct within bli_obj_print(). commit 9bba209fc44fbfce943ba6a51cd8278a0cb6b159 Author: Tyler Smith Date: Mon Sep 29 14:56:36 2014 -0500 Fixed bug when packing anywhere besides in blk_var_1 for gemm. commit 614a4afc9272adb47e5a8b83b39d56c2804d95d6 Merge: b541b667 4a7df04e Author: Tyler Smith Date: Fri Sep 26 10:49:57 2014 -0500 Merge branch 'master' of http://github.com/flame/blis commit 4a7df04e8a4ffdb9561d26426afd35e4fe15b013 Author: Field G. Van Zee Date: Mon Sep 22 16:06:15 2014 -0500 Added 30xk support for packm ukernels. Details: - Updated bli_kernel_*_macro_defs.h headers to include default definitions for 30xk packm kernels. - Extended function pointer arrays in bli_packm_cxk_*() out to 31 and included 30xk kernels. - Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c. commit b6d4bd792e0d44ce4b28afef343f5ff3ba89c285 Author: Field G. Van Zee Date: Mon Sep 22 16:02:37 2014 -0500 Fixed missing tabs from Makefile patch. commit 32630f9b6f0d5ba28d5b56dae4c7288a37158743 Author: Field G. Van Zee Date: Fri Sep 19 17:18:20 2014 -0500 Comment update to virtual micro-kernels. commit 13447cffead7c6d137a7a3ccbf9e552ed0477467 Author: Field G. Van Zee Date: Fri Sep 19 13:00:48 2014 -0500 Minor bugfix to top-level Makefile. Details: - Applied a patch that allows the top-level Makefile to work on certain systems. The patch simply separates out the source-to-object code generation rules for .c and .S files into two separate rules. Thanks to Devin Matthews for submitting this patch. commit e80a4537846416719c067ae08a53aeda978c572d Author: Field G. Van Zee Date: Thu Sep 18 10:24:20 2014 -0500 Fixed bug introduced by bugfix in 25b258d. Details: - We actually need to check alignment of lda*sizeof(double) and NOT a+lda because in the latter case, alignment could cancel out and still allow the optimized code to run when it shouldn't. Thanks to Devin for pointing this out. commit 25b258d61f9c8cee64e922f4131784b6edb196dd Author: Field G. Van Zee Date: Thu Sep 18 10:10:49 2014 -0500 Fixed a non-fatal problem with bugfix in a68b316c. Details: - The bugfix in a68b316c was inadvertantly checkin alignment of the leading dimension itself, rather than the byte size of the leading dimension. Now, we simply check alignment of a+lda. commit 96302d4fc81363410e41c3a3c43a65df44d97ad9 Author: Field G. Van Zee Date: Thu Sep 18 09:43:40 2014 -0500 Renamed bli_info_get_*_ukr_type() functions. Details: - Added _string() suffix to bli_info_get_*_ukr_type() function names. This makes them consistent with the bli_info_get_*_impl_string() functions. commit a68b316ca4852509f84ed50e01afac486bf70f58 Author: Field G. Van Zee Date: Wed Sep 17 11:10:07 2014 -0500 Fixed alignment bugs in level-1f kernels. Details: - Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels were attempting to compute problems with unaligned leading dimensions with optimized code, rather than (correctly) using the reference implementations. Thanks to Devin Matthews for reporting this bug. commit 870761eb902e4866090d1d3446a345df3d6d4599 Merge: e9899be0 a2b59a37 Author: Field G. Van Zee Date: Tue Sep 16 18:20:49 2014 -0500 Merge branch 'master' of github.com:flame/blis commit e9899be09044829e23386bd73e394f1dd7778210 Author: Field G. Van Zee Date: Tue Sep 16 18:19:32 2014 -0500 Added high-level implementations of 4m, 3m. Details: - Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at high levels, respectively. APIs for trmm and trsm were NOT added due to the fact that these approaches are inherently incompatible with implementing 4m or 3m at high levels (because the input right-hand side matrix is overwritten). - Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and 3m so that all are stylistically consistent. - Added new "rih" packing kernels (both low-level and structure-aware) to support both 4mh and 3mh. - Defined new pack_t schemas to support real-only, imaginary-only, and real+imaginary packing formats. - Added various level0 scalar macros to support the rih packm kernels. - Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh. - Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in that order) and execute the first one that is enabled, or the native implementation if none are enabled. - Added implementation query functions for each level-3 operation so that the user can query a string that describes the implementation that is currently enabled. - Updated test suite to output implementation types for reach level-3 operation, as well as micro-kernel types for each of the five micro- kernels. - Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX. - Fixed an obscure bug when packing Hermitian matrices (regular packing type) whereby the diagonal elements of the packed micro-panels could get tainted if the source matrix's imaginary diagonal part contained garbage. commit a2b59a37f166f70a6dd5793db2530823ef590c2b Author: Tyler Smith Date: Mon Sep 15 10:44:44 2014 -0500 Fixed make defs so that they actually compile for bulldozer commit 86fc7e40764f78ec217f50216ef4fa5b57dbfbc7 Author: Tyler Smith Date: Mon Sep 15 10:35:46 2014 -0500 Added bulldozer configuration and updated piledriver micro-kernel commit 0644e61a79a57f136be5f4c47b9099cff2af06e0 Author: Field G. Van Zee Date: Thu Sep 11 12:55:34 2014 -0500 Minor updates to bli_packm_init.c. commit 9dc9b44a057a08e20ad4d423344f0ecad54c1eb2 Author: Field G. Van Zee Date: Thu Sep 11 12:03:28 2014 -0500 Renamed bli_obj_pack_status() to _pack_schema(). Details: - Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in order to help avoid confusion as to what the macro returns. commit cf5efdde0588a0d5b6ea57fe7d7be5000be06f8e Author: Field G. Van Zee Date: Thu Sep 11 11:47:56 2014 -0500 Pass pack_t schemas into ukernels via auxinfo_t. Details: - Modified macro-kernels to pass the pack_t schema values for matrices A and B into the datatype-specific functions, where they are now inserted into a newly-expanded auxinfo_t struct. This gives gives the micro-kernels access to the pack_t schema values embedded in the control trees, which determine the precise format into which the matrix elements are packed. - Updated a call to bli_packm_init_pack() in src/test_libblis.c to remove densify argument. Meant to include this in commit c472993b. commit cc8d2b82775cca3c2d51bf427f4e77c8024a6d15 Author: Field G. Van Zee Date: Tue Sep 9 13:48:22 2014 -0500 Updated old test drivers in 'test'. commit c472993bbccb69e9ffc409c79b742426c8ad2ad4 Author: Field G. Van Zee Date: Tue Sep 9 13:42:04 2014 -0500 Removed densify argument to packm_cntl_obj_create(). Details: - Removed the "densify" bool_t argument to bli_packm_cntl_obj_create(). This argument was inserted very early in BLIS's development, when it was anticipated that the developer may sometimes wish to pack a Hermitian, symmetric, or triangular matrix without making it dense. But as it turns out, if we are packing a matrix, we always want to make it dense in some way or another due to the fact that the micro- kernel only multiplies dense micro-panels. Thus, unless/until there is a real need for the feature, it seems reasonable to remove it from the packm_cntl API. commit 5c43ee387146cd76dc59b730dac6683a8446b834 Author: Field G. Van Zee Date: Mon Sep 8 15:19:29 2014 -0500 Moved trmm4m/3m_cntl files to 'old' directory. Details: - Meant to include this in previous commit. commit 7b2f469d5465ed73b1ca88124bc9a1987388aa27 Author: Field G. Van Zee Date: Mon Sep 8 14:49:50 2014 -0500 Retired trmm_t control tree definitions, usage. Details: - Replaced all trmm_t control tree instances and usage with that of gemm_t. This change is similar to the recent retirement of the herk_t control tree. - Tweaked packm blocked variants so that the triangular code does NOT assume that k is a multiple of MR (when A is triangular) or NR (when B is triangular). This means that bottom-right micro-panels packed for trmm will have different zero-padding when k is not already a multiple of the relevant register blocksize. While this creates a seemingly arbitrary and unnecessary distinction between trmm and trsm packing, it actually allows trmm to be handled with one control tree, instead of one for left and one for right side cases. Furthermore, since only one tree is required, it can now be handled by the gemm tree, and thus the trmm control tree definitions can be disposed of entirely. - Tweaked trmm macro-kernels so that they do NOT inflate k up to a multiple of MR (when A is triangular) or NR (when B is triangular). - Misc. tweaks and cleanups to bli_packm_struc_cxk_4m.c and _3m.c, some of which are to facilitate above-mentioned changes whereby k is no longer required to be a multiple of register blocksize when packing triangular micro-panels. - Adjusted trmm3 according to above changes. - Retired trmm_t control tree creation/initialization functions. commit 576e9e9255a79dba9cd3c804267f51e0b4aa6e8a Author: Field G. Van Zee Date: Sun Sep 7 16:12:52 2014 -0500 Retired herk_t control tree definitions, usage. Details: - Replaced all herk_t control tree instances and usage with that of gemm_t, since the two types presently have the same fields. This means that herk, her2k, syrk, and syr2k can simply use the gemm control tree as-is, just as hemm and symm have been doing for some time now. - Retired herk_t control tree creation/initialization functions. - Retired many _target.c and .h files into 'old' directories. commit b2fed052c9a23d858ef0afbe220b342bce9aa7f7 Author: Field G. Van Zee Date: Wed Sep 3 17:07:25 2014 -0500 Minor code cleanup to bli_packm_struc_cxk*.c Details: - Realized that we don't need to track rs_p11 and cs_p11 for Hermitian/symmetric case of bli_packm_struc_cxk*(). They are always equal to rs_p and cs_p. commit 023ce770966b3b5a98bba729c5af1f45e15ebb97 Author: Field G. Van Zee Date: Wed Sep 3 10:47:53 2014 -0500 Minor update to packm_cxk kernels. Details: - Changed m and n dimension parameter names to panel_dim and panel_len, respectively, in packm_cxk, packm_cxk_3m, packm_cxk_4m kernel wrapper functions. This makes the code a little easier to read since "m" and "n" have connotations that are not applicable here. - Comment updates. commit 189def3667d9218adbeec45e2801fd074341a679 Author: Field G. Van Zee Date: Mon Sep 1 16:23:17 2014 -0500 Retired portions of bli_kernel_3m/4m_macro_defs.h. Details: - Removed sections of bli_kernel_[4m|3m]_macro_defs.h that defined 4m/3m-specific blocksizes after realizing that this can be done in bli_gemm[4m|3m]_cntl.c, since that is (mostly) the only place they are used. - The maximum cache values for 4m/3m are stll needed when computing mem pool dimensions in bli_mem_pool_macro_defs.h. As a workaround, "local" definitions in terms of the regular cache blocksizes are now in place. - Similarly, the register blocksizes for 4m/3m are still needed in bli_kernel_post_macro_defs.h. As a workaround, "local" definitions in terms of the regular register blocksizes are now in place. commit af521ee6f2a77d61c98b833e85c09969987bc00d Author: Field G. Van Zee Date: Mon Sep 1 14:06:46 2014 -0500 Changed semantics of blocksize extensions. Details: - Changed semantics of cache and register blocksize extensions so that the extended values are tracked, rather than just the marginal extensions. - BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?. - BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?. - bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note that these "max" query routines grab the maximum value for cache blocksizes and the packdim value for register blocksizes. - bli_info_*() API has been updated accordingly. - All configurations have been updated accordingly. commit 07f23aefd52f5ba4960dbd46e59b180a2136b8e9 Author: Field G. Van Zee Date: Sun Aug 31 11:58:50 2014 -0500 Pass pack schema into packm_struc_cxk*(). Details: - Changed the interface to the packm_struc_cxk*() kernels to include the pack_t schema. This allows the implementation to more easily determine how the micro-panel is stored (row-stored column panel or column-stored row panel). - Updated packm blocked variants to pass in the schema. - Updated packm_ker_t function pointer definition accordingly. commit f032ba9b1186cb02184574d339565f53d733aa42 Author: Field G. Van Zee Date: Sat Aug 30 16:21:20 2014 -0500 Reorganized packm implementation. Details: - Reorganized packm variants and structure-aware kernels so that all routines for a given pack format (4m, 3m, regular) reside in a single file. - Renamed _blk_var4 to _blk_var2 and generalized so that it will work for both 4m and 3m, and adjusted 4m/3m _cntl_init() functions accordingly. - Added a new packm_ker_t function pointer type to bli_kernel_type_defs.h to facilitate function pointer typecasting in the datatype-specific packm_blk_var2() functions. - Deprecated _blk_var3. - Fixed a bug in the triangular micro-panel packing facility that affected trmm and trmm3 with unit diagonals. commit c6793cecb70788bdf2c76ab8102504ea97be9d2a Author: Field G. Van Zee Date: Thu Aug 28 17:14:48 2014 -0500 Reorganized #includes for scalar macro headers. Details: - Reordered the #include statements in bli_scalar_macro_defs.h so that conventional, ri-, and ri3-based macros are grouped together. - Renamed bli_eqri.h (and macros within) to end with 'ris' suffix. commit b4da8907284345be4374f87a88679c4886ab866e Author: Field G. Van Zee Date: Thu Aug 28 14:10:32 2014 -0500 Whitespace, comments updates on packm_blk_var?.c. commit 46e46a1d83da586c3dd9fd7a01eb16067abbaee1 Author: Field G. Van Zee Date: Thu Aug 28 12:05:45 2014 -0500 Minor updates to packm blocked, cxk_3m/4m code. Details: - Added 'const' qualifier to inlined packing code that handles micro-panel packing that is too large for an existing packm ukernel. - Comment updates. commit 908dc688b5979995eaacb3aa937f241551a8df00 Author: Field G. Van Zee Date: Thu Aug 28 11:55:12 2014 -0500 Pass pack schema into blocked packm routines. Details: - Rather than passing the packm blocked routines a boolean value that represents whether the matrix is being packed to row or column storage, we now pass in the pack schema itself. commit a0ff6066e06075ab5f92b19247b39b92ed15f1bf Merge: c4c99c48 d40b32bc Author: Field G. Van Zee Date: Sun Aug 24 15:56:21 2014 -0500 Merge branch 'master' of github.com:flame/blis commit c4c99c4813bf9817592a7899c5d33412fe22313f Author: Field G. Van Zee Date: Sun Aug 24 15:52:22 2014 -0500 Renamed packm scalar from beta to kappa. Details: - The packm implementation (i.e. sources files in frame/1m/packm and frame/1m/packm/ukernels), interchangeably used the names "beta" and "kappa" to refer to the optional scalar to be applied during packing. This commit renames all uses of "beta" to be "kappa", since "beta" sometimes evokes the scalar specifically on the output matrix of a level-2 or level-3 operation. commit d40b32bc24ffbae24123e054307b3138969bb095 Merge: 9331f794 6c25c379 Author: Field G. Van Zee Date: Sun Aug 24 13:46:36 2014 -0500 Merge branch 'master' of github.com:flame/blis commit 6c25c379fadb50834146e1614f7b80c093c2aad0 Author: Field G. Van Zee Date: Sun Aug 24 13:44:10 2014 -0500 Consolidated unpackm ukernels into single file. Details: - Reorganized unpackm ukernels into a single file, bli_unpackm_ref_cxk.c, in a manner similar to what was done for packm ukernels in commit 4cc2b46. commit 9331f79443223fe267676ee54c439e1ed320380c Merge: 7fc48a7d 670b6392 Author: Field G. Van Zee Date: Sun Aug 24 10:54:21 2014 -0500 Merge branch 'master' of github.com:flame/blis commit 670b63926a7f4fc694abc5b1582ef8a4f367f5a8 Author: Field G. Van Zee Date: Sun Aug 24 10:46:27 2014 -0500 Added whitespace to bli_obj_scalar_ routine calls. Details: - Added extra spaces to align arguments of bli_obj_scalar_init_detached_copy_of(). This misalignment was due to the fact that the function was previously named bli_obj_init_scalar_copy_of() and the name change, performed in b444489f, was done via recursive sed commands which left subsequent lines untouched. commit 7fc48a7d920e07fd8e9528ab2565123f8f4e67f9 Author: Field G. Van Zee Date: Sat Aug 23 16:50:58 2014 -0500 Combined 4m/3m bits into an expanded bitfield. Details: - Combined the 4m/3m bits into an expanded bitfield, which will encode the packing "format" of the micro-panels. This will allow for more easily and compactly encoding additional formats. - Other minor comment/whitespace updates to bli_type_defs.h. - Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new format bitfield. - Comment update to bli_kernel_post_macro_defs.h. - Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h. commit ef0143cc1417e4815e4cafd5a464cc83fe7a1e86 Author: Field G. Van Zee Date: Sat Aug 23 14:02:27 2014 -0500 Renamed _ri, _ri3 packm ukernels to _4m, _3m. Details: - Renamed packm ukernels, _cxk dispatcher, and structure-aware _cxk helper functions to use _4m and _3m instead of _ri and _ri3 suffixes. - Updated names of cpp macros that correspond to packm ukernels. commit b0ccac116158b5ed3316d34798748ba0c6d78672 Author: Field G. Van Zee Date: Thu Aug 21 19:21:52 2014 -0500 Cleaned up front-end layering for 4m/3m. Details: - Added an extra layer to level-3 front-ends (examples: bli_gemm_entry() and bli_gemm4m_entry()) to hide the control trees from the code that decides whether to execute native or 4m-based implementations. The layering was also applied to 3m. - Branch to 4m code based on the return value of bli_4m_is_enabled(), rather than the cpp macros BLIS_ENABLE_?COMPLEX_VIA_4M. This lays the groundwork for users to be able to change at runtime which implementation is called by the main front-ends (e.g. bli_gemm()). - Retired some experimental gemm code that hadn't been touched in months. commit bedec95451cabfa7a8906b51018a5e0572998a5e Author: Field G. Van Zee Date: Thu Aug 21 18:25:48 2014 -0500 Added bli_4m API for querying 4m enabled state. Details: - Added bli_4m.c (and header), which defines a simple API that can be used to query, enable, and disable 4m-based complex support in BLIS. The macros BLIS_ENABLE_?COMPLEX_VIA_4M are now used to initialize the variable that determines the state (enabled or disabled). - Changed bli_info*() API so that all cache and register blocksize- related query routines return the blksz_t objects' values as they exist at runtime, rather than return the values as determined by the configuration system (e.g. bli_kernel.h, or defaults for those values not specified). This sets the foundation for being able to change those blocksizes at runtime. commit b541b667cabfa6d41b50ad1e49209651ee6812cc Merge: 699a8151 dd61307f Author: Tyler Smith Date: Wed Aug 20 14:44:51 2014 -0500 Merge branch 'master' of http://github.com/flame/blis Conflicts: frame/3/trsm/bli_trsm_blk_var2b.c frame/3/trsm/bli_trsm_blk_var2f.c commit 699a8151ca3d5021e834a1784ef45dcc3a3d17cd Author: Tyler Smith Date: Wed Aug 20 14:43:17 2014 -0500 Some improvements to trsm parallelism commit dd61307f55bb6bc762fe0ef0446479d6c0536723 Author: Field G. Van Zee Date: Wed Aug 20 09:52:16 2014 -0500 Minor update to sandybridge MC_S, KC_S. Details: - Changed sandybridge MC and KC for single-precision real to 128 and 384, respectively. - Updated comments in template configuration's gemm micro-kernel file to document the new "contiguous row preference" macro. commit d0eec4bddd740ce360d0f655362c551287cf925b Author: Field G. Van Zee Date: Tue Aug 19 15:49:19 2014 -0500 Added optional row preference to ukernel config. Details: - Added the ability for the kernel developer to indicate the gemm micro- kernel as having a preference for accessing the micro-tile of C via contiguous rows (as opposed to contiguous columns). This property may be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS, which may be defined or left undefined. Leaving it undefined leads to the default assumption of column preference. - Changed conditionals in frame/3/*/*_front.c that induce transposition of the operation so that the transposition is induced only if there is disagreement between the storage of C and the preference of the micro-kernel. Previously, the only conditional that needed to be met was that C was row-stored, which is to say that we assumed the micro- kernel preferred column-contiguous access on C. - Added a "prefers_contig_rows" property to func_t objects, and updated calls to bli_func_obj_create() in _cntl.c files in order to support the above changes. - Removed the row-storage optimization from bli_trsm_front.c because it is actually ineffective. This is because the right-side case of trsm flips the A and B micro-panel operands (since BLIS only requires left-side gemmtrsm/trsm kernels), meaning any transposition done at the high level is then undone at the low level. - Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant invocation of the bli_obj_swap() macro. commit 4cc2b464f29cafbfef9295b073b857fe0752f710 Author: Field G. Van Zee Date: Fri Aug 15 11:49:15 2014 -0500 Reorganized packm ukernels. Details: - Previously, packm micro-kernels were organized by the implied register blocksize (panel dimension) assumed by the kernel, meaning conventional, ri, and ri3 variations of some micro-kernel size were housed in the same file. This commit reorganizes the micro-kernels so that all sizes reside in the same file for each format type (conventional, ri, and ri3). commit fcc10054a11b6fc3976986f57feccf741596cbf6 Author: Field G. Van Zee Date: Wed Aug 13 12:32:06 2014 -0500 Tweaks to gemm4m, gemm3m virtual ukernels. Details: - Fixed a potential, but as-yet unobserved bug in gemm3m that would allow undesirable inf/NaN propogation, since C was being scaled by beta even if it was equal to zero. - In gemm3m micro-kernel, we now avoid copying C to the temporary micro-tile if beta is zero. - Rearranged computation in gemm4m so that the temporary C micro-tile is accessed less, and C is accessed only after the micro-kernel calls. This improves performance marginally in most situations. - Comment updates to both gemm4m and gemm3m micro-kernels. commit cdcbacc2fa871317c8e7ef961ecc6d70ab22dc34 Author: Field G. Van Zee Date: Tue Aug 12 12:45:38 2014 -0500 Removed redundant redef of packm ukr prototypes. Details: - Removed redundant macro code that redefined packm ukernel prototypes when the previous macro was already sufficient. This helps de-clutter the packm ukernel prototyping headers a little bit. commit 82dac98d9032ccb598068a55ddf23d7898491e9e Author: Field G. Van Zee Date: Tue Aug 12 12:36:25 2014 -0500 Relocated packm ukernel #includes. Details: - Consolidated the #include statements for packm ukernel headers from bli_packm_cxk.h, bli_packm_cxk_ri.h, and bli_packm_cxk_ri3.h to bli_packm.h. - Comment/whitespace updates to bli_packm_blk_var3.c, _var4.c. commit 7f77856e25aad5fc6f172ed3e57b6351804e31a4 Author: Field G. Van Zee Date: Tue Aug 12 12:20:15 2014 -0500 Removed unused 4m/3m-related packm macro defs. Details: - Removed unused and unneeded s- and d-flavored macro definitions for packm ukernels related to the complex 4m and 3m methods, as implemented in BLIS. commit bc1d86b2d4d436b1dfba2d0098501aaca9cbb8b5 Author: Field G. Van Zee Date: Thu Aug 7 19:01:20 2014 -0500 Sandy Bridge configuration, micro-kernel update. Details: - Minor updates to bli_config and bli_kernel.h for sandybridge configuration. - Renamed existing AVX intrinsic-based micro-kernel file to bli_gemm_int_d8x4.c. - Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based gemm micro-kernels for single- and double-precision real. commit 98ec95877a95242e159b2bf0c879115a59e4c6e2 Author: Field G. Van Zee Date: Thu Aug 7 18:28:32 2014 -0500 Corrected comment for _obj_is_[row|col]_stored(). Details: - Fixed a mistake in the comments introduced in the previous commit for bli_obj_is_row_stored() and bli_obj_is_col_stored(). commit 43d5e419e1b424d2143817103dbee8ead797e8aa Author: Field G. Van Zee Date: Thu Aug 7 18:20:40 2014 -0500 Reverted _obj_is_[row|col]_stored() macros. Details: - Rolled back recent changes to bli_obj_is_row_stored() and bli_obj_is_col_stored() so that those macros now only inspect the strides (row or column). It turns out that the more sophisticated definitions introduced in a51e32e are not necessary, because these "obj" macros are virtually never used on packed matrices, and when they are, they can use bli_obj_is_[row|col}_packed() macros, which inspect the info bitfield. commit 45692e3ad4b7e1d05ac4302398df4efce04b4284 Author: Field G. Van Zee Date: Thu Aug 7 13:21:15 2014 -0500 Reverted some accidental changes. Details: - Reverted some changes that were unintentionally included in the previous commit (9526ce98). Thanks to Tony Kelman for pointing this out. (Note: a few select changes were not reverted.) commit 9526ce98812be908bc4915f2849b657fb6ce1b49 Author: Field G. Van Zee Date: Wed Aug 6 14:13:46 2014 -0500 Updated copyright headers of emscripten configuration files. commit 30833ed71d56f231ddba21e632bcbbc90b12a97c Author: Field G. Van Zee Date: Wed Aug 6 12:12:03 2014 -0500 Minor edits to configurations' make_defs.mk files. Details: - Redefined CFLAGS, CFLAGS_NOOPT, and CFLAGS_KERNELS so that CFLAGS_NOOPT is defined first and then the other two are defined in terms of CFLAGS_NOOPT. This textually cleans up the definitions and makes them a little easier to read. commit 9d61afeae2ba70fe1df07e7546f6954ea83aed12 Author: Field G. Van Zee Date: Mon Aug 4 16:01:59 2014 -0500 CHANGELOG update (0.1.5) commit bde56d0ecfd0ec20330fac290b91a6dca0cf94e9 (tag: 0.1.5) Author: Field G. Van Zee Date: Mon Aug 4 16:01:58 2014 -0500 Version file update (0.1.5) commit 4c6ceea4be35d089630986eb5b959b9e97214077 Author: Field G. Van Zee Date: Mon Aug 4 15:49:59 2014 -0500 Added CBLAS compatibility layer. Details: - Added a new section in bli_config.h files of all configurations for enabling CBLAS support. (Currently, the default is for the CBLAS layer to be disabled.) - Added a directory, frame/compat/cblas, to house CBLAS source code. A subdirectory 'f77_sub' holds subroutine wrappers corresponding to subroutines found in CBLAS that allow calling some BLAS routines with the return value passed as the last argument rather than as an actual (function) return value. This was probably intended to allow CBLAS to avoid the whole f2c debacle altogether. However, since BLIS does not assume the presence of a Fortran compiler, we had to provide similar routines in C. - A script, integrate-cblas-tarball.sh, is included to streamline the integration of future revisions of the CBLAS source code. - The current tarball, cblas.tgz, that was used with the above script to generate the present set of CBLAS source code is also included. - Updated blis.h to include necessary CBLAS-related headers. commit caab62dac0fb0bd0d674118f409c81680db94d29 Merge: 383631b5 db97ce97 Author: Field G. Van Zee Date: Sun Aug 3 14:36:18 2014 -0500 Merge pull request #19 from kevinoid/fix-install-perms-error Fix permissions error installing to non-owned directory commit db97ce979b88c051922c2f946ce52d523c7a12c6 Author: Kevin Locke Date: Sun Aug 3 12:48:04 2014 -0600 Fix permissions error installing to non-owned directory When installing to a directory which is not owned by the installing user, even when the user has write permission for the directory, the installation can fail with an error similar to the following: Installing libblis-0.1.4-7-sandybridge.a into /usr/local/lib/ install: cannot change permissions of ‘/usr/local/lib’: Operation not permitted Makefile:658: recipe for target '/usr/local/lib/libblis-0.1.4-7-sandybridge.a' failed make: *** [/usr/local/lib/libblis-0.1.4-7-sandybridge.a] Error 1 In the example case, the error occurred because the user attempted to install to /usr/local and /usr/local/lib is owned by root with mode 2755 which the Makefile unsuccessfully attempted to change to 0755. Given that installing to /usr/local is likely to be quite common and the ownership/permissions are the default for Debian and Debian-derived Linux distributions (perhaps others as well), this commit attempts to support that use case by using mkdir rather than install to create the directory (which is the same approach as Automake). Signed-off-by: Kevin Locke commit 383631b514c3d42b724640f57644eea276cc418c Author: Field G. Van Zee Date: Thu Jul 31 14:51:48 2014 -0500 Redefined bit field macros with bitshift operator. Details: - Redefined many of the macros that define bit fields and bit values in the obj_t info field using the bitshift operator (<<). This makes it easier to reorder bit fields, or expand existing bit fields, or add new fields. The bitshifting should be evaluated by the compiler at compile-time. commit 137143345dc93cc9a83da5ba88b25bac7502de86 Author: Field G. Van Zee Date: Thu Jul 31 12:12:45 2014 -0500 Reimplemented unit blocksize fix in prev commit. Details: - Instead of inferring the storage format of the micro-panels from within the packm variants, we now pass in a bool_t value that denotes whether the packed matrix contains row-stored column panels or column-stored row panels. This value can then be tested more easily inside the main packm variant loop. - Renumbered pack_t schema values in bli_type_defs.h so that there are now five bits, each with different meaning: - 4: packed or not packed? - 3: packed for 3m? - 2: packed for 4m? - 1: packed to panels? - 0: stored by rows or columns? - Added new macros that test for status of above bits in schema bit subfield, and renamed some existing macros related to 4m/3m. commit a51e32ec061941cd10119ea80115c82a40b1673f Author: Field G. Van Zee Date: Wed Jul 30 10:41:48 2014 -0500 Fixed unit register blocksize brokenness. Details: - Fixed a breakdown in BLIS's ability to differentiate between row-stored and column-stored micro-panels when MR or NR is unit. When either register blocksize (or both) is equal to one, inspecting the strides of the affected packed micro-panel is no longer sufficient to determine whether the micro-panel is a row-stored column panel or a column-stored row panel (because both strides are unit). At that point, dimension information is necessary when invoking the bli_is_row_stored_f() and bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to Ilya Polkovnichenko for reporting this bug. - Added panel dimensions (m and n) to obj_t, which are set in packm_init() and then passed into the blocked variants to support the aforementioned update. commit c2732272f0ac680a0ad19fa9db5d587398a1479a Author: Field G. Van Zee Date: Tue Jul 29 16:37:18 2014 -0500 Removed old/unused packm variants. commit b97fa9a5a70fe0123e5eebd999b947461d38445f Author: Field G. Van Zee Date: Sun Jul 27 18:54:09 2014 -0500 Minor usage update to build/bump-version.sh. commit b18ba5f62d98629cdd519ff4c96fc67ec1a62fb9 Author: Field G. Van Zee Date: Sun Jul 27 18:52:05 2014 -0500 Added missing 'bla_' prefix to r_imag(), d_imag(). Details: - Added "bla_" to f2c functions r_imag() and d_imag(). Thanks to Murtaza Ali for pointing the mis-named functions. commit af7a8e6c042cade452130a6729377f1a3ef4e19e Author: Field G. Van Zee Date: Sun Jul 27 18:20:13 2014 -0500 CHANGELOG update (0.1.4) commit a7537071b152ecff671f8716595d37dc09e4fd51 (tag: 0.1.4) Author: Field G. Van Zee Date: Sun Jul 27 18:20:12 2014 -0500 Version file update (0.1.4) commit acff74041bf02c7b9fdfa24b507bca782a4c5fce Merge: cdb9413e 47b243ef Author: Tyler Smith Date: Wed Jul 23 15:07:30 2014 -0500 Merge branch 'master' of https://github.com/flame/blis commit cdb9413e140f8a198666250ec88fa34b5425a9c3 Author: Tyler Smith Date: Wed Jul 23 15:05:15 2014 -0500 Enabled threading for a couple more loops in TRSM JC loop is now enabled for the left-sided case IC loop is now enabled for the right-sided case commit 47b243ef08f4101de3d936f2373343e67eaa4dd5 Author: Field G. Van Zee Date: Wed Jul 23 13:41:13 2014 -0500 Call setid for early return from herk/her2k. Details: - Added setid call (to zero imaginary parts of diagonal elements) to early return branches of herk_front() and her2k_front() for cases where alpha is zero. Thanks to Murtaza Ali for suggesting this fix. - Comment update. commit 3e7b0db5b0e24f5fd66c60bacabc019885ddbec5 Merge: 2f8a357d ed3e33d5 Author: Tyler Smith Date: Wed Jul 23 13:40:44 2014 -0500 Merge branch 'master' of https://github.com/flame/blis commit 2f8a357de5fb55163a969d888cf059f24b78125c Author: Tyler Smith Date: Wed Jul 23 13:40:12 2014 -0500 Some TRSM threading fixes/additions commit ed3e33d548047be3283ff41268fdf716563bc542 Author: Field G. Van Zee Date: Tue Jul 22 14:40:43 2014 -0500 Tweaked behavior of herk, her2k for BLAS compat. Details: - Updated herk_front() and her2k_front() to explicitly set the imaginary components of the diagonal entries of C to zero after the computation is complete. This is needed in case downstream applications read the full diagonal entries (i.e., including imaginary part), which could, in the absence of this modification, accumulate numerical error from subsequent rank-k/rank-2k updates. - Updated BLAS compatibility wrappers for herk and her2k to return early if: n == 0 || ( ( alpha == 0 || k == 0 ) && beta == 1 ) This also results in the imaginary components of diagonal entries NOT being set to zero (see above), which is consistent with BLAS. - Updated mkherm to use setid instead of an inlined loop over the diagonal. commit ea59a5c93cde1467a3715abc53dda4aecf961873 Author: Field G. Van Zee Date: Tue Jul 22 14:36:02 2014 -0500 Added new level-1d operation: setid. Details: - Defined a new level-1d operation, setid, which sets the imaginary elements of an object's diagonal to a single scalar. This can be useful, for example, when trying to make the diagonal of a Hermitian matrix real-valued. commit 8965a965931318619ceaebd7c32edccf3022d0c7 Merge: 1785efb5 5b73e80b Author: Field G. Van Zee Date: Tue Jul 22 14:34:32 2014 -0500 Merge branch 'master' of github.com:flame/blis commit 1785efb5420bc7b9c850a068cb5d99837071e877 Author: Field G. Van Zee Date: Tue Jul 22 14:33:01 2014 -0500 Minor improvements to invertd and setd. Details: - Added missing call to invertd_check() from front-end. - Changed setd front-end call of scald_check() to setd_check(). commit 5b73e80b71c054c1945a06aff044ef629bc1a9a0 Merge: a41e68e0 20690fe3 Author: Field G. Van Zee Date: Fri Jul 18 12:21:20 2014 -0500 Merge pull request #16 from Maratyszcza/emscripten Emscripten port commit a41e68e09e73b999fab0bb430a43dccfc63aab45 Author: Field G. Van Zee Date: Thu Jul 17 13:25:56 2014 -0500 Reimplemented BLIS initialization/finalization. Details: - Rewrote bli_init() and bli_finalize() with OpenMP critical sections for thread-safety. Also added lots of explanatory comments. - Renamed bli_init_safe() and bli_finalize_safe() with the _auto() suffix, and reimplemented for simplicity. Updated all invocations in BLAS compatibility layer to use _auto() suffix. commit 36358948ea75074bda32a9f8c008f835b87d21db Author: Field G. Van Zee Date: Thu Jul 17 10:58:10 2014 -0500 Retired frame/3/gemm/other directory. Details: - Removed frame/3/gemm/other directory, which contained some outdated and/or experimental variants. commit c73261f17edf589e76bdbe297702a1fbbd69275f Author: Field G. Van Zee Date: Mon Jul 14 16:23:51 2014 -0500 More minor cleanups post-copyright update. commit 2a09d24463d358be6243b24f112fad057c2aefe0 Author: Field G. Van Zee Date: Mon Jul 14 16:17:09 2014 -0500 Reverted power7 symlinks destroyed by sed script. Details: - Reverted two symlinks, in kernels/power7/3/test, back to being symlinks after recursive-sed.sh mistakenly replaced them with copies of the actual files to which they referred. Meant to include this in previous commit. commit 7ed415824d3b2e78541b6f64e404ca5347c06d3d Author: Field G. Van Zee Date: Mon Jul 14 16:14:33 2014 -0500 Updated copyright headers (continued). Details: - Inserted "at Austin" into third clause of license declarations. Meant to include this change in previous commit. commit 5c2c6c85616834ff2716ece083118201d9df6dde Author: Field G. Van Zee Date: Mon Jul 14 16:05:03 2014 -0500 Updated copyright headers to contain "at Austin". Details: - Updated copyright headers to include "at Austin" in the name of the University of Texas. - Updated the copyright years of a few headers to 2014 (from 2011 and 2012). commit fcec68cda3f6e90ae055e7304e6674c1c5c8d010 Merge: 94c0df79 4a20ed1a Author: Field G. Van Zee Date: Mon Jul 14 11:35:34 2014 -0500 Merge branch 'master' of github.com:flame/blis commit 94c0df797eda377931f29a41ba6a89c0ed58daca Author: Field G. Van Zee Date: Mon Jul 14 11:24:36 2014 -0500 Changed order of zero dim / error checking. Details: - Updated level-2 and level-3 internal back-ends so that the operation's _check() function is called BEFORE any attempt to return early due to the presence of zero dimensions. This ordering makes more sense because (for example) object dimensions should match even if one of them is zero. Previously, a dimension mismatch could result in an early return with no error message. - Updated bli_check_object_buffer() so that NULL buffers result in an error only if the object is dimensionally non-empty (i.e., only if both of the object's dimensions are non-zero). This allows BLIS operations to be performed on dimensionally empty objects (i.e., where at least one dimension is zero). - Updated the error message associated with bli_check_object_buffer() to mention the newly relaxed constraint mentioned above, vis-a-vis non-zero dimensions. commit 20690fe3018ce17c8df61ce0bffecaa7911dc3a5 Author: Marat Dukhan Date: Sun Jul 13 22:50:56 2014 -0700 Emscripten port commit 4a20ed1a3f5e9e5232df30aa0e568e6c00c56ce1 Merge: 6a515e98 8ccdfaef Author: Field G. Van Zee Date: Sun Jul 13 17:45:01 2014 -0500 Merge pull request #14 from Maratyszcza/master Support "make test" for PNaCl configuration commit 6a515e988f2ae1628258a6dec2c0e9cf2d04790f Author: Field G. Van Zee Date: Sun Jul 13 17:38:33 2014 -0500 Implemented dsdot() and sdsdot() in compat layer. Details: - Replaced "not yet implemented" error messages in dsdot() and sdsdot() with actual implementations. (These routines are so rarely used that this log message will probably lead to some people learning of their existence for the first time.) commit 255668ddd1004552c6cc65035ec6486671ce99bb Author: Field G. Van Zee Date: Sun Jul 13 17:30:44 2014 -0500 Inserted gemv beta-scaling bug into compat layer. Details: - BLAS has a peculiar bug (or feature) whereby calling gemv on a vector y of non-zero length and a vector x of zero length results in no action. Given that the operation is y := beta*y + A*x, many (most?) individuals would expect vector y to still be scaled by beta. BLIS, when called natively, handles these cases intuitively (with beta scaling). Unfortunately, many BLAS test suites actually check for the way this situation is handled. Therefore, we have decided to implement this "bug" in the compatibility layer so as to provide "bug-for-bug" compatibility with BLAS. commit 570a154581bdb353fa13a219c7cb3c81d3dceffd Author: Field G. Van Zee Date: Sat Jul 12 17:51:05 2014 -0500 Comment/formatting updates to build scripts. Details: - Minor updates to comments and formatting in bump-version.sh and update-version-file.sh scripts. commit 26cd81990631ff799791629206e068126ff9e3a1 Author: Field G. Van Zee Date: Thu Jul 10 13:16:07 2014 -0500 Added bli_info_*() query functions. Details: - Added a new API family, bli_info_*(), which can be used to query information about how BLIS was configured. Most of these values are returned as gint_t, with the exception of the version string which is char*. - Changed how the testsuite driver queries information about how BLIS was configured (from using macro constants directly to using the new bli_info API). - Removed bli_version.c and its header file. - Added STRINGIFY_INT() macro to bli_macro_defs.h - Renamed info_t type in bli_type_defs.h to objbits_t (not because of an actual naming conflict, but because the name 'info_t' would now be somewhat misleading in the presence of the new bli_info API, as the two are unrelated). commit 970b43141697d8c31a033f59513bb59d7cc78ab0 Author: Field G. Van Zee Date: Thu Jul 10 09:30:00 2014 -0500 Minor bugfixes to BLAS compatibility layer. Details: - Changed bla_amax.c so that i?amax() routines now correctly return 0 if ( n < 1 || incx <= 0 ). - Changed bla_rotg.c and bla_rotmg.c to use bli_fabs() macro instead of f2c's abs() macro for float and double cases. - Thanks to Murtaza Ali for suggesting the two fixes above. - Updated label of fnormv to normfv in testsuite/input.operations. commit 8ccdfaef4c42ad8957af8607a1a9ee29b9277d4b Author: Marat Dukhan Date: Tue Jul 8 23:14:36 2014 -0700 Replicated logic from testsuite/Makefile in top-level Makefile to support make test commit caa6507ff3724c80d60987f309b8bbc5b50a9841 Author: Field G. Van Zee Date: Tue Jul 8 10:25:27 2014 -0500 Minor cleanup to standalone test drivers. Details: - Very minor code changes to standalone test drivers in 'test' directory. - Added *.so files to '.gitignore'. commit 6c65e9a58fe55990ebb99ec3986443e18af35338 Merge: cb12e456 daca500d Author: Field G. Van Zee Date: Tue Jul 8 10:13:49 2014 -0500 Merge branch 'master' of github.com:flame/blis commit cb12e456f94c196c093e52f02a7cbca0032fc86e Author: Field G. Van Zee Date: Tue Jul 8 10:07:46 2014 -0500 Fixed possible level-3 inf/NaN issue when beta=0. Details: - Redefined xpbys_mxn and xpbys_mxn_u/_l macros to employ a copy (instead of scaling by beta) when beta is zero. This will stamp out any possible infs or NaNs in the output matrix, if it happens to be uninitialized. Thanks to Tony Kelman for isolating this bug. commit daca500db5e2448ba0da8047b75eb0f88d9f40e3 Merge: ab3bc915 47023502 Author: Tyler Smith Date: Thu Jul 3 12:52:52 2014 -0500 Merge branch 'master' of http://github.com/flame/blis commit 4702350278af31f662b458127777dd4d85a3192f Author: Field G. Van Zee Date: Thu Jul 3 11:48:23 2014 -0500 Defined _ukernel_void() wrappers to micro-kernels. Details: - Added wrappers for micro-kernels so that users may invoke the micro-kernels without knowing what the function names actually are. This is useful when an application wishes to call the micro-kernel from a shared library instance of BLIS, where the application may not necessarily have the luxury of grabbing the micro-kernel name(s) from C preprocessor macros at compile-time. Also, since the wrappers use void* pointers, one's environment does not need to be aware of some BLIS types such as scomplex and dcomplex. These wrappers now join the level-1 and level-1f kernel wrappers, which pre-dated this commit. - Removed the wrapper definitions and prototypes from the micro-kernel test suite modules, and replaced calls to them with calls to the new wrappers mentioned above. commit ab3bc9153b914fbaf259e15b66c91d628e7c8661 Author: Tyler Smith Date: Thu Jul 3 11:19:43 2014 -0500 Fixed a bug for TRSM when BLIS_ENABLE_MULTITHREADING is not set but the multithreading environment variables are turned on commit b8134b720b985783ee6a582a3eb5d6c51f00d051 Author: Tyler Smith Date: Wed Jul 2 16:02:39 2014 -0500 Quick and dirty multithreading for TRSM Should work fine for small number of threads (up to 8 or maybe even 16). However, performance is yet untested. This parallelizes the "JR" loop for the left sided cases and the "IR" loop for the right sided cases. Future work is to parallelize the outer loops as well. commit e8ef69692831db07ddbe9485a5e504ac3f03e496 Author: Field G. Van Zee Date: Wed Jul 2 14:59:27 2014 -0500 Added shared library support to build system. Details: - Modified top-level Makefile to support building shared (dynamic) libraries. - Updated most configurations' make_defs.mk files to include necessary compiler/linker flags needed by top-level Makefile. - Note that by default, all configurations presently do NOT build shared libraries. To enable, one must change the value of BLIS_ENABLE_DYNAMIC_BUILD to 'yes'. commit b80df0f2cffb015da02e70a82b8512da9891ab67 Author: Field G. Van Zee Date: Mon Jun 23 13:52:39 2014 -0500 Added bump-version.sh script to 'build' directory. Details: - Added a bash script, bump-version.sh, to aid in incrementing the BLIS version string. commit 9ef1f1e21d083697fc730e48d7d9169c201f3da2 Author: Field G. Van Zee Date: Mon Jun 23 13:48:17 2014 -0500 CHANGELOG update (0.1.3) commit 036cc634918463b1caa0fd89c9a211f2f5639af7 (tag: 0.1.3) Author: Field G. Van Zee Date: Mon Jun 23 13:48:17 2014 -0500 Version file update (0.1.3) commit 09d9a3bf6763932d9f571085b2cfd1b8631eccba Author: Field G. Van Zee Date: Mon Jun 23 13:43:26 2014 -0500 Reverting version file to test new version script. Details: - Changed version file contents to 0.1.2 so that I can test out a new version file bumping script. commit ebb33965981dcb2b0bdee5fc7fdf6c959420f311 Author: Field G. Van Zee Date: Mon Jun 23 11:22:50 2014 -0500 Added 'version' file. commit 2cb9a5501a3cbeb6692cf68e896087ba73b6af69 Author: Field G. Van Zee Date: Mon Jun 23 10:42:29 2014 -0500 Removed 'version' from .gitignore file. commit b40dcefc5ee31f67aa3990e2e9d2ef8ed1386a25 Merge: 7101a8ee b693b0cd Author: Field G. Van Zee Date: Mon Jun 23 10:39:05 2014 -0500 Merge pull request #11 from Maratyszcza/stable [sc]axpy kernels for PNaCl commit b693b0cddcfb41450e3c09a3ab97acb44c1ccdec Author: Marat Dukhan Date: Sun Jun 22 13:44:25 2014 -0700 [SC]AXPY kernels for PNaCl commit 7101a8eec0327d6c3a7eb36eb4b0fd45c1c6d162 Merge: ad48dca2 020a831b Author: Field G. Van Zee Date: Thu Jun 19 21:46:50 2014 -0500 Merge pull request #10 from Maratyszcza/stable Portable Native Client port commit 020a831bc5f61744cb8354886aa679b99b1285f6 Author: Marat Dukhan Date: Thu Jun 19 00:58:26 2014 -0700 Code clean-up in PNaCl port commit 491be4f91ed725522f5cc7184053857c6c376ada Author: Marat Dukhan Date: Thu Jun 19 00:45:44 2014 -0700 Optimized dot product kernels for PNaCl commit 4b8e71aab80182873a2e138eb07902b8d8fd5480 Author: Marat Dukhan Date: Thu Jun 19 00:43:25 2014 -0700 Use AR rcs flags for PNaCl target to avoid warning commit 031deb2a5c718d569bde842590a791b812f4cf1d Author: Marat Dukhan Date: Wed Jun 18 03:11:34 2014 -0700 PNaCl configuration: use pnacl-ar instead or ar (fixes build issue on Mac) commit 68a02976e3c3638f0a9821342e269a1743e3ace3 Author: Marat Dukhan Date: Wed Jun 18 03:10:25 2014 -0700 Compile pnacl configuration in GNU11 mode to avoid warning about non-standard features commit 6f8462eb0ec278b89731e73ef583386a3371d095 Author: Marat Dukhan Date: Wed Jun 18 03:08:46 2014 -0700 Fix inconsistent VERBOSE macro in Makefile commit b2ffb4de8b6872cb23537ad282e557d11dcd9c8b Author: Marat Dukhan Date: Sun Jun 15 18:41:30 2014 -0400 Reformatted PNaCl GEMM kernels commit 6de2d472d98baa215264a776f3d5291780a6a085 Author: Marat Dukhan Date: Sun Jun 15 08:44:31 2014 -0400 CGEMM and ZGEMM kernels for PNaCl commit f064711a5e6fb3852c17c7520909b09dc27665f2 Author: Marat Dukhan Date: Sun Jun 15 06:27:37 2014 -0400 SGEMM and DGEMM kernels for PNaCl commit ad48dca22913a363899f0bef45553898718eebb1 Merge: ee2b6792 7118f87e Author: Field G. Van Zee Date: Sat Jun 14 15:10:13 2014 -0500 Merge pull request #9 from tkelman/memalign_windows Use _aligned_malloc instead of posix_memalign on Windows commit 7118f87e18b4941423472afc00215c1d1f2a1fcd Author: Tony Kelman Date: Sat Jun 14 06:53:20 2014 -0700 Use _aligned_malloc instead of posix_memalign on Windows commit ee2b679281ca45fb40b2198e293bc3bc3d446632 Author: Tyler Smith Date: Fri Jun 6 12:41:55 2014 -0500 Only include omp.h if BLIS_ENABLE_OPENMP is set commit 19c05dfaac43c627f86e897c8c00f1f9440754aa Author: Field G. Van Zee Date: Thu Jun 5 10:54:16 2014 -0500 CHANGELOG update (for 0.1.2). commit 00f232f8ed1f7c41619b12ebf779ebe2c3b2d3cd (tag: 0.1.2) Author: Tyler Smith Date: Mon Jun 2 13:40:57 2014 -0500 Added single-precision micro-kernel for Knights Corner aka MIC aka Xeon Phi commit 3fc60e491426f6248c0feae88d971e4d1f88fb95 Author: Field G. Van Zee Date: Wed May 21 11:34:42 2014 -0500 Fixed ldim alignment bug in core2 gemm ukernel. Details: - Fixed a bug in the dunnington/core2 gemm micro-kernels that resulted in a segmentation fault if a column-stored matrix's starting address was aligned, but its leading dimension was such that its second column was unaligned. Basically, the micro-kernel was assuming that aligned load instructions were safe when they actually were not. An extra condition that checks the alignment of cs_c (ie: the leading dimension in the column storage case) has now been added. Thanks to Michael Lehn for reporting this bug. commit 77a2d8dac8b242d7a202c9aabda3927ab68cf987 Merge: 8c5d6071 21fb0893 Author: Field G. Van Zee Date: Tue May 20 09:53:19 2014 -0500 Merge pull request #8 from tlrmchlsmth/master Added multithreading to most level-3 operations. commit 21fb089387ee7c87f6dc53b0f60f68b48d3ff3e8 Author: Tyler Smith Date: Mon May 19 20:38:55 2014 -0700 Reverting changes dunnington and reference configs Now they are unchanged from the main branch of BLIS commit 8a0ef0e0db5880730425926f8ba56b457a2ba764 Author: Tyler Smith Date: Fri May 16 13:44:14 2014 -0500 Fixed rounding error in bli_get_range_weighted commit 0b4b1680334528b1b60bc696537600f763198e92 Author: Tyler Smith Date: Fri May 16 12:23:37 2014 -0500 Fixed bug with disabling JC loop threading for right sided trmm commit 5c048a90d8dfa1dbde4e45fbc10ffcbdfe59d960 Author: Tyler Smith Date: Wed May 14 16:20:06 2014 -0500 Disabled parallelism for right-sided TRMM JC loop The loop has dependent iterations. commit 13a4c717ed0e273359dbaf5554cc4fa70b087d71 Author: Tyler Smith Date: Wed May 14 14:59:04 2014 -0500 Fixed bug with bli_get_range_weighted commit 45957cc7745e9bb1698408d72f53ef192e960820 Author: Tyler Smith Date: Tue May 13 17:14:46 2014 -0500 Allowed threading to be turned off No longer requires OpenMP to compile Define the following in bli_config.h in order to enable multithreading: BLIS_ENABLE_MULTITHREADING BLIS_ENABLE_OPENMP Also fixes a bug with bli_get_range_weighted commit bd1dc98ce599d74513a553fe3b37a2ebca1c3812 Author: Tyler Smith Date: Mon May 12 17:26:19 2014 -0500 Disabled multithreading of the kc loop commit 456df0372170bd7ca2c7e2d85365a69f1f04de88 Author: Tyler Smith Date: Wed Apr 30 12:28:00 2014 -0500 Replaced register blocksize hack with querying the register blocksize for determining parallelism granularity commit f4fdfe8fc573553eb36795b79cdf681270dab71b Merge: 31bb065b 8c5d6071 Author: Tyler Smith Date: Wed Apr 30 11:46:35 2014 -0500 Merge http://github.com/flame/blis commit 8c5d6071e24ba10a53669390a47287e86ff354ce Author: Field G. Van Zee Date: Tue Apr 29 12:26:12 2014 -0500 Added _check() routines for fprint[mv], rand[mv]. Details: - Added _check() routines for fprintm, fprintv, randm, and randv. - Added invocations to the above routines from their respective front-ends. commit 262cdabcc885bcf6636f4d8bb7d320f95e81d820 Author: Field G. Van Zee Date: Mon Apr 28 16:48:25 2014 -0500 Changed treatment of NULL object buffers. Details: - Relaxed the constraint in bli_obj_attach_buffer_check(), which required the buffer address being attached to be non-NULL. This is acceptable because the user was already able to create and use objects with NULL buffers (via bli_obj_create_without_buffer(), which initializes the buffer to NULL). - Inserted calls to newly defined function, bli_check_object_buffer(), into nearly all operations' _check() or _int_check() functions. This allows BLIS to abort peacefully if a computational routine is called with an object containing a NULL buffer. By contrast, under such conditions, BLAS would typically fail with a segmentation fault. - Within operation front-ends, moved the calls to _check()/_int_check() so that zero dimensions are checked first (and if found, execution returns with trivial or no computation). This resolves issue #7. Thanks to Jack Poulson for reporting this bug. commit 31bb065ba40ae0c5a614e743b8025abca012b99e Merge: 20e24430 7c619599 Author: Tyler Smith Date: Wed Apr 23 12:30:19 2014 -0500 Merge http://github.com/flame/blis commit 7c61959955c8ba78160d0ed4d1979022029d963b Author: Field G. Van Zee Date: Thu Apr 10 17:18:36 2014 -0500 Can now query register blocksizes from blk algs. Details: - Added a new field to blksz_t objects that allows one to attach a sub-object. Doing this allows us to associate a register blocksize with any given cache blocksize. That way, the register blocksize can be queried wherever the cache blocksize would normally be accessible (e.g. a blocked algorithm). - Modified bli_gemm_cntl.c (and 4m/3m variants) so that the register blocksizes are attached to the cache blocksizes after they are created. commit 58671597d3d450817b2eda576c05ed6dadd8af6d Author: Field G. Van Zee Date: Thu Apr 10 15:35:30 2014 -0500 Minor cleanups to level-2 _cntl.c files. Details: - Changed level-2 _cntl.c files so that the blocksizes for gemv are imported and used, rather than blocksizes being declared locally. - Whitespace changes to gemv_cntl.c and gemm_cntl.c files (as well as 4m/3m variants). - Removed test/old/test_blis2.c. commit 20e24430a772bc0fbaf24dec2f8c544096fd3f4e Author: Tyler Michael Smith Date: Tue Apr 8 17:50:44 2014 +0000 Some fixes for the bgq kernels commit bde697f75ec1e7f2decebee0c9bd620b4c134cd5 Author: Tyler Smith Date: Fri Apr 4 16:43:44 2014 -0500 Add -openmp to ldflags as well commit c332be8cd471eeace7b4fa4ae7443088b6a68ec3 Author: Tyler Smith Date: Fri Apr 4 16:37:50 2014 -0500 Added -openmp flag to Xeon Phi build for convenience commit e7ca9e4b4a24d585c9aec8293fc7bb79e4171ad0 Author: Tyler Smith Date: Fri Apr 4 16:31:15 2014 -0500 Used BLIS_DEFAULT_*_MR for rounding partitioning instead of BLIS_DEFAULT_*_MC commit 7b9b228c6fa4cfb70b1ebb855b009a036e85fac3 Author: Tyler Smith Date: Fri Apr 4 16:29:10 2014 -0500 Fix for tree barrier freeing bug commit 5ec93bd9a76096312d51c326ccde1e9bd0a436ab Author: Tyler Smith Date: Fri Apr 4 15:09:10 2014 -0500 Bunch of minor fixes Removed barrier after unpackm in all level3 blocked variants Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case) Moved the enabling of the tree barriers into bli_config.h Fed the default MR and NR for double precision into bli_get_range instead of the number 8 commit 575fb9b0b08f3bdb56ccde056da619d1585617c1 Author: Tyler Smith Date: Fri Apr 4 12:13:29 2014 -0500 Changed default blocking factor to default double precision MR and NR commit ab9c7880335c281432d5809fe0dec46753d22569 Author: Tyler Smith Date: Fri Apr 4 11:38:11 2014 -0500 Added faster tree barriers necessary for performance for Xeon Phi Fixed up some stuff in the thread info free functions Disabled threading for TRSM so that it actually works when threading environment variables are set commit ec58a7923cccac08632670caadf3cf6ff5dce766 Author: Tyler Smith Date: Fri Apr 4 10:22:48 2014 -0500 Freeing thread info paths. Also made herk IC and JC loops do weighted partitioning commit 2b6848b2397d6d84ca4e5f792fc51ad05e351a36 Merge: 4e3eb39a 21a0efb3 Author: Tyler Smith Date: Fri Apr 4 09:54:54 2014 -0500 Merge http://github.com/flame/blis Conflicts: kernels/bgq/1/bli_axpyv_opt_var1.c kernels/bgq/1/bli_dotv_opt_var1.c commit 4e3eb39aca4df0b9fdc003d468f368a2f2ba597d Author: Tyler Michael Smith Date: Fri Apr 4 14:50:03 2014 +0000 Some fixes to the bgq config MR and NR for double complex were wrong Default fusing factor for double precision was wrong as well commit 21a0efb33d7435139e9c43c1a4787a6bff533e26 Author: Field G. Van Zee Date: Thu Apr 3 16:38:44 2014 -0500 Fixed follow-up to issue #6. commit c318157a9bee8ea6e59be16f99f65d9271fe0d27 Author: Field G. Van Zee Date: Thu Apr 3 16:24:34 2014 -0500 Fixed issue #6 (incorrect 'restrict' usage). Details: - Fixed improper usage of restrict keyword in axpyv and dotv bgq kernels. (However, there may be other instances of similar misuse elsewhere in BLIS.) Thanks to Jeff Hammond for reporting this issue. commit b5150a1bf3bd89598e2b3aeac110eb5b44ac6c12 Author: Field G. Van Zee Date: Thu Apr 3 12:25:45 2014 -0500 Added #include "arm_neon.h" to ARM gemm ukernel. Details: - Inserted #include "arm_neon.h" into gemm ukernel source file for arm/neon. Thanks to Jean-Michel Hautbois for suggesting this fix. commit 2041c264517b6c590fd4f7e8253e6911b622d1c3 Author: Tyler Smith Date: Thu Apr 3 10:30:03 2014 -0500 Added barriers needed prior to doing scalar reset for rank-k updates. commit 47a90e69dfde3f4f8fdf90654248a6b499fbadbc Author: Field G. Van Zee Date: Tue Apr 1 14:34:31 2014 -0500 Attempted to fix uninitialized variable warnings. Details: - Added initialization statements to various macros used in level 1m and 1m-like operations. I wasn't able to reproduce the reported behavior, so hopefully this takes care of it. Thanks to Jeff Hammond for the report. commit d27b4f690c14b1f836f8c7a3c0e91e09d852f02e Author: Field G. Van Zee Date: Tue Apr 1 12:57:24 2014 -0500 Use generic paths for toolchain in POWER7. Details: - Fixed issue #4. Thanks to Jeff Hammond for contributing changes. commit 1584ae1c83c3a8c1af76acb46404747507650f19 Author: Tyler Smith Date: Fri Mar 28 15:15:48 2014 -0500 Fixed race condition involving scalar reset commit 459dde4acc09e49380da58fb7b246db488884ad9 Author: Tyler Smith Date: Thu Mar 27 17:06:45 2014 -0500 Made barrier after packing implicit. This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines, but not the outer packing routines. This allowed, for instance, the block of B to not be finished being packed before computation to occur. commit 9f78ec6e7e95fcad89a167b27cad7e2d74b6d122 Author: Tyler Smith Date: Thu Mar 27 14:18:46 2014 -0500 Some fixes for the internal functions, was innappropriately only having thread chief do some things. commit a6fd48345424e097f71652be013aa897e098b41e Author: Tyler Michael Smith Date: Wed Mar 26 17:19:46 2014 +0000 Added test drivers for level 3 BLAS that run tests in parallel using MPI commit 73b3db594864be0f9be9a0eb29bf961fa9c95f29 Author: Tyler Michael Smith Date: Wed Mar 26 15:39:05 2014 +0000 Some fixes for the bgq configuration commit f0824a04fc75e231c3a3d7757fa4e7294173282f Author: Tyler Smith Date: Mon Mar 24 15:21:42 2014 -0500 Initial commit to enable threading in TRSM, Also enabled weighted partitioning for herk, trmm Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions Correctly computed a_next and b_next for gemm, herk macrokernels a_next and b_next point to the current micropanels in trmm commit 23d9eab354fbc88165889832955e126772bf8488 Merge: 5d5dc2ee fd3e32a5 Author: Tyler Smith Date: Thu Mar 20 16:54:35 2014 -0500 Merge https://github.com/flame/blis commit 5d5dc2eedef2f7c90d61371a1b457be5c06cf583 Author: Tyler Smith Date: Thu Mar 20 16:43:36 2014 -0500 Parallelized trmm and trmm3 Also fixed bugs in packm commit fd3e32a5f419fa412f46afe4dd1c3a26e15f3eb4 Author: Field G. Van Zee Date: Thu Mar 20 13:59:48 2014 -0500 Refined INSERT_GENTFUNC macro usage. Details: - Defined new INSERT_GENTFUNC macros so that the macro always takes exactly the number of arguments needed for the particular operation or variant being defined. Many operations were using INSERT_GENTFUNC macros that expected one auxiliary argument even though none were needed. Those instances have now been updated. Most of these instances were in the level-0 and -1v operations, as well as some operations defined in frame/util. commit 9b0e715f29338a1a1d6445907d2445c35f011121 Author: Field G. Van Zee Date: Wed Mar 19 15:47:54 2014 -0500 Minor simplifications to trmm, trsm macro-kernels. Details: - Simplified some code that would have allowed the diagonal of a trmm or trsm triangular matrix to intersect the short end of a micro-panel. This is disallowed via higher-level constraints on cache blocksizes, so this code was never needed and only served to obfuscate. - Updated some comments in trmm, trsm macro-kernels. commit a3902750b9ab4923433f7e353f3669c3c419f8e4 Author: Field G. Van Zee Date: Wed Mar 19 12:35:17 2014 -0500 Reorganized norm operations. Details: - Completely reoganized norm operations: - Renames: - fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm) - absumv -> norm1v (vector 1-norm) - New operations: - norm1m (matrix 1-norm) - normiv, normim (infinity-norm) - amaxv (BLAS-like absolute maximum value index) - asumv (BLAS-like absolute sum) - Deprecated absumm, as it did not correspond to any actual norm. (However, an inlined version now exists in the testsuite module for randm.) commit c0140cb752f27e99742f85d23be2181c00a1335e Author: Tyler Smith Date: Wed Mar 19 11:21:16 2014 -0500 Fixed packm variants 3 and 4 where every thread was trying to manipulate the same state Now just performed by the master thread. commit fb42983bd9943711baa7d1c6496de1215bb816ef Author: Tyler Smith Date: Tue Mar 18 16:37:28 2014 -0500 Fixed a barrier bug and a thread decorator bug commit aa2405f8b23d0f8d2ec04790882f2176ef2e8fd8 Author: Tyler Smith Date: Tue Mar 18 15:23:09 2014 -0500 Fixing function pointer issues with thread decorator commit ec8b88f93533942d3711191873310e7ff281bda6 Author: Tyler Smith Date: Tue Mar 18 14:35:37 2014 -0500 Enabled threading for packm blocked variants 3 and 4 commit 0ac534cdf657bbf04601abfe719ba2887aab5da7 Author: Tyler Smith Date: Tue Mar 18 13:26:27 2014 -0500 Added decorator for calling parallelized intermal functions Will allow for easy support for different threading models commit 5296f58975f7d351f88909cc80b6d0cffd73def7 Author: Tyler Smith Date: Mon Mar 17 17:15:35 2014 -0500 Fixing some bugs with herk parallelization commit c51d0110831eb89361b4720bf7ed75edbd26ebce Author: Tyler Smith Date: Mon Mar 17 15:00:47 2014 -0500 Initial multithreading support for HERK commit c720b141568d1f289146bf34ded08001f2c0dfbb Author: Tyler Smith Date: Mon Mar 17 11:39:32 2014 -0500 Switched to using environment variables to control threading. The environment variables all follow the format BLIS_X_NT, where X is the index of the loop as described in our paper Anatomy of High Performance Many-Threaded Matrix Multiplication. These indices are IR, JR, IC, KC, and JC. Also enabled parallelism for hemm and symm, but these are currently untested. commit 92233cf64274b27b2217c5cfffe75443ff6137a4 Author: Tyler Smith Date: Tue Mar 11 14:16:08 2014 -0500 Some fixes to gemm thread info tree creation, Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED instead of BLIS_SINGLE_THREADED commit 020f80c30289d8bcaa688bf600b01fae9b23b54f Author: Tyler Smith Date: Tue Mar 11 12:08:17 2014 -0500 Added files specific to threading for gemm and packm operations commit 8d8f4352a41926bc923e47be836365b6b726aff2 Author: Tyler Smith Date: Mon Mar 10 15:47:28 2014 -0500 Added single threaded thread info data structures specifically for gemm and packm commit 0e8677761175189583ca7d855e24b2bbdd2dada8 Merge: 2e727a02 b3bff631 Author: Tyler Smith Date: Mon Mar 10 15:16:21 2014 -0500 Merge branch 'master' of https://github.com/tlrmchlsmth/blis commit 2e727a025a8f796d2b6bd14f489d0ee72e7d1fc7 Author: Tyler Smith Date: Mon Mar 10 15:14:33 2014 -0500 Modifying the thread info data structures This change makes each operation have its own thread info type, allowing more fine control of threading in operations that have different types of suboperations commit a770590cf21a459f04bf941c58ee2afd272cc441 Author: Field G. Van Zee Date: Mon Mar 3 14:31:44 2014 -0600 Minor fixes to sumsqv, abmaxv. Details: - Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when the vector (or matrix) contains a NaN. - Minor change to bli_abmaxv_unb_var1() to more closely mimic the behavior of netlib BLAS's izamax(). There, a "less than or equal to" operator is used in the search instead of "less than", which would change the element index returned if there were multiple maximum values. - Added macro function definitions for bli_isinf() and bli_isnan(), which are currently implemented in terms of isinf() and isnan() from math.h. commit b3bff631eadf98b15cb422fb4a8e2f855c23e8a7 Merge: 2c158fb8 e8757b03 Author: Tyler Smith Date: Thu Feb 27 16:53:24 2014 -0600 Merge https://github.com/flame/blis commit 2c158fb885c27f7b599dc1e85b57edd684f19223 Merge: e4738c48 c2b2ab62 Author: Tyler Smith Date: Thu Feb 27 16:46:23 2014 -0600 Merge https://github.com/flame/blis Conflicts: frame/1m/packm/bli_packm_blk_var1.c commit e8757b03a74f9891632242e9a90efb32150826f5 Author: Field G. Van Zee Date: Thu Feb 27 16:40:07 2014 -0600 Use "%ld" as int format specifier in fprintm. Details: - Changed "%d" to "%ld" when printing integers via bli_fprintm(). - Meant to include this in previous commit. commit c663ce3b5170fee7dfb5b528b650d70c8e932cac Author: Field G. Van Zee Date: Thu Feb 27 16:32:57 2014 -0600 Fixed various bugs when C99 complex is enabled. Details: - Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and elsewhere in the framework that were not yet set up to work properly when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h - Extensive changes to f2c-derived files in frame/compat/f2c to allow C99 complex storage. Most of these changes center around accessing real and imaginary components via bli_?real()/bli_?imag() accessor macros, and setting of values via bli_?sets() assignment macros. (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX was broken.) commit e4738c48e00b89391d9baa1fd0aa62d1ea2f95e6 Author: Tyler Smith Date: Thu Feb 27 16:29:46 2014 -0600 Added support for parallelism in gemm micro-kernel commit bfe214b633765ed40b57b330fbb84c332663aa40 Author: Tyler Smith Date: Thu Feb 27 15:53:10 2014 -0600 Fixed bug with parallel packing, and bug with allocating an array of thread infos In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency. This dependeny was removed, allowing each iteration to be executed in parallel. Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs. commit 6193d9ceea552e67170dba45abde04c64271c705 Author: Tyler Smith Date: Thu Feb 27 14:09:19 2014 -0600 Fixed bug in thread trees commit ac5a2de1d17ffd460b00fee9757898525a09abae Merge: 01b125e8 bd3c7ecf Author: Tyler Smith Date: Thu Feb 27 11:59:33 2014 -0600 Merge branch 'master' of https://github.com/tlrmchlsmth/blis commit 01b125e815f19410e8e0611d088b84570e499e93 Author: Tyler Smith Date: Thu Feb 27 11:55:45 2014 -0600 First pass at adding parallelism to BLIS. Added a multithreading infrastructure that should be independent of multithreading implementation in the future. Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized. commit c2b2ab62707e4174892aff3ce65f36f54878fae5 Author: Field G. Van Zee Date: Wed Feb 26 12:46:45 2014 -0600 Deprecated panel stride alignment in bli_config.h. Details: - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all configurations. It was already going unused in packm_init() since the recent 4m/3m commit. This setting was rarely, if ever, useful, and its existence only posed a potential risk for 4m/3m-based implementations. - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h. - Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template micro-kernels. commit f18aee83a5ac1b14808686fc3c5a3c846a1d99b9 Author: Field G. Van Zee Date: Tue Feb 25 17:58:42 2014 -0600 CHANGELOG update (for 0.1.1). commit fde5f1fdece19881f50b142e8611b772a647e6d2 (tag: 0.1.1) Author: Field G. Van Zee Date: Tue Feb 25 13:34:56 2014 -0600 Added extensive support for configuration defaults. Details: - Standard names for reference kernels (levels-1v, -1f and 3) are now macro constants. Examples: BLIS_SAXPYV_KERNEL_REF BLIS_DDOTXF_KERNEL_REF BLIS_ZGEMM_UKERNEL_REF - Developers no longer have to name all datatype instances of a kernel with a common base name; [sdcz] datatype flavors of each kernel or micro-kernel (level-1v, -1f, or 3) may now be named independently. This means you can now, if you wish, encode the datatype-specific register blocksizes in the name of the micro-kernel functions. - Any datatype instances of any kernel (1v, 1f, or 3) that is left undefined in bli_kernel.h will default to the corresponding reference implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined, it will be defined to be BLIS_DGEMM_UKERNEL_REF. - Developers no longer need to name level-1v/-1f kernels with multiple datatype chars to match the number of types the kernel WOULD take in a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is sufficient, as in bli_daxpyv_opt(). - There is no longer a need to define an obj_t wrapper to go along with your level-1v/-1f kernels. The framework now prvides a _kernel() function which serves as the obj_t wrapper for whatever kernels are specified (or defaulted to) via bli_kernel.h - Developers no longer need to prototype their kernels, and thus no longer need to include any prototyping headers from within bli_kernel.h. The framework now generates kernel prototypes, with the proper type signature, based on the kernel names defined (or defaulted to) via bli_kernel.h. - If the complex datatype x (of [cz]) implementation of the gemm micro- kernel is left undefined by bli_kernel.h, but its same-precision real domain equivalent IS defined, BLIS will use a 4m-based implementation for the datatype x implementations of all level-3 operations, using only the real gemm micro-kernel. commit 15b51e990f1d21333b5f7af97c211756247336e5 Merge: 6363a9f6 fc04b5eb Author: Field G. Van Zee Date: Fri Feb 21 09:04:32 2014 -0600 Merge branch 'master' of github.com:fgvanzee/blis commit fc04b5eb69868c341ce03f5ef1f02de4b8c121b0 Merge: b29e1c2b d1813c9d Author: Field G. Van Zee Date: Fri Feb 21 09:04:13 2014 -0600 Merge pull request #3 from figual/master New ARM armv7a kernels and Assembly file consideration in Makefile commit d1813c9dee34410833db5061e6588ec1a6c9ecd4 Author: Francisco Igual Date: Fri Feb 21 15:14:31 2014 +0100 Added new armv7a micro-kernels and configuration files from Werner Saar. commit 0cd098c03a000ed9426a7e9135190696da8cadbc Author: Francisco Igual Date: Fri Feb 21 15:12:30 2014 +0100 o Modified Makefile to consider .S assembly microkernels. commit 6363a9f658257fe3d814a3dce5308f807adb54a2 Author: Field G. Van Zee Date: Wed Feb 19 17:00:52 2014 -0600 Added level-3 support for complex via 4m-/3m. Details: - Added the ability to induce complex domain level-3 operations via new virtual complex micro-kernels which are implemented via only real domain micro-kernels. Two new implementations are provided: 4m and 3m. 4m implements complex matrix multiplication in terms of four real matrix multiplications, where as 3m uses only three and thus is capable of even higher (than peak) performance. However, the 3m method has somewhat weaker numerical properties, making it less desirable in general. - Further refined packing routines, which were recently revamped, and added packing functionality for 4m and 3m. - Some modifications to trmm and trsm macro-kernels to facilitate indexing into micro-panels which were packed for 4m/3m virtual kernels. - Added 4m and 3m interfaces for each level-3 operation. - Various other minor changes to facilitate 4m/3m methods. commit b29e1c2b278c177e104c84ba462820ee8296df6c Merge: ee60377e bd3c7ecf Author: Field G. Van Zee Date: Fri Feb 14 14:11:54 2014 -0600 Merge pull request #2 from tlrmchlsmth/master Fixes and improvements to xeon phi implementation. commit bd3c7ecfb54a9b9851c7d364f41c21e4cff52f6f Author: Tyler Smith Date: Fri Feb 14 14:05:57 2014 -0600 Removing changes to input.general and input.operations commit ce066863683cb4e910270cf8ab8e138b01ff3358 Author: Tyler Smith Date: Fri Feb 14 13:40:24 2014 -0600 Fixed more Xeon Phi bugs, especially with scattered update commit 31134b5c7076423aee1b4f494e925f27171d97e6 Author: Tyler Smith Date: Fri Feb 14 11:19:44 2014 -0600 Some fixes, changes, and improvements to the microkernel to the Xeon Phi commit ee60377e467862b9d8a7205c45dce5cf66c78c46 Author: Field G. Van Zee Date: Thu Feb 13 14:03:31 2014 -0600 Shifted some fields in info_t. Details: - Shifted the pack order, pack buffer type, and structure type fields to make room for an extra bit in the pack type/status field. commit bd3ab1ad4cf42f8bc30ab262acf8eccb49bb1a08 Author: Field G. Van Zee Date: Thu Feb 13 09:29:55 2014 -0600 Minor fixes to trsm consistent with prev on trmm. Details: - Removed use of bli_min() and bli_max() that were only being used to try to support situations where the diagonal would intersect the short end of some micro-panels, which is situation that is disallowed at a higher level by various constraints on the register and cache blocksize. This only affected trsm_ll and trsm_lu. - Use panel stride as passed into the macro-kernel rather than compute it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm. commit 6260b0b5f8bd248f3f66e5a1c6854bdbd9d02ad0 Author: Field G. Van Zee Date: Thu Feb 13 09:19:56 2014 -0600 Fixed obscure bug in trmm_ll, trmm_lu. Details: - Fixed an obscure bug in left-hand trmm that would only manifest when non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR) are used. - Removed use of bli_min() and bli_max() that were only being used to try to support situations where the diagonal would intersect the short end of some micro-panels, which is situation that is disallowed at a higher level by various constraints on the register and cache blocksize. This only affected trmm_ll and trmm_lu. - Use panel stride as passed into the macro-kernel rather than compute it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm. commit 16915c1c1e55c660bf82141cdadf7c0860d5b464 Author: Field G. Van Zee Date: Tue Feb 11 10:54:19 2014 -0600 Fixed an obscure bug in packm_cxk(). Details: - Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen from ldp, which is always equal to PACKMR or PACKNR. The problem with this is that the pack ukernels were implicitly assuming that the panel dimension of the panel being packed was equal to ldp, which is not the case when the register blocksizes extensions are non-zero (ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This problem has been fixed by passing ldp into the pack ukernels, which now walk through the packed micro-panel region by incrementing by this value, rather than incrementing by the inherent panel dimension value assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk). - Also fixed a very minor edge case inefficiency whereby pack ukernels smaller than the default were not being used in edge cases, and instead those situations were being handled by scal2m. This is related to the issue above, because the pack ukernel itself was being chosen based on ldp instead of the panel dimension. commit b7da57b282c5a5e2208946e60309d2352f55351d Author: Field G. Van Zee Date: Tue Feb 11 10:28:23 2014 -0600 Updated calls to packm_blk_var2() in testsuite. Details: - In ukernel testsuite modules, replaced calls to packm_blk_var2() with _var1(). Meant to include this in previous commit. commit c255a293e25b2223c88e8800267cd06ad2a90041 Author: Field G. Van Zee Date: Mon Feb 10 14:31:24 2014 -0600 Consolidated packm_blk_var2 and var3. Details: - Consolidated the functionality previously supported by packm_blk_var2() and packm_blk_var3() into a new variant, packm_blk_var1(). - Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk() to accommodate above changes. - Removed packm_blk_var3() and retired packm_blk_var2() to frame/1m/packm/old. - Updated all level-3 _cntl_init() functions so that the new, more versatile packm_blk_var1 is used for all level-3 matrix packing. commit 32d8f264ae7b28155f5d7b21dcc5ecb78da2e0ab Author: Field G. Van Zee Date: Sun Feb 9 10:07:37 2014 -0600 Refactored packm variants. Details: - Revised packm_blk_var2() and _var3() by encapsulating the general, hermitian/symmetric, and triangular panel-packing subproblems into separate functions: packm_gen_cxk(), packm_herm_cxk(), and packm_tri_cxk(), respectively. Also, homogenized the packm code as well as the new specialized packm_*_cxk() code to further improve readability. commit 6c8067028707947fcdf4f856a272e15bb9ed91e3 Author: Field G. Van Zee Date: Fri Feb 7 11:27:15 2014 -0600 Renamed enumerated type in testsuite and modules. Details: - Renamed the test suite's "mt_impl_t" enumerated type to "iface_t", and renamed all corresponding "impl" variables to "iface". commit 6c12598b1bc567f0b08f58aebdc753a1c1390378 Author: Field G. Van Zee Date: Thu Feb 6 18:26:35 2014 -0600 Employ simpler INSERT_ macro for ref ukernels. Details: - Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one argument--the base name of the function--and employed this macro in the reference micro-kernel files instead of the _BASIC macro, which takes one auxiliary argument. That argument was not being used and probably just acted to unnecessarily obfuscate. commit 32cae66326b68706d0e695cfd60c9ca5bc32c534 Author: Field G. Van Zee Date: Thu Feb 6 18:06:42 2014 -0600 Fixed some instances of sloppy 'restrict' usage. Details: - Fixed some technical incorrectness with some usage of the 'restrict' keyword in the reference trsm micro-kernels. - Tweak to testsuite/Makefile that causes rebuild if libblis was touched. commit 7aceef7683e2a2aff3c7ec2a73508036af2e19e2 Author: Field G. Van Zee Date: Thu Feb 6 17:31:19 2014 -0600 Updated comments in macro-kernels. Details: - Updated (and fixed some errors in) the "Assumptions/assertions" comment section of macro-kernels. - Changed register blocksizes of reference configuration to MR = 8 and NR = 4. It's always good for MR != NR in the reference configuration since it may help uncover bugs related to non-square micro-kernels. commit 8fd292aa78950bcdf556605718f09d13f9575abc Author: Field G. Van Zee Date: Thu Feb 6 14:32:21 2014 -0600 Pass panel dimensions into macro-kernels. Details: - Modified the interfaces to the datatype-specific macro-kernels so that: - pd_a and pd_b are passed in (which contain the panel dimensions of packed panels of a and b). - rs_a and cs_b are no longer passed in (they were guaranteed to be 1). - Modified implementations of datatype-specific macro-kernels so pd_a, pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR, and PACKNR, respectively. - Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new header file bli_kernel_post_macro_defs.h. commit 3404e6657eabb017cd1580a2f1dd8e6fb13df923 Author: Field G. Van Zee Date: Wed Feb 5 11:19:10 2014 -0600 Deprecated incremental blocksize macro const defs. Details: - Removed macro constant definitions related to incremental blocksizes from all configurations' bli_kernel.h files. This change is minor and is mostly a cleanup related to a previous commit. commit 1e9afd39a63e0a58167d4439c1a0a880a4a35657 Author: Field G. Van Zee Date: Tue Feb 4 20:15:19 2014 -0600 Comment updates (removed vestiges of "bd"). commit 5cf58f7c2d5bc0d2d94d9576f7158d8f133b7aac Author: Field G. Van Zee Date: Tue Feb 4 09:15:19 2014 -0600 Added early returns for "object is zeros" case. Details: - Added some logic to packm_init(), pack_int() and gemm_int() so that (a) objects marked as BLIS_ZEROS are not packed, and (b) those objects are not computed with. This functionality is not currently needed by any existing implementations, but may be used in the future. commit 6bbd4be769a9b344a55abe5ddaca1a99fd29f7b4 Author: Field G. Van Zee Date: Mon Feb 3 13:15:25 2014 -0600 Added 'f' on some gemm and trmm blocked variants. Details: - Added 'f' to some block variant files/functions to be consistent with other file/functions' naming convention. Here, the f indicates partitioning in the "forward" direction. commit eb13cb2c6b182df5e2a9b88c76f50e2cee25b9e0 Author: Field G. Van Zee Date: Mon Feb 3 11:07:01 2014 -0600 Removed redundant non-gemm blksz_t creation. Details: - Removed code that creates duplicate blksz_t objects for herk, trmm, and trsm. Instead, the gemm blksz_t objects are accessed via extern and used directly. This reduces the amount of code associated with each of the three _cntl_init() and _cntl_finalize() function. commit 0a023a7d9e58e53b8c204a5f49aa8ca9afeba938 Author: Field G. Van Zee Date: Wed Jan 29 14:02:08 2014 -0600 Introduced new level-3 front-end layer. Details: - Added new _front() functions for each level-3 operation. This is done so that the choosing of the control tree (and *only* the choosing of the control tree) happens in what was previously the "front end" (e.g. bli_gemm()). That control tree is then passed into the _front() function, which then performs up-front tasks such as parameter checking. commit 251c5d112196d37b183e554bc9d406104aed65fb Author: Field G. Van Zee Date: Tue Jan 28 19:40:29 2014 -0600 Removed redundant hemm, her2k control trees. Details: - Removed code that generated a control tree specifically for hemm and symm. Instead, the gemm control tree is now configured so that it works for gemm, hemm, or symm. - Retired most her2k code, as it was not being used. (Currently, her2k is implemented as two invocations of herk.) I couldn't think of many situations where her2k variants were needed. - Removed some older her2k code. commit 5a36e5bf2f59d1e85d6dbce32a07d604c5e82d11 Author: Field G. Van Zee Date: Mon Jan 27 11:13:00 2014 -0600 Embed func_t microkernel objects in control trees. Details: - Modified all control tree node definitions to include a new field of type func_t*, which is similar to a blksz_t except that it contains one function pointer (each typed simply as void*) for each datatype. We use the func_t* to embed pointers to the micro-kernels to use for the leaf-level nodes of each control tree. This change is a natural extension of control trees and will allow more flexibility in the future. - Modified all macro-kernel wrappers to obtain the micro-kernel pointers from the incomming (previously ignored) control tree node and then pass the queried pointer into the datatype-specific macro-kernel code, which then casts the pointer to the appropriate type (new typedefs residing in bli_kernel_type_defs.h) and then uses the pointer to call the micro- kernel. Thus, the micro-kernel function is no longer "hard-coded" (that is, determined when the datatype-specific macro-kernel functions are instantiated by the C preprocessor). - Added macros to bli_kernel_macro_defs.h that build datatype-specific base names if they do not exist already, and then uses those to build datatype-specific micro-kernel function names. This will allow developers extra flexibility if they wanted to, for example, name each of their datatype-specific micro-kernels differently (e.g. double real might be named bli_dgemm_opt_4x4() while double complex might be named bli_zgemm_opt_2x2()). - Inserted appropriate code into _cntl_init() functions that allocates and initializes a func_t object for the corresponding micro-kernels. The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(), and then reused via extern wherever possible. commit 6cbd6f1c7f1915180aa28939833afde48665c5ae Author: Field G. Van Zee Date: Fri Jan 24 10:38:29 2014 -0600 Removed commented mixed domain macro-kernel code. Details: - Removed commented-out code from macro-kernels that was supposed to facilitate implementing mixed domain (complex times real) matrix multiplication. This functionality is still (probably possible), but I'm getting tired of looking at the code every time I edit a macro-kernel. Plus, there are probably ways of doing it at a higher level, via control trees. commit 29778be1119f1a884330d7f8dc424a2df4101d58 Author: Field G. Van Zee Date: Wed Jan 22 16:03:11 2014 -0600 Removed b_aux field from cntl nodes. Details: - Removed b_aux field from all control tree node definitions. This field was being used in certain optimizations (incremental blocking) that were not actually being employed within BLIS, and are probably not employed by others. - Updated all _cntl_obj_create() function definitions and invocations according to above change. - Retired bli_gemm_blk_var4.c, which was one such function that employed incremental blocking, but which was never called by BLIS itself. commit 06ac727a42ec9e832c7832745036702014638f99 Author: Field G. Van Zee Date: Wed Jan 15 16:44:52 2014 -0600 Updated some comments in level-3 front ends. commit d628bf1da1560f1f5126a1ddfed8714f0a4b8da3 Author: Field G. Van Zee Date: Wed Jan 15 11:40:12 2014 -0600 Consolidated pack_t enums; retired VECTOR value. Details: - Changed the pack_t enumerations so that BLIS_PACKED_VECTOR no longer has its own value, and instead simply aliases to BLIS_PACKED_UNSPEC. This makes room in the three pack_t bits of the info field of obj_t so that two values are now unused, and may be used for other future purposes. - Updated sloppy terminology usage in comments in level-2 front-ends. (Replaced "is contiguous" with more accurate "has unit stride".) commit ddc8c1c379b4787be5954802906593d7ea144452 Author: Field G. Van Zee Date: Mon Jan 13 14:55:43 2014 -0600 Suppress warning in Makefile (UNINSTALL_LIBS). Details: - Redirect errors to /dev/null when using 'find' to locate libraries that would be uninstalled upon executing "make uninstall-old". Before, if the Makefile was read before $(INSTALL_PREFIX)/lib existed, a "No such file or directory" message was emitted. This message was harmless, but is now suppressed in this situation. commit f8f67d7251bffc05020e20527c100c8115fd5e55 Author: Field G. Van Zee Date: Fri Jan 10 09:06:11 2014 -0600 Typecast bli_getopt() return value in testsuite. Details: - In the test suite driver, inserted an explicit typecast of the return value of bli_getopt() prior parsing. The lack of typecast caused a problem on at least one system whereby a return value of -1 was interpreted as garbage character. Thanks to Francisco Igual for finding and submitting this fix. commit e7f154fe2ed3e10e2323cefe5d25c2c23ac902c4 Author: Field G. Van Zee Date: Fri Jan 10 08:48:07 2014 -0600 Applied edge case fix to arm/neon microkernel. Details: - Applied an edge case bugfix, courtesy of Francisco Igual, to the current double precision real gemm microkernel in kernels/arm/neon/3. commit 89c76a8a51d070d263c13bfa5ace65769509f2b4 Author: Field G. Van Zee Date: Thu Jan 9 12:08:37 2014 -0600 Allow building outside source distribution. Details: - Modified build system (mostly configure and top-level Makefile) so that a user can build a BLIS library outside of the top-level directory of the source distribution. - Added "test" target to Makefile so that the user can run "make test", which will compile, link, and run the testsuite binary. This works even if the build directory is externally located, thanks to the test suite binary's new -g and -o command-line options. Also, when creating the test suite via the top-level Makefile, the linking is against the local archive, in lib/, rather than at /lib. - Modified testsuite/Makefile so that it links against the library built locally, in ../lib/. - Added "-lm" to LDFLAGS of most configurations' make_defs.mk. - Various other cleanups to build system. commit 12fa82ec12cc340ab28552997d9d50f7c98691f8 Author: Field G. Van Zee Date: Wed Jan 8 16:09:26 2014 -0600 Implemented bli_getopt(). Details: - Added bli_getopt.c and .h files to frame/base. These files implement a custom version of getopt(), which may be used to parse command line options passed into a program via argc/argv. I am implementing this function myself, as opposed to using the version available via unistd.h, for portability reasons, as the only requirements are string.h (which is available via the standard C library). - Modified test suite to allow the user to specify the file name (and/or path) to the parameters and operations input files: -g may be used to specify the general input file and -o to specify the operations input file). If -g or -o or both are not given, default filenames are assumed (as well as their existence in the current directory). commit cafb58e86ea5cfb21b9eedc57ca8ebbf24252098 Author: Field G. Van Zee Date: Mon Jan 6 13:28:36 2014 -0600 Updated template micro-kernels to use auxinfo_t. Details: - Updated template micro-kernel implementations (located in config/template/kernels), to adhere to the new auxinfo_t interface. Meant to include this change in a0331fb1. - Changed template configuration to use 64-bit integers (for both BLIS and the BLAS compatibility layer). commit 9ab126b499c3805045020cb89a8a5848e28d3bf5 Author: Field G. Van Zee Date: Mon Jan 6 12:13:26 2014 -0600 Removed error checks in netlib->BLIS param mapping Details: - Disabled error checking in netlib-to-BLIS parameter mapping functions. If the char value input to these functions was not one of the defined values, bli_check_error_code() with the appropriate error code value would be called, resulting in an abort(). This was unnecessary and redundant since these routines are currently only used within the BLAS compatibility layer, and they are only called AFTER parameter checking has already been performed on the original BLAS char values. If the application tried to override xerbla() to prevent an abort() from being called, this error checking would still get in the way. Thus, instead of reporting the error situation to the framework (ie: calling abort()), an arbitrary BLIS parameter value is now chosen and the function returns normally. Thanks to Jeff Hammond for finding and reporting this issue. commit 2cb13600f9f9601c60e7f96f4ca159d169ade9cb Author: Field G. Van Zee Date: Fri Jan 3 12:29:13 2014 -0600 Updated year in copyright headers to 2014. commit 290fa54e0083c9c837188b8321b13b1b282e7b0c Author: Field G. Van Zee Date: Fri Dec 20 14:10:26 2013 -0600 Store variable panel strides in trmm/trsm auxinfo. Details: - Changed the value being stored into the auxinfo_t structure in trmm and trsm macro-kernels. Whereas before we stored whatever value was provided to the macro-kernel implementation via ps_a/ps_b, now we store the stride that will advance to the next variable-length micro-panel of the triangular matrix A (left) or B (right). - Whitespace changes to the files affected above. commit e3a6c7e77667fd749248df3f75f880266c3136ec Author: Field G. Van Zee Date: Thu Dec 19 16:29:31 2013 -0600 Macroized conditionals for a2/b2 in macro-kernels. Details: - Replaced conditional expressions in macro-kernels related to computing the addresses a2 and b2 (a_next and b_next) with a preprocessor macro invocation, bli_is_last_iter(), that tests the same condition. - Updated gemm_ukr module to use auxinfo_t argument. - Whitespace changes in test suite ukr modules. commit a0331fb10a50393e31d16339053b75b944132da1 Author: Field G. Van Zee Date: Thu Dec 19 14:50:11 2013 -0600 Introduced auxinfo_t argument to micro-kernels. Details: - Removed a_next and b_next arguments to micro-kernels and replaced them with a pointer to a new datatype, auxinfo_t, which is simply a struct that holds a_next and b_next. The struct may hold other auxiliary information that may be useful to a micro-kernel, such as micro-panel stride. Micro-kernels may access struct fields via accessor macros defined in bli_auxinfo_macro_defs.h. - Updated all instances of micro-kernel definitions, micro-kernel calls, as well as macro-kernels (for declaring and initializing the structs) according to above change. commit 392428dea4001fe4384efe29f6cde32f8abeeb35 Author: Field G. Van Zee Date: Thu Dec 12 19:01:47 2013 -0600 Added "ri" scalar macros. Details: - Added set of basic scalar macros that take arguments' real and imaginary components separately, named like the previous set except with the "ris" (instead of "s") suffix. - Redefined the previous set of scalar macros (those that take arguments "whole") in terms of the new "ri" set. - Renamed setris and getris macros to sets and gets. - Renamed setimag0 macros to seti0s. - Use bli_?1 macro instead of a local constant in bla_trmv.c, bla_trsv.c. commit f60c8adc2f61eaba06b892f4e73000159de93056 Author: Field G. Van Zee Date: Tue Dec 10 14:39:56 2013 -0600 Minor updates to dunnington configuration. Details: - Added commented alternatives to dunnington configuration's bli_kernel.h. - Minor reformatting of optimization flag variables in make_defs.mk. commit 4ef20150492db254b5baf2368add62e19b0ac11b Author: Field G. Van Zee Date: Mon Dec 9 18:53:03 2013 -0600 Tweaks to dunnington configuration (x86_64/core2). Details: - Updated BLIS_DEFAULT_KC_D from 256 to 384. - Enabled cache blocksize extension of up to 25% for MC and KC (for double-precision real). commit 5ad2ce7bf5ba3ea955e6d517bfd270e02820263b Author: Field G. Van Zee Date: Mon Dec 9 18:30:49 2013 -0600 Minor x86_64 (core2) kernel fixes. Details: - Fixed copy-and-paste bug whereby [scz]gemmtrsm_u_opt_d4x4 kernels for x86_64/core2 were calling the wrong reference code (l instead of u). - Fixed some unused variables in x86_64/core2 dotaxpyv and dotxaxpyf kernels. - Minor typecasting fix in testsuite/src/test_libblis.c. - Makefile updates. commit d289f5d3a9c0e1a68a17c1c32b736e282a289c4c Author: Field G. Van Zee Date: Thu Dec 5 10:56:13 2013 -0600 Whitespace changes to level-2 blocked variants. Details: - Joined some lines in level-2 blocked variants to match formatting used in level-3 blocked variants. - Streamlined implementation of bli_obj_equals() in bli_query.c. commit b444489f100d218bc8ef29b01ff8489c358559f9 Author: Field G. Van Zee Date: Tue Dec 3 16:08:30 2013 -0600 Added new "attached" scalar representation. Details: - Added infrastructure to support a new scalar representation, whereby every object contains an internal scalar that defaults to 1.0. This facilitates passing scalars around without having to house them in separate objects. These "attached" scalars are stored in the internal atom_t field of the obj_t struct, and are always stored to be the same datatype as the object to which they are attached. Level-3 variants no longer take scalar arguments, however, level-3 internal back-ends stll do; this is so that the calling function can perform subproblems such as C := C - alpha * A * B on-the-fly without needing to change either of the scalars attached to A or B. - Removed scalar argument from packm_int(). - Observe and apply attached scalars in scalm_int(), and removed scalar from interface of scalm_unb_var1(). - Renamed the following functions (and corresponding invocations): bli_obj_init_scalar_copy_of() -> bli_obj_scalar_init_detached_copy_of() bli_obj_init_scalar() -> bli_obj_scalar_init_detached() bli_obj_create_scalar_with_attached_buffer() -> bli_obj_create_1x1_with_attached_buffer() bli_obj_scalar_equals() -> bli_obj_equals() - Defined new functions: bli_obj_scalar_detach() bli_obj_scalar_attach() bli_obj_scalar_apply_scalar() bli_obj_scalar_reset() bli_obj_scalar_has_nonzero_imag() bli_obj_scalar_equals() - Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c. - Renamed the following macros: bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1() bli_obj_is_scalar() -> bli_obj_is_1x1() - Defined new macros to set and copy internal scalars between objects: bli_obj_set_internal_scalar() bli_obj_copy_internal_scalar() - In level-3 internal back-ends, added conditional blocks where alpha and beta are checked for non-unit-ness. Those values for alpha and beta are applied to the scalars attached to aliases of A/B/C, as appropriate, before being passed into the variant specified by the control tree. - In level-3 blocked variants, pass BLIS_ONE into subproblems instead of alpha and/or beta. - In level-3 macro-kernels, changed how scalars are obtained. Now, scalars attached to A and B are multiplied together to obtain alpha, while beta is obtained directly from C. - In level-3 front-ends, removed old function calls meant to provide future support for mixed domain/precision. These can be added back later once that functionality is given proper treatment. Also, removed the creating of copy-casts of alpha and beta since typecasting of scalars is now implicitly handled in the internal back-ends when alpha and beta are applied to the attached scalars. commit 992de486d6f23e69a623abd15ae77d7881d13871 Merge: 9552e6ee fd4ac636 Author: Field G. Van Zee Date: Mon Dec 2 13:58:46 2013 -0600 Unimplemented kernels now call reference. Details: - Updated arm, bgq, loongson3a, and x86_64 kernels so that unimplemented datatypes call the corresponding reference kernel. Previously, these kernel functions called abort() with a "not yet implemented" error message. commit fd4ac636d9a55cec1476a444bd4e70def219dc8f Author: Field G. Van Zee Date: Mon Dec 2 13:50:36 2013 -0600 Unimplemented kernels now call reference. Details: - Updated micro-kernels for arm, bgq, loongson3a, and x86_64 so that unimplemented kernel functions simply call the corresponding reference implementation. (Previously, these unimplemented functions would abort() with a "not yet implemented" message.) commit 9552e6ee824d4345d5e908e869e071d19829819a Author: Field G. Van Zee Date: Sun Nov 24 11:40:31 2013 -0600 Removed optional scaling from packm control tree. Details: - Removed does_scale field from packm control tree node and bli_packm_cntl_obj_create() interface. Adjusted all invocations of _cntl_obj_create() accordingly. - Redefined/renamted macros that are used in aliasing so that now, bli_obj_alias_to() does a full alias (shallow copy) while bli_obj_alias_for_packing() does a partial alias that preserves the pack_mem-related fields of the aliasing (destination) object. - Removed bli_trmm3_cntl.c, .h after realizing that the trmm control tree will work just fine for bli_trmm3(). - Removed some commented vestiges of the typecasting functionality needed to support heterogeneous datatypes. commit e65c476284db9ef64b23191a21c2584b1083342f Author: Field G. Van Zee Date: Tue Nov 19 10:05:35 2013 -0600 Minor updates to packm_blk_var2.c and _blk_var3.c. Details: - Comment updates to packm_blk_var2.c and packm_blk_var3.c. - In packm_blk_var2(), call setm_unb_var1(), scal2m_unb_var1() directly instead of setm(), scal2m(). commit 9e1d0d4bca48eda54301d8976f203e2544c9df3a Author: Field G. Van Zee Date: Mon Nov 18 18:11:07 2013 -0600 Added trsm_l, trsm_u ukernels for x86_64/core2. Details: - Added standalone trsm_l/trsm_u micro-kernels for x86_64 (core2). These kernels are based on the gemmtrsm_l/gemmtrsm_u micro-kernels that already existed in kernels/x86_64/core2-sse3/3. commit 85e7e02ea3a9190b6fcff5d46b00d41c79cb1242 Merge: 67761e22 70720054 Author: Field G. Van Zee Date: Mon Nov 18 12:02:00 2013 -0600 Merge branch 'master'. Forgot to git-pull. commit 67761e224c92500eecf9c1540cc72bdd2fb27679 Author: Field G. Van Zee Date: Mon Nov 18 11:57:40 2013 -0600 Attempting to fix errors in bgq build. Details: - Removed restrict declaration from b_cast and c_cast from bli_trsm_lu_ker_var2.c and bli_trsm_rl_ker_var2.c. Curiously, they are causing problems for xlc only in those two files and no other macro-kernels. - Fixed (hopefully) kernel function parameter type declarations in kernels/bgq/1f/bli_axpyf_opt_var1.c and kernels/bgq/3/bli_gemm_8x8.c. commit 707200541d344f98cf34c9801954dbb36fbe0447 Author: Field G. Van Zee Date: Mon Nov 18 11:17:31 2013 -0600 Syntax error fix in x86_64/core2 gemmtrsm_u ukr. commit bbe2b84a49e7785d4d0c514cda34adfbe66478b0 Author: Field G. Van Zee Date: Mon Nov 18 11:11:06 2013 -0600 Updated Makefile in test, testsuite. Details: - Updated Makefiles in test and testsuite directories to use the new BLIS header installation directory scheme, which is to compile with -I/include/blis instead of -I/include. commit 9bd7fcfd436625ca2108128086671319362f4d92 Author: Field G. Van Zee Date: Mon Nov 18 10:58:09 2013 -0600 Outer-to-inner 'restrict' fix in macro-kernels. Details: - Fixed sloppy placement of 'restrict' pointer declarations in level-3 macro-kernels. Previously, all restricted pointers were being declared at the outer-most function scope level. While this violates the C99 standard, very few of the compilers used with BLIS so far have seemed to care. The lone exception has been IBM's xlc. Thanks to Tyler Smith for identifying this bug (and suggesting the fix). commit 50549a6a31dd26cf63a013e0ede16b2c7ce835b6 Author: Field G. Van Zee Date: Sun Nov 17 18:31:27 2013 -0600 Changed header install directory to include/blis. Details: - Changed top-level Makefile so that headers are installed to $(INSTALL_PREFIX)/include/blis/. (Header directories are no longer named by version/configuration and then symlinked.) - Added uninstall targets, including uninstall-old to clean out old library archives. - Added GREP makefile definitions to all configurations' make_defs.mk. commit d70733abddfb9a95661897e1e4f3c1f3cfa7cbaa Author: Field G. Van Zee Date: Sat Nov 16 17:34:25 2013 -0600 Added ARM kernels, configurations. Details: - Added kernels for ARM, and configurations for Cortex-A9 and Cortex-A15. Thanks to Francisco Igual for contributing these kernels and configurations. commit d37c2cff62089c86983c2f79762f4b5329037373 Author: Field G. Van Zee Date: Wed Nov 13 10:47:11 2013 -0600 Minor comment and Makefile changes. Details: - Added missing 'check-config' and 'check-make-defs' targets to testsuite/Makefile. - Removed unused 'test' target from top-level Makefile. - Comment changes to testsuite input files. commit 19885f893a17b91ee79bead0620d0f913392d4c5 Author: Field G. Van Zee Date: Mon Nov 11 12:09:21 2013 -0600 Updated some kernel comment headers. Details: - Updated bgq and piledriver comment headers to use BLIS copyright header instead of libflame. commit 1a4d698f42981d74fe5f29b980031e1ee7dc42d5 Author: Field G. Van Zee Date: Mon Nov 11 10:15:40 2013 -0600 CHANGELOG update (for 0.1.0). commit 089048d5895a30221b6b1976c9be93ad6443420d (tag: 0.1.0) Author: Field G. Van Zee Date: Sat Nov 9 17:18:00 2013 -0600 Added object wrappers to 1f test suite modules. Details: - Added missing object wrappers to level-1f test suite modules. This was only apparent if you were configuring with something other than the reference configuration. - Commented out object-wrappers in level-1f front-ends. These were not working as intended the reference configuration was selected, because most kernel sets, such as those in the template set, do not have object wrappers. - Whitespace changes to template micro-kernels. - Comment changes to template level-1f kernel headers. commit 9ef3752079de10124bed906b5d28479d04aa8187 Author: Field G. Van Zee Date: Fri Nov 8 17:20:47 2013 -0600 Updated template kernels wrt KernelsHowTo wiki. Details: - Merged latest state of KernelsHowTo wiki into template micro-kernels located in config/template/kernels/3. commit 376bbb59c8944e29c5c1ff6637920d8451370afa Author: Field G. Van Zee Date: Fri Nov 8 11:17:34 2013 -0600 Removed support for duplication. Details: - Removed support for duplication from the gemmtrsm/trsm micro-kernels and all framework code. - Updated test suite modules according to above changes. commit 68a5910974b62b4df853fae2a68cb04df9d5a19c Author: Field G. Van Zee Date: Thu Nov 7 11:36:11 2013 -0600 Added comments to testsuite/input.operations. Details: - Added extensive comments to the top of testsuite/input.operations, which describe how to edit the file. - Removed input.operations.0 and input.operations.1. - Changed input.general to test all datatypes ("sdcz") by default. commit a98f78b715fb256a519870071bb5266130d70b21 Author: Field G. Van Zee Date: Wed Nov 6 15:32:47 2013 -0600 Changed dim_t and inc_t to be signed integers. Details: - Redefined dim_t and inc_t in terms of gint_t (instead of guint_t). This will facilitate interoperability with Fortran in the future. (Fortran does not support unsigned integers.) - Redefined many instances of stride-related macros so that they return or use the absolute value of the strides, rather than the raw strides which may now be signed. Added new macros bli_is_row_stored_f() and bli_is_col_stored_f(), which assume positive (forward-oriented) strides, and changed the packm_blk_var[23] variants to use these macros instead of the existing bli_is_row_stored(), bli_is_col_stored(). - Added/adjusted typecasting to to various functions/macros, including bli_obj_alloc_buffer(), bli_obj_buffer_at_off(), and various pointer- related macros in bli_param_macro_defs.h. - Redefined bli_convert_blas_incv() macro so that the BLAS compatibility layer properly handles situations where vector increments are negative. Thanks to Vladimir Sukharev for pointing out this issue. - Changed type of increment parameters in bli_adjust_strides() from dim_t to inc_t. Likewise in bli_check_matrix_strides(). - Defined bli_check_matrix_object(), which checks for negative strides. - Redefined bli_check_scalar_object() and bli_check_vector_object() so that they also check for negative stride. - Added instances of bli_check_matrix_object() to various operations' _check routines. commit 1f8afc3e08a4312cfe810be86aedeacbc57275c5 Author: Field G. Van Zee Date: Wed Nov 6 10:09:10 2013 -0600 Minor comment update to BLAS compat files. commit 1abbf768afafc158d44e4d5c4a135cfd9e277f13 Author: Field G. Van Zee Date: Mon Nov 4 15:50:00 2013 -0600 Fixed bugs in scalv and setv. Details: - Fixed bugs similar to those addressed in cca1e1f51dc6, whereby a segmentation fault may occur if beta is not the same type as the vector operand for scalv and setv. - Changed axpyv and scal2v front-ends in a similar fashion. commit f5953259a1842ee48e5833c22ac86e68a337bfe1 Author: Field G. Van Zee Date: Mon Nov 4 14:43:55 2013 -0600 Fixed a bug related to Hermitian matrix diagonals. Details: - Fixed a bug whereby BLIS assumed that the imaginary components of the diagonal elements of Hermitian matrices were already zero. This property is now enforced when the matrix is packed (bli_packm_blk_var2). Thanks to Vladimir Sukharev for reporting this bug. - Minor comment updates to template kernels. commit d70f2b089dac8b9e4c19295dfa6014c36afee2ec Author: Field G. Van Zee Date: Sat Nov 2 17:19:40 2013 -0500 Added scaling to abval2s, sqrt2s macros. Details: - Re-defined abval2s and sqrt2s macros to use scaling to avoid underflow and overflow from squaring the real and imaginary components. (This is the same technique used to fix recent bugs in invscals/invscaljs and inverts.) commit c5b1ed9409ae2f71d04041eef5da9a0080b5784a Author: Field G. Van Zee Date: Fri Nov 1 10:28:04 2013 -0500 Added new dotxaxpyf variant 2. Details: - Added a new variant for dotxaxpyf that is based on dotxf and axpyf kernels. By default, this variant is not used by any other operation. commit 97f89fbcf202d72fc440b614708e352ea31633e2 Author: Field G. Van Zee Date: Fri Nov 1 10:16:39 2013 -0500 Fixed bug in complex invscals. Details: - Fixed complex inversion in invscals and invscaljs whereby the imaginary component was being computed incorrectly. - Use bli_fmaxabs() instead of bli_fabs() when choosing the scalar in inverts, invscals, and invscaljs. - Changed bli_abs() and bli_fabs() macro definitions to use "<=" operator instead of "<". commit eda42a21d17a2742eab69ab801ed530b82488c8a Author: Field G. Van Zee Date: Thu Oct 31 18:00:44 2013 -0500 Defined missing symbols in bla_rotg.c Details: - Defined local equivalents of libf2c's r_sign(), d_sign(), c_abs(), and z_abs(), which are needed by bla_rotg.c. Also defined r_abs() and d_abs() for completeness. Thanks to Vladimir Sukharev for reporting these bugs. commit cca1e1f51dc67a2c3725d5c1837256831aaf70f8 Author: Field G. Van Zee Date: Wed Oct 30 14:39:01 2013 -0500 Fixed bugs in scalm and setm. Details: - Fixed bugs in scalm and setm that resulted in segmentation faults when beta is not the same type as the matrix operand. Thanks to Vladimir Sukharev for reporting this bug. - Changed axpym and scal2m front-ends in fashion similar to that of scalm and setm; namely, the alpha scalar is copy-cast the type of the first matrix operand. - Changed the template and reference configurations' bli_config.h files so that the number of memory allocator blocks of A and B are set based on BLIS_MAX_NUM_THREADS. - Comment updates to bli_obj.c and variable rename in bla_nrm2.c. commit 2807013a4761c2b84b3944de64d23483ad7ef2fb Author: Field G. Van Zee Date: Thu Oct 24 14:32:20 2013 -0500 Fixed over/under-flow in complex inversion. Details: - Fixed the complex bli_?inverts() macros, which were inverting elements in an "unsafe" manner, such that very large and very small values were unnecessarily over/under-flowing. Thanks for Vladimir Sukharev for reporting this bug. - Comment update to bli_sumsqv_unb_var1.c. - Removed redundant bli_min() macro in bli_scalar_macro_defs.h. - Changed 1.0F to 1.0 for bli_drands() macro. commit 45a80c625f84edb2ade6ac25efe2b9c589d7e0df Author: Field G. Van Zee Date: Wed Oct 23 12:15:25 2013 -0500 Fixed parameter checking issue in BLAS syr[2]k. Details: - Fixed a minor parameter checking bug in the BLAS compatibility layer for [sd]syrk and [sd]syr2k. Specifically, if 'C' is passed in for the trans parameter of either operation, it is (a) allowed, and (b) treated as 'T' (whereas previously it was disallowed). Thanks for Vladimir Sukharev for finding and reporting this bug. commit a091a219bda55e56817acd4930c2aa4472e53ba5 Author: Field G. Van Zee Date: Mon Oct 14 10:11:29 2013 -0500 Minor fixes to piledriver configuration, ukernel. Details: - Applied a patch from Tyler that fixes minor staleness in the piledriver configuration and gemm micro-kernel. - Very minor changes to test suite input files. commit dacdde27aee4fb90b14880136d7f20c6b234e2c6 Author: Field G. Van Zee Date: Fri Oct 11 11:37:19 2013 -0500 Added Fran's Sandy Bridge kernels/configuration. Details: - Added a kernel directory for kernels developed by Francisco Igual for the Sandy Bridge architecture, including a dgemm ukernel coded with AVX intrinsics. - Added a configuration for Sandy Bridge using values supplied by Fran. commit 03106d650e4030d4c9831683448376f92fc52d41 Author: Field G. Van Zee Date: Fri Oct 11 10:40:38 2013 -0500 Fixed minor perf bug in gemm_ker_var2. Details: - Fixed a minor performance bug in bli_gemm_ker_var2.c (and the experimental bli_gemm_ker_var5.c) whereby the addresses for a_next and b_next are not computed correctly (ie: do not wraparound) at the edge cases. Thanks to Tze Meng for helping me identify this bug. commit b053337387dbdef9035be03538222670a21707ca Author: Field G. Van Zee Date: Thu Oct 10 18:26:55 2013 -0500 Added fusing factors, MR/NR to test suite output. Details: - Updated the test suite driver (and modules where appropriate) so that the level-1f fusing factors are output along with the variable dimension. While this is not strictly necessary, since the fusing factors are output in the initial parameter summary, it allows extra reassurance to the user since the fusing factors appear alongside the variable dimension, which together give a complete picture of the problem size. Similar changes were made for outputting the register blocksizes when reporting results for the micro-kernel test modules. commit be4833bd91c5a58d0bfc52daaadf7ba543a77acf Author: Field G. Van Zee Date: Thu Oct 10 14:20:06 2013 -0500 Added test suite modules for level-1f, 3 kernels. Details: - Added test modules in test suite for level-1f kernels and level-3 micro-kernels. (Duplication in the micro-kernels, for now, is NOT supported by these test modules.) - Added section override switches to test suite's input.operations file. - Added obj_t APIs for level-1f front-ends and their unblocked variants to facilitate the level-1f test modules. Also added front-end for dupl operation. - Added obj_t-based check routines for level-1f operations, which are called from the new front-ends mentioned above. - Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing factors as a function of datatype, which is needed by their respective test modules. - Whitespace changes to bli_kernel.h of all existing configurations. commit 680188d46bb15b9a1a2867638104939dc77ca2a1 Author: Field G. Van Zee Date: Thu Oct 10 13:23:37 2013 -0500 Cleaned up old test drivers. Details: - Minor updates to old test drivers in preparation for our participation in ACM TOMS's replicated results initiative. commit 3690bdd4f95769c935c410414112102cc3e108b1 Author: Field G. Van Zee Date: Thu Oct 10 11:45:33 2013 -0500 More updates to level-1f kernels for core2-sse3. Details: - Changed types in function signatures to match new prototypes. Meant to include this in previous commit. commit 661d5120cd7071f9b0c5cefc95f99f1361370ade Author: Field G. Van Zee Date: Thu Oct 10 11:27:27 2013 -0500 Fixed outdated fusing factor macros in 1f kernels. Details: - Updated level-1f kernels for x86_64 and bgq to use renamed fusing factor macros. Meant to include this in 5e54f46c. Thanks to Fran for pointing this out. commit 73aa1e9f31d1b2a319c7e711ced6db3f9835c832 Author: Field G. Van Zee Date: Tue Oct 1 17:01:18 2013 -0500 Added section overrides to test suite. Details: - Added new lines of input to the test suite's input.operations file, which allows the user to disable entire sections (levels) of tests. Before this change, the user had to manually disable each operation tests's "master switch". (This is why input.operations.0 existed: to allow a more convenient starting point for someone who only wanted to test one or a few operations.) commit 5e54f46ccb76beab892d530b693e07c6bf6db7cf Author: Field G. Van Zee Date: Mon Sep 30 12:58:18 2013 -0500 Added template implementations and other tweaks. Details: - Added a 'template' configuration, which contains stub implementations of the level 1, 1f, and 3 kernels with one datatype implemented in C for each, with lots of in-file comments and documentation. - Modified some variable/parameter names for some 1/1f operations. (e.g. renaming vector length parameter from m to n.) - Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files to bli_kernel.h. - Modifed test suite to print out fusing factors for axpyf, dotxf, and dotxaxpyf, as well as the default fusing factor (which are all equal in the reference and template implementations). - Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these reference variants were implemented in terms of front-end routines rather that directly in terms of the kernels. (For example, axpy2v was implemented as two calls to axpyv rather than two calls to AXPYV_KERNEL.) - Changed the interface to dotxf so that it matches that of axpyf, in that A is assumed to be m x b_n in both cases, and for dotxf A is actually used as A^T. - Minor variable naming and comment changes to reference micro-kernels in frame/3/gemm/ukernels and frame/3/trsm/ukernels. commit 97aaf220a847363b4da35935eca17790c0ef71f6 Author: Field G. Van Zee Date: Tue Sep 17 10:51:36 2013 -0500 Added new kernels, configurations. Details: - Added various micro-kernels for the following architectures: Intel MIC IBM BG/Q IBM Power7 AMD Piledriver Loogson 3A and reorganized kernels directory. Thanks to Tyler Smith, Mike Kistler, and Xianyi Zhang for contributing these kernels. - Added configurations corresponding to above architectures, and renamed "clarksville" configuration to "dunnington". commit fe979c5a114c877506a5697cdab1fc8cf2bcd303 Author: Field G. Van Zee Date: Fri Sep 13 14:31:53 2013 -0500 Removed default configuration behavior. Details: - Changed the configure script so that it no longer defaults to the reference configuration. This change is being made so that the developer has a firm awareness of which configuration is being used to configure BLIS. Thanks to Mike Kistler and Bryan Marker for this suggested change. commit da77e9614f54f92f703f01e3b9bd67a83280150c Author: Field G. Van Zee Date: Fri Sep 13 12:00:37 2013 -0500 Minor improvements to static memory allocator. Details: - Expanded on cpp macro definitions from bli_mem.c and relocated them to a new header file, frame/include/bli_mem_pool_macro_defs.h. The expanded functionality includes computing the pool size for each datatype (using that datatype's cache blocksizes) and using the maximum to size the actual pool array. This addresses the somewhat common pitfall whereby a developer updates cache blocksizes in bli_kernel.h for only one datatype (say, single-precision real), while the memory pools are sized using the double-precision real values. Then, when the developer attempts to link to and run a level-3 BLIS routine (e.g. dgemm), the library aborts with a message saying the static memory pool was exhausted. Clearly, this message is misleading when the pool was not sized properly to begin with. - Removed previously disabled code in bli_kernel_macro_defs.h that was meant to check for size consistency among the various cache blocksizes. (Obviously the memory pool size-based solution mentioned above is better.) - Added BLIS_SIZEOF_? cpp macros to bli_type_defs.h. This seemed like a reasonable place to put these constants, rather than further crowd up bli_config.h. - Updated testsuite driver to output memory pool sizes for A, B, and C. - Minor comment updates to bli_config.h. - Removed 'flame' configuration. It was beginning to get out-of-date, and I hadn't used it in months. We can always re-create it later. commit 631f347b7a99cb02757c534fd3ec5f723a2fdb0e Author: Field G. Van Zee Date: Tue Sep 10 17:17:28 2013 -0500 Added ESSL and Accelerate targets to test drivers. Details: - Added ESSL and Accelerate (OS X) targets to standalone test drivers' Makefile in "test" directory. Thanks to Jeff Hammond for suggesting / providing this patch. commit 7ae4d7a41d13ef5f1ceee217c000a5cf77a11128 Author: Field G. Van Zee Date: Tue Sep 10 16:35:12 2013 -0500 Various changes to treatment of integers. Details: - Added a new cpp macro in bli_config.h, BLIS_INT_TYPE_SIZE, which can be assigned values of 32, 64, or some other value. The former two result in defining gint_t/guint_t in terms of 32- or 64-bit integers, while the latter causes integers to be defined in terms of a default type (e.g. long int). - Updated bli_config.h in reference and clarksville configurations according to above changes. - Updated test drivers in test and testsuite to avoid type warnings associated with format specifiers not matching the types of their arguments to printf() and scanf(). - Inserted missing #include "bli_system.h" into blis.h (which was slated for inclusion in d141f9eeb6d1). - Added explicit typecasting of dim_t and inc_t to macros in bli_blas_macro_defs.h (which are used in BLAS compatibility layer). - Slight changes to CREDITS and INSTALL files. - Slight tweaks to Windows build system, mostly in the form of switching to Windows-style CRLF newlines for certain files. commit 068437736b41d51a1f5ec47839f059bf58a20413 Author: Field G. Van Zee Date: Mon Sep 9 14:07:58 2013 -0500 Fixed set-but-not-used compiler (gcc) warnings. Details: - Used void-casts of certain variables to appease gcc (and perhaps other compilers) when such variables are only used in the complex instances of the functions. Special thanks to Karl Rupp for suggesting a portable fix for these warnings. commit 6dc85f63dcd5282340c9e00d585e97d70a21edc3 Author: Field G. Van Zee Date: Mon Sep 9 13:48:52 2013 -0500 Small fix to Windows defs.mk makefile fragment. Details: - Commented out a !include statement that was attempting to include a version file that does not yet exist. For now, the version string is hard-coded into defs.mk. commit d141f9eeb6d1de7044b7429adf52d11c6fca620c Author: Field G. Van Zee Date: Mon Sep 9 13:09:16 2013 -0500 Added Windows build system. Details: - Added a 'windows' directory, which contains a Windows build system similar to that of libflame's. Thanks to Martin for getting this up and running. - Spun off system header #includes into bli_system.h, which is included in blis.h - Added a Windows section to bli_clock.c (similar to libflame's). commit 9b320e7406fb69e8b61a0085abe2ed89a96bdb68 Author: Field G. Van Zee Date: Mon Sep 9 11:04:46 2013 -0500 Edited bli_?lamch.c to avoid Windows keyword. Details: - Renamed "small" variable to "smnum" to avoid collision with Windows type by the same name. This change is needed in advance of the upcoming Windows build system. commit 9013ad6ff2e9ace35e0cf44c32795c2f3d5be628 Author: Field G. Van Zee Date: Wed Sep 4 13:36:07 2013 -0500 Switched integer typedefs (again) to C types. Details: - Redefined gint_t and guint_t in terms of the standard C types long int and unsigned long int, respectively. - Changed testsuite default max problem size to 500. - Changed testsuite input.operations to use square problems for level-3 operation tests. commit 981a60cfa07abac2e93697dfe12b0f076ab00a38 Author: Field G. Van Zee Date: Wed Sep 4 12:09:11 2013 -0500 Falling back to 32-bit integers for dim_t, etc. Details: - In light of recent segfaulting issues when compiling on 32-bit systems, I've changed the default typedef for gint_t and guint_t from int64_t and uint64_t to int32_t and uint32_t, respectively. - Disabled 64-bit integers in the blas2blis layer for the reference configuration. - Added type sizes of gint_t, guint_t, and the four floating-point datatypes to introductory output of the testsuite. commit b776ddcd4338b34f172ef78da0ac1d771a771ab4 Author: Field G. Van Zee Date: Tue Sep 3 21:58:07 2013 -0500 Applied temp fix to typecasting bug in testsuite. Details: - Applied a temporary fix to the typecasting bug in the testsuite driver. The fix involves casting both numerator and denominator to unsigned long. This fix is more voodoo than science, as I can't be sure why it even works. commit 9ee6e125373869c4213c017ce772c38ecefba103 Author: Field G. Van Zee Date: Tue Sep 3 21:53:27 2013 -0500 Changed dimension spec for gemm in testsuite. Details: - Encounted a bizarre typecasting bug whereby the test suite was not computing the proper dimension from the problem size and dimension specification when the latter was set to -3. Will investigate. Thanks to Fran for finding this "bug". commit e8be081e68c385ab44d0fea8dade21d40c200b79 Author: Field G. Van Zee Date: Wed Aug 28 15:52:34 2013 -0500 Generalized matlab and file output in testsuite. Details: - Added a new option in input.general that allows outputting in matlab/octave format so that one can output in matlab format independently from outputting to files. - Adjusted input.operations according to above. - Added input.operations.0 and input.operations.1 with all options disabled and enabled, respectively. commit d352c746e5683037d41b5061dfb5ce08e1d0843b Author: Field G. Van Zee Date: Tue Aug 27 13:41:46 2013 -0500 Added single/real gemm micro-kernel for x86_64. Details: - Added a single-precision real gemm micro-kernel in kernels/x86_64/3/bli_gemm_opt_d4x4.c. - Adjusted the single-precision real register blocksizes in config/clarksville/bli_kernel.h to be 8x4. - Added a missing comment to bli_packm_blk_var2.c that was present in bli_packm_blk_var3.c commit dedda523dc5dc779ecc34e6a03dc74cb8eb220de Author: Field G. Van Zee Date: Mon Aug 19 12:07:41 2013 -0500 Fixed bug in bli_acquire_mpart_t2b(), _l2r(). Details: - Fixed a bug in bli_acquire_mpart_t2b() and bli_acquire_mpart_l2r() that cause incorrect partitioning when SUBPART0 was requested. This bug was introduced in 46d3d09d49ad. Thanks to Bryan for isolating this bug. - Removed dupl kernels from kernels/x86_64/3 directory. - Uncommented beta == 0 optimizaition code in kernels/x86_64/3/bli_gemm_opt_d4x4.c. commit 12dbd2f33455e9384fe2070cbdd660fd4a7fceb5 Author: Field G. Van Zee Date: Thu Aug 8 14:39:35 2013 -0500 Moved init_safe(), finalize_safe() to BLAS compat. Details: - Moved the bli_init_safe() and bli_finalize_safe() function calls from the BLAS-like BLIS layer to the BLAS compatibility layer. Having these auto- initializers in the BLIS layer wasn't buying us anything because the user could still call the library with uninitialized global scalar constants, for example. Thus, we will just have to live with the constraint that bli_init() MUST be called before calling ANY routine with a bli_ prefix. - Added the missing _init_safe() and finalize_safe() calls to the level-1 BLAS compatibility wrappers. commit 8abfe55f2ae5d89df18e1b26a5a28d94b0936683 Author: Field G. Van Zee Date: Thu Aug 8 13:30:19 2013 -0500 Miscellaneous updates. Details: - Changed the BLIS_HEAP_STRIDE_ALIGN_SIZE in the configurations from 16 to BLIS_CACHE_LINE_SIZE (typically 64). - Changed the use of nr in sizing of bd buffer to packnr in level-3 macro- kernels. - Reformulated gemm_ker_var2 to look more like the other level-3 macro- kernels, in that the interior and edge-case handling is expressed once inside the loops in the n and m dimensions, rather than the edge-case handling being "unrolled" and expressed as distinct code regions. The previous macro-kernel now lives in retired form in the subdirectory other/bli_gemm_ker_var2.c.old. - Updated experimental gemm_ker_var5 according to above change. - Fixed bug in bli_her2k.c whereby incorrect transformations were being applied to optimize the macro-kernel accesses pattern on C when C is row-stored. - Various updates inside of test/exec_sizes. commit 1aa05736ff49e7cc5f121acf615460fe9a87852c Author: Field G. Van Zee Date: Wed Aug 7 12:27:04 2013 -0500 Fixed bug in interface of bla_ger_check(). Details: - Fixed the misplaced lda parameter in the function signature of bla_ger_check(). Thanks to Tyler for finding this bug. commit 685aad25353fb200de4ca97a8bc0feeebde51d0f Author: Field G. Van Zee Date: Tue Aug 6 12:25:51 2013 -0500 Fixed cpp guard typos in frame/compat/check files. Details: - Fixed instances of BLIS_ENABLE_BLIS2BLAS that should have been BLIS_ENABLE_BLAS2BLIS. Thanks to Tyler for catching this. - Fixed various syntax errors in the code that had yet to be compiled due to the aforementioned bug. commit f4ec28e723d28d998f1038f82da6986e44320ef6 Author: Field G. Van Zee Date: Thu Aug 1 11:24:23 2013 -0500 Added basic OpenMP-based gemm and packm files. Details: - Integrated Tyler's parallelized packm_blk_var2 and gemm_ker_var2 into the following auxiliary files frame/1m/packm/other/bli_packm_blk_var2.c frame/3/gemm/other/bli_gemm_ker_var2.c The routine in the first file uses a basic OpenMP parallel region to parallelize the packing of blocks of A and panels of B, while the second uses a similar parallel region to parallelize along the n dimension of the gemm macro-kernel. commit f8980edf9c318453bb1962ac4939c06bf11e6d5e Merge: 67a8b949 6e7e4523 Author: Field G. Van Zee Date: Fri Jul 26 11:14:27 2013 -0500 Merge branch 'master' of https://code.google.com/p/blis commit 67a8b9498d13b038deb316ac163e62c5b17da2ec Author: Field G. Van Zee Date: Fri Jul 26 11:12:37 2013 -0500 Added missing cpp kernel blocksize constraints. Details: - Added missing C preprocessor guards in bli_kernel_macro_defs.h that enforce constraints on the register blocksizes relative to the cache blocksizes. Thanks to Tyler for helping me stumble across this issue. commit 6e7e452343014e8f86640874dc1dbadca4a642a1 Author: Field G. Van Zee Date: Mon Jul 22 14:50:57 2013 -0500 Fixed minor warnings and misc issues. Details: - Fixed various warnings output by gcc 4.6.3-1, including removing some set-but-not-used variables and addressing some instances of typecasting of pointer types to integer types of different sizes. commit 03f6c3599743bc837a7d40eb5b415b1bf4f2a4e9 Author: Field G. Van Zee Date: Mon Jul 22 12:54:32 2013 -0500 Tightened some macros that detect datatypes. Details: - Modified the definitions of some macros, such as bli_is_real(), so that the "special" bit is taken into account so that BLIS_INT is differentiated from BLIS_FLOAT. - Whitespace changes to bli_obj_macro_defs.h. - Removed BLIS_SPECIAL_BIT definition from bli_type_defs.h, since it wasn't being used. commit b33e2f4443b9043b554963320280ff7783773652 Author: Field G. Van Zee Date: Fri Jul 19 17:15:03 2013 -0500 CHANGELOG update (for 0.0.9). commit 0680916fdd532f7a4716b11a2515243b2c08d00f (tag: 0.0.9) Author: Field G. Van Zee Date: Thu Jul 18 18:04:34 2013 -0500 Added BLAS error checking to compatibility layer. Details: - Added frame/compat/check directory, which now houses companion _check() routines for each of the BLAS wrappers in frame/compat. These _check() routines are called from the compatibility wrappers and mimic the error-checking present in the netlib BLAS. - Edited bla_xerbla.c so that xerbla() translates the operation string to uppercase before printing. - Redefined util routines in frame/compat/f2c/util in terms of level0 macros. - Added prototypes for util routines, f2c routines, lsame(), and xerbla(). - Commented out prototypes in test/test_*.c since Fortran integers are now int64_t by default (and the prototypes that were present in the files used int). - Removed redundant #include "bli_f2c.h" in bli_?lamch.c and bli_lsame.c, since blis.h was already being included. - Other minor changes to code in frame/compat/f2c. commit 4e80ad28c97273db3366428ec44020da7944964d Author: Field G. Van Zee Date: Thu Jul 18 17:53:31 2013 -0500 Added support for C99 complex types/arithmetic. Details: - Added support for C99 complex types to bli_type_defs.h and overloaded complex arithmetic to the scalar-level macros in include/level0. This includes a somewhat substantial reorganization and re-layering of much of the existing machinery present in the level0 macros. - Added new #define for BLIS_ENABLE_C99_COMPLEX to bli_config.h files, commented-out by default, which optionally enables the use of built-in C99 complex types and arithmetic. - Minor changes to clarksville and reference configs' make_defs.mk files. - Removed macro definitions from bli_param_macro_defs.h which was not being used (bli_proj_dt_to_real_if_imag_eq0). commit 6072d7c848e837ba20d607f7b727438ada31bdcf Author: Field G. Van Zee Date: Wed Jul 17 12:27:45 2013 -0500 Fixed bugs in trsm, trmm macro-kernels. Details: - Fixed a bug in trsm_rl_ker_var2() caused by incorrect edge case handling. - Fixed a bug in trsm_rl_ker_var2() and trsm_ru_ker_var2() whereby k was incorrectly being adjusted upward by MR, instead of NR. The rl and ru trmm macro-kernels were updated in a similar fashion. - Fixed a bug in trsm_ru_ker_var2() that was due to a missing negation on diagoffb when recomputing k to skip a zero region below where the diagonal intersects the right side of the block. The corresponding trmm macro-kernel was also updated. - Fixed a bug in trsm_ru_ker_var2() where the the adjustment of k (by NR) needed to be placed AFTER the block that recomputes k to skip the zero region (if present). The other three trsm macro-kernels, as well as the trmm macro-kernels, were updated in the same manner, for consistency. - Fixed a bug in trmm_lu_ker_var2() in which the wrong dimension (n) was being updated to skip a zero region to the left of where the diagonal of A intersects the top edge of the block. - Comment updates to all trsm and trmm macro-kernels. - Comment updates to bli_packm_init.c. commit 47410a48f9b91e94ce4c67633686ffd1f2ad0275 Author: Field G. Van Zee Date: Wed Jul 10 14:53:59 2013 -0500 Added f2c'ed Givens rotation wrappers. Details: - Retired (for now) existing ?rot*() BLAS compatibility wrappers to 'attic' along with other wrappers for which no BLIS implementation exists. - Added f2c-generated codes for applicable datatype flavors of rot, rotg, rotm, and rotmg operations. commit e5f90f3a8dbe671104bcb9d8b4e3409de01805da Author: Field G. Van Zee Date: Wed Jul 10 13:40:12 2013 -0500 Removed copynz defs from bli_kernel.h files. Details: - Removed COPYNZ_KERNEL definition from the bli_kernel.h files in each configuration. (Meant to include this in previous commit.) commit aec12d90f596e8c04b1ad178258a1cd38108f59d Author: Field G. Van Zee Date: Wed Jul 10 13:33:30 2013 -0500 Removed copynzv, copynzm and related codes. Details: - Removed copynzv and copynzm operation directories. These operations implemented a variation of copyv/m that, in the case of real source and complex destination operands, leaves the imaginary component untouched (rather than setting it to zero). I realize now that the special case(s) (e.g. gemm with real A and B but complex C) that I thought required this operation actually can be handled more simply. - Removed level0 scalar macros implementing copynzs, copynzjs. commit b0a0a0f274a761788531b5d281cc3b411b7124ed Author: Field G. Van Zee Date: Tue Jul 9 17:15:38 2013 -0500 Added handling of restrict, stdint.h for non-C99. Details: - Removed the #include from blis.h and inserted a cpp macro block in bli_type_defs.h that #includes for C++ and C99, and otherwise manually typedefs the types we need (which, for now, are unconditionally int64_t and uint64_t). - Moved basic typedefs to top of bli_type_defs.h, and comment changes. - Added cpp macro block to bli_macro_defs.h that #defines restrict as nothing for C++ and non-C99. commit 4b7e7970f1af4a1ab121e07657e2b78b9fcd7671 Author: Field G. Van Zee Date: Mon Jul 8 15:20:34 2013 -0500 Migrated integer usage to stdint.h types. Details: - Changed the way bli_type_defs.h defines integer types so that dim_t, inc_t, doff_t, etc. are all defined in terms of gint_t (general signed integer) or guint_t (general unsigned integer). - Renamed Fortran types fchar and fint to f77_char and f77_int. - Define f77_int as int64_t if a new configuration variable, BLIS_ENABLE_BLIS2BLAS_INT64, is defined, and int32_t otherwise. These types are defined in stdint.h, which is now included in blis.h. - Renamed "complex" type in f2c files to "singlecomplex" and typedef'ed in terms of scomplex. - Renamed "char" type in f2c files to "character" and typedef'ed in terms of char. - Updated bla_amax() wrappers so that the return type is defined directly as f77_int, rather than letting the prototype-generating macro decide the type. This was the only use of GENTFUNC2I/GENTPROT2I-related macros, so I removed them. Also, changed the body of the wrapper so that a gint_t is passed into abmaxv, which is THEN typecast to an f77_int before returning the value. - Updated f2c code that accessed .r and .i fields of complex and doublecomplex types so that they use .real and .imag instead (now that we are using scomplex and dcomplex). commit 372501398564fdba3d5a3db86c30bc1039b185ff Author: Field G. Van Zee Date: Mon Jul 8 11:24:18 2013 -0500 Added experimental bli_gemm_ker_var5(). Details: - Added support for an experimental gemm macro-kernel incrementally packs one micro-panel of B at a time. This is useful for certain special cases of gemm where m is small. - Minor changes to default values of clarksville configuration. - Defined BLIS_PACKED_BLOCKS as part of pack_t type, even though we do not yet have any use (or implementation support) for block storage. - Comment update to bli_packm_init.c. commit 9915d667a79f23e3a2a2516247c560e9063a1646 Author: Field G. Van Zee Date: Sun Jul 7 13:28:39 2013 -0500 Defined "total" blocksize query functions. Details: - Defined bli_blksz_total_for_type() and bli_blksz_total_for_obj() to query the default blocksize plus blocksize extension (using the type or the type of an object). - Comment update in bli_packm_cxk.c. commit 46d3d09d49aded1d9f1b468c83fce75e07d631dc Author: Field G. Van Zee Date: Thu Jun 27 13:19:56 2013 -0500 Consolidated lower/upper her[2]k blocked variants. Details: - Consolidated lower and upper blocked variants for herk and her2k, and renamed the resulting variants, according to the same changes recently made to trmm and trsm. - Implemented support for four new subpartitions types: BLIS_SUBPART1T BLIS_SUBPART1B BLIS_SUBPART1L BLIS_SUBPART1R which correspond to "merged" partitions that include the middle "1" partition as well as either the neighboring "0" or "2" partition. This is used to clean up code in herk/her2k var2 that attempts to partition away the strictly zero region above or below the diagonal of a matrix operand that is being marched through diagonally. - Added safeguards to herk macro-kernels that skip any leading or trailing zero region in the panel of C that is passed in. This is now needed given that herk/her2k var1 no longer partitions off this zero region before calling the macro-kernel (via bli_her[2]k_int()). - Updated comments and other whitespace changes to trmm/trsm macro-kernels. commit 02002ef6f3d2746665982793db36714bd69bccc9 Author: Field G. Van Zee Date: Mon Jun 24 17:08:14 2013 -0500 Added row-storage optimizations for trmm, trsm. Details: - Implemented algorithmic optimizations for trmm and trsm whereby the right side case is now handled explicitly, rather than induced indirectly by transposing and swapping strides on operands. This allows us to walk through the output matrix with favorable access patterns no matter how it is stored, for all parameter combinations. - Renamed trmm and trsm blocked variants so that there is no longer a lower/upper distinction. Instead, we simply label the variants by which dimension is partitioned and whether the variant marches forwards or backwards through the corresponding partitioned operands. - Added support for row-stored packing of lower and upper triangular matrices (as provided by bli_packm_blk_var3.c). - Fixed a performance bug in bli_determine_blocksize_b() whereby the cache blocksize extensions (if non-zero) were not being used to appropriately size the first iteration (ie: the bottom/right edge case). - Updated comments in bli_kernel.h to indicate that both MC and NC must be whole multiples of MR AND NR. This is needed for the case of trsm_r where, in order to reuse existing left-side gemmtrsm fused micro-kernels, the packing of A (left-hand operand) and B (right-hand operand) is done with NR and MR, respectively (instead of MR and NR). commit d1e81ddc848ee47bc188735883d14582bdd0cabc Author: Field G. Van Zee Date: Thu Jun 13 11:14:21 2013 -0500 Minor generalizing tweaks to trmm blk var1, var2. commit 0efb7974f104206ba3985276f2180a9b14fe9f9b Author: Field G. Van Zee Date: Wed Jun 12 16:40:04 2013 -0500 CHANGELOG update. commit 5b641c3bab31eac6a1795b9f6e3f86c59651ca50 (tag: 0.0.8) Author: Field G. Van Zee Date: Wed Jun 12 16:02:12 2013 -0500 Use separate CFLAGS for "kernels" directories. Details: - Added a new "special" directory type: any source code within directories named "kernels" will be compiled with a separate CFLAGS_KERNELS set of compiler flags. This allows the developer to specify a separate set of flags (e.g. optimization flags) for compiling kernels while maintaining a standard set for regular framework code. - Fixed a bug in the top-level Makefile that was causing "noopt" code to be compiled with the standard set of compilation flags. - Updated make_defs.mk in reference, flame, and clarksville configurations according to above changes. commit 08475e7c7653ba598665071a617d10f0d8f763c2 Author: Field G. Van Zee Date: Tue Jun 11 12:18:39 2013 -0500 Various level-3 optimizations for row storage. Details: - Implemented remaining two cases within bli_packm_blk_var2(), which allow packing from a lower or upper-stored symmetric/Hermitian matrix to column panels (which are row-stored). Previously one could only pack to row panels (which are column-stored). - Implemented various optimizations in the level-3 front-ends that allow more favorable access through row-stored matrices for gemm, hemm, herk, her2k, symm, syrk, and syr2k. - Cleaned up code in level-3 front-ends that has to do with setting target and execution datatypes. commit 05a657a6b92e8d34efa5c57ae6a18a4f35ec0841 Author: Field G. Van Zee Date: Fri Jun 7 11:04:10 2013 -0500 Added beta == 0 optimization to x86_64 ukernel. Details: - Modified x86_64 gemm microkernel so that when beta is zero, C is not read from memory (nor scaled by beta). - Fixed minor bug in test suite driver when "Test all combinations of storage schemes?" switch is disabled, which would result in redundant tests being executed for matrix-only (e.g. level-1m, level-3) operations if multiple vector storage schemes were specified. - Restored debug flags as default in clarksville configuration. commit f1aa6b81cc421516dd77dd0f18f7c432724e6ef2 Author: Field G. Van Zee Date: Thu Jun 6 13:36:06 2013 -0500 Whitespace changes to old test drivers. Details: - Replaced tabs with four spaces in places where indention was already in place. commit 9feb4c23d2e36f3d8b5417a3802c69f94b29f749 Author: Field G. Van Zee Date: Tue Jun 4 14:57:46 2013 -0500 Fixed unaligned handling in axpyf, dotxaxpyf. Details: - Fixed over-cautious handling of unaligned operands in vector instrinsic implementation of axpyf kernel. - Fixed over- and under-cautious handling of unaligned operands in vector intrinsic implementation of dotxaxpyf kernel. commit 22b06cfcd2e3205c8325a246c2279e4b1047c066 Author: Field G. Van Zee Date: Mon Jun 3 16:54:52 2013 -0500 Updated level-1/-1f [vector intrinsic] kernels. Details: - Updated level-1/-1f kernels so that non-unit and un-aligned cases are handled by reference implementation (rather than aborted). - Added -fomit-frame-pointer to default make_defs.mk for clarksville configuration. - Defined bli_offset_from_alignment() macro. - Minor edits to old test drivers. commit 0288c827d3659bb225ac9c10f168b623ed0106a2 Author: Field G. Van Zee Date: Sat Jun 1 08:02:23 2013 -0500 Updated ukernels for x86_64. Details: - Tweaked micro-kernels and configuration for clarksville. - Updated/cleaned up old test drivers in test directory. - Fixed syntax bug in trsv_unb_var1 and trsv_unf_var1 (introduced recently). commit 85a6d1c9a52c2b27c71a3a3e341c51d7ba263749 Author: Field G. Van Zee Date: Mon May 6 11:05:08 2013 -0500 Replaced axpys usage with subs in trsv. Details: - Replaced instances of axpys with alpha equal to -1 with subs. - Use BLIS_MAX_TYPE_SIZE to define BLIS_CONSTANT_SLOT_SIZE instead of sizeof(dcomplex). commit 2d9c667f3c48a12cab64e5ad09d5fcb9f4c19d78 Author: Field G. Van Zee Date: Fri May 24 16:28:10 2013 -0500 Fixed x86_64 kernel bugs and other minor issues. Details: - Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in unaligned subpartitions. We were already going out of our way a bit to handle edge cases in the first iteration for blocked variants, and this was simply the unblocked-fused extension of that idea. - Fixed control tree handling in her/her2/syr/syr2 that was not taking into account how the choice of variant needed to be altered for upper-stored matrices (given that only lower-stored algorithms are explicitly implemented). - Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b() macros to provide inlined versions of bli_determine_blocksize_[fb]() for use by unblocked-fused variants. - Integrated new blocksize_dim macros into gemv/hemv unf variants for consistency with that of the bugfix for trmv/trsv (both of which now use the same macros). - Modified bli_obj_vector_inc() so that 1 is returned if the object is a vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain conditions (e.g. dotv_opt_var1), an invalid increment was returned, which was invalid only because the code was expecting 1 (for purposes of performing contiguous vector loads) but got a value greater than 1 because the column stride of the object (e.g. rho) was inflated for alignment purposes (albeit unnecessarily since there is only one element in the object). - Replaced some old invocations of set0 with set0s. - Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly. - Fixed increment bug in cleanup loop of gemm ukernel for x86_64. - Added safeguard to test modules so that testing a problem with a zero dimension does not result in a failure. - Tweaked handling of zero dimensions in level-2 and level-3 operations' internal back-ends to correctly handle cases where output operand still needs to be scaled (e.g. by beta, in the case of gemm with k = 0). commit d57ec42b34f8447c88adeffa95cf22f8c115ad51 Author: Field G. Van Zee Date: Fri May 3 17:35:32 2013 -0500 Renamed _trans_status() macro. Details: - Mistakenly forgot to rename the _trans_status() macro and instances in previous commit. commit 9e2b227866af429a4a6fb7dbb8c457bbdda2f136 Author: Field G. Van Zee Date: Fri May 3 17:24:58 2013 -0500 Renamed _set_trans(), _trans_status() macros. Details: - Renamed the following macros: bli_obj_set_trans() -> bli_obj_set_onlytrans() bli_obj_trans_status() -> bli_obj_onlytrans_status() to remove ambiguity as to which bits are read/updated. commit 2f8174509ea9f844db11ebd9389de5168e85b132 Author: Field G. Van Zee Date: Wed May 1 15:06:30 2013 -0500 Unconditionally check memory pool(s) for errors. Details: - Changed bli_mem_acquire_m() in bli_mem.c so that we still check if the memory pool is exhausted before checking out and returning a block, even if BLIS error checking has been disabled. These errors are useful because they likely indicate that BLIS was improperly configured for the code being run. commit 75405a2b83679b6aff38d7e7425199d623a7b0a9 Author: Field G. Van Zee Date: Wed May 1 15:00:30 2013 -0500 CHANGELOG update. commit 6bfa96f84887dec0b4cf8be5d38dd634c2f8951d (tag: 0.0.7) Author: Field G. Van Zee Date: Tue Apr 30 19:35:54 2013 -0500 Absorbed blocksize extensions into main objects. Details: - Revamped some parts of commit b6ef84fad1c9 by adding blocksize extension fields to the blksz_t object rather than have them as separate structs. - Updated all packm interfaces/invocations according to above change. - Generalized bli_determine_blocksize_?() so that edge case optimization happens if and only if cache blocksizes are created with non-zero extensions. - Updated comments in bli_kernel.h files to indicate that the edge case blocksize extension mechanism is now available for use. commit bc7c8005cedbe50961ac2a99aeeabf4e9f9a8e9e Author: Field G. Van Zee Date: Thu Apr 25 17:16:59 2013 -0500 Added option to disable err checking in testsuite. Details: - Added a new line to input.general that allows one to specify the error- checking level to use for each BLIS experiment. The only two levels supported for now are "no error checking" and "full error checking". commit 096b366ddcfe386f44419ef84d8df8be13825f86 Author: Field G. Van Zee Date: Thu Apr 25 16:43:43 2013 -0500 Use cntl trees that block in n dimension. Details: - Updated _cntl.c files for each level-3 operation to induce blocked algorithms that first paritition in the n dimension with a blocksize of NC. Typically this is not an issue since only very large problems exceed that of NC. But developers often run very large problems, and so this extra blocking should be the default. - Removed some recently introduced but now unused macros from bli_param_macro_defs.h. commit b6e24b23cb4dfc488c1c9c70d596539c2287f72e Author: Field G. Van Zee Date: Thu Apr 25 12:06:12 2013 -0500 Use PASTEMAC in macro-kernels (over MAC2 or MAC3). Details: - Replaced multi-type invocations of copys_mxn, xpbys_mxn, etc. (PASTEMAC2 and PASTEMAC3) with those that only use a single type (PASTEMAC). - Added extra macros to bli_adds_mxn_uplo.h and bli_xpbys_mxn_uplo.h to accommodate above change. - Fixed comment typo in bli_config.h files. - Added .nfs* pattern to .gitignore. commit df80acf517dde180ddcc5835c6136b2fa7556d4b Author: Field G. Van Zee Date: Tue Apr 23 19:43:23 2013 -0500 Fixed computation of b_next in L3 macro-kernels. Details: - Restructured herk_l and herk_u macro-kernels in the imagine of trmm and trsm, in that the edge cases are captured by the main loop, rather than trying to have "cleanup" sections that result in four distinct parts (interior, bottom edge, right edge, bottom-right edge) of the code. - Fixed the way b_next was being computed in the non-gemm level-3 macro-kernels (herk, trmm, trsm). The way they are computed now matches that of gemm. commit 3671528cf8efe4b445d196665143a5c50c2c6048 Author: Field G. Van Zee Date: Tue Apr 23 19:12:14 2013 -0500 Fixed minor bug in computing b_next in gemm. commit db072a5b4a039a9a668ef951333ecfb5bd3a74b9 Author: Field G. Van Zee Date: Tue Apr 23 17:49:10 2013 -0500 Fixed rare edge case bug in herk_l macro-kernel. Details: - Fixed a potential bug in herk_l at the m_left edge case. If MR was chosen to be much larger than NR, then one could encounter edge cases in the the MC dimension that fall entirely below the diagonal, which the previous implementation of the herk_l macro-kernel was not allowing for. commit 1dab11e37d1cb403cbe75b73a644c00de534f104 Author: Field G. Van Zee Date: Tue Apr 23 17:17:11 2013 -0500 Updated x86 gemmtrsm ukernels to use alpha. commit 9d10d7dd9bc92a993fea7162bfa5983f75506f49 Author: Field G. Van Zee Date: Tue Apr 23 16:00:18 2013 -0500 Added a_next, b_next arguments to micro-kernels. Details: - Added two more arguments to the gemm and gemmtrsm microkernels: the addresses of the next micro-panels of A and B. By passing these pointers into the micro-kernel, we allow the micro-kernel author to prefetch micro-panels of A and B as necessary (though this is completely optional; these addresses may also be safely ignored). - Updated all seven macro-kernels so that they compute and pass in a_next and b_next. Note that ONLY the gemm macro-kernel computes a_next and b_next with the precise semantics we want. I will go back and fix the other macro-kernels in the near future. - Added 'restrict' to various micro-kernels from which it was missing. commit f3815dc84d385c514a5acaf1e925424a57be2f51 Author: Field G. Van Zee Date: Tue Apr 23 11:12:33 2013 -0500 Added code for backward edge-case blocking. Disabled: - Edited bli_determine_blocksize_b() to include experimental (and currently disabled) code that computes extended blocks. - Updated commnts relate to above changes. - Enabled use of x86 gemmtrsm ukernel in config/flame/bli_kernel.h. commit 4fe1435f20e8fc7dd72f795ac58c8e236e6c631b Author: Field G. Van Zee Date: Mon Apr 22 19:00:43 2013 -0500 Updated dupl implementation to use PACKNR and NR. Details: - Updated frame/util/dupl/bli_dupl_unb_var1.c to utilize PACKNR and NR explicitly so navigate b1 so that situations where PACKNR > NR are supported. - Moved the 4x2 and 4x4 reference micro-kernels in frame/3/gemm/ukernels and frame/3/trsm/ukernels to kernels/c99/. - Updated clarksville and flame configurations. commit 2d6f9e83799a46d52d7901e275f8fd67f0a0edc6 Author: Field G. Van Zee Date: Sun Apr 21 15:10:34 2013 -0500 Disabled blocksize checks for memory pools. Details: - Temporarily disabled checks that ensure that enough memory will be allocated by the contiguous memory allocator for all types, given that the values for double precision real are the ones used to allocate the space. These checks can easily go awry in certain situations, especially if you are developing for only one datatype. So for now, they are probably more trouble than they are worth. commit b6ef84fad1c9884c84b7f1350a0bcdfe1737e8f2 Author: Field G. Van Zee Date: Sun Apr 21 15:00:24 2013 -0500 Allow ldim of packed micro-panels != MR, NR. Details: - Made substantial changes throughout the framework to decouple the leading dimension (row or column stride) used within each packed micro-panel from the corresponding register blocksize. It appears advantageous on some systems to use, for example, packed micro-panels of A where the column stride is greater than MR (whereas previously it was always equal to MR). - Changes include: - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding to use when packing micro-panels of A and B. - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR where appropriate, instead of MR and NR. - Added pd field (panel dimension) to obj_t. - New interface to bli_packm_cntl_obj_create(). - Renamed bli_obj_packed_length()/_width() macros to bli_obj_padded_length()/_width(). - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c. - Print out new cache and register blocksize extensions in test suite. - Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger blocksize for edge cases, which can improve performance at the margins. commit 59fca58dbe678d79c1df0916b022afbeac7c48fa Author: Field G. Van Zee Date: Fri Apr 19 15:26:29 2013 -0500 Fixed bug in compatibility layer (her2k/syr2k). Details: - Fixed a bug in the BLAS compatibility layer, specifically in bla_her2k.c and bla_syr2k.c, that caused incorrect computation to occur when the BLAS interface caller requests the [conjugate-]transpose case. Thanks to Bryan Marker for reporting the behavior that led to this bug. commit 09eacbd1ab1380a95a0e9625726b45e43ed102d6 Author: Field G. Van Zee Date: Thu Apr 18 19:39:13 2013 -0500 Changed old level3 test drivers to call front-ends. Details: - Changed old level-3 test drivers, in 'test' directory, to always call the front-end object API instead of the internal back-end with the locally defined control tree. commit 83e45de23e565138b8fde06fb11cfedc973b7246 Author: Field G. Van Zee Date: Thu Apr 18 18:33:03 2013 -0500 Allow packm_init() to reacquire a too-small mem_t. Details: - Changed bli_packm_init() to react differently to a situation where a pack obj_t has an already-allocated mem_t entry that has a buffer that is smaller than what will be needed to hold the block/panel that now needs to be packed. Previously, this situation was treated with an abort() since I assumed something was horribly wrong. I have changed the code so that it now reacts by releasing the previous mem_t and re-acquires a new mem_t with the new information. (This change was done at the request of Bryan Marker to facilitate code generation via DxT.) commit a6990434173b0cf651f8521194f3aef738deb7d2 Author: Field G. Van Zee Date: Thu Apr 18 13:52:47 2013 -0500 Fixed bug in packing block of A for hemm/symm. Details: - Fixed a bug in bli_packm_blk_var2() that affected the packing functionality of hemm and symm. The bug occurs whenever attempting to pack a Hermitian or symmetric matrix where the block of A being packed intersects the diagonal, but some of its micro-panels do not intersect the diagonal and lie completely in the unstored region. Thanks to Francisco Igual for reporting this bug. - Comment updates to both _blk_var2.c and _blk_var3.c. commit c92e7590e1934f830814ab614c794215ebe0c415 Author: Field G. Van Zee Date: Wed Apr 17 20:53:29 2013 -0500 Activated bli_packm_acquire_mpart_t2b(). Details: - Removed the overly-paranoid bli_abort() from the end of bli_packm_acquire_mpart_t2b(), to allow others to experiment with partitioning through packed blocks of A. Also, and more importantly, changed an earlier check that was causing an erroneous (but coincidentally redundant) abort(). Also, updated some of the comments in bli_packm_part.c. commit bea579e9f009a44e08008eb14d09f38748ab2b53 Author: Field G. Van Zee Date: Tue Apr 16 19:43:14 2013 -0500 Allow creation of "empty" objects. Details: - Modified bli_obj_alloc_buffer() to allow allocating an empty buffer, and modified bli_adjust_strides() to explicitly handle m = n = 0. - Updated bli_check_matrix_strides() to allow cases where m = n = 0. commit 7904e20f2e6908571ee5008da2a08084198eefae Author: Field G. Van Zee Date: Tue Apr 16 17:37:16 2013 -0500 Fixed "root" object bug in bli_her[2]k/syr[2]k. Details: - Fixed an obscure bug in the front-ends for herk, her2k, syrk, and syr2k, that manifested as the incorrect triangle being updated. It occurred when the user would pass in a matrix object that was correctly marked as symmetric/Hermitian and lower-stored, but whose root object was never marked as lower (or upper). We now alias and re-assign root status for matrix C within the front-ends. Note that trmm and trsm were already doing this, albeit for a slightly different reason (to allow the internal back-end to choose which algorithm to run--lower or upper--based on the uplo of the root object for both left and right side cases). Thanks to Bryan Marker for leading me to this bug. commit 19155a768dd97b57cfb59c32fa8e54a344ec66e1 Author: Field G. Van Zee Date: Tue Apr 16 11:24:03 2013 -0500 Fixed overzealous type-checking in bli_getsc(). Details: - Relaxed type checking in getsc so that the input object could be a constant and not just a proper floating-point type. (If it is a constant, default to extracting the dcomplex values.) Thanks to Bryan Marker for reporting this bug. - Added definition for bli_is_constant() in bli_param_macro_defs.h - Comment updates to various level-0 scalar routines. commit 2ee6bbca2953d04c967685da9735b3eaf8a4b813 Author: Field G. Van Zee Date: Mon Apr 15 19:27:57 2013 -0500 Fixed bug in bli_obj_is_packed() and renamed. Details: - This macro is used to determine whether the partitioning routines should call a corresponding packm_part routine instead. However, it was unintentionally catching matrices that were marked as "packed" by virtue of them simply being marked as BLIS_PACKED_UNSPEC in, say, bli_gemv(). The macro has now been renamed to bli_obj_is_panel_packed(), and now only checks for row or column panel packing. (Note that I first attempted to fix this bug in a571af816d72.) Thanks to Bryan Marker for reporting the erroneous behavior that led me to this bug. commit 99b99eebe70336b5f28039a4a084aa7f5fa7059d Author: Field G. Van Zee Date: Mon Apr 15 17:54:43 2013 -0500 Removed local reference ukernel blocksize macros. Details: - Removed locally defined gemm microkernel blocksize macros from _mxn reference microkernel definition and header. Meant to include this in a recent/previous commit (0020ef7c8271). commit 6a538fa7b164655f41cea5b9c8d3902438bda66b Author: Field G. Van Zee Date: Mon Apr 15 14:40:31 2013 -0500 Formatting change to mods in previous commit. commit ea079d35591e808971d2d98a1a7d9f89bc1f7c2f Author: Field G. Van Zee Date: Mon Apr 15 14:31:40 2013 -0500 Set structure of objects in level-2 BLIS APIs. Details: - Added missing statement to set structure field of local objects in top-level BLIS (BLAS-like) API wrappers. Thanks to Bryan Marker for reporting this bug. commit d9948c541c0446e20e249a1ccc83709ce51b7aa8 Author: Field G. Van Zee Date: Mon Apr 15 10:21:26 2013 -0500 Tweak to test suite function string construction. Details: - Fixed a minor bug in the way that the test suite would construct function name strings when the user anchored all parameters in input.operations. In this case, the test driver would mistake this situation for one where the operation simply had no parameters to begin with, and thus would not include the parameter string in the function string that is output for every result. commit ca9e435c57c5c7a000d2a32681dd8070ba850abd Author: Field G. Van Zee Date: Mon Apr 15 09:59:46 2013 -0500 Fixed a bug in reference implementation of dupl. Details: - Fixed a bug in reference implementation of dupl (bli_dupl_unb_var1.c), which resulted in incorrect duplication. - Updated old test drivers according to recently updated packm control tree creation interface. - Added 'restrict' to x86 gemm microkernel interface. commit 26cbd52e364bbe439e3744101cd5a6cbcb82dffd Author: Field G. Van Zee Date: Sun Apr 14 19:05:33 2013 -0500 Modified bli_kernel.h include order in blis.h. Details: - Delayed #include of bli_kernel.h in blis.h to prevent a situation where _kernel.h includes an optimized microkernel header, which uses BLIS types such as dim_t and inc_t, which would precede the definition of those types in bli_type_defs.h. - Moved the #include of bli_kernel_macro_defs.h in bli_macro_defs.h to blis.h (immediately after that of bli_kernel.h). commit 3414a23c38b0de45a8034b3dda2fc4b5a755e4e1 Author: Field G. Van Zee Date: Sat Apr 13 16:53:16 2013 -0500 CHANGELOG update. commit ec16c52f2ecf419c749175ce0a297441c10f1c68 (tag: 0.0.6) Author: Field G. Van Zee Date: Sat Apr 13 16:41:16 2013 -0500 Updated INSTALL file (now redirects to website). commit 0020ef7c82711a7ebf08e5174f939bee2563184c Author: Field G. Van Zee Date: Sat Apr 13 15:26:35 2013 -0500 Removed gemmtrsm-, trsm-specific blocksize macros. Details: - Modified gemmtrsm micro-kernel wrappers to use new aliased blocksize macros instead of operation-specific ones. - Removed local, gemmtrsm-specific blocksize macro definitions found in micro-kernel header files. (Meant to include above changes in 31b100e7bf4a.) - Added comments to reference gemmtrsm micro-kernel wrapper implementation. commit 1a9f427b85bb95aaa9e54c8ff8ecad8734b361ee Author: Field G. Van Zee Date: Fri Apr 12 15:25:54 2013 -0500 Added/renamed alignment constants to _config.h. Details: - Added new memory alignment constants: BLIS_HEAP_STRIDE_ALIGN_SIZE (previously assumed to be same as SYSTEM_MEM) BLIS_CONTIG_ADDR_ALIGN_SIZE (previously assumed to be same as PAGE_SIZE) BLIS_STACK_BUF_ALIGN_SIZE (previously not enforced) and renamed existing ones BLIS_SYSTEM_MEM_ALIGN_SIZE -> BLIS_HEAP_ADDR_ALIGN_SIZE BLIS_CONTIG_MEM_ALIGN_SIZE -> BLIS_CONTIG_STRIDE_ALIGN_SIZE to better convey what the alignment factor is used for (and what it is not used for). - Removed BLIS_ENABLE_SYSTEM_MEM_ALIGN. Dynamic memory alignment is now disabled by setting BLIS_HEAP_STRIDE_ALIGN_SIZE to 1. - Inserted instances of __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))) into macro-kernels to specify stack alignment of temporary buffers. - Modified test suite driver to output new constants. - Removed bli_align_dim_to_sys() and bli_align_dim_to_cmem(). Instead, we now use bli_align_dim_to_size(), which takes a third argument (the desired alignment). commit a77d10e87e3c0ab55ec14d74c285bc95c06285c3 Author: Field G. Van Zee Date: Fri Apr 12 11:40:55 2013 -0500 Fixed an bug in axpyv/axpym when alpha is unit. Details: - Fixed bug whereby axpyv and axpym were incorrectly simplifying to a copy, rather than an add, when alpha = 1. Thanks to Bryan Marker for identifying this bug. commit 0495bd1d6de5995fe2fb79b321eec79e961eb7a5 Author: Field G. Van Zee Date: Thu Apr 11 16:39:25 2013 -0500 Moved _POSIX_C_SOURCE def to compiler cmd line. Details: - Removed the #define of _POSIX_C_SOURCE in bli_config.h (for both reference and clarksville configurations) and added "-D_POSIX_C_SOURCE=200112L" to the compiler command line arguments in make_defs.mk (for both configs). Thanks to Devin Matthews for suggesting this change. commit d43d1a0a2ef6de4bc57627566aef8e3fdb458b8c Author: Field G. Van Zee Date: Thu Apr 11 16:28:17 2013 -0500 Appended 'f2c_' to abs, min, max macros in f2c.h. Details: - Renamed abs, min, max, dmin, and dmax macros in bli_f2c.h so that they would not conflict with anything defined by the user (or the language). Thanks to Devin Matthews for suggesting this fix. - Updated all instances of the above macros accordingly. commit 31b100e7bf4aeaa4ceafefd2b6c3102d5fbc4cbb Author: Field G. Van Zee Date: Thu Apr 11 11:11:52 2013 -0500 Added new kernel blocksize macro aliases. Details: - Added new macros that alias level-3 cache and register blocksize macros to names that can be constructed via the PASTEMAC macro. These aliased macro definitions live inside bli_kernel_macro_defs.h, which is now #included after bli_kernel.h. - Modified macro-kernels to use new aliased blocksize macros instead of operation-specific ones. - Removed local, operation-specific kernel blocksize macro definitions (found in macro-kernel header files). commit bd2b24ba65b36d7c07c5918a3838ce2ff57c4b48 Author: Field G. Van Zee Date: Thu Apr 11 10:35:39 2013 -0500 Updated CREDITS file. commit 79328c15410215737f3f14cd069328cf52aa11fd Author: Field G. Van Zee Date: Thu Apr 11 10:32:14 2013 -0500 Reverted testsuite object files' home to 'obj'. Details: - Removed 'obj' and 'lib' from .gitignore. - Added testsuite/obj/.gitkeep (which is an empty file). - Updated testsuite/Makefile accordingly. - Thanks to Vernon Austel for pointing out the .gitkeep trick to tracking empty directories in git. commit 4afe3bfd82c03e1e97b58b7d250588a0d28541e5 Author: Field G. Van Zee Date: Tue Apr 9 17:45:39 2013 -0500 Renamed/moved object scalar constant macros. Details: - Replaced scalar constant macro definitions in bli_const_defs.h with a single, simplier macro in bli_obj_macro_defs.h. - Updated invocations of old macros accordingly. - Removed bli_const_defs.h. commit 357893f5be5c56ab7b062874005e77e614b23f06 Author: Field G. Van Zee Date: Tue Apr 9 14:48:15 2013 -0500 Applied fix from prev commit to gemmtrsm_?_ref_4x4 Details: - Fixed hard-coded kernels in bli_gemmtrsm_l_ref_4x4.c and bli_gemmtrsm_u_ref_4x4.c. commit 54988e8dca44475610bcaee5a7bc1c40e8921402 Author: Field G. Van Zee Date: Mon Apr 8 19:08:43 2013 -0500 Fixed a performance bug in trsm. Details: - Fixed a bug in the reference implementations of the gemmtrsm wrappers (bli_gemmtrsm_l_ref_mxn.c and bli_gemmtrsm_u_ref_mxn.c) whereby the reference gemm microkernel was hard-coded, and thus always called, even when GEMM_UKERNEL was defined to point to an optimzied microkernel. This manifested as artificially low trsm performance for all problem sizes, but especially for small problem sizes as it only affected blocks of A that intersected the diagonal. Thanks to Mike Kistler of IBM for helping me find this bug. commit a7252e40b5c351eef9a1df531ea0ef25cb5fb705 Author: Field G. Van Zee Date: Mon Apr 8 16:08:22 2013 -0500 Generate testsuite objects 'src'. Details: - Tweaked the testsuite makefile so that object files are stored in 'src' rather than 'obj', since (a) the top-level .gitignore dictates that obj directories are to be ignored, and (b) since git has problems tracking empty directories. Now, users do not need to create their own obj directories within their own local clones of BLIS. commit 803871c55b60d3c225ad9a0607fa507a9c16aab7 Author: Field G. Van Zee Date: Mon Apr 8 15:18:42 2013 -0500 Minor formatting changes. commit a571af816d72727e16cad37007e7043b9d6fa362 Author: Field G. Van Zee Date: Mon Apr 8 15:00:13 2013 -0500 Fixed definition of bli_is_packed_object() macro. Details: - Changed the definition of bli_is_packed_object() so that it keys off of the value of the pack schema bits in the info field of obj_t, rather than comparing the obj_t buffer with that of the mem_t entry. This was the cause of a very low probability bug whereby uninitialized memory caused the macro to evaluate to TRUE even though the object in question was not packed. Thanks to Vernon Austel of IBM for helping discover this bug. - Changed an abort() in bli_packm_part() to a not-yet-implemented. commit 3be14c32f735ecc6169d3ab6370cf8b69162acec Author: Field G. Van Zee Date: Sat Apr 6 12:54:45 2013 -0500 Updated information in testsuite output header. Details: - Added to the information that is echoed at the beginning of the test suite's output, and also re-labeled some existing information. commit 874707c1b183a4dd9a91dbfd4ea1522384c190df Author: Field G. Van Zee Date: Fri Apr 5 17:19:43 2013 -0500 Fixed edge case handling bug in herk macrokernels. Details: - Fixed a bug present in bli_herk_l_ker_var2() and bli_herk_u_ker_var2() that only manifests when BLIS is configured such that MR != NR. The bug involves incorrectly detecting edge cases, which resulted in some parts of matrix C potentially being skipped and not updated, depending on the problem size. - Updated the default values of MR and NR in config/reference/bli_kernel.h to 8 and 4, respectively, so that I can better stress the framework on a day-to-day basis. (The fact that they were both equal to 4 for so long is why I did not stumble upon this bug much sooner.) commit 7cbda15291d3e01300e71c286b9657b7ef0708bf Author: Field G. Van Zee Date: Thu Apr 4 15:25:43 2013 -0500 Added reference microkernels for arbitrary MR, NR. Details: - Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that contain explicit loops over MR and NR, thus allowing them to be used unmodified by developers who want to build a reference library with custom register blocksizes. - Changed config/reference/bli_kernel.h to use above ukernels by default. - Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels to use 'restrict' keyword. - Added -funroll-loops option to config/reference/make_defs.mk. - Updated comments in bli_kernel.h describing constraints on register and cache blocksizes. - Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that single-char macros are also defined. commit 6684b73d5501f91d24a79e26655a42819c9b3114 Author: Field G. Van Zee Date: Tue Apr 2 13:06:20 2013 -0500 Implemented amax operation and related changes. Details: - Implemented amax operation in BLIS. - Activated BLAS2BLIS routine mapping for new amax BLIS implementation. - Added integer support to [f]printv, [f]printm. - Added integer support to level-0 copys macros. - Updated printing of configuration information in test suite driver. - Comment changes to _config.h files. - Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are used for. commit fb68087f8727cd5fd656a742a110e54fb1c91db9 Author: Field G. Van Zee Date: Tue Mar 26 15:10:16 2013 -0500 More memory alignment-related tweaks. Details: - Renamed BLIS_MEMORY_ALIGNMENT_SIZE to BLIS_CONTIG_MEM_ALIGN_SIZE. - Renamed BLIS_ENABLE_MEMORY_ALIGNMENT to BLIS_ENABLE_SYSTEM_MEM_ALIGN. - Added BLIS_SYSTEM_MEM_ALIGN_SIZE, which controls only the alignment passed into posix_memalign() or equivalent. - Defined new function, bli_align_dim_to_cmem(), which applies the contiguous memory alignment (rather than the system/malloc alignment). commit 9682ef61dbf9a8846c8b0826d4de24bc216cd641 Author: Field G. Van Zee Date: Tue Mar 26 14:14:53 2013 -0500 Always define memory alignment size cpp constant. Details: - Removed guard around #define for memory alignment size constant. Memory alignment should always be enabled, and so this value should always be defined. commit 3a787cccaae16531474f34398e3c0cf4f49b8cd8 Author: Field G. Van Zee Date: Tue Mar 26 13:59:19 2013 -0500 Renamed memory alignment macro constant. Details: - Renamed all occurrences of BLIS_MEMORY_ALIGNMENT_BOUNDARY to BLIS_MEMORY_ALIGNMENT_SIZE. commit 37308f9a502b56d94fa52a7df71c676a46c3be3d Author: Field G. Van Zee Date: Tue Mar 26 12:43:14 2013 -0500 Align packed panel strides with system alignment. Details: - Pass panel strides through bli_align_dim_to_sys() to ensure that each subsequent packed panel of A and B begins at an aligned address. (The first panel is presumably aligned to system alignment because it is aligned to a page boundary, which is typically much larger.) - Rearranged code in packm_init_pack() to prevent additional conditional blocks as a result of the aforementioned change. - Adjusted contiguous memory allocator so that the system memory alignment is used to allocate enough space for each block no matter what kind of register blocking is used (even if register blocksize is unit and every row/column needs maximal padding). - Adjusted default blocksizes in reference configuration so that MC*KC and KC*NC result in identical footprints for all datatypes. commit 40a0654ada5f256beb3da80ebba015a3c71fb61f Author: Field G. Van Zee Date: Sun Mar 24 20:18:12 2013 -0500 CHANGELOG update. commit b65cdc57d9e51fa00e3c03539cfb7e045707d0f4 (tag: 0.0.5) Author: Field G. Van Zee Date: Sun Mar 24 20:01:49 2013 -0500 Migrated 'bl2' prefix to 'bli'. Details: - Changed all filename and function prefixes from 'bl2' to 'bli'. - Changed the "blis2.h" header filename to "blis.h" and changed all corresponding #include statements accordingly. - Fixed incorrect association for Fran in CREDITS file. commit 132bffcef7441f32d02cc7485aef6a0648e0ef1e Author: Field G. Van Zee Date: Sun Mar 24 18:49:36 2013 -0500 Removed several 'old' directories and files. Details: - Removed most of the 'old' directories scattered throughout the framework, which includes alternate/half-baked/broken implementations. commit 551ea4767a3ea6c263f12aaca94bc2642cee4cfa Author: Field G. Van Zee Date: Sun Mar 24 18:00:10 2013 -0500 Removed #include "blis2.h" from low-level headers. Details: - Removed #include of "blis2.h" from various lower-level, operation-specific header files throughout the framework. Given that these low-level headers are included within #blis2.h in a very specific order, #include'ing blis2.h within them directly is unnecessary. commit bc7b318ed0960edeb4537797dd8c91de0d942ca9 Author: Field G. Van Zee Date: Fri Mar 22 17:18:58 2013 -0500 Added cpp guards to conflicting libflame typedefs. Details: - Added cpp guards around the definitions of dim_t, scomplex, and dcomplex. This is a temporary hack to allow interoperability with libflame. (Similarly temporary changes are being made to libflame's type definitions file.) commit f469907503fcdc24dff0174c569170e6e756e045 Author: Field G. Van Zee Date: Fri Mar 22 15:20:15 2013 -0500 Renamed MAX_PREFETCH_BYTE_OFFSET to MAX_PRELOAD_. Details: - Renamed BLIS_MAX_PREFETCH_BYTE_OFFSET to BLIS_MAX_PRELOAD_BYTE_OFFSET since "prefetch" is kind of a loaded word (e.g. "prefetch" instructions, which are different than the particular kind of prefetching/preloading referred to by this constant). commit d1023bfbc6668a58a01ee4f82ded2319911e7b19 Author: Field G. Van Zee Date: Fri Mar 22 15:09:59 2013 -0500 Removed build/old directory. commit 718888849c48d99f83eea6b8f83bc1998cffef7e Author: Field G. Van Zee Date: Fri Mar 22 15:07:01 2013 -0500 Deprecated 'flame' configuration. Details: - Removed 'flame' configuration, as it was horribly out-of-date. - Comment changes to bl2_blocksize.c and bl2_mem.c. commit bba38cf4e9d28058c14483f44fa074a6d2852ad9 Author: Field G. Van Zee Date: Tue Mar 19 18:07:40 2013 -0500 Added missing conjbeta argument to scald. commit 1f82b51d06d0279dded3f2b87ba59403f3ed0af6 Author: Field G. Van Zee Date: Mon Mar 18 15:37:20 2013 -0500 Relocated packed mem_t dimension fields to obj_t. Details: - Removed the m and n (and elem_size) fields from the mem_t object, and added m_packed and n_packed fields to obj_t. These new fields track the same as the old ones. From an abstraction standpoint, it seemed awkward to store those dimensions inside the mem_t. - Updated interfaces to bl2_mem_acquire_*() so that only a byte size argument is passed in, instead of m, n, and elem_size. - Updated bl2_packm_init_pack() and bl2_packv_init_pack() to inline the functionality of bl2_mem_alloc_update_m() and bl2_mem_alloc_update_v(), respectively. - Updated packm variants to access the packed length and width fields from their new locations. commit 36c782857bf9b8ac1b1dac47a70f689a4407e2cc Author: Field G. Van Zee Date: Mon Mar 18 10:37:03 2013 -0500 CHANGELOG update. commit e7d41229d3b1674e74f47d7f29fae004a745201a (tag: 0.0.4) Author: Field G. Van Zee Date: Fri Mar 15 17:12:36 2013 -0500 Re-implemented contiguous memory allocator. Details: - Completely re-wrote the contiguous memory allocator (bl2_mem.c). The new allocator instantiates and initializes three separate memory pool objects, each one associated with a separate array of contiguous memory blocks, each block of fixed and uniform size. (The three pools are for allocating mc-by-kc blocks of A, kc-by-nc panels of B, and mc-by-nc panels of C.) The pool objects use a stack structure internally to track which blocks in the region have been "checked out" to a thread and which are still available. Critical regions are now clearly marked and adaptable to parallel environments (e.g. OpenMP). Memory pools are set up when bl2_init() is called. - Added a new field to the packm control tree node, which indicates what kind of packed buffer is being allocated. The enumerated type for this argument is defined as packbuf_t in bl2_type_defs.h. - Updated level-3 _cntl.c files to pass in the appropriate value for a new packbuf_t argument to bl2_packm_cntl_obj_create(). - Moved some macros called by packm_init_pack() from bl2_obj_macro_defs.h to bl2_mem_macro_defs.h. - Added BLIS_MAX_NUM_THREADS to bl2_config.h, which we use as the default number of blocks of A reserved for the memory allocator. - Deprecated bl2_align_dim(). Replaced usage with that of bl2_align_dim_to_mult(). Turns out that typically we don't need to align a dimension to the system alignment, since that value has to do with starting addresses, whereas the values we are dealing with are unitless dimensions. commit 1e76cae00cb0a04544aaae1ade878686b238d283 Author: Field G. Van Zee Date: Fri Mar 15 12:21:42 2013 -0500 Perform her2k var1 loops in sequence. Details: - Changed variant 1 of her2k so that the two rank-k products are computed and accumulated in sequence rather than fused into one loop. This is necessary if BLIS is to be configured to provide only enough contiguous memory for one panel of B. commit c95c270eba91ae4efc26603beddfd0292caa919b Author: Field G. Van Zee Date: Thu Mar 7 14:42:15 2013 -0600 Enhanced tracking of dimensions for mem_t objects. Details: - Added new fields to mem_t struct definition to track the allocated (as opposed to the currently used) dimensions of the memory region. This allows packm_init() to be more robust in situations where memory is already allocated but is more than needed for the current packing job. - Updated logic in bl2_obj_set_buffer_with_cached_packm_mem() macro, used in packm_init(), to update the "currently used" dimensions of the mem_t object if the requested dimensions are smaller than the allocated dimensions. commit e99281a0f41d482fddeffa239bfc8e13e6d13d4b Author: Field G. Van Zee Date: Thu Mar 7 14:00:10 2013 -0600 Fixed test suite flop formulas for ops with side. Details: - Fixed incorrect flop counts in test suite modules for hemm, symm, trmm, trmm3, and trsm. - Comment updates in herk macro-kernels. commit ef8cbfc44dd620fdcbdb51cdb173217194bebe31 Author: Field G. Van Zee Date: Sat Mar 2 12:47:06 2013 -0600 Added "version" to .gitignore. Details: - Added "version" to .gitignore file so that the file does not show up when running 'git status', or accidentally get pulled into the index when running 'git add' or 'git add --all'. commit e9e0747c2f6c178f53ac46ab794acbb7b8c4fea8 Author: Field G. Van Zee Date: Sat Mar 2 12:43:54 2013 -0600 Removed version file from version control. Details: - Removed version file from version control to prevent git errors that occur when trying to pull new commits. commit bb612f864e9c17dd9805e9446840f02259619469 Author: Field G. Van Zee Date: Fri Mar 1 12:55:42 2013 -0600 Updated behavior of bl2_obj_induce_trans() macro. Details: - Changed bl2_obj_induce_trans() so that the transposition bit is no longer updated as part of the macro. All current uses of the macro have been coupled with instances of bl2_obj_set_trans() to clear the bit. - Added Jed to CREDITS file. commit f24e29b789e7314764a818ceb3063126936c986f Author: Field G. Van Zee Date: Fri Feb 22 18:15:41 2013 -0600 Replaced banded/packed BLAS2 stubs with f2c code. Details: - Retired the blas2blis wrappers that simply called abort with a "not yet implemented" message. This includes all of the level-2 banded and packed routines. - Replaced the aforementioned with the corresponding netlib implementations having been run through f2c (with some customization). - Added directories named 'attic' to build/gen-make-frags/ignore_list. commit 1454c1a14207766dfed372b8e38b47fa384f5198 Author: Field G. Van Zee Date: Fri Feb 22 12:38:45 2013 -0600 Moved Fortran name-mangling macro to bl2_config.h. Details: - Moved the Fortran-77 name-mangling macros from bl2_blas_macro_defs.h to the configuration directory (bl2_config.h, specifically) given that it can be expected to be tweaked by some developers. commit ede75693e5a36c6006087c4a7df834175b604504 (tag: 0.0.3) Author: Field G. Van Zee Date: Fri Feb 22 12:11:24 2013 -0600 Implemented blas2blis compatibility layer. Details: - Added the blas2blis compatibility layer, located in frame/compat. This includes virtually all of the BLAS, including banded and packed level-2 operations. - Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional initialization, which stores the "exit status" in an err_t, which is then read by the latter function to determine whether finalization should actually take place. - Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and level-3 BLAS-like wrappers. - Added configuration option to instruct BLIS to remain initialized whenever it automatically initializes itself (via bl2_init_safe()), until/unless the application code explicitly calls bl2_finalize(). - Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type templatization of blas2blis wrappers. - Defined level-0 scalar macro bl2_??swaps(). - Defined level-1v operation bl2_swapv(). - Defined some "Fortran" types to bl2_type_defs.h for use with BLAS wrappers. commit 995edf43e21c1868732dbdd7fee14b08730218bd Author: Field G. Van Zee Date: Thu Feb 21 14:30:50 2013 -0600 Updated version file. (Forgot to in prev commit). commit e823b08aaf7b65ecc6ddc30570709ea8a4b52aa7 Author: Field G. Van Zee Date: Thu Feb 21 12:00:17 2013 -0600 Fixed some scalar types in BLAS-like Herm APIs. Details: - Some of the scalars of Hermitian operations, such as alpha in her, alpha and beta in herk, and beta in her2k, need to be real. These arguments were typed incorrectly as the complex types. This has been fixed. Note the issue was only present in the BLAS-like APIs for these operations (not the native object-based interfaces). commit 5ece050a669e74ba4a711d1d4669239d22d45642 Author: Field G. Van Zee Date: Wed Feb 20 15:50:54 2013 -0600 Updated version file. (Forgot to in prev commit). commit f243034b8b430d4684680ea8eddfd246e73fefc0 Author: Field G. Van Zee Date: Wed Feb 20 14:11:36 2013 -0600 Changed API of packm_init_pack() to use blksz_t. Details: - Changed the interface of packm_init_pack() so that mult_m and mult_n are passed in as type blksz_t* instead of dim_t. - Make similar change for packv_init_pack(). commit da0c22f24107be9f33e0ea2dae52e5534b1fd0e5 Author: Field G. Van Zee Date: Fri Feb 15 09:59:48 2013 -0600 Minor changes to lower levels of scalm and setm. Details: - Removed diagx parameter from lower-level interfaces of scalm. - Modified scalm_basic_check() to expect an object with a nonunit diagonal. - Changed setm_unb_var1() so that having an implicit unit diagonal results in only the strictly lower or upper triangle of the matrix being modified. commit 2c836adadcd2a7d7f217033ac4d7fcad03d5bd55 Author: Field G. Van Zee Date: Thu Feb 14 10:42:56 2013 -0600 Updated beta == zero semantics of mulsc. Details: - Updated beta == zero semantics of mulsc. Hopefully this is the last operation that needed updating. - Added Devin to CREDITS file. commit 722b66c7dcaaaa1b109e7c8b1d53fd71a9af8240 Author: Field G. Van Zee Date: Thu Feb 14 10:18:00 2013 -0600 Removed some calls to setv() in test modules. Details: - Removed calls to setv() in test modules whose sole purpose was to initialize vectors to zero to ensure that nan's and inf's would not taint the computation. Now that beta == zero semantics have been updated to clear the output operand (when beta is zero), rather than multiply against it, these setv() calls are no longer needed. commit e6ac623a902f776c42f85eadbf76996d9770a0db Author: Field G. Van Zee Date: Wed Feb 13 18:44:59 2013 -0600 Properly implemented beta == 0 semantics. Details: - Changed name of set0 and set0_mxn macros to set0s and set0s_mxn, respectively. - Added code to the following operations that sets the output operand to zero if the corresponding scalar is zero (rather than performing the floating-point multiply, or in the case of setv, copying the value). This will prevent nan's and inf's from creeping into results from uninitialized memory. - axpy - dotxv - scalv - scal2v - setv - gemv - ger - hemv - her - her2 - gemm reference ukernels commit aedccbc85d491e41711a0c6eb0d246d8700a199a Author: Field G. Van Zee Date: Wed Feb 13 18:29:53 2013 -0600 Fixed stale interface to packm_unb_var1(). Details: - Removed the control tree from the interface to packm_unb_var1(), which I meant to do when it was un-deprecated. commit c23135669f7a8a545e2e11ef559bf284be8bc65c Author: Field G. Van Zee Date: Wed Feb 13 13:21:00 2013 -0600 Un-deprecated packm_unb_var1.c (needed by l2 ops). Details: - Added bl2_packm_unb_var1() back into the mix once I realized that level-2 operations still need this routine for packing matrices. Now, whether level-2 operations should be packing matrices to begin with is another matter. But this fixes the segmentation fault one would have gotten when running bl2_gemv() on a general stride matrix. commit cf49e35f9819f9d93ebdca4703ade5abab28f6f6 Author: Field G. Van Zee Date: Tue Feb 12 18:39:35 2013 -0600 Removed cntl tree usage from packm implementation. Details: - Added new fields to obj_t info field: - invert_diag - pack_order_if_upper - pack_order_if_lower These fields allow packm_init() to embed information that begins in the control tree into the object so that the packm implementation does not need to use control trees at all. This is being done to aid Bryan's DxT code generation. - Added macros that operate on above fields. - Changed packm_init(), packm_blk_var2(), and packm_blk_var3() according to above changes. - Made similar (but much simpler) changes to packv. - Deprecated packm_blk_var1(), packm_unb_var1(), and packm_densify(). These were part of prototype implementations and are no longer needed. commit eb139ae256651af7820b93ef982626180195b87f Author: Field G. Van Zee Date: Tue Feb 12 12:39:30 2013 -0600 Replaced bl2_abs() with _fabs() where appropriate. commit 474bac30c99928f9e87315972bcb45c632c0b7ec Author: Field G. Van Zee Date: Tue Feb 12 12:23:48 2013 -0600 Removed level-0 macros projrs, grabis. Details: - Replaced instances of projrs and grabis macros with newer, more general-purpose getris. commit 03a260a457c8964e4603a655cee0d40ac17affba Author: Field G. Van Zee Date: Tue Feb 12 11:45:34 2013 -0600 Restored executable permissions to scripts. Details: - Restored executable (0755) permissions to scripts that were touched by the recursive sed script that updated the copyright headers in the previous commit. commit 1274e1243775e5e705114257a43176f63635227f Author: Field G. Van Zee Date: Mon Feb 11 14:37:47 2013 -0600 Updated copyright headers from 2012 to 2013. commit 3b620cc8e90c53c79129bd9dd89ae6b77c2446f1 Author: Field G. Van Zee Date: Mon Feb 11 13:38:07 2013 -0600 CHANGELOG update. commit 768fcebaa8be0eb936a6e7a02cd8a19438c79d99 (tag: 0.0.2) Author: Field G. Van Zee Date: Mon Feb 11 13:20:44 2013 -0600 Added unified test suite, and many fixes. Details: - Added a highly configurable, unified test suite. - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel header files. Now, instead, DUPB is computed as (NDUP != 1) within each macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into incorrectly when DUPB was set to FALSE but the NDUP was still non-unit. By encoding both pieces of information into one constant in _kernel.h, it seems somewhat less likely others will encounter this bug in the future. - Added level-2 cache blocksizes to _kernel.h for reference configuration, and defined blocksizes in _cntl.c files to these default values. - Changed semantics of her2k and syr2k such that these operations no longer expect the B matrix to already be conjugate-transposed (or just transposed for syr2k). However, these semantics are preserved for the internal mechanics of the implementations, including the internal back-end and all blocked variants. - Inserted checks for real-valued alpha and beta for herk/her2k and herk, respectively. - Relaxed general object structure constraints in _basic_check() for gemv, ger. - Changed her front-end to NOT copy-cast to real projection; instead, this is replaced by selecting either the real part or both parts within the unblocked algorithm implementation, depending on the value of conjh. - Added conjh to all _check routines for her so that the code knows when to verify that alpha has an imaginary component equal to zero (for her, but not syr). - Changed control tree for her to forgo packing. - Added unit diagonal support to fnormm. - Redefined real versions of abval2s macros in terms of fabs(), fabsf(). - Redefined complex versions of sqrt2s macros using the actual "complex square root" formula. - Created new level-0 object-based routines, suffixed with "sc" (for "scalar"). - Defined new level-1v, -1d, and -1m versions of add and sub operations (two-operand add and subtract). - Added new scalar macros: - getris: acquire real and imaginary components. - setris: set real and imaginary components. - addjs: addition with conjugated x. - subjs: subtraction with conjugated x. - Defined new utility operations: - absumv: element-wise sum of absolute values for vector elements. - absumm: element-wise sum of absolute values for matrix elements. - mkherm: convert existing matrix to Hermitian. - mksymm: convert existing matrix to symmetric. - mktrim: convert existing matrix to triangular. - Added various error checking routines. - Added bl2_clock_min_diff(), which is used to more cleanly measure the wall clock time of a code block. - Added general stride support to bl2_obj_alloc_buffer(). - Added bl2_obj_init_scalar(). - Updated parameter mapping in bl2_param_map.c. - Added support for queriable version string. - Fixed a bug in the her2k macro-kernels (which currently are simply implemented in terms of two invocations of herk) whereby beta was being applied to both the first and second rank-k updates, rather than only the first. - Fixed a bug in trmm/trsm whereby transpose and right side cases were not properly implemented due to erroneous assumptions regarding aliasing and root objects. - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong MR x NR block of B was being updated. - Fixed a bug in the inverts macro in the double real case whereby the value was typecast to float before inversion. This affected non-unit cases of dtrsm. - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one constant was being applied incorrectly. - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code now mimics the rank-k strategy of gemm, whereby alpah is applied during the first iteration of variant 3, with BLIS_ONE passed in instead for subsequent iterations. This also required passing alpha into the macro- kernels as well as the fused gemmtrsm micro-kernels. - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being called for blocks strictly above the diagonal. While this sounds good in theory, this cannot be done because gemm_ker_var2 expects row panels of A to be packed from top to bottom, while for trsm_u, A is actually packed from bottom to top due to the reverse (BR->TL) nature of the algorithm. - Fixed a bug in packm_cxk() whereby panel packings with unit panel dimensions were mishandled due to incorrect arguments to the copyv kernel. Also changed the copyv kernel invocation to scal2v so that these edge cases are properly handled when scaling is requested. - Fixed a bug in packv_int() whereby an uninitialized object is passed in instead of the source object. - Fixed a bug whereby level-2 code could allocate memory dynamically via bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed a potential future bug whereby a mem_t object that is actually no longer "allocated" from the static pool is mistaken for being allocated due to failure to NULLify the buffer when the block was most recently released. - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly toggled when the requested subpartition needed to be "reflected" due to it residing in an unstored region. commit be94fb84c0351602d7585269f29998e3bf83f899 Author: Field G. Van Zee Date: Fri Jan 4 10:55:21 2013 -0600 Added missing 'd' to fused gemmtrsm function name. commit 879a179e1dee36f0c56765f2ab91a26861019b34 Author: Field G. Van Zee Date: Fri Jan 4 10:37:27 2013 -0600 Added debug statements to bl2_mm_acquire_m(). Details: - Added printf() statements to bl2_mm_acquire_m() to help debug issues with prematurely exhausted memory pool. - Removed 'd' from kernel names of reference kernels in clarksville configuration's bl2_kernel.h commit 806e74beb4eafeef620a555ffbb3f6779e29c7b6 Author: Field G. Van Zee Date: Thu Dec 20 17:07:50 2012 -0600 Defined Frobenius norm operations. Details: - Added level-0 grabis macro operation to grab imaginary component of one variable and copy it to the real component of another variable. - Defined sumsqv operation, which computes the sum of the absolute squares of the elements of a vector. This implementation is modeled after ?lassq in netlib LAPACK. - Defined fnormv and fnormm operations, which compute the Frobenius norm on vectors and matrices, respectively. These operations are treated as one- operand operations where the output norm value is the real projection of the datatype of the input operand. Both operations are implemented in terms of sumsqv. commit 66e80ce1aec099b2b2b0c4f295e38add2c921383 Author: Field G. Van Zee Date: Thu Dec 20 17:02:55 2012 -0600 Added GENT*R macros; tweaked bl2_machval defs. Details: - Added function and prototype macro-generating macros for GENTFUNCR and GENTPROTR, which are one-operand macros with auxiliary real projection types. - Tweaked bl2_machval files to use new macros. commit 2fecc88ca22142020573f168da715e8e9f3dd7de Author: Field G. Van Zee Date: Thu Dec 20 11:35:14 2012 -0600 Fixed harmless macro bug in level-1m operations. Details: - Fixed some inconsistent usage of n_iter_max and n_iter in the two bl2_set_dims_incs_uplo_[12]m macros. The right thing ended up happening despite the bug, which is why I had not discovered it until now. commit 8945db6ec9f82168cf72411ad408b4fdb44ae0d1 Author: Field G. Van Zee Date: Tue Dec 18 15:07:36 2012 -0600 Renamed x86,x86_64 kernels to indicate 'd' fusing. Details: - Renamed x86 and x86_64 kernels to contain a 'd' before the fusing shape to emphasize that the fusing shape is not for all datatype instances, but rather just for one (that of double-precision real). Other fusing shapes would be proportional to their precision and domain "byte footprints". - Corresponding changes to config/clarksville/bl2_kernel.h. commit 6fbbdd4e194d06096ad08c5db61127be338067db Author: Field G. Van Zee Date: Tue Dec 18 14:34:02 2012 -0600 More tweaks to _config.h, _kernel.h; smem tweaks. Details: - Moved kernel-related definitions form bl2_config.h to bl2_kernel.h. - Replaced #define of _GNU_SOURCE with #define of _POSIX_C_SOURCE. This accomplishes the same thing (enabling posix_memalign()) without enabling all of the GNU extensions we don't need. - Defined the size of the static memory pool in terms of MC, KC, and NC, as well as two new constants that determine how many MCxKC blocks and how many KCxNC blocks should be allocated (defined in bl2_config.h). - In the case of static memory pool exhaustion, replaced the generic bl2_abort() with a specific error code call. commit 5d8bdb21c48e8fb11bef6128a242122cc1470a99 Author: Field G. Van Zee Date: Mon Dec 17 16:07:36 2012 -0600 Minor reordering of bl2_config.h definitions. commit 4a83f67490136a898f558e273b76a687aed8b893 Author: Field G. Van Zee Date: Mon Dec 17 12:35:54 2012 -0600 Consolidated configuration headers. Details: - Merged contents of bl2_arch.h into bl2_config.h for reference and clarksville configurations. - Updated CREDITS, INSTALL, LICENSE, README files. commit 0670c33cc14612f636ef09ede4133404ae0af6ba Author: Field G. Van Zee Date: Fri Dec 14 12:45:26 2012 -0600 Fixed bug in reference gemm ukernels. Details: - Fixed a bug whereby, for the reference gemm ukernels, the matrix product was not correctly accumulated and scaled (by alpha) into the output matrix C. (Thanks to Fran for finding this bug.) - Whitespace changes to reference trsm kernels. commit e2e7cb2fbe615be4d375bc2dce88d03d98fadc9e Author: Field G. Van Zee Date: Thu Dec 13 18:17:54 2012 -0600 Expanded reference packm/unpackm kernel set to 16. Details: - Added 10xk, 12xk, 14xk, and 16xk reference kernels for packm and unpackm. - Updated bl2_[un]packm_cxk() to silently use scal2m if "out of range" kernel size is requested. (Thanks to Tyler for finding this bug.) - Updated bl2_kernel.h to contain new _KERNEL definitions, according to above changes, for 'reference' and 'clarksville' configurations. - Updated CHANGELOG. - Removed "output*.m" from .gitignore. commit 17455a8bce038dd570356ab0c5c11d9a89f20248 Author: Field G. Van Zee Date: Mon Dec 10 17:23:32 2012 -0600 Minor updates towards to 0.0.1. commit 7ad4ebef38b8e6eea9b6091844ba7294ec870271 (tag: 0.0.1) Author: Field G. Van Zee Date: Mon Dec 10 16:18:40 2012 -0600 Tweaks to get BLIS compiling again on clarksville. Details: - Updated header files and make_defs.mk in config/clarksville. - Fixes to bl2_mem.c (now that SMEM_M, SMEM_N are gone). - Moved definition of blksz_t from bl2_cntl.h to bl2_type_defs.h. - Shuffled include statements in blis2.h. commit cc58ea86010b1f046134d13b546c878389df9af5 Author: Field G. Van Zee Date: Mon Dec 10 14:55:12 2012 -0600 Added template fragment.mk; updated .gitignore. commit 714c527b0eb153b7e2040b79349edc8372f743fd Author: Field G. Van Zee Date: Fri Dec 7 19:54:04 2012 -0600 Added 'changelog' make target; other tweaks. Details: - Updated CHANGELOG. - Added 'changelog' target to Makefile that runs 'git log --decorate' and overwrites CHANGELOG with the output. - Other trivial changes. commit e4e5404d26aded4873278e85faf6f14ac32115b5 Author: Field G. Van Zee Date: Fri Dec 7 17:34:53 2012 -0600 Define static memory pool size in bl2_config.h. commit 19bb507d0de6a2bd3ce37cf616bdcd6b419ed641 Author: Field G. Van Zee Date: Fri Dec 7 17:18:00 2012 -0600 Refined INSTALL text; added 'showconfig' target. Details: - Added 'showconfig' target to Makefile. - Added header files and ./config//make_defs.mk as prerequisites to object file rules. - Added config.mk as prerequisite to library install rules. - Edited and added to INSTALL file. commit 26cb659dd79636489db5a051aa60fff80273a7b9 Author: Field G. Van Zee Date: Thu Dec 6 15:34:53 2012 -0600 Added auto-detection of version string (via git). Details: - Added build/update-version-file.sh script for auto-detecting "version" string and updating 'version' file accordingly. (If .git directory is not present, then it is assumed this copy of BLIS is a downloaded release, in which case 'version' file is left unchanged.) - Added invocation of update-version-file.sh to configure script. commit b0ecd0ff52fa6ffc9e1d9eb44c365f7f009a6204 Author: Field G. Van Zee Date: Thu Dec 6 14:27:11 2012 -0600 Wrote first draft of INSTALL file. commit bcbe81235a35ccfdbcc2f2319a0ca6e04f75a785 (tag: 0.0.0) Author: Field G. Van Zee Date: Thu Dec 6 12:42:35 2012 -0600 Updated standalone test Makefile and other fixes. Details: - Major edits to test/Makefile to bring up-to-date wrt new build system; should no longer be broken. - Minor edits to top-level Makefile. - Fixed copy-and-paste bugs in - frame/1m/packm/ukernels/bl2_packm_ref_?xk.c - frame/1m/unpackm/ukernels/bl2_unpackm_ref_?xk.c commit 2f272b40f43307909736327f49d17737c7a05d37 Author: Field G. Van Zee Date: Tue Dec 4 19:22:14 2012 -0600 Added build system and continued reorganization. Details: - Added/renamed packm, unpackm kernels. - Added machine value routines. - Added param_map facility. - Renamed AUTHORS to CREDITS. - Added Makefile; continued to expand upon existing configure script. - #define fuse_fac macros in operation headers if not defined already (by the user in bl2_kernels.h). commit 00f3498a8943be1b387f0d5c029c8c7891687ad5 Author: Field G. Van Zee Date: Mon Dec 3 12:36:11 2012 -0600 Initial commit.