# SIMD Everywhere

SIMDe provides fast, portable implementations of SIMD intrinsics on
hardware which doesn't natively support them, such as calling SSE
functions on ARM.

The current focus is on writing complete portable implementations,
though a large number of functions already have accelerated
implementations using one (or more) of the following:

 * SIMD intrinsics from other ISA extensions (e.g., using NEON to
   implement SSE).
 * Compiler-specific vector extensions and built-ins such as
   [`__builtin_shufflevector`](http://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-shufflevector)
   and
   [`__builtin_convertvector`](http://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-convertvector)
 * Compiler auto-vectorization hints, using:
   * [OpenMP 4 SIMD](http://www.openmp.org/)
   * [Cilk Plus](https://www.cilkplus.org/)
   * [GCC loop-specific pragmas](https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html)
   * [clang pragma loop hint directives](http://llvm.org/docs/Vectorizers.html#pragma-loop-hint-directives)

For an example of a project using SIMDe, see
[LZSSE-SIMDe](https://github.com/nemequ/LZSSE-SIMDe).

## Current Status

[![Travis](https://api.travis-ci.org/nemequ/simde.svg?branch=master)](https://travis-ci.org/nemequ/simde) [![AppVeyor](https://ci.appveyor.com/api/projects/status/1f3wp712w1ium5vi/branch/master?svg=true)](https://ci.appveyor.com/project/quixdb/simde/branch/master) [![Codecov](https://img.shields.io/codecov/c/github/nemequ/simde.svg)](https://codecov.io/gh/nemequ/simde)

There are currently complete implementations of the following instruction
sets:

 * MMX
 * SSE
 * SSE2
 * SSE3
 * SSSE3
 * SSE4.1

As well as partial support for many others; see the
[instruction-set-support](https://github.com/nemequ/simde/issues?q=is%3Aissue+is%3Aopen+label%3Ainstruction-set-support+sort%3Aupdated-desc)
label in the issue tracker for details on progress.  If you'd like to
be notified when an instruction set is available you may subscribe to
the relevant issue.

If you have a project you're interested in with SIMDe but we don't yet
support all the functions you need, please file an issue with a list
of what's missing so we know what to prioritize.

## Want to help?

There are a *lot* of instructions to get through, so any help would be
greatly appreciated!  It's pretty straightforward work, and a great
way to learn about the instructions.

There are three places you'll want to modify in order to implement a
new function:

 * ${arch}/${isax}.h — this is where the implementations live
 * test/${isax}/${isax}.c — tests comparing the implementation with
   the expected result.
 * test/${arch}/${isax}/compare.c — tests comparing the portable
   implementation with the "native" version, using random data for
   inputs.

The comparison test is optional, but very nice to have.  The regular
tests are required.

Hopefully it's clear what to do by using other functions in those
files as a template, but if you have trouble please feel free to
contact us; we're happy to help!

## Usage

Each instruction set has a separate file; `x86/mmx.h` for MMX,
`s`x86/se.h` for SSE, ``x86/sse2.h` for SSE2, and so on.  Just include
the header for whichever instruction set(s) you want, and SIMDe will
provide the fastest implementation it can given which extensions
you've enabled in your compiler (i.e., if you want to use NEON to
implement SSE, you'll need to pass something like `-mfpu=neon`).

Symbols are prefixed with `simde_`.  For example, the MMX
`_mm_add_pi8` intrinsic becomes `simde_mm_add_pi8`, and `__m64`
becomes `simde__m64`.

Since SIMDe is meant to be portable, many functions which assume types
are of a specific size have been altered to use fixed-width types
instead.  For example, Intel's APIs assume `int` is 32 bits, so
`simde_mm_set_pi32`'s arguments are `int32_t` instead of `int`.  On
platforms where the native API's assumptions hold (*i.e.*, if `int`
really is 32-bits) SIMDe's types should be compatible, so existing
code needn't be changed unless you're porting to a new platform.

For best performance, you should enable OpenMP 4 SIMD support by
defining `SIMDE_ENABLE_OPENMP` before including any SIMDe headers, and
enabling OpenMP support in your compiler.  GCC and ICC both support a
flag to enable only OpenMP SIMD support instead of full OpenMP (the
SIMD support doesn't require the OpenMP run-time library); for GCC the
flag is `-fopenmp-simd`, for ICC `-openmp-simd`.  SIMDe also supports
using [Cilk Plus](https://www.cilkplus.org/), [GCC loop-specific
pragmas](https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html),
or [clang pragma loop hint
directives](http://llvm.org/docs/Vectorizers.html#pragma-loop-hint-directives),
though these are not as well tested.

## Portability

### Compilers

SIMDe requires C99.

Every commit is tested with several different versions of GCC, clang,
and PGI via [Travis CI](https://travis-ci.org/nemequ/simde) on Linux.
Microsoft Visual C++ is tested on Windows using
[AppVeyor](https://ci.appveyor.com/project/quixdb/simde).  Intel C/C++
Compiler is also tested sporadically (mostly because their
optimization reports are excellent).

I'm generally willing to accept patches to add support for other
compilers, as long as they're not too disruptive, *especially* if we
can get CI support going.  Travis and AppVeyor are great, but feel
free to use whatever works.

### Hardware

Currently only x86_64, x86, and ARMv7 receive any sort of regular
testing.  If you'd like to see more thorough testing of other
architectures, please consider finding a way to integrate it into CI.
One example might be running qemu on Travis CI (or some other hosted
CI).

## Related Projects

 * The "builtins" module in
   [portable-snippets](https://github.com/nemequ/portable-snippets)
   does much the same thing, but for compiler-specific intrinsics
   (think `__builtin_clz` and `_BitScanForward`), **not** SIMD
   intrinsics.
 * Intel offers an emulator, the [Intel® Software Development
   Emulator](https://software.intel.com/en-us/articles/intel-software-development-emulator/)
   which can be used to develop software which uses Intel intrinsics
   without having to own hardware which supports them, though AFAIK it
   doesn't help for deployment.
 * I'm not aware of anyone else trying to create portable
   implementations of an instruction set, but there are a few projects
   trying to implement one set with another:
   * [ARM_NEON_2_x86_SSE](https://github.com/intel/ARM_NEON_2_x86_SSE)
     — implementing NEON using SSE.  Quite extensive, Apache 2.0
     license.
   * [sse2neon](https://github.com/jratcliff63367/sse2neon) —
     implementing SSE using NEON.  This code has already been merged
     into SIMDe.
   * [veclib](https://github.com/IvantheDugtrio/veclib) — implementing
     SSE2 using AltiVec/VMX, using a non-free IBM library called
     [powerveclib](https://www.ibm.com/developerworks/community/groups/community/powerveclib/)
   * [SSE-to-NEON](https://github.com/otim/SSE-to-NEON) — implementing
     SSE with NEON.  Non-free.
 * [arm-neon-tests](https://github.com/christophe-lyon/arm-neon-tests)
   contains tests te verify NEON implementations.

If you know of any other related projects, please [let us
know](https://github.com/nemequ/simde/issues/new)!

## Caveats

Sometime features can't be emulated.  If SIMDe is operating in native
mode the functions will work as expected, but if there is no native
support the following caveats apply:

### SSE

 * `simde_MM_SET_ROUNDING_MODE()` will use `fesetround()`, altering
   the global rounding mode.
 * `simde_mm_getcsr` and `simde_mm_setcsr` only implement bits 13 and
   14 (rounding mode).

## License

SIMDe is distributed under an MIT-style license; see COPYING for
details.