\mainsection{Introduction}\label{sec:introduction} The SCALE system consists of three main sub-systems: An offline phase, an online phase and a compiler. Unlike the earlier SPDZ system, in SCALE the online and offline phases are fully integrated. Thus you can no longer time just the online phase, or just the offline phase. The combined online/offline phase we shall refer to as SCALE, the compiler takes a program written in our special language MAMBA, and then turns it into byte-code which can be executed by the SCALE system. We provide switches (see below) to obtain different behaviours between how the online and offline phases are integrated together, which can allow for some form of timing approximation. The main reason for this change is to ensure that the system is ``almost'' secure out of the box, even if it means it is less good for obtaining super-duper numbers for research papers. An issue though is that the system takes a while to warm up the offline queues before the online phase can execute. This is especially true in the case of using a Full Threshold secret sharing scheme. Indeed in this case it is likely that the online phase will run so-fast that the offline phase is always trying to catch up. In addition, in this situation the offline phase needs to do a number of high-cost operations before it can even start. Thus using the system to run very small programs is going to be inefficient, although the execution time you get is indicative of the total run time you should be getting in a real system, it is just not going to be very impressive. In order to enable efficient operation in the case where the offline phase is expensive (e.g. for Full Threshold secret sharing) we provide a mechanism to enable the SCALE system to run as a seperate process (generating offline data), and then a MAMBA program can be compiled in a just-in-time manner and can then be dispatched to the SCALE system for execution. Our methodology for this is not perfect, but has been driven by a real use case of the system. See Section \ref{sec:restart} for more details. But note SCALE/MAMBA is {\em an experimental research system} there has been no effort to ensure that the system meets rigourous production quality code control. In addition this also means it comes with limited support. If you make changes to any files/components you are on your own. If you have a question about the system we will endeavour to answer it. \vspace{5mm} \noindent {\bf Warnings:} \begin{itemize} \item The Overdrive system \cite{KPR} for the offline phase for full-threshold access structures requires a distributed key generation phase for the underlying homomorphic encryption scheme. The SPDZ-2 system and paper does describe such a protocol, but it is only covertly secure. A newer paper \cite{SPDZKG} presents an actively secure method to generate the keys, which is specifically tailored for the keys used in SCALE in this case. A program in the directory \verb|KeyGen| implements the protocol in this paper, see Chapter \ref{sec:keygen} for more details. In the normal execution of the \verb|Setup.x| program in the full-threshold case this protocol is not executed, and thus \verb|Setup.x| internally generates a suitable key and distribute it to the different players. \end{itemize} \subsection{Architecture} The basic internal runtime architecture is as follows: \begin{itemize} \item Each MAMBA program (\verb+.mpc+ file) will initiate a number of threads to perform the online stage. The number of ``online threads'' needed is worked out by the compiler. You can programmatically start and stop threads using the python-like language (see later). Using multiple threads enables you to get high throughput. Almost all of our experimental results are produced using multiple threads. \item Each online is associated with another four ``feeder'' threads. One produces multiplication triples, one produces square pairs, one produces shared bits and one produces data for input/output of data items. The chain of events is that the multiplication thread (say) produces a fully checked triple. This triple is added to a triple-list (which is done in batches for efficiency) for consumption by the online phase. The sizes of these lists can be controlled (somewhat) by the values in the file \verb+config.h+. One can control the number of entries {\em ever} placed on the sacrificed-list by use of the run-time flag \verb+max+. \item By having the production threads aligned with an online thread we can avoid complex machinary to match produces with consumers. This however may (more like will) result in the over-production of offline data for small applications. \item In the case of Full Threshold we have another set of global threads (called the FHE Factory threads, or the FHE Industry) which produces level one random FHE ciphertexts which have passed the ZKPoK from the Top Gear protocol \cite{TopGear}, which is itself a variant of the High Gear protocol in Overdrive \cite{KPR}. This is done in a small bunch of global threads as the ZKPoK would blow up memory if done for each thread that would consume data from the FHE thread. These threads are numbered from 10000 upwards in the code. Any thread (offline production thread) can request a new ciphertext/plaintext pair from the FHE factory threads. Experiments show that having two FHE factory threads is usually optimal. \item In the case of full-threshold access structures or Q2 access structures generated by a generic MSP, we also implement a thread (number 20000 in the code) which implements pairwise OTs via OT-extension. This thread produces a bunch of authenticated bits in various queues, one for each online thread. A similar thread (number 20001 in the code) does the same for authenticated ANDs, i.e. aANDS. Each main online thread then removes aBits and aANDs from its respective queue when it needs these to produce daBits or execute the garbled circuit protocols. The first time the online thread meets an instruction which requires such data there can be a little lag as the sub-queues fill up, but this disappears on subsequent instructions (until the queues need filling again). These aBits and aANDs are used in the HSS protocol to execute binary circuits \cite{AC:HazSchSor17}. \item For Shamir sharing or Replicated sharing instances we have a single thread 20000 which generates shared triples for AND gates. This executes a generalisation of the protocol from \cite{DBLP:conf/sp/ArakiBFLLNOWW17} for triple generation (in particular Protocol 3.1 from that paper). The generalisationis an obvious one from the three party case to a general Q2 Replicated secret sharing. The base multiplication protocol is that of Maurers. The only reason we do not support general Q2 MSPs is that currently we have a bug in generating the associated Replicated scheme from the generic MSP. The triples produced by this thread are then used in an online phase which is essentially the modulo two variant of the protocol of Smart and Wood \cite{SW18}; thus basically it uses Beaver multiplication and `checking' is performed by hashing the reconstructed shares. \end{itemize}