\mainsection{Introduction}\label{sec:introduction}
The SCALE system consists of three main sub-systems:
An offline phase, an online phase and a compiler. 
Unlike the earlier SPDZ system, in SCALE the online
and offline phases are fully integrated. Thus you
can no longer time just the online phase, or just
the offline phase.
The combined online/offline phase we shall refer
to as SCALE, the compiler takes a program written
in our special language MAMBA, and then turns it
into byte-code which can be executed by the SCALE
system. 

We provide switches (see below) to obtain 
different behaviours between how the online and offline
phases are integrated together, which can allow for
some form of timing approximation.
The main reason for this change is to ensure that the
system is ``almost'' secure out of the box, even if it
means it is less good for obtaining super-duper numbers
for research papers.

An issue though is that the system takes a while to warm
up the offline queues before the online phase can execute.
This is especially true in the case of using a Full Threshold
secret sharing scheme. Indeed in this case it is likely
that the online phase will run so-fast that the offline
phase is always trying to catch up. In addition, in this
situation the offline phase needs to do a number of
high-cost operations before it can even start. Thus
using the system to run very small programs is going to
be inefficient, although the execution time you get
is indicative of the total run time you should be getting
in a real system, it is just not going to be very
impressive.

In order to enable efficient operation in the case where the
offline phase is expensive (e.g. for Full Threshold secret sharing)
we provide a mechanism to enable the SCALE system to run
as a seperate process (generating offline data), and then
a MAMBA program can be compiled in a just-in-time manner
and can then be dispatched to the SCALE system for
execution. Our methodology for this is not perfect, but has
been driven by a real use case of the system.
See Section \ref{sec:restart} for more details.

But note SCALE/MAMBA is {\em an experimental research system} 
there has been no effort to ensure that the system meets rigourous production
quality code control. 
In addition this also means it comes with limited support.
If you make changes to any files/components you are on your
own.
If you have a question about the system we will endeavour to
answer it.

\vspace{5mm}

\noindent
{\bf Warnings:}
\begin{itemize}
\item
The Overdrive system \cite{KPR} for the offline phase
for full-threshold access structures requires a distributed key 
generation phase for the underlying homomorphic encryption scheme. 
The SPDZ-2 system and paper does describe such a protocol, but it is only
covertly secure. 
A newer paper \cite{SPDZKG} presents an actively secure method
to generate the keys, which is specifically tailored for the keys
used in SCALE in this case.
A program in the directory \verb|KeyGen| implements the protocol
in this paper, see Chapter \ref{sec:keygen} for
more details.

In the normal execution of the \verb|Setup.x| program in the
full-threshold case this protocol is not executed, and thus
\verb|Setup.x| internally generates a suitable
key and distribute it to the different players.
\end{itemize}


\subsection{Architecture}
The basic internal runtime architecture is as follows:
\begin{itemize}
\item Each MAMBA program (\verb+.mpc+ file) will initiate a number of threads
to perform the online stage.
The number of ``online threads'' needed is worked out by the compiler. You
can programmatically start and stop threads using the python-like
language (see later).
Using multiple threads enables you to get high throughput. Almost all
of our experimental results are produced using multiple threads.
\item Each online is associated with another four ``feeder'' threads.
One produces multiplication triples, one produces square pairs,
one produces shared bits and one produces data for input/output
of data items.
The chain of events is that the multiplication thread (say) produces
a fully checked triple. This triple is added to a triple-list (which is
done in batches for efficiency) for consumption by the online
phase.
The sizes of these lists can be controlled (somewhat) by the
values in the file \verb+config.h+.
One can control the number of entries {\em ever}
placed on the sacrificed-list by use of the run-time flag \verb+max+.
\item By having the production threads aligned with an online
thread we can avoid complex machinary to match produces with
consumers. This however may (more like will) result in the 
over-production of offline data for small applications.
\item In the case of Full Threshold we have another set of global
threads (called the FHE Factory threads, or the FHE Industry) 
which produces level one random FHE ciphertexts which have passed 
the ZKPoK from the Top Gear protocol \cite{TopGear},
which is itself a variant of the High Gear protocol in Overdrive \cite{KPR}.
This is done in a small bunch of global threads as the ZKPoK would
blow up memory if done for each thread that would consume 
data from the FHE thread.
These threads are numbered from 10000 upwards in the code.
Any thread (offline production thread) can request a new
ciphertext/plaintext pair from the FHE factory threads.
Experiments show that having two FHE factory threads is usually
optimal.
\item In the case of full-threshold access structures or 
	Q2 access structures generated by a generic MSP,
	we also implement a thread (number 20000 in the code)
which implements pairwise OTs via OT-extension.
This thread produces a bunch of authenticated bits in various
queues, one for each online thread.
A similar thread (number 20001 in the code) does the
same for authenticated ANDs, i.e. aANDS.
Each main online thread then removes aBits and aANDs 
from its respective queue when it needs these to produce  
daBits or execute the garbled circuit protocols.
The first time the online thread meets an instruction which
requires such data there can be a little lag as the sub-queues
fill up, but this disappears on subsequent instructions
(until the queues need filling again).
These aBits and aANDs are used in the HSS protocol to
	execute binary circuits \cite{AC:HazSchSor17}.
\item For Shamir sharing or Replicated sharing instances
we have a single thread 20000 which generates shared
triples for AND gates.
This executes a generalisation of the protocol from
\cite{DBLP:conf/sp/ArakiBFLLNOWW17} for triple generation
(in particular Protocol 3.1 from that paper).
The generalisationis an obvious one from the three party
case to a general Q2 Replicated secret sharing.
The base multiplication protocol is that of Maurers.
The only reason we do not support general Q2 MSPs is
that currently we have a bug in generating the associated
Replicated scheme from the generic MSP.
The triples produced by this thread are then used in an
online phase which is essentially the modulo two variant
of the protocol of Smart and Wood \cite{SW18}; thus
basically it uses Beaver multiplication and `checking'
is performed by hashing the reconstructed shares.

\end{itemize}