# Goal We want to generate x86_64 code directly into RAM, fast. We don't need to be able to assemble any x86_64 instruction. I think a small and fairly regular subset will suffice. This document chooses the subset and copies the necessary data from reference manuals, noticing only 64-bit mode. ## Sources [This](https://wiki.osdev.org/X86-64_Instruction_Encoding) looks like a good overview. [This](https://www.felixcloutier.com/x86/) has encodings. [This](http://ref.x86asm.net/coder64.html) has decodings, and often reveals the patterns. [Wikipedia](https://en.wikipedia.org/wiki/X86-64#Differences_between_AMD64_and_Intel_64) has history and some compatibility notes. [Intel's version](https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf) [AMD's version](https://www.amd.com/system/files/TechDocs/24594.pdf) ## Compatibility The basic x86_64 instruction set is supported by essentially all x86 CPUs back to Intel Sandy Bridge (2011) or newer; AMD Bulldozer (2011) or newer, and all VIA CPUs from about 2011. ### Vector extensions. We won't have a requirement to vectorise loops for quite some time, but vector registers might be useful for copying bits of memory around and possibly as spill slots. I don't think we have any reason to generate different code for different CPUs; we are not yet at that level of optimization. It's not much of a burden to generate code that runs on the vast majority of recent x86-compatible CPUs, back to 2014 or even earlier. AVX is supported back to Intel Sandy Bridge (2011) or newer; AMD Bulldozer (2011) or newer, and VIA CPUs since about 2015. There are still a few CPUs in use that don't support AVX2. However, more importantly, none of Intel's low power cores (Atom onwards) support AVX, and this appears to remain their policy in 2020. I think we can't assume AVX. By asking for SSE4 instead, we can support Intel's low-power CPUs since about 2014 (Atom Z3xxx). I think there are quite a few of those about, and they're still appearing in new machines. SSE4 was introduced in dribs and drabs; all modern cores support SSE4.1 and SSE4.2 but only AMD ever implemented SSE4a, I think. By asking for only SSSE3 we can be compatible back to 2008ish (the first Intel Atoms and AMD Phenoms and Bobcats) but I doubt there are many still in use. ### CPU differences Don't use these instructions, whose behaviour varies between CPUs: - 16-bit branch offsets. - Unaligned SSE stores (loads are fine). - CMPXCHG16B ### Obsolete Don't use these, which might not be supported in newer CPUs and OSes: - x87 - MMX - 3DNow - [XOP](https://en.wikipedia.org/wiki/XOP_instruction_set) # Encoding concepts The general format of an x86 instruction is the concatenation of up to six pieces: - Legacy prefixes. The main one we need is `REX`. We need the operand-width prefix byte only to encode 16-bit stores. We need the `LOCK` prefix byte only for atomic operations. - Opcode. This is a string of 1 to 4 bytes. - `ModR/M`. This is one byte. It contains register numbers and provides some opcode bits. - `SIB`. This is one byte. It contains register numbers and a 2-bit shift. - Displacement. This is an address offset. It can be 1, 2, 4 or 8 bytes. I think we mainly want the 4-byte variant. - Immediate. This is a constant operand. It can be 1, 2, 4 or 8 bytes. We mostly want the 4-byte variant. We want the 1-byte variant for shifts, and the 8-byte variant for 64-bit `MOV`. The opcode is always present. Depending on the opcode, the `ModR/M` byte might be present. Depending on the `ModR/M` byte, the `SIB` byte might be present. The `ModR/M` byte also determines the presence and size of the displacement, whereas the opcode determines the presence and size of the immediate. ## Patterns For each instruction we want to assemble, I suggest we view the `REX` byte, the opcode bytes, and the `ModR/M` and `SIB` bytes as an indivisible string of 2 to 7 bytes, which I will call a "pattern". We can write a whole pattern into the instruction buffer as a 64-bit integer (corrupting some following bytes). The `REX` byte is always in byte 0 in the pattern, but the last two bytes could be anywhere from byte 2 to byte 7. If a displacement or immediate is required, I suggest we separately write it into the instruction buffer as an integer. ### Naming patterns Let's name the patterns with a string of the letters "ROMS" each representing one byte (`REX`, opcode, `ModR/M`, `SIB`). The letter "O" can be repeated. Please maintain this enumeration of the patterns we need: - OO - RO - ROM - ROOM - ... Note that e.g. the ROM pattern can be used with `reg` as a source and `rm` as a destination, or vice versa. Arguably these are different patterns, but I have not so far given them different names. ## `REX` I suggest we always assemble a [`REX` prefix](https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix) on instructions that mention a register. This gives access to registers 8 to 15 and to the registers `SPL`, `BPL`, `SIL` and `DIL` (which are aliases for the low byte of `SP`, `BP`, `SI` and `DI` respectively). It also makes the encoding of register numbers more regular, and explicitly specifies whether to use 32 bit or 64-bit arithmetic. A few relevant instructions always use 64-bit operands, with or without a `REX` prefix. They are `CALL`, `JMP`, `Jcc`, `PUSH` and `POP`. There is very little downside to putting a `REX` prefix on these too, though it will only be useful for instructions that mention a register. The [encoding](https://wiki.osdev.org/X86-64_Instruction_Encoding#Encoding) is the union of the following bitmasks: - 0x40 - always (identifies a `REX` prefix) - 0x08 - W (1 means 64-bit, 0 means default, usually 32-bit) - 0x04 - R (top bit of the register number whose bottom bits are in the `reg` field of the ModR/M byte) - 0x02 - X (top bit of the register number whose bottom bits are in the `index` field of the SIB byte) - 0x01 - B (top bit of the register number whose bottom bits are in the `rm` field of the ModR/M byte or the `base` field of the SIB byte) An alternative to a `REX` prefix is a [`VEX` prefix](https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes); I think it requires AVX support, and I don't think we need it. ## opcode There are several possible opcode encodings. We mostly only need 1-byte opcodes. We need 2-byte opcodes for `Jcc`, `CMOVcc`, `MOVZX` and `MOVSX`. All the 2-byte opcodes have the `0x0F` in the first byte. ## Operands Instructions which only have one operand sometimes omit the `ModR/M` byte. The bottom three bits of the register number are encoded in the `rd` field (bits 0, 1 and 2) of the opcode and the top bit in the `R` bit of the `REX` byte. More often, one-operand instructions are encoded as two-operand instructions, with the bits that would otherwise be used to encode `reg` used as an opcode extension. A commonly used notation for identifying these cases is to write "opcode/reg", where "opcode" is the opcode byte and "reg" is the `reg` value. Two-operand instructions are generally encoded with the bottom three bits of the register numbers in the `reg` and `rm` fields of the `ModR/M` byte, and the top bit in the `R` and `B` bits of the `REX` byte. The second operand (`B.rm`) of instructions with a `ModR/M` byte can alternatively be a memory location. In all cases that I recommend using, the effective address used for accessing memory is the sum of three components: - `base` or `rm` (a register) - `index` (a register) shifted left by `scale` (a 2-bit immediate constant) - `displacement` (an immediate constant) `base`, `index` and `scale` all come from the `SIB` byte. If `index` is not needed, the `SIB` byte can be omitted, in which case `rm` is used instead of `base` and zero is used instead of `index`. The detailed encoding follows. ### `ModR/M` The presence of a [`ModR/M`](https://wiki.osdev.org/X86-64_Instruction_Encoding#ModR.2FM) byte is indicated by the opcode. The `ModR/M` byte has three fields: - `rm` - bits 0, 1 and 2 - bottom three bits of the register number whose top bit is in `REX.B` - `reg` - bits 3, 4 and 5 - bottom three bits of the register number whose top bit is in `REX.R` - `mod` - bits 6 and 7 - interpreted as follows provided `rm` is not 100 (`SP` or `R12`): - 00 - I recommend we only use this for program-relative addressing, i.e. with `rm` being `101`. In this case, a 32-bit displacement is added to the address of the following instruction. The other cases are a bit irregular. - 01 - word at address `rm` plus an 8-bit displacement. Perhaps we don't need this. - 10 - word at address `rm` plus a 32-bit displacement. - 11 - the register `rm` itself. Sometimes, the `reg` field is used as an opcode. When it is a register number, it can be a source or a destination. `rm` can also be a source or a destination. ### `SIB` A `SIB` byte always follows a `ModR/M` byte, I think, because the presence of a `SIB` byte is indicated by setting the `rm` field of the `ModR/M` byte to 100 (which would otherwise indicate `SP` or `R12`). The `SIB` byte has three fields: - `base` - bits 0, 1 and 2 - bottom three bits of the register number whose top bit is in `REX.B`. Note that a `SIB` byte can only be present if the `m` field of `ModR/M` is not a register number; the `base` register number essentially replaces the missing `rm` register number. - `index` - bits 3, 4 and 5 - bottom three bits of the register number whose top bit is in `REX.X`. - `scale` - bits 6 and 7 - a 2-bit shift distance. If `X.index` is 0.100 (`SP`), zero is used instead. This provides an alternative encoding of `register + displacement` which is one byte longer than the encoding without a `SIB` byte. The alternative encoding is more general, because `base` (unlike `rm`) can be `SP` or `R12`. # Registers Theoretically we have 16 general purpose registers available. In practice we have somewhere between 11 and 14, because: - `SP` and `R12` cannot be used in the `rm` field of the `ModR/M` byte, because the bit pattern 100 is used to indicate the presence of a `SIB` byte. Moreover, `SP` cannot be used as the `index` field of a `SIB` byte, because the bit pattern 100 is used to indicate a zero index. - `A` and `D` have special roles in some instructions, e.g. long multiplication, and `C` has a special role for shifting by a computed amount. These might be a good choice for temporary workspace registers, or alternatively we can teach our register allocator the constraints. ## Fixed registers Mijit needs a few "ambient" values which could compete for register space: - We need a stack pointer. I suggest we use `SP` as it has special powers. - We need to reserve a register to point to the read/writable data block. I suggest we use `R8`. Virtual machine registers, including `PC` and `IR`, are not treated specially by Mijit. ## `BP` and `R13` are not special If the `mod` field of a `ModR/M` byte is 00, then `BP` and `R13` cannot be used in certain positions, because the bit pattern 101 is used variously to indicate program-relative addressing or a zero `base` value. However, `BP` and `R13` behave exactly the same as any other register provided we avoid that `mod` value, which I recommend we do. ## Byte registers There is an alias for byte 0 of each register. For example, `AL` is the bottom byte of `A`. I'm not sure if we ever need these. There are alternative ways of loading and storing single bytes, and we don't need 8-bit arithmetic. We will never need the registers `AH`, `CH`, `BH` or `DH` (which are aliases for byte 1 of the `A`, `C`, `B` and `D` registers). ## Encoding Registers have 4-bit codes. The top bit goes in the `REX` prefix and the bottom 3 bits typically go in the `ModR/M` byte or in the `SIB` byte. The numbering is as follows: ``` Number 0 1 2 3 4 5 6 7 8 9 A B C D E F Register A C D B SP BP SI DI R8 R9R 10 R11 R12 R13 R14 R15 ``` ### Masks I suggest that for each register we construct a 64-bit *register mask* that includes the register number in all bit positions in which it might be needed. Parts of register numbers appear in various bytes, each of which may be in various positions: - The `REX` byte contains the top bit of three register numbers. - The opcode bytes (only the first?) can contain the bottom three bits of a register number: `rd` in bits 0, 1 and 2. - The `ModR/M` byte can contain the bottom three bits of two register numbers: `rm` in bits 0, 1 and 2, and `reg` in bits 3, 4 and 5. - The `sib` byte can contain the bottom three bits of two register numbers: `base` in bits 0, 1 and 2, and `index` in bits 3, 4 and 5. Here are the register masks for register numbers 1, 2, 4 and 8, which can be ORed together to obtain all the other register masks: ``` 1 - 0x09090909090900 2 - 0x12121212121200 4 - 0x24242424242400 8 - 0x00000000000007 ``` For each pattern, we must define up to three *field masks*, corresponding to the three register numbers that can be mentioned in a single x86 instruction. ANDing a field mask with a register mask leaves the register number only in the required field. For example, here are the field masks for the "RO", "ROM", "ROMS" and "ROOMS" patterns: ``` pattern RO ROM ROMS ROOMS reg N/A 0x380004 0x00380004 0x0038000004 index N/A N/A 0x38000002 0x3800000002 rd/rm/base 0x0701 0x070001 0x00070001 0x0700000001 ``` # Recommended instructions ## Move A RISC instruction set would have separate load, store and move instructions, but x86 covers all three cases with move instructions. ``` Pattern Mask Dest Source Meaning ROM 0xC08940 rm reg Move register to register ROM 0x808940 rm reg Store register in memory (disp32) ROM 0xC08B40 reg rm Move register to register ROM 0x808B40 reg rm Load register from memory (disp32) ROM 0x058B40 reg - Load register from [IP + disp32] ROM 0xC0C740 rm imm32 Move constant to register (imm32) ROM 0x80C740 rm imm32 Move constant to memory (disp32 imm32) RO 0xB840 rd imm32 Move constant to register (imm32) ``` Opcode bytes 0x89 and 0x8B can both be used to move from a register to a register; there doesn't seem to be much to choose between them. For opcode 0xC7 (move constant to register or memory), the `reg` field is unused, and must be zero. Opcode byte 0xB8 (really 0xB8 to 0xBF) is unusual, in that the destination register is part of the opcode byte. The 64-bit version of this instruction has an 8-byte immediate constant; I think it is the only such opcode. For 64-bit versions, set the `REX.W` bit. Except for opcode 0xB8, the displacements and immediates remain 4 bytes long (sign extended). ## Binary arithmetic Add instructions are encoded as follows: ``` Pattern Mask Dest Source Meaning ROM 0xC08140 rm imm32 Add constant to register (imm32) ROM 0x808140 rm imm32 Add constant to memory (disp32 imm32) ROM 0xC00140 rm reg Add register to register ROM 0x800140 rm reg Add register to memory (disp32) ROM 0xC00340 reg rm Add register to register ROM 0x800340 reg rm Add memory to register (disp32) ``` Opcode bytes 0x01 and 0x03 can both be used to add a register to a register; there doesn't seem to be much to choose between them. For 64-bit versions, set the `REX.W` bit. The displacements and the immediates remain 4 bytes long (sign extended). Other binary arithmetic instructions follow the same basic pattern. It suffices to give, for each version of each arithmetic operation, the value of the opcode byte (byte 1 of the pattern) and, in some cases, the opcode extension in the `reg` field. In general, the following versions exist: - Combine a signed immediate constant with `rm`, and write `rm`. - Combine `reg` with `rm` and write `rm`. - Combine `rm` with `reg` and write `reg`. ``` Operation ADD OR AND SUB XOR CMP const to rm 0x81/0 0x81/1 0x81/4 0x81/5 0x81/6 0x81/7 reg to rm 0x01 0x09 0x21 0x29 0x31 0x39 rm to reg 0x03 0x0B 0x23 0x2B 0x33 0x3B ``` Note that opcode byte 0x81 has multiple meanings, disambiguated by a number after a "/". I think this must mean that the `reg` field of the `ModR/M` byte is used as an opcode extension, i.e. `reg` must be set to the number after the "/". I note also that the number after the "/" in the opcode for the version with an immediate constant matches bits 3, 4 and 5 of the opcode for the other two versions. The `CMP` instruction does the same as `SUB` but only sets the processor status flags, leaving the destination register unmodified. Note that the carry flag is *set* to indicate a borrow; this convention differs from other architectures. Addition can also be done using the `LEA` instruction, which allows a destination register number to be specified in addition to two operand register numbers, but which does not allow an operand to be in memory. This might be useful later, as an optimization, but for now let's use a preceding move instruction if necessary. ## Shifts Rotate left instructions are encoded as follows: ``` Pattern Mask Dest Source Meaning ROM 0xC0C140 rm imm8 Rotate register left by constant (imm8) ROM 0x80C140 rm imm8 Rotate memory left by constant (imm8) ROM 0xC0D340 rm C Rotate register left by register ROM 0x80D340 rm C Rotate memory left by register (disp32) ``` The shift amount can be a 1-byte constant, or it can be the value of the C register. Only the bottom 5 bits of the shift amount are used; the other bits are masked to zero. For 64-bit versions, set the `REX.W` bit. Only the bottom 6 bits of the shift amount are used. Note that if the shift amount cannot be in any register but C; this is an irregularity in the instruction set. Other shift operations follow the same basic pattern. It suffices to give, for each version of each shift operation, the value of the opcode byte (byte 1 of the pattern) and the opcode extension in the `reg` field. ``` Operation ROL ROR SAL/SHL SHR SAR rm by const 0xC1/0 0xC1/1 0xC1/4 0xC1/5 0xC1/7 rm by C 0xD3/0 0xD3/1 0xD3/4 0xD3/5 0xD3/7 ``` Shifts can also be done using the `LEA` instruction, which allows a destination register number to be specified in addition to a source operand register number. The shift amount must be constant, and must be 0, 1, 2 or 3. This might be useful later, as an optimization, but for now let's use a preceding move instruction if necessary. ## Multiplication ``` Pattern Mask Dest Source Meaning ROOM 0xC0AF0F40 reg rm Multiply register by register ROOM 0x80AF0F40 reg rm Multiply register by memory (disp32) ROM 0xC06940 reg rm Multiply register by constant (imm32) ROM 0xE0F740 (D, A) rm Unsigned long multiply A by register ROM 0xE8F740 (D, A) rm Signed long multiply A by register ``` For 64-bit versions, set the `REX.W` bit. The immediates remain 4 bytes long (sign extended). The ordinary multiplication instructions are similar to other arithmetic instructions; they can use any register as a source or destination. Multiplication by a constant is more flexible; the source and destination can be different registers. Long multiplication is less flexible; register A is always a source, and registers A and D are the destinations. The low word of the result is written to register A and the high word to register D. ## Division ``` Pattern Mask Dest Source Meaning ROM 0xF0F740 (D, A) rm Unsigned long divide by register ROM 0xB0F740 (D, A) rm Unsigned long divide by memory (disp32) ROM 0xF8F740 (D, A) rm Signed long divide by register ROM 0xB8F740 (D, A) rm Signed long divide by memory (disp32) ``` For 64-bit versions, set the `REX.W` bit. The 32-bit versions divide a 64-bit numerator by a 32-bit denominator. The 64-bit versions divide a 128-bit numerator by a 64-bit denominator. Registers D and A hold respectively the high and low word of the numerator. The `rm` operand holds the denominator: a single word. The quotient is returned in A and the remainder in D: both single words. For signed division, the quotient is rounded towards zero. The remainder is calculated accordingly. Overflow or division by zero causes a hardware exception. Avoid it by explicitly checking for overflow before dividing. ## Conditional branch The `JO` and `JNO` instructions are encoded as follows: ``` Pattern Mask Target Meaning OO 0x800F imm32 Add a constant to RIP if the overflow flag is set OO 0x810F imm32 Add a constant to RIP if the overflow flag is clear ``` Do not add a `REX` prefix. It is redundant. Other conditional branch instructions follow the same basic pattern. It suffices to give bits 9, 10 and 11 of the opcode; bit 8 always inverts the condition. ``` Condition Nibble Meaning O 0 Overflow flag is set B = C = NAE 2 ULT: carry/borrow flag is set Z 4 EQ: zero flag is set BE = NA 6 ULE carry or zero is set S 8 Sign flag is set P A Parity flag is set L = NGE C LT: Sign and overflow differ LE = NG E LE: Sign and overflow differ, or zero ``` ## Conditional moves. ``` Pattern Mask Dest Source Meaning ROOM 0x80400F40 reg rm Load from memory if overflow set (disp32) ROOM 0xC0400F40 reg rm Move from register if overflow is set ROOM 0x80410F40 reg rm Load from memory if overflow clear (disp32) ROOM 0xC0410F40 reg rm Move from register if overflow is clear ``` Other conditional branch instructions follow the same basic pattern. It suffices to give bits 9, 10 and 11 of the opcode; bit 8 always inverts the condition. The conditions are the same as for conditional branches. For 64-bit versions, set the `REX.W` bit. Note that the 32-bit versions set the upper 32 bits of the destination to zero whether or not the condition is satisfied. ## JMP, CALL and RET ```` Pattern Mask Target Meaning RO 0xE940 imm32 Add a constant to RIP (i.e. direct jump) (imm32) ROM 0xE0FF40 rm Set RIP to register (i.e. indirect jump) ROM 0xA0FF40 rm Set RIP to memory (i.e. indirect jump) (disp32) RO 0xE840 imm32 Push RIP then JMP imm (i.e. direct call) (imm32) ROM 0xD0FF40 rm Push RIP then JMP reg (i.e. indirect call) ROM 0xB0FF40 rm Push RIP then JMP mem (i.e. indirect call) (disp32) RO 0xC340 Pop RIP. ```` For opcodes `0xC3`, `0xE8` and `0xE9`, the `REX` prefix is pointless, but I have included it. Do not set the `REX.W` prefix. It changes the meaning of opcodes `0xFF/2` and `0xFF/4`. ### Calling conventions According to [Wikipedia](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions) there are two main 64-bit calling conventions in use. I'm not sure how much to trust Wikipedia. Linux uses the System V AMD64 ABI: - Arguments: - The first six integer or pointer arguments are passed in `RDI`, `RSI`, `RDX`, `RCX`, `R8` and `R9`. - The first eight floating-point arguments are passed in `XMM0` to `XMM7`. - Additional arguments are pushed on the stack, rightmost first. - The return address is pushed immediately after the arguments. - The 128 bytes just beyond (below) SP can be used and "will not be clobbered by any signal or interrupt handlers". If you need more space, adjust SP. - Return value: - If the return value is up to two integer or pointer values, they are returned in RAX and RDX. - If the return value is up to two floating-point values, they are returned in XMM0 and XMM1. - Otherwise, the return value is stored in caller-allocated memory whose address is passed as an extra (first) argument. - Registers RBX, RBP and R12 to R15 are callee-saves, and I guess RSP counts as callee-saves too; all other registers (i.e. R10, R11 and unused argument and return registers) are caller-saves. The Microsoft calling convention is different. It uses different registers, and different numbers of registers. It replaces structs with their address. It pushes four unused words between the arguments and the return address. As far as I can see, there is no subset of C function signatures for which the calling conventions are compatible, except for functions with no arguments and up to one integer return value. ## PUSH and POP ``` Pattern Mask Operand Meaning ROM 0x86FF40 rm Push a value from memory RO 0x5040 rd Push a register RO 0x6840 imm32 Push a constant (imm32) ROM 0x808F40 rm Pop a value into memory RO 0x5840 rd Pop a register ``` For opcode `0x68`, the `REX` prefix is pointless, but I have included it. In 64-bit mode, there are no 32-bit pushes or pops, and `REX.W` is ignored ([this page](https://www.felixcloutier.com/x86/push) is wrong). Immediate constants are 32-bit, and are sign-extended. "Push" means subtract 8 from SP, then store the source value at SP. If pushing SP, it is read before it is decremented. "Pop" means load the destination value from SP then add 8 to SP. If popping SP, it is written after it is incremented. If using SP as a base address for specifying a memory location, RTFM. ## Rare operations ### Narrow data The "traditional" way of encoding 8-bit memory accesses is to use the 8-bit variant of the opcode (most operations have one). The traditional way to encode 16-bit memory accesses is to prepend a prefix byte of `0x66`. These encodings affect the width used for arithmetic, which we don't want, and preserve the upper bits of the destination register, which we also don't want. I suggest using this encoding only for store instructions. Instead, use the `MOVZX` and `MOVSX` instructions are available for loading 8- and 16-bit words (two opcodes each). The `MOVSX` instructions do sign extension. There is also a `MOVSXD` instruction for sign-extending a 32-bit word. ``` Pattern Mask Dest Source Meaning FIXME: Missing registers in the next two instructions. ROM 0x808840 rm reg Store 8 bits to memory (disp32) ROM 0x808940 rm reg Store 16 or 32 bits to memory (disp32) ROOM 0xC0B60F40 reg rm Zero-extend 8 bits from register ROOM 0x80B60F40 reg rm Zero-extend 8 bits from memory (disp32) ROOM 0xC0B70F40 reg rm Zero-extend 16 bits from register ROOM 0x80B70F40 reg rm Zero-extend 16 bits from memory (disp32) ROM 0xC08B40 reg rm Zero-extend 32 bits from register ROM 0x808B40 reg rm Zero-extend 32 bits from memory (disp32) ROOM 0xC0BE0F40 reg rm Sign-extend 8 bits from register ROOM 0x80BE0F40 reg rm Sign-extend 8 bits from memory (disp32) ROOM 0xC0BF0F40 reg rm Sign-extend 16 bits from register ROOM 0x80BF0F40 reg rm Sign-extend 16 bits from memory (disp32) ROM 0xC06348 reg rm Sign-extend 32 bits from register ROM 0x806348 reg rm Sign-extend 32 bits from memory (disp32) ``` To store 16 bits to memory, assemble an operand-width prefix byte (`0x66`) before the instruction. For loads into 64-bit registers, set the `REX.W` bit. This only affects the width of the destination register. The displacements remain 4 bytes long (sign extended). Zero-extending a 32-bit value to 64 bits is equivalent to a 32-bit move instruction (`0x8B`), because 32-bit instructions clear the upper 32 bits of the destination register. The `REX.W` bit must be clear for this to work. In contrast, sign-extending a 32-bit value to 64 bits requires its own opcode (`0x63`). The `REX.W` bit must be set for this to work. ### Atomic operations I anticipate wanting atomic operations for updating reference counts in Welly. The [`LOCK` prefix](https://wiki.osdev.org/X86-64_Instruction_Encoding#LOCK_prefix) can be put on read-modify-write instructions to make them atomic. I suggest we special-case the only cases we need (adding or subtracting a constant from a word in memory) and do not bother to generalise everything else. ``` LOCK 0xF0 ```