TOYF processor design --- by Hugh Aguilar --- copyright November 2018
contact: hughaguilar96@gmail.com


Abstract:

The TOYF (Toy Forth) processor is somewhat similar to the MiniForth at Testra that is also 16-bit Harvard (built on the Lattice isp1048 PLD).
The MiniForth was later upgraded to an FPGA and the name changed to RACE.
I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX.
The TOYF has a maximum of 125 instructions. At this time however, 9 are undefined. These are available for application-specific uses.
The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle.
The opcode fields are:  6-bit group-A (32-bit ALU), 5-bit group-M (16-bit ALU), 5-bit group-M (no ALU).
The TOYF has only a crude IRQ facility that is mostly provided to support a UART.
In general, you would program the TOYF with a paced-loop.
The TOYF has support for multiplication and division. It could be used for motion-control.
The TOYF has support for 1-bit data, so it could be used as a PLC (Programmable Logic Controller).
The TOYF has a RND instruction. It could be used for a video-game machine. It would need a coprocessor to do the audio though.
The TOYF might need to be ponied up, so it would have a coprocessor (such as a C8051) that could also do sound.
A language standard limits us to a few primitives. TOYF has an assembly-language though, so there can be a lot of primitives.
Classic Forth processors (B16, RTX2000, BUGS18, etc.) have a very few primitives. The TOYF can have hundreds of primitives!
The TOYF also supports local variables and quotations --- the only other Forth processor with locals is the MiniForth/RACE.
The TOYF quotations can access the parent function's locals, despite the fact that the HOF has locals of its own.



Section 1.)  the architecture

We have the following registers (those that access data-memory have an Rx alternate name):
CF           1-bit      carry flag                  CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502)
VF           1-bit      overflow flag
IF           1-bit      IRQ flag                    this masks IRQs --- a 1 blocks IRQs --- it is 1 on start-up
IV           2-bit      zero or ISR vector          high-byte is $FF, low-byte is xx000000 --- this is the ISR vector
PC          16-bit      program counter             this is the machine-code (the only register that points to code-memory)
AX  R0      16-bit      general purpose
BX  R1      16-bit      general purpose             this is the top value of the data-stack
CX  R2      16-bit      used in countdown loops and list loops; also multiplication and division
DX  R3      16-bit      general purpose             mostly used for multiplication and division, and sometimes as the second value of the data-stack
EX  R4      16-bit      used in list loops; also multiplication and division
IP  R5      16-bit      instruction pointer         this is the Forth threaded-code
HF  R6       5-bit      local frame pointer of HOF  high-byte is $00, low-byte is 010xxxxx --- this holds LF during a quotation
LF  R7       5-bit      local frame pointer         high-byte is $00, low-byte is 010xxxxx --- this is the local-frame pointer
RS  R8       5-bit      return stack pointer        high-byte is $00, low-byte is 010xxxxx --- this is the return-stack pointer
SS  R9       5-bit      single stack pointer        high-byte is $00, low-byte is 011xxxxx --- this is the data-stack for single-cell data
LX  RA      16-bit      primarily for use by LDR    most memory reads use LX as the destination
MA  RB      16-bit      memory access pointer       all memory loads and stores use MA as the address

The IV register is normally zero, but if there is an IRQ (Interrupt ReQuest), it is the address of the ISR (Interrupt Service Routine)
of the highest-priority IRQ that is pending. The ISR is a colon word that ends in ISR_EXIT (like EXIT but also does XOR 1,IF to unmask IRQs).
The IF mask is set on start-up and XOR 1,IF is used after initialization to enable interrupts.
We don't have CLI and SEI like in most processors. The programmer knows at compile-time whether XOR 1,IF is masking or unmasking.

The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle.
Both have a 16-bit data-bus and are word addressable. Code-memory has a 15-bit address-bus and data-memory has a 16-bit address-bus.
I/O ports and important global variables can be in the lower 64 words of zero-page. The two stacks are above that.
An FPGA that has 128 words (256 bytes) of RAM internal will include all of the above in RAM.
Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast.

Forth processors that do Forth instructions such as DUP SWAP + etc. in one clock cycle, typically have the entire data-stack
in registers. This allows them to push and pull data to the data-stack in parallel with the operation, such as addition for + etc..
This requires the data-stack to be small, such as 8 elements. This can result in stack-overflow causing hard-to-find bugs.
Because they don't support local variables, they need all their data on the data-stack, or in global variables.
This isn't a bad design for small programs, but it is error-prone in large programs.
This is especially error-prone if code-libraries are used, so you don't know how deep into the data-stack your sub-functions go.
I would not use such a processor unless the program was quite small and could be analyzed thoroughly for data-stack usage.
The TOYF provides 32 elements for data-stack and return-stack each. This is more than enough for any practical program.
It is possible that some recursive programs could overflow one or the other stack, but this is very unlikely.
The TOYF uses BX as TOS. The TOYF uses DX as SOS sometimes (I would predict about 1/3 of the time). This helps the speed a lot.
Registers use a lot of FPGA resources --- the TOYF keeps the data-stack in memory to allow a smaller FPGA to be used.

There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian.
As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally.

Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address.
Data-memory has a 16-bit address-bus (64KW) and a 16-bit data-bus, and the data registers are all 16-bit.
The opcodes are 16-bit wide. We have group-A (64 instructions), group-B (32 instructions) and group-M (32-instructions).
If the opcode were upgraded to 18-bit wide, group-B and group-M could each be upgraded to 64 instructions.
This is a possibility for a super version of the TOYF --- this might be useful for an application-specific FPGA.

Data-memory has a 16-bit data-bus (64KW). We address words not bytes! ADD 1,BD advances BX and DX to the next 16-bit words.
Forth threaded-code must be in the upper 32KW because the cfa has the high-bit set for colon words.

Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes.

Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle.
The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary.
The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing
as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code.

The opcode has these fields:
MMMMMAAAAAABBBBB
FEDCBA9876543210        these are the bit positions of the fields

There are three fields: MMMMM AAAAAA BBBBB  We are executing up to three instructions per clock cycle.
To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted),
the programmer should hold data in a register for a while before using that register.
Don't modify a register and then immediately use that register for something.
Try to interleave instructions that modify different registers as these instructions can parallelize.
This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved.
The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two.
The TOYF arranges the out-of-order execution at compile-time though, whereas the Pentium does this at run-time.

Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions.
So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult.
The TOYF has simple instructions that execute in one clock cycle, rather than complicated multi-cycle instructions.

An FPGA can support an 18-bit opcode. The TOYF is currently only using a 16-bit opcode.
It might be possible to upgrade to an 18-bit opcode and get a 2-bit field in addition to our group-A, group-B and group-M.
I can't think of any general-purpose use for this, but there might be an application-specific use.
We also have a lot of undefined instructions in group-A and group-B that could have some application-specific use.

Upon start-up, IP is set to zero. Use BCS to initialize it.
Upon start-up, PC is set to 3. This is the start-up code just past the NEXT code.
Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented.


Section 2.)  the instruction set

There are no addressing modes. All of the instructions are inherent. None of the opcodes have an operand.
Most processors have an addressing-mode in which the operand is the address in memory to access. TOYF doesn't have that.

The assembler packs the instructions together into the opcodes. TOYF has out-of-order execution.
The assember is single-pass, which is typical of Forth assemblers.
To assemble the current instruction, the following rules are followed:

1.) The current instruction has to be assembled after any label that appeared previously in the source-code.
2.) The current instruction has to be assembled after any instruction that modifies a register(s) that it uses.
3.) The current instruction has to be assembled after any instruction that modifies a register(s) that it modifies.
4.) The current instruction has to be concurrent with or after any instruction that uses a register(s) that it modifies.

To assemble the current instruction, this is done:
A.) Search back into the machine-code as far as possible to find an opcode that meets the four rules above.
B.) Starting at that opcode, search forward to find an opcode with an empty group-field appropriate for our instruction.
C.) If there is no empty slot available, then append a new opcode onto the end of the machine-code
D.) Assemble our current instruction in the opcode we have obtained above.

If we have ADC LX,BX already assembled, then a MOV 0,CF would have to assemble in an opcode after it.
Rule #3 says an instruction that modifies a register (CF) has to be after a previous instruction that modifies that register.

If we have MOV 0,AX already assembled, then a MOV 0,CF could assemble anywhere. The MOV 0,AX is irrelevant.
It could assemble prior to the MOV 0,AX or in the same opcode or in an opcode afterward.
So long as the four rules are followed, instructions may be assembled out of order to how they appeared in the source-code.

If we have SUM BX,AX already assembled, then a MOV 0,CF could assemble in the same opcode or in an opcode afterward.
The MOV 0,CF could not assemble in an opcode prior to the SUM BX,AX however, because that would violate rule #4.
Rule #4 says that the current instruction (MOV 0,CF) has to assemble after or with any instruction that uses CF.
If an instruction appears in the source-code after any instruction that could modify PC (BCS, NXT, NXP, etc.),
then it will not pack with or before that PC-modifier. This means that code can go after the BCS and before the NXT.
If the assembler were multi-pass, then rule #3 could be modified to allow the current instruction to be assembled prior to
any instruction that modifies any register(s) that it modifies if no other instruction uses that register.
This might allow for better packing in some circumstances. Writing a multi-pass assembler is above my pay grade however.
The single-pass assembler provides pretty good packing. This is all that I did at Testra when I wrote MFX for the MiniForth.
I was getting paid $10/hour at Testra. For that, you get a single-pass assembler that follows simple rules.

The assembly-language programmer doesn't have to worry about any of this out-of-order complexity.
You just assume that your program does the same thing as if the instructions were assembled one per opcode in the same order
as they appeared in your source-code. Don't let the out-of-order execution freak you out! The complexity is under the hood.

This is an example primitive:

DX_MINMAX:          ; a b -- max            ; sets DX= min
    xch BX,DX       ; DX= b
    pls 
    ldr LX          ; LX= a
    mov LX,BX       ; BX= a
    mov DX,AX       ; AX= b
    mov 0,CF        ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502
    sbc AX,BX       ; CF= b(AX,DX) > a(BX)              ; CF indicates that b(DX) > a(BX)
    not CF          ; CF= b(AX,DX) <= a(BX)             ; CF indicates that b(DX) is the minimum and a(BX) is the maximum
    mov LX,BX       ; BX= a                             ; the SBC AX,BX screwed up BX, so BX has to be restored
    mov 1,AX        ; AX= offset to branch to next primitive            
    add IP,AX
    bcs             ; if BX is maximum and DX is minimum, then we are done
SWAP_DX:            ; a -- b        ; needs DX=b, sets DX= a
    xch BX,DX
    nxt

                group-A     group-B     group-M
1st opcode:                 XCH BX,DX   PLS
2nd opcode:     MOV DX,AX   MOV 0,CF    LDR LX          ; MOV 0,CF moved all the way back here becaused it has no dependencies
3rd opcode:                 MOV LX,BX
4th opcode:     MOV 1,AX    SBC AX,BX                   ; it is legal to use a register and store into that register concurrently
5th opcode:     ADD IP,AX   NOT CF
6th opcode:                 MOV LX,BX   BCS             ; BCS will terminate the primitive if BX>=DX
7th opcode:                 XCH BX,DX   NXT             ; swap BX and DX so BX>=DX then terminate

Our DX_MINMAX above had 14 instructions that packed into 7 opcodes, so it came out to an average of 2 instructions per opcode.
It is typical to get about 2 instructions per opcode in primitives.
Some primitives are heavy on one particular group of instructions and don't pack this efficiently, but others get better efficiency.

In the following instruction descriptions, the | separates events that are done concurrently.

These are group-A instructions that modify AX CX IF:
    nop
    xor 1,IF        ; move 1 -xor- IF to IF                             ; toggle the IF mask
    mov LX,AX       ; move LX to AX
    mov BX,AX       ; move BX to AX
    mov DX,AX       ; move DX to AX
    mov EX,AX       ; move EX to AX
    sum EB,CA       ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX        ; used in multiplication
    vng AX          ; if VF then move 0 - AX to AX                      ; used by SMULT to fix the sign afterward
    vng CA          ; if VF then move 0 - CX:AX to CX:AX                ; used by DX_SMULT32 to fix the sign afterward
    shl CA          ; shift CX:AX left 1 bit                            ; used in division to shift numerator left
    sub DB,CA       ; move CX:AX - DX:BX to CX:AX                       ; used in division to do trial subtraction 
    cad DB,CA       ; if CF then move CX:AX + DX:BX to CX:AX            ; used in division to undo trial subtraction
    and CF,AX       ; move AX<>0 -and- CF to AX                         ; bit-0 is the result, and the rest are zero'd
    ior CF,AX       ; move AX<>0 -ior- CF to AX                         ; bit-0 is the result, and the rest are zero'd
    xor CF,AX       ; move AX<>0 -xor- CF to AX                         ; bit-0 is the result, and the rest are zero'd
    set CF,LX,AX    ; if CF then  move LX -ior- AX to AX  else  move LX -and- ~AX to AX     ; AX should be the bit-mask and LX the datum
    cng AX          ; if CF then move 0 - AX to AX
    sh2 0,AX        ; shift AX left 2 times, move %00 into low-bits     ; also called: SHL 2,AX     ; this multiplies AX by 4
    sh2 1,AX        ; shift AX left 2 times, move %01 into low-bits
    sh2 2,AX        ; shift AX left 2 times, move %10 into low-bits
    sh2 3,AX        ; shift AX left 2 times, move %11 into low-bits
    shl 1,AX        ; shift AX left 1 time, move 0 into low-bit                                     ; this multiplies AX by 2
    shl 4,AX        ; shift AX left 4 times, move 0 into low-bits                                   ; this multiplies AX by 16
    shr 1,CA        ; shift CX:AX right 1 time, move 0 into high-bits   ; this divides CX:AX by 2
    shr 4,CA        ; shift CX:AX right 4 times, move 0 into high-bits  ; SHR is used to divide the product by unity
    add IP,AX       ; move AX + IP to AX        ; ADD LX,IP,AX (move LX+IP to AX) would speed up branches by 1 clock cycle
    add RS,AX       ; move AX + RS to AX
    add LF,AX       ; move AX + LF to AX
    add LX,AX       ; move AX + LX to AX
    sub LX,AX       ; move AX - LX to AX
    and LX,AX       ; move LX -and- AX to AX
    ior LX,AX       ; move LX -ior- AX to AX
    xor LX,AX       ; move LX -xor- AX to AX  
    not AX          ; move -1 -xor- AX to AX
    swp AX          ; exchange low byte and high byte in AX             ; BYT AX,BX and SWP are useful for accessing bytes
    mov MA,CX       ; move MA to CX                                     ; allows MA to hold CX during multiplication
    mov LX,CX       ; move LX to CX                                     ; used by LIST_START etc.
    xch CX,AX       ; exchange CX and AX values                         ; also called:  XCH AX,CX
    sub 1,CX        ; move CX - 1 to CX                                 ; used in countdown loops
    rnd AX          ; move bit-1 -xor- bit-2 -xor- bit-4 -xor- bit-15 to bit-0 | shift left bits 0..14 to bits 1..15 (discarding bit-15)
    mov -1,AX       ; move -1 to AX             ; -1 is needed for primitives that conditionally iterate over themselves
    mov 0,AX        ; move 0 to AX
    mov 1,AX        ; move 1 to AX
    mov 2,AX        ; move 2 to AX
    mov 3,AX        ; move 3 to AX 
    mov 4,AX        ; move 4 to AX
    mov 5,AX        ; move 5 to AX
    mov 6,AX        ; move 6 to AX
    mov 7,AX        ; move 7 to AX 
    mov 8,AX        ; move 8 to AX 
    mov 9,AX        ; move 9 to AX
    mov 10,AX       ; move 10 to AX             ; also called: MOV $A,AX
    mov 11,AX       ; move 11 to AX             ; also called: MOV $B,AX
    mov 12,AX       ; move 12 to AX             ; also called: MOV $C,AX
    mov 13,AX       ; move 13 to AX             ; also called: MOV $D,AX
    mov 14,AX       ; move 14 to AX             ; also called: MOV $E,AX
    mov 15,AX       ; move 15 to AX             ; also called: MOV $F,AX
    undefined
    undefined
    undefined
    undefined
    undefined
    undefined
    undefined

Literal values from -1 to 15 can be constructed in AX with one instruction.
For larger values, use SH2 to append 2 bits at a time. 
Also, SHL 4,AX and SWP AX are useful for setting the high bits when the low bits are zero.
In general, small values are faster than big values, so 1-bit I/O ports should be in the lower bits if possible.

These are group-B instructions that modify BX DX EX VF CF:
    nop
    mov LX,DX       ; move LX to DX
    mov AX,DX       ; move AX to DX
    add 1,BD        ; move BX + 1 BX | move DX + 1 to DX    ; used when moving blocks of data with post-increment
    sub 1,BD        ; move BX - 1 BX | move DX - 1 to DX    ; used when moving blocks of data with pre-decrement
    mov LX,BX       ; move LX to BX
    mov AX,BX       ; move AX to BX
    xch BX,DX       ; exchange BX and DX                ; also called: XCH DX,BX
    byt AX,BX       ; move AX -and- $FF to BX
    mov AX,EX       ; move AX to EX
    bit EX          ; shift EX left 1 bit, move ~CF into low-bit of EX      ; used in division with EX = quotient
    mov CF,BX       ; move CF to BX                     ; sets BX= 1 if CF, else BX= 0
    sgn DX,BX       ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits
    adj DX,EB       ; shift DX right, move 0 into high-bit | shift EX:BX left, move 0 into low-bit, move high-bit into CF
    ror BX          ; shift BX right, move CF into high-bit, move low-bit into CF
    rol BX          ; shift BX left, move CF into low-bit, move high-bit into CF
    adc LX,BX       ; move BX + LX + CF to BX, move carry into CF, set VF if LX and AX had same high-bit and it changes
    sbc AX,BX       ; move BX - AX - CF to BX, move borrow into CF, set VF if BX and AX had same high-bit and it changes
    neg BX          ; move 0 - BX to BX
    mov 0,CF        ; move 0 to CF
    xor VF,CF       ; move VF -xor- CF to CF            ; set CF=0 first to just move VF to CF
    not CF          ; move -not- CF to CF
    cmv AX,DX       ; if CF then move AX to DX
    flg AX,CF       ; move AX<>0 to CF                  ; set CF if AX is nonzero
    flg BX,CF       ; move BX<>0 to CF                  ; set CF if BX is nonzero       ; this one is only marginally useful
    flg CX,CF       ; move CX<>0 to CF                  ; set CF if CX is nonzero       ; used in LOOP
    flg DX,CF       ; move DX<>0 to CF                  ; set CF if DX is nonzero
    not BX,CF       ; move BX=0 to CF                   ; set CF if BX is zero   
    mov CN,CF       ; move CX high-bit to CF            ; set CF if CX is negative
    mov BN,CF       ; move BX high-bit to CF            ; set CF if BX is negative
    undefined
    undefined

If it is necessary to make room for another instruction in group-B, FLG BX,CF could be discarded. Use  NOT BX,CF  NOT CF  instead.


These are group-M instructions that modify LX PC MA IP LF HF RS SS:
    nop
    mov AX,LX       ; move AX to LX                         ; this allows EXECUTE and QEXECUTE to work
    ldr LX,BX       ; load (MA) into LX | move BX to MA     ; this is for moving blocks of data, usually from DX to BX
    ldr LX          ; load (MA) into LX
    ldr LF          ; load (MA) into LF
    ldr IP          ; load (MA) into IP
    str AX          ; store AX to (MA)
    str BX          ; store BX to (MA)
    str LX          ; store LX to (MA)
    mov LF,HM       ; move LF to HF and to MA
    mov AX,MA       ; move AX to MA                         ; there is no MOV BX,MA however NXP moves BX to MA for FETCH etc..
    mov CX,MA       ; move CX to MA
    mov DX,MA       ; move DX to MA
    mov R0,MA       ; move RS to MA                         ; this is the address of the first item on the return-stack
    mov S0,MA       ; move SS to MA                         ; this is the address of the second-of-stack (BX is top-of-stack)
    mov S1,MA       ; move SS + 1 to MA                     ; this is the address of the third-of-stack
    phs             ; move SS - 1 to SS and to MA
    pls             ; move SS to MA | move SS + 1 to SS
    phr             ; move RS - 1 to RS and to MA
    plr             ; move RS to MA | move RS + 1 to RS
    phr LX          ; move RS - 1 to RS and to MA | load (MA) into LX                                       ; for ENTER1 etc.
    plr IV          ; if IV -and- ~IF then ( move 1 to IF | move 0 to IV | move IV to IP and to MA | move NEXT to PC ) else ( move RS to MA | move RS + 1 to RS )
    rst LF          ; load (MA) into LF | move AX to RS                                                     ; for LEAVE1 etc.
    lit             ; move IP + 1 to IP and to MA                                                           ; preps literal value fetch
    lit BX          ; store BX to (MA) | move IP + 1 to IP and to MA                                        ; for literal primitives
    bcs             ; if CF then ( move AX to IP and to MA | move NEXT to PC )                              ; branch on CF set
    nxt AX          ; move IP + 1 to IP and to MA | move NEXT to PC | store AX to (MA)                      ; for storing to memory
    nxt LF          ; move IP + 1 to IP and to MA | move NEXT to PC | store LF to (MA) | move RS to LF      ; for ENTER0 etc.
    nxt QX          ; move IP + 1 to IP and to MA | move NEXT to PC | move HF to LF                         ; for QEXIT
    nxt             ; move IP + 1 to IP and to MA | move NEXT to PC                                         ; for ending primitives
    nxp             ; if high-bit of LX clear, then ( move LX to PC | move BX to MA | move DX to LX ) else ( move RS - 1 to RS and to MA )  ; needs LX = cfa
    nxf             ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC                          ; for colon words

NXP expects LX to contain the cfa. If the cfa is of a primitive, NXP executes it, else NXP does PHR in preparation for NXF.
NXF stores the IP to the return-stack, moves the CFA to IP and to MA, and jumps to NEXT.

If NXP executes a primitive, it does some prep for the primitive:
It moves BX to MA to boost the speed of FETCH etc..
It moves DX to LX to boost the speed of DX_AND DX_OR DX_XOR etc..

Group-M does not need an ALU because the only arithmetic done is incrementing and decrementing of registers.
Group-A needs a 32-bit ALU and Group-B needs a 16-bit ALU.

For BCS the offset should be 0 to iterate over the same primitive, -1 to go back one primitive, etc..
An offset of 1 would be the same as NXT that adds 1 to IP internally. 
An offset of 2 would skip over 1 primitive, etc. (unintuitive!)

macro EXECUTE_LX
    mov 1,AX        ; every primitive starts with AX=1 --- this speeds up various primitives
    nxp             ; either do primitive or prep pushing IP --- every primitive starts with MA=BX to speed up FETCH etc.
    nxf             ; push IP and do function
endm

Every primitive starts with AX = 1 because this is done in the EXECUTE_LX macro.
We don't want to set CF inside of EXECUTE_LX because sometimes primitives pass data in CF to the next primitive.
Examples of this are the NEXT_MATCH and LIST_LOOP primitives that pass CF to the POST_LIST primitive.
Also, the last instruction of the primitive is usually group-B, so a group-B in EXECUTE_LX doesn't pack with it.

Notice that there can't be any code between NXP and NXF because NXP may change the PC so nothing afterward will execute.

NEXT:               ; this is at address 0 of code-memory and is hard-coded into NXT and NXF
    ldr LX
    execute_LX

Normally, every primitive ends with a NXT instruction. The NXT will hopefully pack with a group-A or group-B instruction nearby.
Most primitives are dominated by group-M instructions, so packing more group-M instructions doesn't work very well.
In some cases, a primitive is dominated by group-A and group-B instructions, so instead of NXT the following can be inlined:
    lit
    ldr LX
    execute_LX

This is an example:

ABS:                ; n -- |n|              ; absolute value
    mov BN,CF
    mov BX,AX
    lit
    cng AX
    ldr LX
    mov AX,BX
    execute_LX

The LIT packs with surrounding code.
The LDR LX packs with surrounding code.
The MOV 1,AX and NXP in EXECUTE_LX pack with surrounding code (because MOV AX,BX is group-B and they are group-A and group-M).
By comparison, if we used NXT it would pack with MOV AX,BX and it does LIT internally so that is parallelized as well,
but the LDR LX in NEXT would cost an extra clock cycle because it doesn't pack with anything. 
Using EXECUTE_LX can save one clock cycle if both LIT and LDR LX pack, and the last instruction of the primitive is group-B.
The downside is that the primitive is one more opcode in length as compared to NXT (because NXP and NXF each take an opcode).

Note that we have multiple versions of NXT that internally do another group-M operation in parallel, to save one clock cycle.
For example, instead of:  STR AX  NXT  we have:  NXT AX

These are combination instructions (have to be special-cased by the assembler because otherwise they won't pack together):
    xch AX,LX       ; MOV AX,LX  and  MOV LX,AX  packed together        ; also called XCH LX,AX
    xch AX,BX       ; MOV AX,BX  and  MOV BX,AX  packed together        ; also called XCH BX,AX
    xch AX,DX       ; MOV AX,DX  and  MOV DX,AX  packed together        ; also called XCH DX,AX
    xch AX,EX       ; MOV AX,EX  and  MOV EX,AX  packed together        ; also called XCH EX,AX
    xan LX,AX       ; AND LX,AX  and  MOV AX,LX  packed together
    xio LX,AX       ; IOR LX,AX  and  MOV AX,LX  packed together
    xxo LX,AX       ; XOR LX,AX  and  MOV AX,LX  packed together
There are some others that can be done like this if needed.


Section 3.)  some basic example code

LIT_23241:          ; -- 23,241  ; binary: 0101,1010,1100,1001       ; inlining the NEXT code
    phs
    lit BX
    ldr LX
    mov 5,AX        ; leftmost bits: 0101
    sh2 2,AX        ; leftmost bits: 10
    sh2 2,AX        ; leftmost bits: 10
    sh2 3,AX        ; leftmost bits: 11
    sh2 0,AX        ; leftmost bits: 00
    sh2 2,AX        ; leftmost bits: 10
    sh2 1,AX        ; leftmost bits: 01
    mov AX,BX
    execute_LX

Literal value primitives follow the pattern of LIT_23241 shown above --- LIT BX was provided to boost their speed.
Note that LIT BX and LDR LX each pack with one of the ,AX instructions, so they cost nothing.
The MOV AX,BX is group-B so it packs with the MOV 1,AX and NXP in EXECUTE_LX for good speed.
Small values are faster than large values because there are fewer ,AX instructions needed, and the ,AX instructions don't pack with each other at all.
Sometimes SHL 4,AX and SWP AX can be used when you have blocks of zeros in the low bits.

LIT_N250:           ; -- -250   ; binary: 1111,1111,0000,0110       ; this is the obvious way to do it
    phs
    lit BX
    ldr LX
    mov $F,AX       ; leftmost bits: 1111
    sh2 3,AX        ; leftmost bits: 11
    sh2 3,AX        ; leftmost bits: 11
    shl 4,AX        ; leftmost bits: 0000
    sh2 1,AX        ; leftmost bits: 01
    sh2 2,AX        ; leftmost bits: 10
    mov AX,BX
    execute_LX

Here it is again using NOT AX to save 2 clock cycles:

LIT_N250:           ; -- -250   ; binary: 1111,1111,0000,0110  ~0000,0000,1111,1001
    phs
    lit BX
    ldr LX
    mov $F,AX       ; leftmost bits: 1111
    sh2 2,AX        ; leftmost bits: 10
    sh2 1,AX        ; leftmost bits: 01
    not AX
    mov AX,BX
    execute_LX

The NOT AX instruction was mostly provided for generating small negative literals.

LIT_4096:            ; -- 4095  ; binary: 0001,0000,0000,0000
    phs
    lit BX
    ldr LX
    mov 1,AX        ; leftmost bits: 0001
    shl 4,AX        ; leftmost bits: 0000
    swp AX          ; leftmost bits: 0000,0000
    mov AX,BX
    execute_LX

The SWP AX instruction is useful for 1-bit masks in the high byte.

macro DUP           ; n -- n n
    phs
    str BX
end    

macro DROP          ; n --
    pls
    ldr LX
    mov LX,BX
endm    

We have some producers that leave SOS in DX rather than in memory.
Also, we have some consumers that expect SOS in DX rather than in memory.
The compiler has to notice the possibility of using these as part of its optimization.

QEXECUTE:           ; cfa --
    mov LF,HM       ; move LF to HF and to MA
    ldr LF          ; load LF with parent function's LF
EXECUTE:            ; cfa --
    mov BX,AX
    drop
    mov AX,LX
    execute_LX

LOCAL1_QEXECUTE:    ; this executes a quotation cfa held in local1      ; depends on AX=1 already
;   mov n,AX        ; AX= offset to local                               ; this instruction needed for any local offset other than 1
    mov LF,HM       ; move LF to HF and to MA
    ldr LF          ; load LF with parent function's LF
    add LF,AX
    mov AX,MA
    ldr LX
    execute_LX

A function with locals starts with one of ZERO1 ZERO2 etc. if there are any zero-initialized locals.
It then does ENTER2 ENTER1 or ENTER0 for how many stack-initialized locals there are.
ENTER0 is for when there are some zero-initialized locals but no stack-initialized locals.
ENTER0 is not needed when there are no locals at all. Nothing is needed in this case.

ZERO2:              ; --
    mov 0,AX
    phr
    str AX
ZERO1:              ; --
    mov 0,AX
    phr
    nxt AX

ENTER2:             ; a b --
    phr
    str BX          ; transfer B
    pls
    phr LX         
    str LX          ; transfer A
    pls
    phr LX
    mov LX,BX       ; --
    nxt LF          ; push old LF to the return-stack, and move RS to LF as the new local-frame

ENTER1:             ; a --
    phr
    str BX          ; transfer A
    pls
    phr LX
    mov LX,BX       ; --
    nxt LF          ; push old LF to the return-stack, and move RS to LF as the new local-frame

ENTER0:             ; --
    phr
    nxt LF          ; push old LF to the return-stack, and move RS to LF as the new local-frame

Every colon word ends with LEAVEx corresponding to how many locals there were, or just EXIT if no locals.

macro _EXIT         ; --
    plr IV                                              ; if an IRQ is pending, will goto the ISR for that IRQ
    ldr IP          ; restore the old IP
    nxt
endm

LEAVE1:             ; --                                ; assumes that there is one local on the stack      ; depends on AX=1 already
;   mov n,AX        ; AX= size of local frame           ; this instruction needed for any number of locals other than 1
    plr
    add RS,AX       ; prep RST LF
    rst LF          ; pull old LF and discard local variables
EXIT:               ; --
    _exit

QEXIT:              ; --                                ; assumes that this is a quotation called with QEXECUTE
    plr
    ldr IP          ; restore the old IP
    nxt QX          ; restore the old LF

Originally, QEXECUTE would push the HOF's LF to the return-stack, and QEXIT would pull it off.
Now QEXECUTE moves the HOF's LF to the HF register (HF means HOF's LF) and QEXIT moves it back.
This is faster because we don't have memory access.
This means that quotations can't be be nested though. A quotation can't call a HOF that executes another quotation.
This limitation should never be a problem. The compiler can do error-checking to catch this at compile-time

Note that EXIT uses PLR IV rather than PLR. This works the same as PLR if there is no IRQ pending.
If there is an IRQ pending, and IF isn't on, then it goes to the ISR (a colon word) for that IRQ rather than do an exit. 
If more than one IRQ is pending, the highest-priority IRQ will be chosen by the hardware to get its address put in IV.
The ISRs are at $FFC0 $FF80 $FF40 $FF00 (highest-priority to lower priority).
Every ISR must end in ISR_EXIT, which will exit back to the caller of the interrupted function that never did its EXIT.

PLR IV should only be used in EXIT --- PLR should be used any other time data is pulled from the return-stack
PLR IV can also be used in LEAVE2 etc. that do an exit as part of their operation.
PLR IV can't be used in QEXIT because the ISR ends in ISR_EXIT --- this won't undo the quotation properly.

ISR_EXIT:           ; --                                ; like EXIT except changes IF back to 0 again
    xor 1,IF 
    _exit

We don't normally want ISRs to get interrupted, so PLR IV sets the IF mask to 1 when it executes an ISR.
Interrupts are blocked during execution of the ISR, but ISR-EXIT sets IF back to 0 so the main-program can be interrupted.
The XOR 1,IF won't parallelize with the PLR IV in _EXIT so this PLR IV can be interrupted if another IRQ is pending.

The downside of XOR 1,IF not parallelizing however, is that ISR_EXIT takes one clock cycle more than EXIT does.
This isn't too bad --- most processors burn up a lot of clock cycles saving and restoring registers.

The ISR can have local variables. Locals are actually faster than globals (because small literals are fast).
We would need ISR_LEAVEx primitives, because we can't use the LEAVEx primitives that don't clear IF.
Initializing the local-frame with ZEROx ENTER0, and getting rid of it with ISR_LEAVEx, adds a lot of overhead however.
For the most part, if your ISR needs locals, it is too big and complicated --- it is doing too much.

No registers need to be saved and restored for the ISR.
The ISR must ISR_EXIT with the return-stack and data-stack as they were, so RS SS and BX are unchanged.
IP can be corrupted because ISR_EXIT will restore the calling function's IP from the return-stack (this is how we get back).
CX and EX will be unchanged so long as loops are concluded before EXIT is done.
AX DX VF CF can be corrupted by the ISR without harm, because they are always invalid when EXIT is done in the main-program.
DX is only used to pass data between primitives, but never between colon words, so corrupting it causes no harm.

Upon start-up, IF is set to 1 to block interrupts. During initialization, XOR 1,IF should be done to enable interrupts.
After than, IF is normally 0 while the main-program is executing, and 1 while ISRs are executing.

We effectively only service IRQs inside of EXIT (or the exit code in LEAVE2 etc.). 
This should be adequate because EXIT generally executes pretty often (Forth code tends to be heavily factored into colon words).
Some words however may be iterative, and the loop contains only primitives, so EXIT isn't being done for a long time.
This may cause interrupt latency to be excessively long. In this case, the loop should contain a call to the NOOP colon word:

: NOOP ( -- )  ;    \ does nothing except EXIT --- allows any pending IRQ to be serviced by the PLR IV inside of EXIT.

Even given judicious use of NOOP, there may still be too much interrupt latency.
Some primitives are pretty lengthy (multiplication and division, for example), so even bracketed by NOOP too much time passes.
The TOYF clock-speed should be pretty fast (maybe 100 Mhz.), so hopefully EXIT will be done often enough for the I/O.
The TOYF only has 64KW of data-memory, so you can't really buffer a lot of I/O data anyway.
If you have a high-speed UART you don't have to keep up with it. For such small files the upload speed doesn't matter much anyway.
A timer for an ADC or DAC might be problematic though, because you do have to keep up with it --- it has a specified speed.
For the most part, we are still relying on a paced loop, as with the original design. 
The interrupts are for non-critically timed I/O, especially a UART --- we aren't going to expect ISRs to execute right away.
We want the ISRs to be very short and quick, so our paced-loop doesn't run past its time allotment.

My original thought was to have interrupts on the NXP instruction.
They can't happen at arbitrary instructions because we sometimes have data under either the data-stack or return-stack pointer.
If NXP is interrupted, the ISR would be written in machine-code as a primitive.
This is a bad idea however. For one thing, there are no conditional branches in machine-code.
Conditional execution is done by changing IP and then doing a NXT, so it is only done in colon words. 
The ISR has to be a colon word so it can do conditional execution, which is almost always needed.
If NXP is interrupted, then it has to save a lot of state information, start up a colon word, then afterward restore the state.
It is much simpler to have PLR IV that is interrupted, only in EXIT, because there are no registers to save and restore.

RND_BIT:            ; seed-adr -- [0,1]         ; depends upon MA=BX
    ldr LX
    mov LX,AX
    rnd AX
    flg AX,CF
    mov CF,BX       ; BX= low bit of AX
    nxt AX          ; update 16-bit seed

RND_BYTE:           ; seed-adr -- [0,255]       ; depends upon MA=BX
    ldr LX
    mov LX,AX
    rnd AX
    rnd AX
    rnd AX
    rnd AX
    rnd AX
    rnd AX
    rnd AX
    rnd AX
    byt AX,BX       ; BX= low 8 bits of AX
    nxt AX          ; update 16-bit seed

RND_BIT and RND_BYTE aren't adequate for encryption. They are provided for use in games.
The TOYF uses a paced-loop which should work for games --- 100 Hz. (10ms) would update the display fast enough to look like motion.
The major problem is that there is no low-latency IRQ that can be used to generate sound. A separate sound chip would be needed for music.
It may be necessary to use a coprocessor to pony up the TOYF (the TOYF may not be able to start itself on power-up without help).
The coprocessor (a C8051, for example) could also act as the sound chip. The TOYF would squirt a "sound file" over by way of the UART.
The coprocessor would interpret this sound-file to play a tune by toggling a speaker.
The sound quality should be comparable to the video games of the late 1980s (probably not at the level of the C64's 6581 SID though).
The 1 Mhz. Apple-II could do speach synthesis (somewhat robotic, but understandable). A 72 Mhz. C8051 should be significantly better.
The Super Nintendo (SNES) used a 3.58 Mhz. Ricoh 5A22 (a 65c816 derivative). It came out in 1990 and went kaput about 10 years later.
The SNES was programmed in assembly-language (using self-modifying code for speed), and supported some pretty cool games.
The TOYF should be an order of magnitude faster, and you get to program in Forth. High-school students might like it. :-)

LESS_THAN:          ; x y -- x<y            ; a NOT of this is GREATER_THAN_OR_EQUAL
    mov 0,CF
    pls
    ldr LX
    mov BX,AX       ; AX= y
    mov LX,BX       ; BX= x
    sbc AX,BX       ; BX= x-y
    mov BN,CF       ; CF= y>x?
    xor VF,CF       ; adjust for sign       ; this would not be used in the unsigned version
    mov CF,BX
    lit
    ldr LX
    execute_LX

GREATER_THAN:       ; x y -- x>y            ; a NOT of this is LESS_THAN_OR_EQUAL
    mov 0,CF
    pls
    ldr LX
    mov LX,AX       ; AX=x BX=y
    sbc AX,BX       ; BX= y-x
    mov BN,CF       ; CF= x>y?
    xor VF,CF       ; adjust for sign       ; this would not be used in the unsigned version
    mov CF,BX
    lit
    ldr LX
    execute_LX

All of the comparison and test primitives should be straight-forward.

BRANCH:             ; --                    ; depends on AX=1 already
    flg AX,CF       ; CF= 1
    lit
    ldr LX          ; LX= offset
    mov LX,AX
    add IP,AX
    bcs             ; NXT not needed because CF certainly set

0BRANCH:            ; flag --   
    not BX,CF       ; CF= zero?
    drop
    lit
    ldr LX          ; LX= offset
    mov LX,AX
    add IP,AX
    bcs
    nxt

CF0BRANCH:          ; --                    ; branches if CF=0
    lit
    ldr LX          ; LX= offset
    not CF          ; CF= zero?
    mov LX,AX
    add IP,AX
    bcs
    nxt

Some of our logic primitives, such as with 1-bit I/O ports, return the flag in CF --- use CF0BRANCH for this.
Passing a flag in CF is faster than pushing it to the data-stack and then pulling it off the data-stack again.

BRANCH-1:           ; --                    ; depends on AX=1 already
    flg AX,CF       ; CF= 1
    mov -1,AX       ; AX= offset back to previous primitive
    add IP,AX
    bcs             ; NXT not needed because CF certainly set

0BRANCH-1:          ; flag --  
    not BX,CF       ; CF= zero?
    drop
    mov -1,AX       ; AX= offset back to previous primitive
    add IP,AX
    bcs
    nxt

CF0BRANCH-1:        ; --                    ; branches if CF=0
    not CF          ; CF= zero?
    mov -1,AX       ; AX= offset back to previous primitive
    add IP,AX
    bcs
    nxt

It is faster, and results in smaller threaded-code, if BRANCH-1 0BRANCH-1 CF0BRANCH-1 are used instead of BRANCH 0BRANCH CF0BRANCH.
Primitives like BRANCH-1 o0BRANCH-1 CF0BRANCH-1 should be provided for all the common small sizes of code blocks.
Also, if the code block consists of a call to a colon word, then EXIT is done which allows IRQs to be serviced.

GREATER_THAN_BRANCH:        ; x y --  
    mov 0,CF
    pls
    ldr LX
    mov LX,AX       ; AX=x BX=y
    sbc AX,BX       ; BX= y-x
    mov BN,CF       ; CF= x>y?
    xor VF,CF       ; adjust for sign       ; this would not be used in the unsigned version
    drop            ; this is like GREATER_THAN except that we don't store CF into BX, but instead branch on it
    lit
    ldr LX          ; LX= offset to branch
    mov LX,AX
    add IP,AX
    bcs
    nxt

TRUE:               ; -- true                               ; depends on AX=1 already
    dup
    mov AX,BX
    nxt

DX_TRUE:            ; -- true           ; sets DX= SOS      ; depends on AX=1 already
    xch BX,DX
    mov AX,BX
    nxt

Our TRUE is 1 --- unlike in ANS-Forth in which it is -1

TO_R:               ; val --        ; this is: >R
    mov BX,AX
    drop
    phr
    nxt AX

R_FETCH:            ; -- val        ; this is:  R@
    mov R0,MA
    ldr LX
    dup
    mov LX,BX
    nxt

R_FROM:             ; -- val        ; this is:  R>
    plr
    ldr LX
    dup
    mov LX,BX
    nxt

For the most part, local variables should be used. >R R@ R> can also be used though.

DX_DUP:             ; n -- n            ; sets DX= n
    mov BX,AX
    mov AX,DX
    nxt

DX_PREP:            ; a b -- b          ; sets DX= a        ; preps a primitive that needs SOS in DX
    pls
    ldr LX
    mov LX,DX
    nxt

DX_POST:            ; b -- a b          ; needs DX=a        ; puts the SOS onto the memory stack
    phs
    str DX
    nxt

NIP:                ; a b -- b 
    pls
    nxt

RIP:                ; a b c -- b c
    pls
    ldr LX
    mov LX,AX
    mov S0,MA
    nxt AX

OVER:               ; a b -- a b a
    mov S0,MA
    ldr LX
    dup
    mov LX,BX
    nxt    

DX_OVER:            ; a b -- a a        ; sets DX= b
    mov S0,MA
    ldr LX
    xch BX,DX
    mov LX,BX
    nxt

TUCK:               ; a b -- b a b
    mov S0,MA
    ldr LX          
    mov LX,AX       ; AX= a
    str BX
    phs
    nxt AX

DX_TUCK:            ; a b -- b b        ; sets DX= a
    mov S0,MA
    ldr LX
    mov LX,DX
    str BX
    nxt

SWAP:               ; a b -- b a 
    mov BX,AX
    mov S0,MA
    ldr LX
    mov LX,BX
    nxt AX

SLOW_SWAP:          ; a b -- b a 
    mov S0,MA
    ldr LX
    str BX
    mov LX,BX
    nxt

SLOW_SWAP was the original version that is obsolete. 
In SWAP we use NXT AX to save one clock cycle. Note that the MOV BX,AX parallelizes, so it doesn't cost anything.
The idea is to have fewer group-M instructions that don't parallelize with each other. Use MOV BX,AX that is group-A.
ROT and NROT also both now use NXT AX for greater efficiency from their previous versions.

DX_SWAP:            ; a b -- a          ; sets DX= b
    xch BX,DX
    pls
    ldr LX
    mov LX,BX
    nxt

ROT:                ; a b c -- b c a 
    mov BX,AX       ; AX= c
    mov S0,MA
    ldr LX
    mov LX,BX       ; BX= b
    mov S1,MA
    ldr LX          ; LX= a
    str BX
    mov LX,BX
    mov S0,MA
    nxt AX

DX_ROT:             ; a b c -- b a      ; sets DX= c
    xch BX,DX       ; DX= c
    pls
    ldr LX
    mov LX,AX       ; AX= b
    mov S0,MA
    ldr LX          ; LX= a
    mov LX,BX
    nxt AX

NROT:               ; a b c -- c a b
    mov BX,AX       ; AX= c
    mov S1,MA
    ldr LX
    mov LX,BX       ; BX= a
    mov S0,MA
    ldr LX          ; LX= b
    str BX
    mov S1,MA
    mov LX,BX
    nxt AX

DX_NROT:            ; a b c -- c b      ; sets DX= a
    mov S1,MA
    ldr LX
    mov LX,DX       ; DX= a
    mov BX,AX       ; AX= c
    pls
    ldr LX
    mov LX,BX       ; BX= b
    nxt AX

In ANS-Forth, NOT AND OR XOR all do a logical operation on all the bits. The programmer almost never wants this though.
In our Forth, all of these logical operators first normalize the parameters. That is, if it is non-zero, it is changed to 1.
Then the logical operation is done, so the result is normalized.
We also have MASK_NOT MASK-AND MASK-OR MASK-XOR that operate on all the bits, for the rare cases in which this is needed.

NOT:                ; n -- ~flag                            ; 0= in ANS-Forth                       ; normalizes the flags
    not BX,CF       ; CF= zero?
    mov CF,BX
    nxt

FLAG:               ; n -- flag                             ; 0<> in ANS-Forth                      ; normalizes the flags
    flg BX,CF       ; CF= nonzero?
    mov CF,BX
    nxt

DX_DUP_FLAG:        ; n -- flag         ; sets DX= N        ; 0<> in ANS-Forth                      ; normalizes the flags
    lit
    xch BX,DX
    ldr LX
    flg DX,CF       ; CF= nonzero?
    mov CF,BX
    execute_LX

DX_XOR:             ; b -- DX -xor- b       ; expects the SOS to be in DX rather than in memory     ; normalizes the flags
    flg BX,CF
    mov DX,AX
    xor CF,AX
    mov AX,BX
    lit
    ldr LX
    execute_LX

XOR:                ; a b -- a -xor- b      ; expects the SOS to be in DX rather than in memory     ; normalizes the flags
    pls
    ldr LX
    mov LX,AX
    xor CF,AX
    mov AX,BX
    nxt

DX_MASK_XOR:        ; b -- DX -xor- b       ; expects the SOS to be in DX rather than in memory     ; depends on LX=DX
    mov BX,AX
    xor LX,AX
    mov AX,BX
    nxt

MASK_XOR:           ; a b -- a -xor- b
    mov BX,AX
    pls
    ldr LX
    xor LX,AX
    mov AX,BX
    nxt

SLOW_MASK_NOT:      ; n -- ~n           ; effectively the same as:  -1 XOR
    mov BX,AX
    not AX
    mov AX,BX
    nxt

SLOW_MASK_NOT is the obvious way to do it, but it is inefficient.    

MASK_NOT:           ; n -- ~n           ; effectively the same as:  -1 XOR
    neg BX
    sub 1,BD        ; this screws up DX, but we aren't using DX anyway so it doesn't matter
    nxt

EQUAL:              ; x y -- x=y
    mov BX,AX       ; AX= y
    pls
    ldr LX          ; LX= x
    xor LX,AX
    flg AX,CF       ; CF= x<>y
    not CF
    mov CF,BX
    lit
    ldr LX
    execute_LX

DX_EQUAL:           ; y -- x=y              ; DX = x    ; thanks to NXP, LX contains DX when DX_EQUAL is executed
    mov BX,AX       ; AX= y
    xor LX,AX
    flg AX,CF       ; CF= x<>y
    not CF
    mov CF,BX
    lit
    ldr LX
    execute_LX

MINUS:              ; a b -- a-b
    neg BX
    pls
    ldr LX
    mov 0,CF    
    adc LX,BX
    nxt

PLUS:               ; a b -- a+b
    pls
    ldr LX
    mov 0,CF    
    adc LX,BX
    nxt

DX_MINUS:           ; b -- DX-b             ; expects the SOS to be in DX rather than in memory     ; depends on LX=DX AX=1 already 
    flg AX,CF       ; CF= 1
    mov BX,AX
    cng AX          ; AX= -BX
    add LX,AX
    mov AX,BX       ; BX= DX-BX
    lit
    ldr LX
    execute_LX

DX_PLUS:            ; b -- DX+b             ; expects the SOS to be in DX rather than in memory     ; depends on LX=DX already
    mov BX,AX
    add LX,AX
    mov AX,BX 
    lit
    ldr LX
    execute_LX

ABS:                ; n -- |n|              ; absolute value
    mov BN,CF
    mov BX,AX
    lit
    cng AX
    ldr LX
    mov AX,BX
    execute_LX

DX_MINMAX:          ; a b -- max            ; sets DX= min
    xch BX,DX       ; DX= b
    pls 
    ldr LX          ; LX= a
    mov LX,BX       ; BX= a
    mov DX,AX       ; AX= b
    mov 0,CF        ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502
    sbc AX,BX       ; CF= b(AX,DX) > a(BX)              ; CF indicates that b(DX) > a(BX)
    not CF          ; CF= b(AX,DX) <= a(BX)             ; CF indicates that b(DX) is the minimum and a(BX) is the maximum
    mov LX,BX       ; BX= a                             ; the SBC AX,BX screwed up BX, so BX has to be restored
    mov 1,AX        ; AX= offset to branch to next primitive            
    add IP,AX
    bcs             ; if BX is maximum and DX is minimum, then we are done
SWAP_DX:            ; a -- b        ; needs DX=b, sets DX= a
    xch BX,DX
    nxt

DX_MAXMIN:          ; a b -- min            ; sets DX= max
This is just like DX_MINMAX above except without the NOT CF in there.

MIN and MAX are just DX_MAXMIN and DX_MINMAX except that you ignore the DX value.
MINMAX and MAXMIN can be accomplished with DX_MINMAX and DX_MAXMIN followed by POST_DX.

SWAP_DX_POST:       ; b -- b a      ; needs DX=a        ; puts the SOS onto the memory stack (DX is now the new SOS b)
    phs
    str BX
    xch DX,BX
    nxt


Section 4.)  some example code with global variales

FETCH:              ; adr -- val    ; expects MA=BX
    ldr LX
    mov LX,BX
    nxt

FETCH depends on MA getting set by NXP to the needed value.

STORE:              ; val adr --
    mov BX,AX
    pls
    ldr LX          ; LX= val
    mov AX,MA
    str LX
    drop
    nxt

DX_STORE:           ; adr --        ; expects DX=SOS
    mov DX,AX       ; AX= val
    xch BX,DX       ; DX= adr
    drop
    mov DX,MA
    nxt AX

DX_STORE assembles as:
                group-A     group-B     group-M
1st opcode:     MOV DX,AX   XCH BX,DX   PHS         ; the PHS comes from the DROP macro
2nd opcode:                             STR BX      ; the STR BX comes from the DROP macro
3rd opcode:                             MOV DX,MA
4th opcode:                             NXT AX

This is an alternate version of DX_STORE that takes advantage of MA=BX that we now have:

DX_STORE:           ; adr --        ; expects DX=SOS, MA=BX (MA is set to BX by NXP)
    str DX
    drop
    nxt

This assembles as:
                group-A     group-B     group-M
1st opcode:                             STR DX
2nd opcode:                             PHS         ; the PHS comes from the DROP macro
3rd opcode:                             STR BX      ; the STR BX comes from the DROP macro
4th opcode:                             NXT

It seems as if DX_STORE could benefit from MA=BX that we now have, but it actually comes out to 4 clock-cycles either way.

The global variables [0,8] are fastest, so they should generally be the I/O ports.
They are still pretty slow --- accessing I/O is going to be a drag.


Section 5.)  some example code with moving blocks of data

macro MOVE_WORD                 ; dst -- dst        ; needs DX=src
    mov DX,MA
    ldr LX,BX
    str LX
endm

macro MOVE_WORD_POST_INC        ; dst -- dst        ; needs DX=src
    mov DX,MA
    ldr LX,BX
    add 1,BD
    str LX
endm

macro MOVE_WORD_PRE_DEC         ; dst -- dst        ; needs DX=src
    add 1,BD
    mov DX,MA
    ldr LX,BX
    str LX
endm

If the size of the block is known at compile-time, the above macros can be used to make a primitive to move the block.
If MOVE_WORD_PRE_DEC is used repeatedly, the STR LX at the end parallelizes with the ADD 1,BD at the front.

MOVE_POST_INC:      ; dst -- dst                    ; needs DX=src CX=count     ; CX can't be 0
    sub 1,CX        ; decrement count
    flg CX,CF       ; CF= nonzero?
    mov DX,MA
    not CF          ; CF= zero?
    ldr LX,BX 
    add 1,BD
    mov 0,AX        ; AX= offset back to ourself
    str LX
    add IP,AX
    bcs
    nxt

MOVE_PRE_DEC:       ; dst -- dst                    ; needs DX=src CX=count         ; CX can't be 0
    sub 1,CX        ; decrement count
    flg CX,CF       ; CF= nonzero?
    sub 1,BD
    mov DX,MA
    not CF          ; CF= zero?
    ldr LX,BX
    mov 0,AX        ; AX= offset back to ourself
    str LX
    add IP,AX
    bcs
    nxt

Our MOVE_POST_INC and MOVE_PRE_DEC primitives are somewhat slow.
They are heavy on group-M instructions because they are moving data between memory locations.
They also cause a lot of interrupt latency when moving a large block of data due to EXIT not being done.

PRE_MOVE:           ; src dst count --              ; used prior to MOVE_POST_INC and MOVE_PRE_DEC
    mov BX,AX
    flg AX,CF       ; CF= non-zero?
    xch CX,AX       ; CX= count, AX= old CX
    phr
    str AX          ; push old CX to return-stack
    pls
    ldr LX
    mov LX,DX       ; DX= src, BX=dst
    not CF          ; CF= zero?
    mov 2,AX        ; AX= offset to POST_MOVE
    add IP,AX 
    bcs
    nxt

POST_MOVE:          ; --                            ; used after MOVE_POST_INC and MOVE_PRE_DEC
    plr
    ldr LX
    mov LX,CX       ; restore old CX
    nxt


Section 6.)  some example code with local variables

LOCAL1_ADR:         ; -- adr                        ; depends on AX=1
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    dup
    add LF,AX
    mov AX,BX
    nxt

LOCAL1_FETCH:       ; -- val                        ; depends on AX=1
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    dup
    add LF,AX
    mov AX,MA
    ldr LX
    mov LX,BX
    nxt

LOCAL1_STORE:       ; val --                        ; depends on AX=1
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    str BX
    drop
    nxt

DX_DUP_LOCAL1_FETCH     ; -- val                    ; depends on AX=1       ; sets DX= SOS
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    xch BX,DX
    ldr LX
    mov LX,BX
    nxt

DX_LOCAL1_STORE:    ; val --                        ; depends on AX=1       ; expects SOS in DX rather than in memory
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    str BX
    xch DX,BX
    nxt

LOCAL1_PLUS_STORE:  ; val --                        ; depends on AX=1
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    ldr LX          ; LX= current value
    mov 0,CF    
    adc LX,BX
    str BX
    drop
    nxt

DX_LOCAL1_PLUS_STORE:  ; val --                     ; depends on AX=1       ; expects SOS in DX rather than in memory
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    mov BX,AX
    ldr LX          ; LX= current value
    add LX,AX
    xch DX,BX
    nxt AX

LOCAL1_PLUS_STORE is 7 clock cycles, because nothing parallelizes. This is very inefficient!
DX_LOCAL1_PLUS_STORE is 3 clock cycles --- this shows the advantage of getting SOS into DX as much as possible.

LOCAL1_PRE_INC_FETCH:   ; -- val                    ; depends on AX=1
    dup
    mov AX,BX       ; BX= 1
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    ldr LX 
    mov 0,CF    
    adc LX,BX       ; BX= 1+val
    str BX
    nxt

LOCAL1_POST_INC_FETCH:  ; -- val                ; depends on AX=1
    dup
;   mov n,AX        ; AX= offset to local           ; needed for any local other than the first
    add LF,AX
    mov AX,MA
    ldr LX
    mov LX,BX
    mov 1,AX        ; AX= offset to local
    add LX,AX
    nxt AX

In MFX (for the MiniForth) I had "ghost primitives."
For these I saved the source-code, but didn't assemble them into machine-code.
The first time they were used, they assembled into machine-code. Afterward, if they were used, I had the cfa already.
All of these primitives that access local variables should be ghost primitives.
Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used.
Most programs won't need the primitives that are specific to more than four local variables.


Section 7.)  some example code with multiplication

macro MOV_LXEX 
    xch AX,LX
    mov AX,EX
    xch AX,LX
endm

macro MULTIPLY      ; multiply DX * BX setting CX:AX= product, LX= old EX, MA= old CX
    mov 0,AX
    xch AX,EX       ; EX= 0, AX= old EX
    mov AX,LX       ; LX= old EX
    mov 0,AX
    xch AX,CX       ; CX= 0, AX= old CX
    mov AX,MA       ; MA= old CX
    mov 0,AX        ; AX= 0
rept 16             ; the instructions in here parallelize, so we have 1 clock cycle per iteration, for 16 total
    sum EB,CA       ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX
    adj DX,EB       ; shift DX right | shift EX:BX left
endr                ; CX:AX= product
endm

DX_SMULT:           ; y -- x*y          ; needs DX = x      ; signed
    sgn DX,BX       ; this would not be used in DX_UMULT
    multiply        ; needs DX= x, BX = y, sets CX:AX= product
    mov_LXEX        ; EX= old EX
    mov MA,CX       ; CX= old CX
    vng AX          ; this would not be used in DX_UMULT
    mov AX,BX
    lit
    ldr LX
    execute_LX

Use DX_SMULT for * in Forth.

DX_SMULT32:         ; y -- product_high     ; needs DX = x, sets DX= product_low        ; signed
    sgn DX,BX       ; this would not be used in DX_UMULT32
    multiply        ; needs DX= x, BX = y, sets CX:AX= product
    vng CA          ; this would not be used in DX_UMULT32
    mov AX,DX       ; DX= product_low
    xch CX,AX
    mov AX,BX       ; BX= product_high
    xch LX,AX
    mov AX,EX       ; EX= old EX
    mov MA,CX       ; CX= old CX
    nxt

Use DX_SMULT32 for mixed-precision arithmetic, such as */ the scalar.

DX_ UMULT64:         ; y -- product         ; needs DX = x      ; assumes unity = 64    ; unsigned         
    multiply        ; needs DX= x, BX = y, sets CX:AX= product
    mov_LXEX        ; EX= old EX
    mov MA,CX       ; CX= old CX
    shr 4,CA
    shr 1,CA
    shr 1,CA        ; all of these SHR effectively divide CX:AX by 64
    mov AX,BX
    lit
    ldr LX
    execute_LX

Use DX_SMULT64 when unity is 64 --- any power of 2 for unity is fast.

DX_SUM32:           ; y --                  ; needs DX = x      ; return: low high -- low high          ; signed
    sgn DX,BX 
    multiply        ; needs DX= x, BX = y, sets CX:AX= product
    vng CA  
    mov AX,BX       ; BX= low
    xch CX,AX       ; AX= high
    mov MA,CX       ; CX= old CX
    mov_LXEX        ; EX= old EX
    plr             ; now R0 is the low-word
    mov R0,MA
    ldr LX
    mov 0,CF  
    adc LX,BX
    str BX
    phr             ; now R0 is the high-word again --- and MA is set to R0
    mov AX,BX       ; BX= high
    ldr LX
    mov 0,CF
    adc LX,BX
    str BX
    drop
    nxt

Use DX_SMULT32 to approximate polynomials, summing up the 32-bit products in an accumulator on the return-stack.
Afterward, divide the sum by unity for the result. A unity that is a power of 2 is fast.
Note that we don't have MOV R1,MA so we need to use PLR and PHR to adjust RS.
Adjusting RS like this is one reason we can't have interrupts after arbitrary instructions.
It would be possible to have interrupts only after NXP though.


Section 8.)  some example code with division

macro DIV_BIT       ; needs CX:AX = numerator, DX:BX = denominator, sets EX= quotient, CX:AX= remainder
    shl CA          ; CX:AX= 2*numerator (remainder)
    sub DB,CA       ; CX:AX= remainder - denominator
    mov CN,CF       ; CF= CX:AX negative?
    bit EX          ; if negative, set quotient bit to 0, else set quotient bit to 1
    cad DB,CA       ; if negative, then undo trial subtraction
endm

DX_SDIV:            ; denominator -- quotient       ; needs DX = numerator, sets DX= remainder      ; signed
    sgn dx,bx       ; this would not be used in DX_UDIV
    mov EX,AX
    phr
    str AX          ; push old EX to return-stack
    mov DX,AX 
    xch CX,AX       ; CX= numerator, AX= old CX
    phr
    str AX          ; push old CX to return-stack
    mov 0,AX        ; CX:AX= numerator shifted left 16 bits
    mov AX,DX       ; DX:BX= denominator (high word set to zero)
rept 16
    div_bit         ; sets EX= quotient, CX= remainder
endr
    xch CX,AX       ; AX= low-word of remainder, CX= high-word of remainder (should be 0 by this time)
    mov AX,DX       ; DX= remainder
    mov EX,AX
    vng AX          ; this would not be used in DX_UDIV
    mov AX,BX       ; BX= quotient
    plr
    ldr LX
    mov LX,CX       ; pull old CX from return-stack
    plr
    ldr LX
    mov LX,AX
    mov AX,EX       ; pull old EX from return-stack
    nxt

Our DX_SDIV takes a 16-bit numerator and converts it into a 32-bit numerator by shifting left by 16 bits.
It is easy to write a function that takes a 32-bit numerator, such as provided by DX_SMULT32 above.


Section 9.)  some example code with countdown loops

It is possible to implement DO LOOP in TOYF because we have an overflow flag (the MiniForth didn't have an overflow flag).
I don't really like DO LOOP however, because it is very confusing for novices to learn, so I may not bother with this.
Much more common will be a countdown loop as shown below:

PRE_LOOP:           ; count --          ; push old CX to return-stack, set CX= count, branch to POST_LOOP if zero
    mov BX,AX
    xch CX,AX       ; CX= count, AX= old CX
    phr
    str AX          ; push old CX to return-stack
    not BX,CF       ; CF= zero?
    drop            ; --
    lit
    ldr LX          ; LX= offset to POST_LOOP
    mov LX,AX
    add IP,AX
    bcs
    nxt

LOOP:               ; branch if CX is non-zero, then decrement CX
    lit
    ldr LX          ; LX= offset to just past PRE_LOOP
    flg CX,CF       ; CF= nonzero?
    sub 1,CX        ; post-decrement CX
    mov LX,AX
    add IP,AX
    bcs
    nxt

INC_LOOP:           ; n -- n+1          ; depends upon AX=1
    add BX,AX
    mov AX,BX       ; BX incremented
    lit
    ldr LX          ; LX= offset to just past PRE_LOOP
    flg CX,CF       ; CF= nonzero?
    sub 1,CX        ; post-decrement CX
    mov LX,AX
    add IP,AX
    bcs
    nxt
    
LOCAL1_INC_LOOP:    ; --
    mov 1,AX        ; AX= offset to local
    add LF,AX
    mov AX,MA
    mov 1,AX        ; AX= increment
    ldr LX          ; LX= current value
    add LX,AX
    lit
    ldr LX          ; LX= offset to just past PRE_LOOP
    flg CX,CF       ; CF= nonzero?
    sub 1,CX        ; post-decrement CX
    mov LX,AX
    add IP,AX
    bcs
    nxt AX

POST_LOOP:          ; --                ; pull CX from return-stack
    plr
    ldr LX
    mov LX,AX
    xch AX,CX       ; CX= old CX pulled from return-stack
    nxt

For a countdown loop, use:
PRE_LOOP (offset to POST_LOOP)  code...  LOOP (offset to just past PRE_LOOP)  POST_LOOP
The code will execute as many times as count indicates. At POST_LOOP, our CX is -1 before it gets restored.
Use INC_LOOP for the common case of incrementing an index or pointer on the data-stack.
Use LOCALn_INC_LOOP for the even more common case of incrementing an index or pointer in a local variable.


Section 10.)  some example code with list loops

PRE_LIST_LOOP:      ; node --               ; sets EX= node, CX= node           ; return-stack: -- old-EX old-CX prior-node
    mov EX,AX
    phr
    str AX          ; push old EX to return-stack
    mov 0,AX
    mov BX,AX
    xch CX,AX       ; CX= node, AX= old CX
    phr
    str AX          ; push old CX to return-stack
    mov 0,AX
    phr
    str AX          ; push 0 to return-stack as prior-node
    not BX,CF       ; CF= nil?
    mov BX,AX
    drop
    mov AX,EX       ; EX= node
    lit
    ldr LX          ; LX= offset to POST_LIST_LOOP
    mov LX,AX
    add IP,AX
    bcs
    nxt

LIST_START:         ; --                    ; needs CX=node (can't be nil), sets CX= next node              ; EX is still the node
    mov CX,MA
    ldr LX
    mov LX,CX       ; CX= next node
    nxt

LIST_START_LOCAL1_QEXECUTE:                 ; needs CX=node (can't be nil), sets CX= next node              ; EX is still the node
    mov CX,MA
    ldr LX
    mov 1,AX        ; AX= offset to local
    mov LF,HM       ; move LF to HF and to MA
    mov LX,CX       ; CX= next node
    ldr LF          ; load LF with parent function's LF
    add LF,AX
    mov AX,MA
    ldr LX
    execute_LX

In many cases, executing a quotation is the first thing that is done inside of a list loop.
That is what LIST_START_LOCALn_QEXECUTE is for.

FIND_LIST_MATCH:    ; flag --               ; set CF= early-exit?   ; depends upon AX=1
    flg BX,CF       ; CF= true?
    mov 2,AX        ; AX= offset over LIST_LOOP to POST_LIST
    add IP,AX
    bcs             ; if CF then jump to POST_LIST
    nxt             ; else fall into LIST_LOOP

LIST_LOOP:          ; --                    ; needs CX = next-node      ; sets EX= CX, CF= false
    mov EX,AX
    mov R0,MA
    str AX          ; store node to prior-node slot on return-stack
    xch CX,AX       ; CX= EX, AX= CX
    mov AX,EX       ; update EX with next-node
    lit
    ldr LX          ; LX= offset to LIST_START
    flg CX,CF       ; CF= new node is not nil
    mov LX,AX
    add IP,AX
    bcs             ; if CF then jump to LIST_START 
    nxt             ; else fall into POST_LIST

SHORT_LIST_LOOP:    ; --                    ; needs CX = next-node      ; sets EX= CX, CF= false
    mov EX,AX
    mov R0,MA
    str AX          ; store node to prior-node slot on return-stack
    xch CX,AX       ; CX= EX, AX= CX
    mov AX,EX       ; update EX with next-node
    mov -1,AX       ; AX= offset to LIST_START_LOCALn_QEXECUTE that is just prior to SHORT_LIST_LOOP
    flg CX,CF       ; CF= new node is not nil
    add IP,AX
    bcs             ; if CF then jump to LIST_START 
    nxt             ; else fall into POST_LIST

In many cases, LIST_START_LOCALn_QEXECUTE is the only thing that is done inside of a list loop.
That is what SHORT_LIST_LOOP is for. It is possible to have something other than LIST_START_LOCALn_QEXECUTE though.

POST_LIST:          ; -- prior-node node early-exit?    ; needs CF= early-exit?     ; return_stack: old-EX old-CX prior-node --
    dup
    plr
    ldr LX    
    mov LX,BX       ; -- prior-node
    mov EX,AX
    dup
    mov AX,BX       ; -- prior-node node
    dup
    mov CF,BX       ; -- prior-node node early-exit?
    plr
    ldr LX
    mov LX,CX       ; CX= old CX pulled from return-stack
    plr
    ldr LX
    mov LX,AX
    mov AX,EX       ; EX= old EX pulled from return-stack
    nxt

To traverse a list, use:
PRE_LIST_LOOP (offset to POST_LIST)  LIST_START  code...  LIST_LOOP (offset to LIST_START)  POST_LIST
The code inside of the loop will be given the current node in EX --- it leaves EX unmodified.

To obtain a matching node in a list, use:
PRE_LIST_LOOP (offset to POST_LIST)  LIST_START  code...  FIND_LIST_MATCH LIST_LOOP (offset to LIST_START)  POST_LIST
The code inside of the loop will be given the current node in EX --- it leaves EX unmodified.
This code must push a flag indicating if a match was found or not. FIND_LIST_MATCH consumes this flag.
POST_LIST leaves this on the data-stack: -- prior-node node early-exit?
If the loop completed, then node will be the tail node and early-exit? will be FALSE.
If the loop early-exited, then node will be the matching node and early-exit? will be TRUE.
Prior-node is the node prior to the node returned, or nil if the node returned is the first node.

It is okay for the code to modify the link in the node (such as when moving the node to another list).
This is because CX already holds the next-node so we aren't depending upon the old node's link anymore.
If this is done, the prior-node returned by POST_LIST will not be meaningful, so discard it.

The CX register is only used for iteration. Any other use would corrupt any loop that you are inside of.

NODE:           ; -- node                               ; use inside of a list loop to obtain the current node
    dup
    mov EX,AX
    mov AX,BX
    nxt

NODE_PLUS:      ; field-offset -- field-adr             ; use inside of a list loop to obtain a field in the current node
    mov EX,AX
    add BX,AX
    mov AX,BX   
    nxt

NODE and NODE_PLUS are two easy examples of accessing the node inside of a list loop.
Note that modifying EX is not allowed.
Also, if list loops are nested, NODE etc. refer to the inner loop's current node.
If you want to use the outer loop's node, it must be copied somewhere, such as the data-stack or a local variable.
A local variable would work well, because a quotation can be passed in that can access it.

NODE1:          ; -- field-adr                          ; use inside of a list loop to obtain field1 in the current node
\   mov n,AX    ; this instruction needed for any field other than 1
    dup
    mov AX,BX
    mov EX,AX
    add BX,AX
    mov AX,BX
    nxt

The field-offset is generally known at compile-time, so custom primitives can be generated.
Our NODE1 provides the field-address assuming a field-offset of 1.

In my experience in writing the novice-package, a merge-sort is not significantly faster than an insertion-sort.
The insertion-sort has the advantage of not using any extra memory.
My novice-package merge-sort used the return-stack. On the TOYF though, the return-stack is only 32 elements max.
To write a merge-sort on the TOYF, you would need a stack in the heap for temporary storage.
This is way more trouble than it is worth. Just use the insertion-sort --- it is reasonably fast.
For insertion-sort to work, we need the prior-node to the matching node. POST_LIST above provides this.


Section 11.)  some example code for lists

PRE_NTH:            ; head index -- head        ; sets DX= index
    mov 2,AX        ; AX= offset to skip over NTH if index is zero
    not BX,CF       ; CF= index is zero?
    xch BX,DX       ; DX= index
    drop            ; BX= node
    add IP,AX
    bcs
    mov BX,LX       ; LX= node if we are doing NTH (during NTH the BX register is invalid)
    nxt

NTH:                ; invalid -- next-node      ; needs LX=node DX=index
    flg DX,CF       ; CF= not zero?
    sub 1,BD        ; post-decrement DX (also decrements BX, but that doesn't matter because BX is invalid)    
    mov LX,MA 
    ldr LX          ; LX= next node
    mov 0,AX        ; AX= offset back to this primitive again
    add IP,AX
    bcs
    mov LX,BX       ; -- next-node if we aren't doing NTH anymore
    nxt

To calculate the N'th node in a list, use:  PRE_NTH  NTH
The index is 0 for the 1st node, 1 for the 2nd node, etc..
Use GOOD_LENGTH ahead of time to ensure that the list is long enough to have a valid node at the given index.

PRE_GOOD_LENGTH:    ; head count -- count       ; sets DX= head, skips one thread if head is nil
    pls
    ldr LX          
    mov LX,DX       ; -- count      ; DX= head
    mov 2,AX        ; AX= offset skipping over next thread
    flg DX,CF
    not CF          ; CF= nil?
    add IP,AX
    bcs
    nxt

GOOD_LENGTH:        ; count -- remaining-count      ; needs DX=node
    mov -1,AX       ; AX= -1
    mov DX,MA
    ldr LX
    mov LX,DX       ; DX= new-node
    not BX,CF       ; CF= count zero?
    add BX,AX
    mov AX,BX       ; post-decrement BX
    mov 0,AX        ; AX= nil, and offset back to this primitive again
    cmv AX,DX       ; change DX into nil if count is zero, to stop the iteration
    flg DX,CF       ; CF= not nil?
    add IP,AX
    bcs
    nxt

To ensure that a list is at least COUNT nodes long, use:
PRE_GOOD_LENGTH  GOOD_LENGTH  NOT
This leaves a TRUE if the list is long enough, and a FALSE if the list is too short.

PRE_LENGTH:         ; head -- 0         ; sets DX= head, skips one thread if head is nil        ; depends on AX=1
    not BX,CF       ; CF= nil?
    xch BX,DX
    mov 0,AX
    mov AX,BX       ; -- 0              ; initial count is zero
    mov 2,AX        ; AX= offset skipping over next thread       
    add IP,AX
    bcs
    nxt

LENGTH:             ; 0 -- count        ; needs DX=node, finishes with DX= nil, BX= count       ; needs AX=1, DX<>NIL
    add BX,AX
    mov AX,BX       ; increment BX
    mov 0,AX        ; AX= offset back to this primitive again
    mov DX,MA
    ldr LX
    mov LX,DX       ; DX= new node
    flg DX,CF       ; CF= not nil?
    add IP,AX
    bcs
    nxt

To obtain the length, use:  PRE_LENGTH LENGTH
PRE_LENGTH will skip over LENGTH if head is nil, leaving 0 on the stack.
LENGTH will set BX= BX+1, set DX= new-node, then branch back to itself if DX is not nil.

PRE_TAIL:           ; head -- head      ; skips over TAIL if head is nil
    mov 2,AX        ; AX= offset skipping over next thread
    not BX,CF       ; CF= nil?
    add IP,AX
    bcs
    nxt

TAIL depends on DX not being nil initially, which PRE_TAIL guarantees.

TAIL:               ; node -- tail      ; needs BX=node; finishes with DX= nil, BX= tail        ; depends on MA=BX
    ldr LX
    mov LX,DX       ; DX= new node
    mov 0,AX        ; AX= offset back to this primitive again
    flg DX,CF       
    not CF          ; CF= not nil?
    add IP,AX
    xch DX,BX       ; update BX assuming DX is not nil, hold old BX in DX in case we are done
    bcs             ; if BCS ends TAIL BX will still be the old node (DX that is nil)
    xch DX,BX       ; restore BX to tail (it got set to nil in the earlier XCH DX,BX) and we are done
    nxt

To obtain the tail, use:  PRE_TAIL TAIL
PRE_TAIL will skip over TAIL if head is nil, leaving the nil on the stack.
TAIL will set BX= old node, DX= new node, then branch back to itself if DX is not nil.
The tail can also be obtained with LIST_LOOP, but that is much slower if all you need is the tail.

PRE_LINK:           ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head   ; depends on AX=1     ; branch past LINK if 1st-head is nil
    mov S0,MA
    ldr LX          ; LX= 1st-head
    dup
    mov LX,BX       ; -- 1st-head 2nd-head 1st-head
    flg BX,CF       ; CF= 1st-head is not nill, so we can do LINK
    add IP,AX       ; AX was 1, so BCS will do the same as NXT --- it will do the next primitive which is TAIL
    bcs
    mov 3,AX        ; AX= offset past LINK
    flg AX,CF       ; CF= 1
    drop            ; -- 1st-head 2nd-head
    pls             ; -- 2nd-head
    add IP,AX
    bcs             ; branch past LINK --- no need for NXT as CF is certainly set

LINK:               ; 1st-head 2nd-head 1st-tail -- 1st-head        ; TAIL fell through, so 1st-head is not nil
    pls
    ldr LX,BX       ; -- 1st-head 1st-tail      ; LX= 2nd-head
    str LX          ; store 2nd-head into 1st-tail link field
    drop            ; -- 1st-head
    nxt

To link two lists, do this:  PRE_LINK TAIL LINK

The following is a previous version of the code for linking two lists:

PRE_LINK:           ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head   ; branch to BAD_LINK if 1st-head nil
    mov 3,AX        ; AX= offset past TAIL and GOOD_LINK to BAD_LINK
    mov S0,MA
    ldr LX          ; LX= 1st-head
    dup
    mov LX,BX       ; -- 1st-head 2nd-head 1st-head
    not BX,CF       ; CF= 1st-head is nil?
    add IP,AX
    bcs
    nxt

GOOD_LINK:          ; 1st-head 2nd-head 1st-tail -- 1st-head        ; TAIL fell through, so 1st-head is not nil
    mov 2,AX        ; AX= offset past BAD_LINK
    pls
    ldr LX,BX       ; LX= 2nd-head
    flg AX,CF       ; CF= 1
    str LX          ; store 2nd-head into 1st-tail link field
    add IP,AX
    drop            ; -- 1st-head
    bcs             ; no need for NXT because CF is certainly set

BAD_LINK:           ; 1st-head 2nd-head 1st-head -- 2nd-head        ; assumes 1st-head is nil
    drop            ; -- 1st-head 2nd-head
    pls             ; -- 2nd-head
    nxt

To link two lists, do this:  PRE_LINK TAIL GOOD_LINK BAD_LINK
PRE_LINK will skip to BAD_LINK if 1st-head is nil, otherwise fall into TAIL, which falls into GOOD_LINK.
GOOD_LINK will link the lists, then skip over BAD_LINK. If BAD-LINK executes it returns 2nd-head (might be nil).

This is somewhat more complicated, and it uses 3 cells in the threaded code compared to 2 for the first version.


section 11.)  some example code with 1-bit I/O

Normally we would have our I/O ports in low memory, because it is slightly faster to set AX to low literal values.

If several 1-bit I/O ports will be logic'd together, it is best if all of them are in the same cell.
This is so  MOV AX,MA  LDR LX  need only be done once to get all of them into LX where they can be extracted one at a time.

CF_STORE7_5;                ; --        ; needs CF=flag, stores CF into address 7, bit-5
    mov 7,AX
    mov AX,MA
    ldr LX
    mov ^5,AX
    set CF,LX,AX    ; if CF then  move LX -ior- AX to AX  else  move LX -and- ~AX to AX
    nxt AX

FETCH7_5:                   ; -- flag   ; flag at address 7, bit-5
    dup
    mov 7,AX
    mov AX,MA
    ldr LX
    mov ^5,AX  
    and LX,AX
    flg AX,CF       ; note that we can't use MOV AX,BX because AX is not normalized
    mov CF,BX       ; BX is now 1 or 0 --- it is normalized
    lit
    ldr LX
    execute_LX

CF_FETCH7_5xor9_ior_7:      ; --        ; CF= flag at address 7, bit 5 -xor- bit 9
    mov 7,AX
    mov AX,MA
    ldr LX
    mov ^5,AX
    and LX,AX
    flg AX,CF       ; CF= bit-5
    mov ^9,AX
    and LX,AX       ; AX= bit-9
    xor CF,AX
    flg AX,CF       ; CF= bit-5 -xor- bit-9
    mov ^7,AX
    and LX,AX
    ior CF,AX
    flg AX,CF       ; CF= (bit-5 -xor- bit-9) -ior- bit-7
    lit
    ldr LX
    execute_LX

CF_FETCH7_5xor9_and_3iorB:  ; -- flag       ; flag at address 7, (bit 5 -xor- bit 9) -and- (bit 3 -ior- bit B)      ; uses DX
    mov 7,AX
    mov AX,MA
    ldr LX
    mov ^5,AX
    and LX,AX
    flg AX,CF       ; CF= bit-5
    mov ^9,AX
    and LX,AX       ; AX= bit-9
    xor CF,AX       
    mov AX,DX       ; DX= bit-5 -xor- bit-9
    mov ^3,AX
    and LX,AX
    flg AX,CF       ; CF= bit-3
    mov ^B,AX
    and LX,AX       ; AX= bit-B
    ior CF,AX
    mov AX,LX       ; LX= bit-3 -ior- bit-B
    mov DX,AX       ; AX= bit-5 -xor- bit-9
    and LX,AX
    flg AX,CF       ; CF= (bit-5 -xor- bit-9) -and- (bit-3 -ior- bit-B)
    lit
    ldr LX
    execute_LX