bitRAKE 30 Sep 2024, 09:45
What: the MNIST data set is an image database of handwritten digits.

Why: working with the data in the legacy form is not always ideal.

First, we examine the IDX file format the data is presented in:
; IDX files can be used for any dimensional data. The magic number indicates
; the number of dimensions in the least byte. Size of dimensions follow as
; 32-bit values. These are all big-endian.
;       data type inidicated by second byte of magic:
;               0x08: unsigned byte
;               0x09: signed byte
;               0x0B: short (2 bytes)
;               0x0C: int (4 bytes)
;               0x0D: float (4 bytes)
;               0x0E: double (8 bytes)

;IFILE equ "data\train-images.idx3-ubyte" ; for testing
        err "no IDX file specified"

virtual at 0
        file IFILE
        bytes := $
end virtual

load type:3 from DAT:0
type = type bswap 3

load dims:1 from DAT:3
repeat dims
        load dim.%:4 from DAT:4*%
        dim.% = dim.% bswap 4
end repeat

iterate <SIZE,  TYPE>,\
        1,      "0x08: unsigned byte",\
        1,      "0x09: signed byte",\
        0,      "(unknown)",\
        2,      "0x0B: short (2 bytes)",\
        4,      "0x0C: int (4 bytes)",\
        4,      "0x0D: float (4 bytes)",\
        8,      "0x0E: double (8 bytes)"

        index = type - 0x08 + 1
        if 0 < index & index <= %%
                indx index
                display "element data type: "
                display TYPE,10
                item_bytes := SIZE
                item_bytes := 0
        end if
end iterate

est = item_bytes
repeat dims
        est = est * dim.%
end repeat
file_bytes := est + 4*(dims+1)

if file_bytes = 0 | file_bytes > bytes
        display "error: IDX file is incomplete",10
        display "dimensions: "
        repeat dims
                repeat 1,D:dim.%
                        display `D
                end repeat
                if % <> %%
                        display ' x '
                        display 10
                end if
        end repeat
end if

end match ; IFILE    
... set the IFILE symbol value on the commandline.

Preprocessing input data can speed the training and/or effect the model created. The MNIST digit images can be reduced to 20x20, but how should the image data be aligned and what is the impact on the model?
;IFILE equ "data\t10k-images.idx3-ubyte"        ; 10k images
;IFILE equ "data\train-images.idx3-ubyte"       ; 60k images
virtual at 0
        file IFILE
end virtual
offset = 0
iterate <SIZE, ITEM>,\
        4,      magic,\
        4,      images,\
        4,      rows,\
        4,      columns

        load temp:SIZE from DAT:offset
        offset = offset + SIZE
        ITEM := temp bswap 4
end iterate
assert magic = 0x803

repeat images, I:1
        ; initialize all bounds to invalid states
        h_min.I = -1
        h_max.I = -1
        col_mask = 0 ; note: empty images are not allowed
        repeat rows
                load row:columns from DAT:offset
                offset = offset + columns
                if row <> 0
                        if h_min.I < 0
                                h_min.I = % ; first row with data
                                h_max.I = % ; last row with data
                        end if
                        col_mask = col_mask or row
                end if
        end repeat
        ; note: these are 0-based indices
        w_min.I = (bsf col_mask) shr 3
        w_max.I = (bsr col_mask) shr 3

;       debug filtering...
;if 4 > w_max.I - w_min.I + 1 \
;| 11 > h_max.I - h_min.I + 1
;       repeat 1, X:w_min.I, Y:h_min.I, W:w_max.I - w_min.I + 1, H:h_max.I - h_min.I + 1
;               display '#',`I,': (',`X,',',`Y,',',`W,',',`H,')',9
;       end repeat
;end if

; Now that we have the image bounds we can reload and position the data
; within a 20x20 area:

        offset = offset - (columns * rows) ; backup
        repeat rows
                if h_min.I <= % & % <= h_max.I
                        load row:columns from DAT:offset
                        emit 20: row shr (8*w_min.I)
                end if
                offset = offset + columns
        end repeat
        ; tail pad image if additional rows are needed
        db 20 * (19 - h_max.I + h_min.I) dup 0x00
end repeat

; sanity checks
assert offset = 16 + images * rows * columns
assert $-$$ = images * 20 * 20    
... lots of error checking - this conversion is typically only done once and should be correct. For this reason the code also collects per image stats - I can refactor the code easily for other uses, prune images, etc.

Note how terse the code is to refactor the image data into other forms.

¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Roman 30 Sep 2024, 15:01

db 20 * (19 - h_max.I + h_min.I) dup 0x00

This macro only generated data ?
Its right ?
bitRAKE 30 Sep 2024, 15:16
Roman wrote:
This macro only generated data ?
Yes, the line
emit 20: row shr (8*w_min.I)    
... also outputs data.

... the goal here is to reformat a common data source for inclusion into the program. I'm so lazy - I'll do anything to simplify my coding, lol.

After I prune some images I append something like:
repeat 1, I:images
virtual as 'params'
        db "images := ",`I,10
end virtual
end repeat    
... now the script generates parameter constants file as well.

¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
