flat assembler
Message board for the users of flat assembler.

Index > Macroinstructions > [fasmg] Working with MNIST data (IDX files):

Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 4043
Location: vpcmpistri
bitRAKE 30 Sep 2024, 09:45
What: the MNIST data set is an image database of handwritten digits.

Why: working with the data in the legacy form is not always ideal.

First, we examine the IDX file format the data is presented in:
Code:
; IDX files can be used for any dimensional data. The magic number indicates
; the number of dimensions in the least byte. Size of dimensions follow as
; 32-bit values. These are all big-endian.
;       data type inidicated by second byte of magic:
;               0x08: unsigned byte
;               0x09: signed byte
;               0x0B: short (2 bytes)
;               0x0C: int (4 bytes)
;               0x0D: float (4 bytes)
;               0x0E: double (8 bytes)

;IFILE equ "data\train-images.idx3-ubyte" ; for testing
match =IFILE?,IFILE
        err "no IDX file specified"
else

virtual at 0
DAT::
        file IFILE
        bytes := $
end virtual

load type:3 from DAT:0
type = type bswap 3

load dims:1 from DAT:3
repeat dims
        load dim.%:4 from DAT:4*%
        dim.% = dim.% bswap 4
end repeat

iterate <SIZE,  TYPE>,\
        1,      "0x08: unsigned byte",\
        1,      "0x09: signed byte",\
        0,      "(unknown)",\
        2,      "0x0B: short (2 bytes)",\
        4,      "0x0C: int (4 bytes)",\
        4,      "0x0D: float (4 bytes)",\
        8,      "0x0E: double (8 bytes)"

        index = type - 0x08 + 1
        if 0 < index & index <= %%
                indx index
                display "element data type: "
                display TYPE,10
                item_bytes := SIZE
        else
                item_bytes := 0
        end if
        break
end iterate

est = item_bytes
repeat dims
        est = est * dim.%
end repeat
file_bytes := est + 4*(dims+1)

if file_bytes = 0 | file_bytes > bytes
        display "error: IDX file is incomplete",10
else
        display "dimensions: "
        repeat dims
                repeat 1,D:dim.%
                        display `D
                end repeat
                if % <> %%
                        display ' x '
                else
                        display 10
                end if
        end repeat
end if

end match ; IFILE    
... set the IFILE symbol value on the commandline.

Preprocessing input data can speed the training and/or effect the model created. The MNIST digit images can be reduced to 20x20, but how should the image data be aligned and what is the impact on the model?
Code:
;IFILE equ "data\t10k-images.idx3-ubyte"        ; 10k images
;IFILE equ "data\train-images.idx3-ubyte"       ; 60k images
virtual at 0
DAT::
        file IFILE
end virtual
offset = 0
iterate <SIZE, ITEM>,\
        4,      magic,\
        4,      images,\
        4,      rows,\
        4,      columns

        load temp:SIZE from DAT:offset
        offset = offset + SIZE
        ITEM := temp bswap 4
end iterate
assert magic = 0x803

repeat images, I:1
        ; initialize all bounds to invalid states
        h_min.I = -1
        h_max.I = -1
        col_mask = 0 ; note: empty images are not allowed
        repeat rows
                load row:columns from DAT:offset
                offset = offset + columns
                if row <> 0
                        if h_min.I < 0
                                h_min.I = % ; first row with data
                        else
                                h_max.I = % ; last row with data
                        end if
                        col_mask = col_mask or row
                end if
        end repeat
        ; note: these are 0-based indices
        w_min.I = (bsf col_mask) shr 3
        w_max.I = (bsr col_mask) shr 3

;       debug filtering...
;if 4 > w_max.I - w_min.I + 1 \
;| 11 > h_max.I - h_min.I + 1
;       repeat 1, X:w_min.I, Y:h_min.I, W:w_max.I - w_min.I + 1, H:h_max.I - h_min.I + 1
;               display '#',`I,': (',`X,',',`Y,',',`W,',',`H,')',9
;       end repeat
;end if

; Now that we have the image bounds we can reload and position the data
; within a 20x20 area:

        offset = offset - (columns * rows) ; backup
        repeat rows
                if h_min.I <= % & % <= h_max.I
                        load row:columns from DAT:offset
                        emit 20: row shr (8*w_min.I)
                end if
                offset = offset + columns
        end repeat
        ; tail pad image if additional rows are needed
        db 20 * (19 - h_max.I + h_min.I) dup 0x00
end repeat

; sanity checks
assert offset = 16 + images * rows * columns
assert $-$$ = images * 20 * 20    
... lots of error checking - this conversion is typically only done once and should be correct. For this reason the code also collects per image stats - I can refactor the code easily for other uses, prune images, etc.

Note how terse the code is to refactor the image data into other forms.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 30 Sep 2024, 09:45
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1796
Roman 30 Sep 2024, 15:01
Quote:

db 20 * (19 - h_max.I + h_min.I) dup 0x00

This macro only generated data ?
Its right ?
Post 30 Sep 2024, 15:01
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4043
Location: vpcmpistri
bitRAKE 30 Sep 2024, 15:16
Roman wrote:
This macro only generated data ?
Yes, the line
Code:
emit 20: row shr (8*w_min.I)    
... also outputs data.

... the goal here is to reformat a common data source for inclusion into the program. I'm so lazy - I'll do anything to simplify my coding, lol.

After I prune some images I append something like:
Code:
repeat 1, I:images
virtual as 'params'
        db "images := ",`I,10
end virtual
end repeat    
... now the script generates parameter constants file as well.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 30 Sep 2024, 15:16
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.