flat assembler
Message board for the users of flat assembler.
Index
> Macroinstructions > [fasmg] Working with MNIST data (IDX files): |
Author |
|
bitRAKE 30 Sep 2024, 09:45
What: the MNIST data set is an image database of handwritten digits.
Why: working with the data in the legacy form is not always ideal. First, we examine the IDX file format the data is presented in: Code: ; IDX files can be used for any dimensional data. The magic number indicates ; the number of dimensions in the least byte. Size of dimensions follow as ; 32-bit values. These are all big-endian. ; data type inidicated by second byte of magic: ; 0x08: unsigned byte ; 0x09: signed byte ; 0x0B: short (2 bytes) ; 0x0C: int (4 bytes) ; 0x0D: float (4 bytes) ; 0x0E: double (8 bytes) ;IFILE equ "data\train-images.idx3-ubyte" ; for testing match =IFILE?,IFILE err "no IDX file specified" else virtual at 0 DAT:: file IFILE bytes := $ end virtual load type:3 from DAT:0 type = type bswap 3 load dims:1 from DAT:3 repeat dims load dim.%:4 from DAT:4*% dim.% = dim.% bswap 4 end repeat iterate <SIZE, TYPE>,\ 1, "0x08: unsigned byte",\ 1, "0x09: signed byte",\ 0, "(unknown)",\ 2, "0x0B: short (2 bytes)",\ 4, "0x0C: int (4 bytes)",\ 4, "0x0D: float (4 bytes)",\ 8, "0x0E: double (8 bytes)" index = type - 0x08 + 1 if 0 < index & index <= %% indx index display "element data type: " display TYPE,10 item_bytes := SIZE else item_bytes := 0 end if break end iterate est = item_bytes repeat dims est = est * dim.% end repeat file_bytes := est + 4*(dims+1) if file_bytes = 0 | file_bytes > bytes display "error: IDX file is incomplete",10 else display "dimensions: " repeat dims repeat 1,D:dim.% display `D end repeat if % <> %% display ' x ' else display 10 end if end repeat end if end match ; IFILE Preprocessing input data can speed the training and/or effect the model created. The MNIST digit images can be reduced to 20x20, but how should the image data be aligned and what is the impact on the model? Code: ;IFILE equ "data\t10k-images.idx3-ubyte" ; 10k images ;IFILE equ "data\train-images.idx3-ubyte" ; 60k images virtual at 0 DAT:: file IFILE end virtual offset = 0 iterate <SIZE, ITEM>,\ 4, magic,\ 4, images,\ 4, rows,\ 4, columns load temp:SIZE from DAT:offset offset = offset + SIZE ITEM := temp bswap 4 end iterate assert magic = 0x803 repeat images, I:1 ; initialize all bounds to invalid states h_min.I = -1 h_max.I = -1 col_mask = 0 ; note: empty images are not allowed repeat rows load row:columns from DAT:offset offset = offset + columns if row <> 0 if h_min.I < 0 h_min.I = % ; first row with data else h_max.I = % ; last row with data end if col_mask = col_mask or row end if end repeat ; note: these are 0-based indices w_min.I = (bsf col_mask) shr 3 w_max.I = (bsr col_mask) shr 3 ; debug filtering... ;if 4 > w_max.I - w_min.I + 1 \ ;| 11 > h_max.I - h_min.I + 1 ; repeat 1, X:w_min.I, Y:h_min.I, W:w_max.I - w_min.I + 1, H:h_max.I - h_min.I + 1 ; display '#',`I,': (',`X,',',`Y,',',`W,',',`H,')',9 ; end repeat ;end if ; Now that we have the image bounds we can reload and position the data ; within a 20x20 area: offset = offset - (columns * rows) ; backup repeat rows if h_min.I <= % & % <= h_max.I load row:columns from DAT:offset emit 20: row shr (8*w_min.I) end if offset = offset + columns end repeat ; tail pad image if additional rows are needed db 20 * (19 - h_max.I + h_min.I) dup 0x00 end repeat ; sanity checks assert offset = 16 + images * rows * columns assert $-$$ = images * 20 * 20 Note how terse the code is to refactor the image data into other forms. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
30 Sep 2024, 09:45 |
|
Roman 30 Sep 2024, 15:01
Quote:
This macro only generated data ? Its right ? |
|||
30 Sep 2024, 15:01 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.