flat assembler
Message board for the users of flat assembler.

flat assembler > Heap > Section .text or .data for placing small predefined arrays?

Author
Thread Post new topic Reply to topic
Gyricon



Joined: 09 Jun 2011
Posts: 4
Location: Ukraine
[Eng]

Often some procedure works with a small table of predefined data. For example,
proc crc32, proc codepage_change, proc week2text, and so on.

Of course, these tables are stored in ordinary memory (DRAM). But the processor
works not with "slow" DRAM, but with a fast cache (SRAM). These physical
at least two (at least simplified) caches. One for the code (bytes from
section .text), the other for the data (bytes from the section .data).

These two caches have different microarchitecture and sizes. For example, a processor
AMD A8-7600 has:

Cache L1 Code 4х16 KB.
The cache L1 Data 2x96 KB.
The cache L2 Data 2x2048 KB.
The cache L3 Data is absent.

How from the two options is optimal? I believe that the first option (accommodation
data in the cache for the code) is WORSE, because our array bytes are more likely
will be ousted from the cache. For example, due to branch prediction errors,
the conveyor will be forced to cancel prepared micro-operations and upload
new portions of code from DRAM to SRAM L1 Code Cache.

Interested in your opinion. Thank you.

[Rus]

Зачастую некая процедура работает с небольшой таблицей предопределённых данных. Например,
proc crc32 , proc codepage_change, proc week2text, и так далее.

Разумеется, эти таблицы хранятся в обычной памяти (DRAM). Но процессор
работает не с "медленным" DRAM, а с быстрым кешем (SRAM). Этих физических
кешей как минимум два (если рассматривать упрощённо). Один для кода (байты из
секции text), другой для данных (байты из секции data).

Эти два кеша имеют различную микроархитектуру и размеры. Например, процессор
AMD A8–7600 имеет:

Кеш L1 Code 4х16 KB.
Кеш L1 Data 2x96 KB.
Кеш L2 Data 2x2048 KB.
Кеш L3 Data отсутствует.

Как из двух вариантов оптимальнее? Полагаю, что первый вариант (размещение
данных в кеше для Кода) ХУЖЕ, поскольку наши байты массива более вероятно
будут вытеснены из кеша. Например, из-за ошибок предсказания ветвлений,
конвеер будет вынужден отменять подготовленные микрооперации и загружать
новые порции кода из DRAM в SRAM L1 Code Cache.

Интересует ваше мнение. Спасибо.

№1

Code:

        section '.text' code readable executable

                ConvertIndexToPrimeNumber:
                mov eax,dword[PrimeNumbers+eax]
                ret

                align 4
                PrimeNumbers dd 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79

    


№2

Code:

        section '.text' code readable executable

                ConvertIndexToPrimeNumber:
                mov eax,dword[PrimeNumbers+eax]
                ret

        section '.data' data readable writeable

                align 4
                PrimeNumbers dd 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79

    
Post 07 Feb 2018, 13:11
View user's profile Send private message ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16133
Location: Hyperborea
Gyricon wrote:
How from the two options is optimal? I believe that the first option (accommodation
data in the cache for the code) is WORSE, because our array bytes are more likely
will be ousted from the cache. For example, due to branch prediction errors,
the conveyor will be forced to cancel prepared micro-operations and upload
new portions of code from DRAM to SRAM L1 Code Cache.
There is no answer here because it depends upon what your data and code access patterns are and which CPU/system combo you are testing with.

The only way to know for each system, is to time the results and see which method works best for your application on that system.

But do take note that the CPU doesn't "know" which piece of RAM is code and which is data. If you access memory with MOV then that memory is treated as data. If you access the same memory by executing it then that memory is treated as code. Some CPUs have exclusive caches, and others have inclusive caches so that will also affect the results.
Post 07 Feb 2018, 13:17
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1318
Yeah, but don't CPUs have separate instruction and data caches? In that case having "data" in the insn cache can waste it since the data will be duplicated in the data cache anyway.

I mean, in the end you still use probably one cacheline for both code and data here, but at least if they were separate, the code cache and data cache lines would fit "more stuff" in them (adjacent code or data).
Post 07 Feb 2018, 16:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16133
Location: Hyperborea
Furs wrote:
Yeah, but don't CPUs have separate instruction and data caches? In that case having "data" in the insn cache can waste it since the data will be duplicated in the data cache anyway.
Data can only enter the iCache when you execute code. So the data would have to exist in the same block of RAM as the code. If the data fits exactly into a multiple of a cache line size, and it is aligned correctly then it can never enter the iCache. This is easily achieved with padding and alignment directives.

If the goal here is absolute performance then proper alignment is a must do thing, and should really be done as a matter of normal coding to avoid "silly" performance problems. But it also has a downside of inflating the memory footprint and might cause cache thrashing. Also the cache line sizes are different for different CPUs. So one would need to optimise for each target CPU to get the greatest benefit.
Post 07 Feb 2018, 16:42
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2311
Location: Usono (aka, USA)
Wikipedia wrote:

The data segment is read-write, since the values of variables can be altered at run time. This is in contrast to the read-only data segment (rodata segment or .rodata), which contains static constants rather than variables; it also contrasts to the code segment, also known as the text segment, which is read-only on many architectures. Zero-initialized data, both variables and constants, is instead in the BSS segment.
Post 07 Feb 2018, 21:01
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2018, Tomasz Grysztar.

Powered by rwasa.