Dex4u
Joined: 08 Feb 2005
Posts: 1601
Location: web
|
4.6 Sysenter and the vsyscall page
It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional int 0x80 interrupt on a Pentium 4.
Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the sysenter instruction. Also AMD has a syscall instruction. It would be good if these could be used.
Something else is that in some applications gettimeofday() is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.
One way of obtaining a fast gettimeofday() is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.
There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.
Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine sysenter_setup() is called. It sets up a non-writable page and writes code for the sysenter instruction if the CPU supports that, and for the classical int 0x80 otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.
Concerning gettimeofday(), a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday() and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall.)
Some details
The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the one but last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it:
/* get vsyscall page */
#include <unistd.h>
#include <string.h>
int main() {
char *p = (char *) 0xffffe000;
char buf[4096];
#if 0
write(1, p, 4096);
/* this gives EFAULT */
#else
memcpy(buf, p, 4096);
write(1, buf, 4096);
#endif
return 0;
}
and if we do, find an ELF binary.
% ./get_vsyscall_page > syspage
% file syspage
syspage: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), stripped
% objdump -h syspage
syspage: file format elf32-i386
Sections:
Idx Name Size VMA LMA File off Algn
0 .hash 00000050 ffffe094 ffffe094 00000094 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .dynsym 000000f0 ffffe0e4 ffffe0e4 000000e4 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .dynstr 00000056 ffffe1d4 ffffe1d4 000001d4 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .gnu.version 0000001e ffffe22a ffffe22a 0000022a 2**1
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .gnu.version_d 00000038 ffffe248 ffffe248 00000248 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .text 00000047 ffffe400 ffffe400 00000400 2**5
CONTENTS, ALLOC, LOAD, READONLY, CODE
6 .eh_frame_hdr 00000024 ffffe448 ffffe448 00000448 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 .eh_frame 0000010c ffffe46c ffffe46c 0000046c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .dynamic 00000078 ffffe578 ffffe578 00000578 2**2
CONTENTS, ALLOC, LOAD, DATA
9 .useless 0000000c ffffe5f0 ffffe5f0 000005f0 2**2
CONTENTS, ALLOC, LOAD, DATA
% objdump -d syspage
syspage: file format elf32-i386
Disassembly of section .text:
ffffe400 <.text>:
ffffe400: 51 push %ecx
ffffe401: 52 push %edx
ffffe402: 55 push %ebp
ffffe403: 89 e5 mov %esp,%ebp
ffffe405: 0f 34 sysenter
ffffe407: 90 nop
ffffe408: 90 nop
... more nops ...
ffffe40d: 90 nop
ffffe40e: eb f3 jmp 0xffffe403
ffffe410: 5d pop %ebp
ffffe411: 5a pop %edx
ffffe412: 59 pop %ecx
ffffe413: c3 ret
... zero bytes ...
ffffe420: 58 pop %eax
ffffe421: b8 77 00 00 00 mov $0x77,%eax
ffffe426: cd 80 int $0x80
ffffe428: 90 nop
ffffe429: 90 nop
... more nops ...
ffffe43f: 90 nop
ffffe440: b8 ad 00 00 00 mov $0xad,%eax
ffffe445: cd 80 int $0x80
The interesting addresses here are found via
% grep ffffe System.map
ffffe000 A VSYSCALL_BASE
ffffe400 A __kernel_vsyscall
ffffe410 A SYSENTER_RETURN
ffffe420 A __kernel_sigreturn
ffffe440 A __kernel_rt_sigreturn
%
So __kernel_vsyscall pushes a few registers and does a sysenter instruction. And SYSENTER_RETURN pops the registers again and returns. And __kernel_sigreturn and __kernel_rt_sigreturn do system calls 119 and 173, that is, sigreturn and rt_sigreturn, respectively.
What about the jump just before SYSENTER_RETURN? It is a trick to handle restarting of system calls with 6 parameters. As Linus said: I'm a disgusting pig, and proud of it to boot.
The code involved is most easily seen from a slightly earlier patch.
A tiny demo program.
#include <stdio.h>
int pid;
int main() {
__asm__(
"movl $20, %eax \n"
"call 0xffffe400 \n"
"movl %eax, pid \n"
);
printf("pid is %d\n", pid);
return 0;
}
This does the getpid() system call (__NR_getpid is 20) using call 0xffffe400 instead of int 0x80.
However, the proper thing to do is not call 0xffffe400 but call *%gs:0x18. If %gs has been set up so that it addresses 0xffffe000, then at location 0xffffe018 we find the value of __kernel_vsyscall, the entry point of the kernel vsyscalls. Such general setup requires the parsing of the ELF headers of this vsyscall page, but then is future-proof.
|