**This is an old revision of the document!** ----
====== Piotr Jaroszyński: Usermode debugging under Linux ====== ====== How usermode under Linux is done ====== ===== Intro ===== Porting any code to a substantially different environment is the hardest when no other ports have been done yet. Fortunately gPXE already supports two ''ARCH''s (''i386'' and ''x86_64'') and two ''PLATFORM''s (''pcbios'' on ''i386'' and ''efi'' on both). Because of ''efi'' and ''pcbios'' differences extra layers making up for them have been already introduced. That makes the linux usermode port, despite being quite different conceptually (usermode versus hardware), a much easier task. Before focusing on the specific layers (called subsystems later for a lack of a better name), let's look at how the necessary kernel interface is provided first (it's not as trivial as one might think). ===== Kernel API ===== Regardless of the specific usage (discussed later in subsystems) some way of accessing the kernel is necessary. ==== Background ==== Because of gPXE nature it was designed and implemented to be completely self-contained. It doesn't link to stdlib (glibc) or to any other library. That's a nice feature to have considering the crazy size constraints it has to meet. For example it allows to compile gPXE with ''-mregparm=3'' and ''-mrtd'' flags, which reduce code size, but also make it incompatible with code compiled without them. On the other hand availability of stdlib apis was necessary to make the programming environment feel natural and hence many of them were reimplemented internally. ==== linux_ prefix to the rescue ==== To avoid confusion (and in many cases collisions) between gPXE internals and kernel interface it was decided that all of the kernel API functions will be prefixed with ''linux_''. For example: ''include/linux_api.h'': <code c> extern int linux_open(const char *pathname, int flags); extern int linux_close(int fd); </code> ''include/gpxe/posix_io.h'': <code c> extern int open(const char *uri_string); extern int close(int fd); </code> ==== Linking to stdlib (glibc) ==== Despite being non-trivial, forcing some compile flags to be disabled (namely ''-mrtd'' and ''-mregparm'' mentioned earlier) and having [[#the_other_problem_with_stdlib|some other problems]] linking to stdlib was still the quickest for prototyping. It will also come in handy when debugging problems with the [[#being_self-contained|other superior approach]]. To work around the symbol collisions with stdlib, all the neccessary libs are copied with the offending symbols prefixed with ''linux_''. ''objcopy'' with ''--redefine-syms=remap_file'' is used to achieve that. An example line from ''remap_file'' simply says: read linux_read All the build/linker details can be seen in the ''arch/x86/Makefile.linux'': <code> MEDIA = linux STDLIBS_BEGIN = $(BIN)/remapped_crt1.o $(BIN)/remapped_crti.o $(BIN)/remapped_crtbeginT.o STDLIBS_LIBS = $(BIN)/remapped_libc.a $(BIN)/remapped_libgcc.a $(BIN)/remapped_libgcc_eh.a STDLIBS_LIBS_L = $(foreach lib, $(STDLIBS_LIBS), -l:$(lib)) STDLIBS_END = $(BIN)/remapped_crtend.o $(BIN)/remapped_crtn.o SYMBOLS_REMAP = arch/x86/linux/symbols_remap $(BIN)/remapped_% : $(SYMBOLS_REMAP) $(QM)$(ECHO) " [REMAP] $*" $(Q)objcopy --redefine-syms=$(SYMBOLS_REMAP) $(shell gcc $(CFLAGS) --print-file-name $*) $@ .PRECIOUS : $(BIN)/remapped_% TGT_EXTRA_DEPS += $(STDLIBS_BEGIN) $(STDLIBS_LIBS) $(STDLIBS_END) TGT_LD_FLAGS_PRE += -static $(STDLIBS_BEGIN) TGT_LD_FLAGS_POST += --start-group $(STDLIBS_LIBS_L) --end-group $(STDLIBS_END) $(BIN)/%.linux : $(BIN)/%.linux.tmp $(QM)$(ECHO) " [FINISH] $@" $(Q)cp -p $< $@ </code> === Linker script === Amazingly the default ''ld'' scripts work just with the addition of tables (see ''include/gpxe/table.h'') in the ''.data'' section: <code> .data : { *(.data .data.* .gnu.linkonce.d.*) SORT(CONSTRUCTORS) *(SORT(.tbl.*)) } </code> ==== Being self-contained ==== Work in progress. ===== Subsystems ===== Having a kernel API in place, the next step is providing all the necessary subsystems on top of it. ==== Background ==== Subsystems provided by a ''PLATFORM'' can be seen in ''config/defaults/$PLATFORM.h''. Let's look at one of them. ''config/defaults/efi.h'': <code> #define UACCESS_EFI #define IOAPI_EFI #define PCIAPI_EFI #define CONSOLE_EFI #define TIMER_EFI #define NAP_EFIX86 #define UMALLOC_EFI #define SMBIOS_EFI </code> For each subsystem there is, in general, a correspodning ''include/gpxe/$subsystem.h'' header which includes headers for specific implementations. Their location depends upon being ''ARCH''-specific. Most of the subsystems are single-implementation APIs, that is only one implementation of each can be used. See ''include/gpxe/api.h'' for details. ''CONSOLE'' is a bit different as every ''ARCH''/''PLATFORM'' can have many of them and hence have to use another widely adopted concept within gPXE, that is linker tables. Details in ''include/gpxe/tables.h''. That header also explains why ''#ifdef''s are bad and why so many objects are compiled despite not being used in the final target. ==== CONSOLE ==== ''CONSOLE'' is used for all the input and output that gPXE does. As I/O is trivial in userspace, ''LINUX_CONSOLE'' couldn't have been any different. Look at ''include/console.h'' for details on the API. a bit simplified ''interface/linux/linux_console.c'': <code c> static void linux_putchar(int c) { linux_write(1, &c, 1); } static int linux_getchar() { char c; linux_read(0, &c, 1); return c; } struct console_driver linux_console __console_driver = { .putchar = linux_putchar, .getchar = linux_getchar, }; </code> ==== TIMER ==== ''TIMER'' is about two things: delaying execution: <code c> void udelay(unsigned long usecs); </code> and a monotonically increasing counter (used for measuring time intervals mostly): <code c> unsigned long currticks(void); unsigned long ticks_per_sec(void); </code> ''udelay()'' trivially maps to ''(linux_)usleep()''. ''currticks()'' is a bit trickier as there is no sensible way of getting the value of ''jiffies'' (the linux kernel tick counter) in userpace. Instead ''(linux_)gettimeofday()'' is used to emulate ''1000'' ticks per second starting on the first call to ''currticks()''. ==== UACCESS ==== ''UACCESS'' handles access to different kinds of memory. Currently this is a non-issue on Linux usermode as it accesses only the process memory, which has flat addressing. ==== UMALLOC ==== ''UMALLOC'' provides, as the name suggests, the well-known malloc gang: <code c> userptr_t urealloc(userptr_t userptr, size_t new_size); static inline userptr_t umalloc(size_t size) { return urealloc( UNULL, size); } static inline void ufree(userptr_t userptr) { urealloc( userptr, 0); } </code> As can be seen only ''urealloc()'' needs to be implmeneted and it trivially maps to ''(linux_)realloc()''. ==== NAP ===== ''NAP'' is about giving the CPU a break <code c> void cpu_nap(void); </code> In context of Linux usermode that means giving up the processor by the process, which can be achieved with a simple ''(linux_)usleep(0)''. ==== SMBIOS ==== ''SMBIOS'' doesn't seem to be used by anything currently. Linux implementation just returns an error. ==== IOAPI ==== Not used in Linux usermode currently. ==== PCIAPI ==== Not used in Linux usermode currently. ===== Networking ===== With the essentials in place, we can look at how networking is provided in Linux usermode. ==== Devices background ==== gPXE handles devices in a hierarchical manner. The building blocks are in ''include/gpxe/device.h''. <code c> strict device { ... }; struct root_device { struct device dev; struct root_driver *driver; }; struct root_driver { int (*probe)(struct root_device * rootdev); void (*remove)(struct root_device * rootdev); }; </code> The basic idea is that you have one ''root_device'' and a corresponding ''root_driver'' per BUS (or something else that makes sense, like Linux usermode). The exact implementation is of course BUS specific, but a common way of doing things is having ''$BUS_device''s and ''$BUS_driver''s similarly to ''root_device'' and ''root_driver''. During initialization the ''root_driver'''s ''probe()'' scans the BUS for hardware. Upon finding a device it iterates over all ''$BUS_driver'' looking for the one that can handle it (e.g. in the PCI case based upon the pci-id of the device). A matching driver is supposed to initialize the device. But even more importantly to it is supposed to register a new ''net_device'', which represents a piece of networking hardware (or software in Linux usermode). The ''net_device'' is responsible for transmitting the actual data. ==== Linux usermode devices ==== Linux usermode devices follow the scheme described above. The only difference is that instead of physically scanning the BUS, the Linux ''root_driver'' just iterates over a list of requested devices based on the [[#command_line_options|command line options]]. The details can be seen in ''include/gpxe/linux.h'' and ''drivers/linux/linux.c''. ==== Tap linux driver ==== === Why tap? === Tap was chosen over raw sockets because it has many advantages and the only disadvantage is a bit harder setup: * possibility to connect to the localhost * easier to tcpdump * faster * doesn't have to be run with root powers === Implementation === The tap driver is as easy as it possibly gets. ''drivers/linux/tap.c'': <code c> static int tap_transmit(struct net_device * netdev, struct io_buffer * iobuf) { struct tap_nic * nic = netdev->priv; int rc; iob_pad(iobuf, ETH_ZLEN); rc = linux_write(nic->fd, iobuf->data, iobuf->tail - iobuf->data); DBGC(nic, "tap %p wrote %d bytes\n", nic, rc); netdev_tx_complete(netdev, iobuf); return 0; } </code> In ''transmit()'' it can just send out the packet immediately with a simple ''(linux_)write()''. <code c> static void tap_poll(struct net_device * netdev) { struct tap_nic * nic = netdev->priv; int r; char buf[RX_BUF_SIZE]; struct io_buffer * iobuf; while ((r = linux_read(nic->fd, buf, RX_BUF_SIZE)) > 0) { iobuf = alloc_iob(RX_BUF_SIZE); memcpy(iobuf->data, buf, r); iob_put(iobuf, r); netdev_rx(netdev, iobuf); } } </code> In ''poll()'' it can just loop on a non-blocking ''(linux_)read()'' to get all the available packets. ==== Mapping the driver API to kernel ==== Work in progress. ===== Command line options ===== Command line options were introduced to control some aspects of the gPXE usermode. Currently the only option is for setting up a network device: <code> --net <driver>[,option=value[,option=value[,...]]] </code> The only driver currently is ''tap'' and it requires the ''if'' option so it's more like: <code> --net tap,if=<ifname>[,option=value[,option=value[,...]]] </code> Although ''if'' doesn't have to be the first option. Multiple ''---net'' options can be passed. ==== Implementation ==== The implementation of parsing the command line options is pretty straightforward. It can be seen in ''hci/linux_args.c''. ==== The other problem with stdlib ==== When linking with stdlib the only way of grabbing command line arguments is by modifying ''core/main.c'', which isn't particularly nice: <code c> #ifdef PLATFORM_linux __asmcall int main ( int argc, char * argv[] ) { #else __asmcall int main ( void ) { #endif ... #ifdef PLATFORM_linux if (parse_args(argc, argv) != 0) { return -1; } #endif </code> It can be avoided by implementing own ''_start'' routine, which could save ''argc'' and ''argv'' somewhere accessible from a simple ''__init_fn'' (functions that are run as part of the initialization) and hence making the ''core/main.c'' modification unnecessary. That's part of the [[#being_self-contained|being self-contained work]].