====== Differences ====== This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
soc:2008:stefanha:journal:week4 [2008/06/16 04:18] stefanha created |
soc:2008:stefanha:journal:week4 [2008/06/23 04:58] (current) stefanha |
||
---|---|---|---|
Line 5: | Line 5: | ||
**Milestones:** | **Milestones:** | ||
* Get latest GDB stub work into mainline. | * Get latest GDB stub work into mainline. | ||
+ | * Modern bzImage prefix for gPXE. | ||
==== Mon Jun 16 ==== | ==== Mon Jun 16 ==== | ||
- | The GDB stub is almost ready for mainline review. I need to polish it a little and update the documentation. The real test will be when developers begin using it for real work. | + | **The ''gdbstub2'' branch is now ready for mainline review**. Diffs against gPXE ''master'' are [[http://etherboot.org/share/stefanha/gdbstub2.diff|here]]. Once it is merged I will update the documentation and encourage others to use GDB. |
- | I want to get the GDB stub work ready today or tomorrow at latest. Once it goes into mainline I will put effort into making developers know that GDB debugging is there and how to use it. | + | **gPXE needs modern bzImage support so that GRUB, lilo, and SYSLINUX can load it**. This is my next piece of work after the GDB stub. There is already code in etherboot to make a bzImage. The old code doesn't work by default on today's popular bootloaders since the Linux bzImage header it supplies is outdated. I am investigating what needs to be done for GRUB, lilo, SYSLINUX, etherboot, and gPXE to load a gPXE bzImage. |
- | Next steps: | + | ==== Tue Jun 17 ==== |
- | * Choose and document a simple way to manually call into the debugger. | + | Git commit: |
- | * Improve flow control so that GDB does not print warnings. | + | * [[http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commit;h=bfd885802fd6af9938f2b703f6c48a9259cd7657|[bzImage] Make gpxe.lkrn a zImage 2.07]] |
- | * Update [[:dev:gdbstub|GDB stub page]] and screencast when UDP code is merged into mainline. See [[http://grub.enbug.org/DebuggingWithGDB|GRUB GDB wiki page]] for inspiration. | + | |
+ | **I am trying out bootloaders on ''gpxe.lkrn'' images**. We were afraid that the outdated Linux zImage prefix no longer works with modern bootloaders. Here are results for unmodified gPXE (I have not yet attempted to implement bzImage): | ||
+ | * **GRUB** boots ''gpxe.lkrn'' successfully. Here is a script to create a GRUB/gPXE boot floppy: | ||
+ | <code> | ||
+ | #!/bin/sh | ||
+ | set -e | ||
+ | dd if=/dev/zero of=grub.img bs=1024 count=1440 | ||
+ | losetup /dev/loop0 grub.img | ||
+ | mkfs /dev/loop0 | ||
+ | mount /dev/loop0 /mnt | ||
+ | mkdir -p /mnt/boot/grub | ||
+ | cp /boot/grub/stage1 /boot/grub/stage2 /mnt/boot/grub/ | ||
+ | cat >/mnt/boot/grub/menu.lst <<EOF | ||
+ | title=gPXE | ||
+ | root (fd0) | ||
+ | kernel /boot/gpxe.lkrn | ||
+ | EOF | ||
+ | cp bin/gpxe.lkrn /mnt/boot/ | ||
+ | umount /mnt | ||
+ | grub --device-map=/dev/null <<EOF | ||
+ | device (fd0) /dev/loop0 | ||
+ | root (fd0) | ||
+ | setup (fd0) | ||
+ | quit | ||
+ | EOF | ||
+ | losetup -d /dev/loop0 | ||
+ | </code> | ||
+ | |||
+ | * **SYSLINUX** boots ''gpxe.lkrn'' successfully. Here is a script to create a boot floppy: | ||
+ | <code> | ||
+ | #!/bin/sh | ||
+ | set -e | ||
+ | dd if=/dev/zero of=syslinux.img bs=1024 count=1440 | ||
+ | mkfs.msdos syslinux.img | ||
+ | mount -o loop syslinux.img /mnt | ||
+ | cp bin/gpxe.lkrn /mnt/gpxe.zi | ||
+ | cat >/mnt/SYSLINUX.CFG <<EOF | ||
+ | default gpxe.zi | ||
+ | EOF | ||
+ | umount /mnt | ||
+ | syslinux syslinux.img | ||
+ | </code> | ||
+ | |||
+ | * **lilo** boots ''gpxe.lkrn'' unsuccessfully. QEMU stops with a triple-fault. I still need to look into this. Here is a script to create a boot floppy: | ||
+ | <code> | ||
+ | #!/bin/sh | ||
+ | set -e | ||
+ | dd if=/dev/zero of=lilo.img bs=1024 count=1440 | ||
+ | losetup /dev/loop0 lilo.img | ||
+ | mkfs /dev/loop0 | ||
+ | mount /dev/loop0 /mnt | ||
+ | mkdir /mnt/etc /mnt/boot | ||
+ | cp bin/gpxe.lkrn /mnt/gpxe.zi | ||
+ | cat >/mnt/etc/lilo.conf <<EOF | ||
+ | boot =/dev/loop0 | ||
+ | disk =/dev/loop0 | ||
+ | bios =0x00 # 1.44MB disk geometry | ||
+ | sectors =18 | ||
+ | heads =2 | ||
+ | cylinders =80 | ||
+ | install =/mnt/boot/boot.b | ||
+ | map =/mnt/boot/map | ||
+ | backup =/dev/null | ||
+ | image =/mnt/gpxe.zi | ||
+ | EOF | ||
+ | /tmp/lilo/sbin/lilo -C /mnt/etc/lilo.conf | ||
+ | umount /mnt | ||
+ | losetup -d /dev/loop0 | ||
+ | </code> | ||
+ | |||
+ | * **gPXE** boots ''gpxe.lkrn'' unsuccessfully since only the newer bzImage and not the old zImage format is supported. Testing was easy: | ||
+ | <code> | ||
+ | qemu -bootp gpxe.lkrn -tftp bin bin/gpxe.usb | ||
+ | </code> | ||
+ | |||
+ | * **Etherboot 5.4.3** boots ''gpxe.lkrn'' successfully. I used [[http://freshmeat.net/projects/wraplinux/|wraplinux]] to make an NBI file from ''gpxe.lkrn''. | ||
+ | |||
+ | **Updated ''lkrnprefix.S'' to zImage 2.07**. The image is still only a zImage since the non-real code loads at 0x10000. A bzImage loads non-real code at 0x100000, i.e. right after the 1 MB low memory. Perhaps ''gpxe.lkrn'' can be a full bzImage, but I think that the A20 line will prevent us from accessing 0x100000. | ||
+ | |||
+ | * **GRUB** boots successfully. | ||
+ | * **Lilo** still fails. I need to investigate this, probably I'm not using it properly. | ||
+ | * **SYSLINUX** boots successfully. | ||
+ | * **gPXE** boots successfully with a small patch to ''bzimage.c''. Need to discuss this with mcb30. | ||
+ | * **Etherboot** boots successfully. | ||
+ | |||
+ | ==== Wed Jun 18 ==== | ||
+ | **Lilo still triple-faults when loading ''gpxe.lkrn''**. I set up a virtual machine with [[http://damnsmalllinux.org/|Damn Small Linux]] to ensure a clean environment. The DSL kernel is boots successfully while ''gpxe.lkrn'' fails. Here is the triple fault information from QEMU: | ||
+ | <code> | ||
+ | qemu: fatal: triple fault | ||
+ | EAX=60000000 EBX=0000fee8 ECX=00002900 EDX=00001d8a | ||
+ | ESI=0001ffff EDI=0000ff51 EBP=0000f9c4 ESP=0000f96e | ||
+ | EIP=0000074c EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 | ||
+ | ES =0018 00000000 ffffffff 00cf9300 | ||
+ | CS =0008 0000f600 0000ffff 00009b00 | ||
+ | SS =0010 00090000 0000ffff 00009309 | ||
+ | DS =0018 00000000 ffffffff 00cf9300 | ||
+ | FS =0018 00000000 ffffffff 00cf9300 | ||
+ | GS =0018 00000000 ffffffff 00cf9300 | ||
+ | LDT=0000 00000000 0000ffff 00008000 | ||
+ | TR =0000 00000000 00000000 00000000 | ||
+ | GDT= 0009f99c 0000001f | ||
+ | IDT= 00000000 000003ff | ||
+ | CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000 | ||
+ | CCS=00000000 CCD=0000f97e CCO=ADDB | ||
+ | FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 | ||
+ | FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 | ||
+ | FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 | ||
+ | FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 | ||
+ | FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 | ||
+ | XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 | ||
+ | XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 | ||
+ | XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 | ||
+ | XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 | ||
+ | Aborted | ||
+ | </code> | ||
+ | |||
+ | I don't see an obvious clue in the crash dump, so I'll wait until after speaking with mcb30 about bzImage. If we decide to go in a different direction then I'd waste time debugging this. | ||
+ | |||
+ | In the meantime I'll investigate real-mode GDB debugging. I already tried ''set architecture i8086'' for 16-bit disassembly. GDB still treats memory as a flat 32-bit space and will probably require some address translation inside the GDB stub. | ||
+ | |||
+ | Another thought I'm holding is that loading ''gpxe.lkrn'' recursively fails. That potentially means you cannot load another zImage after gPXE has been loaded from ''gpxe.lkrn''. <del>My theory is that gPXE has been loaded to the default zImage load address, i.e. 0x10000. If gPXE then tries to load another image there, it overwrites itself and crashes</del>. It looks unlikely that gPXE is overwriting itself because it relocates as high up as possible. | ||
+ | |||
+ | ==== Thu Jun 19 ==== | ||
+ | Git commit: [[http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commit;h=2c5b2a45b7c33b1cd45e99c29f0a833c458e2ecb|[b44] Create skeleton driver for Broadcom 4401 NIC]] | ||
+ | |||
+ | **Brought up ROM-o-matic for Etherboot top-of-git-tree**. I have been occasionally assisting mdc with his [[http://rom-o-matic.net/|ROM-o-matic.net]] online boot ROM generator. He recently enabled ROM-o-matic for gPXE top-of-git-tree. That way users can get ROMs for the latest development version of gPXE without having to set up a development environment and build from source. This is now also possible for Etherboot. | ||
+ | |||
+ | **Beginning work to port Linux b44 (Broadcom 4401) driver**. My laptop has a BCM4401-B0 NIC and is currently not supported by gPXE. The idea is to port the Linux driver to gPXE. I am looking forward to learning more about network drivers and device driver development in general. | ||
+ | |||
+ | ==== Fri Jun 20 ==== | ||
+ | Git commit: [[http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commit;h=598d23c5b0ec778dcbaafe634569ca9f8f8d1854|[b44] Minimal TX path]] | ||
+ | |||
+ | **The b44 driver is transmitting Ethernet frames**. Thanks to Michael Decker's excellent [[:soc:2008:mdeck:notes:gpxe_driver_api|gPXE Driver API Documentation]] I got the skeleton for the driver working very quickly last night. This morning I started porting the Linux b44 driver code. | ||
+ | |||
+ | After getting the initialization working (mainly by copy-paste) and reading the MAC address from the card, I decided to pursue the TX path. Getting transmit working early is useful since gPXE will attempt to do DHCP automatically and therefore needs to send packets. | ||
+ | |||
+ | Copy-pasting the Linux driver was not a good tactic since the Linux code is much more complex. Eventually I just focused on understanding how the hardware supports transmitting frames (there is no public documentation available!), and then implemented a simple TX path resembling the gPXE natsemi driver. | ||
+ | |||
+ | ==== Sat Jun 21 ==== | ||
+ | Git commit: [[http://git.etherboot.org/?p=people/stefanha/gpxe.git;a=commit;h=39144f971c261bcd92f5052059d444d742b0010f|[b44] Working RX path]] | ||
+ | |||
+ | **The b44 driver is receiving Ethernet frames**. I just booted PXELINUX and HTTP-booted Linux 2.6.25 on this card for the first time! Getting the RX path working has been painful. | ||
+ | |||
+ | I think some of the Linux driver code is misleading/incorrect. Luckily there are drivers for OpenBSD, FreeBSD, and Solaris. Those drivers might even be based on the Linux driver, but they do some things differently and it helps to compare them to each other. My main issue with the RX path was a comment in the Linux driver claiming that the hardware writes a header structure 30 bytes //before// the DMA address of the I/O buffer. | ||
+ | |||
+ | This is false. The Linux driver does offset the DMA address by 30 bytes, but it also offsets the IO buffer by 30 bytes. In the end, it makes no difference and all that has happened is that 30 bytes of the IO buffer have been wasted. The header structure gets written //to// the DMA address, not before it. | ||
+ | |||
+ | The next steps for the b44 driver are cleaning it up, making it robust, and testing. Most of the initialization code is straight from the Linux driver. I want to get to grips with it and then simplify it for gPXE. | ||
+ | |||
+ | I have omitted performance optimizations from the Linux driver. The Linux driver has a "copy threshold" which dictates whether to copy a received packet to a fresh IO buf to hand off to the network stack, or whether to remove the current IO buf from the RX ring and pass it straight to the network stack (and allocating a fresh IO buf for the RX ring). I'll talk to Balaji about performance measurement since he's been optimizing his USB driver. | ||
+ | |||
+ | **Lilo bzImage debugging still underway**. I made a little bit of progress tonight by determining that the triple-fault happens in the call to ''install''. I think that EIP goes crazy somewhere inside ''install'' and hence the triple fault. I'm sure the issue triggers inside ''install'' since I've placed infinite loops before and after the call. The loop after the call never happens. | ||
+ | |||
+ | My current debugging cycle is by booting up Damn Small Linux in QEMU and copying over my latest ''gpxe.lkrn'', running ''lilo'', and rebooting into gPXE. This is slow and frustrating. I need to script it but my DSL install seems to be read-only. | ||
+ | |||
+ | ===== Next week ===== | ||
+ | On to [[.:week5|Week 5]]. |