**This is an old revision of the document!** ----
====== Joshua Oreman: 802.11 wireless development ====== ===== Journal Week 4 ===== ==== Monday, 15 June ==== Worked on a bunch of small things today. My video card didn't come until late in the day, so I was only able to do a little iSCSI testing, but what I found was revealing. Commits: * On branch **mainline-review**: * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=18e6470d06d8846d531d97d881be6f1278bd2f15| [nvs] Add init function for Atmel 93C66 EEPROM]] [+0 bytes] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=d429b31ac28760004e753dc79178400d507975e2| [netdevice] Add netdev argument to link-layer push and pull handlers]] [-12 bytes!] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=f77b486f42b2b12604ce94d65cbba33b55a589e5| [netdevice] Adjust maximum link-layer header length for 802.11]] [+0 bytes] * On branch **wireless**: * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=37f97824d5ac03f8af16afab61d473c99b3ace7c| [802.11] Debug and output cleanup, minor association improvements]] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=08cfc42994e6f35dc8b0796d863861678f835470| [netdevice] Add print_status callback for link-layer-specific state]] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=66f79b57a1ea0354b8339fc508c655f2d46d0ec9| [Makefile] Remove -Wformat-nonliteral command-line option]] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=e418cc6fab843f4831d8f016f75cf6813be4b166| [802.11] Clean up channel and rate handling]] I have no idea how adding a function argument managed to //decrease// code size, but it did. (Measuring uncompressed text size of gpxe.lkrn.tmp with no debugging.) So far the changes on mainline-review are all the commits I made before yesterday that are not in my rtl8180 driver or in 802.11-specific code. I want to get a few things sorted out with 802.11 handling before I push it into mainline review - specifically some kind of rate handling, as working at 1 Mbps is a little undesirable :-) Questions I'd like to get answered by someone with the authority to decide such things: * Is my ''print_status()'' addition consistent with the gPXE Way? I think having the 802.11 state is important - if a user says "it's really slow" the first thing we'd ask is an ifstat to show signal strength and transmission speed. An alternative would be to special-case 802.11 in the ''ifstat()'' code using ''net80211_get()''; if we did that, we could only avoid always linking in 802.11 support by using linker tricks like weak symbols. * Is removing -Wformat-nonliteral an acceptable tradeoff for 60 bytes of uncompressed code size? * How asynchronous should association be? Currently the probe part is synchronous (freezes gPXE for a few seconds) and the rest is asynchronous. I think we should probably push it all in one direction or the other. A blocking association with a status message would have the advantage of obvious error reporting (user doesn't have to check ifstat to see it failed and compile with ''DEBUG=net80211'' to see why), and of preventing the possibility of over-eager networking code sending packets before the link is up, but it detracts somewhat from the "it works just like a wired link with respect to link-up" abstraction. iSCSI testing using DOS scandisk produced a complete scan with no errors, but one significantly slower than wired and quite jagged in its progress. (The unit of progress, 16 clusters, took anywhere between 3 seconds and a minute.) I did this testing before I fixed a bug in the rate-choosing code; what was supposed to be a "conservative" choice of the first 802.11b rate (1Mbps) actually had its logic inverted to use the first 802.11g rate, which in this case was 18Mbps. My host-AP and test machines are about a meter apart, which 802.11 RF modulation is not designed for, so I expect there was a great deal of packet loss between the two. The periodic stalling suggests some TCP strangeness that's borne out by the packet captures. I don't know enough about TCP to diagnose this, but I've posted the packet captures for posterity anyway. Both represent a scan of about 500 clusters on the same disk. * [[http://etherboot.org/share/oremanj/iSCSI-test-wired.cap.gz|gzipped wired capture file (3.8MB)]] * [[http://etherboot.org/share/oremanj/iSCSI-test-wireless.cap.gz|gzipped wireless capture file (2.7MB)]] **Update:** Testing at 1Mbps produced even more horrible results, with upwards of 43 duplicate ACKs detected from my computer to gPXE for the same segment. For almost every single TCP segment. I think this may be an issue with capturing on the same node that's hosting the Access Point; I'll test more with the WRT54G when I receive it tomorrow. ==== Tuesday, 16 June ==== * On branch **mainline-review**: * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=a9a0567225493046f70e6252e21ab5c6d8219e87| [image] Modify imgfree command to accept an argument]] [+88 bytes] No other commits today - I spent the day debugging the iSCSI retransmission issue. It was a productive if slightly maddening time, and I'm hopeful that I'll get this thing squashed tomorrow. The wireless packet captures references yesterday were indeed exacerbated by capturing on the access point; it turns out that running Wireshark on ''mon.wlan0'' (or running ''airodump-ng'') does very poorly in that situation, while Wiresharking on ''wlan0'' works fine. You don't get 802.11 headers, but this is a TCP-level issue where those aren't really necessary. At about 1am in the morning, after setting up a "serial" console over FireWire and poring over the debugging information thus created, I finally identified the root cause of the problem. gPXE debugging output: TCP 0x1da94 TX 1024->3260 3ee9656a..3ee9676a c227fd74 512 PSH ACK [send a packet] TCP 0x1da94 RX 1024<-3260 3ee9676a c227fd74..c227fd74 0 ACK [packet ACKed ok] TCP 0x1da94 TX 1024->3260 3ee9676a..3ee9679a c227fd74 48 PSH ACK [send another packet] TCP 0x1da94 RX 1024<-3260 3ee9676a c227fd74..c227fd74 0 ACK [that ACK seems to be stale...] TCP 0x1da94 TX 1024->3260 3ee9676a..3ee9679a c227fd74 48 PSH ACK [so resend packet] TCP 0x1da94 RX 1024<-3260 3ee9676a c227fd74..c227fd74 0 ACK [get another stale ACK] TCP 0x1da94 TX 1024->3260 3ee9676a..3ee9679a c227fd74 48 PSH ACK [resend it again] TCP 0x1da94 RX 1024<-3260 3ee9679a c227fd74..c227fd74 0 ACK [finally, ACKed properly] Wireshark summary output, prettied up a bit and with sequence and ACK numbers in absolute hex: iSCSI SCSI: Data Out LUN: 0x00 (Write(10) Request Data) Seq=3ee9656a TCP iscsi-target > 1024 [ACK] Ack=3ee9676a TCP [TCP segment of a reassembled PDU] Seq=3ee9676a TCP iscsi-target > 1024 [ACK] Ack=3ee9679a TCP [TCP Retransmission] [TCP segment of a reassembled PDU] Seq=3ee9676a TCP [TCP Dup ACK 4864#1] iscsi-target > 1024 [ACK] Ack=3ee9679a TCP [TCP Retransmission] [TCP segment of a reassembled PDU] Seq=3ee9676a TCP [TCP Dup ACK 4864#2] iscsi-target > 1024 [ACK] Ack=3ee9679a gPXE thinks the iSCSI target is ACKing the packet //before// the one it just sent, so good TCP denizen that it is, it assumes the packet it just sent didn't get through and resends it. This continues an increasing number of times; the capture just shown is from early on, with only 2 retransmissions per packet, but I've seen it over 75 per packet. At some point, about a minute after the retransmissions start to occur, they abruptly cut off and the link is fine for another minute or so. Wireshark shows that the iSCSI target is actually ACKing the packet gPXE just sent. This could be a bug in gPXE's TCP stack (unlikely), an rtl8180 driver-level issue causing it to resubmit stale received packets, memory corruption somewhere, or something to do with 802.11's longer link-layer header. Tomorrow I try to figure out which one it is. Wheee! ==== Wednesday, 17 June ==== It was duplicate ACKs: a silly bug (signed versus unsigned) in the 802.11 layer caused the duplicate RX elimination code to only work half the time, and gPXE's TCP stack did not elegantly handle the duplicate ACKs thus generated. (802.11 can generate duplicate packets when a packet is received but its link-layer ACK is not, causing a retransmission which is also received.) I've patched the issue in both the 802.11 layer and the TCP stack, since TCP is meant to be resilient against such things. RFC793 allows my fix: "If the ACK is a duplicate, it can be ignored" (p.72). I also found an unrelated bug in rtl8180 that caused it to cycle through its whole TX ring whenever one packet was completed, reporting the spurious TX completions with iob set to NULL. I believe the Linux driver does this too. No symptoms, but it's best to fix such things. Thus, commits: * On branch **wireless**: * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=5e27fab092e4be7ee2bcfb466c36e90b9895d2bc| [802.11] Fix packet duplication elimination state]] * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=0b6003f7167d2d80876053c8908e394cf5dd246e| [drivers rtl8180] Only report TX status once per packet]] * On branch **mainline-review**: * [[http://git.etherboot.org/?p=people/oremanj/gpxe.git;a=commit;h=8041741323b40d9f5c482d3c6e1391bee7be759d| [tcp] Ignore duplicate ACKs in TCP ESTABLISHED state]] [+16 bytes] Remaining things for this week: rate control, answers to questions from yesterday's entry, and pushing 802.11 code to mainline-review after I get the first two sorted out. Figuring out the iSCSI issue took much longer than I anticipated, but hopefully I'll still be able to get everything done.