====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
soc:nikhilcrao [2006/06/15 13:10]
nikhilcrao
soc:nikhilcrao [2006/07/17 21:13] (current)
nikhilcrao
Line 1: Line 1:
 ====== Nikhil Rao, Implementing IPv6 support in gPXE ====== ====== Nikhil Rao, Implementing IPv6 support in gPXE ======
 +
 +
 +//​Quicklink:​ [[revised timeline|Timeline]]//​
  
 ===== About the project ===== ===== About the project =====
 My project is titled Implementing IPv6 support in gPXE. Here are some snippets from [[soc:​nikhil:​proposal|my proposal]]. My project is titled Implementing IPv6 support in gPXE. Here are some snippets from [[soc:​nikhil:​proposal|my proposal]].
 +
  
 ==== Synopsis ==== ==== Synopsis ====
Line 45: Line 49:
  
 I have tried to update my blog as frequently as possible with my thoughts. Here are some important notes which have been edited after IRC conversations,​ e-mails and IM sessions with Michael and Marty. I have tried to update my blog as frequently as possible with my thoughts. Here are some important notes which have been edited after IRC conversations,​ e-mails and IM sessions with Michael and Marty.
 +
 +==== Implementing TCP ====
  
 ==== gPXE network infrastructure ==== ==== gPXE network infrastructure ====
Line 255: Line 261:
 Internally, we use packet buffers to hold packets and information about it. When the uIP stack is invoked, data is transfered from the packet buffer into the uip_buffer (which is a statically allocated space for the data in uIP). This is inefficient usage of memory. Internally, we use packet buffers to hold packets and information about it. When the uIP stack is invoked, data is transfered from the packet buffer into the uip_buffer (which is a statically allocated space for the data in uIP). This is inefficient usage of memory.
  
-==== Redesigning ​the network infrastructure ​==== +==== Notes about the interface ​====
- +
-Over the last couple of weeks, after many IRC conversations,​ e-mails and IM sessions with Michael and Marty, I have come up with the following interface structure.+
  
 In addition to the net_header, net_protocol,​ ll_header and ll_protocol data structures, we would need to represent trans_protocol,​ tcp_header and udp_header. We can first discuss UDP, since it is the simpler of the two transport protocols. We can then extend our discussions to TCP. In addition to the net_header, net_protocol,​ ll_header and ll_protocol data structures, we would need to represent trans_protocol,​ tcp_header and udp_header. We can first discuss UDP, since it is the simpler of the two transport protocols. We can then extend our discussions to TCP.
  
-transport ​layer protocol ​is represented as:+We can use the following structures to represent a UDP header, TCP header and a transport protocol ​respectively: 
 + 
 +  struct udp_header { 
 +    uint16_t source_port;​ // source port number 
 +    uint16_t dest_port; // destination port number 
 +    uint16_t length; // length of the udp segment 
 +    uint16_t chksum; // UDP checksum 
 +  }; 
 + 
 +  struct tcp_header { // without options 
 +    uint16_t source_port;​ // source port number 
 +    uint16_t dest_port; // destination port number 
 +    uint32_t seq_num; // sequence number 
 +    uint32_t ack_num; // acknowledgement number 
 +    uint8_t offset; // offset, the last four bits are 0000 (reserved) 
 +    uint8_t flags; // flags, the first two bits are 00 when ECN is not used 
 +    uint16_t window; // advertised window 
 +    uint16_t chksum; // TCP checksum 
 +    uint16_t urg_ptr; // urgent data pointer, not used in our stack 
 +  };
  
   struct trans_protocol {   struct trans_protocol {
Line 289: Line 312:
   trans_protocol->​rx_process(pkb)   trans_protocol->​rx_process(pkb)
  
-The checksum offset is the offset of the 16 bit checksum in the transport layer header. This value for TCP is 16 octets ​and for UDP is 6 octets. We can use the following structures ​to represent a UDP header and TCP header respectively:+The ''​rx_process()''​ function processes ​the transport layer headers ​and passes ​the information ​to the application layer. The application'​s callback function ''​xxx_operations::​newdata()''​ is invoked to send data to the application layer.
  
-  struct udp_header { 
-    uint16_t source_port;​ // source port number 
-    uint16_t dest_port; // destination port number 
-    uint16_t length; // length of the udp segment 
-    uint16_t chksum; // UDP checksum 
-  }; 
- 
-  struct tcp_header { // without options 
-    uint16_t source_port;​ // source port number 
-    uint16_t dest_port; // destination port number 
-    uint32_t seq_num; // sequence number 
-    uint32_t ack_num; // acknowledgement number 
-    uint8_t offset; // offset, the last four bits are 0000 (reserved) 
-    uint8_t flags; // flags, the first two bits are 00 when ECN is not used 
-    uint16_t window; // advertised window 
-    uint16_t chksum; // TCP checksum 
-    uint16_t urg_ptr; // urgent data pointer, not used in our stack 
-  }; 
  
-==== Receiving packets ​[using UDP/​IPv4/​Ethernet] ====+==== Receiving packets ​====  
 +**(using UDP/​IPv4/​Ethernet)**
  
 === Device layer processing === === Device layer processing ===
  
-Receiving a packet is completed in a single time slice in which net_step() runs. net_step() calls net_poll() which polls the device for data. If data is available, it enqueues the data in rx_queue and sets the link layer protocol in the packet buffer to the protocol implemented in the network device. net_step() then checks if a packet is available to process and calls net_rx_process() to process the packet.+Receiving a packet is completed in a single time slice in which ''​net_step()'' ​runs.'' ​net_step()'' ​calls'' ​net_poll()'' ​which polls the device for data. If data is available, it enqueues the data in ''​rx_queue'' ​and sets the link layer protocol in the packet buffer to the protocol implemented in the network device. ​''​net_step()'' ​then checks if a packet is available to process and calls ''​net_rx_process()'' ​to process the packet.
  
 === Link layer processing === === Link layer processing ===
  
-net_rx_process() parses the link layer header in the received packet. It determines which network layer protocol to use and sets the network protocol pointer of the packet buffer to this protocol. It strips off the link layer header and sends the packet to the network layer using the rx_process() routine of the network layer protocol+''​net_rx_process()'' ​parses the link layer header in the received packet. It determines which network layer protocol to use and sets the network protocol pointer of the packet buffer to this protocol. It strips off the link layer header and sends the packet to the network layer using the ''​rx_process()'' ​routine of the network layer protocol
  
   net_protocol = find_net_protocol(llhdr.net_proto);​   net_protocol = find_net_protocol(llhdr.net_proto);​
Line 326: Line 332:
 === Network layer processing === === Network layer processing ===
  
-Currently ipv4_protocol.rx_process() points to ipv4_rx(), which copies the packet to uip_buf and invokes the uIP stack. We need to replace this function with one that processes the network layer headers and transmits the packet to the transport layer. Something that looks like this:+Currently ​''​ipv4_protocol.rx_process()'' ​points to ''​ipv4_rx()''​, which copies the packet to ''​uip_buf'' ​and invokes the uIP stack. We need to replace this function with one that processes the network layer headers and transmits the packet to the transport layer. Something that looks like this:
  
   struct ipv4_hdr {   struct ipv4_hdr {
Line 395: Line 401:
   }   }
  
-There are bunch of functions which need to be implemented. ipv4_reassemble() for example, takes the fragment, reassembles the whole packet and then calls trans_rx_send() to send it to the transport layer.+There are bunch of functions which need to be implemented. ​''​ipv4_reassemble()'' ​for example, takes the fragment, reassembles the whole packet and then calls ''​trans_rx_send()'' ​to send it to the transport layer.
  
 === Transport layer processing === === Transport layer processing ===
  
-Let us assume that UDP is the transport layer protocol specified in iphdr->​protocol (=17), then trans_protocol->​rx_process() points to udp_rx_process(). UDP processing is simple: calculate and verify the checksum, demux and get the connection, invoke the application'​s callback functions.+Let us assume that UDP is the transport layer protocol specified in ''​iphdr->​protocol'' ​(=17), then ''​trans_protocol->​rx_process()'' ​points to ''​udp_rx_process()''​. UDP processing is simple: calculate and verify the checksum, demux and get the connection, invoke the application'​s callback functions.
  
   static int udp_rx_process(struct pk_buff *pkb) {   static int udp_rx_process(struct pk_buff *pkb) {
   struct udp_header *udphdr;   struct udp_header *udphdr;
   ​   ​
-  /* We need to create the UDP pseudo-header in order to compute the checksum. This depends on the network layer protocol. We could encapsulate this information in the generic network layer header but we need to pass that around along with the packet buffer. Another idea is to store the pseudo-header in the packet buffer in the network layer and use it to compute the cheksum.+  /* We need to create the UDP pseudo-header in order to compute the checksum. ​ 
 +   ​* ​This depends on the network layer protocol. We could store the pseudo-header ​ 
 +   ​* ​in the packet buffer in the network layer and use it to compute the cheksum.
   * // in trans_rx_send(pkb) - after find_trans_protocol()   * // in trans_rx_send(pkb) - after find_trans_protocol()
   * ...   * ...
Line 423: Line 431:
   }   }
  
-TCP would require a much more complicated trans_protocol->​rx_process() function which would check the state of the TCP connection and proceed accordingly.+TCP would require a much more complicated ​''​trans_protocol->​rx_process()'' ​function which would check the state of the TCP connection and proceed accordingly. 
 + 
 + 
 + 
 +==== IP fragment reassembly ==== 
 + 
 +IPv4 and IPv6 both support fragmentation and reassembly of IP packets. But the manner in which they do it is very different. IPv4 has it as part of its regular header while IPv6 an extended header for fragmented packets. 
 + 
 +gPXE should have a framework to support IP fragment reassembly, whichever IP protocol is used. The ''​reass_buffer''​ structure can be used to handle the reassembly. 
 + 
 +  struct reass_buffer { 
 +    uint16_t ident; // identification number 
 +    uint8_t net_addr[MAX_NET_ADDR_LEN];​ // datagram source address 
 +    net_protocol *net_protocol;​ // network layer protocol 
 +    struct pk_buff *reass_pkb; // the reassembled packet 
 +    [one of these two: 
 +    struct bitmap *bitmap; // bitmap to check if all the fragments have been received 
 +    list_head frag_headers;​ // list of fragment headers to do above task 
 +    ] 
 +    uint8_t flags; // flags - bitwise OR of zero so more IP_REASS_XXX values 
 +    struct retry_timer reass_timer;​ // reassembly timer 
 +  }; 
 + 
 +We can create an instance of the ''​reass_buffer''​ structure every time a new fragment series is started. This way, I think multiple reassembly processes can be handled at any given time. We also need to maintain a list of all reassembly buffers. 
 + 
 +We can collect fragments of the same fragment series using the identification number and the source network address to determine which fragment series the fragment belong to. Every time a new fragment arrives, its ''​(ident,​ source_addr)''​ is compared with the reass buffer'​s ''​(ident,​ source_addr)''​ and if it matches the contents are merged into ''​reass_pkb''​. 
 + 
 +''​reass_pkb''​ is a packet buffer in which the actual reassembly takes place. The goal is that after the reassembly process,''​ reass_pkb''​ should be identical to the hypothetical packet buffer we would have received had fragmentation not occurred. I am still a little hazy about the details but I guess this is how I would proceed. On receiving the first fragment - i.e. when a new ''​reass_buffer''​ is created, the ''​reass_pkb''​ packet buffer will be allocated a factor times its length (from sources like [[http://​www.caida.org/​analysis/​workload/​fragments/​sdscposter.xml|this]],​ we can put this factor = 2 or 3). If it is the last fragment in the fragment series, then we know exactly how much space needs to be allocated (offset + total length).  
 + 
 +As new fragments come, the data is merged into this packet buffer (''​reass_pkb''​). We might need to do some juggling around with the size of the packet buffer - reallocate, shift data, etc. - which might be a little messy. Any suggestion to make this part clean is welcome :) and greatly appreciated :) I guess the moment we receive the last fragment the total size of the packet can be determined and all is well. 
 + 
 +We would also need to remember whether the first fragment has been received, the last fragment has been received and if the reassembly has been completed. We can use the flags field to keep all this information. 
 + 
 +  IP_FRAG_FST 0x01 
 +  IP_FRAG_LST 0x02 
 +  IP_FRAG_FIN 0x04 
 + 
 +The flags field is a bit wise OR or of zero or more of these values. The ''​IP_REASS_LST''​ flag is set when the received fragment does not have the frag bit set in its IP header (in the case of IPv4). ''​IP_REASS_FST''​ is set when the IP fragment is the first in the sequence, i.e. when the offset field of the IP header (again, IPv4) is 0. Setting the ''​IP_REASS_FIN''​ flag is a little more complex. 
 + 
 +In order to determine whether or not we have received all fragments, we could use either one of the following two approaches. I prefer the first one since it is more space efficient. 
 + 
 +We can maintain a list of the (offset, total-length) values of the fragments received. When the first fragment of a fragment series is received, we initialize this list using the info in this fragment'​s header. As more fragments are received, we add more elements to the list depending on the position of the fragment, i.e. a new element k is placed between elements i and j in the link list such that  
 + 
 +  offset_i >= offset_k >= offset_j  
 +  offset_i + total-length_i <= offset_k  
 +  offset_k + total-length_k <= offset_j  
 + 
 +The flag ''​IP_FRAG_FIN''​ is set when, 
 +  - the ''​IP_FRAG_FST''​ flag has been set 
 +  - the ''​IP_FRAG_LST''​ flag has been set 
 +  - for every element i in the list (followed by element j), the following holds: 
 +       
 +  offset_i + total-length_i = offset_j 
 + 
 +The other approach uses a bitmap. Fragment offsets are calculated in units of 8 octets. Let the total reassembled packet length be L, then we can create a bitmap of length L/8 bits such that each bit corresponds to one byte of the packet. As fragments are received and merged into ''​reass_pkb'',​ the corresponding bits in the bitmap are set to 1. When all bits are 1, ''​IP_FRAG_FIN''​ is set and the packet buffer is sent to the higher level protocol for processing. 
 + 
 +Another aspect is the reassembly timer. When a ''​reass_buffer''​ is created, the reassembly timer is created and set to the ''​MAX_FRAG_TIME''​ value which is the maximum time in which the reassembly should occur. If the timer expires before the ''​IP_FRAG_FIN''​ flag is raised, we assume that one or more fragments are lost and the reassembly buffer is discarded. IMO the reassembly timer should be decremented in the time slice alloted to the network layer, i.e. in ''​net_step()''​.  
 + 
 +When the ''​IP_FRAG_FIN''​ flag is set, the packet is sent to the transport layer protocol for processing. The ''​reass_buffer''​ is discarded and memory is released.  
 + 
 +==== More on IP fragment reassembly ==== 
 + 
 +Last night, Michael and I had a long IM session in which we discussed the reassembly requirements. Here are the main points: 
 + 
 +  - We assume that the fragments are received in order. We do not need to remember where in the original packet the recived fragment fits. This greatly simplifies the process of reassembling fragments.  
 +  - If a fragment is received out of order, we simply assume that the missing fragment (before it) is lost and we discard the reassembly buffer. 
 +  - If the offset field of a fragment is equal to 0, then we create a new reassembly buffer for the fragment series as described in the earlier note. 
 +  - If the offset field if not set, we identify the fragment series. If the offset + total_length of the last received fragment in the fragment series is equal to the offset of the received fragment, we add the fragment to the fragment series. Else we discard the fragment series. 
 +  - If the more fragments field is not set, i.e. this is the last fragment of the series, on successfully adding the fragment, we invoke the transport layer'​s ''​rx_proces()''​ function and pass the reassembled buffer into it. 
 +  - We need to write ''​realloc_pkb()''​ which will take a packet buffer and a new length, allocate a new packet buffer for the new length, copy the contents of the old packet buffer into the new one, and then free up the old packet buffer. This function will be useful in maintaining the reassembled buffer. 
 + 
 +(I apologise for the verbose and hasty notes. I will compile a more elaborate one with code snippets soon)
  
 ==== To do list ==== ==== To do list ====
 +
 +  - Document revised TX and RX paths through the network stack((Last night, Michael was hard at work simplifying the TX and RX data paths through the network stack.)).
 +  - Figure out a way to pass generic information between layers
 +  - Complete the UDP implementation
 +  - Complete the IPv4 implementation
 +  - Debug/test the UDP/IPv4 stack
 +  - Implement fragment reassembly
  
 ===== Some fun stuff ===== ===== Some fun stuff =====

QR Code
QR Code soc:nikhilcrao (generated for current page)