We'll talk about computer networks from an OS perspective.

Local-area vs wide-area networks

A local-area network or LAN connects machines in the same room or building. A wide-area network or WAN connects machines on the same planet. Both provide a mechanism for doing message-passing between computers (see InterProcessCommunication); the main difference from the OS view is speed and possibly reliability.

LANs

For a LAN, the connection between two machines is likely to be either direct or through a single switch or hub. Most LAN transports now are implemented using radio signals, either sent through the air as in 802.11 or along twisted pairs of wires as in Ethernet. Networking hardware takes responsibility for transmitting packets generated by the participating machines and detection collisions, which are attempts by two machines to transmit at the same time. Lost packets are retransmitted, with increasing random delays to avoid congestion.

There are various configuration details that may complicated this picture. Although Ethernet is designed to allow computers to be connected directly across a single shared line, it is common to instead connect machines to a hub (dumb version that echoes packets to everybody) or switch (smart version that echoes packets only to intended recipients); the hub or switch may also act as a gateway that passes packets between separate LANs or to a larger WAN. Gateways are particularly necessary for LANs built from multiple transports; for example, a given local-area network may include both wired and wireless connections, and packets from the wired side must be translated through a gateway to the wireless side and vice versa.

Speeds on local area networks can be very high, with bandwidth ranging from 1 Mib/sec to 54 Mib/sec for wireless connections to 10 Mib/sec to 1 Gib/sec for wired connections using current consumer-grade hardware (that's bits, not bytes; to get bytes, divide by 8). Latency is typically on the order of a few hundred microseconds. Note that bandwidth is usually shared between multiple computers and applications, which can reduce the maximum effective bandwidth substantially.

WANs

When our pile of gateways and hubs become large enough, we have to start exercising still more intelligence in where we send packets, and our gateways/switches/etc. graduate to being routers. A router may be connected to many transports, and has a routing table that tells it where to forward packets based on their destination addresses (see below). In a wide-area network, a packet may be forwarded through dozens of routers and intermediate transports before reaching its destination.

Speeds on wide-area networks are limited by the effective bandwidth of the slowest link and delays incurred in translating and forwarding packets between transports. Even though the transports used for the core of a WAN may have very high bandwidth (e.g. 14 Tib/sec for long-distance optical fiber), this bandwidth is typically divided between many packets, so the actual bandwidth available to a single application is likely to be much more limited—and proportional to what the user is willing to pay for. Consumer and business connections typically involve running a line to an Internet Service Provider or ISP (often multiple ISPs in some cases for businesses that worry about reliability or maintaining competition among their suppliers). The capacity of this line is generally capped by the limits of the transport medium or by a contract between the purchaser and the ISP. Bandwidths here may range from 56 kib/sec or worse for modem connections across the telephone network to 1-2 Mib/sec for DSL connections, 6-12 Mib/sec for cable modem connections, or 100 Mib/sec and up for fiber optic connections (largely legendary in the US, but rumored to be big in Japan). Actual effective bandwidths are likely to be lower because of congestion later in the network, and because of asymmetries in the transport mechanism; for example, typical upload bandwidths on DSL or cable modem lines are an order of magnitude smaller than download bandwidths.

Latency for wide-area networks is necessarily much higher than for local-area networks, both because of delays inside routers and because of fundamental limitations of physics: a signal traveling 20,000 kilometers around the surface of the earth will take approximately 70 ms to arrive even traveling at the speed of light; this is 2-3 orders of magnitude slower than we would expect from a signal traveling across a local-area network, where travel time is much less of an issue. Signals going through roundabout routes (e.g. via satellite transmission) will be even slower. The difference in speed between LANs and WANs has similar effects to the difference in speed between various levels of the memory hierarchy, and encourages the use of similar tools like caching to work around it.

Addressing: IP

How do we know where to send a packet to? At the bottom layer, every wired or wireless Ethernet card ever built has a unique 48-bit MAC address that is baked into by its manufacturer (different manufacturers get different prefixes of this address space); for local-area networks, this address is included in the header of each packet and polite Ethernet cards drop packets that aren't addressed to them.

For wide-area networks, pretty much every network in the world now uses Internet Protocol, in either its "Classic" IPv4 or "New Coke" IPv6 incarnations, to do addressing and routing. IPv4 provides 32-bit address (the classic 128.36.29.174 style addresses you've probably seen before); IPv6 provides a much larger 128-bit address space (e.g. fe80::213:72ff:fe07:b068/64, an abbreviated address written mostly in hexadecimal with blocks of zeros omitted).

IP packets consist of some small number of bytes (the payload) prefixed with a short header containing routing information. This information is essentially just the source and destination addresses, together with a checksum (for the header only) and some status flags. A router or machine attempting to deliver an IP packet will translate the address using its routing table (which must be initialized somehow, often directly by a human being or using some distributed route-discovery algorithm) and send the packet out across the appropriate transport. Here's a simple routing table for one of the Zoo machines: ignoring the two middle lines, it says to send any packets for the 128.36.232 subnet out across the Ethernet and send any other packets to anger.net.yale.edu (the router for the CS department) so that anger can figure out where to send them next.

$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
128.36.232.0    *               255.255.255.0   U     0      0        0 eth0
link-local      *               255.255.0.0     U     0      0        0 eth0
loopback        *               255.0.0.0       U     0      0        0 lo
default         anger.net.yale. 0.0.0.0         UG    0      0        0 eth0

Assuming that the network engineers did their job right, each packet is eventually forwarded to its correct destination. Or not: IP only guarantees best effort delivery, meaning that a router is perfectly entitled to drop, reorder, or duplicate packets if it gets overwhelmed, and transport links are perfectly entitled to damage the data inside packets in various ways if it would be more expensive to avoid such damage. (Except for the possibility of duplication, this is similar in many ways to the situation with physical mail delivery.) IP also only delivers packets to a specific machine, and not to any particular service or process running on that machine. So most applications use higher-level protocols built on top of IP.

UDP and TCP

The two most common protocols built on top of IP are UDP (User Datagram Protocol) and TCP (Transmission Control Protocol). These add their own headers (which IP thinks of as part of its payload) that can be used to provide a saner interface to the user.

For UDP packets, a checksum stored in the header on the UDP payload detects most accidental corruption of the data (but not all—it's a pretty weak checksum). The UDP header also provides 16-bit source and destination ports, which are used to distinguish different processes or services at the endpoints. Finally, UDP packets are guaranteed to be delivered in one piece or not at all; this is an improvement in theory on IP packets, which can be fragmented if necessary for transport media that can't handle large packets, but is not necessarily an improvement in practice since a common way of enforcing this condition is to discard over-large UDP packets. UDP packets do not add any more reliability; like IP packets, UDP packets may be dropped, reordered, or duplicated, and are only useful for applications that can tolerate such abuse. (If you are curious, you can find the specification of UDP at http://tools.ietf.org/html/rfc768.) The selling point of UDP packets is that they are lightweight—no state needs to be maintained within an operating system to handle sending or receiving UDP packets (except perhaps to record which process is interested in receiving packets on which port).

TCP packets add a much larger header that includes sequencing information used to simulate reliable, FIFO delivery of data. The TCP protocol describes elaborate mechanisms for acknowledging successful packet delivery and retransmitting packets that fail to be delivered in a sufficient amount of time. Lost packets are further used as a signaling mechanism to control transmission rates; by dropping excess TCP packets, an overloaded router will cause polite TCP stacks to cut back their rate and resolve the overload (there is a tragedy_of_the_commons situation here if TCP stacks become impolite). The interface presented to the user is one of two-way, byte-oriented stream connections essentially identical to local pipes. So for example, TCP connections can be accessed directly by by C's stdio library; after opening a TCP connection to a remote host, you can run fscanf or fprintf on it just as you would on any open file.

Reliability is still an issue for TCP connections, but it is now all-or-nothing: so long as the TCP connection stays up, bytes are delivered without corruption and in the correct order, but if the connection drops (because packet loss or corruption gets so bad that acknowledgments can't get through within the maximum timeout), some suffix of the transmitted data may be lost. It is generally not possible to detect exactly which bytes get through and which are lost, which creates some interesting theoretical problems for things like purchasing goods over the Internet exactly once, which in practice don't seem to come up except on really low-budget small business websites.

TCP has evolved over the years and is now specified by a number of RFCs; the page TCP gives a list of some of the more important ones. The basic guts of TCP are in http://tools.ietf.org/html/rfc793.

TCP: more details

The basic mechanism for ensuring in-order packet delivery is the use of a 32-bit sequence number on each packet. The sequence number specifies the position in the stream of the first byte in the packet; so sending a 100-byte packet increases the sequence number for the next packet by 100. (The reason for labeling bytes rather than packets is that it allows retransmitted packets to be consolidated or split.) Packets traveling in the other direction contain an acknowledgment number that indicates the next byte that the receiver expects to receive. When the sender emits a packet it places a copy in a local queue, which is retransmitted after a timeout unless an acknowledgment is received first. For streams with high traffic in both directions, no extra packets are needed for acknowledgments; instead, the acknowledgments are piggy-backed on data packets. If there is no outgoing data, acknowledgments can be sent using otherwise empty ACK packets. Note that sequence numbers for each direction of the connection are independent of each other.

Connection setup

In order to use a TCP connection, the connection must first be established using a three-step handshake protocol. The main purpose of this protocol is to initialize the sequence numbers. Because a TCP connection is identified only by the source and destination addresses and source and destination ports, it is possible that a sender may attempt to re-establish a TCP connection that was previously closed. While this is not a problem in general, it does mean that we have to be careful to make sure any old packets from the previous connection don't cause trouble with the new connection (e.g. by getting delivered as part of the new stream or by being treated as acknowledgments for data that hasn't actually gone through). So a connection initiator chooses a new starting sequence number that must be greater than all previous sequence numbers and sends it to the receiver using a special SYN (synchronization) packet. The receiver chooses its own starting sequence number (again subject to the constrain of avoiding reuse) and sends it back in a SYN-ACK packet. A final SYN-ACK-ACK packet from the sender (actually just an ordinary ACK packet) acknowledges receipt of the SYN-ACK; at this point the connection is established and either end can send data packets.

Because SYN, SYN-ACK, and ACK packets can all be lost, timeouts are used to retransmit them just as with regular data packets.

One complication with connection startup is that it is possible that the sender or receiver has no idea whatsoever what packets might still be floating around; for example, a host might have just brought up its network interface after recovering from a crash that wipes its memory. There is essentially nothing that can be done about this other than waiting; TCP defines a maximum segment lifetime of two minutes, and demands that the underlying network not deliver any TCP packet older than this. This means (a) that a crashed machine can't establish any new connections for at least two minutes, and (b) that you can't use TCP to talk to anybody more than 120 light-seconds away, or across communication links with particularly long delays.

Receive window

One of the goals of TCP is to allow data to be shipped in bulk with low overhead. So TCP defines a receive window, which is the amount of data a sender is allowed to send before receiving an acknowledgment. This window is under control of the receiver and is included in the TCP header of each outgoing packet, allowing the receiver to dynamically adjust the window as needed; the trade-off is that large windows allow for higher throughput (particularly on high-latency connections) but may require more buffering at the receiver. The original TCP specification allowed window sizes between 2 and 216 bytes; a later extension (negotiated during the setup handshake) allows window sizes up to 230 bytes. One downside of using a very large window is that it is possible to lose data early in a stream causing all subsequent packets to be retransmitted (since the acknowledgment number only indicates the last byte successfully received, and not whether subsequent packets need to be retransmitted). Further extensions allow selective acknowledgment, which uses extensions to the basic TCP header to allow acknowledging particular ranges of packets.

Retransmission control

Retransmission is controlled by setting a retransmission timeout. An ideal retransmission timeout for maximizing throughput will be slightly larger than the typical round-trip time for the network connection, since this the first time at which the sender can conclude that a packet has in fact been lost. The round-trip time can be estimated by the delay between when a packet is first sent and when it is acknowledged. More sophisticated mechanisms can adjust the round-trip timeout to achieve better congestion control by responding to packet loss; how to do this in a way that optimizes throughput while minimizing congestion (i.e. overloaded routers tossing packets), maintaining fairness (each stream gets a fair share of the underlying pipes), and being compatible with previous widely-deployed mechanisms is an area of continuing active research.

Shutting down a connection

There are two ways to shut down a TCP connection:

Failure

If a sender doesn't get an acknowledgment after retransmitting a packet too many times, it gives up and sends a FIN packet, which initiates the standard shutdown procedure described below.

Shutdown

A sender can send a CLOSE packet, the TCP equivalent of EOF, to indicate it has no more data. If both endpoints send a CLOSE packet, this initiates a shutdown handshake using FIN packets (which must in turn be acknowledged). A particularly impatient sender can also send a FIN packet without CLOSE to hang up on a remote receiver who just won't shut up.

If the shutdown procedure works, both endpoints can clear out the connection (although they may need to retain some state for the connection for 2 MSL intervals to handle delayed FIN acknowledgments).

Higher-level protocols

Higher-level protocols (e.g. HTTP, SMTP, SSH, NFS) are usually built on top of UDP or TCP. This may involve defining yet another header-and-payload structure, so that for example an HTTP request might consist of a packet with several layers of headers, all of which need to be unwrapped to process it:

Ethernet frame header | IP header | TCP header | HTTP request header | HTTP request data

Protocols implemented on top of HTTP (e.g. XML-RPC or SOAP) will contain even more data.

OS implications

An OS can treat most of the network as one gigantic blob of communication sitting on the other side of its network hardware, and doesn't need to worry too much about the specific details of routing protocols. However, it at minimum needs to provide a device driver for its network hardware and a dispatch facility that delivers incoming packets to the appropriate processes, and most likely will provide a kernel-level implementation of the TCP protocol to maximize performance.

Some local routing may also be the responsibility of the OS; for example, to route IP packets across a local Ethernet it is necessary to translate IP addresses to Ethernet addresses using the address resolution protocol (ARP), which essentially involves broadcasting the message "Hey! Who is 128.36.229.174?" and waiting for somebody to pipe up. For efficiency these ARP responses are typically cached (so we don't need to keep asking) for limited times (in case 128.36.229.174 moves or changes its MAC address), which implies all the usual overhead and maintenance of any cache system.

Back in the days of 300 baud modems, the network subsystem did not have to be terribly efficient, since the network itself acted as the main performance bottleneck (this is still somewhat true in the modern era of distant, underpowered web servers). But for high-speed local-area networks running at speeds only a few orders of magnitude slower than main memory, bad design choices in the OS can significantly cut into network performance, which in turn can produce visible, painful slowdowns for users of e.g. remote filesystems. The main design principles here are similar to those for any other I/O: to the maximum extent possible, avoid context switches (arguing against, for example, having a microkernel-style "networking daemon" that processes all network traffic) and avoid copying data unnecessarily. This last design goal is met for very high-performance systems by mapping the internal buffers in network hardware directly into the address space of the sending or receiving processes, so that data is never copied; but more typical systems accept one layer of copying by the kernel between the network hardware and the user process's address space.

Socket interface

Essentially all modern OSes provide access to networking through a standard API known as the Berkeley socket interface. Historically, this was the networking interface developed for BSD Unix, which spread to other Unices and Unix-alikes as well as Windows along with the Berkeley TCP stack that implemented it after the TCP stack became free for general use in 1989.

A socket is an abstraction of an endpoint of a network connection or service as a file. Berkeley sockets are implemented using several different system calls, which between them allow very detailed control over the construction and handling of sockets representing TCP streams, servers capable of accepting TCP connections, servers capable of accepting incoming UDP packets, etc. This control is provided at a very low level; using the socket interface to create a TCP connection often feels like building a car out of individual Lego blocks, and is suspiciously similar to the setup and teardown sequences for TCP connections. Part of the reason for this may be that the C calling syntax (especially in the form of system calls) limits the ability to specify default or conditional arguments, so that basic socket setup operations are handled by initial system calls while later specialized operations (which would probably be methods on socket objects or effects of subclassing in an object-oriented language) require subsequent specialized system calls.

Some of the major system calls provided by the socket API are:

socket

Used to create a socket. Arguments specify a domain (e.g. IPv4, IPv6, various more exotic options), type (e.g. sequential vs datagrams) and protocol (e.g. UDP vs TCP); there is a certain amount of overlap between then last two arguments. Unlike file descriptors returned by open, newly-created sockets aren't really attached to anything, and with a few exceptions (sending UDP packets using sendto) can't do much.

bind

Binds a socket to a local address. This is used to specify what address and port a server will listen for incoming connections on, or what port a remote recipient will see for packets originating from this socket. It is still not enough to actually allow you to send or receive data on TCP connections. For that, you need to do one of the following two system calls, depending on whether you want to be a server (accepting incoming connections) or a client (connecting to some remote server).

listen and accept

For servers, the listen system call declares that we will accept incoming connections, which we can then turn into new sockets representing our end of the TCP connection using still another system call accept.

connect

Attempt to connect to a remote server. If successful (i.e. the remote serving is listening and executes accept), the socket becomes one endpoint of a TCP connection.

shutdown

Like close for sockets; politely signals the other end of our connection that we want to leave. Of course, we still have to call close after the shutdown succeeds (or fails, in case of network trouble).

You can read the details of each of these system calls by typing man 2 socket, man 2 bind, etc. The 2 is important, since similarly-named (and often more useful) functions show up elsewhere in the Unix manual pages.

Many of these functions require highly-processed and magical address arguments that are very different from the nice clean addresses humans are used to. So there is a whole family of additional library routines for translating human-readable names like www.google.com to the low-level crap that the socket library demands. See for example getaddrinfo.

Sensible people generally don't use any of these calls directly unless they are implementing some higher-level client or server library. Instead, they use higher-level client or server libraries already implemented by somebody else.

Effect of failures

Failures create fundamental problems for applications running across machines, which can be mitigated by sufficiently sophisticated protocols but cannot be avoided outright. For example, the TwoGenerals problem means that it is impossible to guarantee with an unreliable network that both the sender and receiver of a packet agree in all circumstances whether it has been delivered. Failures at the machine level can create even bigger headaches, allowing communication protocols to be driven to arbitrary states (Jayaram and Varghese, Crash failures can drive protocols to arbitrary states, PODC 1996; http://portal.acm.org/citation.cfm?id=248052.248104) in the worst case, and rendering a wide variety of perfectly ordinary tasks impossible in theory (see CS425/Notes). In practice most of these problems can be worked around with enough use of timeouts and redundancy, if we are willing to accept a small probability of failures at the application level.


CategoryOperatingSystemsNotes

Networking (last edited 2007-12-25 23:42:17 by localhost)