This blog has moved. Please update your bookmarks.

Fetching A Web Page, Part 1 - Background

Few people realize just how difficult it is, and how many separate processes that are involved, in the "simple" request of fetching a web page. It may seem like the Internet is just a big, gigantic "thing" that works. People may have heard about IP addresses, routers and so forth, but maybe not really understood what it's all about or how it all fits together.

The web is an amazing tool. It has transformed our lives so to the extent that is defines the computer experience today. So I thought it could be interesting to take a closer look at some of the technologies that drive it. For instance, the process of fetching a single web page.

I intend to write this as a series, where we will be looking at each step in fetching a web page over the internet in sequence. I'm probably going to sprinkle the text with references to Wikipedia, where you can read more in detail about each step.

But to understand it all, some introductory knowledge may be required. The history goes something like this:

The internet began as a U.S. Department of Defense science project, more or less. The idea was to build a network of computers that was very loosely bound together, in order to be able to withstand, among other things, a nuclear attack. The idea went that if a network was organized extremely hierarchically, then one single missile hit could take out some very important central part of a nation-wide network. But if it was loosely bound together, there would be no single point of attack that would take out a disproportionally large part of the network. Thus the first network, called ARPANET, was based on the idea of packet switching, that is, all communication that takes place, takes place through a series of individual packets. These packets are sent out with source and destination addresses attached to it, and several routers on the way choose the best route for the packet, using the available channels, until it is delivered to the correct destination. If one part of the network went down, the packets would ideally just find a different way through the internet, and everything would work anyway (although maybe a bit slower).

ARPANET first went online in 1969, and then evolved over the years, adding a bit of technology here, growing by a couple of million users there, until the internet sort of opened up for commercial purposes in the early 90's. And the rest, as they say, is history.

To further expand upon the idea of the network, it should be clarified that there are different types of networks. The two primary talk-abouts these days are Local Area Networks and Wide Area Networks (or LAN and WAN, for short). The internet is an example of a wide area network; a network of computers that span a large area, such as a city, a country, or in this case, the entire world. Local area networks are much more ... well, local. :) They typically encompass a house, a company, or a smaller geographically limited, contigous area. The internet (a WAN) is actually built out of millions of local networks (LANs), all connected together.

What, then, defines a local network? From the internet's point of view, a network is a group of internet addresses, all bunched together. An internet address, as you may have seen, consists of four groups of digits, like, 200.47.96.1. This is, so to say, the "phone number" of a single computer on the internet. The maximum number of unique addresses is four billion, which may seem like a lot, but not all of these numbers are used; and it actually is getting kind of crowded out there, so it's not as much as people thought initially. Each computer, then, has a different phone number, which uniquely identifies this machine on the internet. However, much like ordinary phone numbers have area codes and prefixes, internet addresses also have area codes. They are called, precisely, networks. Usually, the way it goes these days is that the first tree groups of digits define the network; and the last digit group defines the computer on that individual network. Of the address 200.47.96.1, the network is called 200.47.96, and the .1 tells the system that it's computer number one on that network. This makes it all easier, as we shall see later on, to know where individual packets should go.

When computers talk to each other over the network, there are many different layers of communication going on. The fancy name for this is the ISO protocol stack, which defines seven different levels of communication. The first basic level of computer-to-computer talk is to imagine the scenario with two computers linked together by a physical cable, where all they can do is to say "yes" or "no" to each other (typically, high voltage means "yes" and low voltage means "no").

The next level of complexity in this model, is when we attach many computers to the same cable. We then need some sort of communication protocol, to make sure that no two computers are saying "yes" or "no" to each other at the same time, thereby garbling the electrical signals over the wire, and instead take turn in speaking. Fancy names for these protocols include, for instance, IEEE 802.3.

On levels above this, we start working with addresses and many, many computers, connected through different cables; identifying computers, making sure that individual pieces of transmission arrive safely (and not garbled in any way); and the ultimate level of communication, which is known as the application level, where we don't concern ourselves with anything more technical than the idea of fetching web pages ("fetch a web page over this big internet thing, I say, and work out all the technical details while you're at it"). The purpose of these layers is to split the problem of communication up into many different isolated problems, and solve each one at a time. Once one layer of abstraction is built, we simply start building the next communication layer on top of the last.

TCP/IP is the main protocol that drives the internet. TCP/IP actually embodies two different layers in this ISO communication stack: IP means Internet Protocol, and it is the main part making sure that an individual packet sent anywhere, anytime, reaches its destination. TCP builds on top of this, and means Transmission Control Protocol, and uses IP packets to start conversations between computers, making sure that no part of this conversation is lost, garbled in transmission, or that the "sentences" arrive in the wrong order. TCP is also good for saying "hello" and "goodbye", something very important in computer conversations, and which IP doesn't care at all about.

Okay, having gotten this far in our story, I hope you haven't lost me in all the technical details. For now, it's important to know that there are a lot of computers out there, all talking to each other, and that they form networks. Now, let's use this network. Let's imagine that you type in the address "www.cnn.com" into your favorite browser, and next time, we'll look at what happens when you press ENTER.


0 Comments:

Post a Comment

<< Home

 

Blog contents copyright © 2005 Mats Gefvert. All rights reserved.