This blog has moved. Please update your bookmarks.

Fetching A Web Page, Part 3 - First Contact

When we left off last time, our computer had just been given the IP address of "www.cnn.com", as the first part in the process of connecting to it to fetch the main web page. It turned out that the IP address was 64.236.16.11, and the computer is now ready to connect to it.

But in order to connect to it, we must first know what service we wish to use.

On any given server, there are (usually) a number of services, or programs, running. These programs listen to incoming connections from the internet, and process commands - usually handing out information, such as web pages. One of these programs is the web server. Another program may handle FTP, or file transfers; a third one may provide yet other services. Which of these programs should we connect to? Only one of them understands how to handle web pages. How do we tell the remote server which program we want to use?

It turns out, that there is a huge list of ports that are standardized. A port is sort of a door which we can go through. Depending on which door we choose, we get to see different programs. All of these doors have numbers, and as it turns out, door number 80 (port 80) is the standardized number for fetching web pages. A web server that wishes to serve web pages out to the global internet, stands idly by door number 80 and sees if there are any incoming requests. When the request bursts through the door, it says "hello, dear friend! What can I do for you?" But if we had connected to port 79 instead, the door might have been locked; or worse still, there might have been a program on the other side that had no idea what a web page is.

These port numbers are not set in stone. Some companies choose to put their web server at a different port number. This may happen because they may want to try to "hide" their server, or at least not make it so dashedly obvious that it's there. This is why, sometimes, you have to type in really complicated addresses like "http://some-server.net:8080/". This URL tells the system to avoid port 80, and go looking for port 8080 instead.

Far from all ports are answered by your average server, though. In fact, the less ports the server answers on, the less chance there is of having a security hole somewhere. Ideally, a web server should only answer port 80 and no other port at all! Hackers usually scan servers, meaning that they go through all possible doors (ports) to see if any are open, and what may be behind the door.

But onwards we go! We have packets to send!

First of all, we need to connect to the web server at 64.236.16.11, port 80. This is done by asking the computer to send a very special packet, called a SYN packet. The SYN packet is very short, and typically consists of few things more that our IP address (so the server knows where the packet came from), the destination IP address (so the routers on the way know where to send it) and the destination port. SYN means, effectively, "knock on the door to see if anyone is there". So the SYN packet is added to the queue of outgoing packets in our machine.

At a lower level (remember the OSI model?), the network subsystem of the computer looks into its queue. It finds an outgoing packet, queued for immediate delivery. "Oooh", it thinks, "a packet!" And it proceeds to send it immediately to your Internet Service Provider; since there is little use to try to send it somewhere else.

Now a long, long chain of routers start looking at the packet that was sent. The first router comes along, and looks at the destination IP address of the packet, which is 64.236.16.11. This first router is probably responsible for handling a reasonable segment of your ISP's customers, and looks through its list of known customers to see if the packet is going to any of them. "Nope", it says, "I can't find this address in my list. I must send it upwards." It proceeds to forward the packet to its uplink, who handles a much larger part of the network. The process is similar to a lieutenant in the army, who doesn't know what to do with a specific situation, and sends it up to his captain. The captain doesn't know what to do either, and sends it up to the colonel; and so it goes until it ends up all the way up at the Supreme Army Headquarters.

Army Headquarters, in this case, isn't a special router, though. It's the World Wide Internet. Up until now, we've only traveled within "our" Internet Service Provider, like Telia or Com Hem. But now, we leave them for a journey on the high seas.

The high seas of the Internet is ruled by a rather complicated protocol called BGP, or Border Gateway Protocol. It attempts to define routes between different networks, based on link connections, policies, and a multitude of other parameters. BGP is the protocol that controls all of the routes between all the myriads of networks, so you can find just the right one you need. It can be compared to the little GPS device you may have in the car, which finds a route all the way from Podgorica to the Ice Hotel in JukkasjÀrvi. And just as your GPS device has a CD (or DVD) with all roads in Europe, the routers on the Internet have a humongous list of all networks on the Internet. Trust me, you really don't want to know more about this. But it's not uncommon for a long-distance packet to travel through 20 or even 30 different routers on the way - and it all takes less than a tenth of a second.

Ultimately, your packet travels through networks with strange names, such as "telia.net", "atdn.net" and so forth until it reaches the target network. The routers on that side forward your SYN packet to the right sub-net, and finally to the right computer. The SYN packet knocks on door number 80, and a friendly program opens and says "hello there, what can I do for you?"

Well, not quite. But the SYN packet informs CNN's web server that we wish to start communicating, and it responds by transmitting a SYN/ACK packet (ACK stands for acknownledge). This SYN/ACK packet retraces the steps all over the Internet, and finally makes it home to your computer again. Your computer responds by sending an ACK packet (no SYN this time), and in doing so, both your computer and CNN's web server considers a connection to have been opened and agreed upon.

We are now ready to send our initial request for the main web page. We will cover this next week.


0 Comments:

Post a Comment

<< Home

 

Blog contents copyright © 2005 Mats Gefvert. All rights reserved.