Internet Topics + CVu Journal Vol 12, #2 - Mar 2000

Browse in :

All > Topics > Internet
All > Journals > CVu > 122
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: An Introduction to CGI Programming

Author: Administrator

Date: 03 March 2000 13:15:35 +00:00 or Fri, 03 March 2000 13:15:35 +00:00

Summary:

Body:

CGI, the Common Gateway Interface, is a mechanism for allowing your programs to be invoked by a Web server. Note that your program is not downloaded and run on the user's computer, but is run by the server. This allows the program to use the server's resources and store things on the server, and it also means that CGI programs (commonly called "CGI scripts") can be used by almost any Web browser, no matter how old (as long as it supports forms), on any operating system. CGI programs are much more accessible to disadvantaged users than the Java and Javascript equivalents, even if the latter may be a little faster on some computers.

The downside of CGI is you need to be more worried about security; because it is the web server that runs your program, a misbehaving CGI program can interfere with the web server, so anyone trying to break the web server may well start by giving your program silly requests in an attempt to make it misbehave. For this reason, many system administrators who are happy to give you space for Web pages get a bit more worried when you speak of CGI. Some sysadmins ask to read through your program first, expecting it to be short (they get a bit funny when I dump my 200K of C++ on them and tell them I want to update it often); others are more confident about the restrictions that they have put in place. A program called cgiwrap allows your CGI scripts to be run from your account instead of the web server's account, and this helps with security because your CGI program can't do anything that you can't do, but some risks are still there. Many sysadmins insist that CGIs be written in Perl or PHP-3. There is no technical reason why they can't be written in C, C++ or almost any other language, but Perl's string manipulation makes simple scripts easier to write in it and more likely (but still not guaranteed) to be secure.

The execution of CGI programs should generally be quite brief. They take input, do something, print out the results and exit. They should ideally do this in less than one second (the user will see more delay because of the network and so forth), but several seconds is allowable if it is doing some kind of retrieval (such as a search). Even a minute or two might be OK if the user knows it's going to take that long (although in this case it helps to output things as you go, which some web servers will propagate through to the browser as it happens). If your program takes more than a few minutes then the web server will probably kill it and say (perhaps misleadingly) "server overloaded". If you have administrative control over the web server then you can set it to wait for longer, but after a while the web browser will give up, and probably long before that the user will give up. The server administrator may also have something to say about long-running scripts, since while they are running they take up resources on the server. It is easy to make a "denial of service" attack that gives the server several simultaneous requests to fill up a number of its connections for a while. But then again, there is no known cure for denial of service attacks if the attacker can break into enough machines to have more bandwidth than you, and apart from knocking your server off the Internet for a while they are mostly harmless.

The FORM

To get started with CGI, write a web page that goes something like this:

<HTML><BODY>
<FORM METHOD=get ACTION="http://path/to/your/cgi/script">
 <LABEL NAME=L1>
 What is your favourite colour?
 <INPUT TYPE=text NAME=colour ID=L1>
 </LABEL>
 <INPUT TYPE=submit VALUE="OK">
 <INPUT TYPE=reset VALUE="Reset">
</FORM>
</BODY></HTML>

(HTML purists will not like the above, but I'm trying to be brief.) If you then look at the above in a web browser, you can see how it comes out in that particular browser. Netscape starts on a new line when it finds a FORM tag, but other browsers do not. Essentially you can use any HTML you like within a FORM, but if you want the form to do anything useful then you should include some INPUT tags, as in the above example, which demonstrates the input types "text", "submit" and "reset". Notice also the use of the LABEL tag, and its matching ID attribute in the INPUT line. What this does (in the more advanced browsers) is to cause the question and its text box to be treated as one thing if you are using the keyboard to navigate through the form. This is a great help to blind users, since they will always hear the question when they find the text box.

If you want, you can limit the amount of text that the user can type in the box by giving it a MAXLEN attribute (e.g. <INPUT TYPE=text MAXLEN=10 NAME=...). Note, however, that this only restricts the amount of text that a web browser will let the user type; a malicious user who wanted to break your program can still force a longer input by hand, so your program has to be ready for inputs of any length. You can control the width of the text box on the screen by using SIZE (which need not be the same as MAXLEN). Size is measured in characters, which depends on the user's font size in some browsers but is fixed in others (e.g. if you have large fonts, Netscape will enlarge the forms also, but Internet Explorer will not). When allocating SIZE, if you want the entirety of the text to fit in the visible area then remember to allow one more place for the cursor. It is not advisable to set a large SIZE value, since users with large fonts or small windows may find that not all of the resulting text box fits on the screen.

You can also include a default value in the text box, by using the VALUE attribute. For example, <INPUT TYPE=text VALUE= "Unspecified" NAME=...> will cause the word "Unspecified" to appear in the box by default, but the user can change it. Note that, when the "Reset" button is pressed, the value of the text box now reverts to "Unspecified" (rather than empty), and this is why it is called "reset" rather than "clear".

Another useful thing you can do is to put hidden values in the form, such as the following:

<INPUT TYPE=hidden NAME=language VALUE=Japanese>

I use this for the Japanese entry page to my access gateway so that the users are not confronted with irrelevant options; it can also be useful when the CGI program itself is generating another form (more on this below) and needs to include some values from the first form. (Another use is for automating the selection of your favourite options in a search engine. Simply copy the search engine's form into your web page, taking care to ensure that the action URL points to the right server, and change every option except the "what to search for" box into hidden values. You do not have to publish the URL of the resulting web page of course.)

There are, of course, other input types, such as radio buttons, Netscape's "file" type (which allows the user to upload any file from their computer to your program, although handling it is more complicated), and other types that new browsers might implement before you read this. There are also some special cases where INPUT tags are not used at all, as the following two examples show:

<SELECT NAME="colour">
<OPTION VALUE="red">Red</OPTION>
<OPTION VALUE="green">Green</OPTION>
</SELECT>

(Note that the VALUE gives what is actually sent to your CGI program as though the user had typed it into a text box, where the text before the </OPTION> gives what is displayed by the browser; these need not be the same.)

<TEXTAREA NAME="text" WRAP="hard" HEIGHT=3 WIDTH=20>

Here is some default text.

</TEXTAREA>

If you make changes to the form while you are viewing it in your web browser, you should, of course, tell the browser to re-load the page to ensure you are looking at the latest version.

The CGI Program

Let's get the form to actually do something. To do this, you need to write a CGI program. Try the following C:

#include <stdio.h>
#include <stdlib.h>
int main(void) {
puts("Content-type: text/html");
puts(""); /* blank line */
puts("<HTML><BODY>");
puts("The query string is:");
puts(getenv(QUERY_STRING));
puts("</BODY></HTML>");
}

If you compile that and run it from the command line, you will find nothing interesting. It simply prints out the value of the environment variable QUERY_STRING, which is probably unset, wrapped in a load of HTML and a MIME header (the "Content-type"). But if you try copying the executable into your CGI directory, make sure the ACTION attribute of your FORM points to its URL (see your sysadmin if you don't know what the URL is), load the form's page in your web browser, reload just in case, type something in the box, and press OK, you should get something like

The query string is colour=yellow in your web browser. This is the output of the CGI program after it has been decoded by the web browser, which, of course, understands HTML. You can get CGI programs to output "text/plain" instead of "text/html". Then you do not have to get them to write HTML, but the result (besides looking very amateurish) causes some web browsers to load a text editor instead of displaying it in the browser, which is not ideal. Of course, the HTML that is output by the CGI can include links (which may be automatically generated) and other forms (perhaps pointing back to the same CGI program and including some values from the first). Some people write their entire websites using CGI, and others (like myself) use a CGI mediator for the whole World Wide Web.

Note that, when I printed out the query string, I used two separate statements to do it, rather than doing something like this:

printf("The query string is: %s\n",
             getenv("QUERY_STRING"));

This is insecure, since many implementations of printf format the string to a fixed-length buffer. If a malicious person puts tens of thousands of characters in the box, and the web server lets it through, then your program may get a buffer overflow, overwriting its memory with the query string, and crash. A very skilled attacker can even cause the web server to execute arbitrary instructions in this way. This is why many prefer Perl, since it's harder to make that kind of mistake. C++ streams (cin, cout) should be fine; it's the printf functions that you need to be careful of. (Yes, you can always say %500s instead of %s, but if you get into the habit of doing that then your program will be riddled with undocumented arbitrary limits.)

Aside: In fact, occasional web servers do not set the QUERY_STRING environment variable at all if there is no string present, so in C/C++ you should check that the value returned by getenv is not NULL before you use it. You should do the same for any other environment variables that you get.

Parsing the Query String

Notice the value of the query string, "colour=yellow". Try adding another text box (with its question and LABEL) to the form, for example, asking "What is your friend's favourite colour" and with NAME=friend. Then go back and reload in the web browser and you should see two text boxes (if you want them to start on separate lines then use <BR> or whatever in the HTML); type something in both of them and you should get something like:

The query string is: colour=yellow&friend=blue

It helps if you read & as "and". In fact, your program can go through the query string, splitting it whenever it comes to '&' (this is particularly easy for Perl programmers), and splitting names and values whenever it comes to '='. If someone types & or = (or any other special character) in one of the text boxes, then the special characters will be encoded as hexadecimal sequences (see below). However, malicious queries can of course be constructed by hand, and you need to be aware of this. In your early CGI programs, you may not need to parse the query string at all, but you will eventually (and note that the order of the boxes, while often the same as the order they appear on the form, is not guaranteed at all).

Note that spaces in the text are usually (but not always) changed to + (pluses in the text will be encoded in hexadecimal). So if you say your favourite colour is "blue turquoise", the query string will contain colour=blue+turquoise.

The hexadecimal values that I mentioned are of the form %xx, where xx is a hexadecimal number (e.g. 9F or 9f). Your program does need to be able to understand these, since what is and what is not encoded in hex varies from browser to browser, and any text that the user inputs might be encoded in this way. Essentially, whenever you see a % sign, treat the next two characters as a hexadecimal number giving the ASCII code of what should have been sent.

If you leave one of the boxes blank, then you might get something like:

The query string is: colour=&friend=blue

Or, depending on the web browser, you might get:

The query string is: friend=blue

You need to be prepared for such variations in browser behaviour. This is particularly the case if you are using check boxes ("check" is the American word for tick) on your form. Try adding a checkbox, using

<INPUT TYPE=checkbox NAME=box>

(add CHECKED if you want the box to be ticked to begin with). If you tick the box, you may get any of the following:

The query string is: ...&box=on

The query string is: ...&box=1

The query string is: ...&box=

The last result (from Konqueror) can be particularly confusing, since an empty string for text boxes means an empty text box, but an empty string for check boxes does NOT mean an empty check box. If the check box is empty (unticked), then the "box=" will not be present in the query string at all.

If you want more than one possible form submit button, then you need to give each button a NAME, and which button was pressed will be indicated in the query string as name=value in the usual way (using the NAME and VALUE of the button). The VALUE is what will be printed on the button. If you want an image to be printed on the button instead, then you can do something like

<INPUT TYPE=image SRC="fancy-button.gif" ALT="OK" HEIGHT=...
WIDTH=...>

but there is a bug in Netscape that will complicate matters if you use NAME/VALUE. What should happen in this case is "OK" to be used as an ALT tag for the image, to be displayed if images are turned off or to be used as "bubble help" for the image. However, if the user happens to be running a recent version of Netscape, then the NAME attribute will be used instead of the ALT. This is a bug, but browser bugs are something that you have to put up with and work around (bearing in mind that they should one day be fixed).

The attitude of a certain prominent Web standards consortium of ignoring bugs on the grounds that they will eventually be fixed leads to web pages that don't work on today's browsers (certain CSS techniques being a prime example). It is therefore a good idea to make such NAME tags at least vaguely meaningful, since it is quite possible that the user will actually see them.

The MIME Header

The output of the CGI script before the first blank line is the MIME header. It should at least contain "Content-type"; most web servers will fill in the other details, and at this stage if you come across a web server that doesn't then it's probably best to use a better server (Apache is pretty good, and free).

If your script does not return the correct headers, then the web server may return its "500 Server Error" page to the browser, which is not particularly helpful but there might be a better message in the server's error log. The way to debug a CGI that does this is to set the query string at the command line, run it directly (not through the web server) and look at the output.

The content type need not necessarily be "text/html". It could be something like "image/x-gif" and you could return an image file instead of an HTML document. If you do this and your web server is running on a Windows machine, remember to make sure that stdout is opened in "raw" mode, otherwise the web server may turn every character 10 (\n) into 13 10 (\r\n), which will corrupt a binary file. The best way of returning binary files that are not automatically generated is to use the "Location" header in the MIME header, as in

Location: http://whatever

This is used for CGIs that take you to a random page or return a random MIDI file. However, you do need to return some HTML as well that points users in the right direction, in case their browser does not honour such redirections.

You can also use the headers to set cookies (spare us) and things like keywords for search engines, although sometimes it is better to use META tags as these are saved with the document if someone saves it to disk (the MIME header is not saved).

The MIME header that the browser sends to the server is put in the environment, with HTTP_ prefixed to each variable. Most web browsers send the "User agent" header to identify the browser, so this is available in HTTP_USER_AGENT (which you can read with getenv in C). It is inadvisable to return a completely different page to each browser as this can make testing rather awkward (as well as making it difficult for others who are testing your page if you do not make it clear). Both Netscape and Internet Explorer announce themselves as "Mozilla"; IE says "Mozilla compatible" so that it works with servers and CGIs that search for the word Mozilla. My access gateway uses a similar trick, since some web servers refuse access to browsers not containing "Mozilla 4".

It can be instructive to write a CGI program that executes the "env" (or "set") command so that you can see all the variables that are available on a particular server. Never rely on any HTTP_ variable being set (or being set to a sensible value), since they are all coming from the browser and a malicious attacker can of course construct bogus ones by hand. If you write a "robot trap" that denies access to email-collecting robots based on the user agent header, it is advisable to include a message explaining the situation, just in case someone is viewing your page through a personal proxy that removes or changes this information (it can happen).

The REMOTE_HOST variable is usually set to the domain name of the user's computer, which may be something like pc003.joh.cam.ac.uk or it might be unset if the DNS lookup fails (or if the server is set to not make a lookup). REMOTE_ADDR is set to the IP address. This can be used if you need to only allow local users to use your script, but again it is best to include an explanatory message when denying access in case for some reason access is denied to someone for whom it should not be.

The POST Method

You may notice when you try out your form that the URL actually got by the browser is formed by adding a '?' to your CGI URL and then the query string. This is useful, because the complete query (including all the input) is a single URL that can be included in links and bookmarks. It can also be used to construct and modify CGI queries "by hand". Note: If you are typing one of these things on the Lynx command line, you must be careful to quote it correctly, since & happens to be the Unix "background" operator; a command line of the form a&b&c does a, b and c all at the same time, and the results can be quite spectacular (but not necessarily what you wanted).

While GET URLs are useful, they do have their disadvantages. You may not want it to be possible to include all the user input in links and bookmarks, perhaps because running your program will cause a credit card to be billed, in which case you do not want any accidents with the program being run twice. (You should of course make sure that the server has encryption if you are going into that market.) Also, URLs that include all the user input tend to be very long, and can get unwieldy especially in links (take a look at the HTML source of a page processed by the access gateway to see what I mean). If you do not like them then you may wish to use METHOD=post instead of METHOD=get.

In the POST method, instead of QUERY_STRING being set, the query string is placed on the standard input (stdin) by the web server. (Some web servers can be set to set QUERY_STRING as well, but this does not always happen.) The CONTENT_LENGTH environment variable is set to the number of bytes placed on stdin, or you could just read it until EOF. It is probably worth writing your CGI in such a way that it can get its query string from either the environment or the standard input, so that it doesn't matter which method you use and you can change it later.

Wanted

There is also another method that is sometimes used, which is a variation on POST, with encoding set to "multipart". It is not supported by all browsers. You need to use this method if you are using Netscape's FILE input type to allow the user to upload files, but some people use it anyway, particularly if they've built their website out of Microsoft tools, which seem to use it as the default. I would very much like an article on how to deal with this encoding method, since my access gateway currently does not support it and so any web page that uses it is currently inaccessible to me.

Notes:

More fields may be available via dynamicdata ..