Reading CGI Data: url-encoding and the CGI protocol

Reading CGI Data: url-encoding and the CGI protocol Transmitting the Form Data The Query_String and Method Get Extra Path Info Standard Input and Method Post Bundling and url-encoding Some Perl code Further Reading Return to: CGI Tour CGI Tips CGI Resources This is a first draft. Comments are welcome.
What makes a script a CGI script is its ability to read and understand form data submitted to it from a web form. When a user submits her answers on a form, her browser bundles it up in a special format and sends it back to the web server, which passes it on to your script. Your script must know how to acquire that bundle of data and then unbundle it. The CGI protocol is the language that specifues how the data is bundled and supplied to a CGI script. In this tutorial, we'll provide an overview of the CGI protocol, from the point of view of the CGI script developer and web page author. First, we'll describe some of the most common methods for transmitting data to your script, using methods GET and POST, as well as with the query string and path info. Then we'll describe how the web browser packages the data in a special format using url-encoding. We will not describe the underlying HTTP protocol, or multipart/form-data, which is used for file upload. Each of these are complicated enough to require a separate discussions. Transmitting the Form Data When the user submits her form data on your web form, her browser bundles it up and transmits it to your web server, which in turn passes it to your CGI script. Let's first talk about that second step, since it's the first thing your script must do: acquire the raw data. The script must do different things for each method to read the data, so we need to understand each. You can choose several methods to transmit form data to your script. Two methods--the query string and path info--place the data inside the URL that points to the cgi script. (When the URL includes this form data, it is called a Uniform Resource Identifier.) Two other methods--GET and POST--are specified as attributes of the `<FORM METHOD=...>` tag in the html source of the web form. Each method has its own advantages. The Query_String and Method Get The query string is the simplest method of passing form data to a script. If you append a question mark (?) to the url of your script, then any characters after the question mark will be passed to your script in the environment variable called `QUERY_STRING`. A perl script can access its environment through the `%ENV` associative array. So, if your script is called `script.cgi` and you make a hypertext link to your script (or type into your browser)
http://www.your_site.com/script.cgi?howdy_doody
The string "howdy_doody" will be the QUERY_STRING and will be available to your script as the contents of `$ENV{'QUERY_STRING'}`. The script can read that variable and process it however you wish. The environment of a program is not a mysterious concept. Almost every script or program in Unix has an environment associated with it. It contains information that the script might find useful, such as what its name is and who launched it. And you can customize the environment of script. Web servers usually supply cgi scripts with additional environment information, letting them know who the web server is, and who the user is. As is described in the CGI Tour, you can get a list of all the environment variables by examining the contents of the associative array, `%ENV`. Give it a try with the Web Form Analyzer: go to the URL
http://www.halcyon.com/sanford/cgi/perl_form.cgi?howdy_doody
or type in the url yourself and supply your own query string. You'll see a report of all the environment variables available to the script, incuding the QUERY_STRING, which will have the value "howdy_doody". By using the QUERY_STRING in the url of a link, you can supply form data to a script without having a form. (Read that again.) It's handy if you want to link to a script and supply a small amount of information but without bothering to construct an elaborate form, or asking the user to press a Submit button. If you do construct a form, and specify METHOD=GET in your `<FORM METHOD=...>` tag, then your data will be bundled into the QUERY_STRING and submitted to your script, just as above. This is the way many Internet search engines work, such as Yahoo or Lycos. After you type your search terms and press the submit button, look at the url of the search results page. It will have a query string following a ?; somewhere in it will be the data you supplied. For web forms using METHOD=GET, the url will have a query string appended even though you didn't put it there yourself. The browser did it automatically when you submitted the form. The difference between a form using method=GET as opposed to typing it into the script's url (eg, as a link in a web page) is that the form data will be url-encoded by the browser before it is sent. If you place it in the url yourself, you are responsible for url-encoding. We'll talk about this below, but for the moment, just know that url-encoding is complicated. Unless the data you want to send is very simple, you should have the browser do it for you, by using a a form with method=Get instead of supplying your own query string in the URL. It's easy to make a mistake with unpredictable and generally unhappy results. The main disadvantage of METHOD=GET and passing data through the query string in a web form is restrictions on length. Most operating systems have limits on the length of the path to a file (256 characters is common, but some systems can be much shorter). The web server treats the local portion of the url, including the query string, as a kind of file path. So you can't pass very much data this way. Another disadvantage is that the url, including the query string, is often collected in access logs most servers maintain. If your access logs are public, you may not object to having your hits recorded, but your form data might contain information you'd prefer not to so easily expose. Extra Path Info Another simple method of sending data to a script is through the PATH_INFO environment variable. Similar to the query string, path info is whatever comes after the script name in the url. You need to start the path info with a slash (/) to let the web server know where the script name ends. Try the Web Form Analyser again, this time with path info instead of a query string.
http://www.halcyon.com/sanford/cgi/perl_form.cgi/howdy.doo
You'll see now that "/howdy.doo" is listed as the value of the PATH_INFO environment variable. You can supply both path info and a query string in the url as in the following link:
http://www.halcyon.com/sanford/cgi/perl_form.cgi/howdy.doo?Fine,thanks
though you must suppply the path info first and the query string second. Otherwise the query string will gobble up the path info. Try it by reversing the two and typing this example into your browser. Path info need not be a path to any particular file or directory, though this was its original intent and is still the most common use. For example, many hit counter scripts are installed for system-wide use. If one script serves many users, the script must be told which page is to be counted. Path info will often be the (file system or url) path to the particular file that contains the current count. Path info has the same disadvantages as the query string. It is not automatically url encoded, it is subject the the same path length limitations, and it is reported in the server logs. Standard Input and Method Post Because of these limitations, yet another method of transmitting form data to a script was developed, and is now the most common and recommended method. Method POST sends data to a script's STANDARD INPUT. It's less public (it's not reported in the server logs) and in principle there are no length limitations. (But not in practice. I discovered the hard way that America Online's browser is limited to 7 to 15 kilobytes, depending on RAM and other machine configuration.) On the other hand, you cannot supply POST data to a script directly in the url, as can be done with the query string or path info. You really need a web form to use METHOD=POST. So what's required to get data from STDIN into a script? In scripts run by hand, STDIN is normally interactive keyboard input, or input from the contents of a file if shell redirection (`script.pl < data_file`) is used. In these cases, perl scripts commonly read input line by line with a locution like: `while ($next_line = <STDIN>) { ... }` The script would continue reading from STDIN until the end of the data file is reached or the user types the EOF character (usually Ctrl-D). But the web server is running the CGI script, not you, so there is no interactive keyboard input and no data file. Or at least it would be awkward for the server to put the data in a file and then redirect it to the script (though I understand that servers on MS Windows do essentially use a data file since they have no concept of STDIN.) There is a simpler method of reading input for CGI scripts using Perl's `read()` function, which works like this:
$bytes_read = read (STDIN, $form_data, $num_bytes);
This code specifies where to read (STDIN or any open file handle), how many bytes to read (`$num_bytes`) and where to put the data (in the `$form_data` scalar variable). The `read` function returns the number of bytes that were successfully read. Of course, unlike query string or path info, this METHOD=POST requires the number of bytes of form data being supplied. This could be a tedious job if you had to do it yourself, but you don't. The web browser will automatically count the amount of the form data for you and supply it in the script's CONTENT_LENGTH environment variable. So you can set
$num_bytes = $ENV{'CONTENT_LENGTH'};
without having to calculate it yourself. You can see an example of this in the Web Form Analyzer. Click on that link (without supplying a query string or path info) and an example form will appear which uses METHOD=POST. After you submit, you'll see the CONTENT_LENGTH reported. There is another example script below, where you can actually count the bytes of form data yourself. Bundling and url-encoding Having discussed the common methods of transmitting form data, we know how a CGI script can acquire the data. But we must also discuss how the script interprets the data, since browsers will bundle it up in a special sort of encoding. Why is it encoded? Form data often consists of a number of distinct items, each with a name and a value. It's all packaged into a single string or stream, so the script needs to unpack it into distinct items and differentiate the names and values. For example, if you have some text input tags in your web form:
<input type=text name="user_name" value = ""> <input type=text name="user_personality" value = "">
then blank text entry fields will appear on the form where the user can type her name and personality. When she submits the form, your script must be told that form data called `user_name` and `user_personality` were submitted with values `Sally Smith` and `happy & care-free`, for example. Browsers will bundle this data into a single string using these rules of url-encoding: all the submitted form data will be concatenated into a single string of ampersand (&) separated name=value pairs, one pair for each form tag. Like this:
form_tag_name_1=value_1&form_tag_name_2=value_2&...
any spaces occurring in a name or value will be replaced by a plus (+) sign. This is because url's cannot have spaces in them and under METHOD=GET, the form data is supplied in the query string in the url. other punctuation characters (for example, equal signs and ampersands) occurring in names or values will be replaced with a percent sign (%) followed by the two-digit hexadecimal equivalent of the punctuation character in the Ascii character set. (Hexadecimal is base 16.) Otherwise, it would be hard to distinguish these characters inside a form variable from those between the form variables in the first rule above. So, this is the string that encodes the form data above that would be passed to your script:
user_name=Sally+Smith&user_personality=happy+%26+care-free
%26 is the hex value for the ampersand (&). Here's a very short form that will let you experiment with url-encoded form data. It points to a script that's much like the Web Form Analyzer, but it doesn't decode the url-encoded data. It merely reads in the raw encoded data as it was sent from the web browser and reports on it verbatim. Try it with spaces and various punctuation characters to see how they are encoded into hex values.
<form method=get action="read.cgi"> <input type=text name="user_name" value=""> <input type=text name="user_personality" value=""> <input type=submit name=submit value="Go" </form>
This form uses METHOD=GET so you can see the query string also reported in the url. You can point your own web form to the script at: `http://www.halcyon.com/sanford/cgi/readdata/read.cgi` If you use METHOD=POST in your form, you'll see CONTENT_LENGTH reported as an environment variable which will allow you to verify the count of bytes in the form data. You'll also see another environment variable reported, called CONTENT_TYPE, with a value of `application/x-www-form-urlencoded`, indicating that the form data has been encoded. Some Perl code So how do you decode this? Essentially, just by following the rules of url-encoding in reverse. In fact, let's build some Perl code which will first read in some data and then decode it, very much like the ReadParse function does, which we use in many of the scripts here. First, we'll need to acquire the data, which can be supplied with method GET in the query string, or method POST in STDIN. In either case, let's put the data into a scalar, `&in` so we can decode it later.
if ( $ENV{'REQUEST_METHOD'} eq "GET" ) { $in = $ENV{'QUERY_STRING'}; } elsif ($ENV{'REQUEST_METHOD'} eq "POST") { read(STDIN, $in, $ENV{'CONTENT_LENGTH'}); }
At this point, we have acquired the data and need no longer be concerned about the method. We do need to decode it and store it in some convenient form. Having loaded this url-encoded string into a variable called `$in`, a convenient way to access this form data later in a script would be an associative array, keyed on tag names. So our goal is to process the string into an associative array or hash called `%in`. The keys will be the tag names and the corresponding values in the hash will be the form values as the user submitted them. For example, `$in{'user_name'}` will be the string `Sally Smith`. So, let's next split the url-encoded string in the scalar `$in` into the separate form data. We split on ampersand (&), since url-encoding guarantees this character won't occur inside a name or value, and then store the resulting list in an array called `@in`:
@in = split (/&/, $in);
Now, each element of `@in` is form data, in the format `name=value`. Next, we'll process each element of this array, first putting back all the spaces in place of the plus signs, and then splitting the array element into its name and value components. We need to do this before we undo the hexadecimal encoding since we don't want to put back any equal signs that might occur in the name or value before splitting.
foreach (@in) { s/\+/ /g; ($name, $value) = split(/=/,$_);
Then decode each component.
($name, $value) = split(/=/,$_); $name =~ s/%(..)/pack("c",hex($1))/ge; $value =~ s/%(..)/pack("c",hex($1))/ge;
These two substitutions do the decoding from hexadecimal to Ascii characters. They look for a percent sign and remember the following two characters in the `$1` variable. `hex($1)` does the hexadecimal (base 16) to decimal (base 10) translation. Then the `pack` function (with its "c" argument) translates this number into its corresponding character in the Ascii character set. Finally, it replaces the percent sign and the following two characters in the string with the resulting Ascii character. The `g` at the end means to make this substitution as many times as possible in the string (globally) not just once. The `e` means to evaluate the replacement as an expression; otherwise, it would just substitute the characters `pack(...)`. Finally, we load the name and value into the associative array `%in`. In case the same form tag name has been supplied twice (for example, multiple options were selected in a <select> tag)) we save both values, separated by the null character (\0) so they can be recalled and separated later in your script.
$in{$name} .= "\0" if (defined($in{$name})); $in{$name} .= $value; }
That's all that's required. Now your script can get a list of all the submitted form data as `keys %in` and access the value of any particular form variable as `$in{$name}`. Further Reading These days, most definitive documentation on the web seems to be maintained, or linked from the pages at the World Wide Web Consortium. In particular, you can find the formal specifications linked from their page on CGI: Common Gateway Interface. CGI is, of course,intimately connected with the web, so for the most complete understanding, you should also look at the specifications for HTTP.
CGI Resources. Copyright 1995-97, Sanford Morton Last modified: Sun Mar 16 01:14:09 PDT