|
|
Reading CGI Data: url-encoding and the CGI protocol
Transmitting the Form Data
The Query_String and Method Get
Extra Path Info
Standard Input and Method Post
Bundling and url-encoding
Some Perl code
Further Reading
Return to: CGI Tour
CGI Tips
CGI Resources
This is a first draft.
Comments are welcome.
|
 |
What makes a script a CGI script is its ability to read and understand
form data submitted to it from a web form. When a user submits her
answers on a form, her browser bundles it up in a special format and
sends it back to the web server, which passes it on to your
script. Your script must know how to acquire that bundle of data and
then unbundle it. The CGI protocol is the language that specifues how
the data is bundled and supplied to a CGI script.
In this tutorial, we'll provide an overview of the CGI protocol, from
the point of view of the CGI script developer and web page
author. First, we'll describe some of the most common methods for
transmitting data to your script, using methods GET and POST, as well
as with the query string and path info. Then we'll describe how the
web browser packages the data in a special format using
url-encoding.
We will not describe the underlying HTTP protocol, or
multipart/form-data, which is used for file upload. Each of these are
complicated enough to require a separate discussions.

Transmitting the Form Data

When the user submits her form data on your web form, her browser
bundles it up and transmits it to your web server, which in turn
passes it to your CGI script.
Let's first talk about that second step, since it's the first thing your script must do: acquire the raw data.
The script must do
different things for each method to read the data, so we need to
understand each.
You can choose several methods to transmit form data to your script.
Two methods--the query string and path info--place the data inside the
URL that points to the cgi script. (When the URL includes this form
data, it is called a Uniform Resource Identifier.) Two other
methods--GET and POST--are specified as attributes of the <FORM
METHOD=...> tag in the html source of the web form. Each
method has its own advantages.

The Query_String and Method Get

The query string is the simplest method of passing form data to a
script. If you append a question mark (?) to the url of your script,
then any characters after the question mark will be passed to your
script in the environment variable called QUERY_STRING. A
perl script can access its environment through the %ENV
associative array. So, if your script is called script.cgi and you make a hypertext link to your script
(or type into your browser)
|
http://www.your_site.com/script.cgi?howdy_doody
|
The string "howdy_doody" will be the QUERY_STRING and will be
available to your script as the contents of
$ENV{'QUERY_STRING'}. The script can read that variable and
process it however you wish.
The environment of a program is not a mysterious concept. Almost every script
or program in Unix has an environment associated with it. It
contains information that the script might find useful, such as what
its name is and who launched it. And you can customize the environment
of script. Web servers usually supply cgi scripts with additional
environment information, letting them know who the web server is, and
who the user is. As is described in the CGI
Tour, you can get a list of all the environment variables by
examining the contents of the associative array, %ENV.
Give it a try with the Web Form Analyzer: go to the URL
|
http://www.halcyon.com/sanford/cgi/perl_form.cgi?howdy_doody
|
or type in the url yourself and supply your own query string. You'll
see a report of all the environment variables available to the script,
incuding the QUERY_STRING, which will have the value "howdy_doody".
By using the QUERY_STRING in the url of a link, you can supply form
data to a script without having a form. (Read that again.) It's handy
if you want to link to a script and supply a small amount of
information but without bothering to construct an elaborate form, or
asking the user to press a Submit button.
If you do construct a form, and specify METHOD=GET in your
<FORM METHOD=...> tag, then your data will be bundled
into the QUERY_STRING and submitted to your script, just as
above. This is the way many Internet search engines work, such as Yahoo or Lycos. After you type your search
terms and press the submit button, look at the url of the search
results page. It will have a query string following a ?; somewhere in
it will be the data you supplied. For web forms using METHOD=GET, the
url will have a query string appended even though you didn't put it
there yourself. The browser did it automatically when you submitted
the form.
The difference between a form using method=GET as opposed to typing it
into the script's url (eg, as a link in a web page) is that the form
data will be url-encoded by the browser before it is sent. If you
place it in the url yourself, you are responsible for url-encoding.
We'll talk about this below, but for the moment, just know that
url-encoding is complicated. Unless the data
you want to send is very simple, you should have the browser do it for
you, by using a a form with method=Get instead of supplying your own
query string in the URL. It's easy to make a mistake with
unpredictable and generally unhappy results.
The main disadvantage of METHOD=GET and passing data through the query
string in a web form is restrictions on length. Most operating systems
have limits on the length of the path to a file (256 characters is
common, but some systems can be much shorter). The web server treats
the local portion of the url, including the query string, as a kind of
file path. So you can't pass very much data this way.
Another disadvantage is that the url, including the query string, is
often collected in access logs most servers maintain. If your access
logs are public, you may not object to having your hits recorded, but
your form data might contain information you'd prefer not to so easily
expose.

Extra Path Info

Another simple method of sending data to a script is through the
PATH_INFO environment variable. Similar to the query string, path info
is whatever comes after the script name in the url. You need to start
the path info with a slash (/) to let the web server know where the
script name ends.
Try the Web Form Analyser again, this time with path info instead of a
query string.
|
http://www.halcyon.com/sanford/cgi/perl_form.cgi/howdy.doo
|
You'll see now that "/howdy.doo" is listed as the value of the
PATH_INFO environment variable.
You can supply both path info and a query string in the url as in the
following link:
|
http://www.halcyon.com/sanford/cgi/perl_form.cgi/howdy.doo?Fine,thanks
|
though you must suppply the path info first and the query string
second. Otherwise the query string will gobble up the path info. Try
it by reversing the two and typing this example into your browser.
Path info need not be a path to any particular file or directory,
though this was its original intent and is still the most common
use. For example, many hit counter scripts are installed for
system-wide use. If one script serves many users, the script must be
told which page is to be counted. Path info will often be the (file
system or url) path to the particular file that contains the current
count.
Path info has the same disadvantages as the query string. It is not
automatically url encoded, it is subject the the same path length
limitations, and it is reported in the server logs.

Standard Input and Method Post

Because of these limitations, yet another method of transmitting form
data to a script was developed, and is now the most common and
recommended method. Method POST sends data to a script's STANDARD
INPUT. It's less public (it's not reported in the server logs) and in
principle there are no length limitations. (But not in practice. I
discovered the hard way that America Online's browser is limited to 7
to 15 kilobytes, depending on RAM and other machine configuration.) On
the other hand, you cannot supply POST data to a script directly in
the url, as can be done with the query string or path info. You really
need a web form to use METHOD=POST.
So what's required to get data from STDIN into a script? In scripts
run by hand, STDIN is normally interactive keyboard input, or input
from the contents of a file if shell redirection (script.pl <
data_file) is used. In these cases, perl scripts commonly read
input line by line with a locution like: while ($next_line =
<STDIN>) { ... } The script would continue reading from
STDIN until the end of the data file is reached or the user types the
EOF character (usually Ctrl-D).
But the web server is running the CGI script, not you, so there is no
interactive keyboard input and no data file. Or at least it would be
awkward for the server to put the data in a file and then redirect it
to the script (though I understand that servers on MS Windows do
essentially use a data file since they have no concept of STDIN.)
There is a simpler method of reading input for CGI scripts using
Perl's read() function, which works like this:
|
$bytes_read = read (STDIN, $form_data, $num_bytes);
|
This code specifies where to read (STDIN or any open file handle), how
many bytes to read ($num_bytes) and where to put the data (in
the $form_data scalar variable). The read function
returns the number of bytes that were successfully read.
Of course, unlike query string or path info, this METHOD=POST requires
the number of bytes of form data being supplied. This could be a
tedious job if you had to do it yourself, but you don't. The web
browser will automatically count the amount of the form data for you
and supply it in the script's CONTENT_LENGTH environment variable. So
you can set
|
$num_bytes = $ENV{'CONTENT_LENGTH'};
|
without having to calculate it yourself. You can see an example of
this in the Web Form Analyzer. Click on
that link (without supplying a query string or path info) and an
example form will appear which uses METHOD=POST. After you submit,
you'll see the CONTENT_LENGTH reported. There is another example
script below, where you can actually count the bytes of form data
yourself.

Bundling and url-encoding

Having discussed the common methods of transmitting form data, we know
how a CGI script can acquire the data. But we must also discuss how
the script interprets the data, since browsers will bundle it up in a
special sort of encoding.
Why is it encoded? Form data often consists of a number of distinct
items, each with a name and a value. It's all packaged into a single
string or stream, so the script needs to unpack it into distinct items
and differentiate the names and values. For example, if you have some
text input tags in your web form:
|
<input type=text name="user_name" value = "">
<input type=text name="user_personality" value = "">
|
then blank text entry fields will appear on the form where the
user can type her name and personality. When she submits the form,
your script must be told that form data called user_name and
user_personality were submitted with values Sally
Smith and happy & care-free, for example.
Browsers will bundle this data into a single string using these rules
of url-encoding:
- all the submitted form data will be concatenated into a single
string of ampersand (&) separated name=value pairs, one
pair for each form tag. Like this:
|
form_tag_name_1=value_1&form_tag_name_2=value_2&...
|
- any spaces occurring in a name or value will be replaced by a plus
(+) sign. This is because url's cannot have spaces in them and under
METHOD=GET, the form data is supplied in the query string in the url.
- other punctuation characters (for example, equal signs and
ampersands) occurring in names or values will be replaced with a
percent sign (%) followed by the two-digit hexadecimal equivalent of
the punctuation character in the Ascii character set. (Hexadecimal is
base 16.) Otherwise, it would be hard to distinguish these characters
inside a form variable from those between the form
variables in the first rule above.
So, this is the string that encodes the form data above that would be
passed to your script:
|
user_name=Sally+Smith&user_personality=happy+%26+care-free
|
%26 is the hex value for the ampersand (&).
Here's a very short form that will let you experiment with url-encoded
form data. It points to a script that's much like the Web Form Analyzer, but it doesn't decode
the url-encoded data. It merely reads in the raw encoded data as it
was sent from the web browser and reports on it verbatim. Try it with
spaces and various punctuation characters to see how they are encoded
into hex values.
|
|
This form uses METHOD=GET so you can see the query string also
reported in the url. You can point your own web form to the script
at: http://www.halcyon.com/sanford/cgi/readdata/read.cgi If
you use METHOD=POST in your form, you'll see CONTENT_LENGTH reported
as an environment variable which will allow you to verify the count of
bytes in the form data. You'll also see another environment variable
reported, called CONTENT_TYPE, with a value of
application/x-www-form-urlencoded, indicating that the form
data has been encoded.

Some Perl code

So how do you decode this? Essentially, just by following the rules of
url-encoding in reverse. In fact, let's build some Perl code which
will first read in some data and then decode it, very much like the ReadParse function does, which we use in
many of the scripts here.
First, we'll need to acquire the data, which can be supplied with
method GET in the query string, or method POST in STDIN. In either
case, let's put the data into a scalar, &in so we can decode
it later.
|
if ( $ENV{'REQUEST_METHOD'} eq "GET" ) {
$in = $ENV{'QUERY_STRING'};
} elsif ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN, $in, $ENV{'CONTENT_LENGTH'});
}
|
At this point, we have acquired the data and need no longer be
concerned about the method. We do need to decode it and store it in
some convenient form.
Having loaded this url-encoded string into a variable called
$in, a convenient way to access this form data later in a
script would be an associative array, keyed on tag names. So our goal
is to process the string into an associative array or hash called
%in. The keys will be the tag names and the corresponding
values in the hash will be the form values as the user submitted
them. For example, $in{'user_name'} will be the string
Sally Smith.
So, let's next split the url-encoded string in the scalar $in into the
separate form data. We split on ampersand (&), since url-encoding
guarantees this character won't occur inside a name or value, and then
store the resulting list in an array called @in:
|
@in = split (/&/, $in);
|
Now, each element of @in is form data, in the format
name=value. Next, we'll process each element of this array,
first putting back all the spaces in place of the plus signs, and then
splitting the array element into its name and value components. We
need to do this before we undo the hexadecimal encoding since we don't
want to put back any equal signs that might occur in the name or value
before splitting.
|
foreach (@in) {
s/\+/ /g;
($name, $value) = split(/=/,$_);
|
Then decode each component.
|
($name, $value) = split(/=/,$_);
$name =~ s/%(..)/pack("c",hex($1))/ge;
$value =~ s/%(..)/pack("c",hex($1))/ge;
|
These two substitutions do the decoding from hexadecimal to Ascii
characters. They look for a percent sign and remember the following
two characters in the $1 variable. hex($1) does the
hexadecimal (base 16) to decimal (base 10) translation. Then the
pack function (with its "c" argument) translates this number
into its corresponding character in the Ascii character set. Finally,
it replaces the percent sign and the following two characters in the
string with the resulting Ascii character.
The g at the end means to make this substitution as many
times as possible in the string (globally) not just once. The
e means to evaluate the replacement as an expression;
otherwise, it would just substitute the characters
pack(...).
Finally, we load the name and value into the associative array
%in. In case the same form tag name has been supplied twice
(for example, multiple options were selected in a <select> tag))
we save both values, separated by the null character (\0) so they can
be recalled and separated later in your script.
|
$in{$name} .= "\0" if (defined($in{$name}));
$in{$name} .= $value;
}
|
That's all that's required. Now your script can get a list of all the
submitted form data as keys %in and access the value of any
particular form variable as $in{$name}.

Further Reading

These days, most definitive documentation on the web seems to be
maintained, or linked from the pages at the World Wide Web Consortium. In
particular, you can find the formal specifications linked from their
page on CGI: Common Gateway
Interface. CGI is, of course,intimately connected with the web, so
for the most complete understanding, you should also look at the
specifications for HTTP.

|
CGI Resources.
Copyright 1995-97,
Sanford Morton
Last modified: Sun Mar 16 01:14:09 PDT
|
|