Introduction
Installation
Dual-nature API
Basic Tidy Usage
Configuring Tidy
• Converting Documents
• Reducing Bandwidth Usage
• Beautifying Documents
Tidy output buffering
Summary
About the Author
Intended Audience
This article is aimed at the web developer who would like to make use of HTML Tidy from within PHP scripts. No particular level of expertise is assumed.
Introduction
The Tidy extension is new in PHP 5, and is available from PHP version 5.0b3 upward. It is based on the TidyLib library, and allows the developer to validate, repair, and parse HTML, XHTML and XML documents from within PHP. This article will introduce some of the functionality of the extension, and explain how it can be used to ratify your web documents against their respective W3C standards.
Installation
If you are using PHP on a Windows system, all you need to do
to enable the extension is uncomment the line
extension=php_tidy.dll in your php.ini
file. The official win32 binary distribution has built-in Tidy support.
ext/tidy is provided
as part of the official PHP 5 source distribution. However, in order to compile
it, you must also have the TidyLib library and headers installed. The source
code for the library can be found through the HTML Tidy project homepage at
http://tidy.sourceforge.net/.
(When selecting the appropriate download, it’s useful to know that the TidyLib
project uses dates to control its versioning.)
Once the library has been built and installed, PHP 5 can be
compiled to provide built-in support for it by using the --with-tidy
configuration option:
[john@localhost]# ./configure --with-tidy=/path/to/libtidy
phpinfo() function and look for the
"Tidy" section. Alternatively, you can test that the Tidy extension is loaded in
the cli version of PHP by using the -m
option and looking for 'tidy' in the list:
[john@localhost]# php -m
[PHP Modules]
...
tidy
Dual-nature API
In common with many of the new PHP 5 extensions, the Tidy extension supports an interchangeable, procedural, and object-oriented API. Although in this article the examples use the procedural syntax:
$tidy = tidy_parse_file(...);
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
an object-oriented syntax could equally well be used:
<?
$tidy = new tidy();
$tidy->parseFile(...);
$tidy->cleanRepair();
echo $tidy;
Furthermore, the two kinds of syntax can be mixed:
$tidy = tidy_parse_file(...);
$tidy->cleanRepair();
echo tidy_get_output($tidy);
As a general rule of thumb, the syntax difference between object-oriented and procedural APIs
comes down to the omission of a single parameter in function calls. When using Tidy in a
procedural way, a resource is required for every function call:
tidy_clean_repair($tidy);
whereas when the object-oriented syntax is used, this parameter is omitted (note the use of studlyCaps in method calls):
$tidy->cleanRepair();
It is recommended for the sake of consistency that one syntax be used throughout your scripts.
Basic Tidy Usage
Now that you have TidyLib built into your copy of PHP, let me
introduce you to the extension and its features.
The most fundamental ability of HTML Tidy is its ability to
parse, diagnose, clean and repair HTML and XHTML documents. To begin using the
Tidy extension, you must first load and parse a specified document. This task is
accomplished through the use of the
tidy_parse_file() function with the
following syntax:
tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);
where $filename is the file to parse (either
a local or remote file) and
$use_include_path is a boolean value
indicating whether the file should be found in PHP’s include path. When
reading documents from the file system that are of a specific character
encoding, the $encoding parameter can
be passed with the character set to use (for instance "utf-32"). The
remaining parameter, $options, can
safely be ignored for now, as configuration settings and their use will be
discussed in detail later in this article.
When tidy_parse_file() is called, it will
attempt to load and parse the named document and return a resource representing
that document. During this parsing process, Tidy will attempt to determine the
format of the document (HTML, XHTML, etc) and will perform some basic repairs on
the resource to make it syntactically correct. Although the exact changes made
will vary from file to file, common mistakes will be corrected and tags will be
re-organized into the proper order.
Once the document has been parsed it can be further
manipulated using the remainder of the functions available in the Tidy
extension. To retrieve the modified version of the document, the extension
provides the tidy_get_output()
function:
tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);
where $tidy is the resource representing the
document.
Note that the resource returned from
tidy_parse_file() or equivalent can
also be treated as a string. Thus functions that accept string parameters can be
passed to the resource directly, or alternatively the resource can be cast to a
string:
echo tidy_get_output($tidy);
/* Alternative method */
echo $tidy;
/* Casting the resource to a string */
$data = (string)$tidy;
If you are dealing with data from documents that have
already been loaded from another source, such as a database or user input, use
the tidy_parse_string() function:
tidy_parse_string($data [, $options [,$encoding]]);
where $data is the variable containing the
data and $encoding is the character set
to use when reading it. As was the case earlier, the
$options parameter will be discussed
later in the article and can safely be ignored for now.
Although the Tidy parsing functions modify the document data
to a certain extent, as stated earlier, these changes simply correct syntax
errors. In order to perform operations such as making the contents of a document
fully standard-compliant, Tidy provides the
tidy_clean_repair() function:
tidy_clean_repair($tidy);
where $tidy is a valid Tidy resource. When
executed, this function will apply the current configuration to the provided
document. As the Tidy extension has a default configuration, the script below
will parse an HTML snippet and automatically generate a complete document in
HTML 3.2 format:
<?php
$tidy = tidy_parse_string("<B>Hello</I> How are <U> you?</B>");
tidy_clean_repair($tidy);
echo $tidy;
?>
When executed, the follow output is generated:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<b>Hello</b> How are <u>you?</u>
</body>
</html>
As you can see, not only has the Tidy extension repaired the original HTML snippet's tags, but it has also generated a complete document from it. This is the default behavior of the extension, and in order to change it we must discuss how to work with Tidy's configuration settings.
Configuring Tidy
Earlier on, when I introduced the
tidy_parse_file() and
tidy_parse_string() functions, I
intentionally ignored the $options
parameter in each. This parameter, as its name implies, is used to modify the
default configuration of Tidy for the document being parsed. These configuration
manipulations can be done in either of two ways.
- The configuration options can be loaded from a Tidy configuration file by passing the name of the
configuration file as the
$optionsparameter. - Alternatively, Tidy configuration options can be set by passing an associative array of configuration option/value pairs.
<?php
/* Specify configuration options as an array */
$options = array("output-xhtml" => true, "clean" => true);
$tidy = tidy_parse_file("http://www.coggeshall.org/", $options);
tidy_clean_repair($tidy);
echo $tidy;
/* Specify a configuration file */
$tidy_two = tidy_parse_file("http://www.coggeshall.org/", "path/to/myconfig.tcfg");
tidy_clean_repair($tidy_two);
echo $tidy_two;
?>
Tidy configuration files are nothing more than simple text files that provide a series of option/value pairs. An example of a Tidy configuration file is shown below:
indent-spaces: 4
wrap: 72
indent: auto
tidy-mark: no
show-body-only: yes
force-output: yes
Tidy configuration options provide most
of the functionality available in the extension. Because of the sheer quantity
of options available, I cannot provide a listing of them all here. Rather, I
will only be introducing a small and particularly useful subset. If you are
interested in exploring other configuration options and their meanings, you can
find them online at
http://www.w3.org/People/Raggett/tidy/.
To determine the value of a configuration option for a given document, use the
tidy_getopt() function:
<?php
$tidy = tidy_parse_file("http://www.coggeshall.org/");
$optval = tidy_getopt($tidy, "show-body-only");
echo "The value of 'show-body-only' is: $optval\n";
?>
Converting Documents
One of the most useful ways that Tidy configuration directives
can be applied is in the process of converting documents from one format, such
as HTML 4.01, to another, such as XHTML 1.0. In most situations, Tidy will
automatically identify the format and format version of the document being
processed when you parse it, using tidy_parse_file().
To convert it to another type, you can use one of the following configuration options:
| Option | Value | Effect |
output-html |
Boolean | Outputs the data in HTML format |
output-xml |
Boolean | Outputs the data as well-formed XML |
output-xhtml |
Boolean |
Outputs the data in XHTML format |
Each of these options has a Boolean value, and should be set
to true or false, as appropriate, when parsing the
file using tidy_parse_file(). Note that
only one output directive at a time may be set for a given document; doing
otherwise can have unpredictable consequences.
Beyond converting the format of a document, Tidy is also able
to convert deprecated <FONT> HTML
tags into their cascading style sheet (CSS) counterparts automatically through
the use of the clean option. The
generated output contains an inline style declaration.
<?php
/* Convert coggeshall.org to stylesheets and XHTML */
$opts = array("clean" => true, "output-xhtml" => true);
$tidy = tidy_parse_file("http://www.coggeshall.org/", $opts);
tidy_clean_repair($tidy);
echo $tidy;
?>
Reducing Bandwidth Usage
When Tidy generates output, it can be configured to significantly reduce the size of the file by removing all data that is not required by the browser for rendering. There are a number of useful options for doing this, as listed below:
| Option | Value | Effect |
drop-proprietary-attributes |
Boolean | Removes all attributes that are not part of a web standard |
drop-font-tags |
Boolean | Removes deprecated <FONT> tags |
drop-empty-paras |
Boolean |
Removes <P> tags that contain no data |
hide-comments |
Boolean | Strips all comments |
join-classes |
Boolean | Combines CSS classes |
join-styles |
Boolean |
Combines CSS styles |
word-2000 |
Boolean | Removes all proprietary data when an MS Word document has been saved as HTML |
In the example below, these options are used against the php.net homepage:
<?php
$options = array("clean" => true,
"drop-proprietary-attributes" => true,
"drop-font-tags" => true,
"drop-empty-paras" => true,
"hide-comments" => true,
"join-classes" => true,
"join-styles" => true);
$tidy = tidy_parse_file("http://www.php.net/", $options);
tidy_clean_repair($tidy);
echo $tidy;
?>
Although the gains that can be made from this may seem minimal at first sight, their significance can be substantial. Even a reduction as small as 800 bytes per request saves 80 MB of bandwidth on a site serving 100,000 copies of that document. As an added benefit, documents that have been processed with Tidy can still contain important non-rendering information, such as development comments, without wasting bandwidth transmitting it all to the end user.
Beautifying Documents
Just as Tidy can strip everything from a document to assist you in working with it, Tidy can also make documents easier for people to read, through intelligent indentation of an existing document. This is accomplished using the three options shown below:
| Option | Value | Effect |
indent |
"true", "false" or "auto" | Toggles whether the output is indented |
indent-spaces |
integer | Sets the number of spaces to use for each level of indentation |
wrap |
integer | Sets the number of characters allowed before a line is soft-wrapped |
When used together as demonstrated below, these three options can quickly take a machine-generated document and make it easily read by a developer.
<?php
$options = array("indent" => true, /* Turn on beautification */
"indent-spaces" => 4, /* Spaces per indenting level */
"wrap" => 72); /* Line length before wrapping */
$tidy = tidy_parse_file("http://www.php.net/", $options);
tidy_clean_repair($tidy);
echo $tidy;
?>
Tidy output buffering
When the Tidy extension is installed, it can be automatically
applied to every PHP document output by setting the
tidy.clean_output php.ini directive to
true. Tidy can also be specified as the
output handler when output buffering is enabled, by passing the
"ob_tidyhandler"
string anywhere the ob_start() function is used:
<?php
ob_start("ob_tidyhandler");
/* Do your outputting here */
?>
Once the output is sent to the browser in the above
snippet, Tidy will automatically be called to process it before the end user
receives it. This can be particularly useful when used in conjunction with the
tidy.default_config configuration directive.
Although it is useful, be aware that the tidy.clean_output
directive should not be enabled in situations where the output generated by
PHP is not a markup document (for example, when outputting an image using the gd library).
Summary
As you can see, the Tidy extension is a very powerful tool that can be taken advantage of in PHP 5, particularly during development. Although this article has really only brushed on the surface of the Tidy extension, the information given here should be enough to get you started using this new technology.
About The Author
John Coggeshall is a PHP consultant and author who started losing sleep over PHP around five years ago. Lately you'll find him losing sleep meeting deadlines for books or online columns on a wide range of PHP topics. You can find his work online at O'Reilly Networks’ onlamp.com, Zend Technologies, and at his website.

Comments
My aim is to convert a mysql database containing html and php snippets into XHTML compliant pages.
HTML Tidy seems like the best thing out there, but it does not seem to easily work on html contained in database tables.
Jon :
http://bioinformatics.org/phplabware/internal_utilities/htmLawed