Tidying up your HTML with PHP 5

March 16, 2004

Uncategorized

Intended Audience
Introduction

Installation
Dual-nature API
Basic Tidy Usage
Configuring Tidy
•  Converting Documents
•  Reducing Bandwidth Usage
•  Beautifying Documents
Tidy output buffering
Summary

About the Author


Intended Audience

This article is aimed at the web developer who would like to
make use of HTML Tidy from within PHP scripts. No particular level of expertise
is assumed.


Introduction

The Tidy extension is new in PHP 5, and is available from PHP
version 5.0b3 upward. It is based on the TidyLib library, and allows the
developer to validate, repair, and parse HTML, XHTML and XML documents from
within PHP. This article will introduce some of the functionality of the
extension, and explain how it can be used to ratify your web documents against
their respective W3C standards.


Installation

If you are using PHP on a Windows system, all you need to do
to enable the extension is uncomment the line
extension=php_tidy.dll in your php.ini
file. The official win32 binary distribution has built-in Tidy support.



ext/tidy is provided
as part of the official PHP 5 source distribution. However, in order to compile
it, you must also have the TidyLib library and headers installed. The source
code for the library can be found through the HTML Tidy project homepage at
http://tidy.sourceforge.net/.
(When selecting the appropriate download, it’s useful to know that the TidyLib
project uses dates to control its versioning.)


Once the library has been built and installed, PHP 5 can be
compiled to provide built-in support for it by using the –with-tidy
configuration option:

[john@localhost]# ./configure –with-tidy=/path/to/libtidy


Note that if no path is provided, the configuration script will automatically attempt
to find the required libraries in the default locations. Of course, after
properly configuring PHP with support for Tidy and any other extensions or SAPI
modules you desire, you can compile and install PHP in the usual way. To check
whether the extension has been added correctly, check the output of the
phpinfo() function and look for the
“Tidy” section. Alternatively, you can test that the Tidy extension is loaded in
the cli version of PHP by using the -m
option and looking for ‘tidy’ in the list:

[john@localhost]# php -m

[PHP Modules]



tidy


Dual-nature API

In common with many of the new PHP 5 extensions, the Tidy
extension supports an interchangeable, procedural, and object-oriented API.
Although in this article the examples use the procedural syntax:



$tidy = tidy_parse_file(...);

tidy_clean_repair($tidy);

echo
tidy_get_output($tidy);


an object-oriented syntax could equally well be used:



<?

$tidy
= new tidy();

$tidy->parseFile(...);

$tidy->cleanRepair();

echo
$tidy;


Furthermore, the two kinds of syntax can be mixed:



$tidy = tidy_parse_file(...);

$tidy->cleanRepair();

echo
tidy_get_output($tidy);


As a general rule of thumb, the syntax difference between object-oriented and procedural APIs
comes down to the omission of a single parameter in function calls. When using Tidy in a
procedural way, a resource is required for every function call:



tidy_clean_repair($tidy);


whereas when the
object-oriented syntax is used, this parameter is omitted (note the use of
studlyCaps in method calls):



$tidy->cleanRepair();

It is recommended for the
sake of consistency that one syntax be used throughout your
scripts.


Basic Tidy Usage

Now that you have TidyLib built into your copy of PHP, let me
introduce you to the extension and its features.

The most fundamental ability of HTML Tidy is its ability to
parse, diagnose, clean and repair HTML and XHTML documents. To begin using the
Tidy extension, you must first load and parse a specified document. This task is
accomplished through the use of the
tidy_parse_file() function with the
following syntax:



tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);

where $filename is the file to parse (either
a local or remote file) and
$use_include_path is a boolean value
indicating whether the file should be found in PHP’s include path. When
reading documents from the file system that are of a specific character
encoding, the $encoding parameter can
be passed with the character set to use (for instance “utf-32″). The
remaining parameter, $options, can
safely be ignored for now, as configuration settings and their use will be
discussed in detail later in this article.


When tidy_parse_file() is called, it will
attempt to load and parse the named document and return a resource representing
that document. During this parsing process, Tidy will attempt to determine the
format of the document (HTML, XHTML, etc) and will perform some basic repairs on
the resource to make it syntactically correct. Although the exact changes made
will vary from file to file, common mistakes will be corrected and tags will be
re-organized into the proper order.



Once the document has been parsed it can be further
manipulated using the remainder of the functions available in the Tidy
extension. To retrieve the modified version of the document, the extension
provides the tidy_get_output()
function:



tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);


where $tidy is the resource representing the
document.



Note that the resource returned from
tidy_parse_file() or equivalent can
also be treated as a string. Thus functions that accept string parameters can be
passed to the resource directly, or alternatively the resource can be cast to a
string:



echo tidy_get_output($tidy);

/* Alternative method */

echo $tidy;

/* Casting the resource to a string */

$data = (string)$tidy;


If you are dealing with data from documents that have
already been loaded from another source, such as a database or user input, use
the tidy_parse_string() function:



tidy_parse_string($data [, $options [,$encoding]]);


where $data is the variable containing the
data and $encoding is the character set
to use when reading it. As was the case earlier, the
$options parameter will be discussed
later in the article and can safely be ignored for now.

Although the Tidy parsing functions modify the document data
to a certain extent, as stated earlier, these changes simply correct syntax
errors. In order to perform operations such as making the contents of a document
fully standard-compliant, Tidy provides the
tidy_clean_repair() function:



tidy_clean_repair($tidy);

where $tidy is a valid Tidy resource. When
executed, this function will apply the current configuration to the provided
document. As the Tidy extension has a default configuration, the script below
will parse an HTML snippet and automatically generate a complete document in
HTML 3.2 format:


<?php

    $tidy
= tidy_parse_string("<B>Hello</I> How are <U> you?</B>");

    
tidy_clean_repair($tidy);

    echo
$tidy;

?>


When executed, the follow output is generated:



<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>

<head>

<title></title>

</head>

<body>

<b>Hello</b> How are <u>you?</u>

</body>

</html>

As you can see, not only has the Tidy extension repaired the
original HTML snippet’s tags, but it has also generated a complete document from
it. This is the default behavior of the extension, and in order to change it we
must discuss how to work with Tidy’s configuration
settings.


Configuring Tidy

Earlier on, when I introduced the
tidy_parse_file() and
tidy_parse_string() functions, I
intentionally ignored the $options
parameter in each. This parameter, as its name implies, is used to modify the
default configuration of Tidy for the document being parsed. These configuration
manipulations can be done in either of two ways.


  • The configuration options can be loaded from a Tidy configuration file by passing the name of the
    configuration file as the $options parameter.
  • Alternatively, Tidy configuration options can be set by passing an associative array of configuration
    option/value pairs.



<?php

    
/* Specify configuration options as an array */

    
$options = array("output-xhtml" => true, "clean" => true);

    
$tidy = tidy_parse_file("http://www.coggeshall.org/", $options);

    
tidy_clean_repair($tidy);

    echo
$tidy;

    
/* Specify a configuration file */

    
$tidy_two = tidy_parse_file("http://www.coggeshall.org/", "path/to/myconfig.tcfg");

    
tidy_clean_repair($tidy_two);

    echo
$tidy_two;

?>


Tidy configuration files are nothing more than
simple text files that provide a series of option/value pairs. An example of a
Tidy configuration file is shown below:


indent-spaces: 4

wrap: 72

indent: auto

tidy-mark: no

show-body-only: yes

force-output: yes

Tidy configuration options provide most
of the functionality available in the extension. Because of the sheer quantity
of options available, I cannot provide a listing of them all here. Rather, I
will only be introducing a small and particularly useful subset. If you are
interested in exploring other configuration options and their meanings, you can
find them online at
http://www.w3.org/People/Raggett/tidy/.
To determine the value of a configuration option for a given document, use the
tidy_getopt() function:



<?php

    $tidy
= tidy_parse_file("http://www.coggeshall.org/");

    
$optval = tidy_getopt($tidy, "show-body-only");

    echo
"The value of 'show-body-only' is: $optval\n";

?>


Converting Documents

One of the most useful ways that Tidy configuration directives
can be applied is in the process of converting documents from one format, such
as HTML 4.01, to another, such as XHTML 1.0. In most situations, Tidy will
automatically identify the format and format version of the document being
processed when you parse it, using tidy_parse_file().
To convert it to another type, you can use one of the following configuration options:


















Option Value Effect

output-html

Boolean

Outputs the data in HTML format

output-xml

Boolean

Outputs the data as well-formed XML

output-xhtml

Boolean

Outputs the data in XHTML format

Each of these options has a Boolean value, and should be set
to true or false, as appropriate, when parsing the
file using tidy_parse_file(). Note that
only one output directive at a time may be set for a given document; doing
otherwise can have unpredictable consequences.


Beyond converting the format of a document, Tidy is also able
to convert deprecated <FONT> HTML
tags into their cascading style sheet (CSS) counterparts automatically through
the use of the clean option. The
generated output contains an inline style declaration.



<?php

    
/* Convert coggeshall.org to stylesheets and XHTML */

    
$opts = array("clean" => true, "output-xhtml" => true);

    
$tidy = tidy_parse_file("http://www.coggeshall.org/", $opts);

    
tidy_clean_repair($tidy);

    echo
$tidy;

?>



Reducing Bandwidth Usage

When Tidy generates output, it can be configured to
significantly reduce the size of the file by removing all data that is not
required by the browser for rendering. There are a number of useful options for
doing this, as listed below:



































Option


Value

Effect

drop-proprietary-attributes

Boolean

Removes all attributes that are not part of a web standard

drop-font-tags

Boolean

Removes deprecated <FONT> tags

drop-empty-paras

Boolean

Removes <P> tags that contain no data

hide-comments

Boolean

Strips all comments

join-classes

Boolean

Combines CSS classes

join-styles

Boolean

Combines CSS styles

word-2000

Boolean

Removes all proprietary data when an MS Word document has been saved as HTML



In the example below, these options are used against the
php.net homepage:




<?php

    $options
= array("clean" => true,

            
"drop-proprietary-attributes" => true,

            
"drop-font-tags" => true,

            
"drop-empty-paras" => true,

            
"hide-comments" => true,

            
"join-classes" => true,

            
"join-styles" => true);

    $tidy = tidy_parse_file("http://www.php.net/", $options);

    
tidy_clean_repair($tidy);

    echo
$tidy;

?>




Although the gains that can be made from this may
seem minimal at first sight, their significance can be substantial. Even a
reduction as small as 800 bytes per request saves 80 MB of bandwidth on a site
serving 100,000 copies of that document. As an added benefit, documents that
have been processed with Tidy can still contain important non-rendering
information, such as development comments, without wasting bandwidth
transmitting it all to the end user.


Beautifying Documents



Just as Tidy can strip everything from a document to assist
you in working with it, Tidy can also make documents easier for people to read,
through intelligent indentation of an existing document. This is accomplished
using the three options shown below:




















Option

Value

Effect

indent

“true”, “false” or “auto”

Toggles whether the output is indented

indent-spaces

integer

Sets the number of spaces to use for each level of indentation

wrap

integer

Sets the number of characters allowed before a line is soft-wrapped

When used together as demonstrated below, these three options
can quickly take a machine-generated document and make it easily read by a
developer.



<?php

    $options
= array("indent" => true,    /* Turn on beautification */

            
"indent-spaces" => 4,        /* Spaces per indenting level */

            
"wrap" => 72);               /* Line length before wrapping */

    
$tidy = tidy_parse_file("http://www.php.net/", $options);

    
tidy_clean_repair($tidy);

    echo
$tidy;

?>



Tidy output buffering

When the Tidy extension is installed, it can be automatically
applied to every PHP document output by setting the
tidy.clean_output php.ini directive to
true. Tidy can also be specified as the
output handler when output buffering is enabled, by passing the
“ob_tidyhandler”

string anywhere the ob_start() function is used:



<?php

    ob_start
("ob_tidyhandler");

    
/* Do your outputting here */

?>


Once the output is sent to the browser in the above
snippet, Tidy will automatically be called to process it before the end user
receives it. This can be particularly useful when used in conjunction with the
tidy.default_config configuration directive.



Although it is useful, be aware that the tidy.clean_output
directive should not be enabled in situations where the output generated by
PHP is not a markup document (for example, when outputting an image using the gd library).


Summary

As you can see, the Tidy extension is a very powerful tool
that can be taken advantage of in PHP 5, particularly during development.
Although this article has really only brushed on the surface of the Tidy
extension, the information given here should be enough to get you started using
this new technology.


About The Author

John Coggeshall is a PHP consultant and author who started
losing sleep over PHP around five years ago. Lately you’ll find him losing sleep
meeting deadlines for books or online columns on a wide range of PHP topics. You
can find his work online at O’Reilly Networks’

onlamp.com,
Zend Technologies, and at
his website.

About Cal Evans

Many moons ago, at the tender age of 14, Cal touched his first computer. (We're using the term "computer" loosely here, it was a TRS-80 Model 1) Since then his life has never been the same. He graduated from TRS-80s to Commodores and eventually to IBM PC's. For the past 10 years Cal has worked with PHP and MySQL on Linux OSX, and when necessary, Windows. He has built on a variety of projects ranging in size from simple web pages to multi-million dollar web applications. When not banging his head on his monitor, attempting a blood sacrifice to get a particular piece of code working, he enjoys building and managing development teams using his widely imitated but never patented management style of "management by wandering around". Cal is currently based in Nashville, TN and is gainfully unemployed as the Chief Marketing Officer of Blue Parabola, LLC. Cal is happily married to wife 1.28, the lovely and talented Kathy. Together they have 2 kids who were both bright enough not to pursue a career in IT. Cal blogs at http://blog.calevans.com and is the founder and host of Day Camp 4 Developers

View all posts by Cal Evans

4 Responses to “Tidying up your HTML with PHP 5”

  1. tyrondis Says:

    Nice article.
    In my projects I use Tidy to tell me what is wrong with my markup instead of having it corrected automatically.
    I wrote a little post in my blog about how I do it. Feel free to have a look – if you want to ;-)

    http://www.webdevblog.info/php/use-tidy-to-validate-the-html-markup-your-scripts-generate/

  2. patnaik Says:

    The htmLawed PHP script is an alternative to using HTMLTidy; does not require an external library or PHP extension.

    http://bioinformatics.org/phplabware/internal_utilities/htmLawed

  3. stosh1985 Says:

    The indent setting does not function as intended when set to ‘auto’. Instead, the tidy extension reads ‘auto’ as a false and does not indent. If you want tidy to indent using it’s ‘auto’ functionality, which is highly recommended as it is rather sensitive to browser quirks then set indent = 2 (numeric value, not inside quotes as a string), this will use the ‘auto’ indentation setting.

  4. jondblackburn Says:

    Hi I have PHP 4.3.9 installed on my server – do I have to upgrade to PHP 5 to use these functions, or can I just install the TidyLib library and still do much of what I want.

    My aim is to convert a mysql database containing html and php snippets into XHTML compliant pages.

    HTML Tidy seems like the best thing out there, but it does not seem to easily work on html contained in database tables.

    Jon :