Categories


Loading feed
Loading feed
Loading feed

Tidying up your HTML with PHP 5


Intended Audience
Introduction
Installation
Dual-nature API
Basic Tidy Usage
Configuring Tidy
•  Converting Documents
•  Reducing Bandwidth Usage
•  Beautifying Documents
Tidy output buffering
Summary
About the Author

Intended Audience

This article is aimed at the web developer who would like to make use of HTML Tidy from within PHP scripts. No particular level of expertise is assumed.

Introduction

The Tidy extension is new in PHP 5, and is available from PHP version 5.0b3 upward. It is based on the TidyLib library, and allows the developer to validate, repair, and parse HTML, XHTML and XML documents from within PHP. This article will introduce some of the functionality of the extension, and explain how it can be used to ratify your web documents against their respective W3C standards.

Installation

If you are using PHP on a Windows system, all you need to do to enable the extension is uncomment the line extension=php_tidy.dll in your php.ini file. The official win32 binary distribution has built-in Tidy support.

ext/tidy is provided as part of the official PHP 5 source distribution. However, in order to compile it, you must also have the TidyLib library and headers installed. The source code for the library can be found through the HTML Tidy project homepage at http://tidy.sourceforge.net/. (When selecting the appropriate download, it’s useful to know that the TidyLib project uses dates to control its versioning.)

Once the library has been built and installed, PHP 5 can be compiled to provide built-in support for it by using the --with-tidy configuration option:

[john@localhost]# ./configure --with-tidy=/path/to/libtidy

Note that if no path is provided, the configuration script will automatically attempt to find the required libraries in the default locations. Of course, after properly configuring PHP with support for Tidy and any other extensions or SAPI modules you desire, you can compile and install PHP in the usual way. To check whether the extension has been added correctly, check the output of the phpinfo() function and look for the "Tidy" section. Alternatively, you can test that the Tidy extension is loaded in the cli version of PHP by using the -m option and looking for 'tidy' in the list:

[john@localhost]# php -m
[PHP Modules]
...
tidy

Dual-nature API

In common with many of the new PHP 5 extensions, the Tidy extension supports an interchangeable, procedural, and object-oriented API. Although in this article the examples use the procedural syntax:

$tidy = tidy_parse_file(...);
tidy_clean_repair($tidy);
echo
tidy_get_output($tidy);

an object-oriented syntax could equally well be used:

<?
$tidy
= new tidy();
$tidy->parseFile(...);
$tidy->cleanRepair();
echo
$tidy;

Furthermore, the two kinds of syntax can be mixed:

$tidy = tidy_parse_file(...);
$tidy->cleanRepair();
echo
tidy_get_output($tidy);

As a general rule of thumb, the syntax difference between object-oriented and procedural APIs comes down to the omission of a single parameter in function calls. When using Tidy in a procedural way, a resource is required for every function call:

tidy_clean_repair($tidy);

whereas when the object-oriented syntax is used, this parameter is omitted (note the use of studlyCaps in method calls):

$tidy->cleanRepair();

It is recommended for the sake of consistency that one syntax be used throughout your scripts.

Basic Tidy Usage

Now that you have TidyLib built into your copy of PHP, let me introduce you to the extension and its features.
The most fundamental ability of HTML Tidy is its ability to parse, diagnose, clean and repair HTML and XHTML documents. To begin using the Tidy extension, you must first load and parse a specified document. This task is accomplished through the use of the tidy_parse_file() function with the following syntax:

tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);

where $filename is the file to parse (either a local or remote file) and $use_include_path is a boolean value indicating whether the file should be found in PHP’s include path. When reading documents from the file system that are of a specific character encoding, the $encoding parameter can be passed with the character set to use (for instance "utf-32"). The remaining parameter, $options, can safely be ignored for now, as configuration settings and their use will be discussed in detail later in this article.

When tidy_parse_file() is called, it will attempt to load and parse the named document and return a resource representing that document. During this parsing process, Tidy will attempt to determine the format of the document (HTML, XHTML, etc) and will perform some basic repairs on the resource to make it syntactically correct. Although the exact changes made will vary from file to file, common mistakes will be corrected and tags will be re-organized into the proper order.

Once the document has been parsed it can be further manipulated using the remainder of the functions available in the Tidy extension. To retrieve the modified version of the document, the extension provides the tidy_get_output() function:

tidy_parse_file($filename [, $options [, $encoding [, $use_include_path]]]);

where $tidy is the resource representing the document.

Note that the resource returned from tidy_parse_file() or equivalent can also be treated as a string. Thus functions that accept string parameters can be passed to the resource directly, or alternatively the resource can be cast to a string:

echo tidy_get_output($tidy);
/* Alternative method */
echo $tidy;
/* Casting the resource to a string */
$data = (string)$tidy;

If you are dealing with data from documents that have already been loaded from another source, such as a database or user input, use the tidy_parse_string() function:

tidy_parse_string($data [, $options [,$encoding]]);

where $data is the variable containing the data and $encoding is the character set to use when reading it. As was the case earlier, the $options parameter will be discussed later in the article and can safely be ignored for now.
Although the Tidy parsing functions modify the document data to a certain extent, as stated earlier, these changes simply correct syntax errors. In order to perform operations such as making the contents of a document fully standard-compliant, Tidy provides the tidy_clean_repair() function:

tidy_clean_repair($tidy);

where $tidy is a valid Tidy resource. When executed, this function will apply the current configuration to the provided document. As the Tidy extension has a default configuration, the script below will parse an HTML snippet and automatically generate a complete document in HTML 3.2 format:

<?php
    $tidy
= tidy_parse_string("<B>Hello</I> How are <U> you?</B>");
    
tidy_clean_repair($tidy);
    echo
$tidy;
?>

When executed, the follow output is generated:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<b>Hello</b> How are <u>you?</u>
</body>
</html>

As you can see, not only has the Tidy extension repaired the original HTML snippet's tags, but it has also generated a complete document from it. This is the default behavior of the extension, and in order to change it we must discuss how to work with Tidy's configuration settings.

Configuring Tidy

Earlier on, when I introduced the tidy_parse_file() and tidy_parse_string() functions, I intentionally ignored the $options parameter in each. This parameter, as its name implies, is used to modify the default configuration of Tidy for the document being parsed. These configuration manipulations can be done in either of two ways.

  • The configuration options can be loaded from a Tidy configuration file by passing the name of the configuration file as the $options parameter.
  • Alternatively, Tidy configuration options can be set by passing an associative array of configuration option/value pairs.

<?php
    
/* Specify configuration options as an array */
    
$options = array("output-xhtml" => true, "clean" => true);
    
$tidy = tidy_parse_file("http://www.coggeshall.org/", $options);
    
tidy_clean_repair($tidy);
    echo
$tidy;
    
/* Specify a configuration file */
    
$tidy_two = tidy_parse_file("http://www.coggeshall.org/", "path/to/myconfig.tcfg");
    
tidy_clean_repair($tidy_two);
    echo
$tidy_two;
?>

Tidy configuration files are nothing more than simple text files that provide a series of option/value pairs. An example of a Tidy configuration file is shown below:

indent-spaces: 4
wrap: 72
indent: auto
tidy-mark: no
show-body-only: yes
force-output: yes

Tidy configuration options provide most of the functionality available in the extension. Because of the sheer quantity of options available, I cannot provide a listing of them all here. Rather, I will only be introducing a small and particularly useful subset. If you are interested in exploring other configuration options and their meanings, you can find them online at http://www.w3.org/People/Raggett/tidy/. To determine the value of a configuration option for a given document, use the tidy_getopt() function:

<?php
    $tidy
= tidy_parse_file("http://www.coggeshall.org/");
    
$optval = tidy_getopt($tidy, "show-body-only");
    echo
"The value of 'show-body-only' is: $optval\n";
?>

Converting Documents

One of the most useful ways that Tidy configuration directives can be applied is in the process of converting documents from one format, such as HTML 4.01, to another, such as XHTML 1.0. In most situations, Tidy will automatically identify the format and format version of the document being processed when you parse it, using tidy_parse_file(). To convert it to another type, you can use one of the following configuration options:

Option Value Effect
output-html Boolean Outputs the data in HTML format
output-xml Boolean Outputs the data as well-formed XML
output-xhtml Boolean
Outputs the data in XHTML format

Each of these options has a Boolean value, and should be set to true or false, as appropriate, when parsing the file using tidy_parse_file(). Note that only one output directive at a time may be set for a given document; doing otherwise can have unpredictable consequences.

Beyond converting the format of a document, Tidy is also able to convert deprecated <FONT> HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.

<?php
    
/* Convert coggeshall.org to stylesheets and XHTML */
    
$opts = array("clean" => true, "output-xhtml" => true);
    
$tidy = tidy_parse_file("http://www.coggeshall.org/", $opts);
    
tidy_clean_repair($tidy);
    echo
$tidy;
?>

Reducing Bandwidth Usage

When Tidy generates output, it can be configured to significantly reduce the size of the file by removing all data that is not required by the browser for rendering. There are a number of useful options for doing this, as listed below:

Option Value Effect
drop-proprietary-attributes Boolean Removes all attributes that are not part of a web standard
drop-font-tags Boolean Removes deprecated <FONT> tags
drop-empty-paras Boolean
Removes <P> tags that contain no data
hide-comments Boolean Strips all comments
join-classes Boolean Combines CSS classes
join-styles Boolean
Combines CSS styles
word-2000 Boolean Removes all proprietary data when an MS Word document has been saved as HTML

In the example below, these options are used against the php.net homepage:

<?php
    $options
= array("clean" => true,
            
"drop-proprietary-attributes" => true,
            
"drop-font-tags" => true,
            
"drop-empty-paras" => true,
            
"hide-comments" => true,
            
"join-classes" => true,
            
"join-styles" => true);

    
$tidy = tidy_parse_file("http://www.php.net/", $options);
    
tidy_clean_repair($tidy);
    echo
$tidy;
?>

Although the gains that can be made from this may seem minimal at first sight, their significance can be substantial. Even a reduction as small as 800 bytes per request saves 80 MB of bandwidth on a site serving 100,000 copies of that document. As an added benefit, documents that have been processed with Tidy can still contain important non-rendering information, such as development comments, without wasting bandwidth transmitting it all to the end user.

Beautifying Documents

Just as Tidy can strip everything from a document to assist you in working with it, Tidy can also make documents easier for people to read, through intelligent indentation of an existing document. This is accomplished using the three options shown below:

Option Value Effect
indent "true", "false" or "auto" Toggles whether the output is indented
indent-spaces integer Sets the number of spaces to use for each level of indentation
wrap integer Sets the number of characters allowed before a line is soft-wrapped

When used together as demonstrated below, these three options can quickly take a machine-generated document and make it easily read by a developer.

<?php
    $options
= array("indent" => true,    /* Turn on beautification */
            
"indent-spaces" => 4,        /* Spaces per indenting level */
            
"wrap" => 72);               /* Line length before wrapping */
    
$tidy = tidy_parse_file("http://www.php.net/", $options);
    
tidy_clean_repair($tidy);
    echo
$tidy;
?>

Tidy output buffering

When the Tidy extension is installed, it can be automatically applied to every PHP document output by setting the tidy.clean_output php.ini directive to true. Tidy can also be specified as the output handler when output buffering is enabled, by passing the "ob_tidyhandler" string anywhere the ob_start() function is used:

<?php
    ob_start
("ob_tidyhandler");
    
/* Do your outputting here */
?>

Once the output is sent to the browser in the above snippet, Tidy will automatically be called to process it before the end user receives it. This can be particularly useful when used in conjunction with the tidy.default_config configuration directive.

Although it is useful, be aware that the tidy.clean_output directive should not be enabled in situations where the output generated by PHP is not a markup document (for example, when outputting an image using the gd library).

Summary

As you can see, the Tidy extension is a very powerful tool that can be taken advantage of in PHP 5, particularly during development. Although this article has really only brushed on the surface of the Tidy extension, the information given here should be enough to get you started using this new technology.

About The Author

John Coggeshall is a PHP consultant and author who started losing sleep over PHP around five years ago. Lately you'll find him losing sleep meeting deadlines for books or online columns on a wide range of PHP topics. You can find his work online at O'Reilly Networks’ onlamp.com, Zend Technologies, and at his website.

Comments


Friday, March 16, 2007
PHP 4.3.9
9:03AM PDT · jondblackburn
Sunday, January 27, 2008
INDENTING SETTING...
9:13AM PST · stosh1985