HTML::Tidy - some observations

| | Comments (0) | TrackBacks (0)

This is not a complaint, but merely some observations from my exciting evening of playing with HTML::Tidy last night. I'll be sending Andy Lesiter a slightly more sane write up when I have a moment:

1. Passing it a config file

The documentation doesn't make this clear, but the config file you pass is one for HTML Tody (libtidy as opposed to HTML::Tidy). Have a look at the HTML Tidy quick ref documentation on Sourceforge for what these look like and what you can have in there.

2. Quiet in the config file

Setting quiet = yes in the config file reduces the amount of errors being thrown back from libtidy a lot. This is handy if you are trying to use the clean method to format your messy HTML into something clean and, erm, tidy.

3. use warnings in Tidy.pm

This does result in warnings being thrown about if you pass an undefined (or no) argument to the clean method. I hacked my version of Tidy.pm to test and this worked fine.

4. The error parsing is out of date

Lots of the error messages returned by libtidy have changed syntax slightly since the latest release of HTML::Tidy. This means messages that HTML::Tidy intends to ignore are being warned out unneccessarily.

5. clean method returns cleaned up X/HTML

This is not in the documents, but if clean works it returns your X/HTML nicely tidied up. This is the most useful feature of HTML::Tidy!

6. HTML Tidy (libtidy) has limited DTD support

Appears to only support HTML 4.01 and XHTML 1.0. You can tell HMTL::Tidy which you want clean to produce in the config file you supply.

7. clear_messages is good

Save memory when processing lots of X/HTML by making sure you call clear_messages after each document. Messages in HTML::Tidy are relatively large objects that pile up quite quickly.

I processed 14,000 HTML files of various legibiility with only one or two warnings thrown out when I used the config_file with quiet = yes and by hacking out use warnings in Tidy.pm. I do not recommend hacking Tidy.pm. Instead, make sure when you call clean that you pass it a defined variable - do the checking in your script rather than messing with Tidy.pm. Hopefully when I do give Andy something legible and useful to use, HTML::Tidy will be updated properly.

0 TrackBacks

Listed below are links to blogs that reference this entry: HTML::Tidy - some observations.

TrackBack URL for this entry: http://www.robbiebow.co.uk/mt/mt-tb.cgi/8

Leave a comment

About this Entry

This page contains a single entry by Robbie Bow published on October 19, 2006 9:47 AM.

Ultraedit - great text editor was the previous entry in this blog.

Perl warnings and IIS6 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.21-en