Categories


Loading feed
Loading feed
Loading feed

Microformats, PHP, and hKit


Last week, Drew McLellan released version 0.2 of his toolkit, hKit, for extracting and parsing microformats in PHP. Microformats are a hot topic on the web at the moment and something that all developers need to at least be aware of. So we caught up with Drew this week and spent some quality Skype time with him. We talked to him about microformats, his toolkit and PHP. Here’s what Drew had to say to us.

Drew, start us off by giving us a short primer on exactly what Microformats are.

Micorformats are a way of embedding additional semantic information into a normal xhtml document. This relieves us from the necessity of extending XHTML and adding new tags we can instead just uses the class attribute. All the class names in a microformat are already defined so you just tag your data to add richness to it.

Ok, all of that makes sense but why are microformats important?

I guess because we are moving more towards this idea called the ‘semantic web’ where we’ve got more of a web of data than anything else. Potentially every endpoint, every site, every machine, on the web could be a source of data for either a person or another machine. Microformats enable that in a very simple and straightforward way that people can start using right away. There is no need to wait for browsers to adopt standards and no roll-out time. People can start putting these in their documents now and it enables the concept of the ‘semantic web’ where people can just go to a web page and grab the data they are looking for from it.

So is there really any difference between a microformat and just a bunch of semantic class names?

Anybody can put in class names that describe the data in their documents. The key here is that a microformat defines the names you should use. Those names are defined through a process involving research and discussion that gets the whole community involved. This way it’s not just someone arbitrarily picking class names because that would be completely open to debate. People could decide to change them and you would lose the interoperability. The process that you have to go through involves researching the need. You have to find out how this data is being used currently. You have to find out if there are people already marking up this data. You have to research if this data is currently being published on the web and if it’s not then why do we need a microformat? If it is, how are they doing it and what can we learn from that? The other part of the process is to look and see what established standards are already available in that space. Say for example, hCard which is a pretty established format in that space. It’s a 1-to-1 copy of the existing vCard standard. vCard was already out there and in the real world and working so it’s just a translation of that to HTML rather than making up something completely new.

Since you brought up the hCard, can you tell us exactly what the hCard is in relation to a vCard.

Ok, basically an hCard taking the field names form a vCard and applying them to an HTML document.

So I could take my name and address as they appear on my personal homepage, wrap the data in tags with the proper class names and my information would then be in a microformat ready for any hCard reader to consume?

Yes. You just mark the data up in the page itself. All the data remains visible. That’s a key point because there are specifications like RDF which embeds the data in comments in the page. The danger with that approach is because no one ever sees it, no one is bothered about it; it never gets updated and could become out of sync with the document. The idea of having metadata that is visible is quite important in making sure that it stays up to date and relevant.

Before we go on, give us a little background. Exactly how did you get involved with Microformats?

I’m involved in a project called Web Standards which is at webstandards.org. Part of what we do is campaign for web authors, browser makers and tool makers to fully support W3C standards. Microformats really go along really closely with that. In that it’s about insuring you have very valid and rich documents. I got involved with them through that.

So does the W3C officially support microformats?

No. One of the advantages of microformats right now is that it’s very much a community driven effort. There’s a very active community right now working to ensure the quality of each microformat through a system of peer reviews. The way it’s currently structured is that anyone contributing to a microformat spec agrees to the licensing terms of both the W3C and the IETF. This way, down the road, if any microformat gains enough popularity so that it was advantageous that it was adopted by one of those standards bodies then it would be possible. At the moment however, it’s very much community driven.

All of that is facinating. Now let’s talk about your toolkit for the moment. Would you describe for us exactly what your toolkit does?

In the current version it only deals with the hCard microformat. Basically what it does it you feed it a URL of a web page or the HTML. It parses the HTML and looks for those instances of the hCard microformat and extracts it out. It returns the found microformats a PHP array structure. Really very simple.

Being a developer I’m always curious as to motive. Why did you build hKit?

I’m working on an application and I need to be able to consume hCards. I sat down to do that and realized that while there are some really good tools to convert hCards to vCards but if you just want to grab the hCards out of a page and do something with them – turn them into a SQL string and blast them into a database or some such thing- there weren’t any tools to help you do that. Then I realized that what started out as a couple of hours of hacking was going to take me significantly longer. So I decided to get the problem solved and make something that other people could use as well.

Ok, so it parses HTML looking for microformats. What formats does it support?

Currently it only supports the hCard format. The way it’s engineered, it’s basically a core engine with microformat definitions in a plugin style. To keep things simple, I’ve only defined the hCard format since it’s very common and the one I know best. In theory, tomorrow I could sit down and write one for hCalendar or hReview and it would just plug right in. I haven’t actually tried that out yet so it may not work out that way; but that’s the idea.

Where do you see the toolkit headed?

I want to get it to a point where it can achieve my initial goal which is it can parse out any microformat that you give it a definition for. This sounds easy but the way that microformats are designed, they can be embedded in each other. You might have an hReview of say a restaurant. In that review you will obviously list the name of the restaurant and maybe what street it’s on and in what town. So in that review you would be embedding an hCard. Obviously from a parsing point of view, that makes things a lot more difficult because you aren’t dealing with a just one thing at a time, you are dealing with several different microformats all potentially nested within each other. Some of them are very small others are quite big and my library has to be able to pick them all up. Once I get to that point and that all works, I think my hair will probably be all gray and we’ll be on PHP9.

With all the languages choices available and given your history of working in ASP, why did you select PHP to build your toolkit in?

I’ve been working for PHP as my main language in my day job for nearly 3 years. Before that I was mainly an ASP programmer. PHP seemed the natural choice for something that is going to be reusable. The point of entry is so low that my code will be useful to a lot of people.

What resources do you recommend to someone just learning about microformats?

The best place to go is microformats.org. You’ll find a a description of all the current microformats. Also there you’ll find wiki which is linked from that and there you’ll find links to podcasts, presentations, tutorials and all sorts of stuff that will get you started. It’s a great jumping off point.

Finally, what license is your code released under?

I’ve licensed my toolkit under the LGPL license in hopes that people will use it, extend it, fix things and contribute it back. So it’s all open source

Thanks Drew for taking the time out of your busy schedule to talk with us today. We will keep an eye on your toolkit and expect great things from it.

=C=

Comments


Monday, July 3, 2006
THANKS AND IDEA
4:50PM PDT · Brett Zamir