An Old Idea

I’ve been giving some thought to parsing microformats lately. A few threads seem to be converging…

The first is that it’s hard to parse microformats. You can hand-write a parser in a little bit of time that’s 80% right. But getting all of the hcard rules, e.g., encoded is tricky. It’s reasonable to assume, therefore, that there are a lot of 80% parsers out there like the one I wrote for my Ray Ozzie Clipboard example.

The second issue relates to hatom, which uses different class names for the same concept at different scopes. For example, the entry title is called “entry-title” not “title”. I asked Ryan about this when I saw him at www2006, and he told me that they vacillated on this decision, but they settled on “entry-title” because people can nest other microformats inside hatom, and so it would be easier for the parser writers if there were no colliding class names, even in different microformats. In fact, he suggested that they’d probably made a mistake with hcard, since the class names were so likely to collide with other microformats. Ok, so in other words “entry-title” is a hack around the problem of it being hard to parse microformats, and we can expect more of these.

When I bumped into Brian at the same event, I commented that microformats really have a problem with nesting. He agreed. He said it put a burden on the parser writer to potentially have to understand all microformats in order to reliably parse web pages that contain them.


  1. It’s a lot of trouble to write a parser
  2. Bad parsers will proliferate
  3. Microformats are evolving toward being easier to parse, not easier to create
  4. It’s not clear how you can nest microformats w/o knowing how parsers will behave
  5. Users are discouraged from inventing their own specialized microformats, presumably because of the risk of collisions and difficulty others will have in parsing them

My proposal is that we employ a very old solution to this problem: create proper, machine-readable schemas or grammars for each microformat.

The schema…

  1. is a formal specification of the microformat
  2. can be used to generate parsers (like yacc)
  3. can be used to dynamically parse new microformats
  4. is language-neutral

Here’s a fragment of a schema for hcard in a BNF-inspired syntax:

{vcard} ::= {fn} {n} [{org}] [{url}] [{email}] [{photo}] [{tel}]
{n} ::= {fn}
{tel} ::= ({tel-entry})
{tel-entry} ::= [{type}] {value}
{url} ::= a@href
{email} ::= a@href
{photo} ::= img@src | object@data
{fn} ::= body
{org} ::= body
{type} ::= body
{value} ::= body

Note that it has domain-knowledge of HTML (e.g., “img@src”, which means pull the value out of the src attribute of an img tag, and “body” means pull the body of the tag). This syntax doesn’t encode all of the kinds of rules you’ll find in the hcard spec, but it probably could be extended to do so. (Note that a link could be added to the header of web pages pointing to the schema.)

So in addition to making it trivial to generate or find correct parsers for microformats in any language or environment, how does this solve the nesting problem? First, the parser will only “find” data that matches the schema. So if you stick a hcard inside an hatom entry, then the hatom parser wouldn’t be looking for the “title” beneath the “author”, since that’s not in the schema. Second, if you wanted to have a rule like that the DOM-depth were used to disambiguate two “title” properties, then you could enforce this at the parser-generator level, not at the level of every-parser-in-the-world. Third, it’s actually possible to use link tags to refer to every schema inside the web page, making it feasible that the parser would understand all of the microformats contained in the page without any additional work.

The other thing that’s interesting is that this specification actually implies a json-compatible data-model. The “( … )” notation refers to a list, the terminals refer to values, and each of the labels (e.g., “fn”) refer to keys in a name/value pair list. So we’d expect to parse,

<a class="url fn" href="http://smackman.com">Steve</a>


{vcard: {fn: "Steve", url: "http://smackman.com" }}

in JSON-syntax. (Don’t confuse JSON-syntax with JSON-data-model. The latter can be represented in (almost?) any programming language using built-in language constructs while the former is a serialization format).

So this means that the schema spec allows you to parse from HTML to a JSON-data-model. This means that, in contrast to yacc, there isn’t a need to have application-specific instructions in the spec. I’d also point out that the process of going in the opposite direction—from JSON-data-model to HTML—is exactly what microtemplates buy you.

That’s the gist of the idea… a lot more details to be worked out, of course.


3 Responses to “An Old Idea”

  1. June 1, 2006 at 1:32 am

    It’s not clear to me what problem this would solve. It doesn’t appear to solve the problem of collision, as knowing that vcard contains email doesn’t tell you that email can also be part of hreview.

    Can you give a specific example of a document that could be parsed automatically with a schema, and couldn’t be parsed automatically without?

  2. June 1, 2006 at 5:56 am


    The point is not that the document can or cannot be parsed with a schema. The point is that the schema allows parsers to be generated automatically rather than hand-code them for each microformat and for each programming language.

    As for collisions, it’s not that you couldn’t write a parser that knew that an email address inside a vcard was different than an email address inside an hreview, or that hreviews might contain vcards which, in turn, might contain email addresses that should be associated with the latter. The point is that if there is a clear grammar, then one doesn’t have to worry about whether their hand parser will misread a microformat when another, unexpected microformat is nested inside it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: