Archive for the 'News' Category

21
Oct
09

Gap

The last post on this blog (actually imported from smackman.com) was 3 years ago.  In that time I (a) stopped blogging (b) did a bunch of cool stuff at IBM Research before leaving this past May (c) learned to speak Spanish in Mexico and South America (d) built payyattention and the hourlypress with Christian Gromoll, Matt Gibson, and Lyn Headley.

I’ve been having an awesome time doing some technical work lately.  I’ve been processing twitter streams and distilling them into meaningful chunks.  I wanted to start writing about it, so I’m resuscitating my technical blog and rebranding it as PROJ 89.  The name comes from a huge (apparently) abandoned warehouse I walk by every day in the Dogpatch.  (I presume this is a long defunct name and I can appropriate it freely… if you know otherwise let me know!)

Advertisements
01
Jun
06

An Old Idea

I’ve been giving some thought to parsing microformats lately. A few threads seem to be converging…

The first is that it’s hard to parse microformats. You can hand-write a parser in a little bit of time that’s 80% right. But getting all of the hcard rules, e.g., encoded is tricky. It’s reasonable to assume, therefore, that there are a lot of 80% parsers out there like the one I wrote for my Ray Ozzie Clipboard example.

The second issue relates to hatom, which uses different class names for the same concept at different scopes. For example, the entry title is called “entry-title” not “title”. I asked Ryan about this when I saw him at www2006, and he told me that they vacillated on this decision, but they settled on “entry-title” because people can nest other microformats inside hatom, and so it would be easier for the parser writers if there were no colliding class names, even in different microformats. In fact, he suggested that they’d probably made a mistake with hcard, since the class names were so likely to collide with other microformats. Ok, so in other words “entry-title” is a hack around the problem of it being hard to parse microformats, and we can expect more of these.

When I bumped into Brian at the same event, I commented that microformats really have a problem with nesting. He agreed. He said it put a burden on the parser writer to potentially have to understand all microformats in order to reliably parse web pages that contain them.

So,

  1. It’s a lot of trouble to write a parser
  2. Bad parsers will proliferate
  3. Microformats are evolving toward being easier to parse, not easier to create
  4. It’s not clear how you can nest microformats w/o knowing how parsers will behave
  5. Users are discouraged from inventing their own specialized microformats, presumably because of the risk of collisions and difficulty others will have in parsing them

My proposal is that we employ a very old solution to this problem: create proper, machine-readable schemas or grammars for each microformat.

The schema…

  1. is a formal specification of the microformat
  2. can be used to generate parsers (like yacc)
  3. can be used to dynamically parse new microformats
  4. is language-neutral

Here’s a fragment of a schema for hcard in a BNF-inspired syntax:

{vcard} ::= {fn} {n} [{org}] [{url}] [{email}] [{photo}] [{tel}]
{n} ::= {fn}
{tel} ::= ({tel-entry})
{tel-entry} ::= [{type}] {value}
{url} ::= a@href
{email} ::= a@href
{photo} ::= img@src | object@data
{fn} ::= body
{org} ::= body
{type} ::= body
{value} ::= body

Note that it has domain-knowledge of HTML (e.g., “img@src”, which means pull the value out of the src attribute of an img tag, and “body” means pull the body of the tag). This syntax doesn’t encode all of the kinds of rules you’ll find in the hcard spec, but it probably could be extended to do so. (Note that a link could be added to the header of web pages pointing to the schema.)

So in addition to making it trivial to generate or find correct parsers for microformats in any language or environment, how does this solve the nesting problem? First, the parser will only “find” data that matches the schema. So if you stick a hcard inside an hatom entry, then the hatom parser wouldn’t be looking for the “title” beneath the “author”, since that’s not in the schema. Second, if you wanted to have a rule like that the DOM-depth were used to disambiguate two “title” properties, then you could enforce this at the parser-generator level, not at the level of every-parser-in-the-world. Third, it’s actually possible to use link tags to refer to every schema inside the web page, making it feasible that the parser would understand all of the microformats contained in the page without any additional work.

The other thing that’s interesting is that this specification actually implies a json-compatible data-model. The “( … )” notation refers to a list, the terminals refer to values, and each of the labels (e.g., “fn”) refer to keys in a name/value pair list. So we’d expect to parse,

<a class="url fn" href="http://smackman.com">Steve</a>

to

{vcard: {fn: "Steve", url: "http://smackman.com" }}

in JSON-syntax. (Don’t confuse JSON-syntax with JSON-data-model. The latter can be represented in (almost?) any programming language using built-in language constructs while the former is a serialization format).

So this means that the schema spec allows you to parse from HTML to a JSON-data-model. This means that, in contrast to yacc, there isn’t a need to have application-specific instructions in the spec. I’d also point out that the process of going in the opposite direction—from JSON-data-model to HTML—is exactly what microtemplates buy you.

That’s the gist of the idea… a lot more details to be worked out, of course.

20
May
06

Fried pizza, really?

I’m off to Edinburgh, Scotland today for WWW2006. I have a paper in the tagging workshop. Here’s my presentation (done with S5). I used microtemplates to generate a lot of the tag visualizations.

13
May
06

Promoting microtemplates

It was cool to see that Elias blogged about microtemplates and got straight to the point: it’s easy.

My goal now is to try take to make this point to a few of the right people, have them get it and say something about it, and then others will pay attention. It’s, honestly, kind of a funny position to be in. I guess I do promote my ideas, at least inside the four walls of my workplace, but not usually so deliberately.

I’ve started with approaching the microformats folks, since it is, to some extent, derivative, and also the adoption of microtemplates greatly facilitates the adoption of microformats. I’ve gotten a few “very promising” remarks, but not the whoah! that I kinda expected. But I think I was being a little optimistic — it will take some time and some compelling examples for the potential to be apparent. Also, I don’t know if the microformat people are generally as concerned with creating dynamic or ajax web applications as others might be, so there’s a bit of a mismatch.

Ok, so what about the Rails folks? I started to dissect an example I found of rails programming at OnLamp and make some recommendations on the microtemplates wiki. I’ll find the discussion list and forward this to them…. but I probably still need to implement what I describe and show some examples. It would be particularly compelling if I did the same dynamic table as in this example.

The other item on my agenda is the ROCB. One idea is that if you drop a vcard on my web page, I want to be able to create the rendering of that vcard using microtemplates. I met Ray once… maybe I can drop him a note when this demo works?

10
May
06

hCalendar and timezones

I was thinking that hCalendar might be helpful for helping with timezones. The basic idea, just like, ecmanaut says, is to send the zone information in GMT and let the browser do the conversion. So I’m thinking if we use the microformat for dates, hcalendar, then the date gets formatted as,

<abbr class="dtstart" title="2006-05-01T12:15:03.0Z">5:15am</abbr>

where the “title” attribute is machine readable and in GMT, and the body is human readable and in, presumably, the time zone of the page author. All that’s needed is a script (or greasemonkey plugin) like this one that walks the DOM, finds these hCalendar fragments, and replaces the time in the users timezone into the human-readable part of the date. So,

5:15am

gets displayed,

5:15am

Ok… but there are a couple of problems. The first is the formatting has changed. The resolution to this would be to write a function that deduced the format from the example, and then fills in that format with the local timezone. Seems doable, at least in a way that works 80% of the time (and fails gracefully with a generic date format). The second is that the intent of the time has, in fact, been changed a little bit. The user needs to know that this has happened (the greenish background is a hint at that), and needs to be able to see the original string to compare. The user might also prefer to see the data formatted as the author intended, but to be able to hover over the date and see it transformed into his own timezone. This also seems doable.

08
May
06

application/atom+json

JSON looks to be an extremely useful data format for Ajax (client-side web) applications because it is javascript, and so it can be parsed efficiently and loaded from any URL (not just the host that served the web page), opening up the door to a new class of applications that do client-side data integration in the browser. Yahoo has a great writeup of how to use JSON services.

Now, most of the services that are currently coming out are in the lingua franca, XML. XML is great, but it doesn’t have the specific advantages that JSON does for Ajax (ironically, since the “x” in Ajax is XML… but then Ajax sounds better than Ajaj). So what to do? There does exist a universal mapping from XML to JSON called Badgerfish. This is good, but the problem is that the JSON output is funky. Who wants to have variables named “$”?

Can we do a nicer mapping in the case of a particular XML schema, namely Atom? Atom doesn’t use attributes too much, and that’s the stumbling block when mapping from XML to JSON. What if we came up with a JSON representation of Atom that was as similar to the XML as possible, but was actually a different representation. Let’s call it application/atom+json.

Looking at the canonical Atom example, we can picture the atom+json starting something like this:

feed={
  title:"Example Feed",
  link:{href="http://example.org"},
  updated:"2003-12-13T18:30:02Z",
  author:{name:"John Doe"},
  id:"urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6", ...

Ok, but with entry we have a bit of a problem, because there are typically more than one. So do we rename it to entries, and have it refer to a list? Seems reasonable…

  entries: [{
    title:"Atom-Powered Robots Run Amok",
    link:{href:"http://example.org/2003/12/13/atom03"},
    id:"urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a",
    updated:"2003-12-13T18:30:02Z",
    summary:"Some text."
   }, ...]
}

But wait, it can’t be that simple? Well, I guess there can be multiple authors, so do we make the authors always a list (frequently with only one element), or do we allow author to refer to be the same as authors with only one entry? The other issue is that in this example, link had no body, and none of the tags with bodies had attributes. The reason badgerfish goes into “$” and “@attr” syntax is because it’s possible to have both. But why pollute the Atom mapping with awkard constructs that rarely occur? An alternate mapping might be to say that, say you wanted to put an attribute on the title tag, you’d say, title_attr:value.

Hmmm….

01
May
06

Microtemplates

Microtemplates are a way of creating templates in HTML that can be evaluated in the browser. More info here: microtemplates.org.