Twitter API encoding is somewhat bonkers

Take a look at the Encoding section in the Twitter API docs:

The Twitter API supports UTF-8 encoding. Please note that angle brackets (“<” and “>”) are entity-encoded to prevent Cross-Site Scripting attacks for web-embedded consumers of JSON API output. The resulting encoded entities do count towards the 140 character limit.

Does anyone notice the weirdness there? Apart from the MAGIC_QUOTES smell.

If I were feeling pathological, I could tweet a message of 140 characters all between the Unicode code-points U+010000-U+10FFFF. I think that would end up as 560 bytes. And I think that would be all fine with Twitter. Which is another way of saying that Twitter would, I assume, be happy to exceed 140 bytes for a message if it were written in, say, Japanese.

By contrast, while on my pathological holiday from good sense, I would only be able to tweet a message of 35 angle brackets – hence 140 characters, 140 bytes in UTF-8 – because the encoded angle-brackets count toward the number of characters. Seems a bit backwards doesn’t it?

Does anyone know the reasoning here? Or are the docs at fault?

Back to the angle-bracket quoting. Just as the PHP folk are finally ending their own embarrassing journey through that silliness, it looks to me like Twitter are now making a similar mistake. JSON should safely encapsulate angle-brackets, so perhaps I don’t understand the problem that they are trying to solve?

One more question: what if I tweet “&gt;”? When using the API, can that be distinguished from a “>”?

(You might have noticed that I’ve have so far been too lazy to experiment with all this stuff; I just wanted to write it down before I forgot. I’ll add a comment if I get the time to play.)