Executive summary: tagging your digital resources to indicate in what language they are allow

  1. Proper rendition,
  2. Correct behaviour of some software,
  3. Choice of the right tools,
  4. Correct filtering.

What is tagging

Tagging is the process of giving language tags to a digital resource. For instance, in legacy HTML, it is done with:



]]>

and in XML with the xml:lang special attribute:



]]>

What is tagging for?

The purpose of tagging is to give unambiguous information to the software processes that will handle the resource. For instance, properly rendering the content on the screen requires to know the language it is written in. Actual typography rules are different for each language, language-independant rendition can only be an approximation. In the same way, knowing the language used is necessary for speech synthesis.

Some programs may need the language to know what to do with requests like CSS' "first-letter" pseudo-property. The first letter of Llobregat is 'l' in English but 'll' in Spanish.

Tools like spell checkers or an online dictionary must also be choosen depending on the language used.

Language tagging also allow filters to keep only some documents, those written in a language that the user understands. At the present time, most search engines, like Google, use heuristics to find out the language of a Web page. While it works fine to tell apart German from Japanese, it is much more difficult with close languages like Danish and Norwegian, specially if the text is short.

Current situation

At the present day, we are a bit stuck in a chicken-and-egg problem: many applications (like the search engines mentioned before) do not use the language information because it is not present or unreliable. Therefore, webmasters and other document maintainers are not eager of tagging because it brings no short-term benefits. Things are becoming better but certainly too slowly.

More readings