Executive summary: tagging your digital resources to indicate in
what language they are allow
- Proper rendition,
- Correct behaviour of some software,
- Choice of the right tools,
- Correct filtering.
What is tagging
Tagging is the process of giving language
tags to a digital resource. For instance, in legacy
HTML, it is done with:
]]>
and in XML with the xml:lang
special attribute:
]]>
What is tagging for?
The purpose of tagging is to give unambiguous information
to the software processes that will handle the resource. For instance,
properly rendering the content on the screen requires to know the language it is
written in. Actual typography rules are different for each
language, language-independant rendition can only be an
approximation. In the same way, knowing the language used is
necessary for speech synthesis.
Some programs may need the language to know what to do with
requests like CSS' "first-letter"
pseudo-property. The first letter of Llobregat
is 'l' in English but
'll' in Spanish.
Tools like spell checkers or an online dictionary must also be
choosen depending on the language used.
Language tagging also allow filters to keep only some documents,
those written in a language that the user understands. At the present
time, most search engines, like
Google, use heuristics
to find out the language of a Web page. While it works fine to tell
apart German from
Japanese, it is much
more difficult with close languages like Danish and Norwegian, specially if the text is short.
Current situation
At the present day, we are a bit stuck in a
chicken-and-egg problem: many applications
(like the search engines mentioned before) do not use the language
information because it is not present or unreliable. Therefore,
webmasters and other document maintainers are not eager of tagging
because it brings no short-term benefits. Things are becoming better
but certainly too slowly.
More readings