Web-LangTag/why-tagging.xml

<?xml version="1.0" encoding="utf-8"?>
<page title="Why tagging?">
<p>Executive summary: tagging your digital resources to indicate in
what <wikipedia>language</wikipedia> they are allow</p>
<ol>
<li>Proper rendition,</li>
<li>Correct behaviour of some software,</li>
<li>Choice of the right tools,</li>
<li>Correct filtering.</li>
</ol>
<h2>What is tagging</h2>
<p>Tagging is the process of giving <a href="whatare.html">language
tags</a> to a digital resource. For instance, in legacy
<wikipedia>HTML</wikipedia>, it is done with:</p>
<pre>
<![CDATA[
<html lang="ar">
<!-- Text in Arabic -->
]]>
</pre>
<p>and in <wikipedia>XML</wikipedia> with the <code>xml:lang</code>
special attribute:</p>
<pre>
<![CDATA[
<book xml:lang="uk">
<!-- Text in Ukrainian -->
]]>
</pre>
<h2>What is tagging for?</h2>
<p>The purpose of tagging is to give <em>unambiguous</em> information
to the software processes that will handle the resource. For instance,
properly rendering the content on the screen requires to know the language it is
written in. Actual <wikipedia>typography</wikipedia> rules are different for each
language, language-independant rendition can only be an
approximation. In the same way, knowing the language used is
necessary for <wikipedia>speech synthesis</wikipedia>.</p>
<p>Some programs may need the language to know what to do with
requests like <wikipedia>CSS</wikipedia>' "first-letter"
pseudo-property. The first letter of <wikipedia>Llobregat</wikipedia>
is 'l' in <wikipedia name="English language">English</wikipedia> but
'll' in <wikipedia name="Spanish language">Spanish</wikipedia>.</p>
<p>Tools like <wikipedia name="Spell checker">spell checkers</wikipedia> or an online dictionary must also be
choosen depending on the language used.</p>
<p>Language tagging also allow filters to keep only some documents,
those written in a language that the user understands. At the present
time, most <wikipedia>search engines</wikipedia>, like
<wikipedia>Google</wikipedia>, use <a
href="http://www.macchiato.com/slides/unicode_at_google.ppt">heuristics</a>
to find out the language of a Web page. While it works fine to tell
apart <wikipedia name="German language">German</wikipedia> from
<wikipedia name="Japanese language">Japanese</wikipedia>, it is much
more difficult with close languages like <wikipedia name="Danish
language">Danish</wikipedia> and <wikipedia name="Norwegian
language">Norwegian</wikipedia>, specially if the text is short.</p>
<h2>Current situation</h2>
<p>At the present day, we are a bit stuck in a
<wikipedia name="The chicken or the egg">chicken-and-egg</wikipedia> problem: many applications
(like the search engines mentioned before) do not use the language
information because it is not present or unreliable. Therefore,
webmasters and other document maintainers are not eager of tagging
because it brings no short-term benefits. Things are becoming better
but certainly too slowly.</p>
<h2>More readings</h2>
<ul>
<li><a href="http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.091505539">Why specify language?</a> by the W3C</li>
</ul>
</page>