forked from bortzmeyer/Web-LangTag
67 lines
3.0 KiB
XML
67 lines
3.0 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
|
<page title="Why tagging?">
|
|
<p>Executive summary: tagging your digital resources to indicate in
|
|
what <wikipedia>language</wikipedia> they are allow</p>
|
|
<ol>
|
|
<li>Proper rendition,</li>
|
|
<li>Correct behaviour of some software,</li>
|
|
<li>Choice of the right tools,</li>
|
|
<li>Correct filtering.</li>
|
|
</ol>
|
|
<h2>What is tagging</h2>
|
|
<p>Tagging is the process of giving <a href="whatare.html">language
|
|
tags</a> to a digital resource. For instance, in legacy
|
|
<wikipedia>HTML</wikipedia>, it is done with:</p>
|
|
<pre>
|
|
<![CDATA[
|
|
<html lang="ar">
|
|
<!-- Text in Arabic -->
|
|
]]>
|
|
</pre>
|
|
<p>and in <wikipedia>XML</wikipedia> with the <code>xml:lang</code>
|
|
special attribute:</p>
|
|
<pre>
|
|
<![CDATA[
|
|
<book xml:lang="uk">
|
|
<!-- Text in Ukrainian -->
|
|
]]>
|
|
</pre>
|
|
<h2>What is tagging for?</h2>
|
|
<p>The purpose of tagging is to give <em>unambiguous</em> information
|
|
to the software processes that will handle the resource. For instance,
|
|
properly rendering the content on the screen requires to know the language it is
|
|
written in. Actual <wikipedia>typography</wikipedia> rules are different for each
|
|
language, language-independant rendition can only be an
|
|
approximation. In the same way, knowing the language used is
|
|
necessary for <wikipedia>speech synthesis</wikipedia>.</p>
|
|
<p>Some programs may need the language to know what to do with
|
|
requests like <wikipedia>CSS</wikipedia>' "first-letter"
|
|
pseudo-property. The first letter of <wikipedia>Llobregat</wikipedia>
|
|
is 'l' in <wikipedia name="English language">English</wikipedia> but
|
|
'll' in <wikipedia name="Spanish language">Spanish</wikipedia>.</p>
|
|
<p>Tools like <wikipedia name="Spell checker">spell checkers</wikipedia> or an online dictionary must also be
|
|
choosen depending on the language used.</p>
|
|
<p>Language tagging also allow filters to keep only some documents,
|
|
those written in a language that the user understands. At the present
|
|
time, most <wikipedia name="Search engine">search engines</wikipedia>, like
|
|
<wikipedia name="Google Search">Google</wikipedia>, use heuristics
|
|
to find out the language of a Web page. While it works fine to tell
|
|
apart <wikipedia name="German language">German</wikipedia> from
|
|
<wikipedia name="Japanese language">Japanese</wikipedia>, it is much
|
|
more difficult with close languages like <wikipedia name="Danish
|
|
language">Danish</wikipedia> and <wikipedia name="Norwegian
|
|
language">Norwegian</wikipedia>, specially if the text is short.</p>
|
|
<h2>Current situation</h2>
|
|
<p>At the present day, we are a bit stuck in a
|
|
<wikipedia name="The chicken or the egg">chicken-and-egg</wikipedia> problem: many applications
|
|
(like the search engines mentioned before) do not use the language
|
|
information because it is not present or unreliable. Therefore,
|
|
webmasters and other document maintainers are not eager of tagging
|
|
because it brings no short-term benefits. Things are becoming better
|
|
but certainly too slowly.</p>
|
|
<h2>More readings</h2>
|
|
<ul>
|
|
<li><a href="https://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.091505539">Why specify language?</a> by the W3C</li>
|
|
</ul>
|
|
</page>
|