Initial import
This commit is contained in:
parent
505d8f1a04
commit
85f4c9e1e5
26
Makefile
Normal file
26
Makefile
Normal file
@ -0,0 +1,26 @@
|
||||
ALLHTML=$(shell ls *.xml 2> /dev/null | sed 's/.xml$$/.html/' )
|
||||
STYLESHEET=page.xslt
|
||||
IMAGES=ltag-icon-en.png favicon.ico
|
||||
ME=$(shell hostname)
|
||||
ifeq ("${ME}","lilith")
|
||||
WEBSERVER=/var/www/www.langtag.net
|
||||
else
|
||||
WEBSERVER=bortzmeyer@www.langtag.net:/var/www/www.langtag.net
|
||||
endif
|
||||
GOOGLEVERIF=google75f3cadf7e9fc996.html
|
||||
|
||||
all: ${ALLHTML}
|
||||
|
||||
%.html: %.xml ${STYLESHEET} language-subtag-registry-version
|
||||
xsltproc --stringparam lsr-version `cat language-subtag-registry-version` \
|
||||
--output $@ ${STYLESHEET} $< && xmllint --noout --valid $@
|
||||
|
||||
install: all
|
||||
cp -a ../SQL/* .
|
||||
touch ${GOOGLEVERIF}
|
||||
rsync -q -a ${ALLHTML} ${GOOGLEVERIF} ${IMAGES} ltru.css registries test-suites PostgreSQL SQLite ${WEBSERVER}
|
||||
|
||||
clean:
|
||||
rm -f *.html
|
||||
|
||||
|
@ -1,3 +1,8 @@
|
||||
# Web-LangTag
|
||||
|
||||
Code for the www.langtag.net Web site (language tag registry)
|
||||
Code for the [www.langtag.net](https://www.langtag.net/) Web site
|
||||
(language tag registry).
|
||||
|
||||
The programs depend on the [GaBuZoMeu
|
||||
tools](https://www.bortzmeyer.org/gabuzomeu-parsing-language-tags.html)
|
||||
and of some programs like xsltproc and trang.
|
||||
|
4
TODO
Normal file
4
TODO
Normal file
@ -0,0 +1,4 @@
|
||||
* Check all the links
|
||||
* Export database schemas (link to GaBuZoMeu?)
|
||||
* Configure Web site
|
||||
* HTTPS
|
BIN
favicon.ico
Normal file
BIN
favicon.ico
Normal file
Binary file not shown.
After Width: | Height: | Size: 766 B |
68
find-subtags.xml
Normal file
68
find-subtags.xml
Normal file
@ -0,0 +1,68 @@
|
||||
<page title="Find a subtag for my language / dialect / script">
|
||||
<p>Very often, people who are not used to <a href="whatare.html">language
|
||||
tagging</a> hesitate before choosing a tag for a given
|
||||
combination of language, region, script, etc. You can find the result
|
||||
of these hesitations on the Web, with people tagging, for instance,
|
||||
<wikipedia name="Japanese language">japanese</wikipedia> as <code>jp</code> (the proper
|
||||
subtag for the japanese <em>language</em> is <code>ja</code>,
|
||||
<code>jp</code> is for the country) or using subtags that are not
|
||||
registered
|
||||
because they did not find the valid
|
||||
ones. The purpose of this small text is to explain how to find a
|
||||
subtag registered in the <em>Language Subtag Registry</em>. There are
|
||||
many ways to do so, of course, and you are free to report
|
||||
better ways.</p>
|
||||
<p>The first thing to try is probably to use <a
|
||||
href="http://people.w3.org/rishida/utils/subtags/">Richard Ishida's
|
||||
Language Subtag Registry Search</a>. You can enter text which appears
|
||||
in the Description or Comments field of the registry and the
|
||||
corresponding subtags will be displayed. For instance, for japanese,
|
||||
it will correctly report <code>ja</code> (and the script
|
||||
<code>Jpan</code>, which indicates the mix of
|
||||
<wikipedia name="Kanji">Han</wikipedia>, <wikipedia>Hiragana</wikipedia> and
|
||||
<wikipedia>Katakana</wikipedia>).</p>
|
||||
<p>A more powerful, but probably less user-friendly, method, is to use
|
||||
the registry directly. Since its canonical form is more adapted to
|
||||
computer programs than to humans (for instance,
|
||||
<wikipedia>Unicode</wikipedia> characters are reported as XML escapes,
|
||||
like &#xE7;), it may be better to use of the <a
|
||||
href="registries.html/">many unofficial forms</a>, automatically
|
||||
computed from the official one, and available for
|
||||
various environments. For instance, you may load the <a
|
||||
href="registries/language-subtag-registry-utf8">text version</a> and use the Find
|
||||
function of your Web browser (Control-F in
|
||||
<wikipedia>Firefox</wikipedia>). Say that you are not sure of the
|
||||
proper subtag for the <wikipedia name="Canadian Aboriginal Syllabics">canadian
|
||||
aboriginal script</wikipedia>, searching "canadian" that way soon discovers
|
||||
the subtag <code>Cans</code>.</p>
|
||||
<p>Both Richard Ishida's Web service and the above method have a
|
||||
limit: they only use information that is in the registry. If the relationship between common
|
||||
names and the tags is not in the registry, you will not find it. For
|
||||
instance, if you want to identify "<wikipedia name="British English">British English</wikipedia>", you have to realize
|
||||
that that is done by constructing the tag <code>en-GB</code> (not <code>en-UK</code>) from
|
||||
subtags in the registry. Similarly, if you want to
|
||||
write texts in <wikipedia name="Alsatian language">Alsatian</wikipedia>, and search this word
|
||||
in the registry, you won't find anything.</p>
|
||||
<p>You have to use external tools, but please check their results
|
||||
against the registry, with the tools mentioned above. A good search
|
||||
tool is
|
||||
<wikipedia>Wikipedia</wikipedia>. The english-speaking Wikipedia
|
||||
displays the <wikipedia>ISO</wikipedia> code names for most of the
|
||||
languages it talks about. Since most subtags in the registry are based
|
||||
on ISO standards, this works most of the time. For instance, the article on Alsatian will
|
||||
show the language code <code>gsw</code>
|
||||
(<wikipedia name="Alemannic German">Alemannic</wikipedia>), which, even if it is broader than the Alsatian dialect, is a good start.</p>
|
||||
<p>For languages (but not scripts or dialects), another very useful
|
||||
source is <a href="http://www.ethnologue.com/">Ethnologue</a>, which is
|
||||
managed by the <wikipedia>ISO 639</wikipedia> registration agency. It
|
||||
has a <a href="http://www.ethnologue.com/site_search.asp">search function</a> that allows you to use words that are not in the
|
||||
formal standard. For example,
|
||||
searching for "Alsatian", you'll find
|
||||
<code><a href="http://www.ethnologue.com/show_language.asp?code=gsw">http://www.ethnologue.com/show_language.asp?code=gsw</a></code>, where there is the
|
||||
comment: "Called 'Schwyzerdütsch' in Switzerland, and 'Alsatian' in France". But be careful:
|
||||
Ethnologue displays only 3-letters code, while the registry
|
||||
uses 2-letters code whenever they are available. For instance, <wikipedia name="French language">French</wikipedia> is
|
||||
<code>fr</code>, not <code>fra</code>.</p>
|
||||
<p>TODO: endonyms and exonyms</p>
|
||||
</page>
|
||||
|
47
index.xml
Normal file
47
index.xml
Normal file
@ -0,0 +1,47 @@
|
||||
<page title="Home" pagetitle="Language Tags">
|
||||
<div class="menu3">
|
||||
<h2>For users</h2>
|
||||
<ul>
|
||||
<li><a href="whatare.html">What are language tags?</a></li>
|
||||
<li><a href="why-tagging.html">Why tagging</a>?</li>
|
||||
<li>Tag <a href="tag-wisely.html">wisely</a></li>
|
||||
<li>The <wikipedia name="IETF language code">Wikipedia article</wikipedia></li>
|
||||
<li>The <a
|
||||
href="http://www.w3.org/International/articles/language-tags/">excellent
|
||||
article from the W3C</a> about language tags</li>
|
||||
<li>Also from the <wikipedia>W3C</wikipedia>, a <a href="http://www.w3.org/International/tutorials/language-decl/">tutorial "Declaring Language in XHTML and HTML"</a></li>
|
||||
<li><a href="http://www.inter-locale.com">Other resources</a></li>
|
||||
<li><a href="http://people.w3.org/rishida/utils/subtags/">Search subtags online in the registry</a> (some explanations are provided, but this tool is more useful if you know the concepts behind the registry)</li>
|
||||
<li>Browse <a href="http://unicode.org/cldr/utility/languageid.jsp">language tags</a> and see their meaning</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="menu3">
|
||||
<h2>For software developers</h2>
|
||||
<ul>
|
||||
<li>The subtag registry (version <em><lsr-version/></em>) in <a href="registries.html">various formats</a></li>
|
||||
<li><a href="http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagRegex.txt">Mark Davis' parser (ICU)</a>, written as a <wikipedia name="Regular expression">regexp</wikipedia> suitable for <wikipedia>Perl</wikipedia> or <wikipedia>Java</wikipedia></li>
|
||||
<li><a href="http://www.dpawson.co.uk/java/rfc4646.html">Dave Pawson's implementation</a> in <wikipedia name="Java (programming language)">Java</wikipedia></li>
|
||||
<li>Martin Dürst's implementation in <wikipedia name="Ruby (programming language)">Ruby</wikipedia> is available as a <wikipedia name="RubyGems">Gem</wikipedia> named "langtag" at <wikipedia>RubyForge</wikipedia></li>
|
||||
<li><a href="http://www.bortzmeyer.org/gabuzomeu-parsing-language-tags.html">Stéphane Bortzmeyer's implementation</a> in <wikipedia name="Haskell (programming language)">Haskell</wikipedia>.</li>
|
||||
<li><a href="philips-regexp.html">Addison Phillips' code</a>, as a <wikipedia name="Regular expression">regexp</wikipedia> for <wikipedia name="Java (programming language)">Java</wikipedia></li>
|
||||
<li>An <a
|
||||
href="http://www.sasakiatcf.com/felix/lta/05/lta.xsl">implementation</a>
|
||||
in <wikipedia>XSLT</wikipedia> (requires a XSLT 2.0 processor) and the
|
||||
<a href="http://www.sasakiatcf.com/felix/lta/">Web page it powers</a> (by Felix Sasaki).</li>
|
||||
<li><a href="test-suites.html">Test suites</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="menu3">
|
||||
<h2>For standard authors</h2>
|
||||
<ul>
|
||||
<li>The current standard: <a href="http://www.rfc-editor.org/rfc/rfc5646.txt">RFC 5646</a> and <a href="http://www.rfc-editor.org/rfc/rfc4647.txt">RFC 4647</a></li>
|
||||
<li>How to <a href="register-new-subtag.html">register a new
|
||||
subtag</a> (for instance, for a variant of a language),</li>
|
||||
<li>A <a href="registries/lsr.atom">syndication feed</a> (<wikipedia name="Atom (standard)">Atom</wikipedia> format) of the changes in the registry</li>
|
||||
<li>The IETF <a href="http://www.ietf.org/html.charters/ltru-charter.html">LTRU Working Group</a></li>
|
||||
<li>Working on <a href="web-site.html">this Web site</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<p xml:lang="fr"><cite>Tout est aléa, confusion et précarité, sauf le Catalogue.</cite> (<wikipedia>Fred Vargas</wikipedia>)</p>
|
||||
</page>
|
64
ltru.css
Normal file
64
ltru.css
Normal file
@ -0,0 +1,64 @@
|
||||
.menu3 {
|
||||
float: left;
|
||||
width: 30%;
|
||||
padding: 1%;
|
||||
}
|
||||
|
||||
/* .menu3 ul {
|
||||
list-style-type: none;
|
||||
} */
|
||||
|
||||
.menu3 h2 {
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.back-to-normal {
|
||||
margin-top: 7%;
|
||||
clear: both;
|
||||
float: none;
|
||||
padding: 1%;
|
||||
width: 98%;
|
||||
}
|
||||
|
||||
body {
|
||||
padding: 1%;
|
||||
color: #000000;
|
||||
background-color: #ffffff;
|
||||
background-image: none;
|
||||
}
|
||||
|
||||
.main-title {
|
||||
text-align: center;
|
||||
float: none;
|
||||
clear: both;
|
||||
}
|
||||
|
||||
em {
|
||||
font-weight: bolder;
|
||||
}
|
||||
|
||||
a, p, li, h1, h2, h3, pre {
|
||||
}
|
||||
|
||||
a:visited {
|
||||
/* text-decoration: line-through; /* Blog-like :-) */
|
||||
}
|
||||
|
||||
a:active {
|
||||
font-size: 125%;
|
||||
}
|
||||
|
||||
code {
|
||||
font-size: 130%;
|
||||
}
|
||||
|
||||
#ltag-icon {
|
||||
float: left;
|
||||
}
|
||||
|
||||
#headline {
|
||||
margin-left: 2%;
|
||||
font-size: 140%;
|
||||
font-weight: bolder;
|
||||
}
|
||||
|
101
page.xslt
Normal file
101
page.xslt
Normal file
@ -0,0 +1,101 @@
|
||||
<?xml version='1.0' encoding='UTF-8'?>
|
||||
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
|
||||
xmlns:str="http://exslt.org/strings"
|
||||
exclude-result-prefixes = "str"
|
||||
version='1.0'>
|
||||
|
||||
<xsl:output method = "xml"
|
||||
encoding = "UTF-8"
|
||||
omit-xml-declaration = "no"
|
||||
doctype-system = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
|
||||
doctype-public = "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
indent = "yes"/>
|
||||
|
||||
<xsl:param name="lsr-version">VERSION UNDEFINED</xsl:param>
|
||||
|
||||
<xsl:template match="page">
|
||||
<xsl:variable name="title">
|
||||
<xsl:value-of select="@title"/>
|
||||
</xsl:variable>
|
||||
<xsl:variable name="pagetitle">
|
||||
<xsl:choose>
|
||||
<xsl:when test="@pagetitle">
|
||||
<xsl:value-of select="@pagetitle"/>
|
||||
</xsl:when>
|
||||
<xsl:otherwise>
|
||||
<xsl:value-of select="@title"/>
|
||||
</xsl:otherwise>
|
||||
</xsl:choose>
|
||||
</xsl:variable>
|
||||
<html xml:lang="en">
|
||||
<head>
|
||||
<link rel="stylesheet" type="text/css" href="ltru.css" />
|
||||
<title>Language Tags: <xsl:value-of select="$title"/></title>
|
||||
</head>
|
||||
<body>
|
||||
<div><img id="ltag-icon" src="ltag-icon-en.png" alt=""/></div>
|
||||
<h1 class="main-title"><xsl:value-of select="$pagetitle"/></h1>
|
||||
<xsl:apply-templates select="*"/>
|
||||
<hr class="before-footer"/>
|
||||
<p class="footer"><a href="index.html">Home</a>. Web site maintained by
|
||||
<code><a href="mailto:webmaster@langtag.net">webmaster@langtag.net</a></code>.
|
||||
Hosted by <a href="http://www.afnic.fr/">AFNIC</a>.
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
||||
</xsl:template>
|
||||
|
||||
<xsl:template match="lsr-version">
|
||||
<xsl:value-of select="$lsr-version"/>
|
||||
</xsl:template>
|
||||
|
||||
<xsl:template name="wikipedia">
|
||||
<xsl:param name="link"/>
|
||||
<xsl:param name="text"/>
|
||||
<xsl:variable name="actuallink">
|
||||
<xsl:choose>
|
||||
<xsl:when test="$link = ''">
|
||||
<xsl:value-of select="$text"/>
|
||||
</xsl:when>
|
||||
<xsl:otherwise>
|
||||
<xsl:value-of select="$link"/>
|
||||
</xsl:otherwise>
|
||||
</xsl:choose>
|
||||
</xsl:variable>
|
||||
<a><xsl:attribute name="href">http://en.wikipedia.org/wiki/<xsl:value-of select="$actuallink"/></xsl:attribute><xsl:value-of select="$text"/></a>
|
||||
</xsl:template>
|
||||
|
||||
<xsl:template match="wikipedia">
|
||||
<xsl:variable name="word">
|
||||
<xsl:choose>
|
||||
<xsl:when test="@name">
|
||||
<xsl:value-of select="@name"/>
|
||||
</xsl:when>
|
||||
<xsl:otherwise>
|
||||
<xsl:value-of select="text()"/>
|
||||
</xsl:otherwise>
|
||||
</xsl:choose>
|
||||
</xsl:variable>
|
||||
<xsl:variable name="path">
|
||||
<xsl:choose>
|
||||
<xsl:when test="function-available('str:encode-uri')">
|
||||
<xsl:value-of select="str:encode-uri($word, true())"/>
|
||||
</xsl:when>
|
||||
<xsl:otherwise>
|
||||
<xsl:value-of select="$word"/>
|
||||
</xsl:otherwise>
|
||||
</xsl:choose>
|
||||
</xsl:variable>
|
||||
<xsl:call-template name="wikipedia">
|
||||
<xsl:with-param name="link" select="$path"/>
|
||||
<xsl:with-param name="text" select="text()"/>
|
||||
</xsl:call-template>
|
||||
</xsl:template>
|
||||
|
||||
<xsl:template match="@*|node()">
|
||||
<xsl:copy>
|
||||
<xsl:apply-templates select="@*|node()"/>
|
||||
</xsl:copy>
|
||||
</xsl:template>
|
||||
|
||||
</xsl:stylesheet>
|
15
philips-regexp.xml
Normal file
15
philips-regexp.xml
Normal file
@ -0,0 +1,15 @@
|
||||
<page title="Addison Phillips Java regexp for language tags">
|
||||
<p>Here is a <wikipedia>regular expression</wikipedia> to parse the <em>future</em> versions of
|
||||
<a href="index.html">language tags</a>. Suitable for the syntax of the RFC 5646. Written by Addison Phillips, <code>addison - at - amazon.com</code> for the <wikipedia name="Java (programming language)">Java programming language</wikipedia>.</p>
|
||||
<pre>
|
||||
static final String langtag_ex =
|
||||
"(\\A[xX]([\\x2d]\\p{Alnum}{1,8})*\\z)"
|
||||
+ "|(((\\A\\p{Alpha}{2,8}(?=\\x2d|\\z)){1}"
|
||||
+ "(([\\x2d]\\p{Alpha}{3})(?=\\x2d|\\z)){0,3}"
|
||||
+ "([\\x2d]\\p{Alpha}{4}(?=\\x2d|\\z))?"
|
||||
+ "([\\x2d](\\p{Alpha}{2}|\\d{3})(?=\\x2d|\\z))?"
|
||||
+ "([\\x2d](\\d\\p{Alnum}{3}|\\p{Alnum}{5,8})(?=\\x2d|\\z))*)"
|
||||
+ "(([\\x2d]([a-wyzA-WYZ](?=\\x2d))([\\x2d](\\p{Alnum}{2,8})+)*))*"
|
||||
+ "([\\x2d][xX]([\\x2d]\\p{Alnum}{1,8})*)?)\\z";
|
||||
</pre>
|
||||
</page>
|
196
register-new-subtag.xml
Normal file
196
register-new-subtag.xml
Normal file
@ -0,0 +1,196 @@
|
||||
<?xml version="1.0" encoding="us-ascii"?>
|
||||
<page title="How to register a new subtag">
|
||||
<p>The <em><a href="registries.html">language subtag registry</a></em>
|
||||
includes many <a href="whatare.html">subtags</a> identifying countries, languages or variants
|
||||
such as local dialects. Your favorite language and/or variant is
|
||||
probably already there. But, if it is not, you can ask for the
|
||||
registration of a new subtag. This text explains how, but the
|
||||
complete and authoritative explanation is in <a
|
||||
href="http://www.rfc-editor.org/rfc/rfc5646.txt">RFC 5646</a>,
|
||||
specially its section 3.5, <a href="http://tools.ietf.org/html/rfc5646#section-3.5">Registration Procedure for Subtags</a>". It is
|
||||
recommended to read at least this section.</p>
|
||||
<p>Before you start, a warning: the process takes time, documentation
|
||||
and the ability to defend your proposal and to back it with facts and
|
||||
references. Just sending an email saying "People in my hometown speaks
|
||||
<wikipedia name="Alsatian language">Alsatian</wikipedia>, I want 'als' to be registered as a language subtag" is not sufficient.</p>
|
||||
<p>You have different sorts of subtags and the rules are not the same
|
||||
for all:</p>
|
||||
<ul>
|
||||
<li>Subtags for types "countries" or "scripts" cannot be registered
|
||||
directly with the <wikipedia>IETF</wikipedia>. You have to go through
|
||||
the maintenance agencies of <wikipedia name="International Organization for Standardization">ISO</wikipedia>, the language
|
||||
subtag registry managed by IETF copies the ISO standards here.</li>
|
||||
<li>Only subtags of types "language" and "variant" are therefore
|
||||
considered here. In practice, chances that a "language" subtag
|
||||
registration succeeds seem limited (you will probably be redirected to
|
||||
the maintenance agencies of <wikipedia>ISO 639</wikipedia>; if you
|
||||
already ask them and were turned out, prepare a very good proposal if
|
||||
you want the IETF to make another choice). We then concentrate on
|
||||
"variant" subtags.</li>
|
||||
</ul>
|
||||
<p>The process is the following (it is a simplified version; did I
|
||||
tell you to read the full story in <a href="http://tools.ietf.org/html/rfc5646#section-3.5">section 3.5</a> of <a
|
||||
href="http://www.rfc-editor.org/rfc/rfc5646.txt">RFC 5646</a>?):</p>
|
||||
<ol>
|
||||
<li>Collect background information, typically references to published
|
||||
descriptions of the language or dialect. A Wikipedia page is possible but
|
||||
may be insufficient, specially since the page may change
|
||||
easily. Stable references are preferred.</li>
|
||||
<li>Choose a subtag which must conform to the syntax rules explained
|
||||
in RFC 5646 (<a href="http://tools.ietf.org/html/rfc5646#section-2.1">section 2.1</a>). A variant subtag must be either a string of
|
||||
five to eight alphanumeric characters, <em>or</em> a string of four
|
||||
alphanumeric characters, starting with a digit. So, <code><wikipedia
|
||||
name="Valencia (province)">valencian</wikipedia></code> is illegal
|
||||
(too long) while <code>valencia</code> is legal. <code><wikipedia
|
||||
name="German spelling reform of 1996">1996</wikipedia></code> is legal, too, but
|
||||
not <code>732</code>.</li>
|
||||
<li>Fill-in the registration form whose template is:
|
||||
<pre>
|
||||
LANGUAGE SUBTAG REGISTRATION FORM
|
||||
1. Name of requester:
|
||||
2. E-mail address of requester:
|
||||
3. Record Requested:
|
||||
|
||||
Type:
|
||||
Subtag:
|
||||
Description:
|
||||
Prefix:
|
||||
Preferred-Value:
|
||||
Deprecated:
|
||||
Suppress-Script:
|
||||
Comments:
|
||||
|
||||
4. Intended meaning of the subtag:
|
||||
5. Reference to published description
|
||||
of the language (book or article):
|
||||
6. Any other relevant information:
|
||||
</pre>
|
||||
|
||||
Pay special attention to Prefix (in practice, most variants have a
|
||||
Prefix, which is the main language of this variant, such as
|
||||
<code><wikipedia name="Catalan language">ca</wikipedia></code> for
|
||||
valencian).<br/> Think twice about Description (in general a short
|
||||
one-line sentence) and Comments (which may be longer), because the
|
||||
consistency of tagging among different taggers will heavily depend on
|
||||
the quality of these fields.<br/>Keep
|
||||
detailed scholar references for the Reference section of the request:
|
||||
the registry is not a library.<br/> Some
|
||||
fields are typically not used for a variant such as
|
||||
Suppress-Script.</li>
|
||||
<li>Send it to the mailing list <code><a
|
||||
href="mailto:ietf-languages@iana.org">ietf-languages@iana.org</a></code>
|
||||
(you may choose to <a href="http://www.alvestrand.no/mailman/listinfo/ietf-languages">subscribe to the mailing list</a> before, to get an
|
||||
idea of the people and discussions, and to be sure to have the
|
||||
complete thread).</li>
|
||||
<li>Reply to questions, address objections, be prepared to modify your
|
||||
registration form and keep cool.</li>
|
||||
</ol>
|
||||
<p>Let's see a complete example showing many issues (thanks to CE
|
||||
Whitehead for the nice example). The current registry has three
|
||||
entries for the <wikipedia>french language</wikipedia>,
|
||||
<code>fr</code> for today's French, <code>frm</code> for Middle French
|
||||
(the language spoken during the <wikipedia>Renaissance</wikipedia>)
|
||||
and <code>fro</code> for Old French (the language spoken during the
|
||||
<wikipedia name="Middle Ages">Middle Age</wikipedia>). This is not
|
||||
always sufficiently fine-grained to classify some old texts. So, here
|
||||
is a possible proposal to register a variant, <code>1606Nict</code>,
|
||||
for the late Middle French, as described in the famous <wikipedia
|
||||
name="Jean Nicot">Nicot</wikipedia>'s book:</p>
|
||||
<pre>
|
||||
LANGUAGE SUBTAG REGISTRATION FORM
|
||||
1. Name of requester: C. E. Whitehead
|
||||
2. E-mail address of requester: cewcathar@hotmail.com
|
||||
3. Record Requested:
|
||||
Type: Variant
|
||||
Subtag: 1606Nict
|
||||
(or alternately 16siecle)
|
||||
Description: Late Middle French
|
||||
Prefix: frm
|
||||
Preferred-Value:
|
||||
Deprecated:
|
||||
Suppress-Script:
|
||||
Comments: French as catalogued in Jean Nicot, "Thresor de la langue francoyse" 1606
|
||||
|
||||
4. Intended meaning of the subtag:
|
||||
5. Reference to published description
|
||||
of the language (book or article):
|
||||
|
||||
* Joachim du Bellay, La deffence et illustration de la langue francoyse,
|
||||
1549; ed critique by Henri Chamard, Geneve, Slatkine Rpt. 1969
|
||||
|
||||
* Jean Nicot, "Thresor de la langue francoyse" 1606; ARTFL Project,
|
||||
University of Chicago:
|
||||
http://portail.atilf.fr/dictionnaires/TLF-NICOT/index.htm
|
||||
|
||||
6. Any other relevant information:
|
||||
See second request below
|
||||
</pre>
|
||||
<p>Do note the detailed references and the use of Prefix to
|
||||
clearly state that it is a variant of Middle French.</p>
|
||||
<p>Let's see a second example from the same author, with the added
|
||||
difficulty that we use <wikipedia>XML</wikipedia>-like encoding for
|
||||
the composed characters (see section 3.1 of RFC 5646). This specificies the early modern French, as
|
||||
described by the <wikipedia name="Academie francaise">french academy</wikipedia>:</p>
|
||||
<pre>
|
||||
LANGUAGE SUBTAG REGISTRATION FORM
|
||||
1. Name of requester: C. E. Whitehead
|
||||
2. E-mail address of requester: cewcathar@hotmail.com
|
||||
3. Record Requested:
|
||||
|
||||
Type: Variant
|
||||
Subtag: 1694acad
|
||||
(alternately 17siecle)
|
||||
Description: Early modern French
|
||||
Prefix: fr
|
||||
Preferred-Value:
|
||||
Deprecated:
|
||||
Suppress-Script:
|
||||
Comments: As catalogued in the "Dictionnaire de
|
||||
l'acad&#xe9;me fran&#xe7;oise", 4eme ed. 1694; includes
|
||||
elements of Middle French; also new terms from the Americas
|
||||
|
||||
4. Intended meaning of the subtag:
|
||||
5. Reference to published description
|
||||
of the language (book or article):
|
||||
|
||||
* Dictionnaire de l'académie françoise, 4eme ed. 1694; RTFL Project,
|
||||
University of Chicago:
|
||||
http://portail.atilf.fr/dictionnaires/ACADEMIE/index.htm
|
||||
|
||||
* Fénelon, François de Salignac de La Mothe (1984), Fenelon's Letter to the
|
||||
French Academy : with an introduction and commentary.
|
||||
|
||||
* Ayres-Bennett, Wendy (2004), Sociolinguistic variation in
|
||||
seventeenth-century France : methodology and case studies.
|
||||
|
||||
also:
|
||||
* http://www.tsl.state.tx.us/treasures/giants/lasalle/lasalle-cover.html
|
||||
http://teacherweb.com/FL/Cocoa/CEWhitehead/HTMLPage15.stm
|
||||
</pre>
|
||||
<p>It is probably useful to list some mistakes that people seem to
|
||||
make often. Keep in mind that:</p>
|
||||
<ul>
|
||||
<li>Language issues are always extremely passionate, both for
|
||||
psychological (people feel very strong about their language) and
|
||||
political reasons (wars have been fought about languages). Please, try
|
||||
to keep easy and do not forget that it is perfectly normal that an
|
||||
international audience does not know your language (or the language
|
||||
you champion) and does not see things they way you do.</li>
|
||||
<li>Pay attention to syntax issues (you may wish to ask a computer
|
||||
person, may be with the help of some of the software tools listed <a
|
||||
href="/">on the home page</a>) and also be sure to fill in the form
|
||||
properly - or do not be suprised if the first reactions are on the
|
||||
syntax, not on the proposal itself. If the IETF yells at your form
|
||||
errors, do not assume it is a refusal of your language: it is simply a
|
||||
desire to enforce the documented process.</li>
|
||||
<li>The IETF is not a general appeal mechanism for other standard
|
||||
bodies decisions. Other standards are imperfect, true, but so is IETF
|
||||
work, too. Please do not use the IETF language registration mechanism just
|
||||
because ISO turned you down. Variant registration is typically fine
|
||||
because no other standard body do it.</li>
|
||||
</ul>
|
||||
<p>Thanks
|
||||
for reading and good luck for your future subtag
|
||||
registrations. Remember: it may seems difficult but it is worth
|
||||
it.</p>
|
||||
</page>
|
35
registries.xml
Normal file
35
registries.xml
Normal file
@ -0,0 +1,35 @@
|
||||
<page title="The registry in various formats">
|
||||
|
||||
<p>Files available here were automatically produced from the official
|
||||
registry maintained by <wikipedia>IANA</wikipedia>. The current version of
|
||||
the registry is <em><lsr-version/></em>.</p>
|
||||
|
||||
<ul>
|
||||
<li><a href="https://www.iana.org/assignments/language-subtag-registry">Official format</a></li>
|
||||
<li>As text files (one line per record, fields separated by
|
||||
tabulations, the subtag is the first field, the addition date the
|
||||
second):
|
||||
<ul>
|
||||
<li><a href="registries/lsr-language.txt">List of languages</a></li>
|
||||
<li><a href="registries/lsr-script.txt">List of scripts</a></li>
|
||||
<li><a href="registries/lsr-region.txt">List of regions</a> (including countries)</li>
|
||||
<li><a href="registries/lsr-variant.txt">List of variants</a> (orthography, local dialects)</li>
|
||||
<li><a href="registries/lsr-grandfathered.txt">List of grand-fathered</a> (tags which would otherwise be illegal but are maintained for compatibility, because they were previously used)</li>
|
||||
<li><a href="registries/lsr-redundant.txt">List of redundants</a> (subtags redundant with other subtags)</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>As <wikipedia>XML</wikipedia> :
|
||||
<ul>
|
||||
<li>According to <a href="registries/ltru.rnc">this schema</a> (written in <wikipedia>RelaxNG</wikipedia>): <a href="registries/language-subtag-registry.xml">registry in XML</a>, Stéphane Bortzmeyer's version</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>As <wikipedia>SQL</wikipedia>. Since SQL is not really portable, there are
|
||||
several versions :
|
||||
<ul>
|
||||
<li>For <wikipedia>PostgreSQL</wikipedia> (create the database with <a href="PostgreSQL/create-db-subtag.sql">this schema</a>) : <a href="registries/lsr-postgres.sql">registry in SQL</a>,</li>
|
||||
<li>For <wikipedia>SQLite</wikipedia> (create the database with <a href="SQLite/create-db-subtag.sql">this schema</a>) : <a href="registries/lsr-sqlite.sql">registry in SQL</a>,</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</page>
|
9
registries/Makefile
Normal file
9
registries/Makefile
Normal file
@ -0,0 +1,9 @@
|
||||
all:
|
||||
./copy-and-convert.sh
|
||||
cp ./language-subtag-registry-version ..
|
||||
|
||||
ltru.rng: ltru.rnc
|
||||
trang -Irnc -Orng ltru.rnc $@
|
||||
|
||||
clean:
|
||||
rm -f language-subtag-registry language-subtag-registry.xml language-subtag-registry2.xml lsr-*.txt
|
93
registries/copy-and-convert.sh
Executable file
93
registries/copy-and-convert.sh
Executable file
@ -0,0 +1,93 @@
|
||||
#!/bin/sh
|
||||
|
||||
MYURL=https://www.langtag.net/
|
||||
LTR_URL=https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
|
||||
LTR_LOCAL=language-subtag-registry
|
||||
PROGRAMS_DIR=../../GaBuZoMeu
|
||||
TEST_PROGRAM=${PROGRAMS_DIR}/check-registry
|
||||
OS="$(uname)"
|
||||
if [ "$OS" = "FreeBSD" ]; then
|
||||
# FreeBSD's mktemp is stupid enough to have *no*
|
||||
# default template :-(
|
||||
OUTPUT=`mktemp /tmp/$(basename $0).tmp.XXX)`
|
||||
TMPDIFF=`mktemp /tmp/$(basename $0).tmp.XXX)`
|
||||
else
|
||||
OUTPUT=`mktemp`
|
||||
TMPDIFF=`mktemp`
|
||||
fi
|
||||
MAINTAINER=stephane+langtag@bortzmeyer.org
|
||||
|
||||
# Conversions
|
||||
CONVERT_XML_BORTZMEYER=${PROGRAMS_DIR}/registry2xml
|
||||
CONVERT_XML_ELLERMANN="awk -f ltru2xml.awk "
|
||||
CONVERT_POSTGRESQL=${PROGRAMS_DIR}/registry2postgresql
|
||||
CONVERT_SQLITE=${PROGRAMS_DIR}/registry2sqlite
|
||||
CONVERT_TXT=${PROGRAMS_DIR}/registry2txt
|
||||
CONVERT_HTML=${PROGRAMS_DIR}/registry2mulhtml
|
||||
FILL_DATABASE=./fill-in-database.sh
|
||||
# --force is to avoid spurious warnings about "Ambiguous output"
|
||||
#CRLF_TO_LOCAL="recode --force /CR-LF..US-ASCII "
|
||||
|
||||
trap "rm -f $OUTPUT $TMPDIFF; exit 1" 1 2 3 15
|
||||
trap "rm -f $OUTPUT $TMPDIFF" EXIT
|
||||
|
||||
if [ -e ${LTR_LOCAL} ]; then
|
||||
ltr_date=`head -n 1 ${LTR_LOCAL} | cut -d" " -f2`
|
||||
# Allow time to elapse. The date of the file at IANA is often the day after
|
||||
# the date written in the LSR. Heuristically, we add one day and a few hours.
|
||||
current_date=`date +"%Y%m%d %H:%M:%S" --date="${ltr_date} +1 day +4 hour"`
|
||||
else
|
||||
# Trick to force a downloading
|
||||
current_date="19700101"
|
||||
#current_date=`date --utc +"%Y%m%d"`
|
||||
fi
|
||||
curl --silent --output ${LTR_LOCAL}.TMP \
|
||||
--compressed \
|
||||
--referer ${MYURL} \
|
||||
--proxy "" \
|
||||
--time-cond "${current_date}" \
|
||||
--header "From: ${MAINTAINER}" \
|
||||
${LTR_URL} 2>&1 > ${OUTPUT}
|
||||
if [ $? != 0 ]; then
|
||||
cat ${OUTPUT} | mutt -s "Network error getting ${LTR_URL}" ${MAINTAINER}
|
||||
exit 1
|
||||
fi
|
||||
if [ -e ${LTR_LOCAL}.TMP ]; then
|
||||
#$CRLF_TO_LOCAL ${LTR_LOCAL}.TMP
|
||||
${TEST_PROGRAM} ${LTR_LOCAL}.TMP 2>&1 >> ${OUTPUT}
|
||||
if [ $? = 0 ]; then
|
||||
if [ -e ${LTR_LOCAL} ]; then
|
||||
diff -u ${LTR_LOCAL} ${LTR_LOCAL}.TMP > $TMPDIFF
|
||||
if [ ! -z $TMPDIFF ]; then
|
||||
mutt -s "New LTR registry at ${MYURL}" ${MAINTAINER} < $TMPDIFF
|
||||
fi
|
||||
fi
|
||||
mv ${LTR_LOCAL}.TMP ${LTR_LOCAL}
|
||||
# Now, the various conversions
|
||||
${CONVERT_XML_BORTZMEYER}
|
||||
# trang is in Java and therefore fails frequently
|
||||
# trang -Irnc -Orng ltru.rnc ltru.rng
|
||||
xmllint --noout --relaxng ltru.rng ${LTR_LOCAL}.xml
|
||||
${CONVERT_TXT}
|
||||
#${CONVERT_XML_ELLERMANN} < ${LTR_LOCAL} > ${LTR_LOCAL}2.xml
|
||||
#xmllint --noout --valid ${LTR_LOCAL}2.xml
|
||||
${CONVERT_POSTGRESQL} > lsr-postgres.sql
|
||||
${CONVERT_SQLITE} > lsr-sqlite.sql
|
||||
# TODO: UTF-8 support on SQLite was never tested
|
||||
./utf82ncr.py lsr-sqlite.sql
|
||||
mv lsr-sqlite.sql lsr-sqlite-utf8.sql
|
||||
mv lsr-sqlite-ncr.sql lsr-sqlite.sql
|
||||
${CONVERT_HTML}
|
||||
${FILL_DATABASE}
|
||||
# Needs to be ported away from.DateTime
|
||||
#./lsr2atom.py > lsr.atom
|
||||
version=`head -n 1 ${LTR_LOCAL} | awk '{print $2}'`
|
||||
echo $version > ${LTR_LOCAL}-version
|
||||
exit 0
|
||||
else
|
||||
cat ${OUTPUT} | mutt -s "Invalid registry ${LTR_URL}" ${MAINTAINER}
|
||||
exit 1
|
||||
fi
|
||||
else # File not downloaded, probably because there was nothing new.
|
||||
exit 0
|
||||
fi
|
6
registries/fill-in-database.sh
Executable file
6
registries/fill-in-database.sh
Executable file
@ -0,0 +1,6 @@
|
||||
#!/bin/sh
|
||||
|
||||
DATABASE=lsr
|
||||
|
||||
psql -f clean-postgres.sql ${DATABASE}
|
||||
psql -f lsr-postgres.sql ${DATABASE}
|
103
registries/lsr2atom.py
Executable file
103
registries/lsr2atom.py
Executable file
@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
__version__ = "BETA"
|
||||
domain = "langtag.net"
|
||||
tag_prefix = "tag:%s,2007-05:LSR" % domain
|
||||
|
||||
import sys
|
||||
import urllib.request, urllib.parse, urllib.error
|
||||
import psycopg2
|
||||
# ElementTree is painful, with all its renamings :-(
|
||||
try:
|
||||
import cElementTree as ET
|
||||
except ImportError:
|
||||
try:
|
||||
import ElementTree as ET
|
||||
except ImportError:
|
||||
# Now a standard part of Python >= 2.5
|
||||
import xml.etree.ElementTree as ET
|
||||
import mx.DateTime as DateTime # TODO move to another package
|
||||
|
||||
max = 10
|
||||
|
||||
db_module = psycopg2
|
||||
|
||||
def process_type(tree, type="language"):
|
||||
request = ("SELECT code,description, added FROM %ss_with_descr" % type) + \
|
||||
" ORDER BY added DESC LIMIT %(max)s"
|
||||
cursor.execute(
|
||||
request,
|
||||
{'max': max})
|
||||
for tuplee in cursor.fetchall():
|
||||
code = tuplee[0]
|
||||
description = tuplee[1]
|
||||
added = tuplee[2]
|
||||
utype = type.capitalize()
|
||||
entry = ET.SubElement(tree, "entry")
|
||||
title = ET.SubElement(entry, "title")
|
||||
title.text = "%s: %s" % (utype, description)
|
||||
entry_id = ET.SubElement(entry, "id")
|
||||
entry_id.text = tag_prefix + "/" + urllib.parse.quote_plus("%s %s" % (type, code))
|
||||
published = ET.SubElement(entry, "published")
|
||||
published.text = added.strftime("%Y-%m-%dT00:00:00Z")
|
||||
# TODO: records in the LSR are sometimes updated but it is not obvious to see it,
|
||||
# since there is only an "Added" field.
|
||||
updated = ET.SubElement(entry, "updated")
|
||||
updated.text = published.text
|
||||
category = ET.SubElement(entry, "category")
|
||||
category.attrib["scheme"] = tag_prefix
|
||||
category.attrib["term"] = type
|
||||
category.attrib["label"] = utype
|
||||
link = ET.SubElement(entry, "link")
|
||||
link.attrib["rel"] = "alternate"
|
||||
link.attrib["href"] = "http://www.%s/registries/registry-html/%s/%s.html" % \
|
||||
(domain, type, code)
|
||||
content = ET.SubElement(entry, "content")
|
||||
content.attrib["type"] = "text"
|
||||
content.text = """
|
||||
%s
|
||||
|
||||
%s
|
||||
|
||||
%s
|
||||
|
||||
Added on %s
|
||||
""" % (type, code, description, added.strftime("%Y-%m-%d"))
|
||||
# TODO: an alternate Content in HTML?
|
||||
|
||||
connection = db_module.connect("dbname=lsr")
|
||||
cursor = connection.cursor()
|
||||
|
||||
feed = ET.Element("feed")
|
||||
feed.attrib["xmlns"] = "http://www.w3.org/2005/Atom"
|
||||
title = ET.SubElement(feed, "title")
|
||||
title.text = "Language Tag Registry syndication feed"
|
||||
updated = ET.SubElement(feed, "updated")
|
||||
updated.text = DateTime.now().strftime("%Y-%m-%dT%H:%M:00Z")
|
||||
link_html = ET.SubElement(feed, "link")
|
||||
link_html.attrib["rel"] = "alternate"
|
||||
link_html.attrib["type"] = "text/html"
|
||||
link_html.attrib["href"] = "http://www.%s/" % domain
|
||||
link_self = ET.SubElement(feed, "link")
|
||||
link_self.attrib["rel"] = "self"
|
||||
link_self.attrib["type"] = "application/atom+xml"
|
||||
link_self.attrib["href"] = "http://www.%s/registries/lsr.atom" % domain
|
||||
author = ET.SubElement(feed, "author")
|
||||
name = ET.SubElement(author, "name")
|
||||
name.text = "Stephane Bortzmeyer"
|
||||
email = ET.SubElement(author, "email")
|
||||
email.text = "webmaster@langtag.net"
|
||||
feed_id = ET.SubElement(feed, "id")
|
||||
feed_id.text = tag_prefix
|
||||
generator = ET.SubElement(feed, "generator")
|
||||
generator.text = "%s %s running with Python %s" % \
|
||||
("lsr2atom", __version__, sys.version.split()[0])
|
||||
|
||||
process_type(feed, "language")
|
||||
process_type(feed, "variant")
|
||||
process_type(feed, "script")
|
||||
process_type(feed, "region")
|
||||
process_type(feed, "extlang")
|
||||
cursor.close()
|
||||
connection.close()
|
||||
print(ET.tostring(feed, encoding="UTF-8"))
|
74
registries/ltru.dtd
Normal file
74
registries/ltru.dtd
Normal file
@ -0,0 +1,74 @@
|
||||
<!--
|
||||
|
||||
Written by Frank Ellermann
|
||||
|
||||
TODO: does not seem consistent with the awk script which produces the XML
|
||||
|
||||
-->
|
||||
|
||||
<!ELEMENT ltru (language*, extlang*, script*, region*, variant*,
|
||||
grandfathered*, redundant*)>
|
||||
<!ATTLIST ltru
|
||||
date NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT language (suppress?, deprecated?, description+, comment*)>
|
||||
<!ATTLIST language
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT extlang (prefix, deprecated?, description+, comment*)>
|
||||
<!ATTLIST extlang
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT script (deprecated?, description+, comment*)>
|
||||
<!ATTLIST script
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT region (deprecated?, description+, comment*)>
|
||||
<!ATTLIST region
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT variant (prefix*, deprecated?, description+, comment*)>
|
||||
<!ATTLIST variant
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT grandfathered (deprecated?, description+, comment*)>
|
||||
<!ATTLIST grandfathered
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT redundant (deprecated?, description+, comment*)>
|
||||
<!ATTLIST redundant
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT suppress EMPTY>
|
||||
<!ATTLIST suppress
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT prefix EMPTY>
|
||||
<!ATTLIST prefix
|
||||
tag NMTOKEN #REQUIRED
|
||||
>
|
||||
|
||||
<!ELEMENT deprecated EMPTY>
|
||||
<!ATTLIST deprecated
|
||||
date NMTOKEN #REQUIRED
|
||||
tag NMTOKEN #IMPLIED
|
||||
>
|
||||
|
||||
<!ELEMENT description (#PCDATA)>
|
||||
<!ELEMENT comment (#PCDATA)>
|
71
registries/ltru.rnc
Normal file
71
registries/ltru.rnc
Normal file
@ -0,0 +1,71 @@
|
||||
# RelaxNG schema for the "language tag" registry specified in RFC 4646
|
||||
# and available at http://www.iana.org/assignments/language-subtag-registry
|
||||
|
||||
# Not standard in any way, just an individual proposal
|
||||
|
||||
# Stephane Bortzmeyer <bortzmeyer@nic.fr>
|
||||
|
||||
# TODO: add Schematron rules for constraints such as "Records that
|
||||
# contain a 'Preferred-Value' field MUST also have a 'Deprecated'
|
||||
# field. " This specific constraint does not really require
|
||||
# Schematron, but others may.
|
||||
|
||||
start = registry
|
||||
|
||||
registry = element registry {date & languages & extlangs & scripts & regions & variants &
|
||||
redundants & grandfathereds} # TODO: extensions
|
||||
|
||||
date = element date {xsd:date}
|
||||
|
||||
languages = language*
|
||||
|
||||
language = element language {subtag & common & scope? & suppress-script? & macrolanguage?}
|
||||
|
||||
extlangs = extlang*
|
||||
|
||||
extlang = element extlang {subtag & common & scope? & macrolanguage?}
|
||||
|
||||
scripts = script*
|
||||
|
||||
script = element script {subtag & common}
|
||||
|
||||
regions = region*
|
||||
|
||||
region = element region {subtag & common}
|
||||
|
||||
variants = variant*
|
||||
|
||||
variant = element variant {subtag & common & prefix*} # "Records of type 'variant'
|
||||
# MAY have more than one field of type" 'Prefix'.
|
||||
|
||||
grandfathereds = grandfathered*
|
||||
|
||||
grandfathered = element grandfathered {tag & common}
|
||||
|
||||
redundants = redundant*
|
||||
|
||||
redundant = element redundant {tag & common}
|
||||
|
||||
common = added & descriptions & deprecated? & preferred-value?
|
||||
|
||||
added = element added {xsd:date}
|
||||
|
||||
suppress-script = element suppress-script {text}
|
||||
|
||||
descriptions = description+ # Each record MUST contain the following fields
|
||||
|
||||
description = element description {text}
|
||||
|
||||
subtag = element subtag {text}
|
||||
|
||||
tag = element tag {text}
|
||||
|
||||
prefix = element prefix {text}
|
||||
|
||||
macrolanguage = element macrolanguage {text}
|
||||
|
||||
deprecated = element deprecated {xsd:date}
|
||||
|
||||
preferred-value = element preferred-value {text}
|
||||
|
||||
scope = element scope {text}
|
330
registries/ltru2xml.awk
Normal file
330
registries/ltru2xml.awk
Normal file
@ -0,0 +1,330 @@
|
||||
#! /usr/common/bin/gawk -f
|
||||
#
|
||||
# Usage: ltru2xml.awk registry > registry.xml
|
||||
# Or : ltru2xml.awk /dev/null > registry.dtd
|
||||
#
|
||||
# The File-Date record is noted in a 'date' attribute of the root
|
||||
# element <LanguageSubtagRegistry>.
|
||||
#
|
||||
# Other records are converted to elements <language>, <extlang>,
|
||||
# <script>, <region>, <variant>, <grandfathered>, or <redundant>.
|
||||
#
|
||||
# The fields Added (required), Deprecated (optional), Description
|
||||
# (one or more), and Comment (optional) are converted to elements
|
||||
# <added>, <deprecated>, <description>, and <comment> in that order.
|
||||
#
|
||||
#### version 0.6 ##################################################
|
||||
#
|
||||
# Multiple descriptions and comments can be separated by an empty
|
||||
# <alt /> element, squeezing them all into a single <description>
|
||||
# or <comment> resp. The cleaner approach is to allow multiple
|
||||
# <description> or <comment> elements. Modify the line MULT = 0
|
||||
# below to MULT = 1 for this style.
|
||||
#
|
||||
# The tags zh-cmn, zh-hakka, and yi-latn are handled as special
|
||||
# cases for RFC 4646 registries, for 4646bis cmn is an ordinary
|
||||
# <extlang> subtag.
|
||||
#
|
||||
# Known issues: If the language subtag review list introduces a
|
||||
# language subtag with 5-8 characters which clashes with a variant
|
||||
# subtag these subtags would get the same xml:id resulting in an
|
||||
# XML syntax error. In practice it's unlikely that there ever
|
||||
# will be any IANA language subtag not derived from ISO 936, let
|
||||
# alone using the same string also for a variant subtag.
|
||||
#
|
||||
# "XML Notepad 2007", an experimental Microsoft tool, does not yet
|
||||
# support xml:id attributes without an explicit declaration of the
|
||||
# xml namespace. The W3C validator accepts xml:id without explicit
|
||||
# namespace declaration.
|
||||
#
|
||||
#### version 0.7 ##################################################
|
||||
#
|
||||
# The Internet drafts 4646bis-08 and 4645bis-02 introduced a new
|
||||
# optional field "Macrolanguage" for language and extlang subtags.
|
||||
#
|
||||
# Frank Ellermann, 2007
|
||||
#
|
||||
#### remove leading and trailing spaces ###########################
|
||||
function STRIP( STR )
|
||||
{ sub( /^[\t ]+/, "", STR )
|
||||
sub( /[\t ]+$/, "", STR )
|
||||
return STR
|
||||
}
|
||||
#### add underscore to subtags starting with a digit ##############
|
||||
function XMLID( STR )
|
||||
{ sub( /^[0-9]/, "_&", STR )
|
||||
return STR
|
||||
}
|
||||
#### convert tag to IDREFS (subtags) ##############################
|
||||
function IDREF( STR )
|
||||
{ N = split( STR, REF, "-" )
|
||||
STR = ""
|
||||
for ( I = 1; I <= N; ++I )
|
||||
STR = STR " " XMLID( REF[ I ] )
|
||||
return substr( STR, 2 )
|
||||
}
|
||||
#### escape less-than (and greater-than) characters ###############
|
||||
function CANON( STR )
|
||||
{ gsub( /</, "<", STR )
|
||||
gsub( />/, ">", STR )
|
||||
return STR
|
||||
}
|
||||
#### error ########################################################
|
||||
function FATAL( STR )
|
||||
{ print "error near line " NR ": " STR
|
||||
OKAY = 0 ; return 1
|
||||
}
|
||||
#### save unfolded field body #####################################
|
||||
function FIELD()
|
||||
{ if ( NAME == "description" ) D[ ++DD ] = BODY
|
||||
else if ( NAME == "comments" ) C[ ++CC ] = BODY
|
||||
else if ( NAME == "prefix" ) P[ ++PP ] = BODY
|
||||
else if ( F[ NAME ] == "" ) F[ NAME ] = BODY
|
||||
else exit FATAL( NAME ": " BODY )
|
||||
return
|
||||
}
|
||||
#### output record elements #######################################
|
||||
function READY()
|
||||
{ T = tolower( F[ "type" ] )
|
||||
if ( T == "" ) exit FATAL( "missing type" )
|
||||
L = "<" T
|
||||
|
||||
S = F[ "subtag" ]
|
||||
if ( S == "" )
|
||||
{ S = F[ "tag" ]
|
||||
if ( S == "" ) exit FATAL( "missing tag" )
|
||||
if ( T != "redundant" ) print L ">"
|
||||
else if ( S == "yi-latn" )
|
||||
print L " subtags='yi Latn'>"
|
||||
else print L " subtags='" IDREF( S ) "'>"
|
||||
B = "\t<tag> " S " </tag>"
|
||||
HACK = HACK && ( S != "hak" )
|
||||
}
|
||||
else if ( F[ "tag" ] == "" )
|
||||
{ print L " xml:id='" XMLID( S ) "'>"
|
||||
B = "\t<subtag> " S " </subtag>"
|
||||
}
|
||||
else exit FATAL( "conflicting subtag " S )
|
||||
|
||||
S = F[ "suppress-script" ]
|
||||
if ( S != "" && T == "language" )
|
||||
{ L = "\t<suppress script='" S "'> "
|
||||
print L S " </suppress>"
|
||||
}
|
||||
else if ( S != "" )
|
||||
exit FATAL( "unexpected Suppress-Script" S )
|
||||
|
||||
if ( T == "extlang" && PP != 1 )
|
||||
exit FATAL( "missing or extraneous prefix" )
|
||||
while ( PP )
|
||||
{ L = "\t<prefix subtags='" IDREF( P[ PP ] ) "'> "
|
||||
print L P[ PP-- ] " </prefix>"
|
||||
}
|
||||
# modified:
|
||||
S = F[ "macrolanguage" ]
|
||||
if ( S != "" && ( T == "language" || T == "extlang" ))
|
||||
{ L = "\t<macro language='" S "'> "
|
||||
print L S " </macro>"
|
||||
}
|
||||
else if ( S != "" )
|
||||
exit FATAL( "unexpected Macrolanguage" S )
|
||||
|
||||
A = F[ "added" ]
|
||||
if ( A == "" ) exit FATAL( "missing date" )
|
||||
print B "<added> " A " </added>"
|
||||
|
||||
S = F[ "preferred-value" ]
|
||||
A = F[ "deprecated" ]
|
||||
if ( A != "" )
|
||||
{ L = "\t"
|
||||
if ( S != "" )
|
||||
{ L = L "<preferred"
|
||||
H = S == "zh-cmn" || S == "zh-hakka"
|
||||
if ( HACK && H )
|
||||
L = L " subtags='zh'> "
|
||||
else L = L " subtags='" IDREF( S ) "'> "
|
||||
L = L S " </preferred>"
|
||||
}
|
||||
print L "<deprecated> " A " </deprecated>"
|
||||
}
|
||||
else if ( S != "" )
|
||||
exit FATAL( "missing deprecated" )
|
||||
|
||||
if ( DD )
|
||||
{ L = "\t<description> "
|
||||
while ( DD )
|
||||
{ L = L CANON( D[ DD-- ] )
|
||||
if ( DD )
|
||||
{ if ( MULT )
|
||||
{ print L " </description> "
|
||||
L = "\t<description>"
|
||||
}
|
||||
else
|
||||
{ print L
|
||||
L = "\t<alt /> "
|
||||
} } }
|
||||
print L " </description>"
|
||||
}
|
||||
else exit FATAL( "missing description" )
|
||||
|
||||
if ( CC )
|
||||
{ L = "\t<comment> "
|
||||
while ( CC )
|
||||
{ L = L CANON( C[ CC-- ] )
|
||||
if ( CC )
|
||||
{ if ( MULT )
|
||||
{ print L " </comment> "
|
||||
L = "\t<comment>"
|
||||
}
|
||||
else
|
||||
{ print L
|
||||
L = "\t<alt /> "
|
||||
} } }
|
||||
print L " </comment>"
|
||||
}
|
||||
|
||||
print "</" T ">"
|
||||
return
|
||||
}
|
||||
#### output DOCTYPE ###############################################
|
||||
BEGIN { ROOT = "LanguageSubtagRegistry"
|
||||
HACK = 1
|
||||
OKAY = 0
|
||||
|
||||
VERS = "ltru2xml/0.7"
|
||||
MULT = 1
|
||||
if ( ! MULT ) VERS = VERS "alt"
|
||||
|
||||
L = "<?xml version=\"1.0\" "
|
||||
L = L "encoding=\"UTF-8\" "
|
||||
print L "standalone=\"yes\" ?>"
|
||||
print "<!DOCTYPE " ROOT " ["
|
||||
|
||||
A = "\tdate NMTOKEN #REQUIRED"
|
||||
S = "grandfathered*, redundant*"
|
||||
print "<!ELEMENT " ROOT " (language*, extlang*,"
|
||||
print "\tscript*, region*, variant*, " S ")>"
|
||||
print "<!ATTLIST " ROOT
|
||||
print A ">"
|
||||
|
||||
A = "\txml:id ID #REQUIRED"
|
||||
S = "added, (preferred?, deprecated)?,"
|
||||
if ( MULT )
|
||||
S = S " description+, comment*"
|
||||
else S = S " description, comment?"
|
||||
# modified:
|
||||
L = "macro?, subtag,"
|
||||
print "<!ELEMENT language (suppress?, " L
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST language"
|
||||
print A ">"
|
||||
|
||||
print "<!ELEMENT extlang (prefix, " L
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST extlang"
|
||||
print A ">"
|
||||
|
||||
print "<!ELEMENT script (subtag,"
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST script"
|
||||
print A ">"
|
||||
|
||||
print "<!ELEMENT region (subtag,"
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST region"
|
||||
print A ">"
|
||||
|
||||
print "<!ELEMENT variant (prefix*, subtag,"
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST variant"
|
||||
print A ">"
|
||||
|
||||
A = "\tsubtags IDREFS #REQUIRED"
|
||||
print "<!ELEMENT grandfathered (tag,"
|
||||
print "\t" S ")>"
|
||||
print "<!ELEMENT redundant (tag,"
|
||||
print "\t" S ")>"
|
||||
print "<!ATTLIST redundant"
|
||||
print A ">"
|
||||
|
||||
L = "<!-- a script subtag -->"
|
||||
S = "\tscript IDREF #REQUIRED"
|
||||
print "<!ELEMENT suppress (#PCDATA)>" L
|
||||
print "<!ATTLIST suppress"
|
||||
print S ">"
|
||||
# modified:
|
||||
L = "<!-- a macrolang subtag -->"
|
||||
S = "\tlanguage IDREF #REQUIRED"
|
||||
print "<!ELEMENT macro (#PCDATA)>" L
|
||||
print "<!ATTLIST macro"
|
||||
print S ">"
|
||||
|
||||
L = "<!-- a prefix tag -->"
|
||||
print "<!ELEMENT prefix (#PCDATA)>" L
|
||||
print "<!ATTLIST prefix"
|
||||
print A ">"
|
||||
|
||||
L = "<!-- a date -->"
|
||||
print "<!ELEMENT added (#PCDATA)>" L
|
||||
print "<!ELEMENT deprecated (#PCDATA)>" L
|
||||
|
||||
L = "<!-- a (sub)tag -->"
|
||||
print "<!ELEMENT preferred (#PCDATA)>" L
|
||||
print "<!ATTLIST preferred"
|
||||
print A ">"
|
||||
|
||||
print "<!ELEMENT subtag (#PCDATA)>" L
|
||||
print "<!ELEMENT tag (#PCDATA)>" L
|
||||
|
||||
if ( MULT )
|
||||
L = "(#PCDATA)><!-- text -->"
|
||||
else L = "(#PCDATA | alt)*>"
|
||||
print "<!ELEMENT description " L
|
||||
print "<!ELEMENT comment " L
|
||||
if ( ! MULT )
|
||||
{ L = "<!-- separator -->"
|
||||
print "<!ELEMENT alt EMPTY> " L
|
||||
}
|
||||
|
||||
print "]>"
|
||||
print
|
||||
}
|
||||
|
||||
#### record separator #############################################
|
||||
/^\%\%$/ { FIELD()
|
||||
if ( NN++ ) READY()
|
||||
else
|
||||
{ A = F[ "file-date" ]
|
||||
if ( A == "" )
|
||||
exit FATAL( "missing File-Date" )
|
||||
print "<" ROOT " date='" A "'>"
|
||||
OKAY = 1
|
||||
}
|
||||
|
||||
for ( NAME in F ) delete F[ NAME ]
|
||||
NAME = "" ; CC = 0 ; PP = 0
|
||||
BODY = "" ; DD = 0 ; next
|
||||
}
|
||||
#### start of new field ###########################################
|
||||
/^[A-Za-z0-9]/ { FIELD()
|
||||
if ( ! match( $0, ":" )) exit FATAL( $0 )
|
||||
NAME = tolower( substr( $0, 1, RSTART - 1 ))
|
||||
BODY = STRIP( substr( $0, RSTART + 1 ))
|
||||
next
|
||||
}
|
||||
#### unfold field body ############################################
|
||||
/^[\t ]/ { BODY = BODY " " STRIP( $0 )
|
||||
next
|
||||
}
|
||||
#### garbage ######################################################
|
||||
{ exit FATAL( $0 )
|
||||
}
|
||||
###################################################################
|
||||
END { if ( NN++ )
|
||||
{ FIELD() ; READY()
|
||||
print "</" ROOT ">"
|
||||
}
|
||||
else OKAY = 0
|
||||
print "<!-- " VERS " (" --NN " records) -->"
|
||||
exit 1 - OKAY
|
||||
}
|
33
registries/utf82ncr.py
Executable file
33
registries/utf82ncr.py
Executable file
@ -0,0 +1,33 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
""" Converts an UTF-8 text file to an ASCII file with hexadecimal
|
||||
Numeric Character References (like œ). """
|
||||
|
||||
import sys
|
||||
import re
|
||||
|
||||
extension = re.compile("^(.*)\.([a-z0-9_-]+)$", re.IGNORECASE)
|
||||
|
||||
def convert(thematch):
|
||||
codepoint = int(thematch.group(1), 16)
|
||||
return chr(codepoint)
|
||||
|
||||
for ifilename in sys.argv[1:]:
|
||||
print("Converting %s..." % ifilename)
|
||||
match = extension.search (ifilename)
|
||||
if match:
|
||||
ext_ifile = match.group(2)
|
||||
ofilename = match.group(1) + "-ncr." + ext_ifile
|
||||
else:
|
||||
ofilename = ifilename + "-ncr"
|
||||
ifile = open(ifilename, "r")
|
||||
ofile = open(ofilename, "w")
|
||||
data = ifile.read()
|
||||
for ch in data:
|
||||
if ord(ch) > 127:
|
||||
ch = "&#x%x;" % ord(ch)
|
||||
ofile.write(ch)
|
||||
ifile.close()
|
||||
ofile.close()
|
||||
|
||||
|
59
tag-wisely.xml
Normal file
59
tag-wisely.xml
Normal file
@ -0,0 +1,59 @@
|
||||
<page title="Tag wisely">
|
||||
<p>(Most of the content on this page comes directly from
|
||||
<wikipedia name="Request for Comments">RFC</wikipedia> 5646.)</p>
|
||||
|
||||
<p>For the same body of text, you may have several possible
|
||||
tags. Interoperability is best served when all users use the same
|
||||
language tag for the same language. The rules here are intended to
|
||||
help in that respect.</p>
|
||||
|
||||
<p>Subtags should only be used where they add useful distinguishing
|
||||
information; extraneous subtags interfere with the meaning,
|
||||
understanding, and processing of language tags. In particular, fields
|
||||
<code>Suppress-Script</code> in the registry should be obeyed: for
|
||||
instance, <code>fr</code> (<wikipedia name="French language">French</wikipedia>) has a
|
||||
<code>Suppress-Script: Latn</code> because the overwhelming majority
|
||||
of French texts are in the <wikipedia>Latin script</wikipedia>. Therefore, tagging text in French as
|
||||
<code>fr-Latn</code> is useless and confusing. A simple
|
||||
<code>fr</code> is enough. In the unlikely case that you meet French
|
||||
texts in the <wikipedia>Arabic script</wikipedia>, then you can add a subtag for the script:
|
||||
<code>fr-Arab</code>. (This is specially important since the former
|
||||
standard, in RFC 3066, did not have subtags for scripts and therefore
|
||||
old applications will have problems to handle them.)</p>
|
||||
|
||||
<p>Use as precise a tag as possible, but no more specific than is
|
||||
justified. Avoid using subtags that are not important for
|
||||
distinguishing content in an application. For example, <code>de</code>
|
||||
might suffice for tagging an email written in
|
||||
<wikipedia name="German language">German</wikipedia>, while <code>de-CH-1996</code>, while
|
||||
legal,is probably unnecessarily precise for such a task.</p>
|
||||
|
||||
<p>But do not be too vague: the primary language subtag might not be
|
||||
sufficient to give all the information necessary to understand the
|
||||
text. For
|
||||
example, the tag <code>az</code> (for
|
||||
<wikipedia name="Azerbaijani language">Azerbaidjani</wikipedia>) is probably insufficient in the
|
||||
absence of context, because this language has no dominant script. A person fluent in
|
||||
one script might not be able to read the other, even though the text
|
||||
might be identical. Content tagged as <code>az</code> most probably is written
|
||||
in just one script and thus might not be intelligible to a reader
|
||||
familiar with the other script. <code>az-Latn</code>,
|
||||
<code>az-Cyrl</code> or <code>az-Arab</code> are probably necessary.</p>
|
||||
|
||||
<p>If a tag or subtag has a <code>Preferred-Value</code> field in its registry
|
||||
entry, then the value of that field should be used to form the
|
||||
language tag. For example, use <code>he</code> for <wikipedia
|
||||
name="Hebrew language">Hebrew</wikipedia> in preference to
|
||||
<code>iw</code>.</p>
|
||||
|
||||
<p>Validity of a tag is not everything. A tag may be both valid and
|
||||
meaningless. This is unavoidable with a generative system like the
|
||||
language subtag mechanism. So, <code>ar-Cyrl-AQ</code>
|
||||
(<wikipedia>Arabic</wikipedia> written with the <wikipedia name="Cyrillic alphabet">cyrillic
|
||||
script</wikipedia>, as used in <wikipedia>Antarctica</wikipedia>) is
|
||||
perfectly valid but should nevertheless be avoided because it has no
|
||||
relationship with the reality (there is not a single document with
|
||||
these characteristics).</p>
|
||||
|
||||
</page>
|
||||
|
24
test-suites.xml
Normal file
24
test-suites.xml
Normal file
@ -0,0 +1,24 @@
|
||||
<page title="Test suites for language tag software">
|
||||
|
||||
<ul>
|
||||
<li>Stéphane Bortzmeyer's test suites. The format is "tag
|
||||
whitespace optional-comment":
|
||||
<ul>
|
||||
<li><a href="test-suites/well-formed-tags.txt">Well-formed tags</a>
|
||||
(they may be invalid)</li>
|
||||
<li><a href="test-suites/broken-tags.txt">Not well-formed tags</a></li>
|
||||
<li><a href="test-suites/valid-tags.txt">Valid tags</a> (for the
|
||||
2006-11-15 registry)</li>
|
||||
<li><a href="test-suites/invalid-tags.txt">Invalid tags</a> (for the
|
||||
2006-11-15 registry, they may be valid now, the current registry is <lsr-version/>!)</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><a href="http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagTest.txt">ICU test suite</a></li>
|
||||
<li>From the <wikipedia name="Formal grammar">grammar</wikipedia> in
|
||||
the RFC, you can use programs like <a href="http://www.quut.com/abnfgen/">abnfgen</a> or <a
|
||||
href="http://www.bortzmeyer.org/eustathius-test-grammars.html">eusthathius</a> to generate tags for testing a
|
||||
parser. <em>Warning</em>: these tags will follow the grammar but may
|
||||
not be well-formed, since some rules are not in the grammar (for
|
||||
instance the rule that no two singletons must be identical).</li>
|
||||
</ul>
|
||||
</page>
|
18
web-site.xml
Normal file
18
web-site.xml
Normal file
@ -0,0 +1,18 @@
|
||||
<page title="Management of this Web site">
|
||||
<p>The manager, <code><a
|
||||
href="mailto:webmaster@langtag.net">webmaster@langtag.net</a></code>
|
||||
is always glad to receive bug reports, fixes and improvments.</p>
|
||||
<p>The ideal way to send <wikipedia name="Patch
|
||||
(computing)">patches</wikipedia> is to retrieve the source and work on
|
||||
them.</p>
|
||||
<p>You can retrieve the source of the pages for this Web site with
|
||||
<wikipedia name="Subversion (software)">Subversion</wikipedia> at URL
|
||||
<code>https://svn.langtag.net/langtag/</code>. These sources are in
|
||||
<wikipedia>XML</wikipedia>, using a small superset of
|
||||
<wikipedia>XHTML</wikipedia> in a <code><page></code> element.</p>
|
||||
<p>The recommended method
|
||||
to request a change to the Web site is to send the <wikipedia
|
||||
name="Patch (computing)">patches</wikipedia> in <code>diff -u</code>
|
||||
format. If you cannot, send the modified XML page, and please try to
|
||||
convince your XML editor to do as little reformatting as possible.</p>
|
||||
</page>
|
29
whatare.xml
Normal file
29
whatare.xml
Normal file
@ -0,0 +1,29 @@
|
||||
<page title="What are they?">
|
||||
<p><em>Language tags</em> are a way to <em>tag</em> digital resources
|
||||
to indicate in what <wikipedia name="language">human
|
||||
language</wikipedia> they are. They are also used by software to tell
|
||||
an user's preference about languages.</p>
|
||||
<p>They can express the language itself but also the writing system,
|
||||
the national variant and many other things.</p>
|
||||
<p>A few examples of language tags:</p>
|
||||
<ul>
|
||||
<li><code>fr</code>: <wikipedia name="French language">French</wikipedia> language,</li>
|
||||
<li><code>en-AU</code>: <wikipedia name="English language">English</wikipedia> language, as
|
||||
written and spoken in <wikipedia>Australia</wikipedia>,</li>
|
||||
<li><code>az-Latn-IR</code>, <wikipedia name="Azerbaijani language">Azeri</wikipedia> language,
|
||||
written in the <wikipedia name="Latin alphabet">Latin</wikipedia> script, as used in <wikipedia>Iran</wikipedia>.</li>
|
||||
</ul>
|
||||
<p>They are specified in <wikipedia>IETF</wikipedia>
|
||||
<wikipedia name="Request for Comments">RFC</wikipedia> <em>5646</em> (<a
|
||||
href="http://www.rfc-editor.org/rfc/rfc5646.txt">available
|
||||
online</a>).</p>
|
||||
<p>Language tags are made of <em>subtags</em> separated by
|
||||
hyphens. The list of possible subtags is mostly directly copied from
|
||||
various <wikipedia>ISO</wikipedia> standards such as <wikipedia name="ISO
|
||||
639">ISO 639</wikipedia>.</p>
|
||||
<p>They are used in many formats and protocols for instance in
|
||||
<wikipedia>XML</wikipedia> (through the <code>xml:lang</code>
|
||||
attribute) and in <wikipedia>HTTP</wikipedia> (the browser can
|
||||
indicate to the <wikipedia>Web</wikipedia> server what language the
|
||||
user prefers, should the Web server have several versions).</p>
|
||||
</page>
|
67
why-tagging.xml
Normal file
67
why-tagging.xml
Normal file
@ -0,0 +1,67 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<page title="Why tagging?">
|
||||
<p>Executive summary: tagging your digital resources to indicate in
|
||||
what <wikipedia>language</wikipedia> they are allow</p>
|
||||
<ol>
|
||||
<li>Proper rendition,</li>
|
||||
<li>Correct behaviour of some software,</li>
|
||||
<li>Choice of the right tools,</li>
|
||||
<li>Correct filtering.</li>
|
||||
</ol>
|
||||
<h2>What is tagging</h2>
|
||||
<p>Tagging is the process of giving <a href="whatare.html">language
|
||||
tags</a> to a digital resource. For instance, in legacy
|
||||
<wikipedia>HTML</wikipedia>, it is done with:</p>
|
||||
<pre>
|
||||
<![CDATA[
|
||||
<html lang="ar">
|
||||
<!-- Text in Arabic -->
|
||||
]]>
|
||||
</pre>
|
||||
<p>and in <wikipedia>XML</wikipedia> with the <code>xml:lang</code>
|
||||
special attribute:</p>
|
||||
<pre>
|
||||
<![CDATA[
|
||||
<book xml:lang="uk">
|
||||
<!-- Text in Ukrainian -->
|
||||
]]>
|
||||
</pre>
|
||||
<h2>What is tagging for?</h2>
|
||||
<p>The purpose of tagging is to give <em>unambiguous</em> information
|
||||
to the software processes that will handle the resource. For instance,
|
||||
properly rendering the content on the screen requires to know the language it is
|
||||
written in. Actual <wikipedia>typography</wikipedia> rules are different for each
|
||||
language, language-independant rendition can only be an
|
||||
approximation. In the same way, knowing the language used is
|
||||
necessary for <wikipedia>speech synthesis</wikipedia>.</p>
|
||||
<p>Some programs may need the language to know what to do with
|
||||
requests like <wikipedia>CSS</wikipedia>' "first-letter"
|
||||
pseudo-property. The first letter of <wikipedia>Llobregat</wikipedia>
|
||||
is 'l' in <wikipedia name="English language">English</wikipedia> but
|
||||
'll' in <wikipedia name="Spanish language">Spanish</wikipedia>.</p>
|
||||
<p>Tools like <wikipedia name="Spell checker">spell checkers</wikipedia> or an online dictionary must also be
|
||||
choosen depending on the language used.</p>
|
||||
<p>Language tagging also allow filters to keep only some documents,
|
||||
those written in a language that the user understands. At the present
|
||||
time, most <wikipedia>search engines</wikipedia>, like
|
||||
<wikipedia>Google</wikipedia>, use <a
|
||||
href="http://www.macchiato.com/slides/unicode_at_google.ppt">heuristics</a>
|
||||
to find out the language of a Web page. While it works fine to tell
|
||||
apart <wikipedia name="German language">German</wikipedia> from
|
||||
<wikipedia name="Japanese language">Japanese</wikipedia>, it is much
|
||||
more difficult with close languages like <wikipedia name="Danish
|
||||
language">Danish</wikipedia> and <wikipedia name="Norwegian
|
||||
language">Norwegian</wikipedia>, specially if the text is short.</p>
|
||||
<h2>Current situation</h2>
|
||||
<p>At the present day, we are a bit stuck in a
|
||||
<wikipedia name="The chicken or the egg">chicken-and-egg</wikipedia> problem: many applications
|
||||
(like the search engines mentioned before) do not use the language
|
||||
information because it is not present or unreliable. Therefore,
|
||||
webmasters and other document maintainers are not eager of tagging
|
||||
because it brings no short-term benefits. Things are becoming better
|
||||
but certainly too slowly.</p>
|
||||
<h2>More readings</h2>
|
||||
<ul>
|
||||
<li><a href="http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.091505539">Why specify language?</a> by the W3C</li>
|
||||
</ul>
|
||||
</page>
|
Loading…
Reference in New Issue
Block a user