Utilisateur:Eden2004/Transcription

Un article de Wikipédia, l'encyclopédie libre.

The problem:

Lots of unnecessary conflicts (verbal, edit war, rename war) and edits are due to foreign script transcription matters.

As admin on fr.wikipedia working mainly on japanese stuffs, I see everyday several interwiki being changed because an article was renamed with a "ō", then with "ou", then with just "o" and so on. Article content changes as well according to the people editing. Even on fr.wikipedia where the problem was ruled with an encyclopedia-wide survey leading to a regulation, there are still some fights about this and the encyclopedia content is now split in half over "ō" (official one) and "ô" (and even quite some "o"). Even among the admins, the rule is not respected by all. I suppose it is all the same for other non-latin languages like arabian, korean, chinese (although pinyin is well recognized, now, I think) or russian.

Additionally, if the transcription convention is modified in the future (and I think it is probable and it may happen in the future again. In fact, ISO adopted Kunrei 15 years ago but not yet the countries themselves beside Japan, translitteration convention has aslo changed 5 years ago...), currently, it would lead to a massive amount of difficult edits in the encyclopedia each time, modifying tens of thousands of pages for this sole issue.


Idea:

Going further with the concept of the widely spread WikiHiero. It is also slightly in the spirit of the Multilang extension and offer an additional service in the spirit of the Polyglot extension.

Editors gives to the software all the needed things to compute transcriptions and just let the user choose the one he wants (or see the one by default -this one allows changing all items at once by modifying just the configuration-). The extension would itself be able to use extensions, to customize the set of languages supported.

A good side effect is avoiding some transcription mistakes.

At the contrary of Multilang, the differents version would here be generated (the only way to get it used: there is too much content and languages).

Proposal 1

We use a template that generates markup like <language name="ja">My Japanese Text</language>. The mediawiki extension converts it to all known (by the system) transcriptions registered for this language, with a span per type. In the monobook.css, all are hidden but the default one. In user monobook.css (or even better, with the Preference panel), you can override this to see the transcription you prefer.

The text to transcript can only be specified in one form per language (it's a one-to-many process).


Example:

{{jap|watashi 'ha Kousuke}} expands to <language name="ja">watasi 'ha Kousuke</language> (here, as example, text is in annoted 99-siki translitteration) which generates:

<span class="automatic_japanese_hepburn_macron" lang="ja" title="Japanese text in Hepburn transcription">watashi wa Kōsuke</span>
<span class="automatic_japanese_hepburn_circ" lang="ja" title="Japanese text in Hepburn transcription">watashi wa Kôsuke</span>
<span class="automatic_japanese_transliteration" lang="ja" title="Transliterated japanese text">watashi ha Kousuke</span>
<span class="automatic_japanese_kunrei" lang="ja" title="Japanese text in Kunrei transcription">watasi wa Kôsuke</span>

Then, Mediawiki:monobook.css would contain "#automatic_japanese_hepburn_circ, #automatic_japanese_transliteration, #automatic_japanese_kunrei { display:none; }"


Proposal 2

We still use a template that generates markup like <language name="ja">My Japanese Text</language>. But this time, we use something similar to the "variant" parameter used in :zh version. With this, we precise the transcription we want for every specific languages we wish (the others go to user default or WP default) and this page contains only this transcription, no css modification needed. In the HTML page, links will forward the parameter as they received it, so as to keep consistency between page (ie. after viewing a page in Hepburn, clicking on a link to a page containing sime transcripted Japanese will display it as Hepburn too, even if it is not the default).

  • Article title can also be visually altered (it is already done on French Wikipedia for some article like iMac, with help of some javascript) to match current transcription selection.
  • Making a link to an article with a transcripted title would become something like [[Kōbe|{{jap|Koube}}]] (or a template {{jlink}}). Either we choose a convention for article name (less of a problem, as we can alterate its appearance, and it makes jlink shorter as page name can be computed) or we use the systematic redirect creation (as we do it currently).
  • Searching with the internal engine may become difficult. You won't be able to find any Masaaki (beside the page with this title, if we use redirect) but the cryptic Masa'aki. As Google uses final page, if we generate all transcriptions, it will be able to find all real occurences in the encyclopedia.


Side by side comparison
Item Proposal 1 Proposal 2
CPU Cost Generating all transcriptions cost a bit of time. Only fast algorithms should probably be allowed. Generating one transcription is very fast
Coding A small extension with add-ons A small extension with addons and some work on variant or equivalent
Search Google should index all transcriptions because I believe it does not care about css (but it may present all alternatives side by side; not pretty) Google will index them all (with nice presentation) if it can find a link to all versions. We can add links on the page under "search" or under article for access to different versions.
Accessibility For blind, no 100%-sure method to stop the reader to read all transcriptions but should be able to cover most of blind right now. No problem for blind but modifying full adress will be less easy (because it has not only title but also index.php and parameters). Affect mostly power-users so may not be a huge problem.
Cache Impact None There are several versions of the page, and users having several non-default language require a different page for nothing (example: if a page have only some japanese but user also have options for russian. To optimize, the cache system should be able to ignore the russian parameter and fetch the page corresponding to the japanese parameter from the cache but I don't think this is possible).


Proposal 2 expanded

We can imagine that the language element can receive an addition parameter "to", enforcing use of a precise conversion and not the one used by the user. This to secure creation a correct transcription when one specific transcription is expected. Example: <language name="ja">Koube</language>, often written <language name="ja" to="ja-Hepburn-no-diacritic">Koube</language> in newspaper...


Proposal 3

A third proposal could be to use javascript on client side but this would have major drawbacks (no search possible, the page will appear first in a cryptic manner before getting "transcripted"...


Language specification
The Japanese example: To be able to generate all the transcriptions, a marker is needed. The quote is a marquer telling that "ha" (or "he") is the special particle, noted "ha" or "wa", depending on the method used. In the same way, Masa'aki would become Masaaki, quote being a marquer allowing to avoid generating japanese long voyel -'ā', 'â' or 'a' in this case-). Finally, KIISU (in capital letters) would lead to kīsu, because ī is legal for katakana but not for usual hiragana, so we need to distinguish both (it's a usual convention in books written about Japan anyway).


Further thoughts:

  • Some transcriptions will come very often. Capital cities, for example. It might be useful to have a transcription cache system in addition to the article cache system.