Thinking about this just makes me think more and more about an
"alternate titles" feature.
It would seem nice to have a way of entering _smart_ alternate
versions for every "name" in the database. Think of something like the
sort-name or the alias for artists' names, but with an additional
field that defines the meaning of the alternative.
As an example, we could have:
Title(*): ????
alt:transliteration:r?maji:JSL: roomazi
alt:transliteration:r?maji:Shin-kunrei-shiki: r?mazi
alt:transliteration:r?maji:Hepburn: r?maji
alt:transliteration:r?maji:common: romanji
(*: I can't read Japanese, just picked up something from the web.)
Title: ?????????????
alt:transliteration:r?maji: Paper Moon ni koshikakete
alt:translation:English: Paper Moon Stay by Me (**)
(**: Just guessing)
Name: ???? ????? ??????????
alt:sortname: Tchaikovsky, Pyotr Ilyich
alt:translation:English: Pyotr Ilyich Tchaikovsky
alt:translation:Korean: ??????
alt:translation:Romanian: Piotr Ilici Ceaicovsky
alt:part:family-name: ??????????
alt:part:first-name: ????
alt:part:middle-name: ?????
alt:version: Peter I. Tchaikovsky
Title: 100 A?os de cine, Volume 1 (disc 1)
alt:as-on-cover: 100 A?os De Cine, Vol.1
alt:translation:English: 100 Years of Cinema, Volume 1 (disc 1)
alt:translation:Romanian: 100 ani de cinema, volumul 1 (disc 1)
Title: ?
alt:transliteration:ASCII: Spade
The problem with this system is that it obviously adds a lot of
complexity. First we need to define an extensible hierarchy of
alternate types (an ontology), which define what each class should
contain. Then people will start to populate them with content. Both of
those will be an occasion for errors, especially when combined (eg,
what to do when you need to change an alt: class and it's already
populated).
However, take a look at the advantages:
(1) It would give us a relatively simple (technically) solution for
the many complications caused by the overloading of fields we
currently do (eg, sort-name/transliteration).
(2) It would simplify and eliminate a lot of duplication, for instance
by reuniting pseudo-releases with the actual release. We'd have a
single entry for ?????????????, with all the alternate titles together
(thus we don't have to duplicate release info and whatever).
(3) It suddenly reunites some different "points of view" and
eliminates _some_ of the arguing about guidelines. For instance,
people who want things capitalized "as on the cover" can have that
info, and those who want things "right" can have it too.
(4) It would make a huge step towards solving
localization/internationalization issues: People who need ASCII-only
titles can just set their tagger to pick that info for the tags (and
even just for the filenames). If I knew (say) English, French and
Japanese, I could set both the tagger and the web-site preferences to
display those languages in original, and transliterate the rest. Or it
could show original and "closest match to what I can read" next to it.
(5) The best part: Despite the huge explosion in alternatives I'm sure
everyone expects from this, there's a very huge opportunity here:
Having a very good ontology for alternatives (ie, well-defined
meanings), even if it's partial, means that a _huge_ amount of work
can be done by machines. Check this out:
(a) If we define "alt:transliteration:ASCII", we can:
- have a "check" script that only allows ASCII characters in that field
- have an "auto-generate" script that knows common replacements
for most characters (eg, ?=spade), etc.
- the fun is this: even if the script isn't correct all the
time (it can't be), no biggie: it only works if no human-defined
alternative is there. I can just go to a track and say that the real
ASCII title is, eg, "Ace of Spades" instead of "ASpades" for "A?".
- the script can be done as a web-service on the site, or in
the tagger. I can imagine the tagger doing the query like "give me the
cover, guideline and ASCII-transliterated titles of this song", and
then the server picking out those that are defined (guideline),
generating those it doesn't have (ASCII), and ignoring what it can't
("on-the-cover"-version where it's not defined).
(b) The checks can be extended to other cases: sort-names can only
contain Latin characters and some punctuation, translated names should
not fail a spell-check for the target language (a warning, can be
overridden by votes), ASCII transliterations should only contain ASCII
(and we can extend this to any character set), etc.
(c) We can actually have automatic transliterations: since we can
have several, we can have (say) two modules that use two different
methods of romaji. Users can manually define those that are wrong.
Only users that need/can use the info would have to look at it. The
server can populate the fields with automatic data on-demand (eg, the
first time someone asks for a certain transliteration of a song, if
there is none defined, the server can use even a complicated script
(dictionary lookup, web-service) to build one, it remembers it (and
marks that field as auto-generated), and returns it.
The next time it's needed it's there. If the guess was wrong, anyone
who can tell can fix it with an edit.) You can set your tagger to
never use guessed info, or to ask you to check it. (After a manual
check it can even put it up to vote and upgrade it from auto-generated
to voted info.)
(d) Much of this can be done rather decentralized and automatized.
Example: if we decide (on the style-list) that we allow
transliteration to different char-sets _and_ scripts in the
alt:transliteration namespace, we can [i] automatically allow people
to add entries for everything that matches
"alt:transliteration:[charset]" (from whatever list of charsets we
add), and "alt:transliteration:[script]" for our list of scripts; [ii]
for any charset/script we can add a script (say, Python) that checks
if a string can be represented as that; [iii] for any charset/script
we can add a script that makes the translation from Unicode to that;
[iv] we can give some users the permissions to change the scripts
(because they know it), and we can even put the scripts to something
like a voting process, editing them inline on the site.(***)
(e) For every language in our list we can allow
"alt:translation:[language]" versions, and we can give some editors
auto-edit rights for some languages. (We can trust what they put in
the EditorLanguage script on the wiki, or we can put in a "vote on
this" process.)
(f) We can add special cases for some situations. For
instance,"alt:transliteration:romaji" can be a special case; we can
have some special cases for some languages: eg, German ? and ? can be
turned to oe and ue, I'm not sure what's that called. Romanian uses
the relatively uncommon characters ? and ? (s and t with comma below),
which are often replaced with ? and ? (the same with cedilla); the
former are less common in fonts, so some people might prefer to use
them. Again, this only concerns people who use some languages, so it
can be easily hidden from others with settings.
(g) There aren't really many others, after we include
"alt:as-on-cover", "alt:common-misspelling", "alt:other-edition"
and... I can't think of anything else.
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
(i) Since it came up, we can even have a script that checks for
"alt:common-misspelling" that it actually comes up on a Google search
:)
(j) All of these can be plugged into the search quite easily. Even
more, I think we can plug the "generating" scripts (where defined)
into the indexing engine, so it can find stuff even when an alternate
is not explicitly defined.
(***: remember, the best part of all this is that it's trivial to have
a setting that hides the automatically-generated alternates. In fact
it won't generate and return them (in the web-service, for instance)
unless explicitly requested. Which means that we don't have to be
always-correct except for stuff people put in, which goes through the
votes.)
What do you think?
Post by Bogdan Butnaru1) Japanese releases keep everything as on the cover (consistency).
2) Transliterated releases keep their capitalization where applicable
(ie. already-Latin characters remain as they are)
3) Translated releases get turned to English rules (or whatever the
translation language is) for all tracks.
Rationale: for case (3), it would look weird if the transliterated
tracks used normal capitalization and those that were in Latin chars
already didn't. But for (2), the release still has a "Japanese
quality" about it, since some names would be in romanized Japanese
(those transliterated). So they should keep the caps as in the
original.
(a) Symbolic characters, e.g. the heart (?). I think these should be
left alone, except for the special-case "ASCII" transliteration. (What
happens there would be at the editors' judgement.)
(b) Japanese releases where _all_ titles are in English (or Latin
characters, anyway), but with the atypical Japanese capitalization.
They would be at the same time an (1) and an official (2) in the
summary above; for consistency we might want to add the (3)-version,
but that's neither a transliteration nor exactly a translation.
"I think I am a fallen star, I should wish on myself." ? O.
--
Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.