Discussion:
[mb-style] Capitalisation Standards for Transl*ted Psuedo-Releases?
Chidade
2007-09-22 07:30:27 UTC
Permalink
Have been doing a few edits on the
pillows<http://musicbrainz.org/artist/e69c8ccb-25b1-4c4a-89c6-8fdad1172951.html>releases.
I own most of the catalogue so I thought I'd fill in the few
missing barcodes and label information. The pillows fans on MB are pretty
good about making psuedo-releases for each of the official Japanese ones.
But going through them, I noticed that some of the older releases didn't
have ALL CAPS where necessary, which I felt I could change easily enough
thanks to ArtistIntent and this:
http://musicbrainz.org/doc/CapitalizationStandard/JapaneseArtistsException

But then I noticed that the ALL CAPS titles were listed under the transl*ted
psuedo release instead...

My first impulse was to change it to the Guess Case that MB came up with.
After all, these are psuedo-releases that have been Latin-ised so Artist
Intent wouldn't apply for most of the Japanese titles...but what about the
song titles that were originally in Latin script? Should they follow artist
intent?

I can't find anything on the wiki that talks about it. Has this been
discussed before?

Personally, I think that transl*ted psuedo releases should follow standard
capitalisation. But this may cause problems. With tranlated titles, that's
easy enough...but with transliterated? Which words in this title should have
the first letter capitalised?

Kono mama koko de<http://musicbrainz.org/track/62374b63-7b25-478a-aa73-0c792873c2b4.html>

There are more examples on the album that track is from, if anyone is
interested.

So! Thoughts?

- Chidade
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.musicbrainz.org/pipermail/musicbrainz-style/attachments/20070922/601a8989/attachment.htm
Kuno Woudt
2007-09-22 08:14:52 UTC
Permalink
Post by Chidade
But going through them, I noticed that some of the older releases didn't
have ALL CAPS where necessary, which I felt I could change easily enough
http://musicbrainz.org/doc/CapitalizationStandard/JapaneseArtistsException
[2]

If I understand the guidelines correctly, it is not really ArtistIntent
which causes japanese releases in general to keep their caps. The intent
of an artist is very hard to determine, the only way to be sure about an
artists' intent is to ask them, or perhaps in some rare cases you will
find statements on the topic on official websites or in interviews.

The basis for keeping the caps on most japanese releases is that, once
they've chosen the name for a track, they will consistently write it
like that everywhere -- cd cover, official website, same track on
compilation albums, etc... This is ConsistentOriginalData [1], as
described on http://musicbrainz.org/doc/StylePrinciple .
Post by Chidade
My first impulse was to change it to the Guess Case that MB came up with.
After all, these are psuedo-releases that have been Latin-ised so Artist
Intent wouldn't apply for most of the Japanese titles...but what about the
song titles that were originally in Latin script? Should they follow artist
intent?
I don't think there is a specific guideline for this, nor do I think we
really need one. Yes, based on the same rules for the original
tracklisting, for those tracks which were already in latin script, it would
make sense to retain the caps used on the official tracklisting. From
what I can see any of the romanization systems for japanese do not use
caps, so romanized track titles should probably have sentence caps.
(http://en.wikipedia.org/wiki/Romanization_of_Japanese).

-- kuno.

[1] ConsistentOriginalData is overloaded with two entirely different
concepts on the wiki, this still needs to be changed.
[2] As I explained on that page, JapaneseArtistsException is a rather
misleading name for that page -- that page needs to be rewritten and
renamed.
Bogdan Butnaru
2007-09-22 12:04:40 UTC
Permalink
Post by Kuno Woudt
Post by Chidade
My first impulse was to change it to the Guess Case that MB came up with.
After all, these are psuedo-releases that have been Latin-ised so Artist
Intent wouldn't apply for most of the Japanese titles...but what about the
song titles that were originally in Latin script? Should they follow artist
intent?
I don't think there is a specific guideline for this, nor do I think we
really need one. Yes, based on the same rules for the original
tracklisting, for those tracks which were already in latin script, it would
make sense to retain the caps used on the official tracklisting. From
what I can see any of the romanization systems for japanese do not use
caps, so romanized track titles should probably have sentence caps.
(http://en.wikipedia.org/wiki/Romanization_of_Japanese).
Correct as far as I know, except for proper names, of course, which
should be capitalized.

I'd summarize my preference like this:
1) Japanese releases keep everything as on the cover (consistency).
2) Transliterated releases keep their capitalization where applicable
(ie. already-Latin characters remain as they are)
3) Translated releases get turned to English rules (or whatever the
translation language is) for all tracks.

Rationale: for case (3), it would look weird if the transliterated
tracks used normal capitalization and those that were in Latin chars
already didn't. But for (2), the release still has a "Japanese
quality" about it, since some names would be in romanized Japanese
(those transliterated). So they should keep the caps as in the
original.

This still leaves two undiscussed cases:
(a) Symbolic characters, e.g. the heart (?). I think these should be
left alone, except for the special-case "ASCII" transliteration. (What
happens there would be at the editors' judgement.)
(b) Japanese releases where _all_ titles are in English (or Latin
characters, anyway), but with the atypical Japanese capitalization.
They would be at the same time an (1) and an official (2) in the
summary above; for consistency we might want to add the (3)-version,
but that's neither a transliteration nor exactly a translation.

-- Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.
Bogdan Butnaru
2007-09-22 13:28:30 UTC
Permalink
Thinking about this just makes me think more and more about an
"alternate titles" feature.

It would seem nice to have a way of entering _smart_ alternate
versions for every "name" in the database. Think of something like the
sort-name or the alias for artists' names, but with an additional
field that defines the meaning of the alternative.

As an example, we could have:

Title(*): ????
alt:transliteration:r?maji:JSL: roomazi
alt:transliteration:r?maji:Shin-kunrei-shiki: r?mazi
alt:transliteration:r?maji:Hepburn: r?maji
alt:transliteration:r?maji:common: romanji

(*: I can't read Japanese, just picked up something from the web.)

Title: ?????????????
alt:transliteration:r?maji: Paper Moon ni koshikakete
alt:translation:English: Paper Moon Stay by Me (**)

(**: Just guessing)

Name: ???? ????? ??????????
alt:sortname: Tchaikovsky, Pyotr Ilyich
alt:translation:English: Pyotr Ilyich Tchaikovsky
alt:translation:Korean: ??????
alt:translation:Romanian: Piotr Ilici Ceaicovsky
alt:part:family-name: ??????????
alt:part:first-name: ????
alt:part:middle-name: ?????
alt:version: Peter I. Tchaikovsky

Title: 100 A?os de cine, Volume 1 (disc 1)
alt:as-on-cover: 100 A?os De Cine, Vol.1
alt:translation:English: 100 Years of Cinema, Volume 1 (disc 1)
alt:translation:Romanian: 100 ani de cinema, volumul 1 (disc 1)

Title: ?
alt:transliteration:ASCII: Spade

The problem with this system is that it obviously adds a lot of
complexity. First we need to define an extensible hierarchy of
alternate types (an ontology), which define what each class should
contain. Then people will start to populate them with content. Both of
those will be an occasion for errors, especially when combined (eg,
what to do when you need to change an alt: class and it's already
populated).

However, take a look at the advantages:
(1) It would give us a relatively simple (technically) solution for
the many complications caused by the overloading of fields we
currently do (eg, sort-name/transliteration).

(2) It would simplify and eliminate a lot of duplication, for instance
by reuniting pseudo-releases with the actual release. We'd have a
single entry for ?????????????, with all the alternate titles together
(thus we don't have to duplicate release info and whatever).

(3) It suddenly reunites some different "points of view" and
eliminates _some_ of the arguing about guidelines. For instance,
people who want things capitalized "as on the cover" can have that
info, and those who want things "right" can have it too.

(4) It would make a huge step towards solving
localization/internationalization issues: People who need ASCII-only
titles can just set their tagger to pick that info for the tags (and
even just for the filenames). If I knew (say) English, French and
Japanese, I could set both the tagger and the web-site preferences to
display those languages in original, and transliterate the rest. Or it
could show original and "closest match to what I can read" next to it.

(5) The best part: Despite the huge explosion in alternatives I'm sure
everyone expects from this, there's a very huge opportunity here:
Having a very good ontology for alternatives (ie, well-defined
meanings), even if it's partial, means that a _huge_ amount of work
can be done by machines. Check this out:
(a) If we define "alt:transliteration:ASCII", we can:
- have a "check" script that only allows ASCII characters in that field
- have an "auto-generate" script that knows common replacements
for most characters (eg, ?=spade), etc.
- the fun is this: even if the script isn't correct all the
time (it can't be), no biggie: it only works if no human-defined
alternative is there. I can just go to a track and say that the real
ASCII title is, eg, "Ace of Spades" instead of "ASpades" for "A?".
- the script can be done as a web-service on the site, or in
the tagger. I can imagine the tagger doing the query like "give me the
cover, guideline and ASCII-transliterated titles of this song", and
then the server picking out those that are defined (guideline),
generating those it doesn't have (ASCII), and ignoring what it can't
("on-the-cover"-version where it's not defined).
(b) The checks can be extended to other cases: sort-names can only
contain Latin characters and some punctuation, translated names should
not fail a spell-check for the target language (a warning, can be
overridden by votes), ASCII transliterations should only contain ASCII
(and we can extend this to any character set), etc.
(c) We can actually have automatic transliterations: since we can
have several, we can have (say) two modules that use two different
methods of romaji. Users can manually define those that are wrong.

Only users that need/can use the info would have to look at it. The
server can populate the fields with automatic data on-demand (eg, the
first time someone asks for a certain transliteration of a song, if
there is none defined, the server can use even a complicated script
(dictionary lookup, web-service) to build one, it remembers it (and
marks that field as auto-generated), and returns it.

The next time it's needed it's there. If the guess was wrong, anyone
who can tell can fix it with an edit.) You can set your tagger to
never use guessed info, or to ask you to check it. (After a manual
check it can even put it up to vote and upgrade it from auto-generated
to voted info.)

(d) Much of this can be done rather decentralized and automatized.
Example: if we decide (on the style-list) that we allow
transliteration to different char-sets _and_ scripts in the
alt:transliteration namespace, we can [i] automatically allow people
to add entries for everything that matches
"alt:transliteration:[charset]" (from whatever list of charsets we
add), and "alt:transliteration:[script]" for our list of scripts; [ii]
for any charset/script we can add a script (say, Python) that checks
if a string can be represented as that; [iii] for any charset/script
we can add a script that makes the translation from Unicode to that;
[iv] we can give some users the permissions to change the scripts
(because they know it), and we can even put the scripts to something
like a voting process, editing them inline on the site.(***)
(e) For every language in our list we can allow
"alt:translation:[language]" versions, and we can give some editors
auto-edit rights for some languages. (We can trust what they put in
the EditorLanguage script on the wiki, or we can put in a "vote on
this" process.)
(f) We can add special cases for some situations. For
instance,"alt:transliteration:romaji" can be a special case; we can
have some special cases for some languages: eg, German ? and ? can be
turned to oe and ue, I'm not sure what's that called. Romanian uses
the relatively uncommon characters ? and ? (s and t with comma below),
which are often replaced with ? and ? (the same with cedilla); the
former are less common in fonts, so some people might prefer to use
them. Again, this only concerns people who use some languages, so it
can be easily hidden from others with settings.
(g) There aren't really many others, after we include
"alt:as-on-cover", "alt:common-misspelling", "alt:other-edition"
and... I can't think of anything else.
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
(i) Since it came up, we can even have a script that checks for
"alt:common-misspelling" that it actually comes up on a Google search
:)
(j) All of these can be plugged into the search quite easily. Even
more, I think we can plug the "generating" scripts (where defined)
into the indexing engine, so it can find stuff even when an alternate
is not explicitly defined.

(***: remember, the best part of all this is that it's trivial to have
a setting that hides the automatically-generated alternates. In fact
it won't generate and return them (in the web-service, for instance)
unless explicitly requested. Which means that we don't have to be
always-correct except for stuff people put in, which goes through the
votes.)

What do you think?
Post by Bogdan Butnaru
1) Japanese releases keep everything as on the cover (consistency).
2) Transliterated releases keep their capitalization where applicable
(ie. already-Latin characters remain as they are)
3) Translated releases get turned to English rules (or whatever the
translation language is) for all tracks.
Rationale: for case (3), it would look weird if the transliterated
tracks used normal capitalization and those that were in Latin chars
already didn't. But for (2), the release still has a "Japanese
quality" about it, since some names would be in romanized Japanese
(those transliterated). So they should keep the caps as in the
original.
(a) Symbolic characters, e.g. the heart (?). I think these should be
left alone, except for the special-case "ASCII" transliteration. (What
happens there would be at the editors' judgement.)
(b) Japanese releases where _all_ titles are in English (or Latin
characters, anyway), but with the atypical Japanese capitalization.
They would be at the same time an (1) and an official (2) in the
summary above; for consistency we might want to add the (3)-version,
but that's neither a transliteration nor exactly a translation.
"I think I am a fallen star, I should wish on myself." ? O.
--
Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.
Bogdan Butnaru
2007-09-22 13:29:26 UTC
Permalink
Post by Bogdan Butnaru
Thinking about this just makes me think more and more about an
"alternate titles" feature.
BTW, I volunteer to implement this (with a bit of support from the
current developers) if people think it's a good idea.

-- Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.
Philip Jägenstedt
2007-09-22 15:50:45 UTC
Permalink
I like many of these ideas as it could remedy problems that I
experience. I edit many Chinese releases with 2 or even 3
psedo-releases (simplified characters, english translation, pinyin
transliteration) which is not pretty at all. The as-on-cover
alternative is also very attractive since it's often the case that I
want to put that down without having it in the proper title (it might
be a note that this is the song for this and that movie). It is also
often the case that a release or a song has an official name in both
Chinese and English so I suppose this could come in handy there as
well.

Apart from liking the idea, I do feel that it's fairly complex and
something that could easily snowball into including almost anything,
making it something like a basement for rubbish in MB. All those
automated transcriptions and translations could easily become a large
unmaintained mess that no one would double-check (if you know the
language you're not very likely to look at the
translation/transcription).

Anyway, nice idea(s)!

Philip
Post by Bogdan Butnaru
Post by Bogdan Butnaru
Thinking about this just makes me think more and more about an
"alternate titles" feature.
BTW, I volunteer to implement this (with a bit of support from the
current developers) if people think it's a good idea.
"I think I am a fallen star, I should wish on myself." ? O.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Arturus Magi
2007-09-23 02:02:16 UTC
Permalink
Post by Bogdan Butnaru
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
I would suggest not making any unique unless someone's prepared to
write a lengthy definition/explanation for it (and if we go with that,
we should probably define a few of the other 7/8-bit common sets too).
There are a lot of common extensions to/variants of ASCII, and some
of them (US-ASCII, DOS CP850, Windows-1251, and ISO Latin-1,
primarily) have become nearly synonymous with the original ASCII
standard to most people.
Philip Jägenstedt
2007-09-23 02:09:25 UTC
Permalink
Why do we want to mess around with characters sets though? We already
use the Universal Character Set (Unicode), is there anything that
could't be converted by the tagger without involvement of the server
and alt-systems?

Philip
Post by Arturus Magi
Post by Bogdan Butnaru
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
I would suggest not making any unique unless someone's prepared to
write a lengthy definition/explanation for it (and if we go with that,
we should probably define a few of the other 7/8-bit common sets too).
There are a lot of common extensions to/variants of ASCII, and some
of them (US-ASCII, DOS CP850, Windows-1251, and ISO Latin-1,
primarily) have become nearly synonymous with the original ASCII
standard to most people.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Bogdan Butnaru
2007-09-23 08:14:32 UTC
Permalink
There are some things. The tagger can of course check each character,
but what it does when there is no equivalent?

Take ? for instance: in a title like "I ? NY" it should probably be
translated like "I Love NY", but "10?" it would be "10 of Hearts".
Another example I mentioned above: German umlauts are traditionally
replaced with "e" when they are not available; however, in Romanian,
the "comma below" is just ignored if not available.

Also, what's the ASCII trans_literation_ of ?????????????? The closest
we can do is turn it to one romaji version that uses only ASCII chars
(many use macrons, so wouldn't work).

The advantage of the system I proposed is that (1) if an automatic
conversion is possible and good enough, it would just work
automatically, and (2) if it doesn't work, any editor can propose a
better version, which is put to a vote and remembered for future
users.

Another advantage is that it's (relatively) easy to accumulate many
complex translation/transliteration tools on the server, for instance
containing dictionaries and big look-up tables. It's much harder to
put that in the tagger.

--Bogdan Butnaru
Post by Philip Jägenstedt
Why do we want to mess around with characters sets though? We already
use the Universal Character Set (Unicode), is there anything that
could't be converted by the tagger without involvement of the server
and alt-systems?
Philip
Post by Arturus Magi
Post by Bogdan Butnaru
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
I would suggest not making any unique unless someone's prepared to
write a lengthy definition/explanation for it (and if we go with that,
we should probably define a few of the other 7/8-bit common sets too).
There are a lot of common extensions to/variants of ASCII, and some
of them (US-ASCII, DOS CP850, Windows-1251, and ISO Latin-1,
primarily) have become nearly synonymous with the original ASCII
standard to most people.
Philip Jägenstedt
2007-09-23 08:41:22 UTC
Permalink
The examples you raise aren't about character encodings (as in
UTF-8/ISO-8859-1/etc) but about mappings between scripts (a linguistic
concept I suppose, and not a computer science one). Perhaps I'm just
arguing over terminology here, but I think it would be a mistake to
make mappings between character encodings as opposed to scripts.

About converting latin script with accents and diacritics to 7-bit
ASCII, this can actually be done programmatically. There is a Unicode
decomposition with converts for example the Swedish character ? to "A"
and "ring over preceding". So for characters in the Uncidode Latin or
Latin extended blocks you could simply do such a decomposition and
throw away that which isn't 7-bit ascii (the "ring over preceding"
codepoint in the example). I've done something like this for a file
renamer script once and it was 2 lines in python.

I am not opposed to automatic (or manual) transliterations, I'm just
not seeing where character sets come in. Are we misunderstanding each
other?

Philip
Post by Bogdan Butnaru
There are some things. The tagger can of course check each character,
but what it does when there is no equivalent?
Take ? for instance: in a title like "I ? NY" it should probably be
translated like "I Love NY", but "10?" it would be "10 of Hearts".
Another example I mentioned above: German umlauts are traditionally
replaced with "e" when they are not available; however, in Romanian,
the "comma below" is just ignored if not available.
Also, what's the ASCII trans_literation_ of ?????????????? The closest
we can do is turn it to one romaji version that uses only ASCII chars
(many use macrons, so wouldn't work).
The advantage of the system I proposed is that (1) if an automatic
conversion is possible and good enough, it would just work
automatically, and (2) if it doesn't work, any editor can propose a
better version, which is put to a vote and remembered for future
users.
Another advantage is that it's (relatively) easy to accumulate many
complex translation/transliteration tools on the server, for instance
containing dictionaries and big look-up tables. It's much harder to
put that in the tagger.
--Bogdan Butnaru
Post by Philip Jägenstedt
Why do we want to mess around with characters sets though? We already
use the Universal Character Set (Unicode), is there anything that
could't be converted by the tagger without involvement of the server
and alt-systems?
Philip
Post by Arturus Magi
Post by Bogdan Butnaru
(h) We can set some alternatives to be "unique" (eg, just one
"alt:transliteration:ASCII" is allowed), and some to allow multiples
(eg, "alt:common-misspelling").
I would suggest not making any unique unless someone's prepared to
write a lengthy definition/explanation for it (and if we go with that,
we should probably define a few of the other 7/8-bit common sets too).
There are a lot of common extensions to/variants of ASCII, and some
of them (US-ASCII, DOS CP850, Windows-1251, and ISO Latin-1,
primarily) have become nearly synonymous with the original ASCII
standard to most people.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Kuno Woudt
2007-09-23 10:36:14 UTC
Permalink
Post by Philip Jägenstedt
About converting latin script with accents and diacritics to 7-bit
ASCII, this can actually be done programmatically. There is a Unicode
decomposition with converts for example the Swedish character ? to "A"
and "ring over preceding".
Ofcourse almost any transliteration can be done programmatically [1],
the qualify of the conversion will vary though :)

But for many langauges a generated transliteration will probably be a
good starting point for someone entering a transliterated
pseudo-release.

--kuno.
[1] some libraries which can transliterate many scripts:
- IBM ICU, http://www.icu-project.org/index.html
- GNU iconv, http://www.gnu.org/software/libiconv

Perl also has a bunch of romanization modules, which may
or may not use the libraries I mentioned:
http://search.cpan.org/search?query=romanize&mode=all

I expect specific libraries exist for certain languages which do a
better job than the generic tranliterators included in ICU/iconv.
Philip Jägenstedt
2007-09-23 11:25:19 UTC
Permalink
iconv is a character set converter, which doesn't change the script
but only the encoding. Converting the Swedish letter ? between UTF-8
and latin-1 is not a transcription. Perhaps I'm just being
anal-retentive, but MusicBrainz is all in Unicode and I personally
believe it ought to stay that way. I'm not sure anyone has said
anything to the contrary, but anyway...

That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding) but would really like to explore other ways to do this than
using alt:latin1, alt:gbk, alt:big5, alt:shift-js and so on...

Philip
Post by Kuno Woudt
Post by Philip Jägenstedt
About converting latin script with accents and diacritics to 7-bit
ASCII, this can actually be done programmatically. There is a Unicode
decomposition with converts for example the Swedish character ? to "A"
and "ring over preceding".
Ofcourse almost any transliteration can be done programmatically [1],
the qualify of the conversion will vary though :)
But for many langauges a generated transliteration will probably be a
good starting point for someone entering a transliterated
pseudo-release.
--kuno.
- IBM ICU, http://www.icu-project.org/index.html
- GNU iconv, http://www.gnu.org/software/libiconv
Perl also has a bunch of romanization modules, which may
http://search.cpan.org/search?query=romanize&mode=all
I expect specific libraries exist for certain languages which do a
better job than the generic tranliterators included in ICU/iconv.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Kuno Woudt
2007-09-23 11:36:30 UTC
Permalink
Post by Philip Jägenstedt
iconv is a character set converter, which doesn't change the script
but only the encoding.
It has features for transliteration:

***@ararita:~$ echo gr??e | LC_ALL=de_DE.UTF-8 iconv --from utf-8 --to ascii//translit
gruesse

--kuno.
Philip Jägenstedt
2007-09-23 11:38:06 UTC
Permalink
What do you know, my bad :)
Post by Kuno Woudt
Post by Philip Jägenstedt
iconv is a character set converter, which doesn't change the script
but only the encoding.
gruesse
--kuno.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Kuno Woudt
2007-09-23 11:42:08 UTC
Permalink
Post by Philip Jägenstedt
That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding)
That is only one of the reasons people want transliterations, the other
being obviously that you cannot read whatever is the source script.

example:
http://musicbrainz.org/show/release/?releaseid=580582

the original is latin script, the transliteration is to kana.

--kuno.
Philip Jägenstedt
2007-09-23 11:53:08 UTC
Permalink
Yes, I am not at all questioning transliterations, I am trying to make
the point that transliteration and character set conversion is not the
same thing. Just because we transliterate a release an english release
into Japanese katakana doesn't mean that it will be stored with
character encoding SHIFT-JS (or whatever they use nowadays).
Everything should be in Unicode (except for aliases which are supposed
to catch misencodings, those not be valid UTF8 I suppose).

Philip
Post by Kuno Woudt
Post by Philip Jägenstedt
That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding)
That is only one of the reasons people want transliterations, the other
being obviously that you cannot read whatever is the source script.
http://musicbrainz.org/show/release/?releaseid=580582
the original is latin script, the transliteration is to kana.
--kuno.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Kuno Woudt
2007-09-23 11:56:01 UTC
Permalink
Post by Philip Jägenstedt
Everything should be in Unicode
Ah!, ok, I obviously agree with that :).

--kuno.
Bogdan Butnaru
2007-09-24 16:43:46 UTC
Permalink
I might have been a bit unclear about the distinction between the two,
because I named both "transliterations".

There are two different classes of issues that appear occasionally,
and each of the two transliteration types are meant to solve different
ones:

(1) User X can't read script A (say, Katakana), and they'd like to be
able to read a song's title. This is solved by transliterating the
title into a script they can read, for instance in Latin script using
a romaji method. (There are several, and I don't see why we shouldn't
support more than one, but that's a different discussion.) This is
clearly a case of "transliterating between two scripts".

(2) User X (for some reason) can't use Unicode. Maybe his favorite
player doesn't work with it, or his OS can't use it, etc. He'd like to
have song titles converted to a character set he can use. Probably 99%
of the time this would mean converting to ASCII, and I can imagine
just ignoring other charsets and having "alt:ASCII" as a special case.
However there are cases where other charsets are needed (Asian
languages aren't as happy with Unicode as we are, in particular), so I
don't see why we shouldn't allow the other cases.

Now, this case would normally fall under the definition "character set
conversion", except that it's a bit more complex than that. Some
examples (I've already mentioned them, but it helps noting them here):
(2.a) You can't do a charset conversion from Kana (or Punjaabi or
Cyrillic or Runes) to ASCII. You need to transliterate that to the
Latin script. _And_ often that's not enough to allow changing the
charset: some romanji methods use non-ASCII characters (like vowels
with macron), and you need a way to deal with that.
(2.b) The previous case extends to every character that (i) doesn't
have an equivalent in ASCII but (ii) can be expressed in words
(heart-symbols and other ideograms).
(2.c) Regional typographic conventions. As someone mentioned earlier,
the German word gr??e can be written as "gruesse". However, this
_doesn't_ work for every language; they might have other conventions.

The point of all these examples is that we don't need "character
conversion" (since often we'd have to drop characters), but a
"transliteration to a character set". This means finding a close
approximation of the meaning of the characters in the source charset
that can be expressed using characters in the target charset. We can
call it "transcharacterization" if you want.

This operation can't be always automatic. It _can_ work, but sometimes
a human can do much better, which is exactly the reason why I
suggested the approach before: have a tool that does it's best at the
change (iconv is a smart one, we might actually use it), and allow
people to refine the results it if necessary/possible.

MusicBrainz is Unicode-oriented, that's a very good thing and it
should stay that way. In fact, in ten year's time we might find out
that nobody uses the other character sets, and just remove the
feature. But now it seems like a nice thing to have. Anyway, it's just
the least important and hardest to implement of the several kinds of
"alt:" alternatives, I think we can safely ignore it for now and
concentrate on the others.
Post by Philip Jägenstedt
Yes, I am not at all questioning transliterations, I am trying to make
the point that transliteration and character set conversion is not the
same thing. Just because we transliterate a release an english release
into Japanese katakana doesn't mean that it will be stored with
character encoding SHIFT-JS (or whatever they use nowadays).
Everything should be in Unicode (except for aliases which are supposed
to catch misencodings, those not be valid UTF8 I suppose).
Philip
Post by Kuno Woudt
Post by Philip Jägenstedt
That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding)
That is only one of the reasons people want transliterations, the other
being obviously that you cannot read whatever is the source script.
http://musicbrainz.org/show/release/?releaseid=580582
the original is latin script, the transliteration is to kana.
--kuno.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
--
Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.
Philip Jägenstedt
2007-09-24 23:23:49 UTC
Permalink
Thank you for those clarifications. Let's move on to other matters for now :)

One thing that I was wondering about is if this will work only on a
track level. If it does, I think that there should only be allowed one
instance of the same alternative. Take this example. Jay Chou releases
an new album. MB editor Zhang adds alt:transliteration for all tracks
in hanyu pinyin (the system used on the mainland) but user Liu wants
to have them in tongyong pinyin (used on parts of Taiwan) and adds a
second set of alt:transliteration. The tagger now has no concept of
which versions belong together for that set and if the French guy
Pierre gets to chose by himself he will a) be frustrated b) get it
wrong and mix transliteration systems in his tags.

Solutions:

1. Allow no duplicates. This would force us to come up with a new
alternative if it is in fact different, for example
alt:transliteration:han-to-latin:hanyu-pinyin and
alt:transliteration:han-to-latin:tongyong-pinyin.

2. Grouping of the alternatives, very much like the pseudo-releases we
already have but with some added information apart from
script/language (which doesn't allow us to specify transliteration
method).

And BTW, we do want to allow different transliteration methods between
the same two scripts, as we would otherwise have to take side on
political issues. This is true at least of the Chinese case, and
probably some other too.

Philip
Post by Bogdan Butnaru
I might have been a bit unclear about the distinction between the two,
because I named both "transliterations".
There are two different classes of issues that appear occasionally,
and each of the two transliteration types are meant to solve different
(1) User X can't read script A (say, Katakana), and they'd like to be
able to read a song's title. This is solved by transliterating the
title into a script they can read, for instance in Latin script using
a romaji method. (There are several, and I don't see why we shouldn't
support more than one, but that's a different discussion.) This is
clearly a case of "transliterating between two scripts".
(2) User X (for some reason) can't use Unicode. Maybe his favorite
player doesn't work with it, or his OS can't use it, etc. He'd like to
have song titles converted to a character set he can use. Probably 99%
of the time this would mean converting to ASCII, and I can imagine
just ignoring other charsets and having "alt:ASCII" as a special case.
However there are cases where other charsets are needed (Asian
languages aren't as happy with Unicode as we are, in particular), so I
don't see why we shouldn't allow the other cases.
Now, this case would normally fall under the definition "character set
conversion", except that it's a bit more complex than that. Some
(2.a) You can't do a charset conversion from Kana (or Punjaabi or
Cyrillic or Runes) to ASCII. You need to transliterate that to the
Latin script. _And_ often that's not enough to allow changing the
charset: some romanji methods use non-ASCII characters (like vowels
with macron), and you need a way to deal with that.
(2.b) The previous case extends to every character that (i) doesn't
have an equivalent in ASCII but (ii) can be expressed in words
(heart-symbols and other ideograms).
(2.c) Regional typographic conventions. As someone mentioned earlier,
the German word gr??e can be written as "gruesse". However, this
_doesn't_ work for every language; they might have other conventions.
The point of all these examples is that we don't need "character
conversion" (since often we'd have to drop characters), but a
"transliteration to a character set". This means finding a close
approximation of the meaning of the characters in the source charset
that can be expressed using characters in the target charset. We can
call it "transcharacterization" if you want.
This operation can't be always automatic. It _can_ work, but sometimes
a human can do much better, which is exactly the reason why I
suggested the approach before: have a tool that does it's best at the
change (iconv is a smart one, we might actually use it), and allow
people to refine the results it if necessary/possible.
MusicBrainz is Unicode-oriented, that's a very good thing and it
should stay that way. In fact, in ten year's time we might find out
that nobody uses the other character sets, and just remove the
feature. But now it seems like a nice thing to have. Anyway, it's just
the least important and hardest to implement of the several kinds of
"alt:" alternatives, I think we can safely ignore it for now and
concentrate on the others.
Post by Philip Jägenstedt
Yes, I am not at all questioning transliterations, I am trying to make
the point that transliteration and character set conversion is not the
same thing. Just because we transliterate a release an english release
into Japanese katakana doesn't mean that it will be stored with
character encoding SHIFT-JS (or whatever they use nowadays).
Everything should be in Unicode (except for aliases which are supposed
to catch misencodings, those not be valid UTF8 I suppose).
Philip
Post by Kuno Woudt
Post by Philip Jägenstedt
That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding)
That is only one of the reasons people want transliterations, the other
being obviously that you cannot read whatever is the source script.
http://musicbrainz.org/show/release/?releaseid=580582
the original is latin script, the transliteration is to kana.
--kuno.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
--
"I think I am a fallen star, I should wish on myself." ? O.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
Bogdan Butnaru
2007-09-25 07:45:10 UTC
Permalink
Post by Philip Jägenstedt
Thank you for those clarifications. Let's move on to other matters for now :)
One thing that I was wondering about is if this will work only on a
track level.
Yes, the intent was to have this work on tracks, releases and artist
names---which is where the discussion started ;)
Post by Philip Jägenstedt
If it does, I think that there should only be allowed one
instance of the same alternative. Take this example. Jay Chou releases
an new album. MB editor Zhang adds alt:transliteration for all tracks
in hanyu pinyin (the system used on the mainland) but user Liu wants
to have them in tongyong pinyin (used on parts of Taiwan) and adds a
second set of alt:transliteration. The tagger now has no concept of
which versions belong together for that set and if the French guy
Pierre gets to chose by himself he will a) be frustrated b) get it
wrong and mix transliteration systems in his tags.
1. Allow no duplicates. This would force us to come up with a new
alternative if it is in fact different, for example
alt:transliteration:han-to-latin:hanyu-pinyin and
alt:transliteration:han-to-latin:tongyong-pinyin.
Something like this was in my initial proposal. We'll have
"alt:transliteration:hanyu-pinyin" and
"alt:transliteration:tongyong-pinyin", and only one of these would be
allowed. (Note that other alternatives, like "alt:common-misspelling",
can have duplicates.)

Exactly how the "alt:" classes will be named is subject to change. I
prefer "alt:transliteration:tongyong-pinyin" rather than
"alt:transliteration:han-to-latin:tongyong-pinyin" because it's a bit
less redundant: that transliteration (i) only applies to titles with
han characters, and (ii) always results in Latin characters.
Analogously, romaji always translates from the three Japanese scripts
to Latin. So it's a bit redundant to include it in the name.

That said, it might be useful for classification, more exactly for
quickly finding the one you need.
Post by Philip Jägenstedt
2. Grouping of the alternatives, very much like the pseudo-releases we
already have but with some added information apart from
script/language (which doesn't allow us to specify transliteration
method).
I don't like this one. The main reason for proposing the "alt:" thing
was to keep all translations together, and eliminate as much as
possible the pseudo-releases.
Post by Philip Jägenstedt
And BTW, we do want to allow different transliteration methods between
the same two scripts, as we would otherwise have to take side on
political issues. This is true at least of the Chinese case, and
probably some other too.
Sure!

We'll need some smart preferences system to allow people to use the
many alternatives available. I'm still thinking about it, but I think
it would go like this: We'll add an options page (actually, one on the
site and one on the tagger) that has the following options:

== Quick options: ==
[x] Preferred language [ English |v]
[x] Preferred script [ Latin |v]
[x] Transliterate foreign scripts to preferred one (Latin)
[x] Translate foreign titles when possible
[x] Only use human-verified data (don't try to guess trans*ations).

== Advanced ==
(o) Never translate these languages:
(o) Only translate these languages:
[ user enters list of languages they know/want to keep/translate ]

(o) Never transliterate these scripts:
(o) Only transliterate these scripts:
[ user enters list of languages they know/want to keep/translate ]

* Let me edit these scripts:
[ Katakana Latin Greek ] <- user list
* Let me edit these languages:
[ English French German Japanese Greek ]

[x] Custom list:
| Users can enter custom rules like this:
| Japanese/Kanji -> Japanese/Katakana
| /Cyrillic -> /Latin
| French -> English
| Korean -> Japanese/Katakana

In the list above are translations/transliterations requested between
language/scripts to other language/scripts. Exactly what is done (or
attempted) changes depending on what is selected (script and/or
language).

For the "let me edit these scripts", the view/edit interface changes as follows:
i) If there are any translations to those languages, there'll be
either icons or similar widgets to allow the user to see each of them.
(The other translations are available, just not as easy to pick.) The
"primary" titles are always displayed.
ii) There are also special buttons for "translate to ..." and
"transliterate to ...", with the preferred language/script at the top.
When selected, the user is presented with something similar the "Edit
all" for VA albums, but instead of the "artist" tags it'll have the
"original" info, and the "titles" fields with the "best-effort"
automatic trasl*ation (or the previously-entered content). Checkboxes
to the left allow the user to skip translating certain tracks.

Something similar will happen in the tagger: if the user has any
trans*ation options activated, and such info is available, a little
icon near the album name will allow the user to pick which version
they want. (We might pre-select the one that's currently in tags; for
instance, if I'm tagging the Ghost in the Shell soundtrack and my mp3s
are in Japanese/Latin, that will be preselected, but I'll also be
presented with the options for Japanese/Kana+Kanji and English/Latin.

If no other version is available and I didn't activate guessing, I'll
just have to go to the site and add a trans*ation, then reload the
album.

(I hereby propose cute little flags as language icons and a remarkable
character for script icons, but that's open for discussion.)
Post by Philip Jägenstedt
Post by Bogdan Butnaru
I might have been a bit unclear about the distinction between the two,
because I named both "transliterations".
There are two different classes of issues that appear occasionally,
and each of the two transliteration types are meant to solve different
(1) User X can't read script A (say, Katakana), and they'd like to be
able to read a song's title. This is solved by transliterating the
title into a script they can read, for instance in Latin script using
a romaji method. (There are several, and I don't see why we shouldn't
support more than one, but that's a different discussion.) This is
clearly a case of "transliterating between two scripts".
(2) User X (for some reason) can't use Unicode. Maybe his favorite
player doesn't work with it, or his OS can't use it, etc. He'd like to
have song titles converted to a character set he can use. Probably 99%
of the time this would mean converting to ASCII, and I can imagine
just ignoring other charsets and having "alt:ASCII" as a special case.
However there are cases where other charsets are needed (Asian
languages aren't as happy with Unicode as we are, in particular), so I
don't see why we shouldn't allow the other cases.
Now, this case would normally fall under the definition "character set
conversion", except that it's a bit more complex than that. Some
(2.a) You can't do a charset conversion from Kana (or Punjaabi or
Cyrillic or Runes) to ASCII. You need to transliterate that to the
Latin script. _And_ often that's not enough to allow changing the
charset: some romanji methods use non-ASCII characters (like vowels
with macron), and you need a way to deal with that.
(2.b) The previous case extends to every character that (i) doesn't
have an equivalent in ASCII but (ii) can be expressed in words
(heart-symbols and other ideograms).
(2.c) Regional typographic conventions. As someone mentioned earlier,
the German word gr??e can be written as "gruesse". However, this
_doesn't_ work for every language; they might have other conventions.
The point of all these examples is that we don't need "character
conversion" (since often we'd have to drop characters), but a
"transliteration to a character set". This means finding a close
approximation of the meaning of the characters in the source charset
that can be expressed using characters in the target charset. We can
call it "transcharacterization" if you want.
This operation can't be always automatic. It _can_ work, but sometimes
a human can do much better, which is exactly the reason why I
suggested the approach before: have a tool that does it's best at the
change (iconv is a smart one, we might actually use it), and allow
people to refine the results it if necessary/possible.
MusicBrainz is Unicode-oriented, that's a very good thing and it
should stay that way. In fact, in ten year's time we might find out
that nobody uses the other character sets, and just remove the
feature. But now it seems like a nice thing to have. Anyway, it's just
the least important and hardest to implement of the several kinds of
"alt:" alternatives, I think we can safely ignore it for now and
concentrate on the others.
Post by Philip Jägenstedt
Yes, I am not at all questioning transliterations, I am trying to make
the point that transliteration and character set conversion is not the
same thing. Just because we transliterate a release an english release
into Japanese katakana doesn't mean that it will be stored with
character encoding SHIFT-JS (or whatever they use nowadays).
Everything should be in Unicode (except for aliases which are supposed
to catch misencodings, those not be valid UTF8 I suppose).
Philip
Post by Kuno Woudt
Post by Philip Jägenstedt
That being said, I can see some legitamate needs to want to limit a
release to only the characters that can be represented by character
set X (for example if your portable music player can only handle that
encoding)
That is only one of the reasons people want transliterations, the other
being obviously that you cannot read whatever is the source script.
http://musicbrainz.org/show/release/?releaseid=580582
the original is latin script, the transliteration is to kana.
--kuno.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
--
"I think I am a fallen star, I should wish on myself." ? O.
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
_______________________________________________
Musicbrainz-style mailing list
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-style
--
Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.
Alex Dupuy
2007-10-05 03:24:05 UTC
Permalink
Post by Bogdan Butnaru
Thinking about this just makes me think more and more about an
"alternate titles" feature.
It would seem nice to have a way of entering _smart_ alternate
versions for every "name" in the database. Think of something like the
sort-name or the alias for artists' names, but with an additional
field that defines the meaning of the alternative.
I filed a ticket for something like this several years ago
(http://bugs.musicbrainz.org/ticket/820) although that's only for artist
names/aliases.

I am certainly in favor of providing a more structured and effective way of
representing transl*ations than our current scheme (multiple psuedo-releases
and aliases) but I have several problems with your proposal.

1. While the Next Generation Schema (NGS) may be a long time coming (and may
never be implemented in its full glory) I think any near-term solution
should at least specify how it would be mapped to (or modify) the NGS
objects for artists, releases, songs, etc. and you haven't done this.

2. You're conflating way too many things into one implementation, without
any clear mechanism for choosing which one to use. It's messy enough to
have both language and charset codes taking the same place, without any
apparent distinction in type between "English" and "ASCII", but adding in
"part" breakdowns of names (and you forgot that Ilyich is a patronymic, not
a middle name) is just reaching way too far. Throwing in "as-on-cover" as
well makes it even worse (and conflicts with NGS approach).

You also add explicit "transliteration" and "translation" tags, when those
can be deduced mechanically from the original and alternate language (if
they are the same, it's a transliteration, else a translation) but have no
mechanism for noting the language and script of the original (at least for
artist and track names). While technically there may be some benefit to
providing multiple transliteration-style tags, I don't think this is very
useful. Taking the example of Japanese, I don't think that anybody but a
student of Japanese who would probably be able to read kanji (or at least
kana) anyhow would find JSL or Shin-kunrei-shiki transliterations useful,
and Hepburn / common aren't really that different to make it worthwhile to
represent separately. And nobody is going to need to know which
transliteration style was used; either it will be obvious, or there won't be
any difference (or the user wouldn't know enough to care).

Mainland Mandarin Chinese transliteration standard (Pinyin) vs. Taiwanese
Mandarin vs. Cantonese etc. is a different issue, but rather than specifying
the transliteration technique, I think it is more useful to use country
and/or newer ISO-639-3 three-letter language codes to distinguish these
(e.g. zh_CN:Latin vs. zh_TW:Latin vs. zh_HK/zh-yue/yue:Latin).

3. While it may seem to add flexibility to have these alternates on a track
level, I think that the increased complexity and possibilities for confusion
outweigh any benefit over a scheme that provides transl*ations for all the
names in a release as one group. Having individual track transl*ations is
like taking a poem and translating (or transliterating) lines on an
individual basis. Making sure that they are all consistent is much more
useful than the very limited space savings or flexibility of only
transl*ating some tracks but not others (or doing them in different ways).
It does make sense to have just one set of DiscIDs and PUIDs/TRMs for an
album that are shared by all the different transl*ations of the titles
(something that NGS already addresses, I believe) but that's different than
mix-n-match titles.

@alex
--
View this message in context: http://www.nabble.com/Capitalisation-Standards-for-Transl*ted-Psuedo-Releases--tf4500425s2885.html#a13053661
Sent from the Musicbrainz - Style mailing list archive at Nabble.com.
Bogdan Butnaru
2007-10-05 16:57:41 UTC
Permalink
Post by Alex Dupuy
Post by Bogdan Butnaru
Thinking about this just makes me think more and more about an
"alternate titles" feature.
It would seem nice to have a way of entering _smart_ alternate
versions for every "name" in the database. Think of something like the
sort-name or the alias for artists' names, but with an additional
field that defines the meaning of the alternative.
I am certainly in favor of providing a more structured and effective way of
representing transl*ations than our current scheme (multiple psuedo-releases
and aliases) but I have several problems with your proposal.
1. While the Next Generation Schema (NGS) may be a long time coming (and may
never be implemented in its full glory) I think any near-term solution
should at least specify how it would be mapped to (or modify) the NGS
objects for artists, releases, songs, etc. and you haven't done this.
Hmm. You're right, I didn't mention this. I _think_ the two should be
mostly orthogonal, ie, everything we'd do with "alternates" now we'd
do the same way in the NGS. That is, we would need translations and
other kinds of alternates for the NGS objects, too. The only thing
that might be useful is a new grouping object for related alternates,
though I'm not completely sure of that. So as far as I'm concerned,
this should be independent from NGS, ie it should work pretty much the
same between now and then, and anything that conflicts is probably my
fault.
Post by Alex Dupuy
2. You're conflating way too many things into one implementation, without
any clear mechanism for choosing which one to use. It's messy enough to
have both language and charset codes taking the same place, without any
apparent distinction in type between "English" and "ASCII", but adding in
"part" breakdowns of names (and you forgot that Ilyich is a patronymic, not
a middle name) is just reaching way too far. Throwing in "as-on-cover" as
well makes it even worse (and conflicts with NGS approach).
You're sort-of right. Language-conversion, script-conversion and
charset conversion are logically different operations. I certainly
agree that the terminology should improve (I was just making things up
as I made the proposal).

However, they are obviously operations with a similar "prototype":
given a string, provide a different string that has a similar
(well-defined) meaning, but some different (well-defined) properties.

Each of the examples I gave can be implemented as a special case of
that generic rule (kind of like ARs, or maybe tags), or as a specific
separate feature (like release dates). We can debate on that, but I
think that the first case is more powerful and extensible.

I agree it might be a bit harder to understand at first, but I remind
you that it's not intended for everyone, everywhere: Just like in ARs,
the finer distinctions would only need to be understood by those who
need them (ie, I'm not going to bother with Japanese transliteration
methods; only Japanese-speakers and dedicated fans would).

Those that are generally useful would be presented more visible. It's
easier IMO to add UI tweaks for specific kinds of alternates than to
add a completely new feature (see tags and ARs again).
Post by Alex Dupuy
You also add explicit "transliteration" and "translation" tags, when those
can be deduced mechanically from the original and alternate language (if
they are the same, it's a transliteration, else a translation) but have no
mechanism for noting the language and script of the original (at least for
artist and track names).
Yes, you can deduce this automatically _if_ you knew the
language/script/etc of the "main" title. However, we don't currently
have this. (The 'album language' isn't supposed to be precise, for
instance with multiple-languages.)

So, keeping in mind that (1) we have a huge number of tags already
without very good info on what language/etc they're in, and (2) the
alts are not supposed (for now) to be applied to everything (compare
how man 'normal' versus 'pseudo-' releases we have now), I think it's
better to encode this in tags (where needed) rather than try to guess
it. Adding a field and defining the language of every string we have
is a much more expansive kind of change, which is why avoided it.
There's also the case of original titles that mix languages and
scripts (eg, English/Latin and Japanese/Kana); we can have a
well-defined 'target' (eg, French/Latin), but not a well-defined
source. (Though I agree that's not needed for noticing
transliterations, it might be for other things.)

Second, if we do decide at one time to encode that info somewhere, it
would be very easy to convert the tags to new format automatically,
since half the info would already be there.

Third, while a mechanical deduction can be made in some cases (notably
for script), it's hard or impossible for others (notably languages).
Post by Alex Dupuy
While technically there may be some benefit to
providing multiple transliteration-style tags, I don't think this is very
useful. Taking the example of Japanese, I don't think that anybody but a
student of Japanese who would probably be able to read kanji (or at least
kana) anyhow would find JSL or Shin-kunrei-shiki transliterations useful,
and Hepburn / common aren't really that different to make it worthwhile to
represent separately.
(a) Only users who are students of Japanese would actually use those
tags, and (b) even if they can tell the difference, the computer can't
(in general), so it's useful to have the info there. This way it can
be instructed (for instance) to use any transliteration to Latin it
has, but prefer one if available.
Post by Alex Dupuy
And nobody is going to need to know which
transliteration style was used; either it will be obvious, or there won't be
any difference (or the user wouldn't know enough to care).
Again, the tag is there mainly for the computers (eg, so I can tell
Picard that I prefer one of them), not only for people.
Post by Alex Dupuy
Mainland Mandarin Chinese transliteration standard (Pinyin) vs. Taiwanese
Mandarin vs. Cantonese etc. is a different issue, but rather than specifying
the transliteration technique, I think it is more useful to use country
and/or newer ISO-639-3 three-letter language codes to distinguish these
(e.g. zh_CN:Latin vs. zh_TW:Latin vs. zh_HK/zh-yue/yue:Latin).
Of course, nothing prevents us from doing that where it makes sense.
Each "alt:trans*:" tag name would be picked by people who (a) care
about it and (b) know how.
Post by Alex Dupuy
3. While it may seem to add flexibility to have these alternates on a track
level, I think that the increased complexity and possibilities for confusion
outweigh any benefit over a scheme that provides transl*ations for all the
names in a release as one group. Having individual track transl*ations is
like taking a poem and translating (or transliterating) lines on an
individual basis. Making sure that they are all consistent is much more
useful than the very limited space savings or flexibility of only
transl*ating some tracks but not others (or doing them in different ways).
It does make sense to have just one set of DiscIDs and PUIDs/TRMs for an
album that are shared by all the different transl*ations of the titles
(something that NGS already addresses, I believe) but that's different than
mix-n-match titles.
Having 'complete' translations is nice in theory, but I'm not sure
about the practical aspects. Consider these cases:
- releases with multiple languages. It doesn't make lots of sense to
have a complete "to-English" translation of a release that has half of
the songs in English and the others in Arabic.
- also, for multiple languages/scripts on a release, I might know some
of them (thus want them in original), but not others (thus I might
prefer an alternative).
- an editor might only know the translation for some tracks, I don't
see why they shouldn't be able to add it (together with NGS, that
translation can be even more useful on other releases).
- think of the automatic transl*ators I suggested; if the automatic
translation is correct for 9 tracks but not for the tenth, I could
enter that one manually and let the others for the machines.
- I know you didn't like the idea of "as-on-the-cover" or
"alternate-spelling" alternates (though I don't think I understand
why, can you elaborate?), if added they could be useful where just
some tracks have misspellings or other issues on the cover.

-- Bogdan Butnaru ? ***@gmail.com
"I think I am a fallen star, I should wish on myself." ? O.

Loading...