Wu Phonetics Corpus吳國之記事

Introduction
One of the hurdles in learning Shanghainese, or for that matter any dialect of Wu, is the lack of easily accessible data. There are services that provide phonetic transcription based on character input, but often the results given are some proprietary form of pinyin. For more phonetically accurate results, i.e. IPA, there are books that provide that information, though often for only a limited number of characters.

In an effort to fix that, I’ve compiled a list of characters with their corresponding IPA pronunciation. It’s a tabbed text file, UTF-8 encoding. The original file is loosely based on a similar list of about 450 entries provided by Tatoeba.org.

Content
The data set covers the most commonly used characters for writing Wu, as well as a number of other characters to cover things like family names and Wu-specific 语气词. It started as a list of just over 450, quickly expanding to 1400 entries and recently to just over 5300 now over 7520. More entries are continually being added.

Usage
Who uses this? For starters, this data set has been integrated into Tatoeba.com for both entries in Shanghainese to IPA tool as well is in their general Shanghainese sentences. Sentences entered on the site using characters will be converted as below.

伊杭州来个。
ɦi⁵³ ɦɑ̃⁵³ ʦɤ lɛ⁵³ gəˀ¹²

It will also be included as part of the upcoming release of the Eclectus dictionary created by Christoph Burgmer and the related cjklib project..

Expect to see the data appear elsewhere in the near future.

If you’re interested in using the data for your project, send me an email at kellenparker在sinoglot.com explaining what the project is and how you plan to use the data.

The only thing I ask is that you credit me in some way for the many many hours I’ve put into collecting the data. I’m releasing this under the Creative Commons CC-BY license.

Plans
I’m looking in a few different directions as to what else to do to improve the data. I don’t want to get too much into it just yet, but keep an eye on this space for updates.

Thanks
Thanks to Allan Simon of Tatoeba.org for providing me with an initial 450+ word set and for allowing me to contribute to Tatoeba’s data set. Also thanks to Christoph Burgmer for helping work out some kinks and for being willing to include the data into Eclectus.



Except where otherwise noted, content on this site is
licensed under a Creative Commons Attribution 3.0 License
 
   
home | about wu | the site | pinyin, IPA or characters? | the archives | links
Subscribe by RSS or email.

Recent Comments:
Pleco update supports Wu… kinda (6)
 Peter: Hmmm… are the comments...
 Peter: Thanks for the clarification.
Changzhou hua lessons on Tudou (1)
 Michael: This is neat. That they say, 二十...
the New Japanese Myth (32)
 William: Hello, I’ve spent much of...
I only fear Gaochun (5)
 taibaile: non-harmonious gaochun dialect
yígāng yígǎng yîgāng (4)
 minus273: She does say...
© 2009-2010 Kellen Parker. Annals of Wu is part of the Sinoglot network.