Well, things have been busy as always. There are finals this week, and I have been busier than a bee with everything going on. Stick around to the end of this post if you want to see a tiny human 😉
Okay, so as I mentioned at the end of my last update, I have been spending some time learning about Machine Translation. Basically, how Google Translate does what it does, and how I might be able to do something better, for machine translating webnovels. If you are even a little bit curious, I’ll try to make this as simple as I can. But, if you have any questions, feel free to ask me in the comments, or send me an email to tynkerd at gmail. Here goes!
This is an image depicting something called a “sequence to sequence” model. It tries to mimic the human brain, and is a type of Neural Machine Translation (NMT) model. Basically, it takes a properly “prepared” input sentence, and uses the current word, along with memory of past and future words, to predict what the context of the text is, and come up with a sentence that best correlates to it. This is basically new tech, and it beats the old “grammar rules based” and “probability of being the right sentence” models. Google started using this at the end of 2016!
Google Translate, IBM’s Watson…all the Machine Translation systems that use NMT have one similarity. They need tons and tons of data. For Ja-En translation, that means tons of ja-en translated sentences, that can be fed into the system, to teach it how a japanese sentence should end up translated to english. Google is great at gathering this kind of data, but maybe that data doesn’t talk a lot about magic, demons or zombification, so you end up with rougher translations.
This is where a Open-Source translation library known as OpenNMT comes in. There are several options out there, but I’ve found this by far the easiest to use. It scales well with NVIDIA GPUs to make processing/learning times much faster as well. (20 hours CPU training took 2 hours on a 750 Ti)
I’ll speed things up a little. Once you get this system up and running, you basically set a bunch of hyper-parameters on how you want the training to go, and then feed it a bunch of prepared Ja-En sentence pairs.
I used this small_parallel_enja Ja/En parallel data set to train a model with…decent “beginner-level” results, in about 30 minutes. Here are some of the results of text I wanted my own DIY translation model to translate:
As you can see:
“the country was preparing for the war”
should be: The country was preparing for war.
Not bad for 30 minutes of training on a tiny 50k parallel corpus. Some sentences were craaazzy though.
“you know if you’re in which time”
should be: As time passes, you’ll come to know which is right.
So, here’s the issue I’m working on. The biggest challenge I think to webnovel translation using a NMT system, is in vocabulary and example sentences that relate to that vocabulary. So I’m working on a way to strip the text out of the syosetsu website for an entire novel, and break down the text to pick out all the “vocab” words, then send those words to a website like ejje.weblio.jp to get 20-50+ example Ja/En sentences for each word. Like this:
If everything goes well, there should be millions upon millions of parallel sentences available for training, and all the sentences used are directly related to the content of the webnovel, to provide for translations more accurate than Google Translate, because all the data directly relates to the novel!
Anyway, this is just something I’m working on when I have a few minutes. On a fun note, all the model files are saved in one file, so anyone can run the model, and translate their own sentences if they have the proper environment setup.
And, because I know you’ve been waiting for this…I’ll be posting a translation for “Amaterasu” this Saturday or Sunday. (Exams/Reports due this week…sorry…X_x)
Alright, now for the tiny human 😉