Microsoft Translator Hub - FAQ


  • 1. When should I request deployment of a trained translation language model?
    It may take several trainings in order to create the optimal translation system for your project. You may want to try using more training data, more or different additional target language material, or more carefully filtered data. You should be very strict and careful in designing your tuning set and your test set, to be fully representative of the terminology and style of material you want to translate. You can be more liberal in composing your training data, and experiment with different options. Request a training deployment when you are satisfied with the training results, have no more data to add to the training to improve your trained system, want to access the trained system  via API’s and /or want to involve your community to review and submit translations.

    2. How can I ensure skipping the alignment and sentence breaking step in MT Hub, if my data is already sentence aligned?
    MT Hub skips sentence alignment and sentence breaking for .tmx files. and for text files with “.align” extension. “.align” files give users an option to skip MT Hub’s sentence breaking and alignment process for the files that are perfectly aligned, and need no further processing. We do recommend using “.align” extension only for files that are perfectly aligned.
    If the number of extracted sentences does not match for parallel documents, the Hub will still run the sentence aligner on “.align” files.

    3. Is there a way to upload a TMX file and get it machine translated on the server side?
    The machine translations can be viewed via the test console or can be retrieved via an API. We do not currently offer a direct TMX translation utility.

    4. When can I expect my trainings to be deployed? Is 6 business day’s requirement for deployment a hard constraint?
    The 6 business day’s for deployment is not a hard constraint and you might see that it is deployed sooner. 
    We have improved the deployment process now such that all deployment requests submitted before 1 AM PST get processed on the same day. There could be some days in a month where an unplanned maintenance activity or planned release may cause delays in processing the deployment request by 2-3 days. In such a case, we will keep you informed if your deployments gets impacted

    5. I tried uploading my TMX, but it said "document processing failed"!
    Please ensure that the TMX conforms to the specification 1.1 or 1.4b http://www.localization.org/tmx/tmx.htm

    6. How much time will it take for my training to complete?
    Training time depends on 2 factors: the amount of data used for training and choice of using Microsoft models.  The time taken for training is directly proportional to the amount of data used to train a system. Usage of MS models also increases the training time as MS models are huge. Typically a training with MS model would take anywhere from 4 to 12 hrs to complete. Trainings without MS model may complete in less than 6 hrs.

    7. Can the deployed trainings be accessed via API’s?
    Yes. Deployed trainings can be accessed programmatically via the Microsoft Translator API (specifying the category). Details of the API can be found at the following link

    8. Why the results from the “Test Translation” page of MS Translator Hub differ from the one returned by MS Translator API with MS Translation Hub? Is it the difference from the two content types of "text/plain" and "text/html"?
    Yes the web interface in the Hub uses contentType=”text/plain”. In plain text, tags that look like <one letter><number> are left untouched and move with the word they are next to. This may result in tag ordering that would be illegal in XML. Tags of other format will not be treated as tags. The Hub forces all tags it sees in the sample documents into the <one letter><number> format, but the API won’t.

    In text/html proper HTML processing is done, tags will be in legal order and legal nesting. However, you must pass balanced HTML, and self-closing tags will be expanded in the process. You will want to use text/plain for most content, except when you have balanced HTML, or balanced XML that you can transform HTML. In contentType=text/html you may also exclude any span of text from translation by using the notranslate attribute.

    When using HTML, the engine does a better job at positioning the tags properly. If you use plain text and have tags in there, you will need to ensure the correct tag placement yourself.

    9. My training failed! How can I avoid my trainings from failing?
    Trainings can fail if they do not meet the constraints for the minimum required sentences in the Test, Training or Tuning data.  The number of minimum required aligned sentences for a Training to succeed is 500 for the Tuning and Testing set.  For the training set it is 2000.

    If your training fails with the message “An error occurred while building the translation system. Please try again after some time. If you continue to get the same error, please email mthubsup@microsoft.com.” then it’s recommended to wait for few hours before re-submitting the system for training. If you are encountering these errors on a regular basis and Hub team has not already reached out to you, send an email to mthubsup@microsoft.com .

    10. How many trainings can be deployed in a Project?
    Only one training can be deployed per project. It may take several trainings in order to create an accurate translation system for your project and we encourage you to request deployment of a training which gives you the best result. You can ascertain the quality of the training having a good BLEU score and by
    consulting with reviewers before deciding that the quality of translations is suitable for deployment.

    11. How does BLEU work? Is there a reference for the BLEU score?  Like what is good, what the range is, etc.
    BLEU is a measurement of the differences between an automatic translation and one or more human-created reference translations of the same source sentence.  The BLEU algorithm compares consecutive phrases of the automatic translation with the consecutive phrases it finds in the reference translation, and counts the number of matches, in a weighted fashion. These matches are position independent. A higher match degree indicates a higher degree of similarity with the reference translation. Intelligibility and grammatical correctness are not taken into account. BLEU’s strength is that it correlates well with human judgment by averaging out individual sentence judgment errors over a test corpus rather than attempting to devise the exact human judgment for every sentence.

    All that being said, BLEU results depend strongly on the breadth of your domain, the consistency of the test data with the training and tuning data, and how much data you have available to train. On the other hand, if your models have been trained on a narrow domain, and your training data is very consistent with your test data, you can expect a high BLEU score. Please note that a comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine.  A BLEU score from a different test set is bound to be different.

    For further discussion on BLEU score, please see here

    12. Do the corpora need to be perfectly aligned at sentence boundaries?  Though the corpora are aligned by verse, they do not always match at the sentence level.  For example, a given verse might be one sentence in English, but two sentences in the target language.
    Instances where a given verse might be one sentence in English, but two sentences in target language, you should include them in one line and upload it as “.align” file.  Sentences in “.align” file are not broken by sentence end punctuation like “.” or “;”.  Hence you can safely manage such cases via “.align” files. In “.align” files, “enter” key from keyboard is considered the end of the line / sentence.

    13. How can I download community translations?
    Community translations can be downloaded via the “Download Community Translations” link found in the “Review Corrections” section of a deployed system. Please note this link is only accessible via a deployed system.

    14. Is there a feature in MT HUB which would enable a project owner to approve all the submitted translations?
    Yes. Translations provided by the community or reviewers can be approved all at once. To approve the translations navigate to “Review Corrections” section of a deployed section and select the “Suggested” radio button to view all the submitted translations. To approve all the translations, select the checkbox and click on the “Approve checked” button.

    15. The PDF file I tried to upload, failed with an error saying it might be corrupt.
     The PDF file that failed to upload may be a secure PDF file. Currently Hub cannot extract sentences from a secured PDF file.

    16. TMX file fails to upload with an unknown language error message.
    MT Hub looks for RRC3066 compliant language codes in TMX files. This error happens when the TMX files actually has wrong language code as “ES-EM” instead of “ES-ES” or does not have the right format “en_US” opposed to “en-US” as expected by MT Hub.

    17. Uploading a gz file gives an error: “The document has no extension. Please upload a document with a supported file extension. “
    Certain version of gz files is not supported by MT Hub gz extractor. The workaround will be to create a new gz file in 7Zip.

    GZ file

    The Microsoft Translator team

    Wednesday, March 14, 2012 10:28 PM