locked
Fine-tuning using a glossary RRS feed

  • Question

  • Hello, 

    Please clarify if a relatively small glossary can be used to "hard-wire" or "lock in" some proper names and fixed constructions in the trained engine. As I understand, this can be done after cloning and launching a training with the glossary included in the training data set. 

    I think I saw a post somewhere on the MS website saying this should be in Excel with specific row headings, but I cannot locate that source now. Please provide a link explaining the procedure and the format for that glossary in detail. 

    Many thanks in advance, 

    Sergei

    Thursday, February 5, 2015 10:06 AM

Answers

  • Hi Sergei,

    the use of the dictionary is described in the Hub user guide https://hub.microsofttranslator.com/Help/Download/Microsoft%20Translator%20Hub%20User%20Guide.pdf in section 3.3.2. The section that is relevant to your question:

    To use a dictionary, follow the steps listed here.
    1. Create a dictionary of terms using Microsoft Excel.
    a. Create an Excel file. In this release, Hub supports only “.xlsx” files created using Excel 2007 and later. This file contains a list of source-language terms and a list of corresponding target-language equivalents in the first sheet of the Workbook. Other sheets in the workbook will be ignored.
    b. In cell A1 of the first sheet, enter the 3 letter or 2 letter ISO language code for the source language (eg: “enu” or “en” for English)
    c. In cell B1 of the first sheet, enter the 3 letter or 2 letter ISO language code for the target language (eg: “esn” or “ es” for Spanish)
    d. Enter the source language terms in Column A, and the equivalent translations for these terms in the target Language in Column B. HTML tags in the dictionary will be ignored. The image below shows an Excel file containing a dictionary of terms mapped from English to Spanish.

    Please consider the warnings around the use of a dictionary that are listed in the user guide. It is always better to teach the system with actual prose than with dictionaries. Use dictionaries only if and only as long as you do not have actual prose to show your preferred terminology.

    Chris Wendt
    Microsoft Translator


    Thursday, February 5, 2015 5:07 PM

All replies

  • Hi Sergei,

    the use of the dictionary is described in the Hub user guide https://hub.microsofttranslator.com/Help/Download/Microsoft%20Translator%20Hub%20User%20Guide.pdf in section 3.3.2. The section that is relevant to your question:

    To use a dictionary, follow the steps listed here.
    1. Create a dictionary of terms using Microsoft Excel.
    a. Create an Excel file. In this release, Hub supports only “.xlsx” files created using Excel 2007 and later. This file contains a list of source-language terms and a list of corresponding target-language equivalents in the first sheet of the Workbook. Other sheets in the workbook will be ignored.
    b. In cell A1 of the first sheet, enter the 3 letter or 2 letter ISO language code for the source language (eg: “enu” or “en” for English)
    c. In cell B1 of the first sheet, enter the 3 letter or 2 letter ISO language code for the target language (eg: “esn” or “ es” for Spanish)
    d. Enter the source language terms in Column A, and the equivalent translations for these terms in the target Language in Column B. HTML tags in the dictionary will be ignored. The image below shows an Excel file containing a dictionary of terms mapped from English to Spanish.

    Please consider the warnings around the use of a dictionary that are listed in the user guide. It is always better to teach the system with actual prose than with dictionaries. Use dictionaries only if and only as long as you do not have actual prose to show your preferred terminology.

    Chris Wendt
    Microsoft Translator


    Thursday, February 5, 2015 5:07 PM
  • Thanks a lot, Chris! I knew it would be simple enough to find it, but when you read a lot of stuff during the day, you tend to lose the trail. 

    Many thanks again, 

    Sergei

    Thursday, February 5, 2015 5:16 PM
  • Hello Chris, 

    Per your advice, I used a small glossary to try to force certain translations on the MT output:

    Please note placeholders 

    However, the output of the trained system still looks like this: 

    Please note that the engine adds a space between the two symbols of the placeholder, even though this particular placeholder had been added to the glossary.

    Could you explain what is wrong and how that can be corrected?

    Thanks,

    Sergei 


    Friday, February 13, 2015 9:29 AM
  • Hello Chris, 

    I never got an answer to the question above. 

    I'd appreciate your looking into this at your convenience. 

    Many thanks, 

    Sergei

    Saturday, February 28, 2015 9:59 AM
  • Hi Sergei,

    use placeholders that look like email addresses or like a Twitter handle or tag.

    @handle

    #tag

    Consider the warnings about dictionary use in the Hub user guide.

    HTH,
    Chris Wendt
    Microsoft Translator

    Saturday, February 28, 2015 5:24 PM
  • Hello Chris,

    I have a similar issue that I am trying to resolve with dictionary, too.  For example below, the engine adds a space between a name and symbol.

    Input: "3rd Generation Intel® Core™ i7-37xx Processors"

    Output: "3e génération de processeurs Intel ® Core ™ i7-37xx" in fr

    I've added "Intel®" and "Core™" separately in the dictionary, but it doesn't seem to affect.  and it happens with most of languages.  Is there any ways to fix this issue?

    thanks,

    --Yoshi

    Friday, December 11, 2015 6:20 PM
  • Hi Yoshi,

    This is most likely influenced by the tokenization before the actual translation. Can you try adding

    Intel ® Core ™    --> Intel® Core™

    with the spaces on the from side to the dictionary?

    I haven't actually tried it. I'll verify later whether this works...

    Chris Wendt
    Microsoft Translator

    Monday, December 14, 2015 6:17 PM
  • Hi,

    I have read the information about using a dictionary in the manual, and it seems the dictionary has to be uploaded before training to be taken into account. I would like to know if, after the model are built, it is possible to force the translation of some terms. It would thus be a glossary taken into account at running time, not at training time, allowing very fast updates of it. The terms with their required translation could be input with the translation job. Is this at all possible?

    Many thanks,

    Patrik Lambert

    Wednesday, January 13, 2016 11:58 AM
  • Hi Patrik,

    The runtime dictionary is explained here:

    https://social.msdn.microsoft.com/Forums/en-US/71c306e1-666c-43f1-b433-9a11d7e220d1/new-feature-dynamic-dictionary?forum=microsofttranslator

    I took a note to add a reference to this option in the Hub user guide as well.

    Chris Wendt
    Microsoft Translator

    Wednesday, January 13, 2016 3:52 PM