none
Transliterating Language Fonts (Indic Languages) RRS feed

  • Question

  • Hi Experts,

    Greetings. I work for Microsoft and my colleague Mr. Stuart Stuple (Sr. Program Manager for MS Office) suggested that I write to this forum to get some thoughts on how to achieve the following for a personal project of mine.

    During my spare time, I do a bit of community service by bringing lot of rare Sanskrit documents using Devanagari Unicode Font. I do commentary on the Sanksrit content and create a PDF from MS Word 201 file with all formatting for distribution. (I use iTRANS encoding scheme for encoding Sanskrit).

    I often get requests from my distribution List is to provide the same document with Devangari language replaced by a different Indian Language (e.g. Tamil). I wrote a MS Word Macro which tries to transliterate by searching for Devanagari font/characters and replacing them with desired language font - the VBA script does the manipulation with respect to offsetting the Unicode character values after due mapping and save the modified file as a different PDF with language extension. But this did not work and that's when I realized my ignorance of unicode fonts and their complexities and I interacted with Stuart Stuple in this regard.

    Stuart advised that a unicode character displayed on the document may contain one or more glyphs and each glyph may contain one or more unicode character values and it is not possible directly to control these from within MS VBA macro. He suggested me write to this group to seek any ideas for achieving this result.

    Thanks a ton in advance for your help. Please send me a mail to kmurali_sg@yahoo.com with your inputs.

    Thanks & Regards,

    Krishnan Muralidharan

    Monday, September 5, 2011 5:48 AM

All replies

  • Hi Krishnan,

    Thanks for your post.

    We are doing research on this issue. If we have any updates, we will post here. Thanks for your understanding.

    Have a nice day.

    Best regards


    Liliane Teng [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Thursday, September 8, 2011 10:12 AM
  • Hi Krishnan:

     

    Could you please email me internally?    I think the following article with some modification might be what you are looking for: http://support.microsoft.com/kb/886954 

     

    The following Visual Basic for Applications macro locates all combinations of characters and combining diacritic marks. It replaces each combination with a string. The string is the text "HELLO." Microsoft provides programming examples for illustration only, without warranty either expressed or implied. This includes, but is not limited to, the implied warranties of merchantability or fitness for a particular purpose. This article assumes that you are familiar with the programming language that is being demonstrated and with the tools that are used to create and to debug procedures. Microsoft support engineers can help explain the functionality of a particular procedure, but they will not modify these examples to provide added functionality or construct procedures to meet your specific requirements.

    For more information about how to create, to edit, and to run a Word macro, click Microsoft Office Word Help on the Help menu, type macro in the Search for box in the Assistance pane, and then click Start searching to view the topic.

    'This Word macro replaces all combinations of a character 
    'and a combining diacritic mark for a selected part of 
    'a document. The replacement text is "HELLO." 
    
    Sub findcombined()
    
    'Clear any formatting that is specified for a find operation.
    '
    Selection.Find.ClearFormatting
    
    'Clear any formatting that is specified for a replace operation.
    '
    Selection.Find.Replacement.ClearFormatting
    
    'Set up a Find and Replace operation.
    '
    With Selection.Find
    
    'Set the Find What string to find all combinations of a character and 
    'a combining diacritic mark, where &ChrW represents a character and 
    '&H300 and &H36F represent the start and the end of the hexadecimal 
    'values that correspond to the range of combining diacritical marks 
    'in Unicode.
    '
      .Text = "?[" &ChrW(&H300) & "-" &ChrW(&H36F) & "]"
    
    'Set the Replace with text to "HELLO."
    '
      .Replacement.Text = "HELLO"
    
    'Set the search direction to search forward in the document.
    '
      .Forward = True
    
    'Set the Find and Replace operation to wrap around to the start 
    'of the document when the end of the document is reached. This operation makes sure 
    'that the all the selected part of the document is searched.
    '
      .Wrap = wdFindContinue
    
    'Clear any Format settings.
    '
      .Format = False
    
    'Clear the Match case setting.
    '
      .MatchCase = False
    
    'Clear the Find whole words only setting.
    '
      .MatchWholeWord = False
    
    'Clear the Use wildcards setting.
    '
      .MatchWildcards = True
    
    'Clear the Sounds like setting.
    '
      .MatchSoundsLike = False
    
    'Clear the Find all word forms setting.'
      .MatchAllWordForms = False
    
    End With
    
    'Execute the Find and Replace operation
    
    Selection.Find.Execute Replace:=wdReplaceAll
    
    End Sub

     

     

    Sincerely,

    Susan Buchanan

    Microsoft Community Support


    Sincerely, Susan Microsoft Community Support
    Thursday, September 8, 2011 6:59 PM
  • I'm just catching up after a holiday, so this may all be sorted now, but I don't really understand the problem. If Stuart doesn't know the answer, of course, it's unlikely that I do, but I'd still be interested.
     
    Firstly, I note, from the ITRANS home page, that ITRANS is, to some extent, redundant, having been superseded by widespread support for Unicode. That aside, can I ask a question to aid my understanding?
     
     ... I thought Devanagari was a script used for various languages, including Sanskrit (as you say you are doing). Tamil, on the other hand, is a language with its own script. If my understanding is correct then what, exactly, is it that you want to transliterate?
     
    What Stuart has told you, that a displayed character can contain multiple glyphs, is correct although, leaving aside surrogate pairs, I wouldn't say that those glyphs, themselves, could consist of multiple characters. However, regardless of the displayed image, each character (each Unicode code point, if you like), should appear separately in the stream that you see in VBA, or that you can interact with in a Find and Replace operation and the KB article that Susan has referenced should point you in the right direction if a simple transliteration is all that you need. I'm not sure, however, exactly what you may have as a result of using ITRANS - does your document consist of Devanagari characters, or romanisations of them?
     

    Enjoy,
    Tony
    www.WordArticles.com
    Wednesday, September 14, 2011 1:43 PM
  • Here is another way of getting at the problem.  The following VB function will transliterate a Telugu language string into a Romanized string.  You may be able to use a similar strategy for your purposes.  Indeed it may be a lot easier since I have a feeling the same consonant sound is indexed in the same place in the Unicode codepoint for a given Indic language, so you may just need to 'subtract' a common value from the AscW value for each character in a string.  (Refer to the Unicode charts for the languages you are transliterating to/from.  For Telugu: http://unicode.org/charts/PDF/U0C00.pdf

    To try this macro, select a paragraph in Telugu script and then run TestTeluguToEnglish.  

    Sub TestTeluguToEnglish()

    MsgBox FToEnglish(Selection.Paragraph.Text)

    End Sub

                          

    Function FToEnglish(szT As String) As String


    Dim szOut As String

    Dim eChars As Variant
    eChars = Array("~", "M", "H", "#", "a", "A", "i", "I", "u", "U", _
                    "???", "???", "#", "e", "E", "ai", "#", "o", "O", "au", _
                    "ka", "kha", "ga", "gha", "^7749", _
                    "ca", "cha", "ja", "jha", "ña", _
                    "Ta", "Tha", "Da", "Dha", "Na", _
                    "ta", "tha", "da", "dha", "na", "#", _
                    "pa", "pha", "ba", "bha", "ma", _
                    "ya", "ra", "Ra", "la", "La", "#", _
                    "va", "^347", "Sa", "sa", "ha", "#", "#", "#", "-", _
                    "A", "i", "I", "u", "U", "R", "RR", "#", _
                    "e", "E", "ai", "#", "o", "O", "au", "VIRAMA")

        Dim szE As String
        szE = ""

        Dim i As Integer
        For i = 1 To Len(szT)
            Dim nChar As Integer
            nChar = AscW(Strings.Mid(szT, i, 1))
            'MsgBox nChar

            If nChar >= &HC3E Then
                'strip off the 'a' from the last go round
                If Len(szE) > 1 Then
                    szE = Strings.Left(szE, Len(szE) - 1)
                End If
            End If

            If nChar > &HC01 And nChar < &HC01 + UBound(eChars) Then
                'this is the pre-VIRAMA range
                If Strings.Left(eChars(nChar - &HC01), 1) = "^" Then
                    szE = szE & Strings.ChrW(Strings.Right(eChars(nChar - &HC01), Strings.Len(eChars(nChar - &HC01)) - 1)) & "a"
                Else
                    szE = szE & eChars(nChar - &HC01)
                End If
            ElseIf nChar = &HC01 + UBound(eChars) Then
                'do nothing
            Else
                'if it's not a telugu letter at all, just copy it in
                szE = szE & Strings.Mid(szT, i, 1)
            End If

        Next

        FToEnglish = szE

    End Function

    **********************

    Unfortunately (on my system at least) you can't display Indic scripts in message boxes or in the VBA immediate window, so these scripts can be nearly impossible to debug.

    Monday, May 28, 2012 2:52 AM
  • Hi Tony/ZipBlack/Others,

    Greetings and thanks indeed for the reply - apologies for my reply which is more than a year late!

    I originally thought of creating a document which contains Devanagari Unicode content interspersed with English content/commentary (Unicode font as well). I imagined running  a macro that will ask the source and target languages (Indic), select all source (Devenagari) texts, transliterate into the target language (I originally imagined it is as simple as adding/subtracting the Unicode offset!!) and save document with a different name. The expectation is that the original formatting and English commentary will be intact. Also the paragraph formatting of the Devanagari (& Target Language) text will also be intact. The only change will be the source language content would have been replaced with destination language content.

    After my interaction with Stuart and a few other people, I realized that what I wanted to achieve is not that simple though it sounds so - at least for me as I have no real basic on Unicode/font basics/programming.

    So, I had resorted to an interim measure - though a bit lengthy. I found one good transliterating tool (http://www.virtualvinodh.com/aksharamukha) and had interacted with the author of the software as well. I do the transliteration from iTRANS encoding (or Devanagari) into the target language, Cut/paste the transliterated text into word and manually save it as a different file. Before pasting, I manually change the font name and size for the style attached to the Devanagari and paste as unformatted Unicode text. This is near manual - but works fine.

    I had requested Mr. Vinod whether he could make the transliteration as a callable routine from VBA in which case I can automate a large portion of what I wanted to do. He said he will consider the same.

    Please let me know whether there are any better methods.

    Thanks & Regards,

    K. Muralidharan


    K. Muralidharan

    Tuesday, October 30, 2012 9:57 AM
  • Dear Krishnan Muralidharan,

    As you work for Microsoft, it would a great idea if you can persuade Microsoft to add Transliteration capability to Microsoft Office 365 (at least to Microsoft Word 365/64bit) for all languahes in order to be used by those who are only users and do not care about the computer codes. I need this capability for Farsi (Persian) in order to be free from using a Farsi keyborad each time I want to write something in Farsi. Google offers this functionality for Gmail and makes life a lot easier.

    I am very grateful for your attention.

    With best Regards

    Reza Beheshti

    reza.beheshti7@gmail.com

    Sunday, January 4, 2015 12:48 PM