none
Using AscW function, it returns two negative numbers for each character, why? RRS feed

  • Question

  • I have an Arabic PDF. When I copied the text in a word document, its characters changed into some square shape characters. And when I used AscW function to see characters codes, I get two negative numbers for each one. First number always is -9280 and second number is variable for each character. Why it has happened?

    This is what has been copied in the word document:

    􀂽􀁙􀃂􀀯􀃀􀂟

    And this is my program to get character codes:

    Sub Chr_Codes()

    Dim i As Long

    Dim In_String, Out_String, In_String_Char As String

    In_String = Selection

    For i = 1 To Len(In_String) Step 1

    In_String_Char = Mid(In_String, i, 1)

    Out_String = Str(AscW(In_String_Char)) + " " + In_String_Char

    Selection.EndKey Unit:=wdLine

    Selection.TypeParagraph

    Selection.TypeText Text:=Out_String

    Next i

    End Sub

    Copy the text in a word document, Highlight it and run the program. Why it has happened?

    I will be thankful for your answer.
    Friday, May 31, 2013 9:39 AM

Answers

  • Your characters appear to be 2-bit 64k characters that normally ought dispay in Word, yet the characters you pasted into your OP, ie in HTML, do not render in my system either except to show as squares.

    It seems the underlying characters are correct as we both return similar values, eg for the first character AscW returns -9280 for both of us. IOW assuming you pasted from your PDF to your post, Word (or Office) is not the issue.

    Note AscW only returns Integers, ie values between +/-32k. The byte-array example I posted returns the true value of the first character as 56256, and 64k (65536) -9280 = 56256 so that at least solves one small detail.

    I vaguely recall there it is possible to identify any Fonts embedded in a PDF document, maybe look into that and see if the Font or rather an equivalent Font (ie one that maps characters in the region say 56-57000) exists on your system. If so start with that Font in Word before pasting.

    Peter Thornton



    Saturday, June 1, 2013 10:33 AM
    Moderator

All replies

  • Only a guess but maybe your PDF includes an embedded font that's n/a on the system.

    FYI you can get a better look at wide characters by converting the string to a byte array, eg

    Sub Chr_Codes2()
    Dim i As Long, j As Long
    Dim In_String, Out_String, In_String_Char As String
    Dim b() As Byte
        In_String = Selection
        b = Selection
        For i = 1 To Len(In_String) Step 1
            j = i - 1
            In_String_Char = Mid(In_String, i, 1)
            Out_String = Str(AscW(In_String_Char)) + " " + In_String_Char & " " & _
                         CStr(b(j * 2)) & " " & b(j * 2 + 1) & "  " _
                         & CLng(b(j * 2)) + CLng(b(j * 2 + 1)) * 256 & _
                         " " & ChrW(CLng(b(j * 2)) + CLng(b(j * 2 + 1)) * 256)
            Selection.EndKey Unit:=wdLine
            Selection.TypeParagraph
            Selection.TypeText Text:=Out_String
        Next i
    End Sub

    Peter Thornton


    Friday, May 31, 2013 10:21 AM
    Moderator
  • Thank you for your time and consideration, Peter Thornton. I have almost all Arabic fonts and I can see Arabic characters in my PDF. This is a common problem to convert Arabic PDFs to word document. I think when I'm copying the PDF in the word document, each character map to another character and I can convert Arabic PDF to word document if I know the mapping.
    But there is a problem, some times there is no mapping! some characters omit and some characters merge with others.

    I think word use a variable length encoding like UTF-8 for characters that I copied. then characters' length is 4 and this is why. Is it true?

    Thank you again.


    Saturday, June 1, 2013 5:13 AM
  • Your characters appear to be 2-bit 64k characters that normally ought dispay in Word, yet the characters you pasted into your OP, ie in HTML, do not render in my system either except to show as squares.

    It seems the underlying characters are correct as we both return similar values, eg for the first character AscW returns -9280 for both of us. IOW assuming you pasted from your PDF to your post, Word (or Office) is not the issue.

    Note AscW only returns Integers, ie values between +/-32k. The byte-array example I posted returns the true value of the first character as 56256, and 64k (65536) -9280 = 56256 so that at least solves one small detail.

    I vaguely recall there it is possible to identify any Fonts embedded in a PDF document, maybe look into that and see if the Font or rather an equivalent Font (ie one that maps characters in the region say 56-57000) exists on your system. If so start with that Font in Word before pasting.

    Peter Thornton



    Saturday, June 1, 2013 10:33 AM
    Moderator
  • Dear Peter Thornton, as you said, I embedded fonts in the PDF and copied the text. it worked for some PDFs but other PDFs have that problem.

    Thank you for your time and consideration again.

    Monday, June 3, 2013 6:17 AM
  • According to MS Developer Nework:
    AscW returns the Unicode code point for the input character. This can be 0 through 65535. The returned value is independent of the culture and code page settings for the current thread.

    However, there is a bug: AscW returns both positive and negative numbers. Add 65536 to negative numbers to get the correct Unicode code point. A Unicode code point is simply a number (including numbers over 65536) that identifies each letter in every language's alphabet.

    Character encoding is how each letter is stored. VBA uses two byte encoding - I believe it is UTF-16. Most files these days use UTF-8 character encoding, which can use between 1 and 6 bytes for each letter, but I've never seen anything over 4 bytes yet. 

    I found learning Unicode and foreign language to be like learning a completely new language. Here's a really good discussion of Unicode and UTF: since I cannot include a link yet, search for "Characters, Symbols and the Unicode Miracle - Computerphile" for a nine minutes long (9:37) explanation of character encoding from binary, ASCII, and finally to UTF.

    Wednesday, March 5, 2014 5:02 PM