locked
Read Ansi files with characters above 127 in SB ?? RRS feed

  • Question

  • By default Win Notepad (via ShellNew or CmdLine) creates new txt files as Ansi.
    So most Textfiles on my machine are Ansi encoded (OK), BUT (not so good for SB) most of them also contain
    characters > 127 from the extended Ascii. CodePage Win-1252 (Western Europe).

    Opening/Reading them in SB leads to  � (65533, 0xFFFD)  for these chars. SB would save normal
    Ascii Textfiles or .sb as 1.) Ansi or else 2.) as UTF-8 (no Signature), but can't read Ansi files containig
    characters above 127, like often used german Umlauts äÄöÖüÜ, ߀ (or µ²³ on keyboard, does'nt matter much) etc.

    Is there a way to read such Ansi textfiles (with 127 < Char < 256) into SB or at least to detect wrong encoding,
    so that any further processing can be canceled (before it comes to wrong results by �) ??

    Thanks

    Sunday, February 2, 2014 7:13 PM
    Answerer

Answers

All replies

  • I believe you can change codification and end-of-line sequence using other advanced text editors like NotePad2 or NotePad++! 

    I bet Small Basic goes well w/ UTF-8!  :D


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    Sunday, February 2, 2014 7:38 PM
    Answerer
  • I tried a little sample program that creates a text with characters 128 to 255, writes them to a file, then reads them and displays in a Text object.

    Does this show the problem or is it loading files into the SmallBasic interface or something else.

    GraphicsWindow.Width = 1000
    GraphicsWindow.Height = 50
    GraphicsWindow.FontName = "Arial"
    GraphicsWindow.FontSize = 12
    allText = ""
    For i = 128 To 255
      allText = allText + Text.GetCharacter(i)
    EndFor
    File.WriteContents(Program.Directory+"\allText.txt",allText)
    allText = File.ReadContents(Program.Directory+"\allText.txt")
    txt = Shapes.AddText(allText)

    Sunday, February 2, 2014 7:40 PM
  • Thanks both for your suggestions,

    @GoToLoop
    Of course i use NP2 (NP2mod, NP2 BE eg. with Syntax-Scheme 'Visual Basic' for ;sb; smallbasic) since a lot of years. Same for NPP.
    (für deutschsprachige siehe Notepad2 deutsch)
    and its statusbar is where i get/change encodings in this case.
    But there's NO problem with Unicode (UTF8) files at all. SB reads/imports them behaviourless.

    @LitDev
    As assumed, my result of your demo code (allText.txt) is identical with yours. But the created allText.txt is saved as UTF8. Therefore NO problem.

    I hope this issue is reproduceable on an english system as well.

    Sample: On a german system, i create a textfile eg. "background contextmenu - new - textdocument" and write into it something like:

    "ANSI-InclCharsGreater127: ABCabc123+-*/|#?!_äÄöÖüÜ;߀°".

    Notice the Umlauts 'äÄöÖüÜ' and eg. '߀°'. (charcodes are > 127, most of german textfiles will contain some of them). This file will be saved as ANSI automatically (not Unicode) on a german system!! And there are hundreds of them over the years.
    If i select such a file to import it by SB (eg. to TextBox, GW, further processing like search through, pick out a line, encrypt aso), there are always � (65533, 0xFFFD) in SB, where
    ä, € or ß ... should be. One never knows ...
    At the moment i have to inspect every textfile that shall be opened by SB if it contains Umlauts, ß, € -characters or simply convert it to Unicode before.

    Sample:

    I have uploaded the 4 text files from the image above (2 Ansi and 2 UTF8 files with and without Umlauts etc.) and the .sb. As you can see, 4th TB content (at the bottom) is the problem, and i would like to import that correctly to SB.

    Hope you can verify and get same results:

    AnsiUCTest.zip

    I think this problem could also exist on eg. french, scandinavian, spanish etc. systems, that use chars > 127 in their language and such text is usually saved as Ansi ??



    • Edited by Pappa LapubEditor Monday, February 3, 2014 6:39 PM 'encode' lol, meant encrypt
    Monday, February 3, 2014 6:11 PM
    Answerer
  • I can reproduce, the SB File commands do default to UTF-8 only, with no checking.

    Strangely there doesn't seem to be a simple way to detect encoding eg here.

    I can write an extension to do a conversion of your file to UTF8 (works on my PC anyway) if this is of interest.

    Monday, February 3, 2014 7:33 PM
  • Hi LitDev,

    Funny thing, i'm already searchin around for some weeks, landed on same page on StackOverflow as you and  checked out all further leading links eg. how notepad manages it:

    Sorting it all Out (Michael Kaplan about NP, 30 Jan 2005)

    Why I don't like the IsTextUnicode API

    find-out-the-encoding-of-a-file-c-sharp

    Unicode, UTF, ASCII, ANSI format differences

    Encoding Class

    http://csharpindepth.com/Articles/General/Unicode.aspx

    http://www.yoda.arachsys.com/csharp/unicode.html

    aso.... (got a full textfile with links) , then tried it with  ID: MKV472  IsUnicode.vbs (cscript host) 

    and ID: LWC709 IsUnicode.sb (intermed. VBS in %TEMP%\SB), but i's a rough, dirty and crude solution without conversion.

    An extension to import an eventually Ansi encoded text file to SB (as UTF8) would be great and highest wellcome and would ensure that one is on the safe side when opening a text file.

    PS: Usefull find to quickly enter archive site (on 404 error, works fine in FF, IE via js Shortcut: javascript:location.href='http://web.archive.org/web/*/'+document.location    )

    http://www.sven-of-nine.de/site/doku.php/faq:web#bookmarkspielerei

    Monday, February 3, 2014 9:41 PM
    Answerer
  • The reading of ANSI and write out a copy in UTF8 works well.

    The issue is that I cannot reliably detect if file is ANSI, so if it is actually UTF8 the resulting output will be garbage - so the input MUST be ANSI and cannot reliably be detected.

    I can detect it if the UTF8 has a BOM (byte order mark).  Your UTF8 files don't, so appear as ANSI.  UTF8 created by many MS products don't, but Notepad++ does for example.

    I can also write out UTF8 with or without BOM (advantage having it is it can then be detected, disadvantage is that some applications don't want it e.g. XML parsers).

    Any preferences or even an argument option?

    Monday, February 3, 2014 10:03 PM
  • Uploaded a version to test.
    Monday, February 3, 2014 10:38 PM
  • Ups supersonic :-). Great LitDev, read about that problem and just got it (v98).

    In the morning, night's over and this day 'll begnn soon. So, i'll be back .. later today.

    In the meantime, thank you so much.

    Great

    Monday, February 3, 2014 11:16 PM
    Answerer
  • Hi LitDev,

    German xml lies here LitDev.xml (v1.0.0.98)

    I'm going to do some testings now and thought about doing additional checking for that ugly bad '�' via IsSubText(File.ReadContents(OrigText),"�") first, as a first indication for an SB-uncompatible text file.

    Thanks again and i guess it'll do the job.

    Tuesday, February 4, 2014 6:42 PM
    Answerer
  • Thanks for de xml.

    Perhaps there is some way to detect that the resulting generated UTF8 is garbled in some way indicating that the conversion was bad - my guess is that the � represent a variety of bad bytes but should be detectable in some way - if you discover anything let me know.

    Tuesday, February 4, 2014 7:08 PM
  • Gotcha!

    Tuesday, February 4, 2014 7:24 PM
    Answerer
  • Uploaded an update to extension (same version number) to test - it reads all the ANSI and UTF8 (BOM or no BOM) I have correctly.  Basically if encoding is detected as ANSI (may be wrong) it tries to read in UTF8 - if it detects � which is a 65533 (unknown character), then will read as ANSI.  It the encoding is detected as not ANSI it uses that encoding.  For any case if � is found it returns with "".

    If this all works I may modify the xml later for next release to explain a bit more what it should cope with (as above) - should also actually handle other encodings if they are detected properly.

    Tuesday, February 4, 2014 8:02 PM
  • Thanks LitDev,

    Got & 'll go through it and skip additional � verification in SB code.


    Tuesday, February 4, 2014 8:57 PM
    Answerer