none
Is there a way to detect non-printable characters? RRS feed

  • Question

  • I have written a program that downloads webpages, minus all the dangerous tags.   One of the reasons is to make the webpages safe to look at, in case the user is dubious about viewing them in his browser.

    I find that in some cases, the downloaded file has strange looking characters.   For instance, I tried my program on the N.Y. Post, and it downloaded a page that started with:

    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

    and at one point I got a string that looked like this:

    <p>“Rent me!” they fairly scream. “Please!”</p>

    Those strange characters do not show up if you view the original article in a browser (The original article was at: 

    https://nypost.com/2014/04/26/the-hidden-proof-the-economy-is-still-awful/ 

    I checked what the characters were using the "Asc" function.   According to that function, they correspond to: 157 and 226.   But maybe this method is wrong because it assumes the characters are ASCII, while the charset in the meta tag is some variety of Unicode.

    So my question is, is there a way in general to know if a character is not within the normal range of a language (such as English), without consulting tables for every possible character set in a meta tag?  If so, I can just replace such characters by blanks.

    Thanks

    Monday, May 28, 2018 7:21 PM

Answers

  • Well I don't know how you would accomplish determing printable and non-printable characters. If you can view those characters as text as you display them then they are printable. I suspect they are used by the document placed in the webbrowser somehow for formatting or aligning the text possibly but have no idea. You would have to read and understand the entire code of the webpage to see what those characters are used for. And then how would you figure out with a program what they are used for so as to remove them?

    You can not see control characters typically however some provide instruction to documents and printers such as tab and carriage return and line feed.


    La vida loca

    • Edited by Mr. Monkeyboy Tuesday, May 29, 2018 12:10 AM
    • Marked as answer by Gidmaestro Tuesday, May 29, 2018 9:30 AM
    Tuesday, May 29, 2018 12:06 AM
  • You say you don't see the characters in the page but when I look at the data in that page (I think it is the following) I see:

    So the "garbage" is actually the quotation characters. I assume they are using open and close quotation characters instead of just one quotation character. So your result will look strange if you replace those characters with blanks.

    You need to understand ISO/IEC 8859-1 - Wikipedia, it is not as simple as you want it to be.



    Sam Hobbs
    SimpleSamples.Info


    Tuesday, May 29, 2018 1:43 AM

All replies

  • So my question is, is there a way in general to know if a character is not within the normal range of a language (such as English)

    Be careful of what you say. Implying that languages other than English are not normal might be slightly offensive to some. It is unlikely anyone will say anything but it is better to be nice.

    If the article is not in English then those characters are likely normal for that language. If the article is in English then there is a reason why they were used, you should first determine that. We don't know what you are actually doing so we can't be sure what the best solution will be. First learn about HTML ISO-8859-1. Then if you need help explain why you can't use ISO-8859-1. For processing HTML pages like you are doing the character sets and fonts can get complicated and trying to make it simple is likely to be very successful at making you frustrated. It is not as easy as you want it to be. I constantly get frustrated with things like this.



    Sam Hobbs
    SimpleSamples.Info

    Monday, May 28, 2018 7:47 PM
  • Well I don't know how you would accomplish determing printable and non-printable characters. If you can view those characters as text as you display them then they are printable. I suspect they are used by the document placed in the webbrowser somehow for formatting or aligning the text possibly but have no idea. You would have to read and understand the entire code of the webpage to see what those characters are used for. And then how would you figure out with a program what they are used for so as to remove them?

    You can not see control characters typically however some provide instruction to documents and printers such as tab and carriage return and line feed.


    La vida loca

    • Edited by Mr. Monkeyboy Tuesday, May 29, 2018 12:10 AM
    • Marked as answer by Gidmaestro Tuesday, May 29, 2018 9:30 AM
    Tuesday, May 29, 2018 12:06 AM
  • You say you don't see the characters in the page but when I look at the data in that page (I think it is the following) I see:

    So the "garbage" is actually the quotation characters. I assume they are using open and close quotation characters instead of just one quotation character. So your result will look strange if you replace those characters with blanks.

    You need to understand ISO/IEC 8859-1 - Wikipedia, it is not as simple as you want it to be.



    Sam Hobbs
    SimpleSamples.Info


    Tuesday, May 29, 2018 1:43 AM