Removing non-ascii characters from text in Python

I was handling some text scraped using Scrapy and the text had non-ascii unicode charcters like \u003e.
If I did this, it didn’t work:

        html_text = response.text.encode('ascii', errors='ignore').decode()

Here response.text is the string that contains unicode text (scrapy returns strings encoded in unicode).
The html_text still had non ascii unicode characters like \u003e
This worked:

        html_text = response.text.encode('ascii', errors='ignore').decode('unicode-escape')

Note that 'unicode-escape' part in decode. That made the difference in getting rid of characters like \u003e and replacing them with space.

Tek Shinobi
Author: Tek Shinobi

Hiya Ninjas, I am the ninja who invented the fire, wheel, science, technology and everything intelligent this humanity has ever experienced since its evolution from monkeys and germs. Actually, above is a partial list. I also was involved with the Big Bang that created this universe. What was it like before the Big Bang is so secret that if I tell you, I will have to turn you into a hobbit and force you to be my gardener. Okay. Sayonanra, Namaste Tek Shinobi

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *