Removing non-ascii characters from text in Python

I was handling some text scraped using Scrapy and the text had non-ascii unicode charcters like \u003e.
If I did this, it didn’t work:

        html_text = response.text.encode('ascii', errors='ignore').decode()

Here response.text is the string that contains unicode text (scrapy returns strings encoded in unicode).
The html_text still had non ascii unicode characters like \u003e
This worked:

        html_text = response.text.encode('ascii', errors='ignore').decode('unicode-escape')

Note that 'unicode-escape' part in decode. That made the difference in getting rid of characters like \u003e and replacing them with space.




1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *