Why do I get odd characters instead of quotes in my documents?

The way characters are represented within computer documents and email isn't always the same everywhere, and things often get misinterpreted.

//

I have noticed for years that certain emails and documents have strange characters where punctuation and other characters should be. An example is this word: yesterday’s Where the characters ’ should clearly be an apostrophe. Why is this happening and what can I do to eliminate this occurring? I suspect that it happens more often when the originating computer system is a mac.

It’s all about character encoding.

And that simple sentence represents a bit of complexity.

Let me cover a few concepts, and throw out a few tips on how it can sometimes be avoided.

Encoding

As I’ve discussed before, typically in the context of email, there are several ways to “encode” the characters – the letters and numbers and symbols – you see on the screen.

The fundamental concept is that all characters are actually stored as numbers. The uppercase letter “A”, for example, is the number 65. “B” is 66, and so on.

The fundamental concept is that all characters are actually stored as numbers.

The “ASCII” character set or encoding uses a single byte – values from 0 to 255 – to represent up to 256 different characters. (Technically ASCII actually only uses

7 bits of that byte, or values from 0-127. The most common true 8-bit encoding used on the internet today is “ISO-8859-1″.)

The problem, of course, is that there are way more than 256 possible characters. While we might spend most of our time with common characters like A-Z, a-z, 0-9 and a handful of punctuation, in reality the there are thousands of other possible characters – particularly if you think globally.

At the other end of the spectrum is the “Unicode” encoding, which uses two (or more) bytes, giving many more possible different characters. “A” is still 65, but if we look at it in hexadecimal the single byte Ascii “A” is 41, while the two-byte Unicode “A” is 0041.

Laptop ComputerAt this point, it should be clear that switching from Ascii to Unicode would immediately double the size of every email, every document, and everything else that stored text. Possible, and in some cases even the right solution, but when you consider that the majority of communications, particularly in the western world, focus on the basic roman alphabet and a few numbers and punctuation, it starts to seem wasteful.

Enter “UTF-8″, for “8 bit Unicode Transformation Format”.

In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long. The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Messages remain smaller, but should one of those “other” characters be needed it can be incorporated by using it’s “longer” representation.

All that is a lot of back story to the problem.

Mis-Interpretation

When you see funny characters it’s because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.

Let’s use an example: that apostrophe.

First, let’s be clear as mud: there are apostrophes, and apostrophes. In reality the characters we often refer to as apostrophes could be:

  • the apostrophe: (‘)
  • the acute accent: (´)
  • the grave accent: (`)
  • the right single quote (’)
  • the left single quote (‘)

(Those might look similar, different, or not appear at all depending on the fonts and character sets available on your computer. I told you this was complex. Smile)

Each, of course has a different encoding. Let’s take the right single quote (for reasons I’ll explain below):

  • ASCII: doesn’t exist
  • ISO-8859-1: 0xB4 in hexadecimal
  • Unicode: 0x07E3 in hexadecimal
  • UTF-8: 0xE28099

I don’t expect you to care about the actual numbers there, but simply notice how dramatically different they are.

Now, what happens when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1?

’

Look familiar?

0xE28099 breaks down as 0xE2 (â), 0x80 (€) and 0x99 (™). What was one character in UTF-8 (’) gets mistakenly displayed as three (’) when misinterpreted as ISO-8859-1.

The Culprits

There are typically two.

Email programs: email messages can include, as part of the header information you don’t see, the type of encoding used to represent the contents of the message. The problem is that some get it wrong, or, as you compose mail you enter characters that cannot actually be represented by the current encoding scheme. In the later case the email program has to do “something”, and that may include sending the character anyway, in one encoding scheme, even though the message is flagged as being in another.

I can hear you saying “but I didn’t type in any special characters!”.

Use Word to edit your email or your web page? Then you probably did. Microsoft Word is culprit number 2.

In particular, the “Smart Quotes” option in Word will often replace a plain apostrophe (‘) with an acute accent (´) or – as we saw above – right single quote (’). When that gets sent in or displayed using ISO-8859-1 encoding, you get the results above.

The solution? Ideally, watch what you’re typing. I know that “Smart Quotes”, while nice in printed documents, causes me enough grief elsewhere that it’s one of the first options I turn off when configuring Microsoft Word.

If you can, configure your email program to send in UTF-8 encoding (many, if not most, don’t make this easily configurable).

But regardless of how you got here, at least now you’ll know why.

There are 25 comments:

  1. Greg Bulmash Reply

    I just liked to blame Word, because MSFT is an easy target. Thanks for the detailed explanation.

  2. Shawn Reply

    Outlook is culprit #3, since it can be configured (maybe the default config) to use Word as the email editor.

  3. Don Taber Reply

    The first clear explanation I’ve seen for why I get the strange “a Euro-sign TM” characters — which I see a lot. Thanks.

  4. Dan Reply

    I rarely get this in Emails. However, I come across this very often, while navigating the web. If an site has SOME, for example, Japanese type characters included, I get the little boxes with numbers. Even if I use Google’s, “translate this page” function there will still be “little boxes w/ numbers in them” Can be really frustrating.

  5. Chris Reply

    There’s another type of cause for these unanticipated character swaps that database developers are accustomed to dealing with. Each database system has it’s own set of special characters, which need to be ‘escaped’ with other special characters or sequences whenever they are used in a text representation – in order to prevent them from being interpreted as instructions. When working with multiple database types, it’s easy to use the wrong escape sequence. Also, escaping can be overlooked, which, technically is a ‘bug’. These are just some additional reasons quote characters get messed up – especially on mass-produced interactive web pages.

  6. Michael Smith Reply

    I think we’ve all been seeing these more lately, and your explanation was about as clear as it could be, I guess. It was also the first I’ve ever seen where anybody even attempted to detail it out – and now I see why. Good job!

    One thing, though – you didn’t tell us what you thought would be the best “fix”. What should savvy PC folks be working towards?

    Good question, and I’m not sure there is a simple fix. Ideally, I suppose, we’d all juse use a single character encoding, like perhaps Unicode, all the time. Today guess I’d be happy if everyone just settled on UTF-8, but because of all the different combinations of systems, tools and legacy documents it’s not very likely.

    Leo
    16-Sep-2009
  7. Robin Clay Reply

    Not helped, of course, by SOME people not knowing when (and when not) to use an apostrophe !

    “It’s” – as shown a cuppla times on this page – Acksherly is the abbreviation for “it is”. The possesive “its” has no apostrophe.

    <grin>

    It’s also my Achilles heel. Fortunately I have about three people that pounce every time I get it wrong. Smile

    Leo
    16-Sep-2009
  8. James Reply

    It’s worse than that.

    ‘Unicode’ is an encoding of characters as integers, which it calls code points.

    UTF-8 is a method of encoding each code point as a variable number of (8-bit) bytes, at least one and possibly as many as three.

    UCS-2 is actually a subset of Unicode, which encodes the first 64,000 code points (there are more!) as two bytes.

    Where UTF-8 needs only one byte, UCS-2 wastes one byte; however, the fixed character width of UCS-2 is easier to process for such apps as Word.

    ISO-8859-1 is a one-byte encoding that happens to be identical to UTF-8, for the first 128 characters only. If an interpreter thinks it’s looking at 8859-1, it goes wrong when it sees a byte with the top bit set (i.e. a character beyond 127). And vice versa of course.

    If you want more, there’s an extensive article on Unicode at Wikipedia. It may not help much. I have the problem with Thunderbird, which allows me to specify the character encoding, though it will Auto-Detect. I suspect the problem lies at the sender, for example, by pasting USC-2 text into a UTF-8 message.

  9. George Jensen Reply

    When I received an email (in Eudora) with those odd characters in it, I copied it into MS word – the correct characters appeared. Then I copied it back into Eudora and the weird characters were gone!

  10. Dick Victor Reply

    Very helpful article which enabled me to understand and resolve the problem for my wife who is still using Eudora 7.

    For those still using “old” Eudora, there’s a nice free plugin that will convert received UTF-8 emails to ISO-8859-1 which is what Eudora understands. You can even put a button for the plugin in the button bar so it’s easy to use when you hit a message with odd characters. It’s at:

    http://www.windharp.de/software/utf8iso.htm

    Dick

  11. Ganymede Reply

    This reminds me of the old saw, “When you ask him what time it is, he tells you how to build a watch.”

    All we need is to know which settings to change.

    This detailed information is only good for annoying people at cocktail parties and showing off for family at Thanksgiving.

    KISS.

    It’s not simple. Sorry my explanation disappoints. There is no magic setting. If you take the time to understand the explanation you’ll see why.

    Leo
    29-Sep-2010

  12. Anna Toschlog Reply

    I read all of your reply, but I still do not understand how to resolve this problem. I am now having this problem when I forward an email that did not have this problem.
    Thank you kindly for any more help you could offer.

  13. Zamboni Tony Reply

    What I don’t get is why a program isn’t smart enough to see ’ and say to itself “Gee – THAT isn’t very common, is it? NO! What else could it be? AH – it’s a ‘. Or a smart version of that. WHATEVER – I *do* know that ’ isn’t what the person typing me wanted to have down, so I’ll just do a quick conversion.”

    Now – HOW HARD WOULD THAT BE?

    ’’’’’’’’’’’

  14. Ragini Reply

    Hi,

    I am facing problem with apostrophe symbol that is copied from MS -WORD. Once its been stored in the database apostrophe symbol is stored as “?” mark.Please help in selecting the proper encoding type that supports MS-WORD apostrophe symbol.Thanks Ragini

  15. Mike Reply

    First, thank you, Leo. I’d read this article before which provided a basic understanding. Now I need to address it, so the extended explanation gives me a direction.

    Second, Dick, your automated response reminds ME of another automated response: “Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.”

    Lazy people want a singular, simplistic answer to their singular local problem. Since it’s never singular, and it’s never simplistic, and the audience SURE isn’t local, Leo’s site is a learning experience that helps deal with future problems encountered that are related to an earlier one.

    Since my issue can’t be resolved with one simple answer, I’ve learned how to apply the general principle to dealing with it. And if I still can’t, at least now I’m armed with more information to search for an answer that does apply to my particular situation.

  16. Jonathan Reply

    Thanks for your answer, Leo. That was very informative.

    And for those who complained, the question did not ask how to fix the problem, it asked why the characters were appearing.

    If you are seeing the ’ characters, chances are the fault is with the source, not your browser. The source was probably written using something like Word that uses “smart quotes” and when converted to HTML those smart quotes were simply converted into what the coding conversion indicated should be used. In the case of a smart apostrophe, it is converted to the ’ characters.

    When creating HTML code, be sure to use a text-based editor or convert it into plain text before saving it as HTML.

    As far as what shows up on someone else’s web page, if they have a comment page or a means to contact them, bring it to their attention in a courteous and humble manner, and perhaps they will fix the problem.

  17. Jaxsun Reply

    Leo’s info is correct but I finally found the solution. Open WORD, go to TOOLS then AutoCorrectOptions. Select the AutoFormat Tab and de-select “Plain Text Wordmail documents” under Always Autoformat. This stops reformatting of a plain text email message!!

  18. Scott Reply

    Thanks Leo for this explanation. Now (I think) I understand what’s going on. My client has a website that is Joomla (a Content Management System, or CMS) based, which stores page content in an SQL database (in this case, MySQL). They switched over to a new hosting provider back in November 2011, and their content (about 1400 articles, newsletters, ezines, etc) was displaying just fine – until last weekend. Someone inadvertantly deleted the domain and the files, directories, etc. and they had to be restored. After they were restored, and Joomla re-installed, suddenly these strange characters started popping up in almost every single article – yup, you got it, the ’ in place of every single apostrophe. So I’m trying to figure out how to help them resolve this issue, without going in to over 1400 documents and editing/modifying them all manually. I first checked Joomla, and it’s got its encoding character set as UTF-8. Then I checked MySQL, thinking it must be set for some different character encoding set, but no, it’s actually UTF-8 also! But when I look into the MySQL database and run search queries against the entire database for the ‘ ’ ‘ designation, sure enough, it finds that exact string, right throughout the database. Hundreds and hundreds of places.

    So, I’m trying to figure out how the data got there like that in the first place, and it must be that, PRIOR to the “restore”, it was being stored and displayed, encoded and decoded, as ISO-8859-1 because it was being displayed properly. So now what can I do to fix the problem?!? Anyone?!? HELP?!?

  19. connie Reply

    @Scott
    Does a new article end up with the characters, or just the restored database? Since you can find the characters in the database, you could do a search and replace for them all. But if new documents are having the same error, you are going to have to go deeper.

  20. George Reply

    I have this problem in received e-mail. Am I correct in understanding that there’s nothing I can do about it?

  21. bob Reply

    Just a correction : ’ is not the result of interpreting UTF-8 as ISO-8859-1, but as Windows-1252. Just because € doesn’t exist in iso-8859-1 charset !
    And the charset where 0x80 = € and 0x99 = ™ is windows-1252.

  22. Shay Reply

    € : What does that symbol alone stand for when it takes the place of a box in a document where the box needs to be fill-able? And why is it doing so when I did not put it there? And how can I make it stop & go back to the box when the document is printed out? Even on the print preview the box is showing but when printed out the symbol here is printing out…? Frustrating. Thanks!

    • Leo Reply

      That’s the currency sign for the Euro – their equivalent of $. No idea of where it’s showing up for you, sorry.

  23. Phil Lewin Reply

    Using Eudora 7.1.0.9. I know there is a toggle command to turn on and off the foreign characters. Right now I press ? and get and e with a mark over it. Do you know the command? Something like shift-control D…. but that isn’t it! Please help! TIA Phil

Leave a reply:

Before commenting please:

  • Read the article. Seriously. You'd be shocked at how many people make comments that prove they didn't.
  • Comment only on the article. If you have a new, unrelated question start with the search box at the top of the page.
  • Don't post personal information. Email addresses, phone numbers and such will be removed.

VERY IMPORTANT: because of a rise an comment spam that's making it through our filters any comments that do not add to the discussion - typically off topic or content-free comments - run a very high risk of being flagged as spam and removed.

If you have a new question unrelated to the article above, ask it on the Ask Leo! ask-a-question page.