I have noticed for years that certain emails and documents have strange characters where punctuation and other characters should be. An example is this word: yesterday’s Where the characters ’ should clearly be an apostrophe. Why is this happening and what can I do to eliminate this occurring? I suspect that it happens more often when the originating computer system is a mac.
It’s all about character encoding.
And that simple sentence represents a bit of complexity.
Let me cover a few concepts, and throw out a few tips on how it can sometimes be avoided.
Become a Patron of Ask Leo! and go ad-free!
Encoding
As I’ve discussed before, typically in the context of email, there are several ways to “encode” the characters – the letters and numbers and symbols – you see on the screen.
The fundamental concept is that all characters are actually stored as numbers. The uppercase letter “A”, for example, is the number 65. “B” is 66, and so on.
The “ASCII” character set or encoding uses a single byte – values from 0 to 255 – to represent up to 256 different characters. (Technically ASCII actually only uses
7 bits of that byte, or values from 0-127. The most common true 8-bit encoding used on the internet today is “ISO-8859-1”.)
The problem, of course, is that there are way more than 256 possible characters. While we might spend most of our time with common characters like A-Z, a-z, 0-9 and a handful of punctuation, in reality the there are thousands of other possible characters – particularly if you think globally.
At the other end of the spectrum is the “Unicode” encoding, which uses two (or more) bytes, giving many more possible different characters. “A” is still 65, but if we look at it in hexadecimal the single byte Ascii “A” is 41, while the two-byte Unicode “A” is 0041.
At this point, it should be clear that switching from Ascii to Unicode would immediately double the size of every email, every document, and everything else that stored text. Possible, and in some cases even the right solution, but when you consider that the majority of communications, particularly in the western world, focus on the basic roman alphabet and a few numbers and punctuation, it starts to seem wasteful.
Enter “UTF-8”, for “8 bit Unicode Transformation Format”.
In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long. The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Messages remain smaller, but should one of those “other” characters be needed it can be incorporated by using its “longer” representation.
All that is a lot of back story to the problem.
Mis-Interpretation
When you see funny characters it’s because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.
Let’s use an example: that apostrophe.
First, let’s be clear as mud: there are apostrophes, and apostrophes. In reality the characters we often refer to as apostrophes could be:
- the apostrophe: (‘)
- the acute accent: (´)
- the grave accent: (`)
- the right single quote (’)
- the left single quote (‘)
(Those might look similar, different, or not appear at all depending on the fonts and character sets available on your computer. I told you this was complex. )
Each, of course has a different encoding. Let’s take the right single quote (for reasons I’ll explain below):
- ASCII: doesn’t exist
- ISO-8859-1: 0xB4 in hexadecimal
- Unicode: 0x07E3 in hexadecimal
- UTF-8: 0xE28099
I don’t expect you to care about the actual numbers there, but simply notice how dramatically different they are.
Now, what happens when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1?
’
Look familiar?
0xE28099 breaks down as 0xE2 (â), 0x80 (€) and 0x99 (™). What was one character in UTF-8 (’) gets mistakenly displayed as three (’) when misinterpreted as ISO-8859-1.
The Culprits
There are typically two.
Email programs: email messages can include, as part of the header information you don’t see, the type of encoding used to represent the contents of the message. The problem is that some get it wrong, or, as you compose mail you enter characters that cannot actually be represented by the current encoding scheme. In the later case the email program has to do “something”, and that may include sending the character anyway, in one encoding scheme, even though the message is flagged as being in another.
I can hear you saying “but I didn’t type in any special characters!”.
Use Word to edit your email or your web page? Then you probably did. Microsoft Word is culprit number 2.
In particular, the “Smart Quotes” option in Word will often replace a plain apostrophe (‘) with an acute accent (´) or – as we saw above – right single quote (’). When that gets sent in or displayed using ISO-8859-1 encoding, you get the results above.
The solution? Ideally, watch what you’re typing. I know that “Smart Quotes”, while nice in printed documents, causes me enough grief elsewhere that it’s one of the first options I turn off when configuring Microsoft Word.
If you can, configure your email program to send in UTF-8 encoding (many, if not most, don’t make this easily configurable).
But regardless of how you got here, at least now you’ll know why.
I just liked to blame Word, because MSFT is an easy target. Thanks for the detailed explanation.
The first clear explanation I’ve seen for why I get the strange “a Euro-sign TM” characters — which I see a lot. Thanks.
I rarely get this in Emails. However, I come across this very often, while navigating the web. If an site has SOME, for example, Japanese type characters included, I get the little boxes with numbers. Even if I use Google’s, “translate this page” function there will still be “little boxes w/ numbers in them” Can be really frustrating.
There’s another type of cause for these unanticipated character swaps that database developers are accustomed to dealing with. Each database system has it’s own set of special characters, which need to be ‘escaped’ with other special characters or sequences whenever they are used in a text representation – in order to prevent them from being interpreted as instructions. When working with multiple database types, it’s easy to use the wrong escape sequence. Also, escaping can be overlooked, which, technically is a ‘bug’. These are just some additional reasons quote characters get messed up – especially on mass-produced interactive web pages.
I think we’ve all been seeing these more lately, and your explanation was about as clear as it could be, I guess. It was also the first I’ve ever seen where anybody even attempted to detail it out – and now I see why. Good job!
One thing, though – you didn’t tell us what you thought would be the best “fix”. What should savvy PC folks be working towards?
16-Sep-2009
Not helped, of course, by SOME people not knowing when (and when not) to use an apostrophe !
“It’s” – as shown a cuppla times on this page – Acksherly is the abbreviation for “it is”. The possesive “its” has no apostrophe.
<grin>
16-Sep-2009
It’s worse than that.
‘Unicode’ is an encoding of characters as integers, which it calls code points.
UTF-8 is a method of encoding each code point as a variable number of (8-bit) bytes, at least one and possibly as many as three.
UCS-2 is actually a subset of Unicode, which encodes the first 64,000 code points (there are more!) as two bytes.
Where UTF-8 needs only one byte, UCS-2 wastes one byte; however, the fixed character width of UCS-2 is easier to process for such apps as Word.
ISO-8859-1 is a one-byte encoding that happens to be identical to UTF-8, for the first 128 characters only. If an interpreter thinks it’s looking at 8859-1, it goes wrong when it sees a byte with the top bit set (i.e. a character beyond 127). And vice versa of course.
If you want more, there’s an extensive article on Unicode at Wikipedia. It may not help much. I have the problem with Thunderbird, which allows me to specify the character encoding, though it will Auto-Detect. I suspect the problem lies at the sender, for example, by pasting USC-2 text into a UTF-8 message.
When I received an email (in Eudora) with those odd characters in it, I copied it into MS word – the correct characters appeared. Then I copied it back into Eudora and the weird characters were gone!
This reminds me of the old saw, “When you ask him what time it is, he tells you how to build a watch.”
All we need is to know which settings to change.
This detailed information is only good for annoying people at cocktail parties and showing off for family at Thanksgiving.
KISS.
29-Sep-2010
What I don’t get is why a program isn’t smart enough to see ’ and say to itself “Gee – THAT isn’t very common, is it? NO! What else could it be? AH – it’s a ‘. Or a smart version of that. WHATEVER – I *do* know that ’ isn’t what the person typing me wanted to have down, so I’ll just do a quick conversion.”
Now – HOW HARD WOULD THAT BE?
’’’’’’’’’’’
Hi,
I am facing problem with apostrophe symbol that is copied from MS -WORD. Once its been stored in the database apostrophe symbol is stored as “?” mark.Please help in selecting the proper encoding type that supports MS-WORD apostrophe symbol.Thanks Ragini
First, thank you, Leo. I’d read this article before which provided a basic understanding. Now I need to address it, so the extended explanation gives me a direction.
Second, Dick, your automated response reminds ME of another automated response: “Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.”
Lazy people want a singular, simplistic answer to their singular local problem. Since it’s never singular, and it’s never simplistic, and the audience SURE isn’t local, Leo’s site is a learning experience that helps deal with future problems encountered that are related to an earlier one.
Since my issue can’t be resolved with one simple answer, I’ve learned how to apply the general principle to dealing with it. And if I still can’t, at least now I’m armed with more information to search for an answer that does apply to my particular situation.
Thanks for your answer, Leo. That was very informative.
And for those who complained, the question did not ask how to fix the problem, it asked why the characters were appearing.
If you are seeing the ’ characters, chances are the fault is with the source, not your browser. The source was probably written using something like Word that uses “smart quotes” and when converted to HTML those smart quotes were simply converted into what the coding conversion indicated should be used. In the case of a smart apostrophe, it is converted to the ’ characters.
When creating HTML code, be sure to use a text-based editor or convert it into plain text before saving it as HTML.
As far as what shows up on someone else’s web page, if they have a comment page or a means to contact them, bring it to their attention in a courteous and humble manner, and perhaps they will fix the problem.
Leo’s info is correct but I finally found the solution. Open WORD, go to TOOLS then AutoCorrectOptions. Select the AutoFormat Tab and de-select “Plain Text Wordmail documents” under Always Autoformat. This stops reformatting of a plain text email message!!
Thanks Leo for this explanation. Now (I think) I understand what’s going on. My client has a website that is Joomla (a Content Management System, or CMS) based, which stores page content in an SQL database (in this case, MySQL). They switched over to a new hosting provider back in November 2011, and their content (about 1400 articles, newsletters, ezines, etc) was displaying just fine – until last weekend. Someone inadvertantly deleted the domain and the files, directories, etc. and they had to be restored. After they were restored, and Joomla re-installed, suddenly these strange characters started popping up in almost every single article – yup, you got it, the ’ in place of every single apostrophe. So I’m trying to figure out how to help them resolve this issue, without going in to over 1400 documents and editing/modifying them all manually. I first checked Joomla, and it’s got its encoding character set as UTF-8. Then I checked MySQL, thinking it must be set for some different character encoding set, but no, it’s actually UTF-8 also! But when I look into the MySQL database and run search queries against the entire database for the ‘ ’ ‘ designation, sure enough, it finds that exact string, right throughout the database. Hundreds and hundreds of places.
So, I’m trying to figure out how the data got there like that in the first place, and it must be that, PRIOR to the “restore”, it was being stored and displayed, encoded and decoded, as ISO-8859-1 because it was being displayed properly. So now what can I do to fix the problem?!? Anyone?!? HELP?!?
@Scott
Does a new article end up with the characters, or just the restored database? Since you can find the characters in the database, you could do a search and replace for them all. But if new documents are having the same error, you are going to have to go deeper.
Just a correction : ’ is not the result of interpreting UTF-8 as ISO-8859-1, but as Windows-1252. Just because € doesn’t exist in iso-8859-1 charset !
And the charset where 0x80 = € and 0x99 = ™ is windows-1252.
€ : What does that symbol alone stand for when it takes the place of a box in a document where the box needs to be fill-able? And why is it doing so when I did not put it there? And how can I make it stop & go back to the box when the document is printed out? Even on the print preview the box is showing but when printed out the symbol here is printing out…? Frustrating. Thanks!
That’s the currency sign for the Euro – their equivalent of $. No idea of where it’s showing up for you, sorry.
Using Eudora 7.1.0.9. I know there is a toggle command to turn on and off the foreign characters. Right now I press ? and get and e with a mark over it. Do you know the command? Something like shift-control D…. but that isn’t it! Please help! TIA Phil
Did you ever stop to think most of us asking how to fix this problem are not computer software engineers? All your lengthy jargon does absolutely no good. If I wanted to spend hours/days/weeks trying to understand, I could just take a course. Next time your car transmission goes out, contact me. I’ll be happy to give you plenty of theory and nothing useful to fix it and if you don’t understand, I can just call you “lazy”. So take a deep breadth, and slowly explain STEP BY STEP what to try. Your first step comes right after turn the computer on. UTF-8 means nothing to me and I have no idea how or where to change it.
Whenever possible, Leo gives step-by-step instructions, however, in this case, it’s not possible to give step by step instructions for this as the procedure is completely different for each email program and webmail interface. You’ll just have to look around for the settings for your email program or website. Or you could mention how you access email here and someone might be able to explain where to find it.
I have a problem where if I cut and paste from windows into and old program the program crashes completely. I understand now. It must be because it only expects single byte data. And Don, if I came to you with a question about a noise that an automatic transmission makes (any make or model) I would expect you could probably fix it but I don’t think that you could explain to me how to fix it. Similar problem with the odd characters.
Try turning off “replace straight quotes with smart quotes” in the options for autoformat as you type. This seems to work if you are having trouble with apostrophes.
Great Article. One thing that’s strange is that for the same incoming email, if I open in Outlook 2013, I get the dreaded: ’. If I open the SAME email via a browser, like via Gmail, it looks fine! Is this an issue in the sender or in Outlook and how can this be fixed? If it’s possible to find a way to fix this in Outlook that would be great because I can’t control the sender.
BTW, I’m getting the same issue with foreign language symbols. Looks fine when the email is opened in Gmail or a browser, but is jumbled when I open it in Outlook. Please help.
Dan
The explanatory material seems to refer only to written work I create. However, after nearly 20 years computing, the problem has only just now appeared and only in incoming emails (via Win 7 Live Mail) and external websites (like that I am writing in). It is accompanied by what I call ‘font corruption’, ie, scraggly thin characters instead of those I always previously saw in every application – firm black (not bold).
The only change I have made to my applications is to install Firefox browser because my bank (NAB, Australia) could no longer perform as it had always previously done in Win IE and Chrome.
no matter where I look, I have failed to find anything in Win 7 that controls this aspect of screen appearance.
Here’s an example from an incoming email:
That’s pretty bad. It’s nearly a 10% drop.
I think I have found an answer having installed Live Mail 12 to no effect:
– click on the little box in the top left corner of Live Mail; then
– select Options
– select Mail
– select Send tab
– click on International Settings
– use drop down menu to select Unicode UTF-8
– ok
– ok
Okay, then, can you (or anyone) explain this? I only receive this junk from one mailing list and one person who sends email from her Samsung Galaxy Note 4 phone. Using my UTF-8>ISO plug-in (in Eudora 7.1) just makes it all disappear, doesn’t recode to anything readable. (Whereas it does convert ’s stuff.) Sending it to a browser doesn’t change anything, either.
Mc¦j:+é^jǬ¶¬š‰í„Ú/zfÞ¯]0jËaz››’¶*’u«^~ŠÓ„ˆHË4DÅ„
0@À4ä@,rLäX8TÇd@LAT,1
¼•¨«%§$²‰ÚÐÚ¾’Fj{¢·¢ú¶m§ÿî²fœš)ejw(›öè¢K?÷¿5Û4çÎzÑ’žméhÁú+nŠ$z÷§¶ÉÚ™êÂœ†Ö¤z™Zq觚ë”r÷§¹ëž¯z—«~Šæjw±«miÈ^tC3)ÞÁ©[ºj¶yû¥
Eudora is known to have problems with character encodings other than what it expects. (And it DOESN’T expect UTF-8.) I know of no solution.
You might consider switching to something like Thunderbird.
Well written article. I find very often, particularly on the internet when commenting or threads on on say a Youtube video, those detailed comments I leave behind often turn (‘) into ('). So (they’re) will become (they're) and make the reading experience very awkward and messy.
Sorry & #39; (without spaces) so they’re will become they& #39;re
I’ve been getting that very recently on my YouTube comments as well. Except I’m getting it converting my double-quotes ” into & quot; in the comment, and when I edit the comment, it’s actually entered in the comment as & quot; instead of a double-quote. But when I edit the comment and change those back into ” and re-post the edited comment, it shows up correctly as a double-quote.
I’m suspicious of an extension, because I haven’t seen anyone else complaining about the issue online recently, so it’s probably pretty isolated. My suspicion is that it’s my Lazarus extension that saves copies of my comments as I’m entering them so that they don’t get lost if I inadvertently close the tab or get navigated away from a web form. I’m going to try a little experiment with disabling that and see if it’s my culprit or not. Otherwise, I’ll try a clean profile in my Chrome browser and see if that does the trick.
By the way, the technical term for what you and I are seeing are HTML entities. http://dev.w3.org/html5/html-author/charref
I sort of understand from this explanation that there are different coding languages, and that one would interpret differently from another. What I still do not get is why, if this is the problem, only the apostrophe would be affected. I am getting HTML format emails from an apple user into Outlook that appear completely fine, except for the one character that isn’t read properly.
There are something like five or six different types of apostrophes. Seriously. (Straight, curly, left, right… it goes on.) So what’s usually happening is that they’re using one of the curly apostrophes in a character set your program isn’t displaying properly.
Oops. in your Mis-interpretation section, the example you give of an actual apostrophe is, in fact, a left (opening) single quote. An apostrophe is, of course, actually a right (closing) single quote. We see this mistake all the time now, even in major advertising billboards, magazines, and many self-published books. As this article shows, it’s easy to fix, just usually not done. Too bad.
Excellent explanation! I write code for a living and am embarrassed to say I never researched how UTF-8 worked. I simply used that encoding because that’s what most of the example code did. Thanks so much for clearing this up.
THANK YOU!! I code fiction for a fan site, and I suspected that “smart quotes” were the culprit (since those wonky symbols seemed to mainly affect quotes and apostrophes) but I didn’t know the reason for it or how to prevent it. Your explanation was excellent (forget the troll who said it wasn’t). I admit shame-facedly that I had no clue what that UTF-8 thing was, and I’ve been coding for almost 20 years!
Frankly, I think that all these explanations, not just yours, are pretty lame.
It really shouldn’t be so complicated: When I type in “A”, for example, that’s what I want to type. What I want to appear at the receiver’s end is an “A”, not some complex code for an “A”.
How many commonly used punctuation marks are there? Certainly not so many as to present a big major problem to include in the memory capacity of a computer.
I think the whole system needs to be re-ramped and simplified, because it looks ridiculous to see such gibberish printed out, when all we wanted was an apostrophe, or quotation marks.
Leo, your ten-year-old article still finds a widespread and receptive audience. For me, it addressed a sore point in distributing messages containing pre-formatted content. In my business, I re-distribute pre-formatted content regularly, and often meet the dreaded apostrophe natively/originally coded in UTF8.
The browser Firefox is apparently set to 8859-1, and when it finds a UTF8 apostrophe in prepared content, dutifully renders the carated “a”, the Euro sign and TM mark. Although I have searched through the Mozilla database, it seems Firefox cannot be set to understand UTF8 characters and to display the original, intended apostrophe. Although I have searched (briefly) among Firefox settings, I have found no way to set Firefox to speak UTF8, at my discretion.
* BTW– your fine-print “Disclosure: I may recieve a fee…” may be misspelled. I have found many good coders try to spell rationally, yet find English spelling is not completely a set, rational system, but a conglomerate of exceptions. That is why becoming a spelling-bee champion is a triumph, in and of itself.
Although I could not edit or supplement my still-warm comments, made earlier, I found a solution for Thunderbird encoding options under the top menu, under View/Text encoding. The options are Unicode and Western, as well as many national languages.
However, no explicit option for UTF8 encoding is offered– unless western means UTF8. In the context of your article, the word western seems to describe the “most common true 8-bit encoding used on the internet today”, ISO8859-1.
That Firefox of any version offers only multiple foreign languages, but no options for encoding between 8859-1 and Unicode or UTF8 implies Firefox may be internally set for 8859-1.
UTF-8 is Unicode. Technically UTF-8 is Unicode based on 8-bit bytes. UTF-16 is Unicode based on 16-bit words.
Whoops. Thanks for catching that. That whole “i before e” thing is weird.
My run of luck continues– I found and executed the steps that provide Unicode character encoding for Firefox. In Firefox, select the menu icon, which drops down and reveals the option “Customize”. In customize, drag the “Text Encoding” icon into the menu/sidebar on the right. Close the customize menu window, then reopen the menu, click on Text Encoding again, and select either Unicode or Western. A nearby”Auto-detect” option offers even more convenience, since it appears to relieve me of the need to set the encoding in advance of processing new text.
The “Auto-detect” option in Firefox does NOT refer to detection of encoding (as I had hoped) but to the use of Ukrainian, Russian or Japanese. I post this FINAL discovery only so people will not be confused.