Smart quotes, em dashes, and en dashes

Steve Simon

2020-03-02

If you work with text data a lot, you will encounter some characters that are sort of close to what you need, but sort of not. These include the smart quotes, em dashes, and en dashes.

Smart quotes

Programmers use a set of standard quotation marks in their work. These are called straight quotes by some and dumb quotes by others.

The straight double quote (") is one of the standard codes that works pretty much the same on any computer system.

plot of chunk smart-quotes-02

The straight single quote (') is another standard code,

plot of chunk smart-quotes-03

On early computer systems, this was all you had. You might have a backwards slanting single quote (`), often called a backtick.

plot of chunk smart-quotes-04

This was a step backwards from Guttenberg’s printing press. Actually, the regression occurred when the typewriter was invented. The limited number of keys that you could fit into a typewriter

prevented the use of a greater variety of quote marks.

Now you probably already know this but if you want to assign a double quote to a variable, you surround it with single quotes,

plot of chunk smart-quotes-05

or precede the double quotes with a backslash

plot of chunk smart-quotes-06

Use the charToRaw function to see the underlying code for the double quote mark

plot of chunk smart-quotes-07

plot of chunk smart-quotes-08

and for the single quote mark

plot of chunk smart-quotes-09

plot of chunk smart-quotes-10

and for the backtick

plot of chunk smart-quotes-11

plot of chunk smart-quotes-12

These values are hexadecimal, so 27 in hexadecimal is 2*16+7=39 in decimal.

If you know the hexadecimal code, you can convert it to the character equivalent using the \x prefix.

plot of chunk smart-quotes-13

plot of chunk smart-quotes-14

When personal computers started to expand beyond the limited character set, that allowed you to use the left double quote,

plot of chunk smart-quotes-15

the right double quote,

plot of chunk smart-quotes-16

the left single quote,

plot of chunk smart-quotes-17

and the right single quote

plot of chunk smart-quotes-18

These quote marks are part of a larger character set known as Unicode. The rawToChar function provides a surprising result

plot of chunk smart-quotes-19

plot of chunk smart-quotes-20

Surprise! When you open the world up to different typographic characters, you have to include characters with accents,

plot of chunk smart-quotes-21

cedillas,

plot of chunk smart-quotes-22

and tildes.

plot of chunk smart-quotes-23

You have to have room for the sharp S in German,

plot of chunk smart-quotes-24

the thorn in Icelandic,

plot of chunk smart-quotes-25

and a whole host of new characters in Greek,

plot of chunk smart-quotes-26

Arabic,

plot of chunk smart-quotes-27

and Chinese.

plot of chunk smart-quotes-28

When you add various emojis

plot of chunk smart-quotes-29

the list becomes quite long. The system that encodes all of these values is Unicode.

You specify Unicode values with a \U.

plot of chunk smart-quotes-30

plot of chunk smart-quotes-31

Now you might wonder why the internal code for the left double quote (e2 80 9c) does not match the 201C shown above. It turns out that the internal storage of Unicode uses a system called UTF-8. UTF-8 maintains storage efficiency and backwards compatibility with earlier coding systems.

The other smart quote marks are the right double quote,

plot of chunk smart-quotes-32

plot of chunk smart-quotes-33

the left single quote,

plot of chunk smart-quotes-34

plot of chunk smart-quotes-35

and the right single quote,

plot of chunk smart-quotes-36

plot of chunk smart-quotes-37

There are a couple of additional codes that I should mention. Most programmers use the minus sign in their coding

plot of chunk smart-quotes-38

but there are two similar characters that you might see. The em dash,

plot of chunk smart-quotes-39

plot of chunk smart-quotes-40

is a longer dash. It has a width equal to the letter “m” in most proportional width fonts. There is another dash, the en dash

plot of chunk smart-quotes-41

plot of chunk smart-quotes-42

that is also longer than a minus sign, but about half the length of the em dash. It has a width that is equal to the letter “n” in most proportional width fonts.

The em dash and en dash will often cause confusion because they look so much like the minus sign, but they will cause problems often in R code.

There’s a nice web page about the historical developments of computer codes for quote marks and dashes and another page that talks about computer codes in general from the perspective of an R programmer.

There are some other variants, such as the prime symbols, described in this Wikipedia page.