Escaping Escaping


Escaping is a common solution to the problem of representing sequences of symbols (I will call them strings) that are part of structures (program code), that are themselves represented as sequences of symbols (text). For example, an assignment of the string value "\ to the variable a is typically written a="\"\\". Understanding what that value is would have been easier if every character could represent itself. The quote character cannot represent itself because it denotes the end of the string, and neither can the backslash character because it’s used as the escape character. So, is escaping a necessary evil in our programming languages? No, it’s a tradeoff and there are alternatives.

What Makes Escaping Necessary?

If the string can include characters that are not in the alphabet used to represent the structure, then those characters cannot be represented by themselves. There’s no good way around that, and that’s why non-printable characters often have assigned escape sequences.

Barring that though, it takes quite a bit for escaping to be necessary:

  1. A structure that contains a string, is to be represented as a sequence of symbols.
  2. The end of the string must be represented by an established sequence of symbols that are located by the end of the string. (I assume here that strings are recognized by a lexer rather than a parser, so it doesn’t really matter what kind of formal grammar the language uses.)
  3. It’s not possible to choose a sequence of symbols that cannot be a subsequence of the string, using the same alphabet that is used to represent the structure.

Unless all of these conditions hold, there is an alternative to escaping.

Condition 1 is void, for example, when the structure is represented as a GUI rather than a sequence of symbols. You don’t have to escape characters when you input data in a GUI textbox, which is so obvious that you don’t even think of it. But you don’t have to go all the way to a visual programming language1. It’s conceivable to build a programming language in semi-rich text and let e.g. underlined, green text represent string literals. There could be escaping in the file format, but the editor can show the strings without escaping. Picolisp2 actually tried this but removed the functionality in 2018. Most programming languages both old and new are staying as plain text (for good reasons) and want string literals, so condition 1 isn’t helping us.

Condition 2 can be void if we find another way to communicate the end of the string. The primary example would be representing the length of the string followed by the string itself, which is very common in protocols and formats that are mostly read and written by machines. To use this technique in programming languages would mean making humans with no interest in counting symbols do just that, which is so painful that escaping is a better tradeoff.

Condition 3 is more interesting, and it is voiding this condition that can help us avoid escaping in main-stream programming languages.

Dynamic Quote Sequences

Usually we want to be able to represent any unicode string in a text-based language, and we certainly don’t want that language to use characters that aren’t part of unicode. How do we then choose a sequence of symbols from the alphabet used to represent the structure—a subset of unicode—that cannot be a subsequence of the arbitrary, unicode string? By making the choice when the string is known. Then, you can choose a sequence that cannot be a subsequence of the string simply because it isn’t.

So in order to avoid escaping, the language has to support more than one string representation. For example, Python uses apostrophe ('), quote ("), triple-apostrophe (''') and triple-quote (""") as quote sequences.3 However, there will always be a string that requires escaping unless the language allows for an essentially infinite number of quote sequences.

Use Case Example

This isn’t merely a theoretical exercise, it is a useful feature in real-life. In many languages we need strings to represent program code, either in the same language or another. Sometimes there are several levels of such quoting, and escaping characters can grow exponentially. Let’s take an arguably realistic example of nested quotation4. (This example aims to illustrate issues with quoting and escaping, nothing else.)

Here’s PL/pgSQL code that evaluates to yes if the number a is even, otherwise to no.

begin return case when (a % 2)=0 then 'yes' else 'no' end; end

In order to make use of it, we need to define a PostgreSQL function for it, and pass the code as a string in the definition. Then let’s test that by finding out whether 8 is even. In PostgreSQL, apostrophe (') is escaped as two apostrophes ('').

create or replace function iseven (a integer)
returns text language 'plpgsql'
as 'begin return case when (a % 2)=0 then ''yes'' else ''no'' end; end';
select iseven(8);

If we use Python to access the database we might want to assign the PostgreSQL code to a variable. Let’s do that and print the code for testing. Within a Python string using apostrophe (') as the quote sequence, apostrophe (') is escaped by prepending a backslash (\').

query='create or replace function iseven (a integer) returns text language \'plpgsql\' as \'begin return case when (a % 2)=0 then \'\'yes\'\' else \'\'no\'\' end; end\'; select iseven(8);'
print(query)

Now maybe we want to test this from a shell one-off and pipe it through python and psql. We want to print the python code including the newline so we use echo -e. This actually adds two layers of escaping, one in bash argument parsing and one in the echo built-in, so backslash (\) twice escaped becomes four backslashes (\\\\).

echo -e "query='create or replace function iseven (a integer) returns text language \\\\'plpgsql\\\\' as \\\\'begin return case when (a % 2)=0 then \\\\'\\\\'yes\\\\'\\\\' else \\\\'\\\\'no\\\\'\\\\' end; end\\\\'; select iseven(8);'\nprint(query)"|python|psql

Which of course is completely unreadable, and what is unreadable is often useless. Now, if we use better quoting sequences, it can look like this instead:

(cat <<'WHATEVER'
query="""
  create or replace function iseven (a integer)
    returns text language $$plpgsql$$ as
    $ANYTHING$ begin
      return
        case when (a % 2)=0
          then $$yes$$
          else $$no$$
        end;
    end $ANYTHING$;
  select iseven(8);
"""
print(query)
WHATEVER
) | python | psql

It might not be perfect, but it’s readable and there is no escaping here at any level. Every level of code is represented as itself, i.e. you can read, write and copy-paste it with no handicap. This is possible thanks to both bash and PostgreSQL having a mechanism for creating new quote sequences on the fly.

In bash, new quote sequences can be created with here documents5. In the code above, the quoted string begins after 'WHATEVER' followed by newline, and ends before WHATEVER.

In PostgreSQL, new quote sequences can be created with dollar-quoting6. In the code above, two quote sequences are used, $ANYTHING$ and $$.

With this kind of mechanism, languages don’t have to be designed with each other in mind in order to avoid escaping, since anything can be quoted. In the case of the Python code above we were merely lucky that a suitable quote sequence was available.

Indent Quoting

There is actually one type of escaping that comes without any significant readability cost, let’s call it indent quoting. This would only work for indentation sensitive languages like Python. I came up with it in 2012 but I haven’t seen it in the wild. Let me know if you have!

It goes like this:

  • The beginning of a string is represented by a keyword or punctuation mark, followed by a newline and an indentation increase.
  • The newline character is the only one to be escaped. It is represented by a newline, followed by the current indentation whitespace.
  • The end of the string is represented by a newline, followed by an indentation decrease.
  • Editors would automatically escape and unescape in copy/paste operations, which essentially just means managing indentation correctly.

Imagine if Python had this:

helptext = string:
    Usage: This is the first line
        This is the second line
    ^--- This whitespace is part of the string.
    <--- This whitespace is the current indent.
    We can include """, ''', " and ' just as they
    are, so we can quote any code this way,
    whether python or other.
def help():
    print(helptext)

Summary

  • Escaping is a nuisance.
  • Escaping escaping is desirable because it increases both readability and writeability.
  • When a language supports dynamic quote sequences, escaping escaping is possible, simple, and has no significant downsides.
  • The design of dynamic quote sequences in a language (or format or protocol) is independent of the design of any language or other kind of string that is to be quoted.

Finally, a thought for you language designers out there: Do you support escaping escaping?

Discuss on Lobsters and Hacker News



Thanks to Drew DeVault and Lauren Svensson for reading drafts of this.