9 Compile-Time Computation and Information

2-0523.15

9 Compile-Time Computation and Information

The section discusses macros that do computation at compile time, and it introduces a shape for compile-time information bound to an identifier.

9.1 Compile-Time Computation

This example is based on the scramble/regexp library. The parser-tools/lex library implements a similar notation.

I hate writing regular expressions. At least, I hate writing them once they get over twenty characters long, or have more than two “report” groups, or have character classes involving special characters, or....

Let’s design a macro that takes a pleasant, compositional S-expression notation and automatically translates it at compile time to a regular expression literal — specifically, a pregexp literal.

9.1.1 Computation, Whenever

Wait! Why make this a macro? I can define an ordinary run-time AST datatype (call it RE) for representing regular expressions, and I can write a function that translates an RE to a pregexp string.

Here is the RE type. For simplicity, it only represents handles a subset of Regexp Syntax.

;; An RE is one of
;; - (re:or (NonemptyListof RE))
;; - (re:cat (Listof RE))
;; - (re:repeat Boolean RE)
;; - (re:report RE)
;; - (re:chars (NonemptyListof Range))
;; A Range is (rng Char Char)

(struct re:or (res) #:prefab)
(struct re:cat (res) #:prefab)
(struct re:repeat (plus? re) #:prefab)
(struct re:report (re) #:prefab)
(struct re:chars (ranges) #:prefab)
(struct rng (lo hi) #:prefab)

I’ll explain why I use #:prefab structs in a later section.

Here is the code to translate an RE value into a pregexp-compatible string. The functions are organized according to what nonterminal in the ‹regexp› grammar they produce, to handle the precedence of regular expression syntax. For example, producing an ‹atom› from a concatentation RE requires wrapping its ‹regexp› form with (?:␣).

; emit-regexp : RE -> String
(define (emit-regexp re)
  (match re
    [(re:or res) (string-join (map emit-pces res) "|")]
    [_ (emit-pces re #f)]))
; emit-pces : RE [Boolean] -> String
(define (emit-pces re [rec? #t])
  (match re
    [(re:cat res) (string-join (map emit-pces res) "")]
    [_ (emit-pce re rec?)]))
; emit-pce : RE [Boolean] -> String
(define (emit-pce re [rec? #t])
  (match re
    [(re:repeat #f re) (format "~a*" (emit-atom re))]
    [(re:repeat #t re) (format "~a+" (emit-atom re))]
    [_ (emit-atom re rec?)]))
; emit-atom : RE [Boolean] -> String
(define (emit-atom re [rec? #t])
  (match re
    [(re:report re) (format "(~a)" (emit-regexp re))]
    [(re:chars ranges) (format "[~a]" (string-join (map emit-range ranges) ""))]
    [_ (cond [rec? (format "(?:~a)" (emit-regexp re))]
             [else (error 'emit-regexp "bad RE: ~e" re)])]))
; emit-range : Range -> String
(define (emit-range r)
  (match r
    [(rng c c) (emit-char c)]
    [(rng lo hi) (format "~a-~a" (emit-char lo) (emit-char hi))]))
; emit-char : Char -> String
(define (emit-char c)
  (define (special? c) (for/or ([sp (in-string "()*+?[]{}.^\\|")]) (eqv? c sp)))
  (if (special? c) (string #\\ c) (string c)))

Here is an example:

> (emit-regexp
   (re:repeat #f
    (re:cat (list (re:report (re:repeat #t (re:chars (list (rng #\a #\z)))))
                  (re:repeat #t (re:chars (list (rng #\space #\space))))))))
"(?:([a-z]+)[ ]+)*"

So, the ergonomics leave a bit to be desired. It would be possible to improve the interface by using friendlier functions instead of the raw AST constructors, of course. Or we could even define an S-expression notation and parse it into the RE type using match. All of potentially incurs additional run-time overhead, and there is also the overhead of pregexp call itself.

In any case, this code represents a complete, self-contained unit of functionality. Let’s wrap up the code above as module:

(module re-ast racket/base
  (require racket/match racket/string)
  (provide (struct-out re:or)
           (struct-out re:cat)
           ····
           emit-regexp)
  ····)

We can leave a friendlier front end as a task for a separate module.

9.1.2 A Macro Front-End

Let’s add a macro “front end” to the RE type and emit-regexp function. Specifically, the macro should support a friendlier notation that it parses into a compile-time RE value, translates to a string, and converts to a pregexp literal, all at compile time.

Here is a shape for representing (a subset of) regular expressions:

;; RE ::= (or RE ...+)
;;      | (cat RE ...)
;;      | (* RE)
;;      | (+ RE)
;;      | (report RE)
;;      | (chars CharRange ...+)
;; CharRange ::= Char | [Char Char]

I have called the shape RE, the same as the RE type. In fact, the interpretation of the RE term is a compile-time RE value. We can import the RE type and emit-regexp function into the compile-time environment as follows:

(require (for-syntax 're-ast))

The re syntax class, then, should have a single attribute whose value is an RE value.

Before we define the syntax class, we should decide how to recognize the literals in the shape definition (aka grammar) above. In Quasiquotation, I said there are two options: symbolic literals and reference literals. In this case, I want to use names that are already defined by Racket, but their interpretations here have nothing to do with their Racket meanings. More importantly, I don’t plan to support macro-like extensions to this syntax, which is one major reason to recognize literals by reference instead of symbolically. So let’s recognize the RE literals symbolically. We can do that by declaring them with #:datum-literals instead of #:literals. Here are the syntax class definitions:

(begin-for-syntax
  (define-syntax-class char-range
    #:attributes (ast) ; Range
    (pattern c:char
             #:attr ast (let ([c (syntax->datum #'c)]) (rng c c)))
    (pattern [lo:char hi:char]
             #:attr ast (rng (syntax->datum #'lo) (syntax->datum #'hi))))
  (define-syntax-class re
    #:attributes (ast) ; RE
    #:datum-literals (or cat * + report chars)
    (pattern (or e:re ...+)
             #:attr ast (re:or (datum (e.ast ...))))
    (pattern (cat e:re ...)
             #:attr ast (re:cat (datum (e.ast ...))))
    (pattern (* e:re)
             #:attr ast (re:repeat #f (datum e.ast)))
    (pattern (+ e:re)
             #:attr ast (re:repeat #t (datum e.ast)))
    (pattern (report e:re)
             #:attr ast (re:report (datum e.ast)))
    (pattern (chars r:char-range ...+)
             #:attr ast (re:chars (datum (r.ast ...))))))

The my-px macro simply calls emit-regexp on the parsed RE value, then calls pregexp to convert that to a (compile-time) regular expression value.

; (my-px RE) : Expr
(define-syntax my-px
  (syntax-parser
    [(_ e:re)
     #`(quote #,(pregexp (emit-regexp (datum e.ast))))]))

Note that we use quote to wrap the value returned by pregexp.

Here is the example from the previous section translated to use the macro’s notation:

> (my-px (* (cat (report (+ (chars [#\a #\z])))
(+ (chars #\space)))))
#px"(?:([a-z]+)[ ]+)*"

Exercise 20: Update the RE shape with the following case:
;; RE ::= ... | String
You can use string as a syntax class annotation to recognizes string terms.
A string is interpreted as the concatenation of the singleton character sets of each character in the string. For example:
(my-px (* "ab"))
; expect #px"(?:ab)*"

9.2 The Static Shape

The Static[T] shape is a parameterized shape that recognizes identifiers that refer to compile-time information of type T. The interpretation of a Static[T] reference is the compile-time T value.

The corresponding static syntax class is parameterized by a predicate and a description. The syntax class matches an identifier if the identifier is bound in the normal environment to a compile-time value accepted by the predicate; the value attribute contains the compile-time value.

That is, the Static shape contains identifiers bound with define-syntax, let-syntax, and so on. I’ll call this a static binding, as opposed to a variable bindings created by define, let, and so on. Static bindings are also created by macros such as struct: the name of a struct type carries compile-time information about the struct type, including references to its predicate and accessor functions. This information is used by macros like match to implement pattern matching; it is also used by struct-out to get the names to export. (And remember, a static binding in the normal environment is different from a variable binding in the compile-time environment, even though both refer to compile-time values.)

Terminology: I don’t like “bound as syntax”. I’m not totally happy with “bound statically” either, though. “bound to compile-time information” is too verbose. “bound at compile time” is wrong. Alternatives?

Let’s extend our little regular expression language with the ability to define and use RE names.

Here is the shape of the definition form:

Notation: ~ or :: or ??

;; (define-re x:Id RE) : Body[{x ~ RE}]

I’m using the notation from Shapes, Types, and Scopes (★) to indicate the bindings introduced by a Body term, but I have extended it with the notation x ~ RE to mean that x is bound statically to a compile-time value of type RE — as opposed to x : T, which means that x is bound as a variable to a run-time value of type T.

The variants of the RE type are represented by prefab structs, which are readable and — more importantly — quoteable. So we can implement the definition macro as follows:

(define-syntax define-re
  (syntax-parser
    [(_ name:id e:re)
     #`(define-syntax name '#,(datum e.ast))]))

That is, define-re parses the RE term to get a compile-time RE value. But it cannot directly create a static binding for name; it must do so by expanding to a define-syntax term. So the macro must convert the compile-time RE value that it has now into an expression that will produce an equivalent value later, when the macro expander processes the define-syntax form. Since the value is readable, we can do that with quote. (If the value were not readable, then this would create “3D syntax”, and modules using the macro would fail to compile. Or more precisely, compilation would succeed but the compiler would be unable to serialize the compiled code to a .zo file.)

To allow references to static RE bindings, we extend the RE shape as follows:

;; RE ::= ... | Static[RE]

The re syntax class needs a new pattern for references to RE names, and that pattern needs a helper predicate to recognize RE values. The existing patterns are unchanged.

(begin-for-syntax
  ; re-ast? : Any -> Boolean
  ; Shallow check for RE AST constructor. Sufficient?
  (define (re-ast? v)
    (or (re:or? v) (re:cat? v) (re:repeat? v) (re:report? v) (re:chars? v)))

  (define-syntax-class re
    #:attributes (ast) ; RE
    #:datum-literals (or cat * + report chars)
    ····
    (pattern (~var name (static re-ast? "name bound to RE"))
             #:attr ast (datum name.value))))

Now we can define intermediate RE names and compose them into more complicated regular expressions:

> (define-re word (+ (chars [#\a #\z])))
> (define-re spacing (+ (chars #\space)))
> (my-px (* (cat (report word) spacing)))
#px"(?:([a-z]+)[ ]+)*"

If we attempt to refer to a name that is not defined as a RE, then we get an appropriate error:

> (my-px (cat word list))
eval:28:0: my-px: expected name bound to RE
  at: list
  in: (my-px (cat word list))
  parsing context:
   while parsing re
    term: list
    location: eval:28:0
   while parsing re
    term: (cat word list)
    location: eval:28:0

We can inspect the compile-time value bound to an RE name by using syntax-local-value, which is is the low-level mechanism underneath static:

> (phase1-eval (syntax-local-value (quote-syntax word)))
'#s(re:repeat #t #s(re:chars (#s(rng #\a #\z))))

9.3 Static Information with Multiple Interfaces

There are still a few issues:

It would be nice if we could also use RE names like word and spacing as variables.
The shallow re-ast? test doesn’t guarantee that the name was defined using define-re. After all, anyone can create a prefab struct named re:repeat.
This design does not allow forward references: an RE name must be defined before it is used. But it is often preferable to define complex objects in a top-down order.

We can fix the first two issues by adding a generative (that is, not prefab) struct wrapper around the RE value, and making it support the procedure interface so it acts as an identifier macro. The next section shows how to support forward references, at least in most contexts.

The macro expander considers any name that is statically bound to a procedure to be a macro. It invokes the macro’s transformer on uses of the macro both in operator position and as a solitary identifier. A macro that allows being used as a solitary identifier is called an identifier macro. (If the macro’s value is a set!-transformer, it is also invoked when the macro is used as the target of a set! expression.)

In Racket, any non-prefab struct can act as a procedure by implementing the procedure interface, represented by the prop:procedure struct type property. The macro system defines other interfaces, such as prop:rename-transformer and prop:set!-transformer, and macros can also define their own interfaces. For example, the struct form defines an interface, prop:struct-info, for representing compile-time information about struct types; the match macro defines an interface, prop:match-expander, for implementing new match pattern forms; and so on. Thus one name can carry multiple kinds of static information and behavior by being bound to a struct type that implements multiple interfaces.

We could even define and export our own interface for values representing RE names. But that would conflict with our goal of restricting RE names to those defined through our define-re macro, which enforces invariants on the values carried by RE names. Furthermore, it commits us to a representation; if we change how we represent information (as we will in the next section), it can break code that relies on the public interface. These problems can be mitigated with well-formedness checks and adapters, but that is additional effort and complexity, and it often doesn’t completely fix the problem. In this example, the costs and risks don’t seem worth the (absent) benefits. Another possibility is to define an interface but keep it private, only using it within a library. That avoids the problems above. It still doesn’t seem useful in this particular example, though.

So, we will define a new struct type that implements the procedure interface so RE names can be used as expressions, but our macros will recognize the struct type specifically, without going through an additional interface indirection. Here is the definition:

(begin-for-syntax
  ; A RE-Binding is (re-binding RE (Syntax -> Syntax))
  (struct re-binding (ast transformer)
    #:property prop:procedure (struct-field-index transformer)))

When used as a procedure, an re-binding instance just forwards the call to its transformer field, so when define-re constructs an re-binding instance, it must provide a suitable transformer function. We can use the make-variable-like-transformer library function to construct an identifier macro that always produces the same expansion. Here is the updated define-re:

(require (for-syntax syntax/transformer))
(define-syntax define-re
  (syntax-parser
    [(_ name:id e:re)
     #`(define-syntax name
         (re-binding (quote #,(datum e.ast))
                     (make-variable-like-transformer
                      (quote-syntax (my-px name)))))]))

Now we must update re to extract the RE AST value in the RE-name case. Again, the other variants remain unchanged:

(begin-for-syntax
  (define-syntax-class re
    #:attributes (ast) ; RE
    #:datum-literals (or cat * + report chars)
    ····
    (pattern (~var name (static re-binding? "name bound to RE"))
             #:attr ast (re-binding-ast (datum name.value)))))

Now the following example works; we can use word and spacing like variables:

> (define-re word (+ (chars [#\a #\z])))
> (define-re spacing (+ (chars #\space)))
> (define-re word+spacing (cat (report word) spacing))
> (list word spacing (my-px (* word+spacing)))
'(#px"[a-z]+" #px"[ ]+" #px"(?:([a-z]+)[ ]+)*")

We can also verify that the compile-time information stored by an RE name is no longer a prefab struct; it is an opaque wrapper which prints as a procedure:

> (phase1-eval (syntax-local-value (quote-syntax word)))
#<procedure:...tax/transformer.rkt:19:3>

Exercise 21: Update the definition of re-binding so that instances of the struct print as "#<RE>". You can do this by implementing the prop:custom-write interface.
(phase1-eval (format "~s" (syntax-local-eval (quote-syntax word))))
; expect "#<RE>"

Exercise 22 (★): Update the definition of re-binding so that it also acts as a match pattern name. As a match pattern, it takes some number of pattern arguments; these are the patterns used to match the regular expression’s report results. That is, an RE name used as a match pattern expands like this:
(word+spacing pat)
⇒
(pregexp word+spacing (list _ pat))
See match for an explanation of the syntax of match patterns; see prop:match-expander for an explanation of the interface.

Exercise 23 (★★): Update your solution to Exercise 22 to check that when used as a match pattern, the RE name receives the correct number of arguments. That is, the first example below should succeed (because word+spacing has one report), but the others should cause an error:
(word+spacing the-word)         ; okay
(word+spacing the-word extra)   ; error: wrong number of patterns, expected 1
(spacing the-spaces)            ; error: wrong number of patterns, expected 0

9.4 Two-Pass Expansion

To support forward references requires knowing a little about how Racket processes definitions.

The Racket macro expander processes definition contexts (module bodies, lambda bodies, and so on) in two passes. The first pass discovers definitions; the second pass expands (remaining) expressions.

In the first pass, Racket expands each body term until it reaches a core form, but it does not recur into the core form’s sub-expressions. Once it reaches a core form, it does a shallow case analysis. If the form is define-values, it marks the names as variables in the local environment. If the form is define-syntaxes, it evaluates the right-hand side as a compile-time expression and binds the names statically. If the form is begin, it flattens it away and recurs on the contents. (So a macro can expand into multiple definitions by grouping them with begin.) Otherwise the form is an expression form, and it leaves that until the next pass. After the case analysis, it continues to the next body term.

In the second pass, the expander knows all of the names bound in the recursive definition context, both variable names and static names. The expander then processes the remaining expressions: the sub-expressions of core expression forms and the right-hand sides of define-values definitions.

9.4.1 Forward References

How can we support forward references for compile-time data?

Since static definitions are evaluated in order, the static information for a RE name cannot just contain an RE AST value. It must contain a thunk or promise or something similar that allows us to get the AST value on demand. Let’s use a promise. Here is the updated struct definition:

(require (for-syntax racket/promise))
(begin-for-syntax
  ; A RE-Binding is (re-binding (Promise RE) (Syntax -> Syntax))
  (struct re-binding (astp transformer)
    #:property prop:procedure (struct-field-index transformer)))

Now define-re cannot eagerly parse the RE term; instead, it must create a promise that parses it later. Here’s one implementation:

(begin-for-syntax
  ; parse-re-from-def : Syntax -> RE
  ; Receives the entire `define-re` term.
  (define (parse-re-from-def stx)
    (syntax-parse stx
      [(_ _ e:re) (datum e.ast)])))

(define-syntax define-re
  (syntax-parser
    [(_ name:id _)
     #`(begin
         (define-syntax name
           (re-binding (delay (parse-re-from-def (quote-syntax #,this-syntax)))
                       (make-variable-like-transformer
                        (quote-syntax (my-px name)))))
         (void (my-px name)))]))

Within a syntax-parser clause, this-syntax is bound to the syntax object currently being parsed. In this case, that’s the syntax of the define-re use. The reason for passing the whole definition syntax to parse-re-from-def instead of just the RE term is that by default syntax-parse reports syntax errors using the leading identifier of its argument as the “complaining party”. (This behavior can be overridden with the #:context argument, though.)

Why does the expansion include (void (my-px name))? That expression ensures that the promise eventually gets forced. Since it occurs within an expression, it is delayed until pass two, when all forward references should have been defined. If we left it out, then a syntactically invalid RE definition would be accepted as long as it was never used.

Finally, we must update re to force the promise from a RE name. Here is a basic implementation:

(begin-for-syntax
  (define-syntax-class re
    #:attributes (ast) ; RE
    #:datum-literals (or cat * + report chars)
    ····
    (pattern (~var name (static re-binding? "name bound to RE"))
             #:attr ast (force (re-binding-astp (datum name.value))))))

One flaw in this implementation is that if it is given a recursive RE definition, it produces an internal error about re-entrant promises. Here is a version that uses a parameter to detect that situation and signals a better error:

(begin-for-syntax
  ; running : (Parameter (Promise RE))
  ; Currently running RE promises, used to detect cycles.
  (define running (make-parameter null))

  (define-syntax-class re
    #:attributes (ast) ; RE
    #:datum-literals (or cat * + report chars)
    ····
    (pattern (~var name (static re-binding? "name bound to RE"))
             #:attr ast (let ([p (re-binding-ast (datum name.value))])
                          (cond [(member p (running)) #f]
                                [else (parameterize ((running (cons p (running))))
                                        (force p))]))
             #:fail-when (if (datum ast) #'name #f) "recursive RE")))

With those changes, forward references work, at least within modules and within internal definition contexts like let bodies:

> (let ()
    (define-re para (* (cat word spacing)))
    (define-re word (+ (chars [#\a #\z])))
    (printf "word = ~s\n" word)
    (define-re spacing (+ (chars #\space)))
    para)
word = #px"[a-z]+"
#px"(?:[a-z]+[ ]+)*"

We get reasonable messages for the following error cases:

> (let ()
    (define-re uses-undef (cat (chars #\a) undef))
    'whatever)
eval:52:0: define-re: expected name bound to RE
  at: undef
  in: (define-re uses-undef (cat (chars #\a) undef))
  parsing context:
   while parsing re
    term: undef
    location: eval:52:0
   while parsing re
    term: (cat (chars #\a) undef)
    location: eval:52:0
> (let ()
    (define-re rec (cat (chars #\a) rec))
    'whatever)
eval:53:0: define-re: recursive RE
  at: rec
  in: (define-re rec (cat (chars #\a) rec))
  parsing context:
   while parsing re
    term: rec
    location: eval:53:0
   while parsing re
    term: (cat (chars #\a) rec)
    location: eval:53:0

There are other ways we could manage the delayed resolution of RE names. For example, we could extend the AST type with a new variant for names, eagerly parse most of the AST and create promises only for instances of the name variant. One benefit of that approach is that most RE syntax errors could be caught when the definition is processed instead of when the promise is forced. Some drawbacks are that it involves either changing the RE type or creating a substantially similar RE-With-Promises type, and it requires adding a new function to traverse the AST forcing the name nodes.

9.4.2 The Peculiarities of Scoping in Two-Pass Expansion

The two-pass expansion and its treatment of macros means that definition contexts in Racket are not purely recursive; they also have a slight sequential aspect. Consider the behavior of the following example:

> (let ()
    (define-syntax m
      (syntax-parser
        [(_ e:expr) #'(printf "outer ~s\n" e)]))
    (let ()
      (m (begin (m 123) 456))
      (define-syntax m
        (syntax-parser
          [(_ e:expr) #'(printf "inner ~s\n" e)]))
      (m 789)))
inner 123
outer 456
inner 789

Simple recursive scoping would predict that all three references to m in the inner let body refer to the inner definition of m. But the first use of m is head-expanded before the inner definition of m is discovered, so it refers to the outer definition. It’s argument, though, is not expanded until pass two, so it refers to the inner m, as does the third use of m. The third use of m is expanded before the second, though.

What if we simply delete the outer macro definition?

> (let ()
    (let ()
      (m (begin (m 123) 456))
      (define-syntax m
        (syntax-parser
          [(_ e:expr) #'(printf "inner ~s\n" e)]))
      (m 789)))
inner 123
inner 456
inner 789

Now in pass one the expander assumes that the first use of m is a function application, and m is a variable that might be defined later. So it saves the whole expression for pass two. Then in pass two it realizes that the expression is not a function application but a macro application, and it expands the macro. This is good, right? It’s what one would expect given a macro definition in a recursive scope.

But there are limits. Consider the following example, where the macro produces a definition instead of an expression:

> (let ()
    (m a)
    (define-syntax m
      (syntax-parser
        [(_ x:id) #'(define x 1)]))
    (m b)
    (+ a b))
eval:56:0: define: not allowed in an expression context
  in: (define a 1)

As in the previous example, the macro expansion initially classifies the first use of m as a function application. In the second pass, though, when it expands the macro, it expands it in a strict expression context. That is because it is unwilling to make further changes to the environment in the second pass; it is frozen at the end of the first pass.

The greatest scoping peculiarities of definition contexts arise from macro names that are shadowed in the middle of an inner scope.

Lesson: Don’t shadow macro names.

Unfortunately, many names in Racket that seem like variable names are actually implemented as macro bindings. One example is functions with keyword arguments, to reduce run-time overhead for keyword checking and default arguments. Another example is bindings exported with contract-out, to compute the negative blame party from the use site.

Lesson: As much as possible, avoid shadowing entirely.

contents ← prev up next →

1	Introduction
2	Terms and Shapes
3	Basic Shapes
4	Compound Shapes
5	Shape Definitions
6	Enumerated Shapes
7	Multi-Term Shapes
8	Recursive Shapes
9	Compile-Time Computation and Information
10	Unhygienic Macros

9.1	Compile-Time Computation
9.2	The Static Shape
9.3	Static Information with Multiple Interfaces
9.4	Two-Pass Expansion