A digression from obfuscation to representation of code

Faré has a nice obfuscated program in his signature:

(labels(({(] &rest [)(apply([
])[))([(>)(elt(]())>))(](<)(do-external-symbols(] :cl)(push ] <))(sort
<`string<`:key`string))(}({ + ^)({`816`1/5)({`688({`875({`398()"~{~A~^
~}"(]())){(+ { +)))({`381)^))(do*(({`5248({`584 }`36063))([`874({`395
{`6))(]`4({`584 {`6))(}`#36RH4G6HUTA1NVC1ZHC({`395 }`36063)))((} [ ]
({`977 ]))({`902)({`381))))

Whitespacelessness is nice for fitting programs into signatures (although this one is still not strictly McQ), but as a tool of obfuscation it has been obsolete since pprint was invented:

(LABELS (({ (] &REST [)
           (APPLY ([ ]) [))
         ([ (>)
           (ELT (] NIL) >))
         (] (<)
           (DO-EXTERNAL-SYMBOLS (] :CL) (PUSH ] <))
           (SORT < 'STRING< ':KEY 'STRING))
         (} ({ + ^)
           ({ 816 1/5)
           ({ 688
              ({ 875
                 ({ 398
                    NIL
                    "~{~A~^
~}"
                    (] NIL))
                 {
                 (+ { +)))
           ({ 381)
           ^))
  (DO* (({ 5248 ({ 584 } 36063))
        ([ 874 ({ 395 { 6))
        (] 4 ({ 584 { 6))
        (} 3785580492276528215065056 ({ 395 } 36063)))
       ((} [ ] ({ 977 ])) ({ 902) ({ 381))))

So in addition to the formatting, this program has — well, I won't spoil it. Check out that { operator, though.

Wait a minute — what happened to the backquotes and #36r? They're both purely read-time constructs, so there's nothing left of them to pretty-print. This is fine for purposes of reading obfuscated code, but annoying to anyone trying to build editing tools, because the canonical representation of code does not contain the whole program.

Many CL implementations (including SBCL, evidently) expand backquote at read time, because it's slightly simpler to implement. They try to preserve it by expanding into something distinctive (e.g. using sb-impl::backq-list instead of list) so they can unexpand it when pretty-printing, but this doesn't always work — there are lots of holes and ambiguous cases. Fortunately there's a better way: make backquote a simple abbreviation for (quasiquote ...) (as in Scheme), and do the expansion in a macro.

Number representations are harder to preserve. One could read #16rF00 as something like (radix 16 "F00"), and define radix as a macro. But this doesn't work for unevaluated data. In a system where code is more than just lists (like syntax-case), radix could be preserved along with the other extra information (line numbers etc.). The same approach can preserve #. and comments, but the conceptual cost is high — programs are no longer as simple as they appear.

I think it's worth investigating anyway. It is possible to have a richer representation of code than lists without going to the opaque lengths of syntax-case. By avoiding structural abbreviations, it may even be possible to make something easier to work with than lists. There are obviously a lot of challenges here, but the possibility of improving what Lisp does best — metaprogramming — is worth spending some effort on.

And think of the possibilities for obfuscated code!

1 comment:

It's OK to comment on old posts.