Common Lisp Runtime Speed

Common Lisp has got fancy compilers that compile to native code, so it runs typically ~10 times faster than Python, see e.g. Peter Norvig's Python for Lisp Programmers. It's a dynamically typed language though, so it runs somehow slower (say 50%-100%) than statically typed OCaml/C++. Can one optimise things further?

For example, consider a simple square function:

(defun square (x)
  (* x x))

you can easily see the assembler code for it:

(compile 'square)
(disassemble 'square)

to find out that relatively expensive GENERIC-* stuff is called:

480B22B8:       .ENTRY SQUARE(x)             ; (FUNCTION (T) NUMBER)
     2D0:       POP   DWORD PTR [EBP-8]
     2D3:       LEA   ESP, [EBP-32]

     2D6:       CMP   ECX, 4
     2D9:       JNE   L0
     2DB:       MOV   ESI, EDX
     2DD:       MOV   [EBP-12], ESI          ; No-arg-parsing entry point
     2E0:       MOV   EDX, ESI
     2E2:       MOV   EDI, ESI
     2E4:       CALL  #x100002C8             ; GENERIC-*
     2E9:       MOV   ESP, EBX
     2EB:       MOV   ESI, [EBP-12]
     2EE:       MOV   ECX, [EBP-8]
     2F1:       MOV   EAX, [EBP-4]
     2F4:       ADD   ECX, 2
     2F7:       MOV   ESP, EBP
     2F9:       MOV   EBP, EAX
     2FB:       JMP   ECX
     2FD:       NOP
     2FE:       NOP
     2FF:       NOP
     300: L0:   BREAK 10                     ; Error trap
     302:       BYTE  #x02
     303:       BYTE  #x19                   ; INVALID-ARGUMENT-COUNT-ERROR
     304:       BYTE  #x4D                   ; ECX

If you measure its performance:

(time (dotimes (i 100000000) (square '1234)))

it takes 1.80 sec. Now if you set some optimization parameters and declare argument type:

(defun square2 (x)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (unsigned-byte 32) x))
  (* x x))

you would get these hints from the compiler:

Note: Unable to recode as shift and add due to type uncertainty:
      The result is a (MOD 18446744065119617026), not a (UNSIGNED-BYTE 32).
Note: Forced to do GENERIC-* (cost 30).
      Unable to do inline fixnum arithmetic (cost 4) because:
      The first argument is a (UNSIGNED-BYTE 32), not a FIXNUM.
      The second argument is a (UNSIGNED-BYTE 32), not a FIXNUM.
      The result is a (MOD 18446744065119617026), not a FIXNUM.
      Unable to do inline (signed-byte 32) arithmetic (cost 5) because:
      The first argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32).
      The second argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32).
      The result is a (MOD 18446744065119617026), not a (SIGNED-BYTE 32).
      etc.

and if you take into account these hints, you could rewrite your function into, for example:

(defun square3 (x)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (values (unsigned-byte 32))
           (type (unsigned-byte 32) x))
  (* x x))

and you would get to 1.28 sec (as compared to 1.80 sec) and the resulting assembler code would become:

(disassemble 'square3)
4811C888:       .ENTRY SQUARE3()             ; FUNCTION
     8A0:       POP   DWORD PTR [EBP-8]
     8A3:       LEA   ESP, [EBP-32]
     8A6:       MOV   EAX, EDX
     8A8:       TEST  AL, 3
     8AA:       JEQ   L0
     8AC:       MOV   EAX, [EAX-3]
     8AF:       JMP   L1
     8B1: L0:   SAR   EAX, 2
     8B4: L1:   MOV   ECX, EAX
     8B6:       MOV   EAX, ECX               ; No-arg-parsing entry point
     8B8:       MUL   EAX, ECX
     8BA:       MOV   ECX, EAX
     8BC:       TEST  ECX, 3758096384
     8C2:       JNE   L3
     8C4:       LEA   EDX, [ECX*4]
     8CB: L2:   MOV   ECX, [EBP-8]
     8CE:       MOV   EAX, [EBP-4]
     8D1:       ADD   ECX, 2
     8D4:       MOV   ESP, EBP
     8D6:       MOV   EBP, EAX
     8D8:       JMP   ECX
     8DA:       NOP
     8DB:       NOP
     8DC:       NOP
     8DD:       NOP
     8DE:       NOP
     8DF:       NOP
     8E0: L3:   JNS   L4
     8E2:       MOV   EDX, 522
     8E7:       JMP   L5
     8E9: L4:   MOV   EDX, 266
     8EE: L5:   MOV   BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     8F5:       MOV   BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
     8FC:       MOV   EAX, 16
     901:       ADD   EAX, [#x806A404]       ; current_region_free_pointer
     907:       CMP   EAX, [#x806A3D8]       ; current_region_end_addr
     90D:       JBE   L6
     90F:       CALL  #x8053358              ; alloc_overflow_eax
     914: L6:   XCHG  EAX, [#x806A404]       ; current_region_free_pointer
     91A:       MOV   [EAX], EDX
     91C:       LEA   EDX, [EAX+7]
     91F:       MOV   [EDX-3], ECX
     922:       MOV   BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
     929:       CMP   BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     930:       JEQ   L7
     932:       BREAK 9                      ; Pending interrupt trap
     934: L7:   JMP   L2

that you could further optimize by assmebler inlining or whatever, if you wish. (Of course, nothing like this is possible in Python, but it's very well possible with OCaml/C/C++.)

If one prototypes in Lisp, then one does not have to worry about the speed as one may worry with Python.