Common Lisp has got fancy compilers that compile to native code, so it runs typically ~10 times faster than Python, see e.g. Peter Norvig's Python for Lisp Programmers. It's a dynamically typed language though, so it runs somehow slower (say 50%-100%) than statically typed OCaml/C++. Can one optimise things further?
For example, consider a simple square function:
(defun square (x)
(* x x))
you can easily see the assembler code for it:
(compile 'square)
(disassemble 'square)
to find out that relatively expensive GENERIC-* stuff is called:
480B22B8: .ENTRY SQUARE(x) ; (FUNCTION (T) NUMBER) 2D0: POP DWORD PTR [EBP-8] 2D3: LEA ESP, [EBP-32] 2D6: CMP ECX, 4 2D9: JNE L0 2DB: MOV ESI, EDX 2DD: MOV [EBP-12], ESI ; No-arg-parsing entry point 2E0: MOV EDX, ESI 2E2: MOV EDI, ESI 2E4: CALL #x100002C8 ; GENERIC-* 2E9: MOV ESP, EBX 2EB: MOV ESI, [EBP-12] 2EE: MOV ECX, [EBP-8] 2F1: MOV EAX, [EBP-4] 2F4: ADD ECX, 2 2F7: MOV ESP, EBP 2F9: MOV EBP, EAX 2FB: JMP ECX 2FD: NOP 2FE: NOP 2FF: NOP 300: L0: BREAK 10 ; Error trap 302: BYTE #x02 303: BYTE #x19 ; INVALID-ARGUMENT-COUNT-ERROR 304: BYTE #x4D ; ECX
If you measure its performance:
(time (dotimes (i 100000000) (square '1234)))
it takes 1.80 sec. Now if you set some optimization parameters and declare argument type:
(defun square2 (x)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (unsigned-byte 32) x))
(* x x))
you would get these hints from the compiler:
Note: Unable to recode as shift and add due to type uncertainty: The result is a (MOD 18446744065119617026), not a (UNSIGNED-BYTE 32). Note: Forced to do GENERIC-* (cost 30). Unable to do inline fixnum arithmetic (cost 4) because: The first argument is a (UNSIGNED-BYTE 32), not a FIXNUM. The second argument is a (UNSIGNED-BYTE 32), not a FIXNUM. The result is a (MOD 18446744065119617026), not a FIXNUM. Unable to do inline (signed-byte 32) arithmetic (cost 5) because: The first argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32). The second argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32). The result is a (MOD 18446744065119617026), not a (SIGNED-BYTE 32). etc.
and if you take into account these hints, you could rewrite your function into, for example:
(defun square3 (x)
(declare (optimize (speed 3) (safety 0) (debug 0))
(values (unsigned-byte 32))
(type (unsigned-byte 32) x))
(* x x))
and you would get to 1.28 sec (as compared to 1.80 sec) and the resulting assembler code would become:
(disassemble 'square3) 4811C888: .ENTRY SQUARE3() ; FUNCTION 8A0: POP DWORD PTR [EBP-8] 8A3: LEA ESP, [EBP-32] 8A6: MOV EAX, EDX 8A8: TEST AL, 3 8AA: JEQ L0 8AC: MOV EAX, [EAX-3] 8AF: JMP L1 8B1: L0: SAR EAX, 2 8B4: L1: MOV ECX, EAX 8B6: MOV EAX, ECX ; No-arg-parsing entry point 8B8: MUL EAX, ECX 8BA: MOV ECX, EAX 8BC: TEST ECX, 3758096384 8C2: JNE L3 8C4: LEA EDX, [ECX*4] 8CB: L2: MOV ECX, [EBP-8] 8CE: MOV EAX, [EBP-4] 8D1: ADD ECX, 2 8D4: MOV ESP, EBP 8D6: MOV EBP, EAX 8D8: JMP ECX 8DA: NOP 8DB: NOP 8DC: NOP 8DD: NOP 8DE: NOP 8DF: NOP 8E0: L3: JNS L4 8E2: MOV EDX, 522 8E7: JMP L5 8E9: L4: MOV EDX, 266 8EE: L5: MOV BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED* 8F5: MOV BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC* 8FC: MOV EAX, 16 901: ADD EAX, [#x806A404] ; current_region_free_pointer 907: CMP EAX, [#x806A3D8] ; current_region_end_addr 90D: JBE L6 90F: CALL #x8053358 ; alloc_overflow_eax 914: L6: XCHG EAX, [#x806A404] ; current_region_free_pointer 91A: MOV [EAX], EDX 91C: LEA EDX, [EAX+7] 91F: MOV [EDX-3], ECX 922: MOV BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC* 929: CMP BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED* 930: JEQ L7 932: BREAK 9 ; Pending interrupt trap 934: L7: JMP L2
that you could further optimize by assmebler inlining or whatever, if you wish. (Of course, nothing like this is possible in Python, but it's very well possible with OCaml/C/C++.)
If one prototypes in Lisp, then one does not have to worry about the speed as one may worry with Python.