Common Lisp has got fancy compilers that compile to native code, so it runs typically ~10 times faster than Python, see e.g. Peter Norvig's Python for Lisp Programmers. It's a dynamically typed language though, so it runs somehow slower (say 50%-100%) than statically typed OCaml/C++. Can one optimise things further?
For example, consider a simple square function:
(defun square (x)
(* x x))
you can easily see the assembler code for it:
(compile 'square)
(disassemble 'square)
to find out that relatively expensive GENERIC-* stuff is called:
480B22B8: .ENTRY SQUARE(x) ; (FUNCTION (T) NUMBER)
2D0: POP DWORD PTR [EBP-8]
2D3: LEA ESP, [EBP-32]
2D6: CMP ECX, 4
2D9: JNE L0
2DB: MOV ESI, EDX
2DD: MOV [EBP-12], ESI ; No-arg-parsing entry point
2E0: MOV EDX, ESI
2E2: MOV EDI, ESI
2E4: CALL #x100002C8 ; GENERIC-*
2E9: MOV ESP, EBX
2EB: MOV ESI, [EBP-12]
2EE: MOV ECX, [EBP-8]
2F1: MOV EAX, [EBP-4]
2F4: ADD ECX, 2
2F7: MOV ESP, EBP
2F9: MOV EBP, EAX
2FB: JMP ECX
2FD: NOP
2FE: NOP
2FF: NOP
300: L0: BREAK 10 ; Error trap
302: BYTE #x02
303: BYTE #x19 ; INVALID-ARGUMENT-COUNT-ERROR
304: BYTE #x4D ; ECX
If you measure its performance:
(time (dotimes (i 100000000) (square '1234)))
it takes 1.80 sec. Now if you set some optimization parameters and declare argument type:
(defun square2 (x)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (unsigned-byte 32) x))
(* x x))
you would get these hints from the compiler:
Note: Unable to recode as shift and add due to type uncertainty:
The result is a (MOD 18446744065119617026), not a (UNSIGNED-BYTE 32).
Note: Forced to do GENERIC-* (cost 30).
Unable to do inline fixnum arithmetic (cost 4) because:
The first argument is a (UNSIGNED-BYTE 32), not a FIXNUM.
The second argument is a (UNSIGNED-BYTE 32), not a FIXNUM.
The result is a (MOD 18446744065119617026), not a FIXNUM.
Unable to do inline (signed-byte 32) arithmetic (cost 5) because:
The first argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32).
The second argument is a (UNSIGNED-BYTE 32), not a (SIGNED-BYTE 32).
The result is a (MOD 18446744065119617026), not a (SIGNED-BYTE 32).
etc.
and if you take into account these hints, you could rewrite your function into, for example:
(defun square3 (x)
(declare (optimize (speed 3) (safety 0) (debug 0))
(values (unsigned-byte 32))
(type (unsigned-byte 32) x))
(* x x))
and you would get to 1.28 sec (as compared to 1.80 sec) and the resulting assembler code would become:
(disassemble 'square3)
4811C888: .ENTRY SQUARE3() ; FUNCTION
8A0: POP DWORD PTR [EBP-8]
8A3: LEA ESP, [EBP-32]
8A6: MOV EAX, EDX
8A8: TEST AL, 3
8AA: JEQ L0
8AC: MOV EAX, [EAX-3]
8AF: JMP L1
8B1: L0: SAR EAX, 2
8B4: L1: MOV ECX, EAX
8B6: MOV EAX, ECX ; No-arg-parsing entry point
8B8: MUL EAX, ECX
8BA: MOV ECX, EAX
8BC: TEST ECX, 3758096384
8C2: JNE L3
8C4: LEA EDX, [ECX*4]
8CB: L2: MOV ECX, [EBP-8]
8CE: MOV EAX, [EBP-4]
8D1: ADD ECX, 2
8D4: MOV ESP, EBP
8D6: MOV EBP, EAX
8D8: JMP ECX
8DA: NOP
8DB: NOP
8DC: NOP
8DD: NOP
8DE: NOP
8DF: NOP
8E0: L3: JNS L4
8E2: MOV EDX, 522
8E7: JMP L5
8E9: L4: MOV EDX, 266
8EE: L5: MOV BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
8F5: MOV BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
8FC: MOV EAX, 16
901: ADD EAX, [#x806A404] ; current_region_free_pointer
907: CMP EAX, [#x806A3D8] ; current_region_end_addr
90D: JBE L6
90F: CALL #x8053358 ; alloc_overflow_eax
914: L6: XCHG EAX, [#x806A404] ; current_region_free_pointer
91A: MOV [EAX], EDX
91C: LEA EDX, [EAX+7]
91F: MOV [EDX-3], ECX
922: MOV BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
929: CMP BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
930: JEQ L7
932: BREAK 9 ; Pending interrupt trap
934: L7: JMP L2
that you could further optimize by assmebler inlining or whatever, if you wish. (Of course, nothing like this is possible in Python, but it's very well possible with OCaml/C/C++.)
If one prototypes in Lisp, then one does not have to worry about the speed as one may worry with Python.