Francesco Montanari

So far there have been two performance bottlenecks that I regarded as suspicious to use the regular Racket BC (before Chez Scheme) variant in scientific computing applications. One is the fact that while Racket performs similarly to highly efficient compilers like Chez Scheme (within a few factors ≲10), programs that involve invoking and capturing coninuations can be a factor 100 slower. The other issue is that performance of built-in futures and places for parallel computing is subtle. The Racket guide (archive link corresponding to the version I checked) shows that a parallel Mandelbrot set computation based on futures has no advantage over a serial one, unless explicit flonum operations are used to define the Mandelbrot procedure. Furthermore, the pmap package for high-level parallel forms shows drastic performance changes when using futures or places (archive link).

The Racket CS (built on Chez Scheme) variant seems to improve the situation on both problems. Benchmarks in the Racket-on-Chez Status report show that there is no longer performance bottleneck when dealing with continuations, runtime speed is comparable to Chez Scheme. Let's then also repeat the Mandelbrot set computation on Racket guide, but running Racket CS.

#lang racket

(define (mandelbrot iterations x y n)
  (let ([ci (- (/ (* 2.0 y) n) 1.0)]
        [cr (- (/ (* 2.0 x) n) 1.5)])
    (let loop ([i 0] [zr 0.0] [zi 0.0])
      (if (> i iterations)
          i
          (let ([zrq (* zr zr)]
                [ziq (* zi zi)])
            (cond
              [(> (+ zrq ziq) 4) i]
              [else (loop (add1 i)
                          (+ (- zrq ziq) cr)
                          (+ (* 2 zr zi) ci))]))))))

;;; Test performance of Racket futures.

(time (list (mandelbrot 10000000 62 500 1000)
            (mandelbrot 10000000 62 501 1000)))
; cpu time: 1194 real time: 1200 gc time: 59

(time (let ([f (future (lambda () (mandelbrot 10000000 62 501 1000)))])
        (list (mandelbrot 10000000 62 500 1000)
              (touch f))))
; cpu time: 1070 real time: 551 gc time: 15

We computed two times the mandelbrot set for large arguments. The parallel version, using the futures libraries, takes about half of the time of the serial version. Hence, with Racket CS we do have a performance improvement even without using explicit flonum operators in the definition of the Mandelbrot procedure. (Measurements are performed on a Intel Core i7-8550U CPU @ 1.80GHz.)

Let's also consider the pmap package that provides pmapf (based on futures) and pmap (based on places) parallel map procedures.

(require pmap)

;;; Define a quoted mandelbrot definition to use it into pmapp. This ugly trick
;;; is not required for pmapf.
(define quoted-mandelbrot
  '(define (mandelbrot iterations x y n)
     (let ([ci (- (/ (* 2.0 y) n) 1.0)]
           [cr (- (/ (* 2.0 x) n) 1.5)])
       (let loop ([i 0] [zr 0.0] [zi 0.0])
         (if (> i iterations)
             i
             (let ([zrq (* zr zr)]
                   [ziq (* zi zi)])
               (cond
                 [(> (+ zrq ziq) 4) i]
                 [else (loop (add1 i)
                             (+ (- zrq ziq) cr)
                             (+ (* 2 zr zi) ci))])))))))

;;; (mandelbrot 10000000 62 500 1000), four calculations

(define func
  (lambda (x)
    (mandelbrot 10000000 62 500 1000)))

(time (map func '(0 0 0 0)))
; cpu time: 1965 real time: 1965 gc time: 45

(time (pmapf func '(0 0 0 0)))
; cpu time: 2576 real time: 684 gc time: 20

(define funcp
  `(lambda (x)
     ,quoted-mandelbrot
     (mandelbrot 10000000 62 500 1000)))

(time (pmapp funcp '(0 0 0 0)))
; cpu time: 10379 real time: 4110 gc time: 1513

pmapf is 3 times faster than the serial computation (to compute 4 times the the Mandelbrot set). The improvement is good and competitive with what can be usually achieved with parallel map procedures in other languages (e.g., Clojure, Common Lisp, but also Python). Also, according to pmap documentation the improvement for Racket BC is only a factor 1.5. However, here pmapp takes almost 2 times slower than the serial computation, while for Racket BC it is 7 times faster. Following pmap documentation, let's repeat the computation also using a smaller argument.

;;; (mandelbrot 100000 62 500 1000), four calculations

(time (map (lambda (x) (mandelbrot 100000 62 500 1000)) '(0 0 0 0)))
; cpu time: 30 real time: 30 gc time: 8

(time (pmapf (lambda (x) (mandelbrot 100000 62 500 1000)) '(0 0 0 0)))
; cpu time: 54 real time: 28 gc time: 15

(define funcp
  `(lambda (x)
     ,quoted-mandelbrot
     (mandelbrot 100000 62 500 1000)))

(time (pmapp funcp '(0 0 0 0)))
; cpu time: 7581 real time: 3432 gc time: 1531

pmapf now only as a marginal improvement compared to the serial computation. I.e, overheads are comparable to the computational requirements of the Mandelbrot set given the smaller argument. This is expected to happen eventually, although in the case of Racket BC there is still an improvement of a factor 1.5. pmapp is now about 100 times slower than the serial version. For Racket BC pmapp takes 10 times longer than the serial version in this case.

Hence, in this particular benchmark the performance of pmapp (parallel map based on places) for Racket CS is considerably worse than for Racket BC. However, the pmapp user interface is far from being optimal. The performance of pmapf (based on futures) is considerably improved in Racket CS (for heavy computations) than in Racket BC and, importantly, it does not depend on subtleties like operations involving mixtures of floating-point and integer numbers.

To conclude note that, despite the improvement on the futures implementation, a larger performance gain (about 7x) is reached defining the Mandelbrot set in terms of explicit flonum operations. Hence, while Racket CS can profit from futures in both Mandelbrot versions (contrary to Racket BC), mixing floating-point and integer numbers still is a performance bottleneck (similar to Racket BC).

Parallel computing with Racket on Chez Scheme