Last Time

GHC's Memory Management
Value representation on the heap
- String vs. ByteString/Text
- [a] vs. Vector

Addendum to last lecture

Size requirements of common data structures
(ignoring shared instances; sizes in Words and Bytes):

Type	Size (to be rounded up to full Ws)
`(ta,tb)`	3 W + size(`ta`) + size(`tb`)
`[t]`	3N W + N size(`t`)
`ByteString` (strict)	9 W + N B
`Text` (strict)	6 W + 2N⁺ B
`Map tk tv`	6N W + N (size(`tk`) + size(`tv`))
`Set t`	5N W + N size(`t`)
`IntMap t`	(3N + 5(N-1)) W + N size(`t`)
`IntSet`	(2N + 5(N-1)) W
`HashMap tk tv`	4.5N W + N (size(`tk`) + size(`tv`))
`HashSet t`	4.5N W + N size(`t`)
`Vector t` (boxed)	(6 + N) W + N size(`t`)

Today

GHC Compiler Stages
GHC Core
Heap Profiling
Memory Leaks

GHC Pipeline

(source: GHC Commentary)

GHC Pipeline (cont.)

Source language is parsed, name-resolved and type-inferred/checked into HsSyn {RdrName,Name,Id}.

Only then is HsSyn Id desugared into the much simpler typed CoreExpr representation (i.e. GHC Core).

Most of GHC's optimizations are implemented in terms of core-to-core transformations.

Finally, CoreExpr is transformed into STG represention for code generation and after that fed as "C--" into the selected code-generator backend.

GHC can output the intermediate data structures after each stage via -ddump-… flags.

GHC Core

Similiar to Haskell kernel sub-language shown in the 1^st lecture
Formal foundation: "System F with Type Equality Coercions"

GHC's representation of CoreExpr:

type CoreExpr = Expr Var

data Expr b = Var   Id
            | Lit   Literal
            | App   (Expr b) (Arg b)
            | Lam   b (Expr b)
            | Let   (Bind b) (Expr b)
            | Case  (Expr b) b Type [Alt b]
            | Cast  (Expr b) Coercion
            | Tick  (Tickish Id) (Expr b)
            | Type  Type
            | Coercion Coercion
            deriving (Data, Typeable)

GHC Core (cont.)

For debugging purposes, there's a more convenient
"External Representation for the GHC Core Language"
documented in the GHC User's Guide.

Core (after all simplifier passes) can be easily extracted via

ghc -fforce-recomp -ddump-simpl -dsuppress-uniques \
    -dsuppress-coercions -dsuppress-type-applications \
    -O2 -c Foo.hs

(alternatively, -dsuppress-all removes most annotations)

Sometimes more convenient & with syntax coloring: the ghc-core tool (install with cabal install ghc-core):
```
ghc-core --no-asm -- Foo.hs
```

GHC Core (cont.)

The following -dsuppress-* options exist as of GHC 7.6.3:
- all: In core dumps, suppress everything that is suppressable (except for uniques)
- uniques: Suppress the printing of uniques in debug output (easier to use diff)
- idinfo: Suppress extended information about identifiers where they are bound
- module-prefixes: Suppress the printing of module qualification prefixes
- signatures: Suppress type signatures
- applications: Suppress type applications (i.e. all @-terms)
- coercions: Suppress the printing of coercions in Core dumps to make them shorter
Fun fact: -ddump-simpl works in GHCi too!

Understanding GHC Core

The following simple Haskell example

{-# LANGUAGE BangPatterns #-}
mkPair  a  = (a,a)
mkPair' !a = (a,a)  -- same as: mkPair' a = seq a (a,a)

gets compiled to Core (-ddump-simpl -dsuppress-all):

mkPair  = \ @ t_aeT a_aeH -> (a_aeH, a_aeH)
mkPair' = \ @ t_aeM a_aeI ->
          case a_aeI of a1_XeL { __DEFAULT -> (a1_XeL, a1_XeL) }

Core is explicitly typed: @t_aeT and @t_aeM denote type-arguments.

Core's case-expression has an additional binder a1_XeL.

Even though the case-expression is pattern matched against the wildcard pattern __DEFAULT the a_aeI value is forced to WHNF!

Understanding GHC Core (cont.)

The case expression in Core

case a of a1 { __DEFAULT -> (a1,a1); pat1 -> exp1; pat2 -> exp2 }

can be thought of in terms of H98 as

case a of
  a1 | seq a1 False -> undefined
  a1@pat1 -> exp1
  a1@pat2 -> exp2
  a1@_    -> (a1,a1)

__DEFAULT comes first but is matched last.
Pattern failures are explicit in Core (and are attached to __DEFAULT).

Understanding GHC Core (cont.)

Let's try again with non-polymorphic version:

{-# LANGUAGE BangPatterns #-}
mkPair, mkPair' :: Int -> (Int,Int)
mkPair  a  = (a,a)
mkPair' !a = (a,a)

Core now reads:

mkPair  = \ a_aeI -> (a_aeI, a_aeI)
mkPair' = \ a_aeJ ->
          case a_aeJ of a1_XeM { I# ipv_seR -> (a1_XeM, a1_XeM) }

The @-arguments have vanished.

Wildcard pattern __DEFAULT has been replaced by I# _ pattern.

Typeclasses in Core

The simple class & instance definition:

class Monoid a where
  (<>) :: a -> a -> a
  ie   :: a

instance Monoid () where
  () <> () = ()
  ie       = ()

...results in the following Core definitions:

-- created by class definition:
<> = \ @a (d :: Monoid a) -> case d of _ { D:Monoid m1 _ -> m1 }
ei = \ @a (d :: Monoid a) -> case d of _ { D:Monoid _ m2 -> m2 }

-- created by instance definition:
$fMonoid()_$c<> = \ (v1 :: ()) (v2 :: ()) ->
                  case v1 of _ { () -> v2 }

$fMonoid() = D:Monoid @() $fMonoid()_$c<> ()

Data types in Core

Data constructors are not defined explicitly in Core unless strict fields occur
(in which case a constructor-wrapper is defined):

data Foo a = Foo a
data Bar a = Bar !a
data Doo = Doo { unDoo :: Int }

mkFoo = Foo
mkBar = Bar

results in

$WBar :: forall a. a -> Bar a
$WBar = \ @a (v :: a) -> case v of v2 { __DEFAULT -> Bar @a v2 }

unDoo = \ (d :: Doo) -> case d of _ { Doo d -> d }

mkBar = $WBar
mkFoo = Foo

Newtypes in Core

Newtypes result in coercions in Core:

newtype Foo = Foo Int

incFoo (Foo i) = Foo (i+1)

results in

incFoo :: Foo -> Foo
incFoo = a `cast` (<Foo> -> Sym <(NTCo:Foo)> 
                   :: (Foo -> Int) ~# (Foo -> Foo))

a :: Foo -> Int
a = \ ds -> + @ Int $fNumInt (ds `cast` 
                              (<NTCo:Foo> :: Foo ~# Int)) (I# 1)

Use -dsuppress-coercions to reduce clutter by cast-terms

Sum in Core (`-O0`)

Now let's try with a more complex function:

sum :: [Int] -> Int
sum (x:xs) = x + sum xs
sum []     = 0

compiled to Core w/ -O0 (and keeping type-annotations):

sum = \ (ds :: [Int]) ->
      case ds of _ {
        []       -> I# 0;
        :  x xs  -> + @Int $fNumInt x (sum xs)
      }

Core doesn't support infix operators: : and + are in prefix position.

$fNumInt is the Num typeclass dictionary for Int and + is field-accessor.

Sum in Core (`-O2`)

When using -O2 GHC produces different Core:

sum = \ (w :: [Int]) ->
      case $wsum w of ww { __DEFAULT -> I# ww }

$wsum = \ (w :: [Int]) ->
        case w of _ {
          []     -> 0;
          : x xs -> case x of _ {
                      I# x1 -> case $wsum xs of ww {
                                 __DEFAULT -> +# x1 ww
                               }
                    }
        }

Note use of the case binder ww.

Primitive operation: (+#) :: Int# -> Int# -> Int#.

This is the worker/wrapper optimization to avoid reboxing the result Int for each recursion step, by having $wsum return an unboxed Int# instead of a boxed Int (think of the heap!).

Sum in Core - `foldl`-style

The previous definition of sum was equivalent to foldr (+) 0 and is not tail-recursive, thus causing a stack overflow.

Let's try again with a foldl-style accumulator pattern:

sum :: [Int] -> Int
sum = go 0
  where
    go s (x:xs) = go (s+x) xs
    go s []     = s

...and the resulting -O0 Core:

go = \ s ds -> case ds of _ {
                 []     -> s;
                 : x xs -> go (+ $fNumInt s x) xs
               }

sum = go (I# 0)

Now that we have proper tail recursion, we still have a stack overflow... :-(

Sum in Core - `foldl`-style

Compiling with -O1 eliminates the thunk leak magically:

$wgo :: Int# -> [Int] -> Int#
$wgo = \ ww w -> case w of _ {
                   []     -> ww;
                   : x xs -> case x of _ {
                               I# y -> $wgo (+# ww y) xs 
                             }
                 }

sum = \ w -> case $wgo 0 w of ww { __DEFAULT -> I# ww }

However, if we switch to a polymorphic sum :: Num a => [a] -> a, the leak is back even with -O2:

sum = \ @a ($dNum :: Num a) ->
  let    { lvl = + @a $dNum } in
  letrec { go = \ (s :: a) (ds :: [a]) ->
             case ds of _ {
               []     -> s;
               : x xs -> go (lvl s x) xs
             }; 
  } in go (fromInteger @a $dNum (__integer 0))

Unboxed Tuples

Let's look at another example:

count :: [Int] -> (Int,Int) -> (Int,Int)
count []     np    = np
count (x:xs) (n,p) = f xs (if x<0 then (n+1,p) else (n,p+1))

$wcount :: [Int] -> Int -> Int -> (# Int, Int #)
$wcount = \ w n p ->
  case w of _ {
    []      -> (# n, p #);
    : x xs  -> case x of _ {
                 I# x1 -> case <# x1 0 of _ {
False -> $wcount xs n (case p of _ { I# x2 -> I# (+# x2 1) }) ;
True  -> $wcount xs (case n of _ { I# x2 -> I# (+# x2 1) }) p
                          }
               }
  }

count = \ w np ->
        case w1 of _ {
          (n, p) -> case $wcount w n p of _ {
                      (# n', p' #) -> (n', p') 
                    }
        }

Unboxed Tuples (cont.)

The (# Int, Int #) in the Core output represents GHC's Unboxed Tuples.
Primary use is for functions that return multiple values.
Cannot be assigned to a variable, thus need to be deconstructed immediately.
Avoids heap allocation normally associated with normal tuples.
However, (# Int, Int #) is not the same as (# Int#, Int# #).

Core Summary

CoreExpr: variables, literals, applications, lambdas, let, case
@ is used to denote explicit type-arguments/applications in Core.
Literals in Core are unboxed values.
Core's case d of b { __DEFAULT -> e } expression is different from H98's.
let(rec) bindings perform heap allocations.
Only case expressions perform/force evaluations.
Unboxed arguments (e.g. Int#) are implicitly strict.
Unboxed tuples (# … #) eliminate heap allocation for the tuple-constructor.
Newtypes result in cast expressions
Data constructors are not defined explicity (unless strict fields are involved).

STG Representation

Can be dumped via -ddump-stg
Similiar to Core but with more explicit allocation semantics:
let is the only place where heap allocations occur.
Useful for seeing more clearly where thunks are created.

Heap Profiling

Let's evaluate the expression
```
main = evaluate $ count [1..12345678] (0,0)
```
with +RTS -s -hT -i0.01

This generates summary GC statistics:

1,283,991,912 bytes allocated in the heap
3,062,980,520 bytes copied during GC
  261,147,880 bytes maximum residency (26 sample(s))
    4,701,992 bytes maximum slop
          515 MB total memory in use (0 MB lost due to fragmentation)
...
Productivity   4.0% of total user, 3.8% of total elapsed

...so something's definitely wrong here!

Heap Profiling (cont.)

The -hT -i0.01 RTS option causes a heap residency profile being created in the file <program>.hp

The .hp file contains heap samplings which look like

BEGIN_SAMPLE 0.02
base:GHC.Conc.Sync.ThreadId  16
FUN_1_0                      32
base:Data.Dynamic.Dynamic    24
ghc-prim:GHC.Tuple.(,)       24
base:Data.Maybe.Just         16
MUT_ARR_PTRS_CLEAN           552
base:GHC.Arr.STArray         40
FUN_0_1                      16
MVAR_CLEAN                   32
STACK                        912
base:GHC.MVar.MVar           16
WEAK                         48
TSO                          112
BLACKHOLE                    32
THUNK_1_0                    64305024
END_SAMPLE 0.02

Heap Profiling (cont.)

.hp files can be plotted by e.g. hp2ps, hp2pretty, or hp2html:

This looks like a thunk leak.

Heap Profiling (cont.)

Let's try adding strictness to force evaluation:

count :: [Int] -> (Int,Int) -> (Int,Int)
count []     !np    = np
count (x:xs) !(n,p) = f xs (if x<0 then (n+1,p) else (n,p+1))

...but this doesn't seem to make any difference at all!

That's because forcing (⊥,⊥) to WHNF is a no-op.

Heap Profiling (cont.)

Let's try again adding strictness to force evaluation:

count :: [Int] -> (Int,Int) -> (Int,Int)
count []     np    = np
count (x:xs) (!n,!p) = f xs (if x<0 then (n+1,p) else (n,p+1))

Core -O2:

count_$s$wcount :: [Int] -> Int# -> Int# -> (# Int, Int #)
count_$s$wcount = \ w n p ->
  case sc of _ {
    []     -> (# I# n, I# p #);
    : x xs -> case x of _ {
                I# x1 -> case <# x1 0 of _ {
                  False -> count_$s$wcount xs n (+# p 1) ;
                  True  -> count_$s$wcount xs (+# n 1) p
                         }
              }
  }

…

Now, the curried n and p function arguments are unboxed Int#s.

Heap Profiling (cont.)

New statistics:

  987,695,672 bytes allocated in the heap
       84,648 bytes copied during GC
       36,080 bytes maximum residency (11 sample(s))
       21,264 bytes maximum slop
            1 MB total memory in use (0 MB lost due to fragmentation)
…
Productivity 100.0% of total user, 92.6% of total elapsed

..versus previous statistics:

1,283,991,912 bytes allocated in the heap
3,062,980,520 bytes copied during GC
  261,147,880 bytes maximum residency (26 sample(s))
    4,701,992 bytes maximum slop
          515 MB total memory in use (0 MB lost due to fragmentation)
…
Productivity   4.0% of total user, 3.8% of total elapsed

From 4% to 100% productivity by adding just two ! annotations!

Heap Profiling (cont.)

Optimized version's heap diagram:

Thunk leaks, Space Leaks, Stack Overflows...

The first version of sum had only a stack overflow because it wasn't tail-recursive.
The count example had a thunk leak (which often goes hand in hand with stack overflows), because thunks were accumulating on the heap.
And then there's the (ordinary) space leak...

Space Leaks

Consider

let xs = [1..10000000::Int] -- that's 1e7 items
in sum xs * product xs

which results in

   800,041,392 bytes allocated in the heap
 2,777,788,352 bytes copied during GC
   399,735,000 bytes maximum residency (19 sample(s))
    68,504,360 bytes maximum slop
           908 MB total memory in use (0 MB lost due to fragmentation)
…          
Productivity   4.1% of total user, 3.8% of total elapsed

Remember that [Int] requires up to (3+2)N words.

Space Leaks (cont.)

And here's the resulting heap diagram:

Space Leaks (cont.)

The reason can be explained by looking at the Core output for

let xs = [1..10000000::Int] -- that's 1e7 items
in sum xs * product xs

which is

result = case $wsum' xs 0 of ww {
           __DEFAULT -> case $wprod xs 1 of ww1 {
                           __DEFAULT -> I# (*# ww ww1) 
                        }
         }

xs = eftInt 1 10000000 -- [1..10000000]

Here xs is allocated and kept alive until $wprod has been evaluated.

CSE can be harmful: the code below has same Core output!

let x = product [1..10^7::Int]
    y = sum     [1..10^7::Int]
in x*y

VU 183.653 Methodical, industrial Software-Engineering using the Haskell functional programming language

Last Time

Addendum to last lecture

Today

GHC Pipeline

GHC Pipeline (cont.)

GHC Core

GHC Core (cont.)

GHC Core (cont.)

Understanding GHC Core

Understanding GHC Core (cont.)

Understanding GHC Core (cont.)

Typeclasses in Core

Data types in Core

Newtypes in Core

Sum in Core (`-O0`)

Sum in Core (`-O2`)

Sum in Core - `foldl`-style

Sum in Core - `foldl`-style

Unboxed Tuples

Unboxed Tuples (cont.)

Core Summary

STG Representation

Heap Profiling

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Thunk leaks, Space Leaks, Stack Overflows...

Space Leaks

Space Leaks (cont.)

Space Leaks (cont.)

Further reading

VU 183.653 Methodical, industrial Software-Engineering using the Haskell functional programming language

Last Time

Addendum to last lecture

Today

GHC Pipeline

GHC Pipeline (cont.)

GHC Core

GHC Core (cont.)

GHC Core (cont.)

Understanding GHC Core

Understanding GHC Core (cont.)

Understanding GHC Core (cont.)

Typeclasses in Core

Data types in Core

Newtypes in Core

Sum in Core (-O0)

Sum in Core (-O2)

Sum in Core - foldl-style

Sum in Core - foldl-style

Unboxed Tuples

Unboxed Tuples (cont.)

Core Summary

STG Representation

Heap Profiling

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Heap Profiling (cont.)

Thunk leaks, Space Leaks, Stack Overflows...

Space Leaks

Space Leaks (cont.)

Space Leaks (cont.)

Further reading

Sum in Core (`-O0`)

Sum in Core (`-O2`)

Sum in Core - `foldl`-style

Sum in Core - `foldl`-style