|
| 1 | +--- |
| 2 | +title: "How to Write Interpreter" |
| 3 | +page: true |
| 4 | +aside: true |
| 5 | +--- |
| 6 | + |
| 7 | +# How to Write Interpreter |
| 8 | +I will talk about how to write an interpreter based on 2 open source code. One is [kylin-go](https://github.com/zmh-program/kylin-go), the other is [picol](https://github.com/antirez/picol). Kylin-go is written by Go. Picol is written by C. |
| 9 | + |
| 10 | +I recommend you to read **kylin-go** first, it's more simple, self-evident and understandable, especially for beginner. |
| 11 | + |
| 12 | +## Overview |
| 13 | +It's not easy to write an interpreter. A mature interpreter consistes of techs, such as GC(garbage collection), JIT. For beginner, these techs are not important and necessary. We should look at simple theory first. |
| 14 | + |
| 15 | +First, you should have grammar rules. |
| 16 | + |
| 17 | +Second, transform source code to tokens, following grammar rules. This is what lexer does. |
| 18 | + |
| 19 | +Third, transform tokens to execution sequences. This is what parser does. |
| 20 | + |
| 21 | +Finally, prepare scopes, run execution sequences. This is what runtime does. |
| 22 | + |
| 23 | +Ok, let's dive into these parts. |
| 24 | + |
| 25 | +## Lexer |
| 26 | +Lexer transforms source code to tokens, for example: |
| 27 | +```js |
| 28 | +let age = 0; |
| 29 | +function hello() { |
| 30 | + console.log("hi") |
| 31 | +} |
| 32 | +hello(); |
| 33 | +``` |
| 34 | + |
| 35 | +After lexer processes, we get tokens like: |
| 36 | +```js |
| 37 | +tokens = [ |
| 38 | + { type: 'keyword', value: 'let' }, |
| 39 | + { type: 'identifier', value: 'age' }, |
| 40 | + { type: 'operator', value: '=' }, |
| 41 | + { type: 'number', value: '0' }, |
| 42 | + { type: 'semicolon', value: ';'}, |
| 43 | + { type: 'keyword', value: 'function' }, |
| 44 | + { type: 'identifier', value: 'hello' }, |
| 45 | + { type: 'left-bracket', value: '('}, |
| 46 | + { type: 'right-bracket', value: ')' }, |
| 47 | + { type: 'left-brace', value: '{' }, |
| 48 | + { type: 'identifier', value: 'console' }, |
| 49 | + { type: 'dot', value: '.' }, |
| 50 | + { type: 'identifier', value: 'log' }, |
| 51 | + { type: 'left-bracket', value: '(' }, |
| 52 | + { type: 'right-bracket', value: ')' }, |
| 53 | + { type: 'left-double-quotation', value: '"' }, |
| 54 | + { type: 'identifier', value: 'hl' }, |
| 55 | + { type: 'right-double-quotation', value: '"' }, |
| 56 | + { type: 'right-brace', value: '}' } |
| 57 | + { type: 'identifier', value: 'hello' }, |
| 58 | + { type: 'left-bracket', value: '(' }, |
| 59 | + { type: 'right-bracket', value: ')' }, |
| 60 | +] |
| 61 | +``` |
| 62 | + |
| 63 | +Lexer cares about word, it doesn't care about relationship of words, as a result, it produces tokens word by word. |
| 64 | + |
| 65 | +## Parser |
| 66 | +Parser transforms tokens to execution sequences. |
| 67 | + |
| 68 | +Parser finds keyword `let` and operator `=`, then it creates assignment execution sequence like: |
| 69 | +```js |
| 70 | +executionSequence = { |
| 71 | + type: 'assignment', |
| 72 | + variableName: 'age', |
| 73 | + value: 0 |
| 74 | +} |
| 75 | +``` |
| 76 | + |
| 77 | +Parser finds next token is semicolon, just skip it. |
| 78 | + |
| 79 | +Parser finds next token is keyword `function`, so it creates function definition execution sequence: |
| 80 | +```js |
| 81 | +executionSequence = { |
| 82 | + type: 'function-definition', |
| 83 | + functionName: 'hello', |
| 84 | + args: [], |
| 85 | + body: [ |
| 86 | + { |
| 87 | + type: 'method-call', |
| 88 | + obj: 'console', |
| 89 | + methodPath: ['log'], |
| 90 | + args: [ |
| 91 | + { type: 'literal', value: 'hi'} |
| 92 | + ] |
| 93 | + } |
| 94 | + ] |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +As a result, we get these sequences: |
| 99 | +```js |
| 100 | +sequences = [ |
| 101 | + { |
| 102 | + type: 'assignment', |
| 103 | + variableName: 'age', |
| 104 | + value: 0 |
| 105 | + }, |
| 106 | + { |
| 107 | + type: 'function-definition', |
| 108 | + functionName: 'hello', |
| 109 | + args: [], |
| 110 | + body: [ |
| 111 | + { |
| 112 | + type: 'method-call', |
| 113 | + objName: 'console', |
| 114 | + methodPath: ['log'], |
| 115 | + args: [ |
| 116 | + { type: 'literal', value: 'hi'} |
| 117 | + ] |
| 118 | + } |
| 119 | + ] |
| 120 | + }, |
| 121 | + { |
| 122 | + type: 'function-call', |
| 123 | + functionName: 'hello', |
| 124 | + args: [], |
| 125 | + } |
| 126 | +] |
| 127 | +``` |
| 128 | + |
| 129 | +## Runtime |
| 130 | +Runtime prepared scope, and executes sequences. |
| 131 | + |
| 132 | +We have got sequences: |
| 133 | +```js |
| 134 | +sequences = [ |
| 135 | + { |
| 136 | + type: 'assignment', |
| 137 | + variableName: 'age', |
| 138 | + value: 0 |
| 139 | + }, |
| 140 | + { |
| 141 | + type: 'function-definition', |
| 142 | + functionName: 'hello', |
| 143 | + args: [], |
| 144 | + body: [ |
| 145 | + { |
| 146 | + type: 'method-call', |
| 147 | + objName: 'console', |
| 148 | + methodPath: ['log'], |
| 149 | + args: [ |
| 150 | + { type: 'literal', value: 'hi'} |
| 151 | + ] |
| 152 | + } |
| 153 | + ] |
| 154 | + } |
| 155 | +] |
| 156 | +``` |
| 157 | +But there're some questions we should answer: |
| 158 | +1. where is `age` |
| 159 | +2. where is `console` |
| 160 | + |
| 161 | +yes, this is what runtime does, it will prepare scope: |
| 162 | +```js |
| 163 | +globalScope = { parent: null } |
| 164 | +``` |
| 165 | + |
| 166 | +Runtime executes first sequence: |
| 167 | +```js |
| 168 | +globalScope = { parent: null, age: 0 } |
| 169 | +``` |
| 170 | + |
| 171 | +Runtime executes second sequnce: |
| 172 | +```js |
| 173 | +globalScope = { |
| 174 | + parent: null, |
| 175 | + age: 0, |
| 176 | + hello: { |
| 177 | + args: [], |
| 178 | + body: [ |
| 179 | + { |
| 180 | + type: 'method-call', |
| 181 | + objName: 'console', |
| 182 | + methodPath: ['log'], |
| 183 | + args: [ |
| 184 | + { type: 'literal', value: 'hi'} |
| 185 | + ] |
| 186 | + } |
| 187 | + ] |
| 188 | + } |
| 189 | +} |
| 190 | +``` |
| 191 | + |
| 192 | +Runtime executes last sequence: |
| 193 | +```js |
| 194 | +scope = { parent: globalScope } |
| 195 | + |
| 196 | +if (sequence.type === 'method-call') { |
| 197 | + // find obj following parent link of scope |
| 198 | + const obj = scope[sequence.objName] || scope.parent[sequence.objName] |
| 199 | + const method = sequence.methodPath.reduce((prev, state) => prev[state], obj) |
| 200 | + const args = sequence.args.map(arg => { |
| 201 | + if (arg === 'literal') return arg.value; |
| 202 | + if (arg === 'method-call') { |
| 203 | + const temp_scope = { parent: scope } |
| 204 | + // create another scope, execute recursively |
| 205 | + } |
| 206 | + }) |
| 207 | + scope.result = method(...args) |
| 208 | +} |
| 209 | + |
| 210 | +scope.parent = null; |
| 211 | +``` |
| 212 | + |
| 213 | +Now, you have learned that how runtime works. Runtime scope is basically data of native language which is used to write interpreter, e.g. c struct, c++ class, go struct, rust struct, swift class, zig struct and etc. |
| 214 | + |
| 215 | +## Difficulty |
| 216 | +I introduce how to write simple interpreter, interpreter is actually more sophisticated. |
| 217 | + |
| 218 | +For example, how to deal with lifetime of `scope`, yes, this is topic of GC. |
| 219 | + |
| 220 | + |
| 221 | +## How to Speed up |
| 222 | +We use native language to write an interpreter, e.g. c/c++/go/rust. We can transform source code to tokens, transform tokens to execution sequences, and save execution sequences in disk. In short words, we call preprocessing **compile**. Then, we just make runtime execute compiled file. Execution of source code gets faster than uncompiled way. |
| 223 | + |
| 224 | +There's another way to speed up. When compiled file is executed, functions defined in source code are transformed to native language function in the end. We can cache these functions. Next time we execute same function, no transform is taken, just take out function from cache, and execute directly. In this way, it gets faster too. |
| 225 | + |
0 commit comments