Skip to content

Experiment with compiling (ordinary) variants to objects. #3801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cristianoc
Copy link
Collaborator

@cristianoc cristianoc commented Sep 2, 2019

Experiment with the runtime representation of (non-polymorphic) variants, to see if they can be produced/consumed from JS without the need for conversion.

Example

type v =
  | A1
  | A2
  | B(int)
  | C(int, int)
  | D((int, int));

let a1 = A1;
let a2 = A2;
let b = B(34);
let c = C(4, 2);
let d = D((4, 2));

Is represented as:

var a1 = "A1";

var a2 = "A2";

var b = /* constructor */{
  tag: "B",
  Arg0: 34
};

var c = /* constructor */{
  tag: "C",
  Arg0: 4,
  Arg1: 2
};

var d = /* constructor */{
  tag: "D",
  Arg0: /* tuple */[
    4,
    2
  ]
};

Ocaml objects, exceptions, polymorphic variants, and lazy values for now keep the existing representation.

Changes

Requires operations to:

  • Access the tag
  • Access the length with e.g. Caml_obj_extern.size_of_t and Caml_obj_extern.length.
  • Copy/traverse the object as in Caml_hash.caml_hash or Caml_obj.caml_obj_dup.

The pattern-matching compiler had to be modified so it does not apply a range-discovering algorithm to the numeric tags. But only uses equality on the string constants.

TODO

1 Disabled tests that use jsConverter which is not compatible:

  • ast_js_mapper_poly_test.ml
  • ast_abstract_test.ml
    One specific feature might need revisiting: | A1 [@bs.as 3].

2 Removed tests involving generated parsers:
Supporting generated parsers requires passing some information differently.
Currently, the internal representation of variants is implicitly assumed in the generated code.
E.g. the table yytransl_const is an array, assuming that the index is the numeric tag of terminals.
Since these are only used for bigger internal tests, it's possible to adapt ocamllex to generate different code.

@bobzhang
Copy link
Member

we discussed such before

var c = /* constructor */{
  tag: "C",
  "0": 4,
  "1": 2
};

such encoding is slow (with numeric val as key)

@cristianoc
Copy link
Collaborator Author

cristianoc commented Sep 18, 2019

such encoding is slow (with numeric val as key)

How do we get some numbers? We need something a bit more precise to evaluate the trade-offs.
There are many possibilities including:

  1. Use numbers by default, and let people opt into strings.
  2. Use strings by default, and let people opt into numbers.
  3. Use genType to generate singleton types for the numeric constants, to act as a form of documentation. So it's still usable from JS.

@bobzhang
Copy link
Member

see discussions in #24, it should be fine to switch tag between number and string.
It is that {"tag" : 0 , "0":a,"1":b} is quite different from [a,b].tag = 0

@cristianoc
Copy link
Collaborator Author

cristianoc commented Sep 18, 2019

@bobzhang ooh so I've tested the difference is 10X. Looks like the situation is exactly the same as 3 years ago.
That makes tradeoffs really easy: no way we're going to use this.

However, it still makes sense to explore representing tags as strings, and measure if there's a perf difference. This would also mean that for zero-ary variants, the interop is still perfect.
Also, we can reuse a lot of what's done here to do the same for polymorphic variants.

@bobzhang
Copy link
Member

However, it still makes sense to explore representing tags as strings, and measure if there's a perf difference.

It makes sense to me, as it also makes debug easier

This would also mean that for zero-ary variants, the interop is still perfect.

For zero-arity variants, polyvar makes more sense to be compiled into string, as it is structual typing and it has more flexibity with names (name can start with lowercase letter). We also need be careful about a speical case, take list for example type list = | Cons of int * list | Nil, in the compiler we take advantage of the fact that Nil is false, if we make it string, such invariant no longer holds

@bloodyowl
Copy link
Collaborator

would something like {tag: "Name", contents: [1, 2]} have the same negative performance impact? the ability to serialise BucklSscript's data structures would be a huge help for server-side rendering as we generally do something like : window.initialState = JSON.stringify(data)

@cristianoc
Copy link
Collaborator Author

For reference, here's the bernchmark: 37cd85b#diff-7074723a3e2e5fe591d29f327ab490e7R1.

@cristianoc
Copy link
Collaborator Author

@bloodyowl serialization is related but also can be orthogonal. E.g. see here: https://github.com/cristianoc/REInfer/blob/master/src/Serialize.re

(It might require updating, but that's the gist of what kind of thing is required -- just a bit of knowledge of what the runtime representations in general look like)

@cristianoc
Copy link
Collaborator Author

And here's the change to use block representation yet keep tags as strings: 45924dc#diff-28556b6d23fb9668ee5320bf37ed6cc2R189-R192.

@cristianoc
Copy link
Collaborator Author

cristianoc commented Sep 20, 2019

@chenglou has identified that the slowdown was due to having integer keys in objects.
The new representation uses Arg0, Arg1 etc as keys. And the benchmark jscomp/test/record_bench.js is now faster than with arrays.

Add length property. For now, an an explicit field.

Checking in little test.

size_of_t

Object length.

Remove "length" property from representation.

Adapt caml_update_dummy to also support objects.

Remove all the magic for length from Caml_obj_extern.

Now, the representation of objects is not hardcoded anymore. They could equally be objects originating from JS that happen to have the same fields and contents.

Don't print `caml_update_dummy0` as it disturbs tests.

Some tests towards strings as tags.

Tweak.

rebase on master

First attempt at emitting tags as strings.

More variant examples.

Add support for Pisstring in if-then-else generated for pattern matching.

comment

Sync pattern matching changes from ocaml.

Pull in ocaml change for booleans, and fix some tests.

NOTE: the comparison needs at least revisiting.
As now lists don't compare in the intuitive way.
The base case "[]" is a string, and is the largest element in the current comparison.

Tweak printexc

Support compilation of recursive modules.

The shape is a variant constructed in an untyped way from OCaml, needs to generate the new representation.

Disable test with jsConverter that does not fit in the new runtime representation.

This does not make much sense:
```
  | A1 [@bs.as 3]
```

Compile () to 0, not "()".

Disable one more test with deriving jsConverter.

Tweak caml_parser to use the new representation.

Remove the tests involving generated parsers.

Supporting generated parsers requires passing some information differently.
Currently, the internal representation of variants is implicitly assumed in the generated code.
E.g. the table `yytransl_const` is an array, assuming that the index is the numeric tag of terminals.

Update hash test.

Delete arith_lexer.mll

Restore some lexer files.

Add record runtime representation benchmark.

Use blocks for payloads but string for tags.

Use new representaiton where args are strings not numbers.

Instead of using 0, 1, 2 for variant arguments, use "Arg0", "Arg1", "Arg2".
This avoids the perf issues with `record_bench.js`, in fact the test becomes faster.
@jacobp100
Copy link

would something like {tag: "Name", contents: [1, 2]} have the same negative performance impact? the ability to serialise BucklSscript's data structures would be a huge help for server-side rendering as we generally do something like : window.initialState = JSON.stringify(data)

You could also use polymorphic variants, which can (currently) be safely run through JSON stringify and parse

@cristianoc cristianoc closed this Nov 6, 2019
@cristianoc cristianoc reopened this Nov 6, 2019
@Niveous
Copy link

Niveous commented Nov 17, 2019

How about using shorter keys since object keys can't be shortened by minifiers? Elm uses $ for the tag and a, b, etc for the fields.

{$: 'C', a: 4, b: 2}

or, name the keys like Scala tuples:

{$: 'C', _0: 4, _1: 2}

@chenglou
Copy link
Member

@Niveous I've explained this elsewhere, though I can't find that issue currently. But the short version is that we want to expose these underlying representations as public, for trivial, cost-free (code-free!) interop with JS.

@Niveous
Copy link

Niveous commented Nov 17, 2019

@chenglou tag Arg0 etc are not exposed right now right? Could still rename them to something shorter before commiting to a public API?

@chenglou
Copy link
Member

They're not. Temporary names. We'll name them into something more readable before public release.

@Niveous
Copy link

Niveous commented Nov 17, 2019

Sounds good then! Maybe they should be modifiable using attributes similar to https://serde.rs/enum-representations.html for even better zero cost interop.

@sikanhe
Copy link

sikanhe commented Nov 21, 2019

Would this work for Inline Records?

@chenglou
Copy link
Member

Not considering it until variants are a public representation too

@bobzhang
Copy link
Member

I am going to move this forward .
For the transition, I am not going to changing the pattern match compiler, this is my proposed representation:

{ _0 : field0, _1 : field1, ..., tag  }

In debug mode, we can have one more

{ _0 : field0, _1 : field1, ..., tag  , name}

Once landed, I think performance wise, we have reached optimal for data representation then we can stabilize the ABI

@chenglou
Copy link
Member

Sure. Then let's make it clear that folks should not leverage the new representation still

@cristianoc
Copy link
Collaborator Author

Splitting into phases is good, even letting people test etc. But I would try to reach a final form before creating an actual release. So any tooling needs to sync up only once.

@jacobp100
Copy link

Btw, if you put tag first then _0, _1 etc. in the object it should perform better. Then every key will always be at the same offset so you can benefit from the inline cache optimisations

@jfrolich
Copy link

Just out of interest, what will happen to variants with inline records? These would open up an amazing opportunity for interop potentially.

@bobzhang
Copy link
Member

@jfrolich For inline records, I am not sure, we will see how difficult it goes to remapping keys.
@cristianoc I tend to keep tag point to number instead of string, we can add new fields for debugging. The reason is that change tag from number to string is not strictly better, it increases the bundle size, slow dow a little bit, but it requires much more work to do. We need more benefit to justify the complexity

@Risto-Stevcev
Copy link

Could we get this for poly variants as well? #4295
They tend to be more useful for most of my use cases because you can merge sets (ie: type encoding = [ input_encoding | output_encoding ]) and you don't need to bring the constructors into scope

@yawaramin
Copy link
Contributor

yawaramin commented Apr 17, 2020

One caveat: TypeScript and Flow's discriminated union support requires all cases of the DU type to have the same tag field. E.g., to adapt @cristianoc 's original example:

type v =
  | A1
  | A2
  | B(int)
  | C(int, int)
  | D((int, int));

To interop seamlessly with Flow/TS, it would need to output something like:

{ tag: 0 } // A1
{ tag: 1 } // A2
{ tag: 2, _1: 1 } // B(1)
{ tag: 3, _1: 1, _2: 2 } // C(1, 2)
{ tag: 4, _1: [1, 2] } // D((1, 2))

Of course, if a variant type has no payloads in any cases, it could just be compiled to 0, 1, 2, etc. like currently.

@Risto-Stevcev
Copy link

Risto-Stevcev commented Apr 17, 2020

What the advantage of using tag and _1, _2,... in this representation?
I read #24 but I'm not sure I understand what the performance issue is here. You normally wouldn't have that many variants in ordinary code so it seems like any kind of optimization would be negligible in terms of perf but would dramatically hurt ergonomics.

@yawaramin It seems completely reasonable afaict to use this kind of rep, which would be way more expressive, and ts/flow could still wrangle it just fine:

type v =
  | A1
  | A2
  | B(int)
  | C(int, int)
  | D((int, int));
"A1" // A1
"A2" // A2
{ B: 1 } // B(1)
{ C: [1, 2] } // C(1, 2)
{ D: [[1, 2]] } // D((1, 2))

That way you could write converter helpers to/from objects/records that would let you write really expressive code for information processing, and bucklescript would be much more capable of writing libraries that target vanilla js like typescript is, which is one of the goals right?

I'm in favor of an ergonomics-first approach like it was with the records. This seems like a premature optimization to me if it does have an perf benefits, and a separate type like Js.t could be provided for those cases where you need to perf gains

@jacobp100
Copy link

@Risto-Stevcev the tags would be a TS enum (or equivalent), so you wouldn't use the numbers directly

@Risto-Stevcev
Copy link

Risto-Stevcev commented Apr 17, 2020

That's assuming that you would be using typescript on the other side though and not plain javascript. It would also mean you can't really write libraries that target both bucklescript and plain js because the js representation wouldn't be ergonomic. Typescript gives you the tools to write variants that are like what I'm proposing if you want to also target plain js:

// if you want to write a typescript-only lib -- enums aren't really ergonomic with plain js
enum Res {
    No = 0,
    Yes = 1,
}
// if you want to write a lib that also targets plain js
// (bucklescript would have to choose one of them since it can't just add new features to ocaml)
type Res2 = "Yes" | "No"

const yes: Res2 = "Yes"
console.log(Res.Yes, yes);

// the value proposition (what `result` would compile to) described using typescript:
type Result<T, E> = { Ok: T } | { Error: E }
const foo: Result<number, Error> = { Ok: 123 }

The last two examples show what you would get if you kept the approach I'm proposing -- you could write libraries in bucklescript that also target plain javascript in a very straightforward way with a clean API.

Also note that typescript can handle the proposed syntax just fine because it has support for union types but enums would be much less ergonomic for a library targeting plain js.

@jfrolich
Copy link

Perhaps it would be possible to customize the representation with annotations? For instance it would be quite nice to have variants map to Redux-like actions easily:

type action = 
| Increment
| Decrement
| Set({number: int})

to

{type: "Increment"}
{type:  "Decrement"}
{type: "Set", number: 1}

I can imagine the tag is the default tag, but perhaps customizable with an annotation to type?

@yawaramin
Copy link
Contributor

@jacobp100 @Risto-Stevcev people using TS/Flow discriminated unions feature don't write types and pattern matches like that. They will want to write a single switch statement. See https://www.typescriptlang.org/docs/handbook/advanced-types.html#discriminated-unions and https://flow.org/en/docs/types/unions/#toc-disjoint-unions . The common element to both is that each case has to have a 'discriminant' i.e. the tag field that is being discussed here. Without it, if you want TS/Flow to do exhaustiveness checking, you have to resort to workarounds.

Anyway this may all be moot, because the proposed tags and field identifiers here are essentially numeric, so not really what TS/Flow people would normally do. But what I'm suggesting will at least be simple to model on that side.

@Risto-Stevcev
Copy link

That's a fair point, you could model pattern matching like that in js as well. But why the _1, _2, etc? why not just do something like:

{ tag: "C", value: /*tuple*/[1,2] }

Ocaml already does type erasure anyway (tuples reuse array rep on the js side)

@bloodyowl
Copy link
Collaborator

@Risto-Stevcev I believe that'd be because the VM would be able to optimise maps better

@Risto-Stevcev
Copy link

Risto-Stevcev commented Apr 17, 2020

@bloodyowl I saw that the comment that it would benefit from inline cache optimizations, but it brings me back to my skepticism about the old record representation -- what are the benchmarks on this and are they realistic in terms of how variants are used in ocaml and js? for example, is it only noticeable about 1000 variations? which would literally never happen in real code since it's really rare that you'd ever go over 10 variations for any data structure. It seems like a premature optimization. Typescript isn't doing this and it doesn't seem like anyone's complaining about perf

@jacobp100
Copy link

jacobp100 commented Apr 17, 2020

To clarify, my comment was only that putting tag first before the _0 fields would benefit inline caches.

But as a side note, I think the _0 stuff goes away if you use inline records

type t =
  | Test { x: 0, y: 0 }
  | Other { a: 0, b: 0 }

Would have the TS type { tag: 0, x: number, y: number } | { tag: 1, a: number, b: number }, which is pretty friendly!

(Although that might take future work)

@TheSpyder
Copy link
Contributor

@Risto-Stevcev arrays in JS (particularly V8) are optimised for homogeneous contents. I don't think that has anything to do with cache lines. As soon as you start mixing data types in an array v8 switches to a less optimal representation of the array. There's plenty of posts about this, but the first search result I came up with seems good:
https://stackoverflow.com/questions/50939690/v8-heterogeneous-array-literals

I don't think TypeScript developers are storing 90% of their data values in heterogeneous arrays, as BuckleScript generated code used to before records-as-objects. They use unions and other things to simulate what we do with variants and records.

@Risto-Stevcev
Copy link

@TheSpyder Yeah, I think I get all the arguments for the rep now. But is it worth the perf gains to use _1, _2,... considering how few variants you'd work with? I'm just asking, I don't know what the benchmarks are. In any case, @jacobp100 mentioned that inline records seem would solve the issue with the rep

@bobzhang bobzhang closed this Oct 10, 2020
@cristianoc cristianoc deleted the variant_to_object branch June 18, 2022 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.