-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Optics language rework #140
Comments
First of all, this is an incredibly well thought out proposal! I very much agree that the ergonomics of optics could be much better. We have a bit of an odd requirement for optics in the sense that we need to make sure it is not turing complete, and that we can prove that any program written in the language halts. Otherwise, someone could easily take down the production servers. This imposes a fair bit of constraints on the features we can allow directly in the language. I think this approach would have multiple benefits over the approach we use today
Perhaps we should also go ahead and remove the ranking signal adjustments from optics and expose them as a I agree that we should internally move as much as possible out from the What do you think about this? I don't really know what the syntax for the libraries should be yet, but I think we can come up with something that's really nice to use. (huge credits to @oeb25 here) |
This sounds a lot like configuration languages like Nickel, Dhall, Starlark and others. Would it be worth adopting one of these existing languages that already support many/all of the features described in the top post? |
The optics language, as it is, is usable, but not extremely useful as a DSL. It could be entirely expressed as JSON. Additionally, the current syntax is strangely restrictive in some aspects. Actions always need to be at the end of a rule, for example. Commas are... semi-optional. These are mostly quirks of how the LR(1) language is written, because Optics is constructed to not even really be compiled, just parsed.
A future objective is to make the language actually work like a language, to allow for more flexible filtering. I would like this issue to serve as a central repository for this rework, serving both as a place to think this out cleanly, as well as a sort of initial documentation. I'm going to write
Background
The current optics pipeline goes from parsing
RawRule
s and similar into their parsed versions. The parsed versions are mostly identical to their raw versions, with a few simplifications and reductions to make them easier to manage. Optics are then converted into Tantivy queries every time a request is made. Additionally, this conversion to Tantivy queries is done in core.Several other places in core convert things to optics,
Internal architecture
Optics should be kept in their own crate as much as possible. The crate itself should only output Tantivy requests, and optics should be opaque from a core crate perspective. This allows a much more flexible system, with Optic language reworks, alternative syntaxes etc. only being another trait implementation away, and not requiring touching the core at all.
My proposed pipeline is
Optic "program" -Lexing/Parsing-> Executable -Execution-> Unrealized Query -Realization-> Query
Lexing/parsing and execution is done beforehand, and only depends on the optic. Realization is done by the core, calling onto the unrealized query, turning into a Tantivy query. This is necessary because
Matching
requires a schema as well as aFastFieldReader
to be turned into a Tantivy query. Realization should be close to simply cloning data directly and filling up some blanks.I haven't fully investigated, but it seems like neither Schemas nor FastFieldReaders change... pretty much ever (FastFieldReaders get cloned every time, but they only have an inner Arc, so... idk?), so potentially realization could be done once, which allows realization to be slower.
The advantage of this structure is that it allows the Optic language to be much more complex, and affords a large margin in the implementation of the execution (meaning said implementation is infinitely easier as it doesn't require nearly as much thinking). Since it removes Optic compilation from the code executed during a search, searches using optics can also be faster.
A possible language
Basing off of the existing language:
Matching
is a condition that can either be true or false for a specific pageMatching
can either be a composite of several otherMatching
s (more on that later), or a condition on a field of a page.Action
is something that impacts a search result, such as boosting/demoting or discarding.Rule
is aMatching
and anAction
. Only one action, because two different actions can always be combined into 1 rather easily.Rule
s. Rules are applied independently.These definitions effectively define all of the non-standard types of almost any possible Optic language:
Matching
,Action
,Rule
.Notice that effectively all complex programs are essentially iterations on a list. This would therefore be an array of strings. Often, these are "iterated" over, and turned into a single matching with a straightforward mapping and combination. This is why I propose a functional-ish language for Optic, based off of Rust's functional features. Here's a syntax example.
Please note that I am very much open to better syntaxes.
This language has bools, ints, floats, strings, and lists. Since it's functional, functions are also types.
Note that all variables are immutable.
Standard library:
List<I>
:map(apply: I -> G) -> List<G>
zip(other: List<I2>) -> List<(I, I2)>
(the tuple will really be a list, but for docs this is clearer)combine(combiner: List<I> -> G) -> G
call combiner on this list. Mostly syntax sugar.There are a lot of other iterator functions, but these three work for almost all optics. More can be added later.
Match
:Site
,Domain
etc: Return a newMatch
that matches that field. The base Matches.And(in: List<Match>) -> Match
: Returns a newMatch
that only matches when all of its interiorMatch
es match. If the input list is empty, always matches (NOTE TO IMPLEMENTORS: BooleanQuery by default never matches when empty).Or(in: List<Match>) -> Match
: Returns a newMatch
that matches when any of its interiorMatch
es match. If the input list is empty, never matches.Not(self) -> Match
: Returns a newMatch
that only matches when its interior match doesn't match.Never
: A Match that never matches.Always
: A Match that matches everything.Implementation-wise, many combinations of these can be simplified to simpler Tantivy queries, since a BooleanQuery is more general than just And, Or or Not.
Action
:Boost(value: float)
,Downrank(value: float)
,Discard
,Keep
.publish
is a global function/syntax sugar which takes in aMatch
and an action, and applies it as a Rule. Boost and Downranks combine, and Keep always takes priority over Discard.NOTE: this means
DiscardNonMatching
goes from magic to justpublish Match.Always, Action.Discard;
at the start. However it doesn't seem easy to define as a BooleanQuery. Maybe a custom Query type is necessary for this kind of linear combination. Hey, maybe that can be another kind ofMatch
, linearly applying one rule after another. Food for thought idk.Something to think about is how to allow large amounts of data to be inputted without mixing code and data. One possibility is to have some kind of "data section" at the end of the optic, which can contain constants, to dump lists of websites. Something like
data urls;
which promises that we'll defineurls
in the data section.Liking and disliking websites becomes a global function like publish, and take a string. I don't like it much, but internally liking stuff is really distinct from rules, so... yeah.
The text was updated successfully, but these errors were encountered: