Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json extract #524

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Json extract #524

wants to merge 4 commits into from

Conversation

madejejej
Copy link
Contributor

This is currently a WIP and rebased on top of #504

@madejejej
Copy link
Contributor Author

I hoped this feature wouldn't drop me into a rabbit hole 😅 Unfortunately, there are some quirks in the JSON syntax, ex:

-- "\x61" is an escaped ASCII character for 'a', but this doesn't match in neither sqlite nor postgres:
SELECT json_extract('{"\x61": 1}', 'a');

-- you need to use the exact sequence of characters:
SELECT json_extract('{"\x61": 1}', '\x61')
1

-- the other way around also wouldn't work:
SELECT json_extract('{"a": 1}', '\x61')


#[derive(Parser)]
#[grammar_inline = r#"
array_locator = @{ "[" ~ array_index ~ "]" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to copy a large part of grammar from core/json/de.rs. Perhaps I could extract the grammar from there into a constant and add additional rules for parsing the path? However, this would leave us with a few unused rules.

The other way to DRY it out would be to extract only the common rules, but I feel like it's brittle and very unhandy if we ever have to patch the JSON grammar.

@jussisaurio
Copy link
Collaborator

i merged #504 so base is now main

@madejejej madejejej force-pushed the json-extract branch 2 times, most recently from cc089af to 332b831 Compare December 31, 2024 08:40
@@ -9,137 +9,7 @@ use std::f64;
use crate::json::error::{self, Error, Result};

#[derive(Parser)]
#[grammar_inline = r#"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to a separate file

@madejejej
Copy link
Contributor Author

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.

I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej madejejej force-pushed the json-extract branch 3 times, most recently from 0b639f4 to d970d00 Compare December 31, 2024 08:52
@madejejej
Copy link
Contributor Author

Test FAILED: 'SELECT json_extract(1, null, null, null)'
returned ''
expected '[null,null,null]'

SQLite seems to be changing some quirky behavior from version to version:

sqlite> .version
SQLite 3.43.2 2023-10-10 13:08:14 1b37c146ee9ebb7acd0160c0ab1fd11017a419fa8a3187386ed8cb32b709aapl
zlib version 1.2.12
clang-15.0.0 (64-bit)
sqlite> SELECT json_extract(1, null, null, null);
[null,null,null]

@madejejej madejejej marked this pull request as ready for review December 31, 2024 09:00
// see https://spec.json5.org/#syntactic-grammar and
// https://spec.json5.org/#lexical-grammar

COMMENT = _{ "/*" ~ (!"*/" ~ ANY)* ~ "*/" | "//" ~ (!line_terminator ~ ANY)* }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this grammar hand-written based on the above spec or sourced from somewhere else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, this file was moved instead of created here, so you might not know

struct Parser;

#[derive(Clone, Debug, PartialEq)]
pub enum PathElement {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also comment these using triple-slash comments? Each enum variant / struct field can also have its own triple-slash comment. IDEs pick those up which is helpful

pub elements: Vec<PathElement>,
}

// Parses path into a Vec of Strings, where each string is a key or an array locator.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple slash, please

),
}
}
_ => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this branch be hit? doesn't the parser produce discreet unions where we can handle every possible union variant?

match pair.as_rule() {
Rule::EOI => (),
Rule::root => result.push(PathElement::Root()),
Rule::json_path_key => result.push(PathElement::Key(pair.as_str().to_string())),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be made into pascal case? Rule::JsonPathKey

Rule::negative_index_indicator => {
let negative_offset = array_locator.next().unwrap();
// TODO: sqlite is able to parse arbitrarily big numbers, but they
// always get overflown and cast to i32. Handle this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to handle it in this PR?

SELECT json_extract(1, '$')
} {{1}}

# TODO: fix me
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we plan to keep these commented out when merging this, please make the comments descriptive enough that anyone can pick them up and work on them

_ => {
let val = &state.registers[*start_reg];
let reg_values =
&state.registers[*start_reg + 1..*start_reg + arg_count];
Copy link
Collaborator

@jussisaurio jussisaurio Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this range should be ..= instead of ..

consider:

// start_reg = 0
// arg_count = 2

let reg_values = &state.registers[1..2]; // <-- only picks one arg since range end is not inclusive

In fact there seem to be no tests for json_extract() with multiple valid paths - should probably add some?


for path in paths {
match path {
OwnedValue::Text(p) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be simpler if json_extract_single took an &OwnedValue instead of str so we wouldn't have to pattern match here at all.

let mut result = "".to_string();
if paths.len() > 1 {
    result.push('[');
}
for path in paths.iter() {
    let extracted = json_extract_single(&json, path);
    if paths.len() == 1 && extracted == Val::Null {
        return OwnedValue::Null;
    }
    result.push_str(&crate::json::to_string(&extracted).unwrap());
}

etc

}

if paths.len() > 1 {
result.pop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying that this removes the final comma

@petersooley
Copy link
Contributor

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.

I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej The path parsing is much simpler than the JSON parsing, for sure. It's also very limited in sqlite (i.e. no glob/wildcard patterns). It's mostly array indexes and object property paths with a few extra cases.

What I like about the hand-rolled solution is that it doesn't separate out the path parsing from accessing the value in the JSON. That allows returning early as soon as the JSON value has no match for the path. No matter which way we go, there's always a loop required to drill into the JSON and extract a value at the end of the given path. Both solutions are doing that loop anyway, it's just that the hand-rolled solution is doing it during path parsing.

@jussisaurio
Copy link
Collaborator

@petersooley I saw your PR with json_array_length got merged while I was working on json_extract.
I like the idea of a smaller, hand-rolled parser but I also think the JSON grammar is pretty complicated, even in terms of what you can do with the JSON path in SQLite (see some tests attached). Let me know your thoughts.

@madejejej The path parsing is much simpler than the JSON parsing, for sure. It's also very limited in sqlite (i.e. no glob/wildcard patterns). It's mostly array indexes and object property paths with a few extra cases.

What I like about the hand-rolled solution is that it doesn't separate out the path parsing from accessing the value in the JSON. That allows returning early as soon as the JSON value has no match for the path. No matter which way we go, there's always a loop required to drill into the JSON and extract a value at the end of the given path. Both solutions are doing that loop anyway, it's just that the hand-rolled solution is doing it during path parsing.

Yeah this is a decent point, we'd save some deserialization overhead by traversing the JSON object on demand while parsing the path. Maybe not the most important thing in the world and can be optimized later, but it's also nice to implement things right the first time around. I wouldn't block this PR from going forward with the eagerly-parsed version so I'll let @madejejej decide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants