-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds support for writing hextuples, fixes #44. #45
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm not familiar with this format and the "spec" seems... pretty sketchy, but I'll take a look. The basic idea seems reasonable enough, but the linked repo seems to just have a README (full of typos) and not even a rudimentary test suite, so I'm not sure how to go about actually landing this.
Ideally at some point there's support for JSON-LD so applications can provide a JSON interface that's actually nice, but a simple flat format is a good idea too and much easier to implement right now. That said, I vaguely recall there's been quite a few proposals for simple JSON encodings of RDF. Why this one?
src/writer.c
Outdated
} | ||
|
||
if (!strcmp((const char*)node->buf, NS_RDF "nil")) { | ||
return esink("()", 2, writer); | ||
if (writer->syntax == SERD_HEXTUPLES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to push the limits of what makes sense in a single writer. It's probably still (barely) fine, but probably time to split this thing up by syntax.
I truthfully went for this one over json ld because it looked the quickest
to do. JSON ld has a pretty complex spec. Basically just wanted json
triples. Mostly because it looked easier than getting python-serd to work.
I'm trying to find the ingen bug I reported but wanted to run the latest
serd etc.
…On Sun, 8 Sept 2024, 12:23 am David Robillard, ***@***.***> wrote:
***@***.**** commented on this pull request.
Thanks. I'm not familiar with this format and the "spec" seems... pretty
sketchy, but I'll take a look. The basic idea seems reasonable enough, but
the linked repo seems to just have a README (full of typos) and not even a
rudimentary test suite, so I'm not sure how to go about actually landing
this.
Ideally at some point there's support for JSON-LD so applications can
provide a JSON interface that's actually nice, but a simple flat format is
a good idea too and much easier to implement right now. That said, I
vaguely recall there's been quite a few proposals for simple JSON encodings
of RDF. Why this one?
------------------------------
In src/writer.c
<#45 (comment)>:
> @@ -554,9 +554,9 @@ write_sep(SerdWriter* writer, const Sep sep)
// Write newline or space before separator if necessary
if (pre_line) {
- TRY(st, write_newline(writer));
+ TRY(st, write_newline(writer));
nit: Lots of mangled whitespace. Serd is formatted with clang-format,
please take advantage of this to avoid mangled formatting in diffs.
Integrations are available for pretty much every editor under the sun, you
can also use ninja clang-format to fix it after the fact or just run it
manually.
------------------------------
In src/writer.c
<#45 (comment)>:
> }
if (!strcmp((const char*)node->buf, NS_RDF "nil")) {
- return esink("()", 2, writer);
+ if (writer->syntax == SERD_HEXTUPLES) {
This is starting to push the limits of what makes sense in a single
writer. It's probably still (barely) fine, but probably time to split this
thing up by syntax.
------------------------------
In src/writer.c
<#45 (comment)>:
> @@ -755,7 +776,18 @@ write_uri_node(SerdWriter* const writer,
TRY(st, write_uri_from_node(writer, node));
}
- return esink(">", 1, writer);
+ if (writer->syntax == SERD_HEXTUPLES) {
+ /* TRY(st, esink("$CUE", 4, writer)); */
?
------------------------------
In src/writer.c
<#45 (comment)>:
> @@ -946,6 +985,42 @@ serd_writer_write_statement(SerdWriter* writer,
TRY(st, esink(" .\n", 3, writer));
return SERD_SUCCESS;
}
+ else if (writer->syntax == SERD_HEXTUPLES) {
+ TRY(st, esink("[", 1, writer));
+ TRY(st, write_node(writer, subject, NULL, NULL, FIELD_SUBJECT, flags));
+ TRY(st, esink(", ", 2, writer));
+ TRY(st, write_node(writer, predicate, NULL, NULL, FIELD_PREDICATE, flags));
+ TRY(st, esink(", ", 2, writer));
+ // object
+ TRY(st, esink("\"", 1, writer));
+ TRY(st, write_text(writer, WRITE_STRING, object->buf, object->n_bytes));
+ st = esink("\"", 1, writer);
+
+ TRY(st, esink(", ", 2, writer));
+ //datatype
+ if (datatype && datatype->buf) {
+ st = write_node(writer, datatype, NULL, NULL, FIELD_NONE, flags);
This status isn't handled.
------------------------------
In src/writer.c
<#45 (comment)>:
> + TRY(st, esink("\"", 1, writer));
+ TRY(st, write_text(writer, WRITE_STRING, object->buf, object->n_bytes));
+ st = esink("\"", 1, writer);
+
+ TRY(st, esink(", ", 2, writer));
+ //datatype
+ if (datatype && datatype->buf) {
+ st = write_node(writer, datatype, NULL, NULL, FIELD_NONE, flags);
+ } else {
+ TRY(st, esink("\"\"", 2, writer));
+ }
+ TRY(st, esink(", ", 2, writer));
+ // lang
+ TRY(st, esink("\"", 1, writer));
+ if (lang && lang->buf) {
+ st = esink(lang->buf, lang->n_bytes, writer);
Ditto.
—
Reply to this email directly, view it on GitHub
<#45 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAISQBCVPVQOUXGBLWZ4YIDZVMD4HAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEOBXGUYDGNRUGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Sure, I meant aside from JSON-LD, which is indeed much harder to implement, and isn't really intended as "just another RDF serialization" anyway (really using it well requires support for its vocabularies and introduces a schema context to documents etc). I haven't looked into this in detail since I didn't plan to implement a simpler JSON one, I just know there's a bunch of these. See for example https://www.w3.org/wiki/JSON%2BRDF
Fair enough, but just to make sure we're on the same page about the general landscape here, landing whole new features in the stable branch of serd has a pretty high bar. To get merged and released, any new syntax implementation would need to be comprehensively tested like everything else in serd. Being very fast and very solid is more or less the entire point of this project, I wouldn't dream of haphazardly messing with the stable release just because some unreleased unstable branches and wholly new bindings happen to be out of sync at the moment. I take the quality of serd very seriously and strive to have it always increase (ingen is just an app, so more of a "hey it works" situation is fine there). That said, I'm happy to add an implementation of a simpler syntax that makes talking to Javascript (and/or Python?) easier. In principle, that shouldn't be very hard for any simple "flat" syntax. To actually be "shipable" in serd, though, it'd have to be long-term stable, and exhaustively tested in some way or another. I'm not sure if the latter is feasible without reading support, for one thing. I'm also not sure about the choice of syntax, the page linked above shows some, there's probably even more, but I'm not familiar with this space. Ideally, whatever's supported most by the most common implementations, which comes with its own set of test cases, because then it's a simple matter of just making all those pass. This approach seems to be using the existing implementation to write most of the actual text, which makes for an easy change, but that means it's using the escaping/etc rules of the W3C syntaxes... which I doubt is correct? Does swapping a few delimiters produce something significantly easier to parse than actual NTriples anyway? |
On Sun, 8 Sept 2024, 2:51 am David Robillard, ***@***.***> wrote:
Sure, I meant aside from JSON-LD, which is indeed much harder to
implement, and isn't really intended as "just another RDF serialization"
anyway (really using it well requires support for its vocabularies and
introduces a schema context to documents etc).
I haven't looked into this in detail since I didn't plan to implement a
simpler JSON one, I just know there's a bunch of these. See for example
https://www.w3.org/wiki/JSON%2BRDF
https://ontola.io/blog/rdf-serialization-formats
Is where I looked first, json ad looked good but also much harder to
implement.
Mostly because it looked easier than getting python-serd to work
Fair enough, but just to make sure we're on the same page about the
general landscape here, landing whole new features in the stable branch of
serd has a pretty high bar. To get merged and released, any new syntax
implementation would need to be comprehensively tested like everything else
in serd. Being very fast and very solid is more or less the entire point of
this project, I wouldn't dream of haphazardly messing with the stable
release just because some unreleased unstable branches and wholly new
bindings happen to be out of sync at the moment. I take the quality of serd
very seriously and strive to have it always increase (ingen is just an app,
so more of a "hey it works" situation is fine there).
Totally happy if this stays in my fork if need. I've switched my version
of ingen to use it, but I would be happy to add a non serd based json
communication mechanism to Ingen. I'm just attempting to get everything as
close to what your running so I can fix some Ingen crashes and other bugs.
I don't want to waste your time with pull requests if newer dependencies
have fixed them.
That said, I'm happy to add an implementation of a simpler syntax that
makes talking to Javascript easier. In principle, that shouldn't be very
hard for any simple "flat" syntax. To actually be "shipable" in serd,
though, it'd have to be long-term stable, and exhaustively tested in some
way or another. I'm not sure if the latter is feasible without reading
support, for one thing. I'm also not sure about the choice of syntax, the
page linked above shows some, there's probably even more, but I'm not
familiar with this space. Ideally, whatever's supported most by the most
common implementations, which comes with its own set of test cases, because
then it's a simple matter of just making all those pass.
This approach seems to be using the existing implementation to write most
of the actual text, which makes for an easy change, but that means it's
using the escaping/etc rules of the W3C syntaxes... which I doubt is
correct? Does swapping a few delimiters produce something significantly
easier to parse than actual NTriples anyway?
Seemed easier for me, this change was relatively quick and I can now parse
with pretty much the same code in any json environment on the other end
making new front end stuff easier. If I tell a hired front end person these
are the json messages that make stuff happen they are happy, if I say
here's the NTriples or Turtle their heads explode.
I can reproduce some of the ingen bugs just running the unchanged ingen
front end and server but they are timing based so I'm keen to write some
test cases in scripts.
I'll fix the white space and debugging code if you're keen to get this
merged, if not that's okay too.
—
… Reply to this email directly, view it on GitHub
<#45 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAISQBCZPHYHG3TJMGPOB3DZVMVJTAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZVHE2DSMZTGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Just realised we could look if there a sensible tests suit in RDFlib for it, as RDFlib supports Hex tuples as well. |
I doubt you actually can, with anything but the simplest data anyway. That's the real problem here. You've essentially changed the delimiters of an NTriples document, but the actual nodes are still NTriples. The escaping rules for JSON and NTriples are quite different. The document format itself is a trivial line-based thing, almost all of the complexity is in the nodes themselves. If it's a kludge on a fork that makes things work for you, whatever, but yeah, in order to land in serd, readers and writers need to be able to handle arbitrary data correctly without loss. I don't know how much work that is here, but the escaping rules certainly aren't identical, so it's not zero. You might as well just naively chop an NTriples line at delimiters with basic string operations in Python really, that'd be about the same level of sketchy, although I have a hard time believing that parsing NTriples isn't relatively easy in Python (the whole grammar only has 17 rules). Obviously naively loading the whole thing into an rdflib model is obscenely slow, but judging by a quick web search, even with rdflib you can do streaming parsing without much fuss. The bindings, as it happens, are mainly why things are in flux again. The easy way to make language bindings easy is to use the same OO-ish pattern for everything, so that's what I did. The problem is, along the way, I ended up baking some severe performance problems right into the guts of the library itself, and also exposing too many internal implementation details that would mean I have no ability to improve that. So the whole way that nodes and statements work in the API needs to be redone, so that's what I'm doing. If unreleased git repositories don't work for interlopers in the mean time... oh well. If you're doing that, keep track of what refs you're using and don't blindly pull. There is absolutely zero guarantee or even minimal effort that anything there remains at all compatible, in any way, at all, ever. That's the whole point of a new major version. I am working on it, but I'm currently in the middle of an intense period for paid work. Obviously that, and many other things besides, have a much higher priority than inherently unstable WIP things being, well, unstable. |
All good, my needs were very specific and ingen only. I've been able to
get the stuff I needed working and found the crashes in Ingen. You can
ignore this pull request and I'll send a separate one with the Ingen
changes. There's a few timing related segfaults which had an easy fix and
ones I needed to make some complicated test cases to find. Thanks again for
checking this out.
…On Thu, 12 Sept 2024, 1:05 am David Robillard, ***@***.***> wrote:
Seemed easier for me, this change was relatively quick and I can now parse
with pretty much the same code in any json environment on the other end
I doubt you actually can, with anything but the simplest data anyway.
That's the real problem here. You've essentially changed the delimiters of
an NTriples document, but the actual nodes are still NTriples. The escaping
rules for JSON and NTriples are quite different. The document format itself
is a trivial line-based thing, almost all of the complexity is in the nodes
themselves.
If it's a kludge on a fork that makes things work for you, whatever, but
yeah, in order to land in serd, readers and writers need to be able to
handle arbitrary data correctly without loss. I don't know how much work
that is here, but the escaping rules certainly aren't identical, so it's
not zero. You might as well just naively chop an NTriples line at
delimiters with basic string operations in Python really, that'd be about
the same level of sketchy, although I have a hard time believing that
parsing NTriples isn't relatively easy in Python (the whole grammar only
has 17 rules). Obviously naively loading the whole thing into an rdflib
model is obscenely slow, but judging by a quick web search, even with
rdflib you can do streaming parsing without much fuss.
The bindings, as it happens, are mainly why things are in flux again. The
easy way to make language bindings easy is to use the same OO-ish pattern
for everything, so that's what I did. The problem is, along the way, I
ended up baking some severe performance problems right into the guts of the
library itself, and also exposing too many internal implementation details
that would mean I have no ability to improve that. So the whole way that
nodes and statements work in the API needs to be redone, so that's what I'm
doing. If unreleased git repositories don't work for interlopers in the
mean time... oh well. If you're doing that, keep track of what refs you're
using and don't blindly pull. There is absolutely zero guarantee or even
minimal effort that anything there remains at all compatible, in any way,
at all, ever. That's the whole point of a new major version.
I am working on it, but I'm currently in the middle of an intense period
for paid work. Obviously that, and many other things besides, have a much
higher priority than inherently unstable WIP things being, well, unstable.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAISQBFSTVYN3RNCNN24Z3DZWBL4XAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTHEZTQOBUGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This adds support for ndjson hex tuples format explained here: https://github.com/ontola/hextuples an easy to parse ND JSON based RDF representation. It is faster / easier to deal across languages with good JSON support.