adds support for writing hextuples, fixes #44. #45

loki42 · 2024-09-07T05:30:52Z

This adds support for ndjson hex tuples format explained here: https://github.com/ontola/hextuples an easy to parse ND JSON based RDF representation. It is faster / easier to deal across languages with good JSON support.

drobilla

Thanks. I'm not familiar with this format and the "spec" seems... pretty sketchy, but I'll take a look. The basic idea seems reasonable enough, but the linked repo seems to just have a README (full of typos) and not even a rudimentary test suite, so I'm not sure how to go about actually landing this.

Ideally at some point there's support for JSON-LD so applications can provide a JSON interface that's actually nice, but a simple flat format is a good idea too and much easier to implement right now. That said, I vaguely recall there's been quite a few proposals for simple JSON encodings of RDF. Why this one?

src/writer.c

drobilla · 2024-09-07T14:15:44Z

src/writer.c

    }

    if (!strcmp((const char*)node->buf, NS_RDF "nil")) {
-      return esink("()", 2, writer);
+	  if (writer->syntax == SERD_HEXTUPLES) {


This is starting to push the limits of what makes sense in a single writer. It's probably still (barely) fine, but probably time to split this thing up by syntax.

src/writer.c

loki42 · 2024-09-07T16:07:21Z

I truthfully went for this one over json ld because it looked the quickest to do. JSON ld has a pretty complex spec. Basically just wanted json triples. Mostly because it looked easier than getting python-serd to work. I'm trying to find the ingen bug I reported but wanted to run the latest serd etc.

…

On Sun, 8 Sept 2024, 12:23 am David Robillard, ***@***.***> wrote: ***@***.**** commented on this pull request. Thanks. I'm not familiar with this format and the "spec" seems... pretty sketchy, but I'll take a look. The basic idea seems reasonable enough, but the linked repo seems to just have a README (full of typos) and not even a rudimentary test suite, so I'm not sure how to go about actually landing this. Ideally at some point there's support for JSON-LD so applications can provide a JSON interface that's actually nice, but a simple flat format is a good idea too and much easier to implement right now. That said, I vaguely recall there's been quite a few proposals for simple JSON encodings of RDF. Why this one? ------------------------------ In src/writer.c <#45 (comment)>: > @@ -554,9 +554,9 @@ write_sep(SerdWriter* writer, const Sep sep) // Write newline or space before separator if necessary if (pre_line) { - TRY(st, write_newline(writer)); + TRY(st, write_newline(writer)); nit: Lots of mangled whitespace. Serd is formatted with clang-format, please take advantage of this to avoid mangled formatting in diffs. Integrations are available for pretty much every editor under the sun, you can also use ninja clang-format to fix it after the fact or just run it manually. ------------------------------ In src/writer.c <#45 (comment)>: > } if (!strcmp((const char*)node->buf, NS_RDF "nil")) { - return esink("()", 2, writer); + if (writer->syntax == SERD_HEXTUPLES) { This is starting to push the limits of what makes sense in a single writer. It's probably still (barely) fine, but probably time to split this thing up by syntax. ------------------------------ In src/writer.c <#45 (comment)>: > @@ -755,7 +776,18 @@ write_uri_node(SerdWriter* const writer, TRY(st, write_uri_from_node(writer, node)); } - return esink(">", 1, writer); + if (writer->syntax == SERD_HEXTUPLES) { + /* TRY(st, esink("$CUE", 4, writer)); */ ? ------------------------------ In src/writer.c <#45 (comment)>: > @@ -946,6 +985,42 @@ serd_writer_write_statement(SerdWriter* writer, TRY(st, esink(" .\n", 3, writer)); return SERD_SUCCESS; } + else if (writer->syntax == SERD_HEXTUPLES) { + TRY(st, esink("[", 1, writer)); + TRY(st, write_node(writer, subject, NULL, NULL, FIELD_SUBJECT, flags)); + TRY(st, esink(", ", 2, writer)); + TRY(st, write_node(writer, predicate, NULL, NULL, FIELD_PREDICATE, flags)); + TRY(st, esink(", ", 2, writer)); + // object + TRY(st, esink("\"", 1, writer)); + TRY(st, write_text(writer, WRITE_STRING, object->buf, object->n_bytes)); + st = esink("\"", 1, writer); + + TRY(st, esink(", ", 2, writer)); + //datatype + if (datatype && datatype->buf) { + st = write_node(writer, datatype, NULL, NULL, FIELD_NONE, flags); This status isn't handled. ------------------------------ In src/writer.c <#45 (comment)>: > + TRY(st, esink("\"", 1, writer)); + TRY(st, write_text(writer, WRITE_STRING, object->buf, object->n_bytes)); + st = esink("\"", 1, writer); + + TRY(st, esink(", ", 2, writer)); + //datatype + if (datatype && datatype->buf) { + st = write_node(writer, datatype, NULL, NULL, FIELD_NONE, flags); + } else { + TRY(st, esink("\"\"", 2, writer)); + } + TRY(st, esink(", ", 2, writer)); + // lang + TRY(st, esink("\"", 1, writer)); + if (lang && lang->buf) { + st = esink(lang->buf, lang->n_bytes, writer); Ditto. — Reply to this email directly, view it on GitHub <#45 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAISQBCVPVQOUXGBLWZ4YIDZVMD4HAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEOBXGUYDGNRUGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

drobilla · 2024-09-07T16:51:16Z

Sure, I meant aside from JSON-LD, which is indeed much harder to implement, and isn't really intended as "just another RDF serialization" anyway (really using it well requires support for its vocabularies and introduces a schema context to documents etc).

I haven't looked into this in detail since I didn't plan to implement a simpler JSON one, I just know there's a bunch of these. See for example https://www.w3.org/wiki/JSON%2BRDF

Mostly because it looked easier than getting python-serd to work

Fair enough, but just to make sure we're on the same page about the general landscape here, landing whole new features in the stable branch of serd has a pretty high bar. To get merged and released, any new syntax implementation would need to be comprehensively tested like everything else in serd. Being very fast and very solid is more or less the entire point of this project, I wouldn't dream of haphazardly messing with the stable release just because some unreleased unstable branches and wholly new bindings happen to be out of sync at the moment. I take the quality of serd very seriously and strive to have it always increase (ingen is just an app, so more of a "hey it works" situation is fine there).

That said, I'm happy to add an implementation of a simpler syntax that makes talking to Javascript (and/or Python?) easier. In principle, that shouldn't be very hard for any simple "flat" syntax. To actually be "shipable" in serd, though, it'd have to be long-term stable, and exhaustively tested in some way or another. I'm not sure if the latter is feasible without reading support, for one thing. I'm also not sure about the choice of syntax, the page linked above shows some, there's probably even more, but I'm not familiar with this space. Ideally, whatever's supported most by the most common implementations, which comes with its own set of test cases, because then it's a simple matter of just making all those pass.

This approach seems to be using the existing implementation to write most of the actual text, which makes for an easy change, but that means it's using the escaping/etc rules of the W3C syntaxes... which I doubt is correct? Does swapping a few delimiters produce something significantly easier to parse than actual NTriples anyway?

loki42 · 2024-09-08T01:05:22Z

On Sun, 8 Sept 2024, 2:51 am David Robillard, ***@***.***> wrote: Sure, I meant aside from JSON-LD, which is indeed much harder to implement, and isn't really intended as "just another RDF serialization" anyway (really using it well requires support for its vocabularies and introduces a schema context to documents etc). I haven't looked into this in detail since I didn't plan to implement a simpler JSON one, I just know there's a bunch of these. See for example https://www.w3.org/wiki/JSON%2BRDF

https://ontola.io/blog/rdf-serialization-formats Is where I looked first, json ad looked good but also much harder to implement. Mostly because it looked easier than getting python-serd to work

Fair enough, but just to make sure we're on the same page about the general landscape here, landing whole new features in the stable branch of serd has a pretty high bar. To get merged and released, any new syntax implementation would need to be comprehensively tested like everything else in serd. Being very fast and very solid is more or less the entire point of this project, I wouldn't dream of haphazardly messing with the stable release just because some unreleased unstable branches and wholly new bindings happen to be out of sync at the moment. I take the quality of serd very seriously and strive to have it always increase (ingen is just an app, so more of a "hey it works" situation is fine there).

Totally happy if this stays in my fork if need. I've switched my version of ingen to use it, but I would be happy to add a non serd based json communication mechanism to Ingen. I'm just attempting to get everything as close to what your running so I can fix some Ingen crashes and other bugs. I don't want to waste your time with pull requests if newer dependencies have fixed them.

That said, I'm happy to add an implementation of a simpler syntax that makes talking to Javascript easier. In principle, that shouldn't be very hard for any simple "flat" syntax. To actually be "shipable" in serd, though, it'd have to be long-term stable, and exhaustively tested in some way or another. I'm not sure if the latter is feasible without reading support, for one thing. I'm also not sure about the choice of syntax, the page linked above shows some, there's probably even more, but I'm not familiar with this space. Ideally, whatever's supported most by the most common implementations, which comes with its own set of test cases, because then it's a simple matter of just making all those pass. This approach seems to be using the existing implementation to write most of the actual text, which makes for an easy change, but that means it's using the escaping/etc rules of the W3C syntaxes... which I doubt is correct? Does swapping a few delimiters produce something significantly easier to parse than actual NTriples anyway?

Seemed easier for me, this change was relatively quick and I can now parse with pretty much the same code in any json environment on the other end making new front end stuff easier. If I tell a hired front end person these are the json messages that make stuff happen they are happy, if I say here's the NTriples or Turtle their heads explode. I can reproduce some of the ingen bugs just running the unchanged ingen front end and server but they are timing based so I'm keen to write some test cases in scripts. I'll fix the white space and debugging code if you're keen to get this merged, if not that's okay too. —

…

Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAISQBCZPHYHG3TJMGPOB3DZVMVJTAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZVHE2DSMZTGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

loki42 · 2024-09-08T07:06:46Z

Just realised we could look if there a sensible tests suit in RDFlib for it, as RDFlib supports Hex tuples as well.
https://github.com/RDFLib/rdflib/blob/main/test/test_serializers/test_serializer_hext.py Not sure about translating it though.

…status

drobilla · 2024-09-11T15:05:25Z

Seemed easier for me, this change was relatively quick and I can now parse
with pretty much the same code in any json environment on the other end

I doubt you actually can, with anything but the simplest data anyway. That's the real problem here. You've essentially changed the delimiters of an NTriples document, but the actual nodes are still NTriples. The escaping rules for JSON and NTriples are quite different. The document format itself is a trivial line-based thing, almost all of the complexity is in the nodes themselves.

If it's a kludge on a fork that makes things work for you, whatever, but yeah, in order to land in serd, readers and writers need to be able to handle arbitrary data correctly without loss. I don't know how much work that is here, but the escaping rules certainly aren't identical, so it's not zero. You might as well just naively chop an NTriples line at delimiters with basic string operations in Python really, that'd be about the same level of sketchy, although I have a hard time believing that parsing NTriples isn't relatively easy in Python (the whole grammar only has 17 rules). Obviously naively loading the whole thing into an rdflib model is obscenely slow, but judging by a quick web search, even with rdflib you can do streaming parsing without much fuss.

The bindings, as it happens, are mainly why things are in flux again. The easy way to make language bindings easy is to use the same OO-ish pattern for everything, so that's what I did. The problem is, along the way, I ended up baking some severe performance problems right into the guts of the library itself, and also exposing too many internal implementation details that would mean I have no ability to improve that. So the whole way that nodes and statements work in the API needs to be redone, so that's what I'm doing. If unreleased git repositories don't work for interlopers in the mean time... oh well. If you're doing that, keep track of what refs you're using and don't blindly pull. There is absolutely zero guarantee or even minimal effort that anything there remains at all compatible, in any way, at all, ever. That's the whole point of a new major version.

I am working on it, but I'm currently in the middle of an intense period for paid work. Obviously that, and many other things besides, have a much higher priority than inherently unstable WIP things being, well, unstable.

loki42 · 2024-09-11T15:22:53Z

All good, my needs were very specific and ingen only. I've been able to get the stuff I needed working and found the crashes in Ingen. You can ignore this pull request and I'll send a separate one with the Ingen changes. There's a few timing related segfaults which had an easy fix and ones I needed to make some complicated test cases to find. Thanks again for checking this out.

…

On Thu, 12 Sept 2024, 1:05 am David Robillard, ***@***.***> wrote: Seemed easier for me, this change was relatively quick and I can now parse with pretty much the same code in any json environment on the other end I doubt you actually can, with anything but the simplest data anyway. That's the real problem here. You've essentially changed the delimiters of an NTriples document, but the actual nodes are still NTriples. The escaping rules for JSON and NTriples are quite different. The document format itself is a trivial line-based thing, almost all of the complexity is in the nodes themselves. If it's a kludge on a fork that makes things work for you, whatever, but yeah, in order to land in serd, readers and writers need to be able to handle arbitrary data correctly without loss. I don't know how much work that is here, but the escaping rules certainly aren't identical, so it's not zero. You might as well just naively chop an NTriples line at delimiters with basic string operations in Python really, that'd be about the same level of sketchy, although I have a hard time believing that parsing NTriples isn't relatively easy in Python (the whole grammar only has 17 rules). Obviously naively loading the whole thing into an rdflib model is obscenely slow, but judging by a quick web search, even with rdflib you can do streaming parsing without much fuss. The bindings, as it happens, are mainly why things are in flux again. The easy way to make language bindings easy is to use the same OO-ish pattern for everything, so that's what I did. The problem is, along the way, I ended up baking some severe performance problems right into the guts of the library itself, and also exposing too many internal implementation details that would mean I have no ability to improve that. So the whole way that nodes and statements work in the API needs to be redone, so that's what I'm doing. If unreleased git repositories don't work for interlopers in the mean time... oh well. If you're doing that, keep track of what refs you're using and don't blindly pull. There is absolutely zero guarantee or even minimal effort that anything there remains at all compatible, in any way, at all, ever. That's the whole point of a new major version. I am working on it, but I'm currently in the middle of an intense period for paid work. Obviously that, and many other things besides, have a much higher priority than inherently unstable WIP things being, well, unstable. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAISQBFSTVYN3RNCNN24Z3DZWBL4XAVCNFSM6AAAAABNZWYOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTHEZTQOBUGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

adds support for writing hextuples, fixes drobilla#44.

ad7e911

drobilla reviewed Sep 7, 2024

View reviewed changes

run clang format, fix debugging code I left in and check some return …

3f29150

…status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds support for writing hextuples, fixes #44. #45

adds support for writing hextuples, fixes #44. #45

loki42 commented Sep 7, 2024

drobilla left a comment

drobilla Sep 7, 2024

loki42 commented Sep 7, 2024 via email

drobilla commented Sep 7, 2024 •

edited

Loading

loki42 commented Sep 8, 2024 via email

loki42 commented Sep 8, 2024 •

edited

Loading

drobilla commented Sep 11, 2024

loki42 commented Sep 11, 2024 via email

adds support for writing hextuples, fixes #44. #45

Are you sure you want to change the base?

adds support for writing hextuples, fixes #44. #45

Conversation

loki42 commented Sep 7, 2024

drobilla left a comment

Choose a reason for hiding this comment

drobilla Sep 7, 2024

Choose a reason for hiding this comment

loki42 commented Sep 7, 2024 via email

drobilla commented Sep 7, 2024 • edited Loading

loki42 commented Sep 8, 2024 via email

loki42 commented Sep 8, 2024 • edited Loading

drobilla commented Sep 11, 2024

loki42 commented Sep 11, 2024 via email

drobilla commented Sep 7, 2024 •

edited

Loading

loki42 commented Sep 8, 2024 •

edited

Loading