Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds support for writing hextuples, fixes #44. #45

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

loki42
Copy link

@loki42 loki42 commented Sep 7, 2024

This adds support for ndjson hex tuples format explained here: https://github.com/ontola/hextuples an easy to parse ND JSON based RDF representation. It is faster / easier to deal across languages with good JSON support.

Copy link
Owner

@drobilla drobilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not familiar with this format and the "spec" seems... pretty sketchy, but I'll take a look. The basic idea seems reasonable enough, but the linked repo seems to just have a README (full of typos) and not even a rudimentary test suite, so I'm not sure how to go about actually landing this.

Ideally at some point there's support for JSON-LD so applications can provide a JSON interface that's actually nice, but a simple flat format is a good idea too and much easier to implement right now. That said, I vaguely recall there's been quite a few proposals for simple JSON encodings of RDF. Why this one?

src/writer.c Outdated Show resolved Hide resolved
src/writer.c Outdated
}

if (!strcmp((const char*)node->buf, NS_RDF "nil")) {
return esink("()", 2, writer);
if (writer->syntax == SERD_HEXTUPLES) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to push the limits of what makes sense in a single writer. It's probably still (barely) fine, but probably time to split this thing up by syntax.

src/writer.c Outdated Show resolved Hide resolved
src/writer.c Outdated Show resolved Hide resolved
src/writer.c Outdated Show resolved Hide resolved
@loki42
Copy link
Author

loki42 commented Sep 7, 2024 via email

@drobilla
Copy link
Owner

drobilla commented Sep 7, 2024

Sure, I meant aside from JSON-LD, which is indeed much harder to implement, and isn't really intended as "just another RDF serialization" anyway (really using it well requires support for its vocabularies and introduces a schema context to documents etc).

I haven't looked into this in detail since I didn't plan to implement a simpler JSON one, I just know there's a bunch of these. See for example https://www.w3.org/wiki/JSON%2BRDF

Mostly because it looked easier than getting python-serd to work

Fair enough, but just to make sure we're on the same page about the general landscape here, landing whole new features in the stable branch of serd has a pretty high bar. To get merged and released, any new syntax implementation would need to be comprehensively tested like everything else in serd. Being very fast and very solid is more or less the entire point of this project, I wouldn't dream of haphazardly messing with the stable release just because some unreleased unstable branches and wholly new bindings happen to be out of sync at the moment. I take the quality of serd very seriously and strive to have it always increase (ingen is just an app, so more of a "hey it works" situation is fine there).

That said, I'm happy to add an implementation of a simpler syntax that makes talking to Javascript (and/or Python?) easier. In principle, that shouldn't be very hard for any simple "flat" syntax. To actually be "shipable" in serd, though, it'd have to be long-term stable, and exhaustively tested in some way or another. I'm not sure if the latter is feasible without reading support, for one thing. I'm also not sure about the choice of syntax, the page linked above shows some, there's probably even more, but I'm not familiar with this space. Ideally, whatever's supported most by the most common implementations, which comes with its own set of test cases, because then it's a simple matter of just making all those pass.

This approach seems to be using the existing implementation to write most of the actual text, which makes for an easy change, but that means it's using the escaping/etc rules of the W3C syntaxes... which I doubt is correct? Does swapping a few delimiters produce something significantly easier to parse than actual NTriples anyway?

@loki42
Copy link
Author

loki42 commented Sep 8, 2024 via email

@loki42
Copy link
Author

loki42 commented Sep 8, 2024

Just realised we could look if there a sensible tests suit in RDFlib for it, as RDFlib supports Hex tuples as well.
https://github.com/RDFLib/rdflib/blob/main/test/test_serializers/test_serializer_hext.py Not sure about translating it though.

@drobilla
Copy link
Owner

Seemed easier for me, this change was relatively quick and I can now parse
with pretty much the same code in any json environment on the other end

I doubt you actually can, with anything but the simplest data anyway. That's the real problem here. You've essentially changed the delimiters of an NTriples document, but the actual nodes are still NTriples. The escaping rules for JSON and NTriples are quite different. The document format itself is a trivial line-based thing, almost all of the complexity is in the nodes themselves.

If it's a kludge on a fork that makes things work for you, whatever, but yeah, in order to land in serd, readers and writers need to be able to handle arbitrary data correctly without loss. I don't know how much work that is here, but the escaping rules certainly aren't identical, so it's not zero. You might as well just naively chop an NTriples line at delimiters with basic string operations in Python really, that'd be about the same level of sketchy, although I have a hard time believing that parsing NTriples isn't relatively easy in Python (the whole grammar only has 17 rules). Obviously naively loading the whole thing into an rdflib model is obscenely slow, but judging by a quick web search, even with rdflib you can do streaming parsing without much fuss.

The bindings, as it happens, are mainly why things are in flux again. The easy way to make language bindings easy is to use the same OO-ish pattern for everything, so that's what I did. The problem is, along the way, I ended up baking some severe performance problems right into the guts of the library itself, and also exposing too many internal implementation details that would mean I have no ability to improve that. So the whole way that nodes and statements work in the API needs to be redone, so that's what I'm doing. If unreleased git repositories don't work for interlopers in the mean time... oh well. If you're doing that, keep track of what refs you're using and don't blindly pull. There is absolutely zero guarantee or even minimal effort that anything there remains at all compatible, in any way, at all, ever. That's the whole point of a new major version.

I am working on it, but I'm currently in the middle of an intense period for paid work. Obviously that, and many other things besides, have a much higher priority than inherently unstable WIP things being, well, unstable.

@loki42
Copy link
Author

loki42 commented Sep 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants