Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create repository of extractors #8

Open
vsoch opened this issue Nov 13, 2018 · 6 comments
Open

Create repository of extractors #8

vsoch opened this issue Nov 13, 2018 · 6 comments

Comments

@vsoch
Copy link
Member

vsoch commented Nov 13, 2018

Right now these are living in the dockerfiles repo (a full example) but we should also provide simple examples in a separate repo, with the goal of being able plug easily into other tools (e.g., datalad @yarikoptic

These extractors (in progress!) will be here: https://github.com/openschemas/extractors

@yarikoptic I'm done with the schemaorg python tooling, and I'm waiting to hear from the library about use cases to do the first implementations with datalad. I'll also have "ImageDefinition" examples finished soon, just waiting on a few PRs into container-diff to get all the metadata that I want. There will be a full "dockerfiles" example with embedded metadata for schemaorg also soon (it's parsing now).

The general goal will be that if there is a datalad user with some dataset thing that fits a schema.org definition, they can grab one of these extractors to use with datalad (and schemaorg) to generate the metadata (web view) for their dataset.

Another question for you - do you have any datasets / community needs that would do well with a Python extractor with datalad? Since these are ready to go and I'm really wanting to get started working (and I'm not sure how long the library would take) it might be faster to find another use case too.

@yarikoptic
Copy link

oh, you were busy indeed, weren't you @vsoch ?

any datasets / community needs that would do well with a Python extractor with datalad?

I am a bit not sure what you are asking for - all extractors we have are written in Python as well... here you seems to concentrate on container/image definitions - so if you are asking about those, then we do not have many of them in datalad land yet.
Within our niceman project we are trying to achieve similar extraction though, while concentrating though on information sufficient to identify the entire component (package, container image, etc) so we would have clear versioning semantic (where available) and origin information, so later on the same environment could be reconstructed, or multiple be compared (similarly to container-diff)

@vsoch
Copy link
Member Author

vsoch commented Nov 15, 2018

It doesn't have to be containers, my aim is to develop the integration with datalad so I'm good with whatever :) I am using Dockerfiles (containers) just because I spent a day last year creating a little database of over 100K so it's good to test things with. An extracter in how I'm doing it would likely use datalad with a schemaorg extraction so the metadata also plugs nicely into search.

@vsoch
Copy link
Member Author

vsoch commented Nov 15, 2018

Here is the little writeup for the dockerfiles example and extractors, although I haven't finished up doing the ImageDefinition (new schemaorg definition that will get metadata via container-diff) yet. https://vsoch.github.io/2018/datasets/

@yarikoptic
Copy link

Have you looked at the extractors we already have in DataLad? e.g.

$> datalad_ search --show-keys full | nl
     1	annex.MRI
     2	 in  1 datasets
     3	 has 1 unique values: u'yes'
     4	annex.age
     5	 in  1 datasets
     6	 has 1 unique values: 'unhashable 1688 out of 1690 entries'
     7	annex.dcterms_format
     8	 in  1 datasets
     9	 has 1 unique values: u'image/nifti'
    10	annex.diagnosis
...
  3712	xmp.xmpTPg-PlateNames
  3713	 in  1 datasets
  3714	 has 1 unique values: 'unhashable 0 out of 1 entries'
  3715	xmp.xmpTPg-SwatchGroups<xmpG-groupName>
  3716	 in  1 datasets
  3717	 has 1 unique values: 'unhashable 0 out of 1 entries'
  3718	xmp.xmpTPg-SwatchGroups<xmpG-groupType>
  3719	 in  1 datasets
  3720	 has 1 unique values: 'unhashable 0 out of 1 entries'

Harmonization at least at the level of a dataset description is also needed in our case for our rudimentary datasets browser (again on the same http://datasets.datalad.org): https://github.com/datalad/datalad/issues/2403 and for our datasets to get finally indexed by google datasets (https://github.com/datalad/datalad/issues/2793).

On our end, we could within http://datasets.datalad.org at least

  • include your datasets into our distribution and thus make them searchable etc
  • adopt your setup for visualizing metadata
  • looking forward we could probably adopt schemaorg extractor as the one to provide that basic metadata (description) for our webview, with or without any other harmonization effort.

@vsoch
Copy link
Member Author

vsoch commented Apr 25, 2019

@nsheff you might want to take a look at Datalad for another way to have (some) metadata be parsed automatically. I shamefully have not worked on it yet because I don't have many (real use case) datasets to manage.

@vsoch
Copy link
Member Author

vsoch commented Nov 11, 2019

Just want to add another note here - if anyone has a dataset that would conform to Google Datasets (or schema.org) and wants a Datalad extractor, I'm looking for this use case to better develop, and I can offer to help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants