-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create repository of extractors #8
Comments
oh, you were busy indeed, weren't you @vsoch ?
I am a bit not sure what you are asking for - all extractors we have are written in Python as well... here you seems to concentrate on container/image definitions - so if you are asking about those, then we do not have many of them in datalad land yet. |
It doesn't have to be containers, my aim is to develop the integration with datalad so I'm good with whatever :) I am using Dockerfiles (containers) just because I spent a day last year creating a little database of over 100K so it's good to test things with. An extracter in how I'm doing it would likely use datalad with a schemaorg extraction so the metadata also plugs nicely into search. |
Here is the little writeup for the dockerfiles example and extractors, although I haven't finished up doing the ImageDefinition (new schemaorg definition that will get metadata via container-diff) yet. https://vsoch.github.io/2018/datasets/ |
Have you looked at the extractors we already have in DataLad? e.g.
$> datalad_ search --show-keys full | nl
1 annex.MRI
2 in 1 datasets
3 has 1 unique values: u'yes'
4 annex.age
5 in 1 datasets
6 has 1 unique values: 'unhashable 1688 out of 1690 entries'
7 annex.dcterms_format
8 in 1 datasets
9 has 1 unique values: u'image/nifti'
10 annex.diagnosis
...
3712 xmp.xmpTPg-PlateNames
3713 in 1 datasets
3714 has 1 unique values: 'unhashable 0 out of 1 entries'
3715 xmp.xmpTPg-SwatchGroups<xmpG-groupName>
3716 in 1 datasets
3717 has 1 unique values: 'unhashable 0 out of 1 entries'
3718 xmp.xmpTPg-SwatchGroups<xmpG-groupType>
3719 in 1 datasets
3720 has 1 unique values: 'unhashable 0 out of 1 entries' Harmonization at least at the level of a dataset description is also needed in our case for our rudimentary datasets browser (again on the same http://datasets.datalad.org): https://github.com/datalad/datalad/issues/2403 and for our datasets to get finally indexed by google datasets (https://github.com/datalad/datalad/issues/2793). On our end, we could within http://datasets.datalad.org at least
|
@nsheff you might want to take a look at Datalad for another way to have (some) metadata be parsed automatically. I shamefully have not worked on it yet because I don't have many (real use case) datasets to manage. |
Just want to add another note here - if anyone has a dataset that would conform to Google Datasets (or schema.org) and wants a Datalad extractor, I'm looking for this use case to better develop, and I can offer to help out. |
Right now these are living in the dockerfiles repo (a full example) but we should also provide simple examples in a separate repo, with the goal of being able plug easily into other tools (e.g., datalad @yarikoptic
These extractors (in progress!) will be here: https://github.com/openschemas/extractors
@yarikoptic I'm done with the schemaorg python tooling, and I'm waiting to hear from the library about use cases to do the first implementations with datalad. I'll also have "ImageDefinition" examples finished soon, just waiting on a few PRs into container-diff to get all the metadata that I want. There will be a full "dockerfiles" example with embedded metadata for schemaorg also soon (it's parsing now).
The general goal will be that if there is a datalad user with some dataset thing that fits a schema.org definition, they can grab one of these extractors to use with datalad (and schemaorg) to generate the metadata (web view) for their dataset.
Another question for you - do you have any datasets / community needs that would do well with a Python extractor with datalad? Since these are ready to go and I'm really wanting to get started working (and I'm not sure how long the library would take) it might be faster to find another use case too.
The text was updated successfully, but these errors were encountered: