Support for Extracting Skills from Custom Skill Lists #46

yonglin-wang · 2022-05-04T15:39:43Z

Hi 👋 Thanks for this great repo--I really liked how smart the tool is, especially being able to extract "Project Management" from the phrase "manage projects". I'd love to hear what you think about the following use case:

Is your feature request related to a problem? Please describe.
I am looking to use your tool with a custom skill list other than EMSI, e.g. O*NET skill lists

Describe the solution you'd like

It would be great to have API support for a custom skill list. However, I understand that this could involve a lot of work.
Alternatively, an instruction on how to create the skill_db_relax_20.json and token_dist.json files for custom skill lists would also be much appreciated.

Describe alternatives you've considered
I have traced the code a little bit, and found that we would probably need

skill_db_relax_20.json, which seems to be generated with skills_processor/create_surf_db.py based on token_dist.json and skills_processed.json, and
token_dist.json, which seems to be generated with skills_processor/create_token_dist.py based on skill_db_relax_20.json

My questions are:

Could you provide more description/script on how skills_processed.json is generated? More specifically, what are the rules (or data sources) that determine the following fields: unique_token, match_on_stemmed?
Per my previous observation, the required files for generating skill_db_relax_20.json and token_dist.json seems to be circular--they require each other to be generated.. What should be the correct order?
- Correct me if I'm wrong, it looks like token_dist.json could be generated first, with n_grams in this line being a list of strings of lowered, lemmatized skill titles (only if skill title is more than 1 word; otherwise it's the lowered skill title without the parenthesis).

Additional context
Once the two questions are resolved, I would be happy to write a modularized script that generates skills_processed.json, skill_db_relax_20.json, and token_dist.json from any given skill list/table, and create a pull request for it.

Looking forward to hearing from you 😃

The text was updated successfully, but these errors were encountered:

AnasAito · 2022-05-04T18:27:18Z

Hello @yonglin-wang Thank you for this detailed issue. Actually, we had the same idea back when we started SkillNer, but we put it on hold to see how it works first with a simple hard-coded list.
But yes, as you said, it will be good to put in a pipeline to generate the two JSONs token_dist.json and skill_db_relax_20.json
What's next ?

For me, I will add a simple Md file containing the steps/functions to go from a list of skills (label and eventually some metadata) to generate both files
You can then start building the pipeline .
I will also make some change to the code to take into consideration a general format for the two files (instead of having skill type, let's have a column called metadata related to skill meta , like skill type in the case of EMSI skills )

What do you think ?

yonglin-wang · 2022-05-04T19:38:43Z

Thank you for the prompt followup @AnasAito ! Yes, it sounds like an awesome plan to me. I'll keep an eye out for the MD file and start building the pipeline once it's ready.

Having a generalized skill metadata sounds like a great idea too, but personally I would probably give it a slightly lower priority than generating the two files, just because skill type seems to be quite a common tag in the other skill lists I've seen as well.

Looking forward to the collaboration 😁

AnasAito · 2022-05-04T23:35:36Z

Hello @yonglin-wang , I finished the MD file with code utils that will speed your pipeline creation . I guess it will be strain forward now to generate the two files.
check this : how_new_db
Happy coding !

AnasAito added the enhancement New feature or request label May 4, 2022

yonglin-wang mentioned this issue May 12, 2022

ENH - Add support for custom skills #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Extracting Skills from Custom Skill Lists #46

Support for Extracting Skills from Custom Skill Lists #46

yonglin-wang commented May 4, 2022

AnasAito commented May 4, 2022

yonglin-wang commented May 4, 2022

AnasAito commented May 4, 2022

Support for Extracting Skills from Custom Skill Lists #46

Support for Extracting Skills from Custom Skill Lists #46

Comments

yonglin-wang commented May 4, 2022

AnasAito commented May 4, 2022

yonglin-wang commented May 4, 2022

AnasAito commented May 4, 2022