-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kf/generic pipeline #59
Conversation
There is a lot of new code, though much of it is essentially copy-pasted from individual scripts in |
HI @KasperFyhn , sorry I didn't notice this - feel free to let me know on Slack if I haven't been responsive (I should get a notification) |
Will do. I figured you were preoccupied and decided to go ahead and merge. I wonder if you didn't get notified because it started as a draft PR which I then opened for review later on. |
Might be the case - might also have come in when I was on the conference in Norway in which case I might have missed it in the pipe of git related stuff I returned to |
Introducing the "generic pipeline". The idea is as follows:
You bring your data, e.g. in text files, tweets in JSON lines or whatever. There are different pre-processors to handle that. They implement the method
Preprocessor._do_preprocessing()
. If your data is special, you or someone else can implement a new preprocessor to handle that data format. The rest kind of works out of the box since the data is streamlined after preprocessing.Let' say that you have a bunch of text files in a folder under
input/my_input
. The whole thing is run with therun.py
script as follows.You'll get this output:
It can all be refined, but I figured this was a good time for review and maybe merging it in since it basically works. The rest can come down the road.
Next up is finding English components for the pipeline such that one can set
"en"
as language in configuration (or CLI argument, perhaps).