-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added NumberParser(InternationalSystemEnglish) #647
Conversation
Codecov Report
@@ Coverage Diff @@
## master #647 +/- ##
=======================================
Coverage 95.21% 95.21%
=======================================
Files 302 302
Lines 2506 2506
=======================================
Hits 2386 2386
Misses 120 120 Continue to review full report at Codecov.
|
Hi @kishan3 and @marcosmadr. I want to participate in GSoC 2020 under the project "Number Parser" and similar ones. As discussed #46 a machine learning model would not outperform parsing. Therefore I 'have created a small implementation for the english system. The main code is here and minor edits are left to make it work for larger numbers.Improving code style and testing are still left but wouldn't be too difficult to follow. Please explain what more do you expect from a person doing this project. I know deep learning and RNN/(or CNN) architectures and can use them if required for this project. Also I have good knowledge of data Structures and Algorithms like Knuth Morris Pratt, Trie(Understanding and implementation of Kasai's algorithm), trees/graph and traversals. |
Hi I though again about this but I dont think if you are working on so many languages(of which I might know only 1 of them) it wouldn't be possible to create it for all different languages(since you need to understand each one of them). Therefore if I have generated the code above then I can use google translate to generate data for all numbers in various languages which I can feed to the model to get the numeric as output. Please reply @kishan3 and @marcosmadr |
I believe the main goal would be to implement such a parser within Dateparser in a way that it can be made to work with different languages. While I don’t think the project needs to implement that language support, I believe it should implement all the pieces needed for it, so that experts in a language can create a data file, similar to those YAML files we use at the moment, to add natural number parsing support for a language. And, of course, for Dateparser supporting cardinal numbers would not be enough. Days of the month are usually expressed as ordinal numbers in English. |
In that case a machine learning model would be a must. Since I only know English. My current idea is to use the cardinal/English model and google translate to generate data for months, date, year in language and train the model on that data to produce the numberic/target value. Will submit a proposal by the end of the day.
|
Psst! You already got most of the translations you need under https://github.com/scrapinghub/dateparser/tree/master/dateparser_data/cldr_language_data/numeral_translation_data. |
Thanks!.That makes it easier but as dicussed [Issue 46] (#46) I don’t think there’s a way to create all that parsing without the knowledge of all those languages or machine learning. I have studied French for 6 years that’s why I' m saying this. Consider the following:
seventeenth april(in English) dix-sept avril(in french).
The order of the words get swapped in french. Now considering the number of languages in the world it is impossible to recognise patterns and parse the string to its numerical form that too for day, month etc. Best we can do is only to use a machine learning/deep learning model and leave it to train continuously. By the way “spacy” already has a “label data” which recognises dates, organisation etc in a sentence. It can be used to collect data from large corpus. I will try adding it to the proposal. Please tell whether I should add machine learning or not since I do not have any other method in my mind to achieve this.
|
Stop word removal can be achieved by any of the available libraries like gensim, or spacy. |
If you can figure out a system that can work for both English and French, and similar languages provided data from ICU or data provided by users in YML format, I think that would be a good approach. I would personally be satisfied if a GSoC project yields at least that. Once such a system is in place, people from the community with knowledge of other languages can give feedback about the issues that it presents for other languages, and we can improve the system as needed with feedback. Even if only mentors give feedback, that should be 2-3 more languages which you could work on supporting. |
Hi, I would require a little bit of more info regarding what things are to be covered. Can we have a chat on gitter or ircnode? |
I would prefer to discuss them here unless there is a good reason not to. That said, given how close we are to the deadline, I’ll try and be available at Freenode’s |
Hi, I have messaged on the ircnode chat. Please reply so that I can send the proposal asap. Maybe there may be a few hours left to ask for amendments too. |
Hi, @Gallaecio @asadurski @kishan3 @marcosmadr I am interested in this project and would like to work towards it. Please find attached my GSoC proposal. Would appreciate if you could go through it and provide feedback. Sorry for the short notice. |
@Gallaecio What kind of text do you scrape with dateparser? Is it like dates written in html tables or dates in between large paragraphs of text? |
@XtremeGood Mainly |
ok thanks for the info @harsh9200 . |
I have shared the draft through the Python org under Scrapy. Please check. |
@XtremeGood What are your current plans on addressing input of unknown language? Rely on some kind of language detection? Apply all word-to-number changes regardless of the input language? |
Well if you have a text split it into words and compare how many words of the text matches with your each language dictionary. The language dictionary with the most words common words win. No other approach except for machine learning is in my mind other than this. Can we talk on ircnode? |
Sounds like language detection, then 🙂 |
If you see this is called bag of words as the approach suggests and is umm.. machine learning. I also have a better approach in mind which is called "Light" Gradient Boosting Machines which are very fast and people are winning competitions on due to the speed of LGBM on Kaggle the hub for machine learning. During the community bonding I will to make it. We can compare if it is fast enough. |
Hi @XtremeGood , I have looked for possible ML-based solution to this problem. Do you have any good literature or better similar projects to back up this approach, especially considering the speed and the multitude of languages to accommodate ?. Most of the similar projects that I have come across stick to rule-based approaches only. |
I will close this PR as we are handling this with another approach: #711 @XtremeGood Thank you for your efforts here, really appreciated 💪 |
Number Parser for solving issue #46 and Gsoc 2020.