Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added NumberParser(InternationalSystemEnglish) #647

Closed
wants to merge 1 commit into from

Conversation

Teut2711
Copy link

Number Parser for solving issue #46 and Gsoc 2020.

@codecov
Copy link

codecov bot commented Mar 24, 2020

Codecov Report

Merging #647 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #647   +/-   ##
=======================================
  Coverage   95.21%   95.21%           
=======================================
  Files         302      302           
  Lines        2506     2506           
=======================================
  Hits         2386     2386           
  Misses        120      120           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48c4563...119db1d. Read the comment docs.

@Teut2711
Copy link
Author

Teut2711 commented Mar 24, 2020

Hi @kishan3 and @marcosmadr. I want to participate in GSoC 2020 under the project "Number Parser" and similar ones. As discussed #46 a machine learning model would not outperform parsing. Therefore I 'have created a small implementation for the english system. The main code is here and minor edits are left to make it work for larger numbers.Improving code style and testing are still left but wouldn't be too difficult to follow. Please explain what more do you expect from a person doing this project. I know deep learning and RNN/(or CNN) architectures and can use them if required for this project. Also I have good knowledge of data Structures and Algorithms like Knuth Morris Pratt, Trie(Understanding and implementation of Kasai's algorithm), trees/graph and traversals.
Thanks,
XtremeGood

@Teut2711
Copy link
Author

Hi I though again about this but I dont think if you are working on so many languages(of which I might know only 1 of them) it wouldn't be possible to create it for all different languages(since you need to understand each one of them). Therefore if I have generated the code above then I can use google translate to generate data for all numbers in various languages which I can feed to the model to get the numeric as output. Please reply @kishan3 and @marcosmadr

@Gallaecio
Copy link
Member

Please explain what more do you expect from a person doing this project.

I believe the main goal would be to implement such a parser within Dateparser in a way that it can be made to work with different languages.

While I don’t think the project needs to implement that language support, I believe it should implement all the pieces needed for it, so that experts in a language can create a data file, similar to those YAML files we use at the moment, to add natural number parsing support for a language.

And, of course, for Dateparser supporting cardinal numbers would not be enough. Days of the month are usually expressed as ordinal numbers in English.

@Teut2711
Copy link
Author

Teut2711 commented Mar 27, 2020 via email

@asadurski
Copy link
Member

Psst! You already got most of the translations you need under https://github.com/scrapinghub/dateparser/tree/master/dateparser_data/cldr_language_data/numeral_translation_data.

@Teut2711
Copy link
Author

Teut2711 commented Mar 27, 2020 via email

@Teut2711
Copy link
Author

Teut2711 commented Mar 28, 2020

Please explain what more do you expect from a person doing this project.

I believe the main goal would be to implement such a parser within Dateparser in a way that it can be made to work with different languages.

While I don’t think the project needs to implement that language support, I believe it should implement all the pieces needed for it, so that experts in a language can create a data file, similar to those YAML files we use at the moment, to add natural number parsing support for a language.

And, of course, for Dateparser supporting cardinal numbers would not be enough. Days of the month are usually expressed as ordinal numbers in English.

Stop word removal can be achieved by any of the available libraries like gensim, or spacy.
Ordinal number's just have "th" or "nd" at the end.Cardinal and ordinal Numbers.That's is pretty simple to do. One can just create new keys in the dictionary in the code dictionary["first"] = 1 and so on. Main thing is algorithm which I think works for atleast pricing. Now if I assume all languages to follow similar structure like English then I can think of creating multiple keys for a single target. Is there any chat room to communicate?

@Gallaecio
Copy link
Member

Gallaecio commented Mar 28, 2020

If you can figure out a system that can work for both English and French, and similar languages provided data from ICU or data provided by users in YML format, I think that would be a good approach. I would personally be satisfied if a GSoC project yields at least that.

Once such a system is in place, people from the community with knowledge of other languages can give feedback about the issues that it presents for other languages, and we can improve the system as needed with feedback. Even if only mentors give feedback, that should be 2-3 more languages which you could work on supporting.

@Teut2711
Copy link
Author

Teut2711 commented Mar 28, 2020

Hi, I would require a little bit of more info regarding what things are to be covered. Can we have a chat on gitter or ircnode?

@Gallaecio
Copy link
Member

Gallaecio commented Mar 28, 2020

I would prefer to discuss them here unless there is a good reason not to. That said, given how close we are to the deadline, I’ll try and be available at Freenode’s #dateparser room as _gallaecio.

@Teut2711
Copy link
Author

Teut2711 commented Mar 28, 2020

Hi, I have messaged on the ircnode chat. Please reply so that I can send the proposal asap. Maybe there may be a few hours left to ask for amendments too.

@arnavkapoor
Copy link
Contributor

Hi, @Gallaecio @asadurski @kishan3 @marcosmadr I am interested in this project and would like to work towards it. Please find attached my GSoC proposal. Would appreciate if you could go through it and provide feedback. Sorry for the short notice.
https://docs.google.com/document/d/17ElRkrvvMBqHXOhFd4_VXc0gG69kpu3GPrAffYFTf4g/edit#

@Teut2711
Copy link
Author

@Gallaecio What kind of text do you scrape with dateparser? Is it like dates written in html tables or dates in between large paragraphs of text?

@harsh9200
Copy link

@XtremeGood

Mainly dateparser.parse() is used to parse dates in a particular format but it can be also used for paragraphs. whereas function dateparser.search.search_dates() is used to find dates in a paragraph. The various available format you can find in dateparser.parse.py.

@Teut2711
Copy link
Author

ok thanks for the info @harsh9200 .

@Teut2711
Copy link
Author

I have shared the draft through the Python org under Scrapy. Please check.

@Gallaecio
Copy link
Member

@XtremeGood What are your current plans on addressing input of unknown language? Rely on some kind of language detection? Apply all word-to-number changes regardless of the input language?

@Teut2711
Copy link
Author

Teut2711 commented Mar 30, 2020

Well if you have a text split it into words and compare how many words of the text matches with your each language dictionary. The language dictionary with the most words common words win. No other approach except for machine learning is in my mind other than this. Can we talk on ircnode?

@Gallaecio
Copy link
Member

Sounds like language detection, then 🙂

@Teut2711
Copy link
Author

Teut2711 commented Mar 30, 2020

If you see this is called bag of words as the approach suggests and is umm.. machine learning. I also have a better approach in mind which is called "Light" Gradient Boosting Machines which are very fast and people are winning competitions on due to the speed of LGBM on Kaggle the hub for machine learning. During the community bonding I will to make it. We can compare if it is fast enough.
https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm.

@arnavkapoor
Copy link
Contributor

Hi @XtremeGood , I have looked for possible ML-based solution to this problem. Do you have any good literature or better similar projects to back up this approach, especially considering the speed and the multitude of languages to accommodate ?. Most of the similar projects that I have come across stick to rule-based approaches only.

@noviluni
Copy link
Collaborator

noviluni commented Jul 3, 2020

I will close this PR as we are handling this with another approach: #711

@XtremeGood Thank you for your efforts here, really appreciated 💪

@noviluni noviluni closed this Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants