added NumberParser(InternationalSystemEnglish) #647

Teut2711 · 2020-03-24T19:47:46Z

Number Parser for solving issue #46 and Gsoc 2020.

codecov · 2020-03-24T19:50:19Z

Codecov Report

Merging #647 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #647   +/-   ##
=======================================
  Coverage   95.21%   95.21%           
=======================================
  Files         302      302           
  Lines        2506     2506           
=======================================
  Hits         2386     2386           
  Misses        120      120

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48c4563...119db1d. Read the comment docs.

Teut2711 · 2020-03-24T19:56:22Z

Hi @kishan3 and @marcosmadr. I want to participate in GSoC 2020 under the project "Number Parser" and similar ones. As discussed #46 a machine learning model would not outperform parsing. Therefore I 'have created a small implementation for the english system. The main code is here and minor edits are left to make it work for larger numbers.Improving code style and testing are still left but wouldn't be too difficult to follow. Please explain what more do you expect from a person doing this project. I know deep learning and RNN/(or CNN) architectures and can use them if required for this project. Also I have good knowledge of data Structures and Algorithms like Knuth Morris Pratt, Trie(Understanding and implementation of Kasai's algorithm), trees/graph and traversals.
Thanks,
XtremeGood

Teut2711 · 2020-03-26T10:03:13Z

Hi I though again about this but I dont think if you are working on so many languages(of which I might know only 1 of them) it wouldn't be possible to create it for all different languages(since you need to understand each one of them). Therefore if I have generated the code above then I can use google translate to generate data for all numbers in various languages which I can feed to the model to get the numeric as output. Please reply @kishan3 and @marcosmadr

Gallaecio · 2020-03-26T21:01:04Z

Please explain what more do you expect from a person doing this project.

I believe the main goal would be to implement such a parser within Dateparser in a way that it can be made to work with different languages.

While I don’t think the project needs to implement that language support, I believe it should implement all the pieces needed for it, so that experts in a language can create a data file, similar to those YAML files we use at the moment, to add natural number parsing support for a language.

And, of course, for Dateparser supporting cardinal numbers would not be enough. Days of the month are usually expressed as ordinal numbers in English.

Teut2711 · 2020-03-27T07:41:05Z

In that case a machine learning model would be a must. Since I only know English. My current idea is to use the cardinal/English model and google translate to generate data for months, date, year in language and train the model on that data to produce the numberic/target value. Will submit a proposal by the end of the day.

asadurski · 2020-03-27T10:30:15Z

Psst! You already got most of the translations you need under https://github.com/scrapinghub/dateparser/tree/master/dateparser_data/cldr_language_data/numeral_translation_data.

Teut2711 · 2020-03-27T10:51:55Z

Thanks!.That makes it easier but as dicussed [Issue 46] (#46) I don’t think there’s a way to create all that parsing without the knowledge of all those languages or machine learning. I have studied French for 6 years that’s why I' m saying this. Consider the following: seventeenth april(in English) dix-sept avril(in french). The order of the words get swapped in french. Now considering the number of languages in the world it is impossible to recognise patterns and parse the string to its numerical form that too for day, month etc. Best we can do is only to use a machine learning/deep learning model and leave it to train continuously. By the way “spacy” already has a “label data” which recognises dates, organisation etc in a sentence. It can be used to collect data from large corpus. I will try adding it to the proposal. Please tell whether I should add machine learning or not since I do not have any other method in my mind to achieve this.

Teut2711 · 2020-03-28T13:45:48Z

Please explain what more do you expect from a person doing this project.

I believe the main goal would be to implement such a parser within Dateparser in a way that it can be made to work with different languages.

While I don’t think the project needs to implement that language support, I believe it should implement all the pieces needed for it, so that experts in a language can create a data file, similar to those YAML files we use at the moment, to add natural number parsing support for a language.

And, of course, for Dateparser supporting cardinal numbers would not be enough. Days of the month are usually expressed as ordinal numbers in English.

Stop word removal can be achieved by any of the available libraries like gensim, or spacy.
Ordinal number's just have "th" or "nd" at the end.Cardinal and ordinal Numbers.That's is pretty simple to do. One can just create new keys in the dictionary in the code dictionary["first"] = 1 and so on. Main thing is algorithm which I think works for atleast pricing. Now if I assume all languages to follow similar structure like English then I can think of creating multiple keys for a single target. Is there any chat room to communicate?

Gallaecio · 2020-03-28T18:09:52Z

If you can figure out a system that can work for both English and French, and similar languages provided data from ICU or data provided by users in YML format, I think that would be a good approach. I would personally be satisfied if a GSoC project yields at least that.

Once such a system is in place, people from the community with knowledge of other languages can give feedback about the issues that it presents for other languages, and we can improve the system as needed with feedback. Even if only mentors give feedback, that should be 2-3 more languages which you could work on supporting.

Teut2711 · 2020-03-28T18:25:41Z

Hi, I would require a little bit of more info regarding what things are to be covered. Can we have a chat on gitter or ircnode?

Gallaecio · 2020-03-28T18:36:21Z

I would prefer to discuss them here unless there is a good reason not to. That said, given how close we are to the deadline, I’ll try and be available at Freenode’s #dateparser room as _gallaecio.

Teut2711 · 2020-03-28T18:54:10Z

Hi, I have messaged on the ircnode chat. Please reply so that I can send the proposal asap. Maybe there may be a few hours left to ask for amendments too.

arnavkapoor · 2020-03-29T01:11:01Z

Hi, @Gallaecio @asadurski @kishan3 @marcosmadr I am interested in this project and would like to work towards it. Please find attached my GSoC proposal. Would appreciate if you could go through it and provide feedback. Sorry for the short notice.
https://docs.google.com/document/d/17ElRkrvvMBqHXOhFd4_VXc0gG69kpu3GPrAffYFTf4g/edit#

Teut2711 · 2020-03-29T06:38:19Z

@Gallaecio What kind of text do you scrape with dateparser? Is it like dates written in html tables or dates in between large paragraphs of text?

harsh9200 · 2020-03-29T07:18:24Z

@XtremeGood

Mainly dateparser.parse() is used to parse dates in a particular format but it can be also used for paragraphs. whereas function dateparser.search.search_dates() is used to find dates in a paragraph. The various available format you can find in dateparser.parse.py.

Teut2711 · 2020-03-29T07:22:48Z

ok thanks for the info @harsh9200 .

Teut2711 · 2020-03-29T08:34:36Z

I have shared the draft through the Python org under Scrapy. Please check.

Gallaecio · 2020-03-30T17:47:55Z

@XtremeGood What are your current plans on addressing input of unknown language? Rely on some kind of language detection? Apply all word-to-number changes regardless of the input language?

Teut2711 · 2020-03-30T17:53:26Z

Well if you have a text split it into words and compare how many words of the text matches with your each language dictionary. The language dictionary with the most words common words win. No other approach except for machine learning is in my mind other than this. Can we talk on ircnode?

Gallaecio · 2020-03-30T19:05:24Z

Sounds like language detection, then 🙂

Teut2711 · 2020-03-30T19:25:41Z

If you see this is called bag of words as the approach suggests and is umm.. machine learning. I also have a better approach in mind which is called "Light" Gradient Boosting Machines which are very fast and people are winning competitions on due to the speed of LGBM on Kaggle the hub for machine learning. During the community bonding I will to make it. We can compare if it is fast enough.
https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm.

arnavkapoor · 2020-03-30T20:14:47Z

Hi @XtremeGood , I have looked for possible ML-based solution to this problem. Do you have any good literature or better similar projects to back up this approach, especially considering the speed and the multitude of languages to accommodate ?. Most of the similar projects that I have come across stick to rule-based approaches only.

noviluni · 2020-07-03T18:41:53Z

I will close this PR as we are handling this with another approach: #711

@XtremeGood Thank you for your efforts here, really appreciated 💪

added NumberParser(InternationalSystemEnglish)

119db1d

noviluni closed this Jul 3, 2020

added NumberParser(InternationalSystemEnglish) #647

added NumberParser(InternationalSystemEnglish) #647

Uh oh!

Conversation

Teut2711 commented Mar 24, 2020

Uh oh!

codecov bot commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Teut2711 commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Teut2711 commented Mar 26, 2020

Uh oh!

Gallaecio commented Mar 26, 2020

Uh oh!

Teut2711 commented Mar 27, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asadurski commented Mar 27, 2020

Uh oh!

Teut2711 commented Mar 27, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Teut2711 commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gallaecio commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Teut2711 commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gallaecio commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Teut2711 commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arnavkapoor commented Mar 29, 2020

Uh oh!

Teut2711 commented Mar 29, 2020

Uh oh!

harsh9200 commented Mar 29, 2020

Uh oh!

Teut2711 commented Mar 29, 2020

Uh oh!

Teut2711 commented Mar 29, 2020

Uh oh!

Gallaecio commented Mar 30, 2020

Uh oh!

Teut2711 commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gallaecio commented Mar 30, 2020

Uh oh!

Teut2711 commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arnavkapoor commented Mar 30, 2020

Uh oh!

noviluni commented Jul 3, 2020

Uh oh!

Uh oh!

codecov bot commented Mar 24, 2020 •

edited

Loading

Teut2711 commented Mar 24, 2020 •

edited

Loading

Teut2711 commented Mar 27, 2020 via email •

edited

Loading

Teut2711 commented Mar 27, 2020 via email •

edited

Loading

Teut2711 commented Mar 28, 2020 •

edited

Loading

Gallaecio commented Mar 28, 2020 •

edited

Loading

Teut2711 commented Mar 28, 2020 •

edited

Loading

Gallaecio commented Mar 28, 2020 •

edited

Loading

Teut2711 commented Mar 28, 2020 •

edited

Loading

Teut2711 commented Mar 30, 2020 •

edited

Loading

Teut2711 commented Mar 30, 2020 •

edited

Loading