Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple arithmetic for the words #46

Open
Allactaga opened this issue Jan 28, 2015 · 16 comments
Open

Simple arithmetic for the words #46

Allactaga opened this issue Jan 28, 2015 · 16 comments
Assignees

Comments

@Allactaga
Copy link
Contributor

We need to be able to transform token sequences like "seven hundred and sixty-five thousand, four hundred and thirty-two" to the "765432". There could be different handling of such tokens in different languages (for example Roman numerals deals with subtractions). So let's for now only focus on how English tokens transforming to numbers. Let's call this approach "general" (later we would define which approach should be used in languages.yaml file)
Initial idea is to iterate through the list of tokens, skipping tokens that are in skip, or [\W_]+. Each token should be present in dictionary (numbers section of the language).

So if number represented by current token is less then previous, we use addition, if it is greater than several of previous nearby numbers, than those smaller number are describing this bigger one and use multiplication. Be sure to use multiplication only with those preceding number that are 1) less then current 2) directly chained with current.

This approach should of course be properly tested.

@Gallaecio
Copy link
Member

There is a great potential for code sharing between dateparser and price-parser here. I’ve recently proposed an English-only approach for price-parser (scrapinghub/price-parser#11).

Time for number-parser? 😁

@Eveneko
Copy link

Eveneko commented Feb 22, 2020

Hi, I'm interested in this idea. However, I have NLP course this semester. I'm not sure if it is a bit late, But I really want to participate, and I believe I have the ability to start work when I start work in the summer.

@varunagarwal18
Copy link

There are many implementations for number parsers on Stack overflow. There is also a library called word2number in Python contributed by someone.

@Gallaecio
Copy link
Member

There are many implementations for English. But the end goal is to support different locales. And in unambiguous cases, without the knowledge of the locale of the input text. That can be hard.

@ShantanuDube
Copy link

There are many implementations for English. But the end goal is to support different locales. And in unambiguous cases, without the knowledge of the locale of the input text. That can be hard.

So what should be the main aim of the project... to solve this issue with respect to natural language processsing in english or in all different locales?

@noviluni
Copy link
Collaborator

Hi @ShantanuDube @varunagarwal18 @Eveneko !

The idea is to support every supported language, however, if you check the code, most of the things are first translated to English and then processed to get the date. So this could be done as "X language" --> "English" --> "numbers". The first natural step would be "English" --> "numbers", but we also need to develop a "framework" to easily add support for the other languages.

On the other hand, there are some open PRs trying to address this issue, and even we have some natural numbers directly included in the main code ("one", "two"...). Feel free to investigate it and open issues or draft PRs with ideas. Don't be afraid to code! 😄

@aditya-hari
Copy link

Why can't we use an existing library, like say https://github.com/jduff/numerizer?

@Gallaecio
Copy link
Member

Using an existing library is not out of the question, provided that they can be used to achieve the desired goal. Internationalization may be an issue, so that’s something to account for when looking for existing libraries.

They should also be Python libraries or have Python bindings, Ruby libraries are probably not a good fit 😛

@aditya-hari
Copy link

Okay, I could have sworn that I linked a Python library. A ruby library is not ideal for a python library, yes i tend to agree. Sorry!

@heraclex12
Copy link

I think it just need to use Regex to resolve this. You can see this example https://github.com/facebook/duckling/blob/master/Duckling/Numeral/EN/Rules.hs

@Teut2711
Copy link

This problem can be solved by LSTMs. If we can parse the date in one format from bizarre text then with the help of various parsing libraries we can parse date in any format. But we will need a data like with one column containing all dates (in english or some other language) and another the target date. The language variation shall make the model tough to train but I think it will work if we have sufficient data. Major problem might be with languages like chinese or japaneses which are totally different from english in the way we write them. It doesnt seem parsing can be the right solution when someone wants to write 3 jan 1978 or someone else 3 January '78 and there can exist all different shortcuts in different languages.

@asadurski
Copy link
Member

@heraclex12 - True, as this is mostly what is done with the dates, but... see answer #46 (comment).

@asadurski
Copy link
Member

@XtremeGood - I don't think this is a viable solution. I mean, yes, I believe it would generally work, but:

  • we would not get the performance we need - and we need it really fast,
  • it wouldn't run on any hardware (imagine running this in a Flask app on a tiny server),
  • the size of the library with required libraries to run it would be enormous.

So it's a good approach, just for a separate library.

@Teut2711
Copy link

I thing regex is also slow and python too in that way.

@Teut2711
Copy link

Teut2711 commented Mar 24, 2020

What we can do is to use the 1 D convolution neural nets in place of rnns. I have heard of this approach. Those are even used for mobile devices.

@Teut2711
Copy link

or use this https://spacy.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants