-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple arithmetic for the words #46
Comments
There is a great potential for code sharing between dateparser and price-parser here. I’ve recently proposed an English-only approach for price-parser (scrapinghub/price-parser#11). Time for number-parser? 😁 |
Hi, I'm interested in this idea. However, I have NLP course this semester. I'm not sure if it is a bit late, But I really want to participate, and I believe I have the ability to start work when I start work in the summer. |
There are many implementations for number parsers on Stack overflow. There is also a library called word2number in Python contributed by someone. |
There are many implementations for English. But the end goal is to support different locales. And in unambiguous cases, without the knowledge of the locale of the input text. That can be hard. |
So what should be the main aim of the project... to solve this issue with respect to natural language processsing in english or in all different locales? |
Hi @ShantanuDube @varunagarwal18 @Eveneko ! The idea is to support every supported language, however, if you check the code, most of the things are first translated to English and then processed to get the date. So this could be done as "X language" --> "English" --> "numbers". The first natural step would be "English" --> "numbers", but we also need to develop a "framework" to easily add support for the other languages. On the other hand, there are some open PRs trying to address this issue, and even we have some natural numbers directly included in the main code ("one", "two"...). Feel free to investigate it and open issues or draft PRs with ideas. Don't be afraid to code! 😄 |
Why can't we use an existing library, like say https://github.com/jduff/numerizer? |
Using an existing library is not out of the question, provided that they can be used to achieve the desired goal. Internationalization may be an issue, so that’s something to account for when looking for existing libraries. They should also be Python libraries or have Python bindings, Ruby libraries are probably not a good fit 😛 |
Okay, I could have sworn that I linked a Python library. A ruby library is not ideal for a python library, yes i tend to agree. Sorry! |
I think it just need to use Regex to resolve this. You can see this example https://github.com/facebook/duckling/blob/master/Duckling/Numeral/EN/Rules.hs |
This problem can be solved by LSTMs. If we can parse the date in one format from bizarre text then with the help of various parsing libraries we can parse date in any format. But we will need a data like with one column containing all dates (in english or some other language) and another the target date. The language variation shall make the model tough to train but I think it will work if we have sufficient data. Major problem might be with languages like chinese or japaneses which are totally different from english in the way we write them. It doesnt seem parsing can be the right solution when someone wants to write 3 jan 1978 or someone else 3 January '78 and there can exist all different shortcuts in different languages. |
@heraclex12 - True, as this is mostly what is done with the dates, but... see answer #46 (comment). |
@XtremeGood - I don't think this is a viable solution. I mean, yes, I believe it would generally work, but:
So it's a good approach, just for a separate library. |
I thing regex is also slow and python too in that way. |
What we can do is to use the 1 D convolution neural nets in place of rnns. I have heard of this approach. Those are even used for mobile devices. |
or use this https://spacy.io/ |
We need to be able to transform token sequences like "seven hundred and sixty-five thousand, four hundred and thirty-two" to the "765432". There could be different handling of such tokens in different languages (for example Roman numerals deals with subtractions). So let's for now only focus on how English tokens transforming to numbers. Let's call this approach "general" (later we would define which approach should be used in
languages.yaml
file)Initial idea is to iterate through the list of tokens, skipping tokens that are in
skip
, or[\W_]+
. Each token should be present in dictionary (numbers
section of the language).So if number represented by current token is less then previous, we use addition, if it is greater than several of previous nearby numbers, than those smaller number are describing this bigger one and use multiplication. Be sure to use multiplication only with those preceding number that are 1) less then current 2) directly chained with current.
This approach should of course be properly tested.
The text was updated successfully, but these errors were encountered: