-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added control string for sentence breaks #676
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code is fine. I'm a little worried about ramifications, but don't let that hold it back.
WHITESPACE=22 | ||
SEQ_OF_UNICODES=23 | ||
ErrorCharacter=24 | ||
'[SB]'=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd, but I assume, correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e., line 25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was what Antlr generates, so I am assuming it is correct.
@@ -119,6 +119,16 @@ abstract class RuleBasedSentenceSplitter extends SentenceSplitter { | |||
endPositions += lastPosition + raw.head.length | |||
} | |||
|
|||
// found the control string that enforces sentence breaks | |||
// note that this token is NOT added to the sentences produced | |||
else if(crt.word == SENTENCE_BREAK_CONTROL_STRING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be reassured if this was also dependent on this.useControlStrings
or SentenceSplitter.useControlStrings
which defaults to false
. Those who want to use the feature can turn it on if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Can you please add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be set without needing to create a custom Processor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No... Because we need to adjust the corresponding antlr grammar. Unless we come up with a generic format for the control string, e.g., anything between square brackets? Or, anything between double square brackets, e.g., [[SB]]? Then we can let people set the string to whatever values they want.
@@ -186,6 +196,10 @@ class SpanishSentenceSplitter extends RuleBasedSentenceSplitter { | |||
object SentenceSplitter { | |||
val EOS: Regex = """^[\.!\?\s]+$""".r | |||
|
|||
// Control string that enforces a sentence break | |||
// If you change this value, change also the SENTENCEBREAK in OpenDomainLexer.g to the same value (and recompile the Antlr grammar) | |||
val SENTENCE_BREAK_CONTROL_STRING = "[SB]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the procedure would be to figure out before tokenization, so before Processor.mkDocument
, where the control strings should be, like where there's a <br>
, and change them to [SB]
. These two strings happen to be the same length and one could take the resulting Document
and substitute the old text for the new text in order to preserve the original. If the strings are different lengths, all the offsets would be off and the substitution won't work. We would lose (easy) access to the original document text. Will that be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is likely that this will change offsets (e.g., when replacing newlines with '[SB]'). Users need to be aware of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to worry about supporting reinsertion of the original original token in this case of sentence boundaries (at least not at this stage).
That said, I think that is something we think about supporting for cases where a user wants to preserve unrecognized tokens (ex. through re-insertion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwalcock , would you feel more comfortable using a control string with higher entropy (ex. <[*^[SB]^*]>
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with it just being off by default. If someone turns it on, (a simple SentenceSplitter.useControlString = true
) and it's important, they can be responsible for making sure that the control string is not already in their text and if necessary, escaping it before and unescaping it after, etc.
We do in general have cases in which provenance is important and the original text needs to be preserved. This new feature is still useful and can be used when the original is not so important, though.
Make use of control strings optional
@MihaiSurdeanu @kwalcock , I had need of this again today and it got me thinking: It would be helpful to have a test related to what we expect the value of a Document's
|
@myedibleenso, can you check TestMkCombinedDocuments? |
What do you think @kwalcock, @myedibleenso ?
See the unit test for the expected behavior.