Added control string for sentence breaks #676

MihaiSurdeanu · 2022-11-18T00:47:44Z

What do you think @kwalcock, @myedibleenso ?
See the unit test for the expected behavior.

kwalcock

I think the code is fine. I'm a little worried about ramifications, but don't let that hold it back.

kwalcock · 2022-11-18T16:10:14Z

main/src/main/java/org/clulab/processors/clu/tokenizer/OpenDomainLexer.tokens

+WHITESPACE=22
+SEQ_OF_UNICODES=23
+ErrorCharacter=24
+'[SB]'=8


This is odd, but I assume, correct.

I.e., line 25.

It was what Antlr generates, so I am assuming it is correct.

kwalcock · 2022-11-18T16:12:50Z

main/src/main/scala/org/clulab/processors/clu/tokenizer/SentenceSplitter.scala

@@ -119,6 +119,16 @@ abstract class RuleBasedSentenceSplitter extends SentenceSplitter {
        endPositions += lastPosition + raw.head.length
      }

+      // found the control string that enforces sentence breaks
+      // note that this token is NOT added to the sentences produced
+      else if(crt.word == SENTENCE_BREAK_CONTROL_STRING) {


I would be reassured if this was also dependent on this.useControlStrings or SentenceSplitter.useControlStrings which defaults to false. Those who want to use the feature can turn it on if necessary.

Good idea. Can you please add it?

Can this be set without needing to create a custom Processor?

No... Because we need to adjust the corresponding antlr grammar. Unless we come up with a generic format for the control string, e.g., anything between square brackets? Or, anything between double square brackets, e.g., [[SB]]? Then we can let people set the string to whatever values they want.

kwalcock · 2022-11-18T16:27:32Z

main/src/main/scala/org/clulab/processors/clu/tokenizer/SentenceSplitter.scala

@@ -186,6 +196,10 @@ class SpanishSentenceSplitter extends RuleBasedSentenceSplitter {
 object SentenceSplitter {
  val EOS: Regex = """^[\.!\?\s]+$""".r

+  // Control string that enforces a sentence break
+  // If you change this value, change also the SENTENCEBREAK in OpenDomainLexer.g to the same value (and recompile the Antlr grammar)
+  val SENTENCE_BREAK_CONTROL_STRING = "[SB]"


So the procedure would be to figure out before tokenization, so before Processor.mkDocument, where the control strings should be, like where there's a <br>, and change them to [SB]. These two strings happen to be the same length and one could take the resulting Document and substitute the old text for the new text in order to preserve the original. If the strings are different lengths, all the offsets would be off and the substitution won't work. We would lose (easy) access to the original document text. Will that be a problem?

I think it is likely that this will change offsets (e.g., when replacing newlines with '[SB]'). Users need to be aware of this.

I'm not sure we need to worry about supporting reinsertion of the original original token in this case of sentence boundaries (at least not at this stage).

That said, I think that is something we think about supporting for cases where a user wants to preserve unrecognized tokens (ex. through re-insertion).

@kwalcock , would you feel more comfortable using a control string with higher entropy (ex. <[*^[SB]^*]>)?

I'm happy with it just being off by default. If someone turns it on, (a simple SentenceSplitter.useControlString = true) and it's important, they can be responsible for making sure that the control string is not already in their text and if necessary, escaping it before and unescaping it after, etc.

We do in general have cases in which provenance is important and the original text needs to be preserved. This new feature is still useful and can be used when the original is not so important, though.

Make use of control strings optional

myedibleenso · 2024-01-27T02:00:22Z

@MihaiSurdeanu @kwalcock , I had need of this again today and it got me thinking: It would be helpful to have a test related to what we expect the value of a Document's text to be when the control string is used:

c/o B.A.Z. Bub[SB]
Morning Star Industries, Ltd.[SB]
666 Ring of Fire Circle[SB]
Lake of Fire, AZ 85666[SB]

signs you might be living in a simulation (recognize these warning signs)...[SB]
  - the earliest sound you remember from your childhood is the Windows startup theme[SB]
  - ....[SB]

kwalcock · 2024-01-27T05:27:36Z

@myedibleenso, can you check TestMkCombinedDocuments?

Mihai Surdeanu and others added 20 commits December 13, 2021 19:40

added vs code files to ignore

126c8d1

Merge branch 'master' of https://github.com/clulab/processors

35d5608

Merge branch 'master' of https://github.com/clulab/processors

f5e5f07

Merge branch 'master' of https://github.com/clulab/processors

afb6b8d

Merge branch 'master' of https://github.com/clulab/processors

2f8f2f0

Merge branch 'master' of https://github.com/clulab/processors

39460a4

Merge branch 'master' of https://github.com/clulab/processors

6487405

Merge branch 'master' of https://github.com/clulab/processors

efe157b

Merge branch 'master' of https://github.com/clulab/processors

1aadc76

Merge branch 'master' of https://github.com/clulab/processors

09fd9b7

Merge branch 'master' of https://github.com/clulab/processors

6cd8b39

Merge branch 'master' of https://github.com/clulab/processors

027a08a

Merge branch 'master' of https://github.com/clulab/processors

6debd6f

Merge branch 'master' of https://github.com/clulab/processors

76fed8c

Merge branch 'master' of https://github.com/clulab/processors

3cfb3d5

Merge branch 'master' of https://github.com/clulab/processors

9c03764

Merge branch 'master' of https://github.com/clulab/processors

6db7073

Merge branch 'master' of https://github.com/clulab/processors

c841522

Merge branch 'master' of https://github.com/clulab/processors

a88e7d2

Added control string for sentence breaks

def90ec

MihaiSurdeanu requested review from myedibleenso and kwalcock November 18, 2022 00:47

kwalcock approved these changes Nov 18, 2022

View reviewed changes

kwalcock added 3 commits November 18, 2022 12:20

Make use of control strings optional

b047378

Test better

cef093a

Merge pull request #678 from clulab/kwalcock/sentencebreak

3091960

Make use of control strings optional

myedibleenso approved these changes Nov 19, 2022

View reviewed changes

kwalcock mentioned this pull request Nov 30, 2022

Split a text into smaller pieces and make a single document of them. #685

Merged

kwalcock marked this pull request as draft February 15, 2023 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added control string for sentence breaks #676

Added control string for sentence breaks #676

MihaiSurdeanu commented Nov 18, 2022

kwalcock left a comment

kwalcock Nov 18, 2022

kwalcock Nov 18, 2022

MihaiSurdeanu Nov 18, 2022

kwalcock Nov 18, 2022

MihaiSurdeanu Nov 18, 2022

myedibleenso Nov 18, 2022

MihaiSurdeanu Nov 18, 2022

kwalcock Nov 18, 2022

MihaiSurdeanu Nov 18, 2022

myedibleenso Nov 18, 2022

myedibleenso Nov 18, 2022

kwalcock Nov 18, 2022

myedibleenso commented Jan 27, 2024

kwalcock commented Jan 27, 2024

Added control string for sentence breaks #676

Are you sure you want to change the base?

Added control string for sentence breaks #676

Conversation

MihaiSurdeanu commented Nov 18, 2022

kwalcock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myedibleenso commented Jan 27, 2024

kwalcock commented Jan 27, 2024