Skip to content

Conversation

sjanssen2
Copy link
Member

@sjanssen2 sjanssen2 commented Mar 7, 2023

This PR is huge. Sorry. It adds the ability of automatic outside grammar generation to gapc, via the parameter --outside_grammar ALL.

The current implementation can "consume" a subword i,j from the user provided input sequence(s) until every symbol between i and j are parts of candidates (the inside direction). With this PR, we revert the direction: given i,j we "consume" characters 0...i and j...n until we reach 0 and n (the outside direction). In practice, this allows to conceptually split all candidates at a give i,j and compute inside (was already possible) and outside (new) parts which combined will produce complete candidates. With this, we can easily compute e.g. posteriors or other useful information of the candidate space.

I've tried to capsule the code into three phases and create offload most of the new functions into src/outside, but of course changes also touch existing inside parts and thus need changes of existing code - which I tried to keep minimal. The phases are

  1. grammar_transformation: convert the user defined inside grammar into a grammar that additionally contains outside rules which reflect the structure of the inside grammar, but operate from inside to outside
  2. middle_end: the new direction requires running indices from i,j towards 0 and n. Thus, moving boundaries and loops require reverse order. Code in middle_end produces these different non-terminal functions.
  3. codegen: result of the algorithm is no longer the single DP cell (0,n) for the axiom, but values of every (tabulated) NT at every position i,j. Thus, codegen produces a function print_insideoutside_report_fn to report these many values (also consider multi track grammar with more than two dimensions). To limit the output, a use can define which NTs shall be reported via the gapc parameter --outside_grammar X where X is one non-terminal. Repeatedly use of --outside_grammar with different X will lead to multiple NTs being reported. If the user provides ALL as non-terminal, all NTs will be reported.

We ran multiple semantic checks to warn users if outside grammar generation is not meaningful (e.g. empty word cannot be parsed) or algebra functions would use mixed data types

The new function shall work with CYK, CYK+openMP (only single track) and Unger code generation, checkpointing, multitrack.

  • semantic checks:
    • grammar cannot parse the empty word
    • are all requested NTs in the grammar, for --outside_grammar?
    • does the algebra use mixed types, i.e. answer foo(answer_bar), if so outside cannot be generated as we will lack answer_bar foo(answer) for outside parts
      • complete definition of is_terminal_type for ALL types!
  • resolve_blocks:
    • check if multi_filter really needs to become public
  • other ToDos
    • remove #ifdef LOOPDEBUG parts
    • add documentation for middle_end and codegen
    • don't forget to revert actions back to master branch of gapc-test-suite!
    • cyk
    • cyk + checkpointing
    • cyk + openMP
    • cyk + openMP + checkpointing
    • how to deal with non-terminals that have no choice function applied?

P.S. this PR shall replace #122

@sjanssen2 sjanssen2 added the WIP work in progress, do not (yet) merge label Mar 7, 2023
Copy link
Member

@kmaibach kmaibach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a very large PR and I fear that I will not understand everything that's going on here. But I try to work through it and hope it's fine that I'll ask some questions while doing so.

  1. One general question regarding grammar transformation:
    Can the grammar be saved to a file or is it only reported to stdout?
    Would it be useful to include that so the user can have a look at it later?

  2. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

  3. Not necessary for this PR but out of curiosity: Did you check the runtime of the code generation regarding grammar size (number of NTs)? Would be interestring how it changes with larger grammars.

template<typename alphabet, typename pos_type, typename T>
inline bool complete_track(
const Basic_Sequence<alphabet, pos_type> &seq, T i, T j) {
return ((i == seq.n) && (j == seq.n));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is checked here?



std::list<Symbol::NT*> *NTs_to_report(const AST &ast) {
/* define which non-terminals shell be reported to the user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'shall' not 'shell'

/* define which non-terminals shell be reported to the user
* order of user arguments (--outside_grammar) shall take precedence over
* NTs as occurring in source code of grammar.
* - User shell be warned, if outside version of NT has not been generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s. above

@sjanssen2
Copy link
Member Author

  1. One general question regarding grammar transformation:
    Can the grammar be saved to a file or is it only reported to stdout?
    Would it be useful to include that so the user can have a look at it later?

The grammar is NOT printed to stdout, but you can "plot" it via --plot-grammar 1. This will not report gap-L code but a visual representation. And yes, I think it is useful to inspect this automatic generation. I've added this hint to the help message of gapc.

@sjanssen2
Copy link
Member Author

2. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess a is a terminal)

@sjanssen2
Copy link
Member Author

3. Not necessary for this PR but out of curiosity: Did you check the runtime of the code generation regarding grammar size (number of NTs)? Would be interestring how it changes with larger grammars.

Theory says that every CFG can be transformed into Chomsky Normal Form (CNF), i.e. max. width is 1 and would have at most 2 outside productions. Thus, the asymptotic run time will not change; only a constant factor is added. For CNF, this yields a factor of ~3. For ADP CFGs with width > 1, this factor can be higher (depends on the number of r.h.s. NTs) but the asymptotic remains the same.

@sjanssen2
Copy link
Member Author

hope it's fine that I'll ask some questions

these are great questions! Please don't hesitate to ask more of those!

@kmaibach
Copy link
Member

  1. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess a is a terminal)

Yes, a is a terminal.

I think I have seen it somewhere but are there checks to see if an inclusion or exclusion of certain NTs is problematic? Does the user get a warning if they don't include NTs?

@sjanssen2
Copy link
Member Author

  1. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess a is a terminal)

Yes, a is a terminal.

I think I have seen it somewhere but are there checks to see if an inclusion or exclusion of certain NTs is problematic? Does the user get a warning if they don't include NTs?

outside grammar generation is fully automatic, i.e. the user has no saying in which NTs to process. Therefore, he/she cannot do anything wrong here, except designing an inside grammar that cannot parse the empty word - an according warning will be reported to the user.

@kmaibach
Copy link
Member

  1. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess a is a terminal)

Yes, a is a terminal.
I think I have seen it somewhere but are there checks to see if an inclusion or exclusion of certain NTs is problematic? Does the user get a warning if they don't include NTs?

outside grammar generation is fully automatic, i.e. the user has no saying in which NTs to process. Therefore, he/she cannot do anything wrong here, except designing an inside grammar that cannot parse the empty word - an according warning will be reported to the user.

I think I completely misunderstood this passage in your explanation:

To limit the output, a use can define which NTs shall be reported via the gapc parameter --outside_grammar X where X is one non-terminal. Repeatedly use of --outside_grammar with different X will lead to multiple NTs being reported. If the user provides ALL as non-terminal, all NTs will be reported.

I thought that you could exclude the NTs in the code generation NOT (only) in die visual report. But I think the latter is what you really mean, right?

So, my question would be irrelevant then. Sorry for the confusion.

@sjanssen2
Copy link
Member Author

  1. Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?

You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess a is a terminal)

Yes, a is a terminal.
I think I have seen it somewhere but are there checks to see if an inclusion or exclusion of certain NTs is problematic? Does the user get a warning if they don't include NTs?

outside grammar generation is fully automatic, i.e. the user has no saying in which NTs to process. Therefore, he/she cannot do anything wrong here, except designing an inside grammar that cannot parse the empty word - an according warning will be reported to the user.

I think I completely misunderstood this passage in your explanation:

To limit the output, a use can define which NTs shall be reported via the gapc parameter --outside_grammar X where X is one non-terminal. Repeatedly use of --outside_grammar with different X will lead to multiple NTs being reported. If the user provides ALL as non-terminal, all NTs will be reported.

I thought that you could exclude the NTs in the code generation NOT (only) in die visual report. But I think the latter is what you really mean, right?

So, my question would be irrelevant then. Sorry for the confusion.

ah, that is the misunderstanding. Correct, it only controls which results are reported on stdout. One is typically not interested in all cell values of all NTs, thus one might save printing lines. But in general, we cannot know what the user is interested in.

@fymue
Copy link
Collaborator

fymue commented Jun 22, 2023

@sjanssen2 Regarding the checkpointing test that keeps failing: I can't really tell what exactly is going wrong by just looking at the log, but I assume that the test input is simply too short for this test if it is executed on multiple threads (with OMP). I ran into some issues before whenever I checkpointed programs that had execution times close to 1s, which is the checkpointing interval used in all tests. We could maybe increase the input length a bit so the test runs a bit longer and see if it still keeps failing.

@sjanssen2
Copy link
Member Author

I've increased test input sequences as suggested by @fymue - seems to help :-)
Do you have further issues regarding this PR @fymue @kmaibach ?

@fymue
Copy link
Collaborator

fymue commented Jun 27, 2023

I've increased test input sequences as suggested by @fymue - seems to help :-)

Very good.

Do you have further issues regarding this PR @fymue @kmaibach ?

Nothing else from my side.

Copy link
Member

@kmaibach kmaibach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so here are my final comments.
I looked through everything and have some remarks and questions left.

But other than that it should be it.

formula = number(INT)
| add(formula, CHAR('+'), formula)
| mult(formula, CHAR('*'), formula)
| nil(EMPTY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heinz and minus are missing here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, the compiler does not complain if an algebra defines the body of an algebra function despite the fact that this algebra function is not declared in the signature! Bug or feature? I am using the behavior to smuggle in a normalization algebra function for computation of derivatives, part of #151
Here, I was testing that gapc really does not throw an error about heinz.

minus is not used in the grammar as this would violate Bellman's Principle in combination with the defined algebra functions. It's basically a left over from copy and pasting the code from the teaching example.

}
}

//algebra alg_dotBracket implements sig_foldrna(alphabet = char, answer = string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi-line comments would be better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or remove this unused algebra definition altogether :-)

@sjanssen2
Copy link
Member Author

Hi @kmaibach I hope I have addressed your latest issues appropriately. Can you check again?

@sjanssen2 sjanssen2 merged commit df87aca into master Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants