-
Notifications
You must be signed in to change notification settings - Fork 6
Outside #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a very large PR and I fear that I will not understand everything that's going on here. But I try to work through it and hope it's fine that I'll ask some questions while doing so.
-
One general question regarding grammar transformation:
Can the grammar be saved to a file or is it only reported to stdout?
Would it be useful to include that so the user can have a look at it later? -
Just for my understanding: Why would I want to exclude NTs from outside transformation? Are there cases where some NTs don't influence the probabilities of the rest of a grammar? Or are those only NTs of the form X -> a ?
-
Not necessary for this PR but out of curiosity: Did you check the runtime of the code generation regarding grammar size (number of NTs)? Would be interestring how it changes with larger grammars.
template<typename alphabet, typename pos_type, typename T> | ||
inline bool complete_track( | ||
const Basic_Sequence<alphabet, pos_type> &seq, T i, T j) { | ||
return ((i == seq.n) && (j == seq.n)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is checked here?
src/outside/codegen.cc
Outdated
|
||
|
||
std::list<Symbol::NT*> *NTs_to_report(const AST &ast) { | ||
/* define which non-terminals shell be reported to the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'shall' not 'shell'
src/outside/codegen.cc
Outdated
/* define which non-terminals shell be reported to the user | ||
* order of user arguments (--outside_grammar) shall take precedence over | ||
* NTs as occurring in source code of grammar. | ||
* - User shell be warned, if outside version of NT has not been generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s. above
The grammar is NOT printed to stdout, but you can "plot" it via |
You would not manually consider excluding NTs, but yes, there are "inside" productions that lack outside analogues because they have no r.h.s. non-terminals as in your example! (I guess |
Theory says that every CFG can be transformed into Chomsky Normal Form (CNF), i.e. max. width is 1 and would have at most 2 outside productions. Thus, the asymptotic run time will not change; only a constant factor is added. For CNF, this yields a factor of ~3. For ADP CFGs with width > 1, this factor can be higher (depends on the number of r.h.s. NTs) but the asymptotic remains the same. |
these are great questions! Please don't hesitate to ask more of those! |
Yes, a is a terminal. I think I have seen it somewhere but are there checks to see if an inclusion or exclusion of certain NTs is problematic? Does the user get a warning if they don't include NTs? |
outside grammar generation is fully automatic, i.e. the user has no saying in which NTs to process. Therefore, he/she cannot do anything wrong here, except designing an inside grammar that cannot parse the empty word - an according warning will be reported to the user. |
I think I completely misunderstood this passage in your explanation:
I thought that you could exclude the NTs in the code generation NOT (only) in die visual report. But I think the latter is what you really mean, right? So, my question would be irrelevant then. Sorry for the confusion. |
ah, that is the misunderstanding. Correct, it only controls which results are reported on stdout. One is typically not interested in all cell values of all NTs, thus one might save printing lines. But in general, we cannot know what the user is interested in. |
@sjanssen2 Regarding the checkpointing test that keeps failing: I can't really tell what exactly is going wrong by just looking at the log, but I assume that the test input is simply too short for this test if it is executed on multiple threads (with OMP). I ran into some issues before whenever I checkpointed programs that had execution times close to 1s, which is the checkpointing interval used in all tests. We could maybe increase the input length a bit so the test runs a bit longer and see if it still keeps failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so here are my final comments.
I looked through everything and have some remarks and questions left.
But other than that it should be it.
formula = number(INT) | ||
| add(formula, CHAR('+'), formula) | ||
| mult(formula, CHAR('*'), formula) | ||
| nil(EMPTY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heinz
and minus
are missing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, the compiler does not complain if an algebra defines the body of an algebra function despite the fact that this algebra function is not declared in the signature! Bug or feature? I am using the behavior to smuggle in a normalization algebra function for computation of derivatives, part of #151
Here, I was testing that gapc really does not throw an error about heinz
.
minus
is not used in the grammar as this would violate Bellman's Principle in combination with the defined algebra functions. It's basically a left over from copy and pasting the code from the teaching example.
} | ||
} | ||
|
||
//algebra alg_dotBracket implements sig_foldrna(alphabet = char, answer = string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi-line comments would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or remove this unused algebra definition altogether :-)
Hi @kmaibach I hope I have addressed your latest issues appropriately. Can you check again? |
This PR is huge. Sorry. It adds the ability of automatic outside grammar generation to gapc, via the parameter
--outside_grammar ALL
.The current implementation can "consume" a subword i,j from the user provided input sequence(s) until every symbol between i and j are parts of candidates (the inside direction). With this PR, we revert the direction: given i,j we "consume" characters 0...i and j...n until we reach 0 and n (the outside direction). In practice, this allows to conceptually split all candidates at a give i,j and compute inside (was already possible) and outside (new) parts which combined will produce complete candidates. With this, we can easily compute e.g. posteriors or other useful information of the candidate space.
I've tried to capsule the code into three phases and create offload most of the new functions into
src/outside
, but of course changes also touch existing inside parts and thus need changes of existing code - which I tried to keep minimal. The phases aregrammar_transformation
: convert the user defined inside grammar into a grammar that additionally contains outside rules which reflect the structure of the inside grammar, but operate from inside to outsidemiddle_end
: the new direction requires running indices from i,j towards 0 and n. Thus, moving boundaries and loops require reverse order. Code in middle_end produces these different non-terminal functions.codegen
: result of the algorithm is no longer the single DP cell (0,n) for the axiom, but values of every (tabulated) NT at every position i,j. Thus, codegen produces a functionprint_insideoutside_report_fn
to report these many values (also consider multi track grammar with more than two dimensions). To limit the output, a use can define which NTs shall be reported via the gapc parameter--outside_grammar X
where X is one non-terminal. Repeatedly use of--outside_grammar
with different X will lead to multiple NTs being reported. If the user providesALL
as non-terminal, all NTs will be reported.We ran multiple semantic checks to warn users if outside grammar generation is not meaningful (e.g. empty word cannot be parsed) or algebra functions would use mixed data types
The new function shall work with CYK, CYK+openMP (only single track) and Unger code generation, checkpointing, multitrack.
--outside_grammar
?answer foo(answer_bar)
, if so outside cannot be generated as we will lackanswer_bar foo(answer)
for outside partsis_terminal_type
for ALL types!multi_filter
really needs to becomepublic
#ifdef LOOPDEBUG
partsP.S. this PR shall replace #122