Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics of null array elements in logical structure objects that can be arrays #308

Closed
petervwyatt opened this issue Jul 6, 2023 · 63 comments
Assignees
Labels
ISO approved Resolved issue approved by ISO

Comments

@petervwyatt
Copy link
Member

petervwyatt commented Jul 6, 2023

Related to Errata #157 and Arlington Issue #90, but specifically for logical structure. This is effectively checking the rules for any objects that can be an array (either directly or as a value in a name or number-tree) - can the following array objects in logical structure contain null objects as array elements?

  • StructTreeRoot/K array - can this have null array elements?

  • StructTreeRoot/ParentTree - specifically when the value in the number-tree is an array "For a page object or content stream containing marked-content sequences that are content items, the value shall be an array of references to the parent elements of those marked-content sequences." - can an element in that array be null? It would appear at least one popular implementation uses 'null`...

  • StructElem/K - noting that the array format can be an array of various kinds of dicts, or an array of integers

  • StructElem/A - noting also that this array can have an optional revision number after a dict/stream entry so there is potential ambiguity as to whether a null is a dict/stream or revision in the array elements

  • StructElem/C - noting also that this array can have an optional revision number after a name so there is potential ambiguity as to whether a null is a name or revision in the array elements

  • StructElem/Ref

@petervwyatt petervwyatt added the bug Something isn't correct label Jul 6, 2023
@petervwyatt petervwyatt added this to the Tagged PDF related milestone Jul 6, 2023
@petervwyatt
Copy link
Member Author

@bdoubrov @DuffJohnson @mrbhardy - please weigh in. Happy if you wish to take discussion to another TWG first...

@petervwyatt petervwyatt added documentation Improvements or additions to documentation question Further information is requested and removed bug Something isn't correct labels Jul 6, 2023
@bdoubrov
Copy link

bdoubrov commented Jul 6, 2023

I would say that null is allowed only in StructTreeRoot/ParentTree and not permitted in all other places listed above.

To me the presence of null in StructTreeRoot/ParentTree should be an implication of a more general statement that null is permitted as a value in both name-tree and number-tree data structures with the same provision as null values in dictionaries:

A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated the same as if the entry does not exist.

@car222222
Copy link

Re: StructTreeRoot/ParentTree . . .

  • can an element in that array be null?

This would presumably indicate that there is a MCS content item that has no parent element (whose reference must be at that index in the array)!

Such orphaned content items (with an MCID) seem unlikely to actually exist, so what meaning does this "popular implementation" give to such use of 'null'? : It would appear at least one popular implementation uses 'null`.

@car222222
Copy link

Re StructTreeRoot/K :

Table 354 clearly states that this either a single dictionary or an array of dictionaries.

But the null object is definitely not a dictionary, so how can this value be legal here?

@petervwyatt
Copy link
Member Author

... or an array of dictionaries.

So where an element in the array is null: [ ... null ... ] - not a null array.

So consider these situations:

  • /K [ ]
  • /K [ null ]
  • /K [ 10 0 R null 11 0 R ]

Which ones do you think are valid?

@petervwyatt
Copy link
Member Author

StructTreeRoot/ParentTree is more awkward to do as an example as it is a number tree. From Table 354 "the value shall be an array of references to the parent elements of those marked-content sequences" - can this array have null array elements? [ 10 0 R null 11 0 R ] where 10 0 R and 11 0 R are valid parents?

@bdoubrov
Copy link

bdoubrov commented Jul 6, 2023

Using null as a value in the StructTreeRoot/ParentTree (or in any name-tree or number-tree) is a very pragmatic solution for deleting elements from the tree. The only alternative would be rebuilding the tree, which is often too ineffective and, in case of StructTreeRoot/ParentTree, might require modifying MCID properties of marked content sequences in the content stream.

@petervwyatt
Copy link
Member Author

Using null as a value in the StructTreeRoot/ParentTree ...

I'm referring to the value of the name-/number-tree that is an array and that array then having null elements.
So not the value elements of the Nums or Names arrays, although what you say is also relevant.

But I would be nervous about making global assumptions that every name-tree or number-tree in PDF can always accept null as an acceptable value without first explicitly checking wording of each use. In the same way that assuming an empty array [ ] or array with only nulls [ null ] are the same - as we need to check for any requirements about array lengths, etc.

@bdoubrov
Copy link

bdoubrov commented Jul 7, 2023

You are right, I meant null elements inside arrays forming values in StructTreeRoot/ParentTree. Like this example taken from a real world document:

<<
/Type /StructTreeRoot
...
/ParentTree
    <<
     /Nums [0 [23 0 R null null 24 0 R 24 0 R] 2 [18 0 R 19 0 R 20 0 R 21 0 R] ]     
    >> 
>>

I agree this is different from having null as an element in the array value of Nums in the number-tree. So, I'm no longer sure why a creator of this particular example does not remove null values and sometimes (as in the example above) duplicates elements in the arrays in the ParentTree.

@car222222
Copy link

car222222 commented Jul 7, 2023

Yes indeed, so what can the null values possibly mean here? :

[23 0 R null null 24 0 R 24 0 R]

But note that it is not possible to remove values from these arrays as they are indexed by the fixed MCIDs of actual content items.

But since each such content item has a parent structure, the use of null here must surely be an error.

@bdoubrov
Copy link

bdoubrov commented Jul 7, 2023

In this particular example the use case was to merge two structure elements into one. So, in the structure tree two elements were deleted and one added. The two nulls in the parent tree correspond to the deleted structure elements and the duplicated one (24 0 R) corresponds to the new merged structure element.

@car222222
Copy link

car222222 commented Jul 7, 2023

It seems that is just wrong. These entries are not indexed by structure elements but by content items with MCIDs.

Merging a structure element does not remove its content items from the stream, they are still there and have MCIDs. The structure element that is now the parent will need to be changed, but surely every content item with an MCID will still have a parent structure item.

If content items with MCIDs can exist that do not have any parent structure item, then how to represent this case should be explicitly documented in 14.7.5.4.

Currently it states only this:

"For a content stream containing marked-content sequences that are content items, the value shall be an array of indirect references to the sequences’ parent structure elements.
The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array."

@petervwyatt
Copy link
Member Author

I just ran a test over about 100,000 PDFs and a LOT of PDFs have this idiom with null in the array form of StructTreeRoot/ParentTree number-tree values (some with 100s and 1000s of instances)! But none of the other logical structure-related dict entries listed in my original post had null entries.

I did not check the content streams and whether any MCID still indexed those null array elements.

@car222222
Copy link

I think I can see one (more?) good reason for there being many nulls in these arrays.

This cause highlights a more general failing with the documentation of how these arrays, indexed by MCIDs, should work.

Some explanation of how these nulls arise:

The MCID numbers of the content items in a content stream can be arbitrary integers, they need not form a 'complete integer interval' such as {0,1,2,3,4,5}; they can be any set, such as {0,2,4}, or even (in theory) {0,2,4,652}.

But such a set of MCID numbers themselves must be used as indices into these arrays, thus in the second example above, the array would need to be of length 5, and in the final example 653. This is somewhat strange but it is the reality to be dealt with. But it is not covered in 14.7.5.4 or, I believe, anywhere else.

Sets of MCIDs such as {0,2,4} arise vey naturally and there is no prescription for what entry to use in the "missing indices", 1 and 3 in this case. Thus "null" is used as a pragmatic solution that seems to work.

This pragmatic solution needs to be documented (as does the one below).

The following, distinct, possibility was previously indicated by @bdoubrov:

When an existing MCID number is that of a content item that has no structural parent. It is not clear whether this should also be indicated by using a null in this array; it maybe needs a different value, such as using the root node of the structure tree.

@petervwyatt petervwyatt added proposed solution Proposed solution is ready for review and removed question Further information is requested labels Jul 13, 2023
@petervwyatt petervwyatt self-assigned this Jul 13, 2023
@petervwyatt
Copy link
Member Author

There is an existing discussion in subclause 14.7.5.4 "Finding structure elements from content items" regarding MCIDs and the ParentTree, but it does not cover the cases we're discussing here. Interestingly searching for MCID does not find subclause 14.7.5.4 so this info is difficult to find...

Trying to summarize:

  • as far as my original post goes, we've settled that ONLY StructTreeRoot/ParentTree in its array form can validly have array elements that are null.
  • subclause 14.7.5.4 "Finding structure elements from content items" needs to state that null array entries can be used for unused marked-content identifiers - e.g. in the 2nd bullet above the NOTE and Table 354. This is a "can"/"may", not a "shall". This is @car222222's point above.
  • subclause 14.7.5.4 "Finding structure elements from content items" needs to state that null array entries shall be used for marked-content identifiers that do not have any structural parent - e.g. in the 2nd bullet above the NOTE. This is @bdoubrov's point.
  • all the other entries I mentioned in the original post shall not have array elements that are null - these file format requirements need to be added to the description of each of those entries in Table 354 or 355.
  • subclause 14.7.5.4 "Finding structure elements from content items" needs an editorial change so that the acronym "MCID" is mentioned as this is what most people search for. e.g. as a parenthetical in the 2nd bullet above the NOTE.

These are all file format requirements, not processor requirements.

Did I miss anything?

@petervwyatt
Copy link
Member Author

@bdoubrov @mrbhardy - can you please confirm the above prior to our next PDF TWG

@car222222
Copy link

car222222 commented Sep 13, 2023

I have not checked all the details here
(apologies) but I can see no obvious deficiencies.

Here is one extra possible refinement to consider, re:

"* subclause 14.7.5.4 "Finding structure elements from content items" needs to state that null array entries shall be used for marked-content identifiers that do not have any structural parent"

It might be better to recommend using something other than a "null" value for such an MCID index in the array, so as to clearly distinguish this case, when the index is the MCID of a content item, from the case of a "non-existent" MCID (the previous bullet point).

One possibility is to use for this case the structure tree root.

This makes some sense since, even if there are no surviving structure elements in the tree, the structure tree root must survive in order for the "parents" number tree to be defined, without which these arrays cannot not even be accessed.
If there is no structure tree root, then there is no "parents tree" and then there are no arrays indexed by MCIDs.

A useful result of this change will be a better outcome from the following events:
a content item in a content stream starts off with an MCID and a normal structure tree parent, but then subsequently the structure tree gets pruned in such a way that the content item gets totally orphaned (i.e., it has no structure parent).
If this happens then it is natural to make the structure tree root become the new parent of the content item.

@u-fischer
Copy link

One possibility is to use for this case the structure tree root.

That would imho not be allowed. If a MC points to a structure element as its parent, this structure element should also contain this MC as a kid, but the root can't contain content items as kid.

@car222222
Copy link

A good point.

So, is there a better way to make this important distinction between existing and non-existent MCIDs?

@car222222
Copy link

car222222 commented Sep 13, 2023

Although, of course, the "structure tree root" is not in fact a "structure element", so this stricture (that it must have the MC item as a kid) does not apply to it.

The idea is that this is simply better than using "null" as the "distinguished value" for both cases. Like "null" it is really just a name, not an actual "structure parent".

@u-fischer
Copy link

Although, of course, the "structure tree root" is not in fact a "structure element"

StructTreeRoot is explicitly mentioned in the matrix with a list of allowed children, and content items are not one of them:

image

But even if that were allowed: How would you then distinguish between "wanted" MCID put there by purpose and "unwanted" MCID which only got pushed out of other structures, e.g. because they are empty and should be ignored? Actually the main use of the null entry for us, is to hide empty MCID the code produces and it would be a bit contra productive to make them visible again by moving them into the root.

@car222222
Copy link

But no one is adding any children to the root, it is merely used as an arbitrary marker, that is not "null".

If no one likes it then choose something else, such as "the integer 0".

@petervwyatt
Copy link
Member Author

It would be interesting if @petervwyatt could spot the same pattern in his own huge PDF set to pinpoint major culprits (at least we could stop this plague from parasitizing yottabytes of valuable memory around the world... 😉 ).

I will do so... but it will take some time.

@petervwyatt
Copy link
Member Author

It would also be good to get Adobe's input on this thread... @mrbhardy?

@car222222
Copy link

@stechio wrote: "it's beyond my imagination that a processor could be forced by its users to torture tiny bunches of innocent marked-content sequences . . . "

Maybe you ire should in fact be directed at the use of an array here: this could be a hangover from many decades ago when this was (intended to be) a document-wide list and hence too long for the use of any other data structure?

Since it is now only page-wide, there would be no real-world problems with using a dictionary here: this would contain keys only for the actually “current content items” that are also MCSs on this page.

Surely such a dictionary will never get too large for any processor limits? — although I can foresee scenarios that will produce potential numbers of such content items per page in the 100s (but only quite low 100s).

Another possibility is to still allow also the use of an array, but recommend that it is used for a page only when the size of the dictionary gets too large. Such alternatives would preserve compatibility whilst making life much easier for future developers.

@stechio
Copy link

stechio commented Sep 16, 2023

@car222222:

@stechio wrote: "it's beyond my imagination that a processor could be forced by its users to torture tiny bunches of innocent marked-content sequences . . . "

Maybe you ire should in fact be directed at the use of an array here: this could be a hangover from many decades ago when this was (intended to be) a document-wide list and hence too long for the use of any other data structure?

I can't grasp what you are referring to... The structure element lookup mechanism has been the same since the very inception of logical structure in PDF 1.3 (see 8.4.3 (Logical Structure)) — parent structure elements for a content stream were already indexed in page-scoped arrays; MCID was already defined as "an integer marked-content identifier that uniquely identifies the marked-content sequence within its content stream", it even recommended: "Because marked-content identifiers serve as indices into an array in the structural parent tree, their assigned values should be as small as possible to conserve space in the array". Did any precursor to PDF 1.3 logical structure ever exist?

My rage was about a blatantly abusive implementation: no matter how annoying a constraint may feel, a diligent implementer should always have the consequences of its decisions in mind! Paraphrasing a popular saying (uncertain attribution): always code as if the guys who end up maintaining the PDFs generated by your processor will be violent psychopaths who know where you live 😁 . (As I previously noted, the problem here was that PDF 1.3 spec didn't explicitly enforce a formal processor requirement for MCID assignment, leaving a degree of leeway to lazy/cynical interpretations)

To my understanding, this array structure is the most suitable for the purpose (indexed lookups), provided that its implementers judiciously handle it (ie, canonical (zero-based, contiguous) MCIDs only).

Since it is now only page-wide, there would be no real-world problems with using a dictionary here: this would contain keys only for the actually “current content items” that are also MCSs on this page.

Surely such a dictionary will never get too large for any processor limits? — although I can foresee scenarios that will produce potential numbers of such content items per page in the 100s (but only quite low 100s).

Another possibility is to still allow also the use of an array, but recommend that it is used for a page only when the size of the dictionary gets too large. Such alternatives would preserve compatibility whilst making life much easier for future developers.

My feeling is that trading indexes (efficiency) for keys (flexibility) would be a lazy game without significant benefits and with certain drawbacks. In real-world applications, how would a dictionary of structural parents behave over time (editing session after editing session) compared to a (canonically generated and managed) array? Sure, the array could fragment with discarded MCIDs, or possibly expand when all discarded MCIDs are reused (as enforced by my algorithm proposal); the dictionary could shrink and expand, but its memory footprint would be often larger both in serialized and deserialized forms than its counterpart. Dictionaries would be somewhat more comfortable to juggle, but where is the solid gain of them as structural parents instead of current arrays? IMO, only a benchmark over a statistically-significant collection of real-world PDFs could clarify this point.

@stechio
Copy link

stechio commented Sep 17, 2023

I updated my analysis over the null-ridden PDF file with the overall count of null occurrences — the results are mind-blowing 🤯...

@mrbhardy
Copy link

To add to @stechio's point, we generally try to avoid unconstrained dictionaries in PDF (even though in reality we do have a few, such as the role map dictionary). Historically, this was to reserve the key names as "atoms" in a popular implementation, but since dictionaries have no order, walking them to find an entry would be inefficient. Arrays seem like the right structure.

I agree with the outcome above, which is that only StructParents can have nulls and I don't think we can kill this off. We also can't really mandate processor behaviors in generating these, since we're trying to just tell them how to build syntax. A note added to the section on StructParents explaining that they can have null entries, but that it has negative impacts if used broadly (e.g. global MCID counters as mentioned previously) would be the right way forward.

@car222222
Copy link

Agreed that one would not want, in general, to use large dictionaries (for such purposes) unless they are implemented as hashed arrays so that look-up remains efficient.

@petervwyatt
Copy link
Member Author

I ran a test over on 442 random PDFs.

For that mini-corpus, the record was 67,524 null entries in StructElems for this PDF (shareable here, since publicly available on the 'net): cgc-principles-and-recommendations-fourth-edn.pdf - for those that are interested... (Warning: it's compressed so it's difficult to see - and my mileage varied depending on the tool. Be warned!).

There were a few other PDFs with over 60,000 null entries in StructElems, with a total of 14 PDFs (of 442 total) having more than 1,000 null entries in StructElems but because of PII, licensing, etc I cannot share those. My tech doesn't report the total number of structure elements or page breakdowns so I cannot produce the other metrics @stechio was able to.

I could run more but I don't think it's necessary.

@mrbhardy
Copy link

@petervwyatt as I said previously, I think this is legal if a bad choice and the most we should do is add a note saying this is a bad idea.

@petervwyatt
Copy link
Member Author

Would you care to draft something? :-)

@petervwyatt
Copy link
Member Author

petervwyatt commented Sep 21, 2023

So circling back around to my original question of where is null valid, have we agreed on the following?

  • StructTreeRoot/K, StructElem/K, StructTreeRoot/IDTree, StructElem/A, StructElem/C and StructElem/Ref - "shall not" contain null elements. Where these entries are name/number-trees, this means that the value-i of the tree Names/Nums array "shall not" be null (i.e. for the odd array indices (zero-based): 1, 3, 5, ... - as per Tables 36 and 37.

  • StructTreeRoot/ParentTree - "may" contain null elements and we'll add an informative note explaining this.

@stechio
Copy link

stechio commented Sep 21, 2023

@petervwyatt:

I ran a test over 442 random PDFs.

When you say "random", what kind of collection method did you apply to gather those PDFs? Were they automatically harvested on the 'net with some randomized crawling strategy, or picked from existing collections, or...? (just to have an idea of their statistical significance)

For that mini-corpus, the record was 67,524 null entries in StructElems for this PDF (shareable here, since publicly available on the 'net): cgc-principles-and-recommendations-fourth-edn.pdf - for those that are interested... (Warning: it's compressed so it's difficult to see - and my mileage varied depending on the tool. Be warned!).

There were a few other PDFs with over 60,000 null entries in StructElems, with a total of 14 PDFs (of 442 total) having more than 1,000 null entries in StructElems but because of PII, licensing, etc I cannot share those. My tech doesn't report the total number of structure elements or page breakdowns so I cannot produce the other metrics @stechio was able to.

Thanks for sharing, @petervwyatt; here it is the analysis of cgc-principles-and-recommendations-fourth-edn.pdf (in the meantime, I refined my inspection code further, with additional information (see column descriptions in the footnote here below)) — as you can see, it shows the same denormalization anti-pattern as I previously encountered:

Page indexIndirect referenceParent tree array sizeNon-null element countActual structural MCS countNull element countOverhead (%)Cumulative null element count
01 0 R521717355635
15 0 R1136161523487
27 0 R197848411345200
39 0 R288919119760397
411 0 R40912112128862685
. . .. . .. . .. . .. . .. . .. . .. . .
3577 0 R3,26997973,1729557,280
3679 0 R3,30940403,2699860,549
3781 0 R3,4371281283,3099463,858
3883 0 R3,50164643,4379767,295
391716 0 R2927272467,297
391731 0 R1711169167,313
391733 0 R2311229467,335
391732 0 R3011299567,364
391730 0 R3111309567,394
391737 0 R3211319567,425
391736 0 R3311329567,457
391734 0 R3411339567,490
391735 0 R3511349667,524
Average1,77587871,68893
Columns
  • Page index: zero-based page index (NOTE: additional occurrences represent nested content streams, ie form XObjects painted inside current page)
  • Indirect reference: content stream reference (either page or form XObject)
  • Parent tree array size: all the elements in the parent tree array associated to content stream
  • Non-null element count: non-null elements in the parent tree array associated to content stream
  • Actual structural MCS count: actual marked-content sequences as content items found in content stream (NOTE: this value may be greater than 'Non-null element count' in case of orphaned content items)
  • Null element count: null elements in the parent tree array associated to content stream
  • Overhead (%): excess memory usage due to null values in the parent tree array (100% means array is completely filled with null elements, without payload; 0% means array is completely filled with non-null elements, ie 100% payload)
  • Cumulative null element count: null elements in parent tree arrays up to current, inclusive

Nested content streams

As you may have noticed, the last page paints several form XObjects whose structural marked-content sequences are mapped in their own parent tree arrays. The following screenshot shows one of such occurrences (marked-content sequence with MCID=33, corresponding to the vectorial "FSC" logo, here highlighted in magenta):

page-39_MCID-33_nestedFormXObjectMCS

And this is the associated structure element (Stm=1734 0 R, Pg=1716 0 R):

page-39_MCID-33_structureElement

Overall null element count

Its overall null element count of 67,524 (Case 2) may not look as horrifying as the 648,198 of my previous analysis (Case 1), but we have to keep in mind their respective cumulative functions:

cdfChart

As you can see:

  • cumulative growth is more than proportional: its slope isn't constant, as null element count between contiguous pages is ever-increasing
  • cumulative growth in Case 2 is much faster than Case 1 because of their respective average non-null element counts per page (87 vs 47): at page 39, cumulative null element count of Case 1 (25,580) is far below Case 2 (67,524); since Case 1 counts 648,198 null elements at page 171, it is reasonable to infer that if Case 2 had 172 pages itself, its overall null element count would be much higher than 648,198!

This demonstrates that global MCID counters are a really clumsy and insidious solution, ever more detrimental to PDF files as their page count and their average non-null element count per page increase.

@petervwyatt
Copy link
Member Author

When you say "random", what kind of collection method did you apply to gather those PDFs? Were they automatically harvested on the 'net with some randomized crawling strategy, or picked from existing collections, or...? (just to have an idea of their statistical significance)

A true random sample across millions of PDFs including CC_MAIN..., "Issue Tracker", GovDocs1, my own collections, PDF Association test suites, etc (many listed in https://github.com/pdf-association/pdf-corpora). I stopped sampling once it had accumulated 1GB of files - in this case that was at 442 PDFs.

@stechio
Copy link

stechio commented Sep 22, 2023

Comparative graph added to last null-element analysis.

@mkl-public
Copy link

@stechio

as an aside,

I refined my inspection code further

it looks like that code uses an updated version of your PDF library. Is that version available somewhere?

@stechio
Copy link

stechio commented Sep 22, 2023

@stechio

as an aside,

I refined my inspection code further

it looks like that code uses an updated version of your PDF library. Is that version available somewhere?

Hi @mkl-public, I saw your comment last year on the project's website, but had no heart to tell you that, after your previous request on stackoverflow, I still had no idea about its release 😢 — because of your passionate interest in PDF technology, I thought to offset my unforgivable delay by offering you a sneak peek of some of the ongoing work, but it's still too early... I am obsessed with details, never satisfied, always pushing code to a better balance between simplicity and power...

The new version is already packed with several new usable (and useful) features developed over the last three years (and I have even more ideas yet to implement, BUT for another dev cycle, of course 😅 ); I'm currently completing the implementation of logical structure and tagged PDF, then the current dev cycle will be closed 🥳! I will publish a pre-release here on github, waiting for feedback. Stay assured, I didn't forget you! 😃 thank you

@stechio
Copy link

stechio commented Sep 24, 2023

Both sample files analyzed in this thread (this for Case 1 and this for Case 2) were apparently produced by Adobe PDF Library (version 15.0), so in these cases the global MCID counter cannot be a whimsical choice dictated by laziness.

Trying to figure out a rationale behind that, page 39 of Case 2 seems to give a possible clue: marked-content sequences nested in form XObjects (see EXAMPLE 5 in subclause 14.7.5.2 (Marked-content sequences as content items)). Since form XObjects are inherently reusable, one might think that the MCIDs assigned to them couldn't be assigned elsewhere in the same file because of possible collisions. But subclause 14.7.5.4 (Finding structure elements from content items) clearly states that each content stream (page object or other kind) has its own set of zero-based MCIDs (emphasis is mine):

For a content stream containing marked-content sequences that are content items, the value shall be an array of indirect references to the sequences' parent structure elements. The array element corresponding to each sequence shall be found by using the sequence's marked-content identifier as a zero-based index into the array.
[...]
Because a marked-content sequence is not an object in its own right, its parent tree key shall be found in the StructParents entry of the page object or other content stream in which the sequence resides. The value retrieved from the parent tree shall not be a reference to the parent structure element itself but to an array of such references — one for each marked-content sequence contained within that content stream.

So: an MCID is a ZERO-based index into an array of references to parent structure elements, ONE FOR EACH marked-content sequence contained WITHIN THAT content stream. Because of such definition, no collision is possible, isn't it? In turn, the only plausible reason for global MCID counter vanishes miserably... (that wouldn't be decisive, anyway, since the overwhelming majority of null elements existing in those files are unrelated to form XObjects)

As Adobe PDF Library is, de facto, the gold standard among the implementations of ISO PDF, it would be great if @mrbhardy could let us reach out to its dev team to understand the rationale of their solution.

@mrbhardy
Copy link

mrbhardy commented Oct 5, 2023

@stechio the PDF Library doesn't stop people doing very low-level control or setting things up incorrectly like this. It power most of our Desktop products and they generally produce good Tagged PDF, so I don't think this is a problem in the core library. It is possible that a product using it is setting it up in a way that produces this unfortunate outcome.

I'm happy to draft a note saying that it is important to start MCIDs (start just meaning the lowest value, not first) at zero on each page and within XObjects.

@stechio
Copy link

stechio commented Oct 5, 2023

@mrbhardy I understand your point (also my initial analysis supposed it could be the result of a custom arrangement in the generation of the logical structure) and I am certain that the PDF Library delivers top-quality documents per-se; yet, IMHO, it could be useful to report this possible issue, just in case.

@petervwyatt petervwyatt added help wanted Extra attention is needed and removed proposed solution Proposed solution is ready for review labels Oct 15, 2023
@petervwyatt
Copy link
Member Author

PDF TWG agree:

  • explicitly acknowledge that null do occur in the array
  • add an informative note in 14.7.5.2 below current note after example 1: for efficient creation of StructParents within a given page it is highly recommended that MCIDs start at zero and be contiguous to avoid excessive nulls. Each page or XObject is unique and can start at 0.

@MatthiasValvekens
Copy link
Member

very minor nit:

add an informative note in 14.7.5.2 below current note after example 1: for efficient creation of StructParents within a given page it is highly recommended that MCIDs start at zero and be contiguous to avoid excessive nulls. Each page or XObject is unique and can start at 0.

I think this is a good change, but "highly recommended" reads a little too normative to me (and turning this into a normative recommendation is probably too heavy-handed). Maybe we can write the note more matter-of-factly as follows:

MCIDs are scoped by content stream, so the same MCID can reappear across pages or XObjects. Ensuring that the MCID counter starts at 0 and is contiguous for any given page allows for efficient creation of StructParents without excessive null objects.

@petervwyatt petervwyatt added proposed solution Proposed solution is ready for review and removed help wanted Extra attention is needed labels Oct 16, 2023
@petervwyatt petervwyatt added ISO approved Resolved issue approved by ISO and removed documentation Improvements or additions to documentation proposed solution Proposed solution is ready for review labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ISO approved Resolved issue approved by ISO
Projects
None yet
Development

No branches or pull requests

8 participants