Skip to content

CTSM/FATES enters an 'infinte loop' on Dec. 31 in trying to demote cohorts to balance areas #171

@mvertens

Description

@mvertens

Brief summary of bug

I have been working with @mvdebolskiy to try to understand a crash that happens at the end of year 1 on Dec. 31 with the following behavior.

code repository: 
   https://github.com/mvdebolskiy/CTSM.git
code branch: 
   updt-noresm-to-5.3.084
test case: 
   SMS_D_Ld366_P1024.ne30pg3_ne30pg3_mtn14.I2000Clm60Fates.betzy_gnu.cl\m-FatesColdNoComp.matvey/
  • The code dies on Dec. 31 in EDCanopyStructureMod.F90 at first with an array out of bounds for arealayer(nclmax+5). It turns out that arealayer does not have to be an array - so I have made it a scalar to get past this point.
  • Now it dies upon failure of the condition (patch_area_counter > max_patch_iterations .and. area_not_balanced)
  • This happens at numerous gridcells - and I have picked just one gridcell failure to look at more carefully -
    lat, lon = 1.33058, 27.50000 and the crash happens with pft=13. I'm using 1024 tasks - and task 134: contains the problem gridcell.
  • I have restart files written at both Dec. 30 and Dec. 31.
  • On dec. 30:
134: DEBUG: lat, lon, year, mon, day, tod =       1.33058       27.50000  2000    12    30  1800
134:
134: DEBUG: area not balanced loop: pft, counter =   13     0
134: DEBUG: z1 =        3
134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    1        720.9159737710        714.2857142857          6.6302594853
134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   1     6.6302594853   631.1649323585
134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    2        724.1804081137        714.2857142857          9.8946938280
134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   1     9.8946938280   634.4244105726
134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    3        726.7865900232        714.2857142857         12.5008757375
134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1     0.6813581681     0.6813581681
134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   2    11.8195175694   716.2105380271
134: DEBUG: z2  =        4
13

On dec. 31:

 134: DEBUG: lat, lon, year, mon, day, tod =       1.33058       27.50000  2000    12    31  1800
 134:
 134: DEBUG: area not balanced loop: pft, counter =   13     0
 134: DEBUG: z1 =        3
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    1       8088.8092821351        714.2857142857       7374.5235678494
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   1  7374.5235678494  7661.6637274354
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    2      12887.8364490678        714.2857142857      12173.5507347821
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1    37.8177039835    37.8177039835
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   2  5475.4951772349  5475.4951772349
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   3  6660.2378535637  7374.5235678494
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    3      15777.8579507025        714.2857142857      15063.5722364168
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1     0.6853056261     0.6853056261
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   2    37.8177039835    37.8177039835
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   3  3603.6219102943  3603.6219102943
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   4  5475.4951772349  5475.4951772349
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   5  5945.9521392780  6660.2378535637
 134: DEBUG: z2  =        4
 134:
  134:
 134: DEBUG: area not balanced loop: pft, counter =   13     1
 134: DEBUG: z1 =        4
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    1        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    2        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    3        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   1     0.0000000000   714.2857142857
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    4      15063.5722364168        714.2857142857      14349.2865221311
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1     0.6853056261     0.6853056261
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   2    37.8177039835    37.8177039835
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   3  3603.6219102943  3603.6219102943
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   4  5475.4951772349  5475.4951772349
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   5  5231.6664249923  5945.9521392780
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   6     0.0000000000     0.0000000000
 134: DEBUG: z2  =        5
 134:
 134:
 134: DEBUG: area not balanced loop: pft, counter =   13     2
 134: DEBUG: z1 =        5
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    1        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    2        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    3        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    4        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    5      14349.2865221311        714.2857142857      13635.0008078454
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1     0.6853056261     0.6853056261
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   2    37.8177039835    37.8177039835
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   3  3603.6219102943  3603.6219102943
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   4  5475.4951772349  5475.4951772349
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   5  4517.3807107066  5231.6664249923
 134: DEBUG: z2  =        6
......
and demotion continues until the following last iteration of the loop which causes an abort
.......
 134:
 134: DEBUG: area not balanced loop: pft, counter =   13    10
 134: DEBUG: z1 =       13
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    1        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    2        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    3        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    4        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    5        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    6        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    7        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    8        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =    9        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =   10        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: target_area is less than nearzero - returning
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =   11        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =   12        714.2857142857        714.2857142857          0.0000000000
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   1     0.0000000000   482.6192892933
 134: DEBUG: canopy_structure i_lyr, arealayer, currentPatch%area, target_area =   13       8635.0008078455        714.2857142857       7920.7150935598
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   1     0.6853056261     0.6853056261
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   2    37.8177039835    37.8177039835
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   3  3603.6219102943  3603.6219102943
 134: DEBUG: demotion partial: cohort#, pd_area, cohort_area   4  4278.5901736559  4992.8758879416
 134: DEBUG: demotion total  : cohort#, pd_area, cohort_area   5     0.0000000000     0.0000000000
 134: DEBUG: z2  =       14
 134:
 134:  PATCH AREA CHECK NOT CLOSING

The problem seems to be that the arealayer for pft 14 coming into this routine on Dec. 31 is huge and there are not enough demotion calls to totally resolve it. I'm still trying to spin up on FATES - so maybe I am missing how this is happening. @mvdebolskiy - do you have other data to add since we last talked?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions