Skip to content

Attempt to fix 3489 #3533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Attempt to fix 3489 #3533

wants to merge 16 commits into from

Conversation

heplesser
Copy link
Contributor

This PR is an (so far not entirely successful) attempt to solve #3489, i.e., to address the problem that spikes are delivered by the event delivery manager after a connection has been removed. See #3489 for a reproducer (not added here as a test yet). I post this here now so further work can build on my attempts.

My attempt to solve this is by running update_connection_infrastructure() from the main update_() loop, after events have been delivered and SP has done its work. In addition, the conn infrastructure must be built once before the first simulation step, so WFR can work already in the first time slice. This is unproblematic, since no spikes can be available for delivery at time 0.

Most tests pass, but I needed to adapt some tests because they either worked more by coincidence that robustly, and others so that comparisons of SynapseCollections no longer depend on the order of synapses in the collection; I also ignore port in comparisons in almost all cases, because that changes with conn infra updates and is an internal implementation detail. See the changes in testsuite for how I modified tests.

I finally ran into trouble that I haven't fully solved yet with test_spike_transmission_after_simulate(), which does

    n = nest.Create("parrot_neuron", 10)
    nest.Connect(n, n)

    c = nest.GetConnections()
    c[::3].disconnect()

    g = nest.Create("spike_generator", params={"spike_times": [1]})
    nest.Connect(g, n)

    nest.Simulate(3)

This fails on the disconnect() step, claiming that the connection to be deleted does not exist. This is somehow related to ConnectionManager::find_connection(). Introducing nest.Simulate(0.1) before GetConnections(), and thus forcing the conn infra to be updated, solves the problem and make the test work. Note that this test works in master only if NEST uses compressed spikes, otherwise it fails in the same way as this branch (see #3532).

The following tests currently fail, I haven't had time to investigate why:

  | install.share.nest.testsuite.pytests.test_spike_transmission_after_disconnect.test_spike_transmission_after_disconnect
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_GetConnectionsOnSubset
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_GetConnectionsSynapse
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_basic
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_get
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_getWithPandasOutput
    | install.share.nest.testsuite.pytests.test_synapsecollection.TestSynapseCollection.test_string
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_disconnect_all_to_all
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_disconnect_defaults
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_disconnect_static_synapse
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_multiple_synapse_deletion_all_to_all
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_multiple_synapse_deletion_one_to_one
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_multiple_synapse_deletion_one_to_one_no_sp
    | install.share.nest.testsuite.pytests.test_sp.test_disconnect_multiple.TestDisconnect.test_single_synapse_deletion_sp
    | install.share.nest.testsuite.pytests.test_spatial.test_dumping.DumpingTestCase.test_DumpConns
    | install.share.nest.testsuite.pytests.sli2py_mpi.test_issue_1957.test_issue_1957
    | install.share.nest.testsuite.pytests.sli2py_mpi.test_self_get_conns_with_empty_ranks.test_get_conns_with_empty_ranks

@heplesser heplesser added T: Bug Wrong statements in the code or documentation S: High Should be handled next I: Behavior changes Introduces changes that produce different results for some users labels Jul 13, 2025
@heplesser heplesser added this to Kernel Jul 13, 2025
@github-project-automation github-project-automation bot moved this to In progress in Kernel Jul 13, 2025
@heplesser
Copy link
Contributor Author

I just pushed another update fixing one corner case and added a regression test. Besides the problem with test_spike_transmission_after_disconnect mentioned above, there is also a deadlock issue on some MPI-based tests which leads to timeouts. This happens if one rank has connections and the other doesn't. One example is test_issue_3099.py, which does the following:

    nrn = nest.Create("parrot_neuron")
    nest.Connect(nrn, nrn)
    conns = nest.GetConnections()
    if conns:
        conns.weight = 2.5

Here, GetConnections() completes well. Running on two MPI ranks, only one rank will have a connection in conns, for the other one the node collection is empty. Therefore, only one rank enters the if. Now the following happens:

  1. At the Python level, we call SynapseCollection.set() and there we have
    if self.__len__() == 0 or GetKernelStatus("network_size") == 0:
  2. The rank without connection returns here immediately, while the rank with connection calls GetKernelStatus().
  3. GetKernelStatus() calls update_delay_extrema_(), which initiates MPI communication if and only if connections_have_changed() is true:
    if ( not kernel().connection_manager.get_user_set_delay_extrema()
    and kernel().connection_manager.connections_have_changed() and kernel().mpi_manager.get_num_processes() > 1 )
    .

Now, in current master, because GetConnections() triggers an update of the connection infrastructure, both ranks have connections_have_changed() returning false at this point, so neither rank tries to communicate and all is well. But in my PR, this is on longer the case, and so only the rank with connections tries to communicate here, and then we are locked. But GetConnections() must not update the connection infrastructure, because that will lead to undeliverable spikes after disconnection.

A "workaround" is to call Simulate(0.1) after creating the connections, because then the delay extrema are fixed and update_delay_extrema_() is skipped.

Clearly, in this case, where we only need the network size, no MPI communication would be necessary at all, we just induce it because the only way to get data out of the kernel is to call GetKernelStatus(), and that always returns everything. Incidentally, the SynapseCollection.set() method later on calls SynapseCollection.get()

node_params = self[0].get()
, which does the GetKernelStatus() dance all over again:
if self.__len__() == 0 or GetKernelStatus("network_size") == 0:
. BTW, this code is the same in PyNEST-NG for now.

@heplesser
Copy link
Contributor Author

I manually canceled the Actions workflow because the remaining cases had gotten stuck and our timeout mechanisms don't seem to work for MPI-based tests right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: Behavior changes Introduces changes that produce different results for some users S: High Should be handled next T: Bug Wrong statements in the code or documentation
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

1 participant