Skip to content

Conversation

@peterrrock2
Copy link
Collaborator

Adding a PR here so that that I can respond to comments more easily. We'll see how well this operates as a workflow and change it if need be.

Fred Mueller added 2 commits May 6, 2025 14:05
I added a new Graph object in graph.py that is NOT a subclass of NetworkX
but instead has two internal data members, nxgraph and rxgraph.  The
new Graph object supports most (almost all) of the syntax of the old
Graph object, which should allow us to add RX support seamlessly in the
near future...
Merge branch 'frm_rustworkx' of https://github.com/chief-dweeb/GerryChain into frm_rustworkx
Copy link
Collaborator Author

@peterrrock2 peterrrock2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, lots of comments here, but they were good to read through! Hopefully you find my responses somewhat enlightening

Comment on lines +36 to +39
1) It seemed possible to me that users took advantage in their own code that the
Graph object was in fact a NetworkX Graph object. If that is so, then we can
make life easier for them by providing them an easy way to gain access to the
internal NetworkX Graph object so they can continue to use that code.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very common thing that our users will do (esp. undergraduate students), so we do need to maintain easy access to the NetworkX object, and, ideally, we should implement a routine that gracefully converts from a RustworkX object to NetworkX for people to use

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - the RustworkX for NetworkX page actually has code to do just that (for folks to copy)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep! We will need to adapt it, however. The code in the RX documentation will work for a graph without adornment, but there is a bit of nuance that we should be careful about. Namely, we will sometimes want to be considerate of particular node and edge attributes associated to the graph. The biggest concerns here are obviously the population column and any surcharges that we would like to impose on edges to impact the MST distribution we are sampling from.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to this late - sorry.

I think that this is resolved. Users can get access to the underlying graph (NX or RX) with get_nx_graph() and get_rx_graph(). They can also get an NX graph after running MarkovChain by converting the RX graph used in the MarkovChain with to_networkx_graph() which preserves all of the node and edge data on the RX graph.

with the new Graph object and some of it hacked to operate on the underlying
NX Graph data member.
In the future, if #1 is not an issue, we can just expunge NetworkX completely.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we will be able to do this completely, unfortunately. While we could conceivably get rid of the NetworkX part and then give users a way to do some nice conversion from the graph object we have defined and an equivalent NetworkX object, we will run into some issues for users that want to write custom acceptance / rejection functions. There are some situations in which users will want to take graph diameter, earth-mover-distance, etc. into account, and I do not think that the RustworkX interface (specifically the way that it handles subgraphs) is sufficiently intuitive to allow for novice users to take advantage of its functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional thought:

Many of the things that we would like to do with graphs rely on the node attributes of the graph. The way that RustworkX stores node attributes. In fact, it might be nice to store a lot of these in a Pandas Dataframe so that we can implement an index that converts between the NetworkX index and the RustworkX index quickly (I am thinking for things like subgraph aggregation). An additional note: in my heart of hearts, I would use polars instead of pandas, but we are likely to use the same df for the Partition and the Graph, and Partitions will often need geo-spatial operations that are not possible in polars at this time.

Feedback on this thought is welcome!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know enough about pandas (or similar libraries) to have an intelligent opinion.

The general issue, however, of wanting to allow users to write functions as if the underlying graph is a NetworkX graph is unfortunate. It would seem to require converting back and forth from NetworkX to RustworkX each time the user's function was called.

Perhaps we could come up with a sane way to support most of the use cases that would prompt users to do this (that would NOT require converting back and forth) and then 1) support the converting back and forth, but 2) issue a warning saying that in the future we will drop support and that there is a new way to do it...

I will think about this some...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could come up with a sane way to support most of the use cases that would prompt users to do this (that would NOT require converting back and forth) and then 1) support the converting back and forth, but 2) issue a warning saying that in the future we will drop support and that there is a new way to do it...

I like this as an option. This gives us time to get feedback from users to see how obstructive this will be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this sounds like an issue we should discuss. The strategy for users in the current codebase using RX is: 1) create a graph in NX, 2) have it be automatically converted to RX when a partition is created (and the graph essentially frozen), and then 3) convert what is needed back to "original" node_ids if needed after running MarkovChain(). In many cases, no translation back to "original" node_ids is required - cases when the user is just interested in calculated properties of districts (which just depend on district ids which do not depend on NX/RX). In cases where the user cares about nodes, we can translate the graph back to NX and we can translate the assignments for each element in the chain back to the "original" NX node_ids.

So, the questions seem to be: 1) are the other use cases where translating node_ids after running MarkovChain() is needed and 2) do users want to write code (for updaters, constraints, etc.) that uses NX idioms that are not compatible with RX?

Note that if users want to write code that operates on a real NX graph during MarkovChain() processing, then I expect that any performance improvement from RX will disappear into translating back and forth from RX to NX...

Comment on lines 6 to 7
# frm ???: Is this intended to be externally visible / useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Comment on lines 30 to 31
# frm ???: Is this intended to be externally visible / useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Comment on lines 47 to 48
# frm ???: Is this intended to be externally visible / useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Comment on lines 84 to 90
# Note that I would lobby for the names "part" and "parts" to be
# changed to be "district" and "districts" just to avoid confusion
# with "partition" - parts of partitions warps my mind, and this
# is all about re-DISTRICTing isn't it???
#
# I would also lobby to have "crosses_parts" changed to "crosses_districts"
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind that change. We used "parts" historically because we are talking about edges between components of a graph partition, so can use this to study clustering in networks as well and not just districts in a redistricting plan. In fact, we have used this package to explore exactly these sort of network questions a few times over the years.

I would keep the names short to "dist" and "dists" as a matter of style, but feel free to disagree

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to whatever coding style you would prefer. I apologize for not doing PEP8 (I hope that is the correct acronym) - I had not expected my code to be reviewed before going back and making it pretty.

As I said in an earlier email, I come from the old-school, very large program world, where we decided that verbosity was in general a good thing (with all sorts of caveats about how to do so without creating commentary that became out of date and hence anti-helpful). Descriptive variable and function names were one of the ways we thought to make programs more easy for others to grok without gumming up the works with comments - that is, let the code be self-describing.

I am sure we can find a comfortable middle ground... ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to whatever coding style you would prefer. I apologize for not doing PEP8 (I hope that is the correct acronym) - I had not expected my code to be reviewed before going back and making it pretty.

No worries! I'm generally not that picky about style. We have moved towards using Black as our default at the lab for reference. So long as you generally adhere to the "each function is a single thought and fits on one screen (ish)" principle, I'm sure it will be fine.

As I said in an earlier email, I come from the old-school, very large program world, where we decided that verbosity was in general a good thing (with all sorts of caveats about how to do so without creating commentary that became out of date and hence anti-helpful). Descriptive variable and function names were one of the ways we thought to make programs more easy for others to grok without gumming up the works with comments - that is, let the code be self-describing.

I'm all for this. Go ahead and do as you usually would, and I'll comment if I ever have an issue. The "dist" vs "district" suggestion was mainly from a user perspective: the main way that we access components of a Partition is (currently) through the 'parts' attribute, so changing to 'districts' does increase the character count. Most of our users do not know how to set up a decent LSP (many only access a coding environment so that they can use our library), so a good chunk of them will have to type everything in the public API. Not a big deal if you can type fast, but I don't want to make that assumption. Thus, whenever I am dealing with variables or functions that will become part of the public API, if I can, I try to find distinct abbreviations for common things to lighten the load a little bit. If, however, the variable is going to be entirely internal I'll be happy so long as it stays below the 50 character mark or so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I am convinced that the code is really solid, I will go back and make it both "pretty" and well documented.

We can discuss at that time what standards would be appropriate. I am flexible...

Comment on lines 103 to 106
#
# frm ???: I think that this only operates on cut-edges and not on all of the
# edges in a partition. A "cut-edge" is an edge that spans two districts.
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@chief-dweeb
Copy link

I am inclined to think next about creating RustworkX versions of NX functions (like laplacian_matrix, is_connected, bfs_successors/predecessors, etc. I do not anticipate any great problems here, and doing this will confirm that the delegation approach works fine.

I also plan to spend a good bit of time grokking what your code samples do. I would like to feel as if I could whip up GerryChain code without having to ask myself, "what exactly do updaters do?"

Does that sound like the right next step? Stated differently, if the goal is to mitigate risk and convince ourselves that we are on the right track, what should I do next?

@peterrrock2
Copy link
Collaborator Author

I like this as a next step. Getting the library under your fingers in a general use case is always a good thing in my opinion. Getting those RX versions of NX functions implemented would also be great.

As far as getting the code under your fingers goes, I think that it would also be good to get an idea of how acceptance functions play into things. So, in addition to being able to spin up a vanilla chain (one with only a population updater and alway_accept), I would recommend that you also experiment with some custom updater and acceptance functions since making sure that these remain adaptable will be the main pain point for this migration. A good candidate for these would be something like Reock score, or "maximum commute time within a district." For Reock, I would grab one of the shapefiles either on the User Guide, or from our MGGG States github page, and for the commute time, you can just make up some random weights and add them as attributes on the edges of the graph.

Fred Mueller added 4 commits May 23, 2025 16:48
a Partition.  In particular, support for translating node_ids
from subgraphs to parent graph node_ids.  Total PITA...

Lots of cleanup needed and also lots of tests, but the
regression test works again - tests/frm_tests/test_frm_regression.py
… the Graph class. Also updated a bunch of tests to work with the new RX based code.
Copy link
Collaborator Author

@peterrrock2 peterrrock2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick General Notes:

  • I like the change to graph.node_data. Much easier to read and reason about
  • Some things need changed to snake_case
  • There were a couple of comments in the tests talking about difficulty converting from RustworkX graphs back to NetworkX graphs that I left notes on. Using rx.networkx_converter with keep_attributes=True stores the old node id on the node attribute __networkx_node__

Otherwise looking good so far!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire file is largely a bit of legacy that we can take or leave. I think that the original intention behind it is that being able to set up a quick grid-graph for experimentation is pretty nice, and that remains true when you are mucking about with new metrics, but it's just as easy to use

from gerrychain import Graph
import networkx as nx

nx_graph = nx.grid_2d_graph(width, height)
nx_graph = nx.convert_labels_to_integers(nx_graph)

for node, data in graph.nodes(data = True):
    data['pop'] = 1
    # set other data here
graph = Graph(nx_graph)

The auxiliary functions that are in this file are pretty nice to have (i.e. setting a constant attribute), but they can definitely be improved to allow for arbitrary graphs rather than just grid graphs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think it is worth preserving. I have mucked around with it so that all of the tests pass now.

There were some interesting issues with tests using NX node_ids that then needed to be translated into RX node_ids. In the future, we probably want to make this less painful (both for ourselves when writing tests or doing quick one-offs, but also for users). Ideally we would completely shield users from NX vs. RX node_ids, but it is not clear to me at present if that is possible...

Comment on lines +29 to +31
# Set the random seed so that the results are reproducible!
import random
random.seed(2024)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably did this in your environment, but it might be worth adding

assert os.environ["PYTHONHASHSEED"] == <known_value>

at the top when turning this into a test since the ordering of set objects in python is controlled by this environment variable and not by the random module.

Comment on lines 86 to 90
# I am going to assume that this test is just verifying that
# we correctly create the appropriate subgraphs (which we do).
# Rewrite the test so that it first verifies the nodes in the parent graph
# and their assignments, then reach into the subgraphs to verify that the
# parent node mapping for the subgraphs is correct.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have it correct here. All this test really does is make sure that when you partitioned the graph, the partitions are what you expect and that the objects have recorded that partition correctly

Comment on lines +71 to +76
# frm: TODO: the following statement blows up because we do not copy
# geometry data from NX to RX when we convert to RX.
# Need to grok what the right way to deal with geometry
# data is (is it only an issue for from_geodataframe() or
# are there other ways a geometry value might be set?)
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The geometry data should only exist on the attached geodataframe. In fact, if there is no "geometry" column in the dataframe, this call should fail.

Fixing the plotting functions is a low-priority. I need to set up snapshot tests for these anyway, so if you find working with matplotlib a PITA (because it is), then don't worry about the plotting functions for now.

Worst-case scenario, I can just add some temporary verbage to readthedocs telling people to use

my_partition.df.plot()

Which will just use all of the plotting stuff that Pandas has set up internally.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now passes because I attach the geometry data to the graph...

Comment on lines +240 to +243
# frm: TODO: stmt below fails - saying too many attempts:
#
# raise RuntimeError(f"Could not find a possible cut after {max_attempts} attempts.")
# RuntimeError: Could not find a possible cut after 10000 attempts.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an environment issue. My testing environment has the hash seed set to 0, so this passed, but it might be good to revisit this and build a better graph so that we don't need to rely on the hash seed being set

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the problem was, but the test now passes in my current code

Comment on lines +177 to +179
# frm: TODO: Once you have a partition, you cannot change the total population
# in the Partition, so why don't we cache the total population as
# a data member in Partition?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this would be a good thing to do.

totpop += partition.graph.get_node_data_dict(node)[self.pop_col]
totpop += partition.graph.node_data(node)[self.pop_col]

# frm: TODO: Ditto with num_districts - isn't this a constant once you create a Partition?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on this file: A ton of the code in here is inefficient. This was made 6 years ago and hasn't really been touched since then other than when I was doing an overhaul on many of the doc strings

else:
locality_population[locality_name] += locality_pop
ideal_population_per_district = totpop / num_districts
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably store this on the partition as well

allowed_pieces[loc] = math.ceil(pop / (totpop / num_districts))

"""
frm: new version of this code that is less clever...
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better. The subgraph thing was... interesting...

@chief-dweeb
Copy link

chief-dweeb commented Jul 29, 2025 via email

Copy link

@chief-dweeb chief-dweeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed all of your comments and have added a few responses to them, but we seem to be on the same page.

I have progressed my local branch, and essentially all of the tests pass now (the only failures are due to: normalized_laplacian_matrix() not being implemented, geometry information not cached on RX-based graphs, and an "interesting" issue with NX node_ids being used for an initial Tally that screws up because the test assumes it knows what partition those nodes map to, but of course, node_ids change when converting to RX (as does the assignment).

Let me know if you would like me to just push those changes or whether you want me to do something else.

@peterrrock2
Copy link
Collaborator Author

Wonderful! Thank you for going through all of that. I agree, we seem to be pretty much on the same page about things.

As a note, I have recently (in the past couple of weeks) changed my schedule so that Tuesdays are my main meeting day and OSS management and PR review has a whole day on Wednesday. So, go ahead and push your most recent changes to the branch, and I will get to them on the 6th.

As far as this comment goes:

Peter, I have continued to plow ahead to get tests to pass. At present three tests fail: - One because we don't store geometry information - One because I have not yet implemented normalized_laplacian_matrix() - One that I have not yet looked at - the results of doing a MarkovChain do not match the expected results. - I assume I will have figured that out and fixed whatever needs fixing as soon as I can get back to coding (going to take a day or two off to enjoy the summer). After getting the "interesting" MarkovChain test to pass, I will look at your comments on my previous work. Left to my own devices, I would take stock of the code and add additional tests to convince myself that all is really well and that there are not any edge-case bugs lurking, and then I would do some performance testing to see whether the code is actually significantly faster - presumably identifying some opportunities to speed things up. And THEN, I would analyze what the migration path would be for a legacy user (starting with the code in the User Guide, but also looking at the code samples you sent earlier) - I assume some issues would come up that would lead to changes in the code to make the API and/or conversion path "better". And THEN, I would go back and update the documentation everywhere - in the code - but perhaps in the User Guide or perhaps in a new Migration Guide. However, at some point, I need to get my work in the appropriate branch. I am not clear yet on where you would like this work to ultimately reside. You said something about this before, but I do not remember what you wanted done. I would also like to get clear on how to manage branches and merges etc going forward. Is anyone else working on this codebase? Do you have a rules-of-the-road process - that is what we used to call it way back when... At some point, it would be really nice to have a phone call. I am on the east coast and am typically fully caffeinated by 10am EST. Where are you? Lastly - let me know if you want me to push my latest changes, or whether that would actually muck up the works for you.

-Fred
On Fri, May 9, 2025 at 3:11 PM Peter @.> wrote: @.* commented on this pull request. ------------------------------ In gerrychain/graph/graph.py <#428 (comment)>: > + 1) It seemed possible to me that users took advantage in their own code that the + Graph object was in fact a NetworkX Graph object. If that is so, then we can + make life easier for them by providing them an easy way to gain access to the + internal NetworkX Graph object so they can continue to use that code. Ah, yep! We will need to adapt it, however. The code in the RX documentation will work for a graph without adornment, but there is a bit of nuance that we should be careful about. Namely, we will sometimes want to be considerate of particular node and edge attributes associated to the graph. The biggest concerns here are obviously the population column and any surcharges that we would like to impose on edges to impact the MST distribution we are sampling from. — Reply to this email directly, view it on GitHub <#428 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVB2BB6777G7V72EB4FIP325T4XLAVCNFSM6AAAAAB4XAFXHGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQMRZGQ2DOOBRGY . You are receiving this because you commented.Message ID: @.***>

I think that we can resolve all of this in a meeting. This upcoming Tuesday (August 5), I do not have any meetings scheduled currently, so I have all day open where we can meet. I live in Colorado, but I work on Eastern Time, so anytime after 10ET works great. Let me know!

@chief-dweeb
Copy link

chief-dweeb commented Aug 2, 2025 via email

@chief-dweeb
Copy link

chief-dweeb commented Aug 2, 2025 via email

@peterrrock2 peterrrock2 merged commit d747493 into mggg:wip/rustworkx-migration Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants