-
Notifications
You must be signed in to change notification settings - Fork 76
Frm rustworkx - cleanup... #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frm rustworkx - cleanup... #432
Conversation
* Some changes are cosmetic * Some are results of resolving a TODO: item All tests pass...
No new functionality in GerryChain...
with a subtag that said whether the TODO: was for Performance, Documentation, Refactoring (basically any stylistic or software engineering improvement that did not affect functionality), and Code (which was a catchall for stuff that should be done sooner rather than later for one reason or another). This commit is just to allow a reviewer to be able to ignore these changes as trivial.
updated function signature comments, lots of comments more generally, updated TODO: comments (actually "did" a lot of them)
peterrrock2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Thank you for adding the tags to each of your "TODO" items.
- There are a lot of missing doc strings (esp. in
graph.py) but I think that we can worry about that more after a refactor and when we get closer to a full release
| split = {name: 0 for name in all_reg_names} | ||
|
|
||
| # frm: TODO: Grok what this tests - not clear to me at this time... | ||
| # frm: TODO: Testing: Grok what this tests - not clear to me at this time... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a test; it's just a helper function and is not picked up by pytest. Tests are prefixed with the word "test" in the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about it not being a test, but I still want to "grok" what it does ;-)
| subgraph_stored_node_id = subgraph_rx.node_data(subgraph_node_id)["nx_node_id"] | ||
| subgraph_stored_node_id = subgraph_rx.node_data(subgraph_node_id)["nx_node_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| subgraph_stored_node_id = subgraph_rx.node_data(subgraph_node_id)["nx_node_id"] | |
| subgraph_stored_node_id = subgraph_rx.node_data(subgraph_node_id)["nx_node_id"] | |
| subgraph_stored_node_id = subgraph_rx.node_data(subgraph_node_id)["nx_node_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch... I am not sure this test should remain in the test suite, though. I will take a look at it when I get around to cleaning up the testing stuff...
| def test_subgraph(four_by_five_graph_rx): | ||
|
|
||
| """ | ||
| Subgraphs are one of the most dangerous areas of the code. | ||
| In NX, subgraphs preserve node_ids - that is, the node_id | ||
| in the subgraph is the same as the node_id in the parent. | ||
| However, in RX, that is not the case - RX always creates | ||
| new node_ids starting at 0 and increasing by one | ||
| sequentially, so in general a node in an RX subgraph | ||
| will have a different node_id than it has in the parent | ||
| graph. | ||
| To deal with this, the code creates a map from the | ||
| node_id in a subgraph to the node_id in the parent | ||
| graph, _node_id_to_parent_node_id_map. This test verifies | ||
| that this map is properly created. | ||
| In addition, all RX based graphs that came from an NX | ||
| based graph record the "original" NX node_ids in | ||
| another node_id map, _node_id_to_original_nx_node_id_map | ||
| When we create a subgraph, this map needs to be | ||
| established for the subgraph. This test verifies | ||
| that this map is properly created. | ||
| Note that this test is only configured to work on | ||
| RX based Graph objects because the only uses of subgraph | ||
| in the gerrychain codebase is on RX based Graph objects. | ||
| """ | ||
|
|
||
| # Create a subgraph for an arbitrary set of nodes: | ||
| subgraph_node_ids = [2, 4, 5, 8, 11, 13] | ||
| parent_graph_rx = four_by_five_graph_rx # make the code below clearer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a fuzzed version of this where we do a random connected graph on 1000 nodes and then take random subgraphs of random size (say with between 100 and 200 nodes) and then repeat a few times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - good idea. I will put a TODO: Testing: comment in the code to remind me to do this
| * Aliasing concerns: | ||
| It occurs to me that the RX node_data is aliased with the NX node_data. | ||
| That is, the data dictionaries in the NX Graph are just retained | ||
| when the NX Graph is converted to be an RX Graph - so if you change | ||
| the data in the RX Graph, the NX Graph from which we created the RX | ||
| graph will also be changed. | ||
| I believe that this is also true for subgraphs for both NX and RX, | ||
| meaning that the node_data in the subgraph is the exact same | ||
| data dictionary in the parent graph and the subgraph. | ||
| I am not sure if this is a problem or not, but it is something | ||
| to be tested / thought about... | ||
| * NX allows node_ids to be almost anything - they can be integers, | ||
| strings, even tuples. I think that they just need to be hashable. | ||
| I don't know if we need to test that non-integer NX node_ids | ||
| don't cause a problem. There are tests elsewhere that have | ||
| NX node_ids that are tuples, and that test passes, so I think | ||
| we are OK, but there are no tests specifically targeting this | ||
| issue that I know of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we don't really need to worry about the aliasing problem here. Once they get passed to a recom method, we treat graphs as immutable, read-only structures. So long as a user doesn't try to do something silly in an acceptance function or in an updater, things should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is more about the data dictionaries (node data and edge data) as these dictionaries are also shared.
I agree that it is probably not a problem, but aliasing creeps me out, and I would like to make sure that we really are OK...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
| # frm: TODO: Testing: how to handle geometry? | ||
| # | ||
| # Originally, the following statement blew up because we do not copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably leave the geometry in a geodataframe attached as an attribute of the "graph" class rather than trying to add data onto the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted this entire comment - I went ahead and added geometry data to the RX-based Graph object, and it all just worked after that...
I have gone through my copious TODO: comments and have dealt with the ones that seemed important to resolve before releasing it to alpha/beta users. I have also updated the function signature documentation for routines in graph.py and tree.py where it seemed most important. Also some cosmetic changes - longer variable names etc.
I have just redone performance tests - comparing NX performance against RX performance. There is one line in side _first_time() in partition.py that will revert partitions to use NX graphs internally. I use this to run the current code with an embedded graph that is either NX or RX, and then compare the results.
I have created an Excel spreadsheet that compares NX vs. RX, and I would be happy to share it with you, but the TLDR is that on a test that does 100,000 iterations in MarkovChain() the RX version runs more than twice as fast with essentially all of the speedup happening in random_spanning_tree() due to the underlying nxtree.minimum_spanning_tree() vs. rx.minimum_spanning_tree() which is an astonishingly 4 times faster than the NX version.
So - let me know what else you would like done before we let real users loose on this codebase...
-Fred