A taxi company operating in New York City (NYC) is rethinking its fleet allocation strategy. Their restructuring process is motivated by recent changes in passenger travel patterns (provoked by Covid) and the arrival of new competitorsn. To minimise costs, the company wants to station its fleet in a maximum of 20 city zones, spread across the city. At the same time, the company wants to maximise trips served and therefore choose zones that have hub-like characteristics, i.e. that are well-connected and near centres of high user activity. Lastly, the company considers operating purely outside the Manhattan borough where there is less competition.
To help with this task, the taxi company contracts you to analyse its historical trip records and
recommend a list of city zones where they should station their fleets. To that end, the company
gives you a copy of its Neo4j graph database, available here, containing aggregate statistics of
taxi trips undertaken in 2021. There are two types of nodes: borough
and zone
. Two zones are
connected by a relationship of type :CONNECTS
if a minimum of one trip was recorded between
them. In addition, :CONNECTS
relationships have property trips
representing the total number of
trips observed in the year. Lastly, a node is related to a borough via a relationship of type :IN
.
Based on preliminary exploration of the database and the NYC zones and boroughs map (shown above), you decide to tackle the challenge by combining two methods of network analysis available in the Neo4j Data Science Library. Your action plan is divided into three parts:
- Perform community detection to identify clusters of strongly connected zones.
- Perform centrality analysis to measure the hub-like features of each zone.
- Combine the results above by identifying the top 3 zones with highest centrality score within
each community cluster. This processed is done twice for the entire city:
- once including Manhattan, and
- once excluding Manhattan.
In addition to your Neo4j analysis, you will use the visualisation app below to help interpret the results of your network analysis and present them to your client (the taxi company). To use the app, export your results at the end of each stage to a csv file and load it into the app (more details given in each task).
There are four tasks:
- Find isolated nodes
- Compute the community cluster of each node
- Compute the centrality score of each node
- Find the top centrality zones within each community
In task 0 you will develop a cypher query to find nodes and relationship of a certain type.
In tasks 1 and 2 you will execute two different algorithms available in the Neo4j Data Science Library (GDS). Both tasks follow the typical workflow described here:
- Create an (in-memory) graph projection.
- Estimate the memory necessary to run the algorithm (optional).
- Run the algorithm in stats mode to summarise the output of the algorithm.
- Run the algorithm in stream mode to run the algorithm but not store the results in the original graph.
IMPORTANT: Since everyone is reading from the same database, you will NOT execute write queries to the database (create/update/delete) NOR run the GDS library algorithms in merge mode (as this would cause the results to be written to the original graph, shared by everyone).
In task 3 you will develop a cypher query that combines the outcomes of tasks 1 and 2.
Due to database writing restrictions, the coursework is split across three neo4j databases - one for Task 0, another for tasks 1 and 2 and a third for task 3. Connection details are given at the end of each task's description.
Find all:
- Self-pointing relationships, that is, all relationship instances of type
:CONNECTS
where the start node is the same as the end node. - Find all isolated nodes, i.e. all nodes that have no relationship instaces of type
:CONNECTS
to other nodes except possibly to themselves.
IMPORTANT: Do not delete the nodes/relationships found this way. Execute MATCH
only queries.
Deliverables: cypher queries for 1 and 2.
Neo4j database connection details:
- url: http://csc8101-neo4j-task0.uksouth.cloudapp.azure.com:7474/browser/
- Neo4j Connect URL: bolt://20.90.189.157:7687
- Neo4j username: neo4j
- Neo4j password: neo4j
Run the Louvain algorithm for community detection, available in the GDS library, with the following arguments:
- Graph projection of type
UNDIRECTED
. - Name the graph projection as
USERNAME-communities
where USERNAME is your username (this is necessary to avoid naming conflicts between students). - Weighted by the
trips
property in:CONNECTS
type of relationships.
As specificed above, run the algorithm in 2 modes: stats
and stream
:
- Report the number of communities using the
stats
mode, - Export the results of running the algorithm in
stream
as a CSV file with two columnszone_id
andcommunity_id
and - Visualise the results in the app by uploading the produced csv file (right-click to download image).
Tip: See the examples.
IMPORTANT: Do not run the algorithm in merge
mode.
Deliverables: all cypher queries (projection, stats mode, stream mode), and resulting visualisation image.
Neo4j database connection details:
- url: http://csc8101-neo4j-task1.uksouth.cloudapp.azure.com:7474/browser/
- Neo4j Connect URL: bolt://20.90.189.157:7687
- Neo4j username: neo4j
- Neo4j password: neo4j
Note: isolated nodes found in Task 0 have been removed for this task.
Run the Page Rank algorithm for centrality analysis, available in the GDS library, with the following arguments:
- Directed graph projection
- Name the graph projection as
USERNAME-centrality
where USERNAME is your username (this is necessary to avoid naming conflicts between students). - Damping factor: 0.75
- Weighted by the
trips
property in:CONNECTS
type of relationships.
As specificed above, run the algorithm in 2 modes: stats
and stream
:
- Report the maximum and minimum centrality score using the
stats
mode, - Export the results of running the algorithm in
stream
as a CSV file with two columnszone_id
andcentrality_score
and - Visualise the results in the app by uploading the produced csv file (right-click to download image).
Tip: See the examples.
IMPORTANT: Do not run the algorithm in merge
mode.
Deliverables: all cypher queries (projection, stats mode, stream mode), and resulting visualisation image.
Neo4j database connection details:
- url: http://csc8101-neo4j-task1.uksouth.cloudapp.azure.com:7474/browser/
- Neo4j Connect URL: bolt://20.90.189.157:7687
- Neo4j username: neo4j
- Neo4j password: neo4j
Note: isolated nodes found in Task 0 have been removed for this task.
Find the top 3 highest centrality zones per each community:
- Including zones in the 'Manhattan' borough.
- Excluding zones in the 'Manhattan' borough.
Do this using the available zone labels community
(representing the community
id obtained in 1) and centrality
(representing the centrality score obtained in 2).
For each case, return two columns: zone_id
and community_id
and export the query results to a
csv file. Then, using the visualisation
app, upload these (one at a time) to the
'Task-3' file input and produce a separate and produce two separate images.
IMPORTANT: Do not delete any nodes/relationships. Execute MATCH
queries only.
Deliverables: cypher queries and resulting visualisation images.
Neo4j database connection details:
- url: http://csc8101-neo4j-task3.uksouth.cloudapp.azure.com:7474/browser/
- Neo4j Connect URL: bolt://20.117.94.144:7687
- Neo4j username: neo4j
- Neo4j password: neo4j
Note: Task 1 and 2 have been run in merge mode such that the relationship properties 'centrality' and 'community' are available for this task.