Skip to content

GraphML Doc Updates #689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 83 additions & 78 deletions site/content/3.10/data-science/arangographml/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,24 @@ description: >-
aliases:
- graphml
---
Traditional machine learning overlooks the connections and relationships
Traditional Machine Learning overlooks the connections and relationships
between data points, which is where graph machine learning excels. However,
accessibility to GraphML has been limited to sizable enterprises equipped with
specialized teams of data scientists. ArangoGraphML, on the other hand,
simplifies the utilization of GraphML, enabling a broader range of personas to
extract profound insights from their data.
specialized teams of data scientists. ArangoGraphML simplifies the utilization of GraphML,
enabling a broader range of personas to extract profound insights from their data.

## How GraphML works

GraphML focuses on the utilization of neural networks specifically for
graph-related tasks. It is well-suited for addressing vague or fuzzy problems
and facilitating their resolution. The process involves incorporating a graph's
topology (node and edge structure) and the node and edge characteristics and
features to create a numerical representation known as an embedding.
Graph machine learning leverages the inherent structure of graph data, where entities (nodes) and their relationships (edges) form a network. Unlike traditional ML, which primarily operates on tabular data, GraphML applies specialized algorithms like Graph Neural Networks (GNNs), node embeddings, and link prediction to uncover complex patterns and insights.

![GraphML Embeddings](../../../images/GraphML-Embeddings.webp)
1. **Graph Construction** – Raw data is transformed into a graph structure, defining nodes and edges based on real-world relationships.
2. **Featurization** – Nodes and edges are enriched with features that help in training predictive models.
3. **Model Training** – Machine learning techniquee are applied on GNNs to identify patterns and make predictions.
4. **Inference & Insights** – The trained model is used to classify nodes, detect anomalies, recommend items, or predict future connections.

Graph Neural Networks (GNNs) are explicitly designed to learn meaningful
numerical representations, or embeddings, for nodes and edges in a graph.
ArangoGraphML streamlines these steps, providing an intuitive and scalable framework to integrate GraphML into various applications, from fraud detection to recommendation systems.

By applying a series of steps, GNNs effectively create graph embeddings,
which are numerical representations that encode the essential information
about the nodes and edges in the graph. These embeddings can then be used
for various tasks, such as node classification, link prediction, and
graph-level classification, where the model can make predictions based on the
learned patterns and relationships within the graph.
![GraphML Embeddings](../../../images/GraphML-Embeddings.webp)

![GraphML Workflow](../../../images/GraphML-How-it-works.webp)

Expand All @@ -45,71 +37,84 @@ The platform comes preloaded with all the tools needed to prepare your graph
for machine learning, high-accuracy training, and persisting predictions back
to the database for application use.

### Classification

Node classification is a natural fit for graph databases as it can leverage
existing graph analytics insights during model training. For instance, if you
have performed some community detection, potentially using ArangoDB's built-in
Pregel support, you can use these insights as inputs for graph machine learning.

#### What is Node Classification

The goal of node classification is to categorize the nodes in a graph based on
their neighborhood connections and characteristics in the graph. Based on the
behaviors or patterns in the graph, the Graph Neural Network (GNN) will be able
to learn what makes a node belong to a category.

Node classification can be used to solve complex problems such as:
- Entity Categorization
- Email
- Books
- WebPage
- Transaction
- Social Networks
- Events
- Friends
- Interests
- BioPharmaceutical
- Protein-protein interaction
- Drug Categorization
- Sequence grouping
- Behavior
- Fraud
- Purchase/decision making
- Anomaly

Many use cases can be solved with node classification. With many challenges,
there are multiple ways to attempt to solve them, and that's why the
ArangoGraphML node classification is only the first of many techniques to be
introduced. You can sign up to get immediate access to our latest stable
features and also try out other features included in the pipeline, such as
embedding similarity or link prediction.

For more information, [get in touch](https://www.arangodb.com/contact/)
with the ArangoDB team.

### Metrics and Compliance

#### Training Performance

Before using a model to provide predictions to your application, there needs
to be a way to determine its level of accuracy. Additionally, a mechanism must
be in place to ensure the experiments comply with auditor requirements.

ArangoGraphML supports these objectives by storing all relevant training data
and metrics in a metadata graph, which is only available to you and is never
viewable by ArangoDB. This metagraph contains valuable training metrics such as
average accuracy (the general metric for determining model performance), F1,
Recall, Precision, and confusion matrix data. This graph links all experiments
## Supported Tasks

### Node Classification

Node classification is a **supervised learning** task where the goal is to predict the label of a node based on both its own features and its relationships within the graph. It requires a set of labeled nodes to train a model, which then classifies unlabeled nodes based on learned patterns.

**How It Works in ArangoGraphML**
- A portion of the nodes in a graph is labeled for training.
- The model learns patterns from both **node features** and **structural relationships** (neighboring nodes and connections).
- It predicts labels for unlabeled nodes based on these learned patterns.

**Example Use Cases**

**1. Fraud Detection in Financial Networks**
- **Problem:** Fraudsters often create multiple accounts or interact within suspicious clusters to evade detection.
- **Solution:** A transaction graph is built where nodes represent users and edges represent transactions. The model learns patterns from labeled fraudulent and legitimate users, detecting hidden fraud rings based on **both user attributes and transaction relationships**.

**2. Customer Segmentation in E-Commerce & Social Media**
- **Problem:** Businesses need to categorize customers based on purchasing behavior and engagement.
- **Solution:** A graph is built where nodes represent customers and edges represent interactions (purchases, reviews, social connections). The model predicts the category of each user based on how similar they are to other users **not just by their personal data, but also by how they are connected to others**.

**3. Disease Classification in Biomedical Networks**
- **Problem:** Identifying proteins or genes associated with a disease.
- **Solution:** A protein interaction graph is built where nodes are proteins and edges represent biochemical interactions. The model classifies unknown proteins based on their interactions with known disease-related proteins, rather than just their individual properties.

### Node Embedding Generation

Node embedding is an **unsupervised learning** technique that converts nodes into numerical vector representations, preserving their **structural relationships** within the graph. Unlike simple feature aggregation, node embeddings **capture the influence of neighboring nodes and graph topology**, making them powerful for downstream tasks like clustering, anomaly detection, and link prediction. Combining this with downstream tasks like clustering, anomaly detection, and link prediction can provide valuable insights. Consider using [ArangoDB's Vector Search](https://arangodb.com/2024/11/vector-search-in-arangodb-practical-insights-and-hands-on-examples/) capabilities to find similar nodes based on their embeddings.

**Feature Embeddings vs Node Embeddings**

**Feature Embeddings** are vector representations derived from the attributes or features associated with nodes. These embeddings aim to capture the inherent characteristics of the data. For example, in a social network, a feature embedding might encode user attributes like age, location, and interests. Techniques like **Word2Vec**, **TF-IDF**, or **autoencoders** are commonly used to generate such embeddings.

In the context of graphs, **Node Embeddings** are a **combination of a node’s feature embedding and the structural information from its connected edges**. Essentially, they aggregate both the node’s attributes and the connectivity patterns within the graph. This fusion helps capture not only the individual properties of a node but also its position and role within the network.

**How It Works in ArangoGraphML**
- The model learns an embedding (a vector representation) for each node based on its **position within the graph and its connections**.
- It **does not rely on labeled data**—instead, it captures structural patterns through graph traversal and aggregation of neighbor information.
- These embeddings can be used for similarity searches, clustering, and predictive tasks.

**Example Use Cases**

**1. Recommendation Systems (E-commerce & Streaming Platforms)**
- **Problem:** Platforms like Amazon, Netflix, and Spotify need to recommend products, movies, or songs.
- **Solution:** A user-item interaction graph is built where nodes are users and products, and edges represent interactions (purchases, ratings, listens). **Embeddings encode relationships**, allowing the system to recommend similar items based on user behavior and network influence rather than just individual preferences.

**2. Anomaly Detection in Cybersecurity & Finance**
- **Problem:** Detecting unusual activity (e.g., cyberattacks, money laundering) in complex networks.
- **Solution:** A network of IP addresses, users, and transactions is represented as a graph. Nodes with embeddings that significantly deviate from normal patterns are flagged as potential threats. The key advantage here is that anomalies are detected based on **network structure, not just individual activity logs**.

**3. Link Prediction (Social & Knowledge Graphs)**
- **Problem:** Predicting new relationships, such as suggesting friends on social media or forecasting research paper citations.
- **Solution:** A social network graph is created where nodes are users, and edges represent friendships. **Embeddings capture the likelihood of connections forming based on shared neighborhoods and structural similarities, even if users have never interacted before**.

### **Key Difference**

| Feature | Node Classification | Node Embedding Generation |
|------------------------|--------------------|--------------------------|
| **Learning Type** | Supervised | Unsupervised |
| **Input Data** | Labeled nodes | Graph structure & features |
| **Output** | Predicted labels | Node embeddings (vectors) |
| **Key Advantage** | Learns labels based on node connections and attributes | Learns structural patterns and node relationships |
| **Use Cases** | Fraud detection, customer segmentation, disease classification | Recommendations, anomaly detection, link prediction |

ArangoGraphML provides the infrastructure to efficiently train and apply these models, helping users extract meaningful insights from complex graph data.

## Metrics and Compliance

ArangoGraphML supports tracking your ML pipeline by storing all relevant metadata
& metrics in a Graph called ArangoPipe. This is only available to you and is never
viewable by ArangoDB. This metadata graph links all experiments
to the source data, feature generation activities, training runs, and prediction
jobs. Having everything linked across the entire pipeline ensures that, at any
time, anything done that could be considered associated with sensitive user data,
it is logged and easily accessible.
jobs, allowing you to track the entire ML pipeline without having to leave ArangoDB.

### Security

Each deployment that uses ArangoGraphML has an `arangopipe` database created,
which houses all this information. Since the data lives with the deployment,
which houses all ML Metadata information. Since this data lives within the deployment,
it benefits from the ArangoGraph SOC 2 compliance and Enterprise security features.
All ArangoGraphML services live alongside the ArangoGraph deployment and are only
accessible within that organization.
Loading