This repository demonstrates the application of text embeddings, Uniform Manifold Approximation and Projection (UMAP), and visualization tools such as Plotly Dash to gain insights into a dataset of research paper abstracts and associated metadata.
To run the demo on Google Colab, follow these steps:
- Copy the URL of the
abstract_summary.ipynb
notebook: Notebook URL - Navigate to Google Colab.
- Click on
File
>Open notebook
. - Switch to the
GitHub
tab, paste the copied URL into the search bar, and press Enter. - Open the notebook from the search results.
To install and run the demo locally, follow the steps below in the terminal:
$ conda create -n text-embedding-demo python=3.11
$ conda activate text-embedding-demo
$ pip install torch # Use the exact installation command from https://pytorch.org/
$ git clone https://github.com/NoviaIntSysGroup/text-embedding-demo.git
$ cd text-embedding-demo
$ pip install -e .
NOTE: One needs to have conda, python and git installed.
- Description: Utilizes UMAP for visualizing data similarity in a 2D scatter plot, where each point corresponds to an individual research paper.
- Interactive Features:
- Hover: Displays summary/abstract and metadata on hover.
- Selection: Enables selection of multiple points for comparative analysis.
- Description: A horizontal bar graph representing the similarity of keywords in the dataset, offering insights into dominant themes.
- Description: Provides detailed information about a highlighted or selected data entry from the scatter plot.
- Description: Visualizes metrics related to the dataset using a radar chart.
- Features:
- Mean Line: Depicts the average value for each metric across the dataset.
- Hover Line: Updates to show the metric values of a data point highlighted in the UMAP plot.
- Explore Clusters: Begin with the 2D UMAP scatter plot to explore data clusters and patterns.
- Inspect Details: Hover over points to view abstracts, metadata, and an updated Spider Chart.
- Group Analysis: Select multiple points for comparative analysis.
- Identify Themes: Abstracts can be color-coded based on their similarity to a specific input query by user or a selected topic sentence. Analyze the Keyword Bar Graph for prevalent themes and leverage the UMAP-generated scatter plot for thematic categorization of abstracts.
- Deep Dive: Use the Detailed Data View and Spider Chart for deeper insights, comparing individual metrics against the dataset mean.