Skip to content

Automated Exploratory Data Analysis (EDA) pipeline for any generic dataset as input.

License

Notifications You must be signed in to change notification settings

saraswathub/Automated-Generic-EDA-as-LLM-feed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Automated Generic EDA as LLM Feed

This project provides an Automated Exploratory Data Analysis (EDA) pipeline that generates detailed insights into a dataset using natural language outputs. The pipeline leverages Large Language Models (LLMs) to interpret and present statistical summaries, visualizations, and key findings from structured datasets.

Features

  • Data Summary: Automatically generates descriptive statistics for numerical, categorical, and mixed datasets.
  • Outlier Detection: Identifies potential outliers using statistical methods.
  • Data Cleaning Suggestions: Highlights missing values, duplicates, and inconsistencies, and suggests preprocessing steps.
  • Correlation Analysis: Computes correlation metrics and highlights significant relationships between variables.
  • Automated Visualizations: Generates relevant visualizations (e.g., histograms, scatter plots, heatmaps) to support the findings.
  • LLM Integration: Translates technical EDA results into human-readable summaries and business insights.

In this script:

  1. The load_dataset function loads the dataset from the given path.
  2. The perform_detailed_textual_eda function performs the EDA and extracts relevant data values into a dictionary.
  3. The generate_vector_embeddings function generates vector embeddings for the extracted values using a pre-trained sentence-transformers model.

Architecture

  1. Input: Upload a structured dataset (CSV, Excel, etc.).
  2. EDA Processing:
    • Data profiling
    • Statistical computations
    • Visualization generation
  3. LLM Feed: Processed EDA results are converted into natural language summaries using an LLM.
  4. Output: Detailed EDA report as text, images, or PDF.

Installation

  1. Clone the repository:
    git clone https://github.com/your-username/automated-generic-eda-llm-feed.git
    cd automated-generic-eda-llm-feed
  2. Install dependencies:
    pip install -r requirements.txt
  3. Set up your LLM API credentials (e.g., OpenAI API or other providers). Update config.yaml or environment variables with your API key.

Dependencies

Python 3.8+
Libraries:
    pandas
    numpy
    matplotlib
    seaborn
    scikit-learn
    openai (or another LLM library)

About

Automated Exploratory Data Analysis (EDA) pipeline for any generic dataset as input.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages