Skip to content

Latest commit

 

History

History
53 lines (39 loc) · 2.26 KB

File metadata and controls

53 lines (39 loc) · 2.26 KB

Automated Generic EDA as LLM Feed

This project provides an Automated Exploratory Data Analysis (EDA) pipeline that generates detailed insights into a dataset using natural language outputs. The pipeline leverages Large Language Models (LLMs) to interpret and present statistical summaries, visualizations, and key findings from structured datasets.

Features

  • Data Summary: Automatically generates descriptive statistics for numerical, categorical, and mixed datasets.
  • Outlier Detection: Identifies potential outliers using statistical methods.
  • Data Cleaning Suggestions: Highlights missing values, duplicates, and inconsistencies, and suggests preprocessing steps.
  • Correlation Analysis: Computes correlation metrics and highlights significant relationships between variables.
  • Automated Visualizations: Generates relevant visualizations (e.g., histograms, scatter plots, heatmaps) to support the findings.
  • LLM Integration: Translates technical EDA results into human-readable summaries and business insights.

In this script:

  1. The load_dataset function loads the dataset from the given path.
  2. The perform_detailed_textual_eda function performs the EDA and extracts relevant data values into a dictionary.
  3. The generate_vector_embeddings function generates vector embeddings for the extracted values using a pre-trained sentence-transformers model.

Architecture

  1. Input: Upload a structured dataset (CSV, Excel, etc.).
  2. EDA Processing:
    • Data profiling
    • Statistical computations
    • Visualization generation
  3. LLM Feed: Processed EDA results are converted into natural language summaries using an LLM.
  4. Output: Detailed EDA report as text, images, or PDF.

Installation

  1. Clone the repository:
    git clone https://github.com/your-username/automated-generic-eda-llm-feed.git
    cd automated-generic-eda-llm-feed
  2. Install dependencies:
    pip install -r requirements.txt
  3. Set up your LLM API credentials (e.g., OpenAI API or other providers). Update config.yaml or environment variables with your API key.

Dependencies

Python 3.8+
Libraries:
    pandas
    numpy
    matplotlib
    seaborn
    scikit-learn
    openai (or another LLM library)