This project provides an Automated Exploratory Data Analysis (EDA) pipeline that generates detailed insights into a dataset using natural language outputs. The pipeline leverages Large Language Models (LLMs) to interpret and present statistical summaries, visualizations, and key findings from structured datasets.
- Data Summary: Automatically generates descriptive statistics for numerical, categorical, and mixed datasets.
- Outlier Detection: Identifies potential outliers using statistical methods.
- Data Cleaning Suggestions: Highlights missing values, duplicates, and inconsistencies, and suggests preprocessing steps.
- Correlation Analysis: Computes correlation metrics and highlights significant relationships between variables.
- Automated Visualizations: Generates relevant visualizations (e.g., histograms, scatter plots, heatmaps) to support the findings.
- LLM Integration: Translates technical EDA results into human-readable summaries and business insights.
In this script:
- The load_dataset function loads the dataset from the given path.
- The perform_detailed_textual_eda function performs the EDA and extracts relevant data values into a dictionary.
- The generate_vector_embeddings function generates vector embeddings for the extracted values using a pre-trained sentence-transformers model.
- Input: Upload a structured dataset (CSV, Excel, etc.).
- EDA Processing:
- Data profiling
- Statistical computations
- Visualization generation
- LLM Feed: Processed EDA results are converted into natural language summaries using an LLM.
- Output: Detailed EDA report as text, images, or PDF.
- Clone the repository:
git clone https://github.com/your-username/automated-generic-eda-llm-feed.git cd automated-generic-eda-llm-feed
- Install dependencies:
pip install -r requirements.txt
- Set up your LLM API credentials (e.g., OpenAI API or other providers). Update config.yaml or environment variables with your API key.
Python 3.8+
Libraries:
pandas
numpy
matplotlib
seaborn
scikit-learn
openai (or another LLM library)