ISAW AI Librarian

Overview

The ISAW AI Librarian is an end-to-end pipeline tailored for humanities scholars. It enables:

Data Collection: Scraping and downloading scholarly articles from online sources.
Data Preprocessing: Splitting text into manageable chunks with metadata.
Knowledge Augmentation: Creating an augmented retrieval system to query domain-specific knowledge.
User Interface: Deploying a user-friendly interface to retrieve and explore insights from the data.

This tool is specifically designed for researchers working in domains like history, archaeology, and related fields, facilitating easy access to their data and enhanced information retrieval.

Discover more about the ISAW AI Librarian on the Institute for the Study of the Ancient World (ISAW) website.

Features

Web Scraping: Automatically download scholarly PDFs from web pages.
Data Preprocessing: Split long texts into manageable, metadata-tagged chunks for efficient storage and querying.
Text Embedding and Search: Create a searchable database using advanced vector-based embeddings.
QA Chat Interface: Use an interactive Gradio interface for querying domain-specific knowledge, with options to refine queries and retrieve sources.

How It Works

Pipeline Overview

Data Collection: Download scholarly articles from URLs.
Chunking: Split documents into manageable chunks, retaining metadata like source URLs and page numbers.
Embedding: Convert text into embeddings using OpenAIEmbeddings for similarity-based search.
Query Interface: Implement an AI-powered librarian using langchain to answer questions and retrieve information interactively.

Installation

Clone the repository:

git clone https://github.com/your-username/isaw-ai-librarian.git
cd isaw_ai_librarian

Install dependencies
```
pip install -r requirements.txt
```

Set up environment variables for OpenAI API:

export OPENAI_API_KEY=your_openai_api_key

User Interface

The Gradio-powered interface offers:

Textbox: For asking questions or inputting research summaries.
Refinement Options: Use checkboxes for domain-specific perspectives like "Art History" or "Archaeology."
Multilingual Support: Receive responses in English, Mandarin, Arabic, Spanish, or Russian.

Example

Research Summary:

Key Components

download_pdf_from_html(): Scrapes PDFs from a webpage.
split_chunk_into_halves(): Splits large text into smaller chunks using punctuation.
convert_json_to_txt_clean(): Cleans and formats JSON into readable text.
ChatOpenAI: GPT-4-based chat model for Q&A.
Gradio Interface: Easy-to-use UI for querying the dataset.

Contributions

Contributions are welcome! Submit a pull request or open an issue to suggest improvements or report bugs.

License

This project is licensed under the MIT License.

Acknowledgments

Developed with LangChain and Gradio.
My gratitude goes to Sebastian Heath (ISAW) and Riccardo Torlone (Roma Tre University) as well as Patrick J. Burns and David Ratzan (ISAW), for their guidance on this project. I also wish to acknowledge the ISAW scholars at the Institute for the Study of the Ancient World at NYU for their valuable inputs and for curating the collections.

Enjoy extracting knowledge from your domain-specific datasets with ease!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
chunks_isaw_papers_all.txt		chunks_isaw_papers_all.txt
evaluation_results.csv		evaluation_results.csv
ground_truth.txt		ground_truth.txt
pipeline.py		pipeline.py
pipeline_test.py		pipeline_test.py
rag_evaluation.py		rag_evaluation.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ISAW AI Librarian

Overview

Features

How It Works

Pipeline Overview

Installation

User Interface

Example

Research Summary:

Key Components

Contributions

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

federicodip/isaw_ai_librarian

Folders and files

Latest commit

History

Repository files navigation

ISAW AI Librarian

Overview

Features

How It Works

Pipeline Overview

Installation

User Interface

Example

Research Summary:

Key Components

Contributions

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages