The ISAW AI Librarian is an end-to-end pipeline tailored for humanities scholars. It enables:
- Data Collection: Scraping and downloading scholarly articles from online sources.
- Data Preprocessing: Splitting text into manageable chunks with metadata.
- Knowledge Augmentation: Creating an augmented retrieval system to query domain-specific knowledge.
- User Interface: Deploying a user-friendly interface to retrieve and explore insights from the data.
This tool is specifically designed for researchers working in domains like history, archaeology, and related fields, facilitating easy access to their data and enhanced information retrieval.
Discover more about the ISAW AI Librarian on the Institute for the Study of the Ancient World (ISAW) website.
- Web Scraping: Automatically download scholarly PDFs from web pages.
- Data Preprocessing: Split long texts into manageable, metadata-tagged chunks for efficient storage and querying.
- Text Embedding and Search: Create a searchable database using advanced vector-based embeddings.
- QA Chat Interface: Use an interactive Gradio interface for querying domain-specific knowledge, with options to refine queries and retrieve sources.
- Data Collection: Download scholarly articles from URLs.
- Chunking: Split documents into manageable chunks, retaining metadata like source URLs and page numbers.
- Embedding: Convert text into embeddings using
OpenAIEmbeddingsfor similarity-based search. - Query Interface: Implement an AI-powered librarian using
langchainto answer questions and retrieve information interactively.
- Clone the repository:
git clone https://github.com/your-username/isaw-ai-librarian.git cd isaw_ai_librarian - Install dependencies
pip install -r requirements.txt
- Set up environment variables for OpenAI API:
export OPENAI_API_KEY=your_openai_api_key
The Gradio-powered interface offers:
- Textbox: For asking questions or inputting research summaries.
- Refinement Options: Use checkboxes for domain-specific perspectives like "Art History" or "Archaeology."
- Multilingual Support: Receive responses in English, Mandarin, Arabic, Spanish, or Russian.
download_pdf_from_html(): Scrapes PDFs from a webpage.split_chunk_into_halves(): Splits large text into smaller chunks using punctuation.convert_json_to_txt_clean(): Cleans and formats JSON into readable text.ChatOpenAI: GPT-4-based chat model for Q&A.- Gradio Interface: Easy-to-use UI for querying the dataset.
Contributions are welcome! Submit a pull request or open an issue to suggest improvements or report bugs.
This project is licensed under the MIT License.
- Developed with LangChain and Gradio.
- My gratitude goes to Sebastian Heath (ISAW) and Riccardo Torlone (Roma Tre University) as well as Patrick J. Burns and David Ratzan (ISAW), for their guidance on this project. I also wish to acknowledge the ISAW scholars at the Institute for the Study of the Ancient World at NYU for their valuable inputs and for curating the collections.
Enjoy extracting knowledge from your domain-specific datasets with ease!


