Skip to content

sayandebnath-creator/HTML-Content-Search-Engine-SPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 HTML Search SPA

A full-stack semantic search tool for crawling, chunking, and vectorizing HTML content from any public website URL.

🎥 Demo

HTML Search Demo

🌐 Features

  • Accepts any public website URL.
  • Parses and chunks HTML content.
  • Indexes with Weaviate vector database.
  • Lets you search semantically across website content.
  • Shows match percentage, context preview, and raw HTML.

🛠️ Tech Stack

  • Frontend: Next.js
  • Backend: Python (Flask)
  • Vector DB: Weaviate
  • NLP Embedding: HuggingFace SentenceTransformer (all-MiniLM-L6-v2)

🚀 Running the Project

  1. Backend
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py
  1. Frontend
cd frontend
npm install
npm run dev
  1. Vector DB
docker-compose up -d

📁 Folder Structure

html-search-spa/
├── backend/
│ ├── app.py
│ ├── utils/
│ ├── requirements.txt
├── frontend/
│ ├── public/
│ ├── src/
│ │ ├── App.jsx
│ │ ├── components/
│ ├── tailwind.config.js
│ ├── package.json
├── weaviate/
│ ├── docker-compose.yml
├── README.md
├── .gitignore

To Push to GitHub

git init
git add .
git commit -m "Initial commit: HTML Search SPA"
git remote add origin https://github.com/your-username/your-repo.git
git push -u origin main

📃 License

MIT License © 2025 Sayan Debnath

About

A web application where users input a website URL and a search query. It crawls the URL, chunks HTML content, indexes it using Weaviate (vector DB), and lets users semantically search content. Matching results show highlighted text, HTML view, and relevance score.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors