A full-stack semantic search tool for crawling, chunking, and vectorizing HTML content from any public website URL.
- Accepts any public website URL.
- Parses and chunks HTML content.
- Indexes with Weaviate vector database.
- Lets you search semantically across website content.
- Shows match percentage, context preview, and raw HTML.
- Frontend: Next.js
- Backend: Python (Flask)
- Vector DB: Weaviate
- NLP Embedding: HuggingFace SentenceTransformer (all-MiniLM-L6-v2)
- Backend
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py- Frontend
cd frontend
npm install
npm run dev
- Vector DB
docker-compose up -d
html-search-spa/
├── backend/
│ ├── app.py
│ ├── utils/
│ ├── requirements.txt
├── frontend/
│ ├── public/
│ ├── src/
│ │ ├── App.jsx
│ │ ├── components/
│ ├── tailwind.config.js
│ ├── package.json
├── weaviate/
│ ├── docker-compose.yml
├── README.md
├── .gitignore
git init
git add .
git commit -m "Initial commit: HTML Search SPA"
git remote add origin https://github.com/your-username/your-repo.git
git push -u origin mainMIT License © 2025 Sayan Debnath
