- Lead Project: Phong Cao
- AI Developer: Phong Cao
- Backend: Phong Cao, Hien Hoang
- Frontend: Doanh Phung, Minh Bui
SyntheSearch: A smart research tool that finds and synthesizes the most relevant papers, saving researchers time and enhancing insight.
-
Navigate to the backend directory:
cd backend
-
Set up a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
source venv/bin/activate
-
Install dependencies:
pip install -r requirement.txt
β οΈ Note: On iOS, ensurepywin32
is removed fromrequirement.txt
. -
Run the server:
uvicorn main:app --reload
-
Navigate to the frontend directory:
cd frontend
-
Install dependencies:
npm i
-
Start the development server:
npm run dev
SyntheSearch is a web application designed to streamline the research process for students and researchers by efficiently locating relevant research papers. Researchers often spend hours sifting through papers, hoping to find the studies that best match their interests. SyntheSearch aims to reduce this time by intelligently suggesting the most relevant papers and generating a synthesis to reveal how the studies interrelate, offering users an insightful overview that saves time and enhances understanding.
The inspiration for SyntheSearch came from our own experiences as students. Before HackUMass XII, one team member struggled to find research papers on machine-learning applications in cancer detection. The process of locating credible sources was exhausting and time-consuming, even with optimized library search tools. This frustration inspired us to develop a more efficient search engine that leverages Large Language Models (LLM) and vector databases to quickly surface relevant research and summar...
We chose Python for the back end because of its extensive frameworks for AI development. Databricks was used to streamline our machine-learning pipeline. Hereβs how we approached building SyntheSearch:
- π Data Collection: We started by scraping data from the CORE collection of open-access research papers.
- π Embedding: Using LangChain, we implemented OpenAI's text-embedding-3-large model to convert paper texts into vector embeddings.
- π Storage: We utilized LanceDB as our vector database, storing the embedded vectors for fast and efficient retrieval.
- π Summarization and Synthesis: We employed OpenAIβs GPT-4o-mini model to generate summaries, suggestions, and synthesized insights.
- π Front-End: We built the user interface using React.JS with a TypeScript template, providing a clean and responsive experience for users.
- π GitHub Workflow Issues: Frequent pull request conflicts slowed our progress due to merge conflicts.
- π£οΈ Communication Gaps: Miscommunication led to duplicated work and inefficiencies.
This project was an invaluable learning experience. As it was our first LLM project, we gained hands-on experience with GenAI technologies, particularly the power of vector databases. We learned the importance of clear team communication, and we now have a deeper understanding of LLMs and their capabilities in revolutionizing information retrieval.
- π Python (Backend Development)
- π LangChain (Embedding)
- π LanceDB (Vector Database)
- π€ OpenAI GPT Models (Summarization and Synthesis)
- βοΈ React with TypeScript (Front-End Development)
- π¨ TailwindCSS (Styling)
- β‘ Vite (Tooling)
- π§ͺ Databricks (Machine Learning Pipeline)
Through SyntheSearch, weβre excited to contribute to the efficiency of the research process, empowering researchers to focus on insights rather than information overload.