From b93568ead8deae6f7f513e4d885a061db70f22d8 Mon Sep 17 00:00:00 2001 From: andkret Date: Thu, 28 Nov 2024 11:33:54 +0100 Subject: [PATCH] Inserted GenAI project Added the instructions and the code --- sections/04-HandsOnCourse.md | 137 +++++++++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) diff --git a/sections/04-HandsOnCourse.md b/sections/04-HandsOnCourse.md index 43589d7..f712027 100644 --- a/sections/04-HandsOnCourse.md +++ b/sections/04-HandsOnCourse.md @@ -3,11 +3,148 @@ Data Engineering Course: Building A Data Platform ## Contents +- [GenAI Retrieval Augmented Generation with Ollama and ElasticSearch](04-HandsOnCourse.md#genai-retrieval-augmented-generation-with-ollama-and-elasticsearch) - [Free Data Engineering Course with AWS, TDengine, Docker and Grafana](04-HandsOnCourse.md#free-data-engineering-course-with-aws-tdengine-docker-and-grafana) - [Monitor your data in dbt & detect quality issues with Elementary](04-HandsOnCourse.md#monitor-your-data-in-dbt-and-detect-quality-issues-with-elementary) - [Solving Engineers 4 Biggest Airflow Problems](04-HandsOnCourse.md#solving-engineers-4-biggest-airflow-problems) - [The best alternative to Airlfow? Mage.ai](04-HandsOnCourse.md#the-best-alternative-to-airlfow?-mage.ai) +## GenAI Retrieval Augmented Generation with Ollama and ElasticSearch + +- This how-to is based on this one from Elasticsearch: https://www.elastic.co/search-labs/blog/rag-with-llamaIndex-and-elasticsearch +- Instead of Elasticsearch cloud we're going to run everything locally +- The simplest way to get this done is to just clone this GitHub Repo for the code and docker setup +- I've tried this on a M1 Mac. Changes for Windows with WSL will come later. +- The biggest problems that I had were actually installing the dependencies rather than the code itself. + +### Install Ollama +1. Download Ollama from here https://ollama.com/download/mac +2. Unzip, drag into applications and install +3. do `ollama run mistral` (It's going to download the Mistral 7b model, 4.1GB size) +4. Create a new folder in Documents "Elasticsearch-RAG" +5. Open that folder in VSCode + +### Install Elasticsearch & Kibana (Docker) +1. Use the docker-compose file from the Log Monitoring course: https://github.com/team-data-science/GenAI-RAG/blob/main/docker-compose.yml +2. Download Docker Desktop from here: https://www.docker.com/products/docker-desktop/ +3. Install docker desktop and sign in in the app/create a user -> sends you to the browser + +**For Windows Users** +Configure WSL2 to use max only 4GB of ram: +``` +wsl --shutdown +notepad "$env:USERPROFILE/.wslconfig" +``` +.wslconfig file: +``` +[wsl2] +memory=4GB # Limits VM memory in WSL 2 up to 4GB +``` +**Modify the Linux kernel map count in WSL** +Do this before the start because Elasticsearch requires a higher value to work +`sudo sysctl -w vm.max_map_count=262144` + +4. go to the Elasticsearch-RAG folder and do `docker compose up` +5. make sure you have Elasticsearch 8.11 or later (we use 8.16 here in this project) if you want to use your own Elasticsearch image +6. if you get this error on a mac then just open the console in the docker app: *error getting credentials - err: exec: docker-credential-desktop: executable file not found in $PATH, out:* +7. Install xcode command line tools: `xcode-select --install` +8. make sure you're at python 3.8.1 or larger -> installed 3.13.0 from https://www.python.org/downloads/ + +### Setup the virtual Python environment + +#### preparation on a Mac +##### install brew +which brew +/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" +export PATH="/opt/homebrew/bin:$PATH" +brew --version +brew install pyenv +brew install pyenv-virtualenv + +##### install pyenv +``` +brew install pyenv +brew install pyenv-virtualenv +``` + +Modify the path so that pyenv is in the path variable +`nano ~/.zshrc` + +``` +export PYENV_ROOT="$HOME/.pyenv" +export PATH="$PYENV_ROOT/bin:$PATH" +eval "$(pyenv init --path)" +eval "$(pyenv init -)" +eval "$(pyenv virtualenv-init -)" +``` + +install dependencies for building python versions +`brew install openssl readline sqlite3 xz zlib` + +Reload to apply changes +`source ~/.zshrc` + +install python +``` +pyenv install 3.11.6 +pyenv version +``` + +Set Python version system wide +`pyenv global 3.11.6` + +``` +pyenv virtualenv +pyenv activate +pyenv virtualenv-delete +``` + +#### Windows without pyenv +setup virtual python environment - go to the Elasticsearch-RAG folder and do +`python3 -m venv .elkrag` +enable the environment +`source .elkrag/bin/activate` + + +### Install required libraries (do one at a time so you see errors): +``` +pip install llama-index (optional python3 -m pip install package name) +pip install llama-index-embeddings-ollama +pip install llama-index-llms-ollama +pip install llama-index-vector-stores-elasticsearch +pip install python-dotenv +``` + +### Write the data to Elasticsearch +1. create / copy in the index.py file +2. download the conversations.json file from the folder code examples/GenAI-RAG +3. if you get an error with the execution then check if pedantic version is <2.0 `pip show pydantic` if not do this: `pip install "pydantic<2.0` +4. run the program index.py: https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/index.py + +### Check the data in Elasticsearch +1. go to kibana http://localhost:5601/app/management/data/index_management/indices and see the new index called calls +2. go to dev tools and try out this query `GET calls/_search?size=1 http://localhost:5601/app/dev_tools#/console/shell` + +### Query data from elasticsearch and create an output with Mistral +1. if everything is good then run the query.py file https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/query.py +2. try a few queries :) + +### Install libraries to extract text from pdfs + + +### Extract data from CV and put it into Elasticsearch +I created a CV with ChatGPT https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/Liam_McGivney_CV.pdf + +Install the library to extract text from the pdf +`pip install PyMuPDF` +I had to Shift+Command+p then python clear workspace cache and reload window. Then it saw it :/ + +The file cvpipeline.py has the python code for the indexing. It's not working right now though! +https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/cvpipeline.py + + +I'll keep developing this and update it once it's working. + ## Free Data Engineering Course with AWS TDengine Docker and Grafana