Skip to content

Commit

Permalink
added changes for RFC for Ollama support for agentsQnA workflow in OPEA
Browse files Browse the repository at this point in the history
  • Loading branch information
pbharti0831 committed Jan 29, 2025
1 parent 0b784fa commit 677e378
Showing 1 changed file with 19 additions and 19 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 24-11-25-GenAIExamples-Ollama_Support_for_Xeon
# 24-11-25-GenAIExamples-Ollama_Support_for_CPU_Server

The AgentQnA workflow in GenAIExamples uses LLMs as agents to intelligently manage the control flow in the pipeline. Currently, it relies on the OpenAI paid API for LLM services on the Xeon platform, which incurs costs and does not utilize Xeon capability for LLM computation. This RFC aims to add support for open-source small language models (SLMs) locally deployed on Xeon through Ollama for LLM engines.
The AgentQnA workflow in GenAIExamples leverages large language models (LLMs) as agents to intelligently manage control flow within the pipeline. Currently, it depends on cloud-hosted, paid APIs for LLM services on the CPU server platform, which incurs significant costs and does not utilize the full computational capabilities of the CPU. This RFC proposes the integration of support for open-source small language models (SLMs) locally deployed on x86 CPU servers using Ollama, thereby enabling LLM computation locally on on-prem CPUs and reducing operational expenses.
## Author(s)

[Pratool Bharti](https://github.com/pbharti0831/)
Expand All @@ -12,30 +12,33 @@ The AgentQnA workflow in GenAIExamples uses LLMs as agents to intelligently mana
## Objective

### Problems This Will Solve
- **Access to Open-source SLMs on Xeon**: Provides access to open-source SLMs through Ollama on Xeon. SOTA open-source SLMs model work fine for less complex agentic workflow. Given an elaborated prompt, Llama 3.1 and 3.2 small models are fairly accurate for tool calling, an important feature for Agents.
- **Cost Reduction**: Eliminates the need for paid API services by using open-source SLMs.
- **Access to Open-source SLMs on CPU Servers**: Enables the use of open-source SLMs through Ollama on x86 CPU servers. State-of-the-art open-source SLMs are suitable for less complex agent workflows. A critical task for agents is accurately invoking the correct tools for specific tasks. As demonstrated in the [Berkeley Function-Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html), Llama 70B and 8B models perform similarly in tool-calling tasks, indicating that smaller language models can effectively support agent workflows.
- **Cost Reduction**: Eliminates the need for paid cloud-based API services by running open-source SLMs locally on on-prem CPUs.
- **Data Privacy**: Ensures data privacy by processing data locally.
- **Performance Optimization**: Leverages the computational power of Intel Xeon CPUs for efficient LLM execution.
- **Performance Optimization**: Leverages the computational power of x86 CPU servers for efficient LLM execution.

### Goals

- **Local Deployment**: Enable local deployment of open-source SLMs on Intel Xeon CPUs.
- **Local Deployment**: Enable local deployment of open-source SLMs on on-prem x86 CPU servers.
- **Integration with Ollama**: Seamless integration of Ollama framework to access open-source SLMs.
- **Maintain Functionality**: Ensure the AgentQnA workflow continues to function effectively with the new setup.
- **Integration of popular serving framework**: Integration of Ollama serving framework in AgentQnA.
- **Integration of popular serving framework**: Integration of Ollama serving framework in AgentQnA workflow in OPEA.

### Non-Goals

- **New Features**: No new features will be added to the AgentQnA workflow beyond the support for local SLMs as an agent.
- **Support for Non-Xeon Platforms**: This RFC is specific to Intel Xeon CPUs and does not cover other hardware platforms.
- **Support for Non-Xeon Platforms**: This RFC is specific to x86 CPU servers and does not cover other hardware platforms.

## Motivation

### SLMs Performance on CPU
Open-source small language models (SLMs) are optimized to run efficiently on CPUs, including Intel Xeon processors. These models are designed to balance performance and resource usage, making them suitable for deployment in environments where GPU resources are limited or unavailable. By leveraging the computational capabilities of Xeon CPUs, SLMs can achieve satisfactory performance for various agent tasks within the AgentQnA workflow. Given a right prompt, smaller Llama models are fairly accurate in tool calling which is an essential features for agents.
Open-source small language models (SLMs) are optimized to run efficiently on CPUs, including Intel Xeon processors. These models are designed to balance performance and resource usage, making them suitable for deployment in environments where GPU resources are limited or unavailable. By leveraging the computational capabilities of x86 CPU servers, SLMs can achieve satisfactory performance for various agent tasks within the AgentQnA workflow. Given a right prompt, smaller Llama models are fairly accurate in tool calling which is an essential features for agents.

### Ollama Popularity and Wide Range of Models
Ollama provides a comprehensive set of libraries and tools to facilitate the deployment and management of open-source language models. These libraries are designed to integrate seamlessly with existing workflows, enabling developers to easily incorporate SLMs into their applications. Ollama's model libraries support a wide range of open-source models, ensuring compatibility and ease of use for different use cases.
Ollama provides a comprehensive set of libraries and tools to facilitate the deployment and management of open-source language models. These libraries are designed to integrate seamlessly with existing workflows, enabling developers to easily incorporate SLMs into their applications. Ollama's model libraries support a wide range of open-source models, ensuring compatibility and ease of use for different use cases.

### Ollama vs vLLM
VLLM is an optimized inference engine designed for high-throughput token generation and efficient memory utilization, making it suitable for large-scale AI deployments. Ollama is a lightweight and intuitive framework that facilitates the execution of open-source LLMs on local on-prem hardware. In terms of popularity, the [vLLM](https://github.com/vllm-project/vllm) GitHub repository has 35K stars, while [Ollama](https://github.com/ollama/ollama) has 114K stars.

#### Key Features of Ollama
- **Extensive Model Support**: Ollama supports a variety of open-source language models, including state-of-the-art models that are continuously updated.
Expand Down Expand Up @@ -150,30 +153,27 @@ The proposed design for Ollama serving support entails following changes:

1. **Data Privacy and Security**:
- **Scenario**: A healthcare organization needs to process sensitive patient data for generating medical reports and insights.
- **Solution**: By using Ollama service on Intel Xeon CPUs, the organization can run LLM agents locally, ensuring that sensitive patient data remains within their secure infrastructure. This preserves privacy and complies with data protection regulations.
- **Solution**: By using Ollama service on on-prem x86 server CPUs, the organization can run LLM agents locally, ensuring that sensitive patient data remains within their secure infrastructure. This preserves privacy and complies with data protection regulations.

2. **Cost Efficiency**:
- **Scenario**: A startup is developing an AI-driven customer support system but has limited budget for cloud services.
- **Solution**: Deploying Ollama service on Intel Xeon CPUs allows the startup to run LLM agents locally, reducing dependency on expensive cloud-based APIs like OpenAI. This significantly lowers operational costs and makes the solution more affordable.
- **Solution**: Deploying Ollama service on on-prem x86 CPU servers allows the startup to run LLM agents locally, reducing dependency on expensive cloud-based APIs like OpenAI, Anthropic, Gemini, etc. This significantly lowers operational costs and makes the solution more affordable.

3. **Low Latency and High Performance**:
- **Scenario**: A financial institution requires real-time analysis of market data to make quick trading decisions.
- **Solution**: Running Ollama service on Intel Xeon CPUs provides high computational power locally, enabling the institution to achieve low latency and high performance. This ensures timely and accurate analysis without the delays associated with cloud-based services.
- **Solution**: Running Ollama service locally on-prem CPU servers provides high computational power locally, enabling the institution to achieve low latency and high performance. This ensures timely and accurate analysis without the delays associated with cloud-based services.

4. **Scalability and Control**:
- **Scenario**: An enterprise wants to scale its AI capabilities across multiple departments while maintaining control over the infrastructure.
- **Solution**: Deploying Ollama service on Intel Xeon CPUs enables the enterprise to scale LLM agents locally across various departments. This provides better control over the infrastructure and ensures consistent performance and reliability.

5. **Compliance with Regulations**:
- **Scenario**: A legal firm needs to process confidential client information while adhering to strict regulatory requirements.
- **Solution**: Running Ollama service on Intel Xeon CPUs ensures that all data processing happens locally, helping the firm comply with regulations and maintain client confidentiality.
- **Solution**: Running Ollama service locally ensures that all data processing happens locally, helping the firm comply with regulations and maintain client confidentiality.

6. **Enhanced Reliability**:
- **Scenario**: A manufacturing company relies on AI-driven predictive maintenance to avoid equipment downtime.
- **Solution**: By using Ollama service on Intel Xeon CPUs, the company can run LLM agents locally, ensuring reliable and uninterrupted operation even in environments with limited internet connectivity.
- **Solution**: By using Ollama service locally on on-prem CPU servers, the company can run LLM agents locally, ensuring reliable and uninterrupted operation even in environments with limited internet connectivity.

7. **Energy Efficiency**:
- **Scenario**: An environmental organization aims to minimize its carbon footprint while leveraging AI for data analysis.
- **Solution**: Deploying Ollama service on Intel Xeon CPUs allows the organization to run energy-efficient LLM agents locally, reducing the need for energy-intensive cloud data centers.

The proposed design for Ollama serving support on Intel Xeon CPUs integrates Ollama as an additional LLM service alongside existing services like vLLM, TGI, and OpenAI. This setup enhances data privacy by keeping processing local, reduces operational costs by leveraging on-premise hardware, and provides flexibility and control over AI deployments. The workflow includes embedding, retrieval, and reranking microservices, ensuring efficient and secure handling of user queries and data preparation.
The proposed design for Ollama serving support on om-prem x86 CPU servers integrates Ollama as an additional LLM service alongside existing services like vLLM and TGI. This setup enhances data privacy by keeping processing local, reduces operational costs by leveraging on-premise hardware, and provides flexibility and control over AI deployments. The workflow includes embedding, retrieval, and re-ranking microservices, ensuring efficient and secure handling of user queries and data preparation.

0 comments on commit 677e378

Please sign in to comment.