LLM Model Management

Model Management

Upload and manage LLMs for your RAG pipeline

Name	Type	Status	Parameters	Quantization	Context Window	Last Updated
Llama 3.1 (8B)Meta's Llama 3.1 8B model, optimized for local deployment	Local	Ready	8B	GGUF (Q4_K_M)	8,192 tokens	2023-10-15
Mistral (7B)Mistral 7B model, good balance of performance and efficiency	Local	Ready	7B	GGUF (Q5_K_M)	8,192 tokens	2023-09-20
GPT-4oOpenAI's GPT-4o model, accessed via API	API	Ready	Unknown	N/A	128,000 tokens	2023-10-22
Custom BERT-based ModelCustom fine-tuned model for specific domain knowledge	Custom	Error	330M	GGUF (Q4_0)	4,096 tokens	2023-10-18

Model Hosting

Configure how models are hosted and served

Hosting Environment

Inference Framework

Max Concurrent Requests

Batch Size

Enable Continuous Batching

Continuous batching improves throughput for multiple concurrent requests

KV Cache Optimization

Optimizes memory usage for handling multiple requests

Tensor Parallelism

Distributes model across multiple GPUs (requires compatible hardware)

Model Hub

Download pre-configured models from the hub

Llama 3.1 (8B)

Meta's latest 8B parameter model

Size:4.2 GB (Q4_K_M)

Context:8K tokens

Mistral (7B)

Efficient 7B parameter model

Size:3.8 GB (Q4_K_M)

Context:8K tokens

Phi-3 (3.8B)

Microsoft's compact but powerful model

Size:2.1 GB (Q4_K_M)

Context:4K tokens