Well after waiting 24 hours for a reply and having no help, I decided to resolve it myself. The problem is with this code:
client = LlamaStackClient(
base_url=LLAMA_STACK_API_TOGETHER_URL,
)
Which looks for llama stack server at: https://llama-stack.together.ai
but the above URL is dead or the API token has limit that finishes with end of lesson 6.
Here is how I resolved the issue by running llama stack server locally using WSL and docker. I used my own API after creating account on https://together.ai
Below are the steps which I am sharing in case if you are also curious in running stacks:
Installing and Using Llama Stack
Step 1: Download the Git Repository
Open WSL (Windows Subsystem for Linux). I am using Debian Linux for this setup.
To begin setting up Llama Stack, clone the official repository using the following command:
git clone https://github.com/meta-llama/llama-stack.git
Step 2: Build Llama Stack
Navigate to the cloned repository and update your package lists:
cd llama-stack/
sudo apt update
Step 3: Check Python Version and Set Up Virtual Environment
Ensure you have Python version 3.10 or higher:
python --version
Install venv
if it is not already installed:
sudo apt install python3.11-venv
On my WSL setup, Python version 3.11 is used. Adjust accordingly for your system.
Create a virtual environment:
python3 -m venv llama_stack_env
Activate the virtual environment:
source llama_stack_env/bin/activate
Step 4: Install Llama Stack
Install Llama Stack using:
pip install llama-stack
Alternatively, install it from the Git repository:
pip install -e .
I installed it from PyPI. Verify the installation with:
llama --help
Step 5: Run Llama Stack
Ensure Docker is installed and running. Then, build the Llama Stack container:
llama stack build --template together --image-type docker
If you encounter the following error:
curl: command not found
Install the required package:
sudo apt install curl
Run the Llama Stack again:
llama stack build --template together --image-type container
This process may take some time. Then, export the necessary environment variables:
export LLAMA_STACK_PORT=5001
export TOGETHER_API_KEY=<Your_API_Key>
Obtain your TOGETHER_API_KEY
by creating an account on together.ai. They provide an API key with a $1.00 credit.
Now, run the Docker container:
docker run -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT llamastack/distribution-together --port $LLAMA_STACK_PORT --env TOGETHER_API_KEY=$TOGETHER_API_KEY
The system will download different Llama models. The server will start and run on localhost:5001
:
Listening on ['::', '0.0.0.0']:5001
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: ASGI 'lifespan' protocol appears unsupported.
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit)
Step 6: Check Inference on Windows
On Windows, open cmd
and create a virtual environment:
mkdir llama-stack
cd llama-stack
python -m venv llama-stack-client-env
llama-stack-client-env\Scripts\activate
Install the client package:
pip install llama-stack-client
Configure the client:
llama-stack-client configure --endpoint http://localhost:5001
The above command will prompt for the TOGETHER_API_KEY
. Create an account on together.ai to obtain an API key worth $1.00.
Step 7: Verify Your Inference Code
In the same command prompt, create and run verify.py
to check the available models. Create the script inside the llama-stack
folder:
python verify.py
Contents of verify.py
:
from llama_stack_client import LlamaStackClient
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5001")
models = client.models.list()
print("Available models:")
for model in models:
print(f"- {model.identifier}")
Step 8: Run Inference Code
Create an inference script named inference.py
in the llama-stack
folder:
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5001")
# Define the user message
user_message = UserMessage(
content="Hello, can you write a short poem about the stars?",
role="user",
)
# Perform chat completion
response = client.inference.chat_completion(
messages=[user_message],
model_id="meta-llama/Llama-3.2-3B-Instruct", # Ensure this model ID is available on your server
stream=False,
)
# Print the response
if response.completion_message:
print(response.completion_message.content)
else:
print("No completion message received.")
Run it using:
python inference.py
Sample Output:
Here is a short poem about the stars:
Twinkling diamonds in the night,
A celestial show, a wondrous sight.
The stars up high, a twinkling sea,
A million points of light, for you and me.
Their gentle sparkle, a guiding light,
A beacon in the darkness, shining bright.
In the vast expanse, of space and time,
The stars remain, a constant rhyme.
Their beauty is, a gift to see,
A reminder of, the mystery.
Of the universe, and all its might,
The stars shine on, through the dark of night.
Conclusion
You have successfully installed and configured Llama Stack. You can now use it for inference and explore its capabilities further. Happy coding!