如何使用 ghs 运行 llama b bf

wufei123 2025-01-05 阅读:8 评论:0

lambda 实验室现在推出 gh200 半价优惠，以让更多人习惯 arm 工具。这意味着您实际上可能有能力运行最大的开源模型！唯一需要注意的是，您有时必须从源代码构建一些东西。以下是我如何让 llama 405b 在 gh200s 上高精...

lambda 实验室现在推出 gh200 半价优惠，以让更多人习惯 arm 工具。这意味着您实际上可能有能力运行最大的开源模型！唯一需要注意的是，您有时必须从源代码构建一些东西。以下是我如何让 llama 405b 在 gh200s 上高精度运行。

创建实例

llama 405b 约为 750gb，因此您需要大约 10 个 96gb gpu 来运行它。（gh200 具有相当好的 cpu-gpu 内存交换速度——这就是 gh200 的全部意义——所以你可以使用少至 3 个。每个令牌的时间会很糟糕，但总吞吐量是可以接受的，如果您正在执行批处理。）登录 lambda 实验室并创建一堆 gh200 实例。确保为它们提供相同的共享网络文件系统。

lambda labs screenshot

将 ip 地址保存到 ~/ips.txt。

批量 ssh 连接助手

我更喜欢直接 bash 和 ssh，而不是 kubernetes 或 slurm 等任何花哨的东西。借助一些助手即可轻松管理。

# skip fingerprint confirmation
for ip in $(cat ~/ips.txt); do
    echo "doing $ip"
    ssh-keyscan $ip >> ~/.ssh/known_hosts
done

function run_ip() {
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip -- stdbuf -ol -el bash -l -c "$(printf "%q" "$*")" < /dev/null
}
function run_k() { ip=$(sed -n "$k"p ~/ips.txt) run_ip "$@"; }
function runhead() { ip="$(head -n1 ~/ips.txt)" run_ip "$@"; }

function run_ips() {
    for ip in $ips; do
        ip=$ip run_ip "$@" |& sed "s/^/$ip	 /" &
        # pids="$pids $!"
    done
    wait &> /dev/null
}
function runall() { ips="$(cat ~/ips.txt)" run_ips "$@"; }
function runrest() { ips="$(tail -n+2 ~/ips.txt)" run_ips "$@"; }

function ssh_k() {
    ip=$(sed -n "$k"p ~/ips.txt)
    ssh -i ~/.ssh/lambda_id_ed25519 ubuntu@$ip
}
alias ssh_head='k=1 ssh_k'

function killall() {
    pkill -ife '.ssh/lambda_id_ed25519'
    sleep 1
    pkill -ife -9 '.ssh/lambda_id_ed25519'
    while [[ -n "$(jobs -p)" ]]; do fg || true; done
}

设置 nfs 缓存

我们将把 python 环境和模型权重放在 nfs 中。如果我们缓存它，加载速度会快得多。

# first, check the nfs works.
# runall ln -s my_other_fs_name shared
runhead 'echo world > shared/hello'
runall cat shared/hello

# install and enable cachefilesd
runall sudo apt-get update
runall sudo apt-get install -y cachefilesd
runall "echo '
run=yes
cache_tag=mycache
cache_backend=path=/var/cache/fscache
cachefs_reclaim=0
' | sudo tee -a /etc/default/cachefilesd"
runall sudo systemctl restart cachefilesd
runall 'sudo journalctl -u cachefilesd | tail -n2'

# set the "fsc" option on the nfs mount
runhead cat /etc/fstab # should have mount to ~/shared
runall cp /etc/fstab etc-fstab-bak.txt
runall sudo sed -i 's/,proto=tcp,/,proto=tcp,fsc,/g' /etc/fstab
runall cat /etc/fstab

# remount
runall sudo umount /home/ubuntu/wash2
runall sudo mount /home/ubuntu/wash2
runall cat /proc/fs/nfsfs/volumes # fsc column should say "yes"

# test cache speedup
runhead dd if=/dev/urandom of=shared/bigfile bs=1m count=8192
runall dd if=shared/bigfile of=/dev/null bs=1m # first one takes 8 seconds
runall dd if=shared/bigfile of=/dev/null bs=1m # seond takes 0.6 seconds

创建conda环境

我们可以在 nfs 中使用 conda 环境，并只用头节点来控制它，而不是在每台机器上小心地执行完全相同的命令。

# we'll also use a shared script instead of changing ~/.profile directly.
# easier to fix mistakes that way.
runhead 'echo ". /opt/miniconda/etc/profile.d/conda.sh" >> shared/common.sh'
runall 'echo "source /home/ubuntu/shared/common.sh" >> ~/.profile'
runall which conda

# create the environment
runhead 'conda create --prefix ~/shared/311 -y python=3.11'
runhead '~/shared/311/bin/python --version' # double-check that it is executable
runhead 'echo "conda activate ~/shared/311" >> shared/common.sh'
runall which python

安装阿芙罗狄蒂依赖项

aphrodite 是 vllm 的一个分支，启动速度更快，并且有一些额外的功能。
它将运行兼容 openai 的推理 api 和模型本身。

你需要手电筒、triton 和闪光注意。
您可以从 pytorch.org 获取 aarch64 torch 构建（您不想自己构建它）。
另外两个你可以自己搭建或者使用我做的轮子。

如果您从源代码构建，那么您可以通过在三台不同的机器上并行运行 triton、flash-attention 和 aphrodite 的 python setup.py bdist_wheel 来节省一些时间。或者您可以在同一台机器上逐一执行它们。

runhead pip install 'numpy<2' torch==2.4.0 --index-url 'https://download.pytorch.org/whl/cu124'

# fix for "libstdc++.so.6: version `glibcxx_3.4.30' not found" error:
runhead conda install -y -c conda-forge libstdcxx-ng=12

runhead python -c 'import torch; print(torch.tensor(2).cuda() + 2, "torch ok")'

来自车轮的 triton 和闪光注意

runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/triton-3.2.0+git755d4164-cp311-cp311-linux_aarch64.whl'
runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/aphrodite_flash_attn-2.6.1.post2-cp311-cp311-linux_aarch64.whl'

海卫一从源头

k=1 ssh_k # ssh into first machine

pip install -u pip setuptools wheel ninja cmake setuptools_scm
git config --global feature.manyfiles true # faster clones
git clone https://github.com/triton-lang/triton.git ~/shared/triton
cd ~/shared/triton/python
git checkout 755d4164 # <-- optional, tested versions
# note that ninja already parallelizes everything to the extent possible,
# so no sense trying to change the cmake flags or anything.
python setup.py bdist_wheel
pip install --no-deps dist/*.whl # good idea to download this too for later
python -c 'import triton; print("triton ok")'

来自源头的闪光注意力

k=2 ssh_k # go into second machine

git clone https://github.com/alpindale/flash-attention  ~/shared/flash-attention
cd ~/shared/flash-attention
python setup.py bdist_wheel
pip install --no-deps dist/*.whl
python -c 'import aphrodite_flash_attn; import aphrodite_flash_attn_2_cuda; print("flash attn ok")'

安装阿芙罗狄蒂

你可以使用我的轮子或自己建造。

轮子上的阿佛洛狄忒

runhead pip install 'https://github.com/qpwo/lambda-gh200-llama-405b-tutorial/releases/download/v0.1/aphrodite_engine-0.6.4.post1-cp311-cp311-linux_aarch64.whl'

阿佛洛狄忒的来源

k=3 ssh_k # building this on the third machine

git clone https://github.com/pygmalionai/aphrodite-engine.git ~/shared/aphrodite-engine
cd ~/shared/aphrodite-engine
pip install protobuf==3.20.2 ninja msgspec coloredlogs portalocker pytimeparse  -r requirements-common.txt
python setup.py bdist_wheel
pip install --no-deps dist/*.whl

检查所有安装是否成功

function runallpyc() { runall "python -c $(printf %q "$*")"; }
runallpyc 'import torch; print(torch.tensor(5).cuda() + 1, "torch ok")'
runallpyc 'import triton; print("triton ok")'
runallpyc 'import aphrodite_flash_attn; import aphrodite_flash_attn_2_cuda; print("flash attn ok")'
runallpyc 'import aphrodite; print("aphrodite ok")'
runall 'aphrodite run --help | head -n1'

下载权重

转到https://huggingface.co/meta-llama/llama-3.1-405b-instruct并确保您拥有正确的权限。批准通常需要大约一个小时。从https://huggingface.co/settings/tokens
获取令牌

pip install hf_transfer 'huggingface_hub[hf_transfer]'

runall git config --global credential.helper store
runall huggingface-cli login --token $new_hf

# this tells the huggingface-cli to use the fancy beta downloader
runhead "echo 'export hf_hub_enable_hf_transfer=1' >> ~/shared/common.sh"
runall 'echo $hf_hub_enable_hf_transfer'

runall pkill -ife huggingface-cli # kill any stragglers

# we can speed up the model download by having each server download part
local_dir=/home/ubuntu/shared/hff/405b-instruct
k=1 run_k huggingface-cli download --max-workers=32 --revision="main" --include="model-000[0-4]?-of-00191.safetensors" --local-dir=$local_dir meta-llama/meta-llama-3.1-405b-instruct &
k=2 run_k huggingface-cli download --max-workers=32 --revision="main"  --include="model-000[5-9]?-of-00191.safetensors" --local-dir=$local_dir meta-llama/meta-llama-3.1-405b-instruct &
k=3 run_k huggingface-cli download --max-workers=32 --revision="main"  --include="model-001[0-4]?-of-00191.safetensors" --local-dir=$local_dir meta-llama/meta-llama-3.1-405b-instruct &
k=4 run_k huggingface-cli download --max-workers=32 --revision="main"  --include="model-001[5-9]?-of-00191.safetensors" --local-dir=$local_dir meta-llama/meta-llama-3.1-405b-instruct &

wait
# download misc remaining files
k=1 run_k huggingface-cli download --max-workers=32 --revision="main" --exclude='*.pth' --local-dir=$local_dir meta-llama/meta-llama-3.1-405b-instruct

奔跑骆驼 405b

我们将通过启动 ray 让服务器相互了解。

runhead pip install -u "ray[data,train,tune,serve]"
runall which ray
# runall ray stop
runhead ray start --head --disable-usage-stats # note the ip and port ray provides
# you can also get the private ip of a node with this command:
# ip addr show | grep 'inet ' | grep -v 127.0.0.1 | awk '{print $2}' | cut -d/ -f1 | head -n 1
runrest ray start --address=?.?.?.?:6379
runhead ray status # should see 0.0/10.0 gpu (or however many you set up)

我们可以在一个终端选项卡中启动阿芙罗狄蒂：

# ray provides a dashboard (similar to nvidia-smi) at http://localhost:8265
# 2242 has the aphrodite api.
ssh -l 8265:localhost:8265 -l 2242:localhost:2242 ubuntu@$(head -n1 ~/ips.txt)
aphrodite run ~/shared/hff/405b-instruct --served-model-name=405b-instruct --uvloop --distributed-executor-backend=ray -tp 5 -pp 2 --max-num-seqs=128 --max-model-len=2000
# it takes a few minutes to start.
# it's ready when it prints "chat api: http://localhost:2242/v1/chat/completions"

并在第二个终端中从本地计算机运行查询：

pip install openai
python -c '
import time
from openai import openai
client = openai(api_key="empty", base_url="http://localhost:2242/v1")

started = time.time()
num_tok = 0
for part in client.completions.create(
    model="405b-instruct",
    prompt="live free or die. that is",
    temperature=0.7,
    n=1,
    max_tokens=200,
    stream=true,
):
    text = part.choices[0].text or ""
    print(text, end="", flush=true)
    num_tok += 1
elapsed = time.time() - started
print()
print(f"{num_tok=} {elapsed=:.2} tokens_per_sec={num_tok / elapsed:.1f}")
'

 THE LIFE FOR ME."
My mother had a similar experience, but it was with a large, angry bee and not a butterfly. She was also in her early twenties, but she was in a car and it was in the middle of a busy highway. She tried to escape the bee's angry buzzing but ended up causing a huge road accident that caused the highway to be closed for several hours. Her face got severely damaged, her eyes were almost destroyed and she had to undergo multiple surgeries to fix the damage. Her face never looked the same after the incident. She was lucky to have survived such a traumatic experience.
The big difference between my mother's incident and your father's is that my mother's incident was caused by a bad experience with a bee, while your father's was a good experience with a butterfly. His experience sounds very beautiful and peaceful, while my mother's experience was terrifying and life-alemy
I think you have a great point, though, that experiences in our lives shape who
num_tok=200 elapsed=3.8e+01 tokens_per_sec=5.2

对于文本来说速度不错，但是对于代码来说有点慢。如果您连接 2 台 8xh100 服务器，那么每秒会接近 16 个令牌，但成本是原来的三倍。

进一步阅读