2024-11-21 一站式 AI 平台生日大派对!2024-11-21 一站式 AI 平台生日大派对! 无问芯穹特别推出多项超值福利!立即参与
Skip to content
回到全部文章

推理服务极简部署:使用专用 vLLM 镜像

简介

本文将介绍如何使用专用容器镜像来部署大型语言模型(LLM)。我们提供的推理服务专用容器镜像已经预装了 FastChat 和 vLLM,通过使用这个镜像,您可以简化部署过程,并专注于模型的服务和优化。

目前已提供的专用镜像名称:

  • cr.infini-ai.com/infini-ai/inference-base:v1-vllm0.4.0-torch2.1-cuda12.3-ubuntu22.04
  • cr.infini-ai.com/infini-ai/inference-base:v2-vllm0.6.2-torch2.2-cuda12.3-ubuntu22.04

准备工作

在开始部署之前,请确保您的 LLM 已经放置在共享高性能存储服务中。您可以在创建推理服务时添加挂载点,以便访问模型文件。

创建推理服务

访问智算云控制台的推理服务页面,可创建推理服务。

启动命令

使用专用 vLLM 镜像时,只需要在启动命令中指定模型路径,并启动 FastChat 的相应服务。

TIP

请特别注意容器的生命周期管理,在启动命令中需要有一个前台运行的主进程。如果全部使用 nohup 命令启动服务,当容器的启动命令执行完毕,平台会认为容器的主进程已经结束,导致容器陷入销毁与创建的循环。

bash
#!/bin/bash

# 设置必要的环境变量
export MODEL="/path/to/model"  # 模型在您租户的共享高性能存储的路径

# 检查环境变量是否已设置
if [ -z "$MODEL" ]; then
    echo "Error: MODEL must have value."
    exit 1
fi

echo "Loading model (${MODEL})"
echo "Running with log ..."

# 启动 Metrics 服务,输出监控指标;端口为 2000 时,可与平台的监控功能集成
echo "Running metrics (1/4) ... (see /app/metrics.log)"
nohup python3 -m fastchat.serve.metrics --host 0.0.0.0 --port 20000 > metrics.log 2>&1 &

# 启动 controller 服务
echo "Running controller (2/4) ... (see /app/controller.log)"
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 > controller.log 2>&1 &

# 启动 api_server
echo "Running api_server (3/4) ... (see /app/api_server.log)"
nohup python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 > api_server.log 2>&1 &

# 启动 vLLM worker
echo "Running vllm_worker (4/4) ... (see /app/vllm_worker.log)"
python3 -m fastchat.serve.vllm_worker --host 0.0.0.0  --model-path ${MODEL} 2>&1 | tee vllm_worker.log

您将可以将以上启动命令封装为一个 Shell 脚本,方便管理。

详细介绍

  1. 环境变量

    • MODEL:指定要服务的模型路径。
  2. 命令选项

    • fastchat.serve.metrics: 启动监控服务,监控 2000 端口。
    • fastchat.serve.controller:启动控制器服务,监听所有接口上的 21001 端口。
    • fastchat.serve.vllm_worker:启动 vLLM 工作节点服务,监听 21002 端口,使用指定的模型路径。
    • fastchat.serve.openai_api_server:启动 OpenAI API服务器,监听 8000 端口,并连接到控制器。
  3. 日志处理

    • 使用 > log_file 2>&1 & 将标准输出和标准错误重定向到日志文件中,并在后台运行命令。这样可以记录每个服务的日志,便于调试和监控。

通过以上步骤,您可以使用预装 FastChat 和 vLLM 的容器镜像,轻松部署和管理大型语言模型。希望这篇指南能帮助您更好地理解和应用这项技术。

参考脚本

预置镜像中打包一个 /app/entrypoint 脚本,您可以自行查看,仅供参考。

以下文件内容来自镜像: cr.infini-ai.com/infini-ai/inference-base:v2-vllm0.6.2-torch2.2-cuda12.3-ubuntu22.04

bash
#!/bin/bash

export LOGDIR="" # don't output to local file
VLLM_USE_MODELSCOPE=False # use local model path
GMU=${GMU:-0.90}  #GMU=0.90 as default
MAXLR=${MAXLR:-16} #max-lora-rank=16 as default,lora model adapter_config.json params["r"]
TOKENMODE=${TOKENMODE:-auto} #tokenizer-mode=auto as default,['auto', 'slow,].if set slow, just as 'use_fast=false'(MT-infini-3B model need)
SFT=${SFT:-0} #sft-model=0 as default. if model is sft_model,set 1.
BLOCK_SIZE=${BLOCK_SIZE:-16}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}

# MML = None as default max-model-len
# MNBT = None as default max-num-batched-tokens
# BLOCK_SIZE = 16 as default. choices=[8, 16, 32, 128].export BLOCK_SIZE=16
# MAX_NUM_SEQS = 256 as default. maximum number of sequences per iteration. export MAX_NUM_SEQS=16
# QUANTIZATION =['awq', 'gptq', 'squeezellm', None],'Method used to quantize the weights. If 'None, we first check the `quantization_config` 'attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.
# FIRST_TOKEN_METRICS  first token metrics seconds. export FIRST_TOKEN_METRICS=0.01,0.025,0.05,0.075,0.1,0.15,0.2,0.3,0.4,0.5,0.75,1.0,2.5
# PRE_TOKEN_METRICS pre token metrics seconds. export PRE_TOKEN_METRICS=0.001,0.005,0.01,0.02,0.04,0.06,0.08,0.1,0.25,0.5,0.75,1.0,2.5,5.0,7.5,10.0
# PREFIXCACHING  export PRE_TOKEN_METRICS=True
# CHUNKEDPREFILL export CHUNKEDPREFILL=True
# ENFORCEEAGER export ENFORCEEAGER=True
# ENABLE_LORA export ENABLE_LORA=True

# check envs
LOG=${LOG:-1} # LOG=1 as default
DEBUG=${DEBUG:-0} # DEBUG=0 as defualt

# enable infini log while debugging
if [ $DEBUG -eq 1 ]; then
    echo "Debug Mode is ON"
    exec 1> >(tee -a debug.log) 2>&1
fi

if [ -z "$MODEL" ] || [ -z "$TP" ]; then
    echo "Error: MODEL and TP must have values."
    exit 1
fi
echo "Load model (${MODEL}) with tp (${TP}) ..."

# build lora options
if [ "${ENABLE_LORA}" = "True" ]; then
    ENABLE_LORA_OPTION="--enable-lora "
fi
if [ -n "${LORA_MODELS}" ]; then
    LORA_OPTIONS="--enable-lora --lora-modules "
    IFS=',' read -ra LORA_MODELS <<< "${LORA_MODELS}"
    for LORA_MODEL in "${LORA_MODELS[@]}"; do
        LORA_OPTIONS+="${LORA_MODEL}=/app/model/${LORA_MODEL} "
    done
    LORA_OPTIONS+="--max-lora-rank ${MAXLR}"
fi

# first token metrics seconds
if [ -n "${FIRST_TOKEN_METRICS}" ]; then
    FIRST_TOKEN_METRICS_SECONDS="--first-token-metrics-seconds "
    IFS=',' read -ra FIRST_TOKEN_METRICS <<< "${FIRST_TOKEN_METRICS}"
    for FIRST_TOKEN in "${FIRST_TOKEN_METRICS[@]}"; do
        FIRST_TOKEN_METRICS_SECONDS+="${FIRST_TOKEN} "
    done
fi

# pre token metrics seconds
if [ -n "${PRE_TOKEN_METRICS}" ]; then
    PRE_TOKEN_METRICS_SECONDS="--pre-token-metrics-seconds "
    IFS=',' read -ra PRE_TOKEN_METRICS <<< "${PRE_TOKEN_METRICS}"
    for PRE_TOKEN in "${PRE_TOKEN_METRICS[@]}"; do
        PRE_TOKEN_METRICS_SECONDS+="${PRE_TOKEN} "
    done
fi

if [ -n "${MML}" ]; then
    MAX_MODEL_LEN="--max-model-len "
    MAX_MODEL_LEN+="${MML}"
fi

if [ -n "${QUANTIZATION}" ]; then
    QUANTIZATION_OPTION="--quantization "
    QUANTIZATION_OPTION+="${QUANTIZATION}"
fi

if [ -n "${MNBT}" ]; then
    MAXNUMBATCHEDTOKENS="--max-num-batched-tokens"
    MAXNUMBATCHEDTOKENS+="${MNBT}"
fi

if [ "${PREFIXCACHING}" = "True" ]; then
    PREFIXCACHING_OPTION="--enable-prefix-caching"
fi

if [ "${CHUNKEDPREFILL}" = "True" ]; then
    CHUNKEDPREFILL_OPTION="--enable-chunked-prefill True"
fi

if [ "${ENFORCEEAGER}" = "True" ]; then
    ENFORCEEAGER_OPTION="--enforce-eager"
fi


if [ "${TP}" -ge "1" ]; then
    export VLLM_WORKER_MULTIPROC_METHOD=spawn
fi

_term() {
  echo "entrypoint caught SIGTERM signal!"
  kill -TERM "$vllm_worker_pid " 2>/dev/null
  wait $vllm_worker_pid
  echo "send term signal finish"
}
trap _term SIGTERM

if [ "${LOG}" = "1" ]; then
    LOG_PRINT="Running with log ..."
    METRICS_LOG="metrics.log"
    CONTROLLER_LOG="controller.log"
    API_SERVER_LOG="api_server.log"
    VLLM_WORKER_LOG="vllm_worker.log"
    HEALTH_LOG="health.log"
else
    LOG_PRINT="Running without log ..."
    METRICS_LOG="/dev/null"
    CONTROLLER_LOG="/dev/null"
    API_SERVER_LOG="/dev/null"
    VLLM_WORKER_LOG="/dev/null"
    HEALTH_LOG="/dev/null"
fi

echo ${LOG_PRINT}
echo "Running metrics (1/5) ... (see /app/${METRICS_LOG})"
nohup python3 -m fastchat.serve.metrics \
                --host 0.0.0.0 --port 20000 > ${METRICS_LOG} 2>&1 &

echo "Running controller (2/5) ... (see /app/${CONTROLLER_LOG})"
nohup python3 -m fastchat.serve.controller \
                --host 0.0.0.0 > ${CONTROLLER_LOG} 2>&1 &

echo "Running api_server (3/5) ... (see /app/${API_SERVER_LOG})"
nohup python3 -m fastchat.serve.openai_api_server \
                --host 0.0.0.0 --port 8000 > ${API_SERVER_LOG} 2>&1 &

echo "Running health_server (4/5) ... (see /app/${HEALTH_LOG})"
nohup python3 -m fastchat.serve.health_server \
                --host 0.0.0.0 --port 9000 > ${HEALTH_LOG} 2>&1 &

echo "Running vllm_worker (5/5) ... (see /app/vllm_worker.log)"
python3 -m fastchat.serve.vllm_worker \
        --host 0.0.0.0 \
        --model-path /app/model/${MODEL} \
        --trust-remote-code \
        --tensor-parallel-size ${TP} \
        --gpu-memory-utilization ${GMU} \
        --dtype auto \
        --tokenizer-mode ${TOKENMODE} \
        --sft-model ${SFT} \
        --block-size ${BLOCK_SIZE} \
        --max-num-seqs ${MAX_NUM_SEQS} \
        --debug-enable ${DEBUG} \
        ${FIRST_TOKEN_METRICS_SECONDS} \
        ${PRE_TOKEN_METRICS_SECONDS} \
        ${QUANTIZATION_OPTION}  \
        ${MAX_MODEL_LEN} \
        ${PREFIXCACHING_OPTION} \
        ${CHUNKEDPREFILL_OPTION} \
        ${ENFORCEEAGER_OPTION} \
        ${ENABLE_LORA_OPTION} \
        ${LORA_OPTIONS} 2>&1 | tee ${VLLM_WORKER_LOG} &
vllm_worker_pid=$(pgrep -f "python3 -m fastchat.serve.vllm_worker")
echo "vllm worker pid:$vllm_worker_pid"
wait $vllm_worker_pid

参考资料

  • 关于 FastChat 的详细介绍,请参考 FastChat