推理服务极简部署:使用专用 vLLM 镜像
简介
本文将介绍如何使用专用容器镜像来部署大型语言模型(LLM)。我们提供的推理服务专用容器镜像已经预装了 FastChat 和 vLLM,通过使用这个镜像,您可以简化部署过程,并专注于模型的服务和优化。
目前已提供的专用镜像名称:
cr.infini-ai.com/infini-ai/inference-base:v1-vllm0.4.0-torch2.1-cuda12.3-ubuntu22.04
cr.infini-ai.com/infini-ai/inference-base:v2-vllm0.6.2-torch2.2-cuda12.3-ubuntu22.04
准备工作
在开始部署之前,请确保您的 LLM 已经放置在共享高性能存储服务中。您可以在创建推理服务时添加挂载点,以便访问模型文件。
创建推理服务
访问智算云控制台的推理服务页面,可创建推理服务。
启动命令
使用专用 vLLM 镜像时,只需要在启动命令中指定模型路径,并启动 FastChat 的相应服务。
TIP
请特别注意容器的生命周期管理,在启动命令中需要有一个前台运行的主进程。如果全部使用 nohup
命令启动服务,当容器的启动命令执行完毕,平台会认为容器的主进程已经结束,导致容器陷入销毁与创建的循环。
bash
#!/bin/bash
# 设置必要的环境变量
export MODEL="/path/to/model" # 模型在您租户的共享高性能存储的路径
# 检查环境变量是否已设置
if [ -z "$MODEL" ]; then
echo "Error: MODEL must have value."
exit 1
fi
echo "Loading model (${MODEL})"
echo "Running with log ..."
# 启动 Metrics 服务,输出监控指标;端口为 2000 时,可与平台的监控功能集成
echo "Running metrics (1/4) ... (see /app/metrics.log)"
nohup python3 -m fastchat.serve.metrics --host 0.0.0.0 --port 20000 > metrics.log 2>&1 &
# 启动 controller 服务
echo "Running controller (2/4) ... (see /app/controller.log)"
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 > controller.log 2>&1 &
# 启动 api_server
echo "Running api_server (3/4) ... (see /app/api_server.log)"
nohup python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 > api_server.log 2>&1 &
# 启动 vLLM worker
echo "Running vllm_worker (4/4) ... (see /app/vllm_worker.log)"
python3 -m fastchat.serve.vllm_worker --host 0.0.0.0 --model-path ${MODEL} 2>&1 | tee vllm_worker.log
您将可以将以上启动命令封装为一个 Shell 脚本,方便管理。
详细介绍
环境变量
MODEL
:指定要服务的模型路径。
命令选项
fastchat.serve.metrics
: 启动监控服务,监控 2000 端口。fastchat.serve.controller
:启动控制器服务,监听所有接口上的 21001 端口。fastchat.serve.vllm_worker
:启动 vLLM 工作节点服务,监听 21002 端口,使用指定的模型路径。fastchat.serve.openai_api_server
:启动 OpenAI API服务器,监听 8000 端口,并连接到控制器。
日志处理
- 使用
> log_file 2>&1 &
将标准输出和标准错误重定向到日志文件中,并在后台运行命令。这样可以记录每个服务的日志,便于调试和监控。
- 使用
通过以上步骤,您可以使用预装 FastChat 和 vLLM 的容器镜像,轻松部署和管理大型语言模型。希望这篇指南能帮助您更好地理解和应用这项技术。
参考脚本
预置镜像中打包一个 /app/entrypoint
脚本,您可以自行查看,仅供参考。
以下文件内容来自镜像: cr.infini-ai.com/infini-ai/inference-base:v2-vllm0.6.2-torch2.2-cuda12.3-ubuntu22.04
bash
#!/bin/bash
export LOGDIR="" # don't output to local file
VLLM_USE_MODELSCOPE=False # use local model path
GMU=${GMU:-0.90} #GMU=0.90 as default
MAXLR=${MAXLR:-16} #max-lora-rank=16 as default,lora model adapter_config.json params["r"]
TOKENMODE=${TOKENMODE:-auto} #tokenizer-mode=auto as default,['auto', 'slow,].if set slow, just as 'use_fast=false'(MT-infini-3B model need)
SFT=${SFT:-0} #sft-model=0 as default. if model is sft_model,set 1.
BLOCK_SIZE=${BLOCK_SIZE:-16}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
# MML = None as default max-model-len
# MNBT = None as default max-num-batched-tokens
# BLOCK_SIZE = 16 as default. choices=[8, 16, 32, 128].export BLOCK_SIZE=16
# MAX_NUM_SEQS = 256 as default. maximum number of sequences per iteration. export MAX_NUM_SEQS=16
# QUANTIZATION =['awq', 'gptq', 'squeezellm', None],'Method used to quantize the weights. If 'None, we first check the `quantization_config` 'attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.
# FIRST_TOKEN_METRICS first token metrics seconds. export FIRST_TOKEN_METRICS=0.01,0.025,0.05,0.075,0.1,0.15,0.2,0.3,0.4,0.5,0.75,1.0,2.5
# PRE_TOKEN_METRICS pre token metrics seconds. export PRE_TOKEN_METRICS=0.001,0.005,0.01,0.02,0.04,0.06,0.08,0.1,0.25,0.5,0.75,1.0,2.5,5.0,7.5,10.0
# PREFIXCACHING export PRE_TOKEN_METRICS=True
# CHUNKEDPREFILL export CHUNKEDPREFILL=True
# ENFORCEEAGER export ENFORCEEAGER=True
# ENABLE_LORA export ENABLE_LORA=True
# check envs
LOG=${LOG:-1} # LOG=1 as default
DEBUG=${DEBUG:-0} # DEBUG=0 as defualt
# enable infini log while debugging
if [ $DEBUG -eq 1 ]; then
echo "Debug Mode is ON"
exec 1> >(tee -a debug.log) 2>&1
fi
if [ -z "$MODEL" ] || [ -z "$TP" ]; then
echo "Error: MODEL and TP must have values."
exit 1
fi
echo "Load model (${MODEL}) with tp (${TP}) ..."
# build lora options
if [ "${ENABLE_LORA}" = "True" ]; then
ENABLE_LORA_OPTION="--enable-lora "
fi
if [ -n "${LORA_MODELS}" ]; then
LORA_OPTIONS="--enable-lora --lora-modules "
IFS=',' read -ra LORA_MODELS <<< "${LORA_MODELS}"
for LORA_MODEL in "${LORA_MODELS[@]}"; do
LORA_OPTIONS+="${LORA_MODEL}=/app/model/${LORA_MODEL} "
done
LORA_OPTIONS+="--max-lora-rank ${MAXLR}"
fi
# first token metrics seconds
if [ -n "${FIRST_TOKEN_METRICS}" ]; then
FIRST_TOKEN_METRICS_SECONDS="--first-token-metrics-seconds "
IFS=',' read -ra FIRST_TOKEN_METRICS <<< "${FIRST_TOKEN_METRICS}"
for FIRST_TOKEN in "${FIRST_TOKEN_METRICS[@]}"; do
FIRST_TOKEN_METRICS_SECONDS+="${FIRST_TOKEN} "
done
fi
# pre token metrics seconds
if [ -n "${PRE_TOKEN_METRICS}" ]; then
PRE_TOKEN_METRICS_SECONDS="--pre-token-metrics-seconds "
IFS=',' read -ra PRE_TOKEN_METRICS <<< "${PRE_TOKEN_METRICS}"
for PRE_TOKEN in "${PRE_TOKEN_METRICS[@]}"; do
PRE_TOKEN_METRICS_SECONDS+="${PRE_TOKEN} "
done
fi
if [ -n "${MML}" ]; then
MAX_MODEL_LEN="--max-model-len "
MAX_MODEL_LEN+="${MML}"
fi
if [ -n "${QUANTIZATION}" ]; then
QUANTIZATION_OPTION="--quantization "
QUANTIZATION_OPTION+="${QUANTIZATION}"
fi
if [ -n "${MNBT}" ]; then
MAXNUMBATCHEDTOKENS="--max-num-batched-tokens"
MAXNUMBATCHEDTOKENS+="${MNBT}"
fi
if [ "${PREFIXCACHING}" = "True" ]; then
PREFIXCACHING_OPTION="--enable-prefix-caching"
fi
if [ "${CHUNKEDPREFILL}" = "True" ]; then
CHUNKEDPREFILL_OPTION="--enable-chunked-prefill True"
fi
if [ "${ENFORCEEAGER}" = "True" ]; then
ENFORCEEAGER_OPTION="--enforce-eager"
fi
if [ "${TP}" -ge "1" ]; then
export VLLM_WORKER_MULTIPROC_METHOD=spawn
fi
_term() {
echo "entrypoint caught SIGTERM signal!"
kill -TERM "$vllm_worker_pid " 2>/dev/null
wait $vllm_worker_pid
echo "send term signal finish"
}
trap _term SIGTERM
if [ "${LOG}" = "1" ]; then
LOG_PRINT="Running with log ..."
METRICS_LOG="metrics.log"
CONTROLLER_LOG="controller.log"
API_SERVER_LOG="api_server.log"
VLLM_WORKER_LOG="vllm_worker.log"
HEALTH_LOG="health.log"
else
LOG_PRINT="Running without log ..."
METRICS_LOG="/dev/null"
CONTROLLER_LOG="/dev/null"
API_SERVER_LOG="/dev/null"
VLLM_WORKER_LOG="/dev/null"
HEALTH_LOG="/dev/null"
fi
echo ${LOG_PRINT}
echo "Running metrics (1/5) ... (see /app/${METRICS_LOG})"
nohup python3 -m fastchat.serve.metrics \
--host 0.0.0.0 --port 20000 > ${METRICS_LOG} 2>&1 &
echo "Running controller (2/5) ... (see /app/${CONTROLLER_LOG})"
nohup python3 -m fastchat.serve.controller \
--host 0.0.0.0 > ${CONTROLLER_LOG} 2>&1 &
echo "Running api_server (3/5) ... (see /app/${API_SERVER_LOG})"
nohup python3 -m fastchat.serve.openai_api_server \
--host 0.0.0.0 --port 8000 > ${API_SERVER_LOG} 2>&1 &
echo "Running health_server (4/5) ... (see /app/${HEALTH_LOG})"
nohup python3 -m fastchat.serve.health_server \
--host 0.0.0.0 --port 9000 > ${HEALTH_LOG} 2>&1 &
echo "Running vllm_worker (5/5) ... (see /app/vllm_worker.log)"
python3 -m fastchat.serve.vllm_worker \
--host 0.0.0.0 \
--model-path /app/model/${MODEL} \
--trust-remote-code \
--tensor-parallel-size ${TP} \
--gpu-memory-utilization ${GMU} \
--dtype auto \
--tokenizer-mode ${TOKENMODE} \
--sft-model ${SFT} \
--block-size ${BLOCK_SIZE} \
--max-num-seqs ${MAX_NUM_SEQS} \
--debug-enable ${DEBUG} \
${FIRST_TOKEN_METRICS_SECONDS} \
${PRE_TOKEN_METRICS_SECONDS} \
${QUANTIZATION_OPTION} \
${MAX_MODEL_LEN} \
${PREFIXCACHING_OPTION} \
${CHUNKEDPREFILL_OPTION} \
${ENFORCEEAGER_OPTION} \
${ENABLE_LORA_OPTION} \
${LORA_OPTIONS} 2>&1 | tee ${VLLM_WORKER_LOG} &
vllm_worker_pid=$(pgrep -f "python3 -m fastchat.serve.vllm_worker")
echo "vllm worker pid:$vllm_worker_pid"
wait $vllm_worker_pid
参考资料
- 关于 FastChat 的详细介绍,请参考 FastChat