目录

OpenStack之2023新实战

参考资料

物理基础设施层

基础硬件设备

服务器硬件要求保证高可用高可靠
交换机物理堆叠(IRF)防止交换机单点故障
交换机端口聚合(LACP)互冗互备,提升性能
服务器多网卡绑定聚合bond mode=2/4/6,互冗互备,提升性能
计算节点的资源CPU一致保证虚拟机的平滑迁移
计算节点的内存也应该一致保证虚拟机创建管理的均衡调度,尽量不要超卖

系统安装及要求

硬件要求:

使用方式:独占,裸金属
 
服务器:液冷人工智能一体机
CPU:2 * Intel Platinum 8352V 第三代英特尔® 至强® 可扩展处理器,36核心 超线程 3.50 GHz
内存:16 * 64GB 三星 DDR4 ECC 共1024G内存
系统盘:1 * U2固态硬盘 Gen3三星PM983容量1.92TB 
数据盘:3 * 三星PM9A3 NVMe® U.2 7.68TB PCIe 4.0 SSD
网卡:Mellanox cx5 25Gb 单网口 带模块
电源:CRPS 2400W热拔插电源模块(2+2冗余)
GPU:8 * RTX 4090 24GB
半精度性能 :8 * 165.2 TFLOPS
单精度性能:8 * 82.58 TFLOPS
 
出口配置2台万兆防火墙加固网络安全,并保障链路稳定性

操作系统:

网络配置:

  * 每台服务器接两个网口,万兆起;
  * 一个公网口直接分配公网ip
  * 一个内网口都划在同一vlan。

系统分区及最小化安装:

/   根分区大小  40G 
swap  交换分区大小  16G
不用LVM,不用Raid,剩余不用分区,我们自行处理,如剩余分区给/var/log目录
# 隔离其余核
GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=4,5,6,7,8,9,10,11,12"
 
# 建议值是预留前几个物理 CPU
vcpu_pin_set = 4-31
 
# 物理CPU超售比例,默认是16倍,超线程也算作一个物理 CPU
cpu_allocation_ratio = 8
 
# 允许在同一台主机上扩容,这样速度最快
allow_resize_to_same_host = true
 
### 内存配置 ###  
# 关闭KVM内存共享
echo 0 > /sys/kernel/mm/ksm/pages_shared
echo 0 > /sys/kernel/mm/ksm/pages_sharing
 
# 开启透明大页
echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
 
# 内存分配超售比例,默认是1.5倍,生产环境不建议开启超售
ram_allocation_ratio = 1 
 
# 内存预留量,这部分内存不能被虚拟机使用,以便保证系统的正常运行
reserved_host_memory_mb = 10240

OpenStack云平台层

openstack分布式部署 openstack HA

网络节点

# DHCP的调度程序会为同一网络启动X个DHCP Agents
dhcp_agents_per_network = X 
 
# 所有的路由默认使用HA模式
l3_ha = True
 
# 根据网络节点的数量设置最大最小值
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2

计算节点

存储节点

cinder,glance与nova访问ceph集群的逻辑图

volumes池是永久性存储,vms是实例临时后端存储,images是镜像存储
  • Ceph公共网络和集群网络,都应该单独分开。
  • Ceph存储节点使用的网卡,建议做多网卡Bonding
  • SSD创建容量相对较小但性能高的缓存池(Cache tier)
  • SATA硬盘创建容量大的但性能较低的存储池(Storage tier)
  • 生产环境建议单独创建4个存储资源池(Pool)分别对应OpenStack的4种服务存储

多Region、多AZ、多HA

多区域(Region)部署

简单的来说,这三者是从大范围到小范围的关系,即前者包含了后者。
  • Region(多区域)
  • Availability Zone(多可用区)
  • Host Aggregate(主机聚合)

一个地理区域Region包含多个可用区AZ (availability zone),
同一个AZ中的计算节点又可以根据某种规则逻辑上的组合成一个组。

OpenStack网络概念

ML2和L3的作用

在OpenStack中,ML2是一种网络插件,它用于管理虚拟网络和物理网络之间的连接。ML2支持多种机制驱动程序,包括VLAN、GRE、VXLAN等,这些驱动程序可以用于不同的网络部署场景。

L3路由是一种网络服务,它允许虚拟机之间或虚拟机与外部网络之间进行通信。L3路由通常由路由器提供,它将数据包从一个子网转发到另一个子网。

在OpenStack中,L3路由服务可以通过Neutron网络服务提供。当您使用ML2网络插件时,可以使用L3路由服务将虚拟机连接到外部网络。L3路由服务可以通过Neutron API和命令行工具进行配置和管理。

总之,ML2是一种网络插件,它用于管理虚拟网络和物理网络之间的连接;而L3路由是一种网络服务,它允许虚拟机之间或虚拟机与外部网络之间进行通信。在OpenStack中,您可以使用ML2和L3路由服务来构建和管理虚拟网络。

Neutron使用OVN还是Bridge

在OpenStack中,Neutron可以使用多种网络插件来管理虚拟网络。其中,OVN和Linuxbridge是两种常用的网络插件,它们都可以用于构建和管理虚拟网络。

OVN是一种基于OVS(Open vSwitch)的网络插件,它支持多种网络拓扑和服务,包括虚拟机、容器、L2/L3网络和安全组等。OVN提供了分布式的逻辑路由和逻辑交换机功能,可在多个计算节点之间分布虚拟网络的工作负载,从而提高网络性能和可靠性。

Linuxbridge是一种简单的网络插件,它使用Linux内核的桥接功能来实现虚拟网络。Linuxbridge插件适用于小规模的OpenStack部署,它提供了基本的虚拟网络功能,如VLAN隔离、DHCP、路由和安全组等。

当选择OVN或Linuxbridge时,您应该考虑以下几个因素:

性能:如果您需要高性能的虚拟网络,可以选择OVN插件,因为它提供了分布式的逻辑路由和逻辑交换机功能,可以提高网络性能和可靠性。

功能:如果您需要更复杂的虚拟网络拓扑和服务,可以选择OVN插件,因为它支持多种网络拓扑和服务,包括虚拟机、容器、L2/L3网络和安全组等。

管理复杂度:如果您需要简单的虚拟网络拓扑和服务,可以选择Linuxbridge插件,因为它使用Linux内核的桥接功能来实现虚拟网络,管理复杂度较低。

社区支持和文档:在选择OVN或Linuxbridge时,您应该考虑社区支持和文档的质量和数量,以便在需要时获取帮助和支持。

综上所述,选择OVN还是Linuxbridge取决于您的具体需求和情况。如果您需要更高级的网络功能和性能,可以选择OVN插件;如果您需要简单的网络拓扑和服务,可以选择Linuxbridge插件

VLAN和VxLAN如何选择

VLAN和VXLAN的区别即在于,VLAN是一种大二层网络技术,不需要虚拟路由转换,性能相对VXLAN、GRE要好些,支持4094个网络,架构和运维简单。VXLAN是一种叠加的网络隧道技术,将二层数据帧封装在三层UDP数据包里传输,需要路由转换,封包和解包等,性能相对VLAN要差些,同时架构也更复杂,好处是支持1600多万个网络,扩展性好。

在VXLAN和VLAN网络通信,即租户私网和Floating IP外网路由转发通信背景下,默认在OpenStack传统的集中式路由环境中,南北流量和跨网络的东西流量都要经过网络节点,当计算节点规模越来越大的时候,网络节点很快会成为整个系统的瓶颈,为解决这个问题引入了Distribute Virtual Router (DVR)的概念。使用DVR方案,可以将路由分布到计算节点,南北流量和跨网段的东西流量由虚机所在计算节点上的虚拟路由进行路由,从而提高稳定性和性能。

keystone监听两个端口区别

在keystone v2 版本中,这两个端口对应的API一个是Identity Admin API,一个是Identity API

如果是使用keystone v3 版本的话,这两个端口的使用基本没有区别,API 是一致的。

openstack endpoint create --region RegionOne identity admin http://controller:35357/v3

批量修改endpoint

openstack endpoint list|awk '/db.service/{gsub("db.service","cc.service",$(NF-1)); print "openstack endpoint set --url",$(NF-1),$2}'

Ironic裸机与nova虚拟机区别

Ironic是裸机管理,可以类同为企业的IT资产管理系统,而Nova是提供裸机服务的,可以认为是给用户分配物理服务器的。底层技术实现上,Ironic是Nova的其中一种ComputeDrive,和Libvirt平行,一个裸机node对应Nova的一个Hypervisor实例。

虚拟机创建的过程是compute节点向Glance下载镜像到本地目录,定义虚拟机xml模板,镜像文件映射为虚拟机的一个虚拟块设备,调用libvirt启动虚拟机。而裸机部署则相对复杂些。首先conductor节点会启动一个TFTP服务,保存有操作系统的bootloader文件,而Neutron可提供DHCP服务,并提供TFTP服务的地址。Ironic会首先从Glance下载initramfs镜像到conductor节点的TFTP路径,然后通过IPMI命令开机服务器,并设置从PXE启动(网卡需要提前开启PXE功能)。裸机服务器PXE引导后,自动从TFTP下载bootloader文件,bootloader会告诉服务器加载deploy initramfs和deploy kernel,启动deploy操作系统。deploy操作系统仅仅是一个临时的操作系统,并没有安装到硬盘,可以认为是一个内存精简操作系统,该操作系统除了安装一些必要的驱动和工具外,还安装了ironic-python-agent。

虚拟机如何通过cloud-init获取metadata

neutron可以分配dhcp地址,但虚拟机cloud-init无法获取内部ip,
检查libvirtd的日志,必要时neutron主控程序也重启一下。

OpenStack的高级配置

VT-d: enable Advanced ⇒ System Agent Configuration ⇒ Intel VT for Directed I/O

VT-x: enable Advanced ⇒ CPU Configuration ⇒ Intel Virtualization Technology

开启vt-d才能io虚拟化。AMD平台是iommu,某些OEM主板上叫SRIOV

配置虚拟机GPU直通

✅ vfio-pci: probe failed with error -22

  1. 主板上没有启用vt-d特性
  2. /sys/kernel/iommu_groups/* 必须要有vfio才能透传
  3. 没有加入GPU驱动到黑名单
  4. 没有加入vfio相关驱动到dracut.conf.d下
  5. 系统引导映像没有重新生成并加载vfio-pci驱动
  6. 特殊情况:是不是从U盘引导启动而使用了旧的内核?

✅ vfio X: group X is not viable

lspci -nnv -s 02:00|grep -i "Kernel driver"

要所有的设备全部支持vfio-pci驱动才行,尤其是显卡自带的声卡和USB,使用命令修改驱动:

driverctl set-override 0000:0c:00.4 vfio-pci

设置IOMMU

intel_iommu=on iommu=pt

vfio驱动隔离PCIE

Isolating the GPU with VFIO

vfio-pci.ids=10de:1b81,10de:10f0 vfio_iommu_type1.allow_unsafe_interrupts=1 modprobe.blacklist=nvidiafb,nouveau

PCIE ID映射到卡虚拟机instance

nvidia-smi -a |grep -i 'bus id'
 
lspci | grep -i vga | grep -i nvidia
 
sed -r -n '/hostdev/, /\/hostdev/{/address/p}' /etc/libvirt/qemu/instance-00000e37.xml

GPU RTX4090 PCI-E Slots

获取硬件信息ID

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070] [10de:1b81] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)

#检测 GPU 带宽
lspci -vv | awk '/VGA.+NVIDIA/{print $1}' | xargs -i sh -c 'echo {}; lspci -s {} -vvv | grep -i LnkSta:'
 
lspci -nn | sed -r -n '/VGA.*NVIDIA/s@(.*) VGA.*\[(.*)\].*\[(.*):(.*)\].*@\1 \3 \4 #\2@gp'
02:00.0 10de 1e89 #GeForce RTX 2060
有些显卡带usb,禁用默认驱动,改用vfio-pci
GRUB_CMDLINE_LINUX="modprobe.blacklist=xhci_hcd"

dracut重新生成initramfs

echo force_drivers+=\" vfio vfio_iommu_type1 vfio_pci vfio_virqfd\" > /etc/dracut.conf.d/10-vfio.conf
#dracut --force --add-drivers "ses hpsa megaraid megaraid_sas mpt3sas mpt2sas aacraid smartpqi" -f --kver
 
dracut -f --kver `uname -r`
#dracut -f /boot/initramfs-$(uname -r).img $(uname -r)
注意到我们使用forcedrivers 而不是常规的 adddrivers 选项, 这能保证驱动程序能通过modprobe提前加载 (Dracut#Early kernel module loading)。

所以在制作镜像时的内核模块要配置:

GRUBCMDLINELINUX=”ixgbe.allowunsupportedsfp=1

Openstack API使用案例

创建带gpu属性实例类型

openstack flavor create c2m4d10g1  \
  --vcpus 2 --ram 4096 --disk 10 \
  --property "pci_passthrough:alias"="geforce_rtx_4090:1"
 
openstack flavor create c100m800d500g8-rtx_4090 \
  --vcpus 100 --ram 819200 --disk 500 \
  --property "pci_passthrough:alias"="geforce_rtx_4090:8, audio:1"
后面的1表示,这个实例最多挂载一块显卡,8表示,最多可以挂载八块显卡。
[pci]
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"22ba"}, {"vendor_id":"10de", "product_id":"2684"}]
alias = {"name": "geforce_4090","vendor_id":"10de", "product_id":"2684", "device_type":"type-PCI"}
alias = {"name": "audio_4090","vendor_id":"10de", "product_id":"22ba", "device_type":"type-PCI"}

命令行创建虚拟机实例

#!/bin/sh
readonly domain="zhy2"
readonly uuid=$(uuidgen | cut -d- -f1)
 
#  需要指定主机名
host=$1
[ -z $host ] && echo "$0 cpu-xx" && exit 0
host=$(echo $host|awk -F. '{print $1}')
 
cat > clouduser.txt<<EOF
#cloud-config
chpasswd:
  list: |
    root:upyunxxxx
  expire: False
 
users:
  - name: root
    ssh_authorized_keys:
      - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC+DMdQ+IKj1TWVjp0qe+R9cQ5HhSIinr+oJx8FoInVWsNParq4WxqXdRk4jJH/d3gFFXBgQwh2Ie6gF23KjS0O6nfNFhaN5AzcJaXPF+H9RIs6fsRmmK0Wdr6sq9W8//WTdD69MIS04u582XyUzayRy0WjCj/HoeHZFRHu8FtlcUfVvqosC0teUvEY9qRK0uehWU84lxONeJYJPq5JMU1NwJ/lnJrMkqCDBlTLcpCeNfcha4atxXvCk8QjYvRDDUc6dQlnnvWRoXF6xtOpWbcK/k1wyzQVRWP011KfBGdQ/gm8x4R9tfEwZMHUdtsgNdSot2HXIQjFqR2MtTWVkTQ1Tc8qbmTVG5CFZRT2YNPtlAsi/YQ4/9FmvBAespRBZjKI0rQX+zk8Z1oAWYvZQIxa7htqTOe0dcwXxu226ji5Ku2gD1BvCtSCGnfv6c/GpY6umo0//Vfd8IwQc54sbuAK1ms6rA7RcN1yLksfxlZry7wCuZJichNnJc+WbFxQtRGTYPi+QWG2iuj40Mh2+A323dK2G0B9bonvbmQjhPCfNOyTnwh8YfS+9PsfgRXc4hFDlVu+VaecnQXk+IxZMU4MITjF8PJa3F4/Vptaon/cZpdjSUXr7rylRRleVBOEWCtA2sBQxrjFC7z16nc6YqTPRLzmOQRuj6CGAAQqGsKriQ== shaohaiyang@gmail.com-yoga
EOF
 
#image="hashcat_gpu_hold"
#image="debian12_nocloud"
image="ubuntu2204"
 
gpu_flavor="c2m4d20"
 
gpu_zone="nova"                 # default: nova
#gpu_zone="GPU-4090"            # default: nova
 
gpu_host=$(echo $host|awk -F. '{print $1".service."}')"$domain" # default: *
#gpu_test_image=$(openstack  image list | awk -F'|' '/'"$image"'/{print $2}')
 
# 检查一下有没有这个实例,如果没有就创建实例
[ ! -s /tmp/.server_list ] && openstack server list > /tmp/.server_list
grep "$domain-$host" /tmp/.server_list
if [ $? != 0 ];then
        openstack server create $domain-$host-$uuid --flavor $gpu_flavor \
        --image $image --availability-zone $gpu_zone:$gpu_host \
        --network provider  --user-data clouduser.txt
 
        rm -rf clouduser.txt
fi
 
# 如果没有卷信息,就导出卷
if [ ! -s /tmp/.volume_list ];then
        openstack volume list > /tmp/.volume_list
fi
 
openstack server list > /tmp/.server_list
 
instance_id=$(grep "$domain-$host" /tmp/.server_list | awk '{print $2}')
echo $instance_id
 
openstack server list | awk '/ACTIVE/{split($8,a,"=");print $4,"ansible_ssh_host="a[2],"ansible_ssh_port=22"}' > list_vms
 
for volume_id in $(awk '/'"$host"'-vdisk.*available/{print $2}' /tmp/.volume_list);do
        echo " ----------->>  mount $volume_id << -------------"
        openstack server add volume $instance_id $volume_id
done
[Unit]
After=network-online.target
 
[Service]
Environment="SELINUX=disabled"
ExecStart=/usr/local/bin/hashcat -m 1400 -a 3 1EE9146B48AE09BF64A572BF48D727B01D024EEEA614272E90DC9D480B59BC3E --force
 
[Install]
WantedBy=multi-user.target

命令行关闭计算节点服务

openstack compute service list
openstack network agent list
openstack hypervisor list
 
openstack compute service set --disable ops-yoga-c2 nova-compute
 
# 上面的虚拟机要热迁移或冷迁移掉,才能删除掉计算节点
openstack compute service delete 1e0b442c-d138-420....
 
# Example
# curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services /force-down -H “Content-Type: application/json” -H “Accept: application/json ” -H “X-OpenStack-Nova-API-Version: 2.11” -H “X-Auth-Token: {token}” -d ‘{“b inary”: “nova-compute”, “host”: “compute1”, “forced_down”: true}’

重新发现和注册计算节点

# 控制节点上运行命令
su -s /bin/sh -c "nova-manage cell_v2 discover_hosts --verbose" nova
计算节点上的nova-compute和neutron-xx-agent也要重启一下生效

Openstack网络配置和转发

#!/bin/bash
 
# -h
if [ -z "$1" ]; then
  echo -e "show: show floating ip list \nport: show vm port list\nfip [fip]: show fip port forwarding list\npfa [internal ip] [vm port] [vm ip port] [external port] [fip]\npfd [fip] [fip port forwaring id]"
  exit 1
fi
 
 
case "$1" in
  "show")
    openstack floating ip list
    ;;
  "port")
    openstack port  list
    ;;
  "fip")
    openstack floating ip port forwarding list $2
    ;;
  "pfa")
    openstack floating ip port forwarding create --internal-ip-address $2 --port $3  --protocol tcp --internal-protocol-port $4 --external-protocol-port $5 $6
    ;;
  "pfd")
    openstack floating ip port forwarding delete $2 $3
    ;;
  *)
    echo "unknow command"
    ;;
esac

Openstack数据库备份

✅ 数据库备份并保留N天

#!/bin/bash
daytime=$(date +%Y%m%d%H)
keepday=30
dirname=/var/log/yoga_sqlbackup_$daytime
dbs="mysql keystone nova nova_api nova_cell0 placement cinder neutron"
passwd="upyunxxxx"
 
mkdir -p $dirname
for db in $dbs; do
  mysqldump --skip-extended-insert -uroot -p$passwd $db > $dirname/$db.sql
done
tar zcvf $dirname.tgz $dirname
rm -rf $dirname
 
find /var/log/ -name yoga_sqlbackup*.tgz -a -ctime +$keepday -exec rm -rf {} \;

nova-cell0数据库的模式与nova一样,主要的作用就是当实例调度失败时,实例的信息将不属于任何一个cell ,因而存放到nova_cell0中,所以说cell0是存放数据调度失败的数据用来集中管理。

✅ 获取所有的虚拟机IP

mysql -uroot -p$passwd nova -e "select network_info from instance_info_caches where network_info!='[]'" > .instance.sql
while read line;do
  echo $line | jq '.[0].network.subnets[0].ips[0].address'
done < .instance.sql
 
exit
for i in $(grep geforce-rtx nova.sql | awk -F, '{print $6}');do
  echo $i
  mysql -uroot -p$passwd nova -e "delete from instance_extra where instance_uuid like $i;"
done

虚拟机在线平滑热迁移

✅ 计算节点配置libvirt

vim /etc/libvirt/libvirtd.conf
listen_tls = 0
listen_tcp = 1
unix_sock_group = "root"
unix_sock_rw_perms = "0777"
auth_unix_ro = "none"
auth_unix_rw = "none"
log_filters="2:qemu_monitor_json 2:qemu_driver"
log_outputs="2:file:/var/log/libvirt/libvirtd.log"
tcp_port = "16509"
listen_addr = "0.0.0.0"
auth_tcp = "none"
vim /etc/sysconfig/libvirtd
LIBVIRTD_ARGS="--listen"
pre-creation of storage targets for incremental storage migration is not supported

原因是

  1. 每台计算节点上的nova instance目录要一致!
  2. 每台计算节点和控制节点都有相同的/etc/hosts

重启libvirtd服务,再次进行迁移

🏆OpenStack 自动化实战

如果是CentOS 8原生系统需要初始化
#!/bin/sh
repo="yoga_repo.tgz"
 
grep 114.114.114 /etc/resolv.conf
[ $? == 0 ] || ( echo "nameserver 114.114.114.114" > /etc/resolv.conf)
 
[ -s /root/$repo ] || curl devops.upyun.com/$repo -o $repo
 
grep aliyun /etc/yum.repos.d/ -r -l
[ $? == 0 ] || ( rm -rf /etc/yum.repos.d/* ; tar zxvf /root/$repo -C /)
 
dnf makecache --refresh
dnf install -y epel-release
dnf install -y python3 sshpass wget nc lldpd vim-enhanced rsyslog supervisor pciutils chrony tar screen bind-utils --enablerepo=epel
 
cat >> /root/.ssh/authorized_keys<<EOF
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3K0pbD055f0dU/z2tQTG9RfnExjC9eKoBemxZCkQaiCHfPC/6XUEKReykZEzRLNEhI/0cOUshZjIHC9zjxHGfnJBPktjUvL4QvsA4qETk3T5uuy6PJx6P1ta4XaZy1mtEKh+Anqx8Q/w4ClrYiaUFEc62Gef+f4JUCCwlRqCvFcc5IoxNngFfUh2Wxp7hINyljVFNRcGBHs4cNSsxmE37ye+usJT98iafFi1A9SOwzUfhuB4zemNxTWoZpKNRf75gyHU+fOJ9FOuGveIPz7aoh2JLjltD1WrFnEZXsKnoftaiGaovrvO0ks++p73Q9y5o/9/9o59vcPcy7w16XgLT libo.huang@localhost.localdomain
EOF
 
dnf update -y

配置跳板机

安装python和加速pip

dnf install -y python3 python3-pip python3-libvirt sshpass libvirt --enablerepo=epel
 
cat > /etc/pip.conf <<EOF
[global]
target = /usr/lib/python3.6/site-packages
index-url = https://mirrors.aliyun.com/pypi/simple
trusted-host = aliyun.com
timeout = 120
EOF
 
python3 -m pip install -U pip
pip3 install -U pyOpenSSL pyyaml
pip3 install ansible -t /usr/lib/python3.6/site-packages/
CentOS8 Python3默认是python36版本。

Libvirt 必须要升级到7.5以上,8.0(需要CA)以下的版本,否则挂载有bug.

dnf list --installed | awk '/libvirt/{print $1}'|xargs -i dnf remove -y {}

ansible简洁配置文件

mkdir -p /etc/ansible; cat >> /etc/ansible/ansible.cfg <<EOF
[defaults]
inventory      = /root/list_hosts
remote_tmp     = $HOME/.ansible/tmp
pattern        = *
forks          = 5
poll_interval  = 15
sudo_user      = root
transport      = smart
remote_port    = 22
module_lang    = C
gathering = implicit
host_key_checking = False
sudo_exe = sudo
timeout = 30
module_name = shell
deprecation_warnings = False
fact_caching = memory
 
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
 
[accelerate]
accelerate_port = 5099
accelerate_timeout = 30
accelerate_connect_timeout = 5.0
accelerate_daemon_timeout = 30
EOF

ansible的主机列表

[control]
ccm-01.service.iqn ansible_host=194.101.0.1
ccm-02.service.iqn ansible_host=194.101.0.2
ccm-03.service.iqn ansible_host=194.101.0.3
 
[network]
ccn-01.service.iqn ansible_host=194.101.0.4
ccn-02.service.iqn ansible_host=194.101.0.5
 
[compute]

操作系统前期准备

GPU4090卡要开启persistent

sudo nvidia-smi -pm 1

如果要开机就开启持久化

cat > /lib/systemd/system/nvidia-persistenced.service <<EOF
[Unit]
Description=NVIDIA Persistence Daemon
After=syslog.target
 
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
TimeoutSec=300
 
[Install]
WantedBy=multi-user.target
 
EOF
 
systemctl enable --now nvidia-persistenced

GPU4090卡开启P2P 🚀

传统多GPU系统通过PCIe总线经CPU中转进行数据传输,这种设计存在双重限制:

首先,PCIe 4.0 x16的理论带宽仅31.5GB/s,仅相当于RTX 4090显存带宽的3%;其次,CPU介入带来的额外延迟会显著增加通信开销。

在ResNet-152等典型模型中,参数同步时间可能占据总训练时间的40%以上。这种架构缺陷在需要频繁交换数据的场景(如GNN图神经网络的多节点特征传递)中尤为致命。

P2P技术的革命性在于突破传统内存层级限制,允许GPU通过NVLink或PCIe Switch直接访问对等显存。

这种点对点通信模式带来三重优势:

  1. 消除CPU中介环节,将端到端延迟降低至微秒级,这对实时推理场景至关重要;
  2. 借助RTX 4090支持的第三代NVLink(双向带宽达600GB/s),可建立比PCIe快20倍的直连通道;
  3. 支持非对称访问拓扑,使4卡系统中的每个GPU都能建立直接通信路径。

删除驱动程序

sudo apt-get --purge remove "*nvidia*"
sudo apt-get --purge remove "*cuda*" "*cudnn*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*" "*libnccl*"
 
sudo apt install git cmake gcc g++

Bios设置

  1. 开启 Resize BAR
  2. 关闭 Intel Vd-T 或 AMD iommu
  3. 关闭 PCI ACS

SR-IOV && Re-Size BAR PCIe-ACS

随着RTX 3090/4090引入大BAR支持(如4090的BAR1显存扩展至32GB),NVIDIA在H100中新增BAR1P2P模式,直接通过PCIe BAR实现点对点传输。

关闭 Nouveau driver

echo blacklist nouveau >  /etc/modprobe.d/blacklist-nouveau.conf
echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nouveau.conf
 
sudo update-initramfs -u
sudo reboot

安装开源驱动

wget -c https://us.download.nvidia.com/XFree86/Linux-x86_64/565.57.01/NVIDIA-Linux-x86_64-565.57.01.run
 
#在安装时,选择无kernel 模组安装(因为后续P2P内核需要重新编译)
sudo sh ./NVIDIA-Linux-x86_64-565.57.01.run --no-kernel-modules

定制魔改内核

从github链接中下载驱动对应的open内核版本(如果为其他版本可以下载后切换或者直接下ZIP)

git clone git@github.com:tinygrad/open-gpu-kernel-modules.git
cd open-gpu-kernel-modules
sudo ./install.sh

驱动测试P2P功能

nvidia-smi topo -p2p p

nvidia p2p

替换镜像源站

# 修改为阿里源
sed -r -i -e 's|^mirrorlist=|#mirrorlist=|g' \
  -e '/^baseurl/s|(.*releasever)/(.*)|baseurl=https://mirrors.aliyun.com/centos-vault/8.5.2111/\2|g' \
    /etc/yum.repos.d/CentOS-*.repo
 
dnf -y install epel-release
dnf install -y wget nc lldpd vim-enhanced rsyslog supervisor pciutils chrony tar screen bind-utils --enablerepo=epel

virt-customize预配置qcow

yum install libguestfs-tools
 
virt-customize -a centos8.qcow2 --root-password password:passw0rd \
--run-command "sed -i 's/^mirrorlist/#mirrorlist/g' /etc/yum.repos.d/*.repo" \
--run-command "sed -i 's|^#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|' /etc/yum.repos.d/*.repo" \
--install python3,python3-pip,epel-release,nc,wget,lldpd,vim-enhanced,rsyslog,chrony,tar,screen,bind-utils \
--delete /etc/yum.repos.d/epel-test*.repo \
--copy-in /etc/pip.conf:/etc/ \
--update --selinux-relabel

一键完整安装openstack包

#!/bin/sh
RDO_URL="https://repos.fedorapeople.org/repos/openstack/archived/openstack-yoga/rdo-release-yoga-1.el8.noarch.rpm"
 
POWERTOOLS=$(grep -i '\[powertools\]' /etc/yum.repos.d/*.repo | sed -r -n 's@.*\[(.*)\].*@\1@gp'|sort -u|head -1)
 
VERSION="yoga"
 
dnf config-manager --set-enabled $POWERTOOLS
#dnf --enablerepo=$POWERTOOLS --enablerepo=openstack-$VERSION install -y 
 
# 安装rdo仓库
# 修改为阿里源
  sed -r -i -e 's|^mirrorlist=|#mirrorlist=|g' \
    -e '/^baseurl/s|(.*releasever)/(.*)|baseurl=https://mirrors.aliyun.com/centos-vault/8.5.2111/\2|g' \
    /etc/yum.repos.d/CentOS-*.repo
  dnf install -y epel-release.noarch
  dnf install -y $RDO_URL
 
pip3 install osc-placement
 
for svc in httpd mod_ssl mariadb-server mariadb-server-galera rabbitmq-server memcache \
  driverctl tar wget nc nmap lvm2 lldpd rsyslog pciutils chrony screen vim-enhanced \
  bind-utils libvirt ceph-common ebtables supervisor bridge-utils ipset iperf3 htop \
  haproxy device-mapper targetcli qemu-img python3-libvirt python3-rbd sysstat \
  python3-keystone python3-mod_wsgi python3-openstackclient nmon ;do
    echo -e "${YELLOW_COL}-> Installing $svc ... ${NORMAL_COL}"
    dnf list installed | grep -iq $svc
    [ $? != 0 ] && dnf install -y $svc --enablerepo=epel
  systemctl disable --now $svc
done
 
for svc in keystone dashboard glance cinder placement-api neutron neutron-ml2 neutron-linuxbridge\
  nova-api nova-metadata-api nova-conductor nova-novncproxy nova-scheduler nova-compute ;do
    svc="openstack-$svc"
    echo -e "${YELLOW_COL}-> Installing $svc ... ${NORMAL_COL}"
    dnf list installed | grep -iq $svc 
    [ $? != 0 ] && dnf install -y $svc --enablerepo=epel
    systemctl disable --now $svc
done
 
yum update -y --nobest

更换网络服务

在安装部署OpenStack时,OpenStack的网络服务会与NetworkManager服务产生冲突,二者无法一起正常工作,需要使用Network

dnf install -y network-scripts --enablerepo=epel
 
# 停用NetworkManager并禁止开机自启
systemctl unmask NetworkManager
systemctl stop NetworkManager && systemctl disable NetworkManager
 
# 启用 Network并设置开机自启
systemctl start network && systemctl enable network

手动时间网络同步

[root@ccm-01 ~]# timedatectl 
               Local time: Fri 2024-01-26 18:14:39 CST
           Universal time: Fri 2024-01-26 10:14:39 UTC
                 RTC time: Fri 2024-01-26 10:14:39
                Time zone: Asia/Shanghai (CST, +0800)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no
 
 
 
chronyc sources -v
 
  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current best, '+' = combined, '-' = not combined,
| /             'x' = may be in error, '~' = too variable, '?' = unusable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^- tick.ntp.infomaniak.ch        1  10   377   462  -7859us[-8245us] +/-  115ms
^+ time.neu.edu.cn               1  10   177   132    -22ms[  -22ms] +/-   28ms
^* time.neu.edu.cn               1  10   177   360    -26ms[  -26ms] +/-   34ms
^- ntp6.flashdance.cx            2  10   377   431    -29ms[  -30ms] +/-   91ms
如果 timedatectl set-timezone Asia/Shanghai 有timeout,如:
Failed to set ntp: Failed to activate service 'org.freedesktop.timedate1': timed out (service_start_timeout=25000ms)
Failed to set local RTC: Connection timed out
dnf reinstall -y  mozjs60-60.9.0-4.el8.x86_64.rpm 

内核中启用IP转发路由

在创建桥接器之前,让我们通过在运行时使用 .net 设置内核参数来启用 IP 路由sysctl。

tee /etc/sysctl.d/iprouting.conf<<EOF
net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.ipv4.ip_nonlocal_bind=1
EOF

磁盘检查和逻辑卷

mkdir -p /disk/nvme-disk
 
pvcreate -ff -y /dev/nvme0n1 /dev/nvme1n1 
vgcreate -y nvme-disk /dev/nvme0n1 /dev/nvme1n1 
 
lvcreate -n nova-volume -L 2000G -y nvme-disk
 
# 不一定需要创建这个,看数据盘要不要在 nvme上
#lvcreate -n cinder-volume --type raid0 --stripes 2 -l 100%free -I 128k nvme-disk
 
#检验lvm创建成功
lvs -a -o +devices,segtype nvme-disk
 
# 格式化分区并添加到/etc/fstab 挂载到 /disk/nvme-disk
mkfs.ext4 -L /disk/nvme-disk /dev/nvme-disk/nova-volume
 
sed -r -i '/nvme-disk/d' /etc/fstab
blkid | awk '/disk-nova/{print $3"\t/disk/nvme-disk\t\text4\tdefaults\t0 0"}' >> /etc/fstab
mount -a
<pre>虚拟机实例的/var/lib/nova分区要求:</pre> LVM逻辑卷并条带化能提速性能。
# 这个raid0导致格式化太慢了,可能是个bug,还是用回默认的
#lvcreate -n nova-volume --type raid0 --stripes 2 -L 2000G -I 128k nvme-disk

编译内核支持vfio

计算节点要检测pcie有没有打开virtfio

ls -adl /sys/kernel/iommu_groups/*
 
lspci -nn| sed -r -n '/VGA.*NVIDIA/s@.*\[(.*)\].*\[.*@\1@gp'|tr ' ' '_'|sort -u
lspci -nn |awk '/NVIDIA/{split($1,a,".");print a[1]}'|sort -u
lspci -nnv -s $gpuid | grep -iE "$id|Kernel driver"
重新编译内核后要重启 !!!

升级内核: How to Install or upgrade to Kernel 6.x on CentOS 8

搭建主机域名和服务发现

搭建基于consul和coredns的自动注册和发现服务,并检查consul的健康状态。

#查看成员:
consul members -http-addr 127.0.0.1:8500
 
#查看各节点状态
consul operator raft list-peers -http-addr=127.0.0.1:8500
 
#查看主节点:
curl 127.0.0.1:8500/v1/status/leader
 
#配置consul主机名发现服务并测试成功
dig ccm-01.service.iqn @127.0.0.1 -p8600
 
#配置coredns做域名解析并测试成功
dig ccm-01.service.iqn

coredns配置

.:5353 {
  #新增以下一行Template插件,将AAAA记录类型拦截,返回空(NODATA)
  template IN AAAA .
 
  bind 192.168.0.1 100.100.8.1
  acl {
       allow net 172.16.0.0/12 100.100.0.0/16
       block
  }
 
  errors
  log stdout
  health localhost:8080
  prometheus 127.0.0.1:9253
  cache {
        success 30000 300
        denial 1024 5
        prefetch 1 1m
  }
  loadbalance round_robin
 
  rewrite continue {
    ttl regex (.*) 30
  }
 
  rewrite name regex (.*).ngb1s3.dc.huixingyun.com ngb1i.dc.huixingyun.com
  rewrite name ngb1.dc.huixingyun.com ngb1i.dc.huixingyun.com
 
  forward service.ngb1 100.100.8.1:8600 100.100.8.2:8600
 
  forward . 119.29.29.29 180.76.76.76 114.114.114.114 {
      except service.ngb1
  }
}

分布式数据库搭建

搭建分布式的galera数据库并初始化(先初始化一个数据库,然后同步复制也可以)

./easyStack_yoga.sh control_init
 
UPDATE user SET Password=PASSWORD("xxx") WHERE User="root";
grant all privileges on *.* to root@'100.100.%' identified by 'xxx' with grant option;
SHOW STATUS LIKE 'wsrep%';
SHOW GLOBAL STATUS LIKE '%aborted_connects%';

正常第一次启动集群,使用命令:galera_new_cluster
其他版本请另行参考。

MySQL+Galera集群参数调优

### shy_begin
default-storage-engine = innodb
innodb_file_per_table = on
back_log = 10240
max_connections = 10240
thread_cache_size = 10240
max_connect_errors = 10240
thread_pool_idle_timeout = 7200
connect_timeout = 7200
net_read_timeout = 7200
net_write_timeout = 7200
interactive_timeout = 7200
wait_timeout = 7200
host_cache_size = 0
thread_pool_size = 1024
query_cache_size = 512M
max_allowed_packet = 512M
collation-server = utf8_general_ci
character-set-server = utf8
bind-address = 10.33.66.1
port = 3308
### shy_end
 
# Enable wsrep
wsrep_on=1
wsrep_cluster_name="yoga_wsrep_db"
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
bind-address="10.33.66.1"
wsrep_node_address="10.33.66.1"
wsrep_cluster_address="gcomm://10.33.66.1,10.33.66.2,10.33.66.3"
wsrep_provider_options="gmcast.listen_addr=tcp://10.33.66.1:4567; gcs.fc_limit = 2048; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes"
wsrep_retry_autocommit = 50
wsrep_slave_threads=300
wsrep_max_ws_rows = 0
wsrep_max_ws_size = 2147483647
  vim /var/lib/mysql/grastate.dat
 
  #GALERA savedd state
  version:2.1
  uuid: 自己的cluster id
  seqno: -1
  safe_to_bootstrap:0
 
修改
  seqno:1
  safe_to_bootstrap:1

重新启动集群命令:galera_new_cluster

如果Mariadb Galera Cluster集群故障无法启动,用集群中最新的节点中的数据序列号:seqno,再以wsrep_new_cluster的方式将数据拉起即可。

检测脚本:

grep "New cluster view" /var/log/mariadb.log|awk  -F: 'END { print $1":"$2":"$3 $6":"$7}'
#!/bin/sh
PASS="xxxxx"
mysql -uroot -p$PASS -e "show full processlist" | awk '/Waiting for table flush/ || /UPDATE agents/ {print $1}'

分布式rabbitmq

搭建分布式的rabbitmq消息队列并检查健康

scp /var/lib/rabbitmq/.erlang.cookie node:/var/lib/rabbitmq/.erlang.cookie
 
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl forget_cluster_node ccm-01
rabbitmqctl join_cluster --disc rabbit@ccm-01
rabbitmqctl start_app
 
rabbitmqctl set_policy ha-all "^ha\." '{"ha-mode":"all"}'
rabbitmqctl cluster_status
rabbitmqctl change_password guest $DBPASSWD
rabbitmqctl set_permissions guest ".*" ".*" ".*" 
rabbitmq-plugins enable rabbitmq_management

如果原来集群中的一台机器已经掉线,重新加入时必须要先剔除节点

rabbitmqctl forget_cluster_node "$node"
rabbitmqctl cluster_status  # 检查rabbitmq集群状态
rabbitmq 启动时需开启并连接 epmd ,默认端口为 4369
网络服务启用后,一定要在rc-local后启动
for srv in mariadb@ memcached rabbitmq-server keepalived;do
  sed -r -i '/^After/s^$^ rc-local.service^g' /usr/lib/systemd/system/$srv.service
done

EasyStack自动化部署

指定版本的RDO

dnf -y install https://repos.fedorapeople.org/repos/openstack/openstack-yoga/rdo-release-yoga-1.el8.noarch.rpm
dnf install -y network-scripts wget nc lldpd vim-enhanced rsyslog supervisor pciutils chrony tar screen bind-utils --enablerepo=epel
我们建议在使用RDO repos 最好禁用EPEL,因为 EPEL 中的更新会破坏向后兼容性。或者,最好使用插件来固定封装版本yum-versionlock。
dnf config-manager --set-disabled epel
dnf config-manager --set-enabled powertools

下载脚本并配置

wget -c http://xxxx.upyun.com/easyStack_yoga.sh
chmod +x /root/easyStack_yoga.sh
 
./easyStack_yoga.sh 
# 会询问一些问题,动态生成easystackrc文件
 
HOSTNAME="ccm-01.service.iqn"
NODE_TYPE="control" # network or compute
REGION="Region021"
CCVIP=db.service.iqn
MY_IP=100.100.62.3
VIRT_TYPE="kvm"
PROVIDER_INTERFACE="ens118f0"
GLERA_SRV="100.100.62.1,100.100.62.2,100.100.62.3"
MEMCACHES="db.service.iqn:11211"
STORE_BACKEND=ceph
 
NOVA_URL="http://db.service.iqn:8774/v2.1"
IMAGE_URL="http://db.service.iqn:9292/v2"
VOLUME_URL="http://db.service.iqn:8776/v3"
NEUTRON_URL="http://net.service.iqn:9696"
PLACEMENT_URL="http://db.service.iqn:8778"
KEYS_AUTH_URL="http://db.service.iqn:5000/v3"
KEYS_ADMIN_URL="http://db.service.iqn:35357/v3"
 
# 然后运行优化脚本
./easyStack_yoga.sh adjust_sys
同步密钥到所有机器,保证能够互相访问对方(ssh-copy-id或ansible copy)
ansible all -m copy -a "src=authorized_keys dest=/root/.ssh/"

如果是KVM虚拟化,还需要在nova.conf设置:

libvirt.cpu_mode = (custom, host-model, host-passthrough)

openstack节点角色配置

控制集群部署

控制集群三台都要依次安装所有服务

./easyStack_yoga.sh keys_init
./easyStack_yoga.sh gls_init
./easyStack_yoga.sh cinder_init
./easyStack_yoga.sh nova_init    # 初始化数据库和控制类服务
./easyStack_yoga.sh neutron_init # 只初始化数据库
 
./easyStack_yoga.shprobe_hypervisor # 控制节点要自动同步fernet

初始化fernet秘钥并同步到各控制节点

# 选定任意控制节点做fernet秘钥初始化,在/etc/keystone/生成相关秘钥及目录
 
  keystone-manage fernet_setup --keystone-user keystone --keystone-group keystone
  keystone-manage credential_setup --keystone-user keystone --keystone-group keystone
# 向controller02/03节点同步秘钥
 
#!/bin/sh
for ccm in ccm-02 ccm-03;do
        rsync -avz -e "ssh" /etc/keystone/ $ccm:/etc/keystone/
        rsync -avz -e "ssh" /var/lib/keystone/ $ccm:/var/lib/keystone/
done
# 同步后,注意controller02/03节点上秘钥权限
 
  chown keystone:keystone /etc/keystone/credential-keys/ -R
  chown keystone:keystone /etc/keystone/fernet-keys/ -R

API无状态类服务,如:

  1. nova-api: 负责接受和响应外部请求,兼容EC2 API
  2. nova-conductor:访问数据库的中间件
  3. nova-scheduler:用于云主机调度
  4. nova-novncproxy:VNC代理
  5. glance-api: 响应发现、注册及搜索虚拟机映像文件
  6. keystone-api:负责身份验证、服务规则和服务令牌的功能
  7. neutron-server: 主要负责对外提供openstack网络服务API及扩展等
  8. neutron-api: 用于接受API请求后调用Plugin创建网络,子网,路由器等

可以由 HAProxy/Nginx + KeepAlivedVIP来提供HA负载均衡
将请求按照一定的算法转到某个节点上的 API 服务

有状态服务,包括MySQL数据库和AMQP消息队列。对于有状态类服务的HA,如:

  1. neutron-l3-agent / neutron-linuxbridge-agent :位于这些节点上的Agents才是与网络相关命令的实际执行者
  2. neutron-metadata-agent:负责将接收到的获取metadata的请求转发给nova-api(metadata进程)
  3. nova-compute: 负责维护和管理云环境的计算资源,同时管理虚拟机生命周期
  4. cinder-volume等

最简便的方法就是多节点部署(consul自动注册发现)

网络集群部署

网络集群两台要依次安装neutron服务

./easyStack_yoga.sh neutron_init | neutron_start | neutron_restart # 启动网络服务

计算节点部署

计算节点要安装nova服务

./easyStack_yoga.sh nova_init | nova_start | nova_restart  # 启动计算资源类服务

检查服务是否正常

ss -tnplu | grep -E ":80 |3306|8774|8776|9696|11211|15672|5000|35357"| awk '{print $1,$2,$5}'
 
# 80 httpd
# 3306 mysql
# 8774 nova
# 8776 cinder
# 9292 glance
# 9696 neutron
# 5000, 35357 keystone
# 11211 memcache
# 15672 rabbitmq

Ceph创建必需的存储池

ceph osd pool create volumes 128 128 replicated rep_ssd
ceph osd pool create backups 128 128 replicated rep_ssd
ceph osd pool create vms 128 128 replicated rep_ssd
ceph osd pool create images 64 64 replicated rep_ssd

Glance和Cinder启用ceph认证

ceph auth get-or-create client.cinder mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=vms, allow rx pool=images'
ceph auth get-or-create client.glance mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=images'
ceph auth get-or-create client.cinder-backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=backups'
 
ceph auth get-or-create client.cinder > /etc/ceph/ceph.client.cinder.keyring
ceph auth get-or-create client.cinder-backup > /etc/ceph/ceph.client.cinder-backup.keyring
ceph auth get-or-create client.glance > /etc/ceph/ceph.client.glance.keyring
 
chown cinder.cinder /etc/ceph/*cinder*
chown glance.glance /etc/ceph/*glance*
每个计算节点上都要同步cinder和glance的ceph认证

Cinder开启ceph和lvm双后端

enabled_backends = ceph,lvm
 
[ceph]
volume_driver = cinder.volume.drivers.rbd.RBDDriver
rbd_pool = volumes
rbd_ceph_conf = /etc/ceph/ceph.conf
rbd_flatten_volume_from_snapshot = false
rbd_max_clone_depth = 5
rbd_store_chunk_size = 4
rados_connect_timeout = -1
rbd_user = cinder
rbd_secret_uuid = 85b90a2a-3072-4fef-b4cb-17de9346a4ea
volume_backend_name=ceph-gpu-01
 
[lvm]
volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver
volume_group = ssd1
volume_backend_name = lvm-gpu-01
target_helper = lioadm
  • 安装targetcli软件和开启iscsid, target服务
  • 创建iscsi iqn,写入 /etc/iscsi/initiatorname.iscsi
  • cinder service-list → openstack volume service list
cinder服务下的/var/lib/cinder/conversion要迁移到大盘
if [ ! -L /var/lib/cinder/conversion ] ;then
        mv /var/lib/cinder/conversion /disk/ssd1/
        mkdir -p /disk/ssd1/conversion
        chown -R cinder.cinder /disk/ssd1/conversion
        ln -snf /disk/ssd1/conversion /var/lib/cinder/conversion
fi

Cinder使用NVMe-oF提升性能

#!/bin/bash
readonly TYPE="rdma"  # tcp
DISK="/dev/nvme0n1"
MY_IP=$(ip a | awk '/inet.*100.100.*internal/{split($2,a,"/");{print a[1]}}'|head -1)
NAME="nvmeof_$TYPE"
 
for mod in nvmet nvmet-tcp nvmet-rdma nvme-fabrics; do
    modprobe $mod
done
 
TT=/sys/kernel/config/nvmet/subsystems/$NAME
if [ ! -s $TT/attr_allow_any_host ]; then
    mkdir -p $TT/namespaces/1
    echo -n $DISK > $TT/namespaces/1/device_path
    echo 1 > $TT/attr_allow_any_host
    echo 1 > $TT/namespaces/1/enable
fi  
 
PP=/sys/kernel/config/nvmet/ports/1
if [ ! -s $PP/addr_traddr ]; then
    mkdir -p $PP
    echo $TYPE > $PP/addr_trtype
    echo ipv4 > $PP/addr_adrfam
    echo $MY_IP > $PP/addr_traddr
    echo 4420 > $PP/addr_trsvcid
fi  
 
ln -s /sys/kernel/config/nvmet/subsystems/$NAME $PP/subsystems/$NAME
echo " -t $TYPE -a $MY_IP -s 4420 " > /etc/nvme/discovery.conf 
 
if [ ! -s /etc/nvme/hostnqn ];then
    nvme gen-hostnqn > /etc/nvme/hostnqn
    uuidgen > /etc/nvme/hostid
fi

然后在cinder.conf启用nvme-pool后端

[nvme-pool]
image_volume_cache_enabled = true
image_volume_cache_max_size_gb = 1000
image_volume_cache_max_count = 100
target_protocol = nvmet_rdma
target_helper = nvmet
target_ip_address = $my_ip
target_port = 4420
volume_group = nvme-disk
volume_backend_name = nvme-pool
直接分离卷会导致,写入数据盘失败!

如果服务器非正常重启,会出现挂载磁盘的虚拟机开启不正常,先查询挂载卷ID

openstack server show 7c8f3d4e-xxxx -c volumes_attached

登录数据库,手动解除关联:

update block_device_mapping SET deleted=1 WHERE instance_uuid="7c8f3d4e-xxx" and volume_id="8a532442-xxx";

Cinder开启数据加密

部署负载均衡和高可用性

Octavia OpenStack LBaaS

高可用性keepalived

负载均衡haproxy

Kolla容器化部署

pip3 install git+https://opendev.org/openstack/kolla-ansible@stable/zed
 
git clone --branch stable/zed https://opendev.org/openstack/kolla
pip install ./kolla 

DevStack自动化部署

下载指定版本的devstack

dnf install -y git 
git clone -b stable/victoria https://opendev.org/OpenStack/devstack
CentOS9额外安装redhat-lsb-core

创建stack用户

useradd -s /bin/bash -d /opt/stack -m stack
echo "stack ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/stack
chmod -R 755 /opt/
 
# 切换到stack用户
su - stack

详解local.conf

[[local|localrc]]
WSGI_MODE=mod_wsgi
LIBVIRT_TYPE=qemu
SERVICE_IP_VERSION=4
 
ADMIN_USER=admin
ADMIN_PASSWORD=upyunxxxx
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
 
GIT_BASE=http://git.trystack.cn
NOVNC_REPO=http://git.trystack.cn/kanaka/noVNC.git
SPICE_REPO=http://git.trystack.cn/git/spice/spice-html5.git
 
HOST_IP=10.0.6.40
SERVICE_HOST=$HOST_IP
MYSQL_HOST=$HOST_IP
RABBIT_HOST=$HOST_IP
GLANCE_HOSTPORT=$HOST_IP:9292
KEYSTONE_AUTH_HOST=$HOST_IP
KEYSTONE_SERVICE_HOST=$HOST_IP
Q_HOST=$HOST_IP
 
# Neutron ML2 with OpenVSwitch
Q_PLUGIN=ml2
Q_AGENT=openvswitch
ENABLE_TENANT_VLANS=True
ML2_VLAN_RANGES=physnet1:1000:2000
NEUTRON_CREATE_INITIAL_NETWORKS=False
 
 
 
## OpenStack云实例使用的FloatingIP的范围
FLOATING_RANGE="203.0.113.0/24"
Q_FLOATING_ALLOCATION_POOL=start=203.0.113.5,end=203.0.113.200
FIXED_RANGE="10.0.6.200/28"
FIXED_NETWORK_SIZE=200
 
IDENTITY_API_VERSION=3
OS_IDENTITY_API_VERSION=3
OS_AUTH_URL="http://$KEYSTONE_AUTH_HOST/identity/"
 
# Reclone each time
RECLONE=no
 
 
DOWNLOAD_DEFAULT_IMAGES=False
IMAGE_URLS=http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
 
# Enabling Neutron (network) Service
disable_service n-net tempest
enable_service neutron
如果是在虚拟机里再启动虚拟机,那么需要把virt_type=qemu

软件运行

devstack.sh 若执行成功,会在当前主机内,根据 local.conf 文件中的配置信息,安装指定的子模块,若 local.conf 中没有指定模块,则会安装所有子模块。

以 stack 用户执行以下命令,使用管理员登录 OpenStack 客户端。

source openrc admin admin
 
# 查看各服务状态
sudo systemctl status "devstack@*"
 
# 查看服务列表
nova service-list
 
# 清理down的计算服务列表
nova service-list | awk '/nova.down/{print $2}' | xargs -i nova service-delete {}
 
 
# 执行以下命令,可以获取镜像资源列表。
openstack image list
 
# 执行以下命令,可以获取网络资源列表。
openstack network list
 
# 执行以下命令,可以获取虚拟机配置类型列表。
openstack flavor list
 
# 创建虚拟机。
openstack server create --flavor m1.nano --image cirros-0.5.1-x86_64-disk --nic net-id=网络名称或ID --security-group 安全组名称或ID   实例名称
 
 
# 查看虚拟机状态。
openstack server list
 
systemctl restart httpd && systemctl ebable httpd

常见安装错误

清理并重新安装
./unstack.sh
./clean.sh
#and againt run
./stack.sh

✅ clouds_file missing 1 required positional argument: 'Loader'

YAML 5.1版本后弃用了yaml.load(file)这个用法,因为觉得很不安全,5.1版本之后就修改了需要指定Loader,通过默认加载​​器(FullLoader)禁止执行任意函数,该load函数也变得更加安全

用以下三种方式都可以

d1=yaml.load(file,Loader=yaml.FullLoader)
d1=yaml.safe_load(file)
d1 = yaml.load(file, Loader=yaml.CLoader)

✅ github因外力无法下载etcd

提前下载etcd相应版本,二进制安装即可

✅ Failed to discover available identity versions when contacting

WSGI_MODE=mod_wsgi   # uwsgi感觉有bug,不推荐单独使用uwsgi

no uwsgi in

http_plugin.so python3_plugin.so No such file or directory

dnf install -y uwsgi uwsgi-plugin-python3 uwsgi-plugin-common --nobest

✅ Failure creating NET_ID for private

devstack 默认采用ml2+ovs的方式进行部署的时候,在创建network的时候需要先去取vlan id,但是这里没有配置VLAN RANGE,所以创建网络报错.

解决方法:在local.conf中增加VLAN RANGE的配置项,比如:

Q_PLUGIN=ml2
Q_AGENT=openvswitch
ENABLE_TENANT_VLANS=True
ML2_VLAN_RANGES=physnet1:100:4000

physical_network unknown for VLAN provider

✅ rabbitmq兼容性安装不了

dnf config-manager --set-enabled powertools
dnf --enablerepo=powertools -y install rabbitmq-server 

✅ Block Device Mapping is Invalid

dashboard中创建实例时,不要直接创建新卷,点击否,待实例创建完毕后再分配卷。

✅ Can not find requested image

[glance]
api_servers = http://10.0.1.232:9292 

在 nova.conf 的配置里这个地址不需要加 /v2,应该是个bug

✅ 新计算节点not mapped to any cell

nova-manage cell_v2 discover_hosts --verbose
检查计算节点的openstack-nova-compute进程是否正常?
journalctl -xeu openstack-nova-compute | grep -i error

检查nova依赖的ceph进程是否正常?

/usr/libexec/platform-python -s /usr/bin/ceph df --format=json --id cinder --conf /etc/ceph/ceph.conf

检查neutron-linuxbridge-agent的日志是否正常?

python源代码需要编译重新生效

重新将config.py文件重新编译,并替换原有的config.pyc文件

python -m py_compile config.py

✅ PortBindingFailed: Binding failed for port

journalctl -xeu openstack-nova-compute
journalctl -xeu openstack-nova-api
journalctl -xeu neutron-linuxbridge-agent
 
726 08:52:15 ops-yoga-m1 neutron-linuxbridge-agent[1472536]: oslo_config.cfg.ConfigFilesPermissionDeniedError: Fa>

linuxbridge-agent.ini权限错误导致linuxbridge进程失败

✅ Host is blocked because of many connection errors

SET global max_connect_errors=10000;
set global max_connections = 200;
flush hosts;

或者在my.cnf中添加 hostcachesize = 0

✅ Deadlock: wsrep aborted transaction

经过一天一夜的排查和测试,做过如下的解决方案:

  1. 优化mariadb的参数,包括缓存大小和超时时间,性能有所提升但不能减少报错;
  2. 优化galera的参数,如wsrepslavethreads,提升性能但不能避免报错;
  3. 备份数据库,解除1/3台数据库,做单机部署并导入数据,并重置pymysql连接池,错误明显减少;
  4. Maria Galera集群不能同时被多个客户端连接,会产生冲突(神奇的集群模式,可能是没有优化好);
  5. 前面加了nginx和haproxy的负载均衡,测试结果,haproxy明显比nginx性能好得多(之前经验也是这样);

结论: 在galera集群前面分别添加haproxy负载均衡,保护好集群和健康检查

global 
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     10240
    user        haproxy
    group       haproxy
    daemon
 
defaults
    mode                    tcp
    log                     global
    retries                 10
    timeout queue           10m
    timeout connect         10m
    timeout check           30s
 
listen  mysqld-load
        bind 10.33.66.1:3306
        balance source
        server mysqld0 10.33.66.3:3308 check weight 9
        server mysqld1 10.33.66.2:3308 check weight 7
        server mysqld2 10.33.66.1:3308 check weight 5

✅ nova虚机实例状态错误手工恢复vm_state

在日常管理中,经常出现比如物理机故障,Neutron或者Nova服务Down等等各种原因导致虚拟机状态,要么Error。要么处于一直Hard Reboot。或者Softe Reboot的状态,这个时候就要借助于Nova命令来解决了。

1、首先重置虚拟机状态,后面可以使用虚拟机名称或者ID。

nova reset-state 06d9d410-***********   
nova reset-state --active 06d9d410-***********

2、停机虚拟机

nova stop  06d9d410-***********   

3、启动虚拟机

nova start 06d9d410-*********** 
nova reboot --hard 06d9d410-*********** 

✅ cinder 卷状态重置和强制卸载

cinder reset-state --state available xxx
cinder reset-state --state available --attach-status detached xxx

✅ 浮动IP做端口转发到虚拟机

✅ gre和vxlan二次封装数据包的MTU大小

VXLAN 模式下虚拟机中的 mtu 最大值为1450,也就是只能小于1450,大于这个值会导致 openvswitch 传输分片,进而导致虚拟机中数据包数据重传,从而导致网络性能下降。GRE 模式下虚拟机 mtu 最大为1462。

计算方法如下:

vxlan mtu = 1450 = 1500 – 20(ip头) – 8(udp头) – 8(vxlan头) – 14(以太网头)
gre mtu = 1462 = 1500 – 20(ip头) – 4(gre头) – 14(以太网头)

可以配置 Neutron DHCP 组件,让虚拟机自动配置 mtu

dnsmasq_config_file = /etc/neutron/dnsmasq-neutron.conf

vi /etc/neutron/dnsmasq-neutron.conf
dhcp-option-force=26,1450

显示浮动floating ip

openstack floating ip list

显示虚拟机ip

openstack server list

显示端口port列表

openstack port list --server ${SERVER_ID} -c id -f value

创建一个端口转发

openstack floating ip port forwarding create --internal-ip-address 192.168.2.167 --port f3b67c8c-9f39-42ca-a0cd-f131121db8d4  --protocol tcp --internal-protocol-port 22 --external-protocol-port 60122 112.13.174.28

列举端口转发列表

openstack floating ip port forwarding list 1.2.3.4

no attribute ‘X509_V_FLAG_CB_ISSUER_CHECK‘

主要原因是系统当前的python和pyOpenSSL版本不对应

解决方法:

  pip3 install -U pyOpenSSL

✅ libvirtError

pcie卡失效 no host valid

journalctl -xeu libvirtd | grep virPCIDeviceNew
 
Feb 05 15:03:33 gpu-01.service.yiw libvirtd[1190917]: 1190983: error : virPCIDeviceNew:1478 : Device 0000:4e:00.0 not found: could not access /sys/bus/pci/devices/0000:4e:00.0/config: No such file or directory
Feb 05 15:03:33 gpu-01.service.yiw libvirtd[1190917]: 1190983: error : virPCIDeviceNew:1478 : Device 0000:4e:00.1 not found: could not access /sys/bus/pci/devices/0000:4e:00.1/config: No such file or directory
 
 
lspci -v -s 0000:4e:00 | grep 'Physical Slot'
        Physical Slot: 2-1
 
#断电
echo 0 > /sys/bus/pci/slots/2-1/power
echo 1 > /sys/bus/pci/slots/2-1/power 

Secret not found: no secret with matching uuid

cat > /root/ceph_secret_virsh.xml <<EOF
<secret ephemeral='no' private='no'>
  <uuid>$MY_UUID</uuid>
  <usage type='ceph'>
    <name>client.cinder secret</name>
  </usage>
</secret>
EOF
 
  for node in $(openstack hypervisor list -c "Hypervisor Hostname" -f value);do
    scp /etc/ceph/client.cinder.keyring $node:/etc/ceph/
    scp /root/ceph_secret_virsh.xml $node:/root/
    echo -en "${YELLOW_COL} ------------- Virsh Patch $node -------------${NORMAL_COL}\n"
    ssh $node "virsh secret-define --file /root/ceph_secret_virsh.xml; virsh secret-set-value --secret $MY_UUID --base64 \$(awk '/key/{print \$NF}' /etc/ceph/client.cinder.keyring) ; virsh secret-list"
  done

✅ 数据库里清除虚拟机和镜像信息

#!/bin/sh
PASS="upyunxxxx"
UUID=$1
 
cls_nova(){
 DBS="nova"
 UUID=$1
 for DB in $DBS;do
    CMD="mysqldump -uroot -p$PASS --skip-extended-insert $DB"
    $CMD > $DB.sql
    TABLES=$(grep $UUID $DB.sql | awk '{print $3}'|sort -ur)
    for table in $TABLES;do
      table=$(echo $table|sed -r 's@`@@g')
      echo $table
      #mysql -uroot -p$PASS $DB -e "desc $table"
      if [ $table == "instance_id_mappings" -o $table == "instances" ];then
        key=uuid
      else
        key=instance_uuid
      fi
      mysql -uroot -p$PASS $DB -e "select count(*) from $table where $key=\"$UUID\""
      #mysql -uroot -p$PASS $DB -e "SET FOREIGN_KEY_CHECKS=0; delete from $table where $key=\"$UUID\"; SET FOREIGN_KEY_CHECKS=1;"
    done
  done
}
 
cls_glance(){
 DBS="glance"
 for DB in $DBS;do
    CMD="mysqldump -uroot -p$PASS --skip-extended-insert $DB"
    $CMD > $DB.sql
    TABLES=$(grep $UUID $DB.sql | awk '{print $3}'|sort -ur)
    for table in $TABLES;do
      table=$(echo $table|sed -r 's@`@@g')
      echo $table
      mysql -uroot -p$PASS $DB -e "desc $table"
      if [ $table == "images" ];then
        key=id
      else
        key=image_id
      fi
      mysql -uroot -p$PASS $DB -e "select count(*) from $table where $key=\"$UUID\""
      #mysql -uroot -p$PASS $DB -e "SET FOREIGN_KEY_CHECKS=0; delete from $table where $key=\"$UUID\"; SET FOREIGN_KEY_CHECKS=1;"
    done
  done
}
 
cls_glance
#cls_nova

✅ ConflictNovaUsingAttachment: Detach

journalctl -xef |grep -vE 'consul|sshd|conmon|prometheus' | grep -iw error
Aug 12 11:31:28 ccn-01.service.yoga nova-compute[1340408]: 2023-08-12 11:31:28.608 1340408 ERROR nova.volume.cinder [req-4f63de2e-ee43-4d88-a191-768ee16b2ef0 7fabcbb084c84ec6ac89af7a85b3d6f2 7be91688344a4457af9e670e051089a9 - default default] Delete attachment failed for attachment 972faa64-0a35-40f4-ac76-d75ad866d0af. Error: ConflictNovaUsingAttachment: Detach volume from instance 6e942805-13b4-4297-a691-ed956b85cb01 using the Compute API (HTTP 409) (Request-ID: req-66735ff4-9cf4-43e8-bb8f-5cb4c5f0569d) Code: 409: cinderclient.exceptions.ClientException: ConflictNovaUsingAttachment: Detach volume from instance 6e942805-13b4-4297-a691-ed956b85cb01 using the Compute API (HTTP 409) (Request-ID: req-66735ff4-9cf4-43e8-bb8f-5cb4c5f0569d)

🆘 无法正常卸载的卷(detach),脚本回收

#!/bin/sh
readonly passwd="upyunxxxx"
 
# 导入 admin凭证
[ -s /var/lib/keystone/ks_rc_admin ] && source /var/lib/keystone/ks_rc_admin
[ -s ~/.easystackrc ] && source ~/.easystackrc
 
#for vol_id in a77d32d7-4a8f-4655-b3a6-58cf1726722e;do
for vol_id in $(openstack volume list -f value -c ID -c Status | awk '/detaching/ || /reserved/ || /attaching/ {print $1}');do
  #instance_id=$(openstack --os-volume-api-version 3.37 volume attachment list -c "Server ID" -c "Volume ID" -f value | awk '/'"$vol_id"'/{print $NF}')
  #nova volume-detach instance_id volume_id
  echo "detached volume for $instance_id $vol_id"
  cinder reset-state --state available --attach-status detached $vol_id
  mysql -ucinder -p$passwd cinder -e "delete from volume_attachment where volume_id=\"$vol_id\""
  mysql -unova -p$passwd nova -e "delete from block_device_mapping where volume_id=\"$vol_id\""
done
 
for vol_id in $(openstack volume list -f value -c ID -c Status | awk '/ reserved/{print $1}');do
  echo "attached volume for $instance_id $vol_id"
  cinder reset-state --state in-use --attach-status attached $vol_id
done

Glance image error

RADOS permission denied

查看cinder卷服务是否正常

openstack volume service list

volume Could not find any available weighted backend

看/var/log/cinder/下的日志,很有可能是 vg和lv名称或者尺寸不对

lvm长时间无法删除,重置状态也不行,cinder重启解决

yoga以后的版本是需要在nova.conf配置这个参数:
sendserviceuser_token=True

✅ dashboard报错 : Invalid service catalog service

✅ Network partition detected

这是由于网络问题导致集群出现了脑裂临时解决办法:

Mnesia reports that this RabbitMQ cluster has experienced a network partition. There is a risk of losing data
 
   在出现问题的节点上执行:  sbin/rabbitmqctl stop_app 
   在出现问题的节点上执行:  sbin/rabbitmqctl start_app 

the scheduler has made an allocation against this compute node but the instance has yet to start

✅ ResourceProviderCreationFailed: Failed to create resource provider

下面是暴力解决方法

排查掉 neutron-linuxbridge-agent 网络正常外,从nova compute 日志查看:无法创建新的资源提供者(uuid 冲突?)
那就先删除,再重新 probe_kvm

  journalctl -xeu openstack-nova-compute -f
 
#!/bin/sh
# 配置颜色
readonly RED_COL="\\033[1;31m"      # red color
readonly GREEN_COL="\\033[32;1m"     # green color
readonly BLUE_COL="\\033[34;1m"     # blue color
readonly YELLOW_COL="\\033[33;1m"    # yellow color
readonly NORMAL_COL="\\033[0;39m"
 
PASS="xxxxxx"
 
cls_instance(){
 [ -z $2 ] && echo -e "$0 $1 ${GREEN_COL}uuid ${NORMAL_COL}" && exit 0
 ID=$2
 DBS="nova"
 for DB in $DBS;do
    CMD="mysqldump -uroot -p$PASS --skip-extended-insert $DB"
    $CMD > $DB.sql
    echo $ID 
    TABLES=$(grep $ID $DB.sql | awk '{print $3}'|sort -ur)
    echo $TABLES---------
    for table in $TABLES;do
      table=$(echo $table|sed -r 's@`@@g')
      echo $table
      #mysql -uroot -p$PASS $DB -e "desc $table"
      if [ $table == "instance_id_mappings" -o $table == "instances" ];then
        key=uuid
      else
        key=instance_uuid
      fi
      #mysql -uroot -p$PASS $DB -e "select count(*) from $table where $key=\"$ID\""
      mysql -uroot -p$PASS $DB -e "SET FOREIGN_KEY_CHECKS=0; delete from $table where $key=\"$ID\"; SET FOREIGN_KEY_CHECKS=1;"
    done
  done
}
 
cls_glance(){
 [ -z $2 ] && echo -e "$0 $1 ${GREEN_COL}uuid ${NORMAL_COL}" && exit 0
 ID=$2
 DBS="glance"
 for DB in $DBS;do
    CMD="mysqldump -uroot -p$PASS --skip-extended-insert $DB"
    $CMD > $DB.sql
    TABLES=$(grep $ID $DB.sql | awk '{print $3}'|sort -ur)
    for table in $TABLES;do
      table=$(echo $table|sed -r 's@`@@g')
      echo $table" -------------------------"
      mysql -uroot -p$PASS $DB -e "desc $table"
      if [ $table == "images" ];then
        key=id
      else
        key=image_id
      fi
      #mysql -uroot -p$PASS $DB -e "select count(*) from $table where $key=\"$ID\""
      mysql -uroot -p$PASS $DB -e "SET FOREIGN_KEY_CHECKS=0; delete from $table where $key=\"$ID\"; SET FOREIGN_KEY_CHECKS=1;"
    done
  done
}
 
cls_compute(){
 [ -z $2 ] && echo -e "$0 $1 ${GREEN_COL}hostname ${NORMAL_COL}" && exit 0
 ID=$2
 DBS="nova nova_api placement"
 for DB in $DBS;do
    CMD="mysqldump -uroot -p$PASS --skip-extended-insert $DB"
    $CMD > $DB.sql
    TABLES=$(grep $ID $DB.sql | awk '{print $3}'|sort -ur)
    for table in $TABLES;do
      table=$(echo $table|sed -r 's@`@@g')
      if [ $table == "resource_providers" ];then
        key=name
      elif [ $table == "migrations" ];then
        key=source_compute
      else
        key=host
      fi
        echo  $ID -- $DB -- $table -- $key
        mysql -uroot -p$PASS $DB -e "SET FOREIGN_KEY_CHECKS=0; delete from $table where $key=\"$ID\"; SET FOREIGN_KEY_CHECKS=1;"
    done
 done
}
case $1 in
  cls_glance)
        cls_glance $*;;
  cls_instance)
        cls_instance $*;;
  cls_compute)
        cls_compute $*;;
  *)
        echo -e "${RED_COL}$0 ${YELLOW_COL}cls_glance|cls_instance|cls_compute${NORMAL_COL}";;
esac

🏆✅ 正确的清理计算节点和虚拟机的方式

openstack server list | awk '/tenant=/{print $2,$4}' > server.list
 
#!/bin/sh
while read host;do
        read -r tid t  <<<  $host
        h_host=$(openstack server show $tid | awk '/hypervisor_hostname/{print $(NF-1)}')
        grep -wq $h_host list_nouse
        if [ $? = 0 ];then
                echo  "----->  delete $tid $t"
                #openstack server delete $tid
        fi
done < server.list
 
openstack compute service list|awk '/down/{print $2}'|xargs -i openstack compute service delete {}
 
openstack network agent list|awk '/XXX/{print $2}'|xargs -i openstack network agent delete {}
 
#!/bin/sh
types="cinder-scheduler cinder-backup cinder-volume"
for type in $types;do
  for node in $(openstack volume service list | awk '/'"$type"'.*gpu.*down/{print $4}');do
        echo "$type -> $node"
        openstack volume service set --disable $node $type
        cinder-manage service remove $type $node
  done
done

🏆 ✅ 添加计算节点的流程

关闭省电/开启性能模式:(整合在create_nvmeof.sh)
  cpupower idle-set -D 0
  cpupower frequency-set -g performance
如果glance/cinder/nova服务不正常,请检查相关日志中
ceph中的rbd 配置 或 授权 (probe_ceph)是否正确?

必须要检查:

  • bios里开启vt-d, numa, grub里pci=realloc
  • nova(主控) / cinder / os_brick(zed) 要打补丁升级
  • haproxy 转发端口不要冲突
  • cpu/gpu 开实例测试流程是否正常
dmesg | grep -c iommu
dmesg | grep -wE 'vfio_iommu_type1|realloc'
grep -E 'novncproxy_base|passthrough_' /etc/nova/nova.conf 
grep -w service_plugins /etc/neutron/neutron.conf
grep can_disconnect /usr/lib/python3.6/site-packages/os_brick/initiator/connectors/nvmeof.py
mkvnc(){
  #ngb1.dc.huixingyun.com:6081/vnc_auto.html
  sed -r -i '/novncproxy_base_url/s@=.*@= https://xxxx.s.upyun.com/vnc_auto.html@g' /etc/nova/nova.conf
  systemctl daemon-reload
  systemctl restart openstack-nova-compute
}

获取虚拟实例的vnc链接:

nova get-vnc-console ad42e72d-59de-4907-9d7a-a6157a5b8e10 novnc