🌱 煎茶

原创,优质、深度、有思想的优质内容分享

openFuyao NPU-Operator故障排查

故障 pod describe [root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f Name: ascend-device-plugin-ll46f Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: ascend-device-plugin-sa Node: master1/10.17.30.131 Start Time: Mon, 30 Mar 2026 11:08:32 +0800 Labels: app.kubernetes.io/managed-by=npu-operator controller-revision-hash=7df5dcb887 helm.sh/chart=npu-operator-0.15.0 name=ascend-device-plugin-ds pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878 cni.projectcalico.org/podIP: 192.168.137.118/32 cni.projectcalico.org/podIPs: 192.168.137.118/32 scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: 192.168.137.118 IPs: IP: 192.168.137.118 Controlled By: DaemonSet/ascend-device-plugin Init Containers: init-permission: Container ID: containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27 Image: cr.openfuyao.cn/openfuyao/busybox:1.36.1 Image ID: cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176 Port: <none> Host Port: <none> Command: sh -c chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin chmod 750 /var/log/mindx-dl/devicePlugin State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 30 Mar 2026 15:28:32 +0800 Finished: Mon, 30 Mar 2026 15:28:32 +0800 Ready: True Restart Count: 1 Environment: <none> Mounts: /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Containers: device-plugin-01: Container ID: containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab Image: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0 Image ID: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58 Port: <none> Host Port: <none> Command: /bin/bash -c -- Args: device-plugin -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 31 Mar 2026 10:28:58 +0800 Finished: Tue, 31 Mar 2026 10:28:58 +0800 Ready: False Restart Count: 274 Limits: cpu: 500m memory: 500Mi Requests: cpu: 500m memory: 500Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /tmp from tmp (rw) /usr/local/Ascend/driver from hiai-driver (ro) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/lib/kubelet/pod-resources from pod-resource (rw) /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: pod-resource: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/pod-resources HostPathType: hiai-driver: Type: HostPath (bare host directory volume) Path: /usr/local/Ascend/driver HostPathType: log-path: Type: HostPath (bare host directory volume) Path: /var/log/mindx-dl/devicePlugin HostPathType: DirectoryOrCreate tmp: Type: HostPath (bare host directory volume) Path: /tmp HostPathType: kube-api-access-gfldg: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt Optional: false DownwardAPI: true QoS Class: Burstable Node-Selectors: openfuyao.com/npu.present= Tolerations: CriticalAddonsOnly op=Exists device-plugin=v2:NoSchedule huawei.com/Ascend910:NoSchedule op=Exists node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 16m (x205 over 18h) kubelet (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes. Warning BackOff 2m47s (x5216 over 18h) kubelet Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e) Normal Pulling 66s (x227 over 19h) kubelet Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" 故障 pod /dev 检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) autofs null tty10 tty34 tty58 vcs5 bsg ppp tty11 tty35 tty59 vcs6 btrfs-control ptmx tty12 tty36 tty6 vcsa bus pts tty13 tty37 tty60 vcsa1 core random tty14 tty38 tty61 vcsa2 cpu_dma_latency raw tty15 tty39 tty62 vcsa3 cuse relationship_ctrl tty16 tty4 tty63 vcsa4 davinci0 rfkill tty17 tty40 tty7 vcsa5 davinci_manager rtc0 tty18 tty41 tty8 vcsa6 devmm_svm sda tty19 tty42 tty9 vcsu dri sda1 tty2 tty43 ttyAMA0 vcsu1 fb0 sda2 tty20 tty44 ttyS0 vcsu2 fd sg0 tty21 tty45 ttyS1 vcsu3 full sg1 tty22 tty46 ttyS2 vcsu4 fuse sg2 tty23 tty47 ttyS3 vcsu5 hidraw0 shm tty24 tty48 uhid vcsu6 hidraw1 snapshot tty25 tty49 uinput vfio hisi_hdc sr0 tty26 tty5 urandom vga_arbiter hwrng sr1 tty27 tty50 usbmon0 vhost-net input stderr tty28 tty51 usbmon1 vhost-vsock kmsg stdin tty29 tty52 usbmon2 vport2p1 loop-control stdout tty3 tty53 vcs zero mapper termination-log tty30 tty54 vcs1 mem tty tty31 tty55 vcs2 mqueue tty0 tty32 tty56 vcs3 net tty1 tty33 tty57 vcs4 故障 pod 驱动检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) total 44K drwxr-xr-x 8 root root 4.0K Mar 27 08:03 . drwxr-xr-x 3 root root 4.0K Mar 31 02:34 .. drwxr-xr-x 2 root root 4.0K Mar 27 08:01 bin -r--r--r-- 1 root root 20 Mar 27 08:01 build.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 device dr-x------ 41 root root 4.0K Mar 27 08:01 kernel drwxr-xr-x 6 root root 4.0K Mar 27 08:01 lib64 -r--r----- 1 root root 56 Mar 27 08:01 scene.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 script drwxr-xr-x 2 root root 4.0K Mar 27 08:01 tools -r--r--r-- 1 root root 352 Mar 27 08:03 version.info 故障 pod 日志 [root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) [INFO] 2026/03/31 06:46:54.593254 1 hwlog/api.go:108 devicePlugin.log's logger init success [INFO] 2026/03/31 06:46:54.593449 1 main.go:187 ascend device plugin starting and the version is v6.0.0_linux-aarch64 [INFO] 2026/03/31 06:46:54.593494 1 main.go:188 ascend device plugin starting scene is center [INFO] 2026/03/31 06:46:54.787930 1 devmanager/devmanager.go:104 the dcmi version is 24.1.rc3 [ERROR] 2026/03/31 06:46:54.788019 1 devmanager/devmanager.go:211 get error card quantity: 0 [ERROR] 2026/03/31 06:46:54.788052 1 devmanager/devmanager.go:195 get card list failed for init [ERROR] 2026/03/31 06:46:54.788101 1 main.go:203 init devmanager failed, err: auto init failed, err: get card list failed for init 故障 pod 驱动检查 [root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH' Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) /usr/local/Ascend/driver/lib64/driver/libdcmi.so command terminated with exit code 137 [root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep root 21578 1 0 Mar30 ? 00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid 检查服务状态? [root@master1 ~]# systemctl status ascend-dmi Unit ascend-dmi.service could not be found. [root@master1 ~]# systemctl status ascend-dkms Unit ascend-dkms.service could not be found. [root@master1 ~]# systemctl status npu-smi Unit npu-smi.service could not be found. [root@master1 ~]# find / -name dmp_daemon 2>/dev/null [root@master1 ~]# find / -name slogd 2>/dev/null [root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null [root@master1 ~]# dcmi 问题,需硬件排查 ...

April 13, 2026 | 5 分钟 | 2277 字 | Tianlun Song

openFuyao 2603 共测测试报告

相关链接 特性清单: https://gitcode.com/openFuyao/release-management/blob/main/openFuyao-26.03/release-plan.md 安装部署前置环境校验工具使用指导: https://gitcode.com/openFuyao/sig-installation/blob/master/docs/zh/user_guide/cluster_installation_deployment/environment_pre_check_tool_guide.md 测试环境 CPU: Kunpeng-920 OS: openEuler 24.03 LTS SP3 aarch64 Fuyao Version: v26.03 rc3 docker: 2:18.09.0-346.oe2403sp3 测试特性 在线部署; 离线包制备; 离线部署; 安装部署前置检查工具; NPU Operator; AI推理套件; 建议优化点 环境检测工具,检查 iptables 默认策略是否放行,若未放行可能在部署成功后无法访问;默认防火墙策略为 FORWARD DROP ,对集群运行和访问带来的潜在问题; 运行 cli 前检查是否存在命令并及时抛出错误;检查 tar / unzip 是否安装,安装过程有很多地方会用到,而且出错时不会得到明显的解压失败报错,难以定位问题。 安装命令变化,考虑上下兼容性? 场景记录 离线部署管理面和业务面集群 CPU: Kunpeng-920 OS: openEuler 24.03 LTS SP3 aarch64 Fuyao Version: v26.03 rc3 docker: 2:18.09.0-346.oe2403sp3 arm64 环境下构建离线制品包为什么会执行 amd64 的 bin ...

April 13, 2026 | 16 分钟 | 7872 字 | Tianlun Song

openFuyao InferNex AI推理集成部署 310P(300I Pro) 环境问题记录及解决

AI推理集成部署(InferNex)是一个专为云原生环境下AI推理服务优化所设计的端到端集成部署方案。该方案基于Kubernetes Gateway API Inference Extension (GIE) 和主流LLM技术栈构建,通过Helm Chart将开源网关、智能路由、高性能推理后端、全局KVCache管理、扩缩容决策框架及推理可观测体系等核心加速模块无缝集成。它提供从请求接入、动态路由、推理执行到资源管理与监控的完整加速链路,旨在提升推理吞吐量并降低TTFT/TPOT时延,实现一站式的高效AI服务部署体验。 ...

April 13, 2026 | 24 分钟 | 11831 字 | Tianlun Song

ceph mon Operation not permitted 问题解决

自己构建的 ceph 发现 mon 起不来,报错如下: Apr 03 11:14:30 debian systemd[1]: Started Ceph cluster monitor daemon. ░░ Subject: A start job for unit ceph-mon@debian.service has finished successfully ░░ Defined-By: systemd ░░ Support: https://www.debian.org/support ░░ ░░ A start job for unit ceph-mon@debian.service has finished successfully. ░░ ░░ The job identifier is 6997. Apr 03 11:14:31 debian ceph-mon[374601]: 2026-04-03T11:14:31.084+0800 ffffaf907040 -1 load: jerasure load: lrc load dlopen(/usr/lib/ceph/erasure-code/libec_isa.so): /usr/lib/ceph/erasure-code/libec_isa.so: cannot make segment writable for relocation: Operation not permitted Apr 03 11:14:31 debian systemd[1]: ceph-mon@debian.service: Main process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: https://www.debian.org/support ░░ ░░ An ExecStart= process belonging to unit ceph-mon@debian.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Apr 03 11:14:31 debian systemd[1]: ceph-mon@debian.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: https://www.debian.org/support ░░ ░░ The unit ceph-mon@debian.service has entered the 'failed' state with result 'exit-code'. 临时解决方案 根据 claude-sonnet 4.6 的说法: ...

April 3, 2026 | 2 分钟 | 625 字 | Tianlun Song

Ascend 310P + openFuyao + NPU-Operator 故障排查

[TOC] 故障 pod describe [root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f Name: ascend-device-plugin-ll46f Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: ascend-device-plugin-sa Node: master1/10.17.30.131 Start Time: Mon, 30 Mar 2026 11:08:32 +0800 Labels: app.kubernetes.io/managed-by=npu-operator controller-revision-hash=7df5dcb887 helm.sh/chart=npu-operator-0.15.0 name=ascend-device-plugin-ds pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878 cni.projectcalico.org/podIP: 192.168.137.118/32 cni.projectcalico.org/podIPs: 192.168.137.118/32 scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: 192.168.137.118 IPs: IP: 192.168.137.118 Controlled By: DaemonSet/ascend-device-plugin Init Containers: init-permission: Container ID: containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27 Image: cr.openfuyao.cn/openfuyao/busybox:1.36.1 Image ID: cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176 Port: <none> Host Port: <none> Command: sh -c chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin chmod 750 /var/log/mindx-dl/devicePlugin State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 30 Mar 2026 15:28:32 +0800 Finished: Mon, 30 Mar 2026 15:28:32 +0800 Ready: True Restart Count: 1 Environment: <none> Mounts: /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Containers: device-plugin-01: Container ID: containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab Image: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0 Image ID: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58 Port: <none> Host Port: <none> Command: /bin/bash -c -- Args: device-plugin -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 31 Mar 2026 10:28:58 +0800 Finished: Tue, 31 Mar 2026 10:28:58 +0800 Ready: False Restart Count: 274 Limits: cpu: 500m memory: 500Mi Requests: cpu: 500m memory: 500Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /tmp from tmp (rw) /usr/local/Ascend/driver from hiai-driver (ro) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/lib/kubelet/pod-resources from pod-resource (rw) /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: pod-resource: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/pod-resources HostPathType: hiai-driver: Type: HostPath (bare host directory volume) Path: /usr/local/Ascend/driver HostPathType: log-path: Type: HostPath (bare host directory volume) Path: /var/log/mindx-dl/devicePlugin HostPathType: DirectoryOrCreate tmp: Type: HostPath (bare host directory volume) Path: /tmp HostPathType: kube-api-access-gfldg: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt Optional: false DownwardAPI: true QoS Class: Burstable Node-Selectors: openfuyao.com/npu.present= Tolerations: CriticalAddonsOnly op=Exists device-plugin=v2:NoSchedule huawei.com/Ascend910:NoSchedule op=Exists node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 16m (x205 over 18h) kubelet (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes. Warning BackOff 2m47s (x5216 over 18h) kubelet Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e) Normal Pulling 66s (x227 over 19h) kubelet Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" 故障 pod /dev 检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) autofs null tty10 tty34 tty58 vcs5 bsg ppp tty11 tty35 tty59 vcs6 btrfs-control ptmx tty12 tty36 tty6 vcsa bus pts tty13 tty37 tty60 vcsa1 core random tty14 tty38 tty61 vcsa2 cpu_dma_latency raw tty15 tty39 tty62 vcsa3 cuse relationship_ctrl tty16 tty4 tty63 vcsa4 davinci0 rfkill tty17 tty40 tty7 vcsa5 davinci_manager rtc0 tty18 tty41 tty8 vcsa6 devmm_svm sda tty19 tty42 tty9 vcsu dri sda1 tty2 tty43 ttyAMA0 vcsu1 fb0 sda2 tty20 tty44 ttyS0 vcsu2 fd sg0 tty21 tty45 ttyS1 vcsu3 full sg1 tty22 tty46 ttyS2 vcsu4 fuse sg2 tty23 tty47 ttyS3 vcsu5 hidraw0 shm tty24 tty48 uhid vcsu6 hidraw1 snapshot tty25 tty49 uinput vfio hisi_hdc sr0 tty26 tty5 urandom vga_arbiter hwrng sr1 tty27 tty50 usbmon0 vhost-net input stderr tty28 tty51 usbmon1 vhost-vsock kmsg stdin tty29 tty52 usbmon2 vport2p1 loop-control stdout tty3 tty53 vcs zero mapper termination-log tty30 tty54 vcs1 mem tty tty31 tty55 vcs2 mqueue tty0 tty32 tty56 vcs3 net tty1 tty33 tty57 vcs4 故障 pod 驱动检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) total 44K drwxr-xr-x 8 root root 4.0K Mar 27 08:03 . drwxr-xr-x 3 root root 4.0K Mar 31 02:34 .. drwxr-xr-x 2 root root 4.0K Mar 27 08:01 bin -r--r--r-- 1 root root 20 Mar 27 08:01 build.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 device dr-x------ 41 root root 4.0K Mar 27 08:01 kernel drwxr-xr-x 6 root root 4.0K Mar 27 08:01 lib64 -r--r----- 1 root root 56 Mar 27 08:01 scene.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 script drwxr-xr-x 2 root root 4.0K Mar 27 08:01 tools -r--r--r-- 1 root root 352 Mar 27 08:03 version.info 故障 pod 日志 [root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) [INFO] 2026/03/31 06:46:54.593254 1 hwlog/api.go:108 devicePlugin.log's logger init success [INFO] 2026/03/31 06:46:54.593449 1 main.go:187 ascend device plugin starting and the version is v6.0.0_linux-aarch64 [INFO] 2026/03/31 06:46:54.593494 1 main.go:188 ascend device plugin starting scene is center [INFO] 2026/03/31 06:46:54.787930 1 devmanager/devmanager.go:104 the dcmi version is 24.1.rc3 [ERROR] 2026/03/31 06:46:54.788019 1 devmanager/devmanager.go:211 get error card quantity: 0 [ERROR] 2026/03/31 06:46:54.788052 1 devmanager/devmanager.go:195 get card list failed for init [ERROR] 2026/03/31 06:46:54.788101 1 main.go:203 init devmanager failed, err: auto init failed, err: get card list failed for init 故障 pod 驱动检查 [root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH' Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) /usr/local/Ascend/driver/lib64/driver/libdcmi.so command terminated with exit code 137 [root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep root 21578 1 0 Mar30 ? 00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid 检查服务状态? [root@master1 ~]# systemctl status ascend-dmi Unit ascend-dmi.service could not be found. [root@master1 ~]# systemctl status ascend-dkms Unit ascend-dkms.service could not be found. [root@master1 ~]# systemctl status npu-smi Unit npu-smi.service could not be found. [root@master1 ~]# find / -name dmp_daemon 2>/dev/null [root@master1 ~]# find / -name slogd 2>/dev/null [root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null [root@master1 ~]# dcmi 问题,需硬件排查 ...

April 1, 2026 | 5 分钟 | 2278 字 | Tianlun Song

KDE Plasma6 禁用全局菜单,恢复正常应用菜单

前情提要 不知道从什么时候开始,KDE Plasma 默认启用类似 macOS 的全局应用菜单。 即应用窗口标题栏下方不显示菜单,而是移动到顶部菜单栏中“全局菜单”小组件中。 但问题是,Linux 桌面生态生态复杂,X11 Wayland Qt GTK 等等技术太过复杂,很难保证常用软件都能够正常显示全局菜单。 ...

April 1, 2026 | 1 分钟 | 431 字 | Tianlun Song

终极指南:在 Linux 裸机服务器上快速部署 Moltbot (原 Clawbot) 并集成飞书

引言 2026 年初,一个名为 Moltbot(前身为 Clawbot)的开源 AI 代理(Agent)框架席卷了开发者社区。该框架允许用户将强大的 AI 模型(例如 OpenAI 的 GPT 系列、Anthropic 的 Claude 等)与 WhatsApp、Telegram、Discord 等日常通讯工具集成,从而通过聊天即可操控电脑、执行任务、获取信息。特斯拉前 AI 主管 Andrej Karpathy 的推荐更是使其迅速走红,其 GitHub 项目在短时间内获得了超过 60,000 个星标。 ...

January 29, 2026 | 5 分钟 | 2263 字 | Tianlun Song

Windows 配置 Claude Code 解决 settings.json 不生效

TL;DR 默认情况下 Windows 安装的 Claude Code 会从这个位置读取配置: C:\Users\<YOUR_NAME>\.claude 其他系统则类似的找到 ~/.claude 路径。 官方流程在安装结束后就完成了,可以直接登录使用。如果需要修改配置将 Claude Code 接入第三方 API,就需要修改这里的 settings.json 配置文件,可以使用 CC-Switch 或是手动编写,但是修改后会发现不生效。 ...

January 9, 2026 | 1 分钟 | 344 字 | Tianlun Song

Windows 配置 Claude Code 全流程

今天终于跑通了 Windows 下运行 Claoude Code 的全流程,不借助 WSL ,原生运行。起因是自己需要一个可以长期运行任务的云桌面,这方面还是 Windows 最好用。不得不说相比于 Linux/macOS ,Windows 下运行 Claude Code 实在太多坑了。 ...

January 9, 2026 | 2 分钟 | 943 字 | Tianlun Song

2025-12-31 | 年终总结

2025 年就这样过去了,做了很多事情,发生了很多事情,是转折的一年,是难忘的一年,是值得回味的一年。 这一年,生活轨迹发生了很堵变化,思想认识也发生了很多转变,突然要写年终总结,真不知道从何说起。就想到哪里说到哪里吧。 ...

December 31, 2025 | 6 分钟 | 2557 字 | Tianlun Song