Category Archives: G_Tips

在家庭网络中使用ubuntu multipass虚机部署OpenClaw

openclaw很火,于是想在家里的NAS上部署一个。
由于openclaw有本地的权限,绝对不敢在NAS的系统下直接运行,因此需要一个隔离的环境来部署,以下是折腾记录。

Docker部署

第一反应是Docker容器化部署,毕竟我NAS上的大部分服务都是跑在容器里的。
从github上看,它是支持的,运行docker-setup.sh就可以。阅读相关文档并查看这个脚本,发现可以配置这几个环境变量来实现配置、workspace目录的定制:

OPENCLAW_WORKSPACE_DIR=/path/to/openclaw/workspace
OPENCLAW_CONFIG_DIR=/path/to/openclaw/config_dir

分别指向workspace(包含README、MEMORY、SKILL等md)与openclaw本身的配置(主要是openclaw.json

同时容器编译、运行时可以加个代理,可以在Dockerfile里添加:

ENV https_proxy=http://<proxy>:<port>

然后直接执行:

./docker-setup.sh

就可以一路完成容器镜像的编译并自动启动容器,并自动运行了onboard了。

看上去很美好,然而。。。

Docker部署的坑

坑一:cli容器无法连接gateway

在openclaw的docker-compose.yaml里,默认会有两个容器,一个是openclaw-gateway,另一个是openclaw-cli。gateway的容器是长期跑服务的,另一个容器是用于cli工具的。
然而,当我们看到gateway容器跑得好好的,也映射好了端口,但是用cli工具的容器时,会直接遇到这样的报错:

 Error: gateway closed (1006 abnormal closure (no close frame)): no close reason
 Source: local loopback

显然这是一个网络问题:两个容器,gateway使用了loopback来启动服务,在另一个容器想使用127.0.0.1:18789来连接gateway,显然是连不上的。。。

解决方案是在docker-compose.yml的cli服务中添加 network_mode: "service:openclaw-gateway",让它使用gateway的网络,这样cli才能连上gateway,可以“正常”地检查health

docker compose run --rm openclaw-cli health

这个改动已经提了PR:https://github.com/openclaw/openclaw/pull/13941

坑二:gateway里没有openclaw

openclaw在运行时,接收到用户的指令,很多时候会调用各类cli工具来执行。
比如说我们让openclaw来创建一个定时的服务,它会调用openclaw cron这样的命令。然后我们会在日志中看到:

openclaw command not found

去gateway的容器里检查,发现确实没有任何openclaw或者openclaw-cli这样的命令行工具!

实际上在容器部署中:

  • gateway与cli工具共享一个镜像
  • 两个service的command/entrypoint不一样
  • 但是镜像中确实没有openclaw这个cli工具,只有openclaw-cli

本质上这应该是一个容器化部署的bug,但考虑到潜在的其它问题,还是放弃这个方式了。

虚拟机部署

不使用容器的话,最适合的就是使用虚拟机来部署了,本来我的NAS上也跑着qemu。
不过最近听说了Ubuntu的multipass,正好可以试试。

multipass的安装与配置

multipass只支持snap,我需要把它的数据目录放到非rootfs的地方,有一些特别需要注意配置的地方,主要是它能使用的目录必须是/mnt/media这样的目录,参考Configure where Multipass stores external data

# 安装multipass
sudo snap install multipass

# 配置external data的目录为 /mnt/storage/multipass/
sudo snap stop multipass
sudo snap connect multipass:removable-media
mkdir -p /mnt/storage/multipass/
sudo chown root /mnt/storage/multipass/

# 配置systemd unit的环境变量
sudo mkdir /etc/systemd/system/snap.multipass.multipassd.service.d/
sudo tee /etc/systemd/system/snap.multipass.multipassd.service.d/override.conf <<EOF
[Service]
Environment=MULTIPASS_STORAGE=/mnt/storage/multipass/
EOF

# reload systemd
sudo systemctl daemon-reload

# Copy 数据
sudo cp -r /var/snap/multipass/common/data/multipassd /mnt/storage/multipass/data
sudo cp -r /var/snap/multipass/common/cache/multipassd /mnt/storage/multipass/cache

# 启动multipass服务
sudo snap start multipass

这样multipass的数据就放到/mnt/storage/multipass/目录了,而不会占用rootfs的空间。

multipass虚机的启动与配置

最简单的方式一句话启动

multipass launch --name openclaw-vm

不过这样启动的虚拟机的core/内存/磁盘空间都是默认的,不够安装openclaw,所以还需要额外配置磁盘、内存。也很简单。

multipass stop openclaw-vm
multipass set local.openclaw-vm.cpus=2
multipass set local.openclaw-vm.disk=64G
multipass set local.openclaw-vm.memory=4G
multipass start openclaw-vm

这样就启动好一个2core 4G内存 64G磁盘的虚拟机了。

然后我要把之前Docker部署的openclaw的配置项都导入虚拟机里,发现也很方便,一条命令,可以直接在虚拟机里mount好/extra/openclaw,直接可用。

multipass mount /mnt/path/to/openclaw openclaw-vm:/extra/openclaw

为了让虚拟机更好地被本地访问,添加bridge,这样可以有一个lan IP

multipass set local.openclaw-vm.bridged=true

再额外配置一个科学上网的网关和nameserver,就可以开始玩耍openclaw了。

# cat /etc/netplan/50-cloud-init.yaml
network:
  version: 2
      # 省略一些配置...
      routes:
        - to: default
          via: 192.168.1.xxx
          metric: 50
      nameservers:
        addresses: [192.168.1.xxx]
      dhcp4-overrides:
        use-dns: false
        use-routes: false

openclaw的安装与配置

在虚拟机里由于是个干净的环境,并且已经配置好科学上网了,安装与配置就很简单直接。

# 确保有build-essential
sudo apt update && sudo apt install -y build-essential

# 下载、安装openclaw
wget https://openclaw.ai/install.sh
bash install.sh

# 把.openclaw link到 /extra/openclaw,配置需要的环境变量
systemctl --user stop openclaw-gateway.service
ln -s /extra/openclaw ~/.openclaw
EDITOR=vim systemctl --user edit openclaw-gateway.service --full
# 添加 Environment=OPENCLAW_GATEWAY_HOST=0.0.0.0
systemctl --user start openclaw-gateway.service

这样服务就启动好了,并在虚拟机里监听了所有端口,可以正常使用了。
可以用这几个命令检查一下状态:

openclaw health
openclaw status

也可以在本地tui直接进行测试

openclaw tui

其它

  1. 对于其它openclaw.json本身的配置,网上有很多,主要注意的是models.providers.xxx只是配置了provider,还需要在agent.defaults.models里也配置好,这样自定义的模型API才可以被正常使用。
  2. Telegram bot的创建、使用非常方便,推荐这个。
Share

给SnapRAID替换硬盘

之前在我的乞丐版NAS及配置这篇文章中介绍了NAS的配置。
正常工作近5年之后,其中一块硬盘开始报了SMART Error。

SMART Error

一开始的SMART Error是这样:

# SMART error (CurrentPendingSector) detected on host
Device: /dev/sde [SAT], 1 Currently unreadable (pending) sectors

稳定了几天之后,pending sectors变为了这样:

Device: /dev/sde [SAT], 41 Currently unreadable (pending) sectors

看到unreadable sector明显增加了。

看一下snapraid smart的输出

$ sudo snapraid smart
SnapRAID SMART report:

   Temp  Power   Error   FP Size
      C OnDays   Count        TB  Serial           Device    Disk
 -----------------------------------------------------------------------
     32   1750       0   5%  6.0  WD-WX31D88R41HT  /dev/sdb  d1
     35   1749 selferr 100%  6.0  WD-WX31D88AEC9Z  /dev/sde  d2
     33   1749       0   4%  6.0  WD-WX31D88AERKL  /dev/sdd  parity

d2这块盘的寿命快到了。

替换硬盘

在SnapRAID中替换硬盘的步骤不复杂,记录一下。

  1. 停止所有会写数据的服务,比如把日常的docker容器都停了

    docker stop $(docker ps -a -q)
  2. 确保数据正确,如果有,sync一下

    $ sudo snapraid diff # 看下有没有数据不一致
    Loading state from /var/snapraid/snapraid.content...
    Comparing...
    ...
    6 removed
    18 updated
    There are differences!
    
    # 有不一致,跑一下sync
    $ sudo snapraid sync
    ...
    Everything OK
    ...
    Verifying...
    Verified /mnt/data/disk2/snapraid.content in 0 seconds
    Verified /var/snapraid/snapraid.content in 0 seconds
    Verified /mnt/data/disk1/snapraid.content in 0 seconds
    
    # 再次确保数据一致
    $ sudo snapraid diff
    Loading state from /var/snapraid/snapraid.content...
    Comparing...
    ...
    No differences
  3. 给新硬盘创建分区,并格式化。使用之前文章中相同的方式。

    $ sudo parted -a optimal /dev/sdc
    ...
    $ sudo mkfs.ext4 -m 2 -T largefile4 /dev/sdc1
  4. 从旧硬盘拷数据。这一步花了好几个小时…

    # 新硬盘临时mount在 /mnt/tmp
    sudo mount -t auto /dev/sdc1 /mnt/tmp
    # 拷贝旧硬盘数据
    cp -av /mnt/data/disk2/. /mnt/tmp
  5. 重新mount文件系统

    # umount mergerfs
    $ sudo umount /mnt/storage
    
    # umount 旧硬盘
    $ sudo umount /mnt/data/disk2
    
    # 修改/etc/fstab,使用新硬盘,mount到原始位置
    /dev/disk/by-id/ata-XXXXXXXXXXX-part1 /mnt/data/disk2 ext4 defaults 0 2
    
    # 重新mount
    sudo mount /mnt/data/disk2/
  6. 使用snapraid确认数据,可以注意到snapraid发现d2这块硬盘的uuid变了。
    snapraid check这一步也要花好几个小时(胆大的也可以跳过)

    $ sudo snapraid diff
    Loading state from /var/snapraid/snapraid.content...
    UUID change for disk 'd2' from 'dd5c3760-1e4e-4d72-b710-56c782f416c3' to '55a67bea-3ca1-4cbe-b9ed-83ca9825d627'
    Comparing...
    WARNING! UUID is changed for disks: 'd2'. Not using inodes to detect move operations.
    No differences
    
    # 检查数据一致性
    $ sudo snapraid check -a -d d2
    Self test...
    Loading state from /var/snapraid/snapraid.content...
    UUID change for disk 'd2' from 'dd5c3760-1e4e-4d72-b710-56c782f416c3' to '55a67bea-3ca1-4cbe-b9ed-83ca9825d627'
    Selecting...
    Using 1235 MiB of memory for the file-system.
    Initializing...
    Selecting...
    Hashing...
    8%, 191365 MB, 123 MB/s, 524 stripe/s, CPU 0%, 4:22 ETA TA
  7. 最后一步,sync一下

    $ sudo snapraid sync

到这里,snapraid相关的事情就完成了。

其它

相比之前文章里的aufs,后来在ubuntu不再支持aufs的情况下,改用mergerfs了,性能也还行。

我的配置如下

/mnt/data/* /mnt/storage fuse.mergerfs defaults,allow_other,use_ino,cache.files=partial,dropcacheonclose=true,category.create=mfs,minfreespace=20G,fsname=mergerfs 0 0

最后把旧硬盘拔掉,重新mount /mnt/storage目录,全部工作搞定。

Share

单口旁路由,以及配置不同的设备用不同的路由

最近在折腾旁路由,然后出现了一个需求:对于不同的设备,要指定不同的路由。
比如说

  • 绝大部分普通设备(比如各种IoT设备)走原来的路由,全部在墙内;
  • 指定几个设备(比如说手机、iPad等)走旁路由,由旁路由来分流墙内外的流量。

发现用dnsmasq的话,这个问题很好解决,于是顺便写个文章记录一下。

旁路由

家里的网络很简单,电信光猫改桥接,路由器接光猫,所有设备都在一个LAN里。
之前看一些测试,路由器翻墙的话,性能有瓶颈,而当NAS用的蜗牛星际平时很闲,正好用来做旁路由。
目标是,在KVM虚拟机里面运行Lede,只开科学上网工具。

配置KVM

我的蜗牛星际是单口千兆的版本,所以必须配置bridge来给KVM虚拟机用。
Ubuntu用netplan来配置很简单:

$ cat /etc/netplan/01-netcfg.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    enp3s0:
      dhcp4: no
      dhcp6: no
  bridges:
    br0:
      interfaces: [enp3s0]
      dhcp4: no
      dhcp6: no
      addresses: [<lan-ip>/24]
      gateway4: <lan-gateway>
      nameservers:
        addresses: [<nameservers>]
      parameters:
        stp: true

然后 netplan apply就好了

安装好KVM之后,运行virt-manager用图形界面创建虚拟机(这样方便一点),加载lede的image,只要注意网络那边手动写之前创建好的br0Device modevirtio(性能最好)就好了。

之后配置这个虚拟机开机自动启动:

virsh autostart lede # lede是建VM时起的名字

配置lede

初次使用时,lede默认的IP是192.168.1.1,可以在console里先把它的IP改成自己想要的。

virsh console lede
## 之后就是在lede的VM里操作
# cat /etc/config/network
...
config interface 'lan'
        option ifname 'eth0'
        option proto 'static'
        option ipaddr '<ip>'   # 设置Lede的LAN IP
        option netmask '255.255.255.0'
        option gateway '<gateway>'
        option ip6assign '60'
        option multipath 'off'
        option dns '<dns, or gateway>'
...

# /etc/init.d/network restart #重启网络服务

然后可以在浏览器里打开lede,在网络->接口的配置里

  • 把LAN口的DHCP关掉
  • 在WAN口的物理设置里,取消“桥接接品”,接口选eth0(因为我只有一个网口)
  • 在WAN口的基本设置里,选”DHCP客户端”

现在LEDE就可以愉快地上网啦。

配置v2ray

装旁路由的目的就是为了科学上网,我一般用v2ray。
这个东东本来应该在Lede的软件中心里出现的,不过由于”你懂的“原因,它下架了,只能手动下载离线包,离线安装。

  • 自行Google “离线 v2ray_2.3.7.tar.gz”(后果自负),比如说,https://github.com/hq450/fancyss_history_package/tree/master/fancyss_X64,安装好。
  • 在服务器列表里,导入 vmess::// 头的链接
  • 在帐号设置里,选一个服务器,根据自己需要选“大陆白名单”或者gfwlist,打开”代理开关“,看到国外链接正常的时候,说明没问题了。

配置不同设备用不同的路由

这个时候如果手动配置一个设备的gateway到Lede的LAN IP,就能正常科学上网了。
不过我希望由DHCP server来给指定的设备(比如说根据mac地址)来分配不同的gateway。
Google之后发现,dnsmasq正好可以用来做这事儿,而我主路由的梅林固件正好就是用dnsmasq的,并且还用它来支持电信IPTV。
原理是,通过对mac地址打tag,然后根据tag来设置对应的gateway和nameserver (参考 http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html)

# cat /jffs/configs/dnsmasq.d/customgateway.conf
dhcp-mac=red,<mac1>   # 把要使用旁路由的设备的mac地址写在这里,用相同的tag名字
dhcp-mac=red,<mac2>
dhcp-option = net:red, option:router, <gateway-IP>   # 指定旁路由的IP作为gateway
dhcp-option = net:red, 6, <nameserver-IP>   # 指定旁路由的IP作为nameserver(也可以用自定义的nameserver,看需求)

然后重启dnsmasq服务

# service restart_dnsmasq

需要注意的是,有些手机可能会用随机的MAC地址,在手机上设置一下,连家里的wifi,就用真正的mac地址吧。
手机、iPad连上wifi之后,看一下gateway,如果已经是旁路由的地址了,说明配置没问题,可以自由地上网了。
而其它设备不受影响,还是走原来的路由。

完工。

Share

异地NAS备份

之前的文章介绍了我的乞丐版NAS及配置,是放在家里的单个NAS。虽然做了SnapRAID,坏一块盘也没关系,然而假如由于某种原因这个NAS里的盘全坏了,数据就没了。
数据备份有个3-2-1原则:

  • 3份数据
  • 2种存储介质
  • 1个异地备份

过年回老家之后,由于新冠病毒(2019-nCOV)一直呆在老家,正好可以折腾一下异地的NAS备份,这样我也有自己的2-1-1的备份了:

  • 2份数据
  • 1种存储介质
  • 1个异地备份

对于家用来说,已经足够安全了。
这时记录一下我的异地NAS备份的配置。

硬件

  • 跟之前一样,还是蜗牛星际A款单口千兆,350包邮(2020.02价格,比之前涨了100!)
  • 西数紫盘 4T*1 (目前的容量,一块就够了,以后有需要就再添加)

软件

对于放在老家的NAS,出于成本考虑,就不做RAID了;而目前只用一块硬盘,为了方便以后扩展,LVS是最方便的。
既然是异地备份,一开始想的是用rsync就好了。
不过后来想想,当人在老家的时候,可以直接备份数据在老家的NAS上;并且老爸老妈他们手机里的数据也可以直接备份在本地,所以我需要的是一个双向的同步工具。用两条rsync命令当然也可以,不过应该会有更合适的工具。

综合上面的需要,最后决定用的是:

  • Ubuntu (目前装的是19.10)
  • LVS方便未来扩展空间
  • wireguard用来连接两台NAS
  • unison用来双向同步
  • 其它工具和原来的NAS相同(比如hdparm, smartmontools等)

LVS

以前不太用LVS,觉得麻烦,后来在公司的某台服务器上配置了LVS之后,觉得用起来很爽,现在在这台异地NAS上正好可以用起来。
把这块4T的硬盘分成两个LV(Logical Volume),一个用来放不用长期保存的(如下载的视频之类)文件,另一个用来作为我的NAS的异地备份,都用ext4文件系统。

# 假设用fdisk或者parted工具创建了一个/dev/sdb1,包含/dev/sdb的整个空间
pvcreate /dev/sdb1   # 创建physical volume
vgcreate nas_data_vg /dev/sdb1   # 创建名为nas_data_vg的virtual group
lvcreate -n nas_download -L 800G nas_data_vg   # 创建一个800G大小的LV
lvcreate -n nas_data nas_data_vg -l 100%FREE   # 创建一个包含剩余所有空间的LV

# 格式化
mkfs.ext4 /dev/nas_data_vg/nas_download
mkfs.ext4 /dev/nas_data_vg/nas_data

# 查看blkid
blkid /dev/nas_data_vg/nas_data
blkid /dev/nas_data_vg/nas_download

# 编辑/etc/fstab自动挂载
# cat /etc/fstab
UUID=<blk-id-of-nas_data> /mnt/storage ext4 defaults 0 0
UUID=<blk-id-of-nas_download> /mnt/downloads ext4 defaults 0 0

Wireguard

之前写过一篇文章讲怎么配置Wireguard server和怎么在手机上用Wireguard,在Ubuntu里配置一个client也是超级简单的事情。然后用systemd让它开机自动启动就好了,这样这个异地的NAS开机完,就自动连着我原来的NAS。

sudo apt install wireguard

# cat /etc/wireguard/wg0.conf
[Interface]
Address = <Self IP in the VPN>/24
PrivateKey = <private-key>
[Peer]
PublicKey = <public-key-on-wireguard-server>
# AllowedIPs = 0.0.0.0/0 # 如果想让NAS的默认网关走远程的网络
AllowedIPs = 192.168.2.0/24 # 如果只想在WireGuard的网段互相访问
Endpoint = <wireguard-server>:<port>

sudo systemctl enable [email protected]   # Enable这个service
systemctl start [email protected]   #开始这个service

unison

搜索了一下双向同步的工具,有一些开源的,根据https://askubuntu.com/questions/727304/automatically-do-a-two-way-sync-of-two-directories 这里面的推荐,我最后选择了用unison,看上去比较清晰简单。
安装很简单,Ubuntu官方源里有就有,直接apt就装好了。

sudo apt install unison

unison的基础用法很简单,unison root1 root2就把root1和root2同步了,先按官网的教程里,把玩一下a.tmpb.tmp,然后就可以写自己的脚本了。

$ cat ~/bin/sync_between_nas.sh
#!/bin/bash

if [ $# -ne 1 ]; then
  echo "Usage:"
  echo "$0 <dir>"
  echo ""
  echo "Two-way sync <dir> between my NAS in /mnt/storage"
  exit 1 
fi

dir_to_sync=$1
src=ssh://$user@$nas//mnt/storage/$1
dst=/mnt/storage/$1

if ssh $user@$nas "[ ! -d $dst ]"; then
  echo $dst does not exist!
  exit 1
fi

echo To sync from \"$src\" to \"$dst\"...
sudo unison $src $dst -batch -owner -group -prefer newer -times -nodeletion $src -nodeletion $dst

这个脚本会同步两个NAS的/mnt/storage/目录,主要功能在最后一行:

  • -batch: 表示不需要用户的输入(否则它会问用户使用哪种action)
  • -owner, -group: 保留原有的owner:group。注意:
  • 这里两个NAS都需要创建同样的owner、group
  • 因为文件传输之后需要chown,所以这个脚本要以su用户来执行(不知道有没有更好的方式)
  • -prefer newer:如果有冲突,总是以更新的文件为准
  • -times:同步文件的修改时间
  • -nodeletion $src -nodeletion $dst: 对$src和$dst都不进行删除,以防有文件被误删,并被同步到另一个NAS

然后,有一个辅助脚本来指定要备份的目录,并调用上面这个脚本。

#!/bin/bash

set -e
dirs_to_sync="<dir1> <dir2> ..."
for d in $dirs_to_sync; do
  echo $d
  /home/mine/bin/sync_between_nas.sh $d
done

最后,配置一个cron job,就可以愉快地每天同步数据啦。记得让cron发邮件,以便检查同步的状态。

$ sudo crontab -l
MAILTO="<my-email-address>"
10 2 * * * /usr/bin/flock /tmp/sync_my_nas.lock /path/to/sync_my_nas_dirs.sh

后记

后来有朋友推荐了Syncthing,看上去有点像BTSync之类的,以后也许可以试试。

Share

From a CI bug to systemd, to GCC

I was tracing a bug in sdbusplus found by OpenBMC CI, it leads me into the code in systemd, and eventually get me into GCC.
To get a short introduction, refer to https://lists.ozlabs.org/pipermail/openbmc/2019-December/019884.html
Here is the full story of the investigation.

The sdbusplus CI issue

A CI issue is found in sdbusplus, that the valgrind reports the below error.

==5290== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==5290== at 0x4F2FB08: epoll_ctl (syscall-template.S:79)
==5290== by 0x493A8F7: UnknownInlinedFun (sd-event.c:961)
==5290== by 0x493A8F7: sd_event_add_time (sd-event.c:1019)
==5290== by 0x190BB3: phosphor::Timer::Timer(sd_event*, std::function) (timer.hpp:62)
==5290== by 0x192B93: TimerTest::TimerTest() (timer.cpp:25)
==5290== by 0x193A13: TimerTest_timerExpiresAfter2seconds_Test::TimerTest_timerExpiresAfter2seconds_Test() (timer.cpp:85)
...
==5290== by 0x4A90917: main (gmock_main.cc:69)
==5290== Address 0x1fff00eafc is on thread 1's stack
==5290== in frame #0, created by epoll_ctl (syscall-template.S:78)
==5290==

Clearly, valgrind detects that some uninitialized data is used.
However, this issue is not 100% reproduced, it only occurs sometimes, how could that be?

At a first glance, it’s sdbusplus‘s Timer class that invokes sd_event_add_time() from libsystemd, and eventually invokes epoll_ctl().
So I would suspect something may be wrong in Timer that it may pass uninitialized data to sd_event_add_time().

Investigation in sdbusplus

Let’s see the related code.

void initialize()
{
    ...
    auto r = sd_event_add_time(
        event, &eventSource,
        CLOCK_MONOTONIC, // Time base
        UINT64_MAX,      // Expire time - way long time
        0,               // Use default event accuracy
        [](sd_event_source* eventSource, uint64_t usec, void* userData) {
            auto timer = static_cast<Timer*>(userData);
            return timer->timeoutHandler();
        },     // Callback handler on timeout
        this); // User data
    ...
}

The event is a pointer that is already initialized;
The eventSource is the out-parameter.
Others are just simple data or a lambda, nothing suspicious.

Investigation in libsystemd

So let’s dive into libsystemd to see what exactly happens.
The related code is in sd_event_add_time().

_public_ int sd_event_add_time(sd_event *e, ...) {
    ...
    if (d->fd < 0) {
        r = event_setup_timer_fd(e, d, clock);
        if (r < 0)
            return r;
    }
    ...
}

Where:

  • e is sd_event*, and clock is clockid_t, both are passed into this function
  • d is struct clock_data* initialized in this function So nothing is wrong.

Let’s see event_setup_timer_fd() then.

static int event_setup_timer_fd(...) {
    struct epoll_event ev;
    ...
    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };
    r = epoll_ctl(e->epoll_fd, EPOLL_CTL_ADD, fd, &ev);
    ...
}

The epoll_fd, fd, and ev, are all initialized, are they?

Let’s see how epoll_ctl is implemented in kernel source.

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    ...
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
        goto error_return;
    ...
}

Be noted that valgrind says Syscall param epoll_ctl(event) points to uninitialised byte(s), and here we do see that it’s copying the whole event struct from userspace, which means the event contains uninitialized bytes.
Let’s go back to see how the event struct is initialized:

    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };

And let’s see how the struct is defined in glibc.

typedef union epoll_data
{
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event
{
  uint32_t events;  /* Epoll events */
  epoll_data_t data;    /* User data variable */
} __EPOLL_PACKED;

Hmm, the events is a uint32_t and is initialized, and the data is initialized as well, it looks fine…
Is it really fine?
data is a union and should be at least 64 bits, while events is uint32_t, which is 32 bits, there could be padding inside the epoll_event padding if it’s not packed.
Hey, there is __EPOLL_PACKED… let’s grep it in glibc:

$ gg __EPOLL_PACKED
ChangeLog.old/ChangeLog.18:     (__EPOLL_PACKED): Define to empty if not defined by
ChangeLog.old/ChangeLog.18:     (struct epoll_event): Use __EPOLL_PACKED to make possibly packed.
sysdeps/unix/sysv/linux/sys/epoll.h:#ifndef __EPOLL_PACKED
sysdeps/unix/sysv/linux/sys/epoll.h:# define __EPOLL_PACKED
sysdeps/unix/sysv/linux/sys/epoll.h:} __EPOLL_PACKED;
sysdeps/unix/sysv/linux/x86/bits/epoll.h:#define __EPOLL_PACKED __attribute__ ((__packed__))

It is defined to __attribute__ ((__packed__)) for x86, and not defined for other architectures.
Remember that the issue is not 100% reproduced, right?
The OpenBMC CI backend has both x86-64 and ppc64le servers, so we could guess that the padding causes the valgrind error, and it only happens on ppc64le but not on x86-64, because on x86-64 there is no padding at all.
From the CI log, it confirms that the guess is correct: the issue only occurs on the ppc64le CI server!

So let’s go back to the question code:

    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };

It’s using GCC’s Designated Initializers extension
I tried to google how GCC initialize the padding, there are discussions in StackOverlows and blogs, e.g. according to https://stackoverflow.com/questions/37642026/does-c-initialize-structure-padding-to-zero, it looks like the case:

padding for the remaining objects are guaranteed to be 0, but not for the members which has received the initializers.

But I do not see an official GCC doc talking about this.
Let’s do some experiments.

Testing in GCC

Giving below demo code to test how the padding is initialized:

#include <string.h>
#include <stdint.h>
#include <stdio.h>

struct struct_with_padding {
        uint32_t a;
        uint64_t b;
        uint32_t c;
};
int main()
{
        struct struct_with_padding s;
        memset(&s, 0xff, sizeof(s));
        s = (struct struct_with_padding) {
                .a = 0xaaaaaaaa,
                .b = 0xbbbbbbbbbbbbbbbb,
#ifdef SHOW_GCC_BUG
                .c = 0xdddddddd,
#endif
        };
        uint8_t* p8 = (uint8_t*)(&s);
        printf("data: ");
        for (size_t i = 0; i < sizeof(s); ++i)
        {
                printf("0x%02x ", p8[i]);
        }
        printf("\n");
        return 0;
}

Compile with or without SHOW_GCC_BUG, the result is different:

$ gcc -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

$ gcc -DSHOW_GCC_BUG -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0xff 0xff 0xff 0xff 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xdd 0xdd 0xdd 0xdd 0xff 0xff 0xff 0xff

GCC behaves like:

  • If a struct is partial initialized, the all the padding are initialized to zero;
  • If a struct is fully initialized, the padding remains the old data. This is exactly what happens in my case!

How about clang?

$ clang -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

$ clang -DSHOW_GCC_BUG -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xdd 0xdd 0xdd 0xdd 0x00 0x00 0x00 0x00

OK, clang initializes the padding in both cases, good!

While I tried to file a bug to GCC, I found an exact same bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77992, which is reported in 2016, and it looks like GCC is not going to fix it…

Follow-ups

  1. I tried to send a PR to systemd as a workaround that manually zero initialize the struct epoll_event.
    It’s under discussion, but likely the maintainer of systemd (@poettering) will not accept the PR, because it’s not really a systemd bug, instead, @poettering treats it a Valgrind bug (if not a GCC bug).
  2. Although I do not think it’s a Valgrind bug, a bug is filed to https://bugs.kde.org/show_bug.cgi?id=415621, there is no further feedback yet.
  3. Without GCC or systemd or valgrind’s fix, I have to add a valgrind suppression to OpenBMC CI https://gerrit.openbmc-project.xyz/c/openbmc/sdbusplus/+/25548, problem solved.

Summary

  • GCC has a bug of not initializing the padding.
  • systemd hits the bug on initializing struct epoll_event on non-x86 systems.
  • OpenBMC CI has both x86-64 and ppc64le systems. If a CI is run on ppc64le, the issue occurs.
  • Adding a valgrind suppression file fixes (or workarounds) the issue.
Share