Tag Archives: c++

From a CI bug to systemd, to GCC

I was tracing a bug in sdbusplus found by OpenBMC CI, it leads me into the code in systemd, and eventually get me into GCC.
To get a short introduction, refer to https://lists.ozlabs.org/pipermail/openbmc/2019-December/019884.html
Here is the full story of the investigation.

The sdbusplus CI issue

A CI issue is found in sdbusplus, that the valgrind reports the below error.

==5290== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==5290== at 0x4F2FB08: epoll_ctl (syscall-template.S:79)
==5290== by 0x493A8F7: UnknownInlinedFun (sd-event.c:961)
==5290== by 0x493A8F7: sd_event_add_time (sd-event.c:1019)
==5290== by 0x190BB3: phosphor::Timer::Timer(sd_event*, std::function) (timer.hpp:62)
==5290== by 0x192B93: TimerTest::TimerTest() (timer.cpp:25)
==5290== by 0x193A13: TimerTest_timerExpiresAfter2seconds_Test::TimerTest_timerExpiresAfter2seconds_Test() (timer.cpp:85)
...
==5290== by 0x4A90917: main (gmock_main.cc:69)
==5290== Address 0x1fff00eafc is on thread 1's stack
==5290== in frame #0, created by epoll_ctl (syscall-template.S:78)
==5290==

Clearly, valgrind detects that some uninitialized data is used.
However, this issue is not 100% reproduced, it only occurs sometimes, how could that be?

At a first glance, it’s sdbusplus‘s Timer class that invokes sd_event_add_time() from libsystemd, and eventually invokes epoll_ctl().
So I would suspect something may be wrong in Timer that it may pass uninitialized data to sd_event_add_time().

Investigation in sdbusplus

Let’s see the related code.

void initialize()
{
    ...
    auto r = sd_event_add_time(
        event, &eventSource,
        CLOCK_MONOTONIC, // Time base
        UINT64_MAX,      // Expire time - way long time
        0,               // Use default event accuracy
        [](sd_event_source* eventSource, uint64_t usec, void* userData) {
            auto timer = static_cast<Timer*>(userData);
            return timer->timeoutHandler();
        },     // Callback handler on timeout
        this); // User data
    ...
}

The event is a pointer that is already initialized;
The eventSource is the out-parameter.
Others are just simple data or a lambda, nothing suspicious.

Investigation in libsystemd

So let’s dive into libsystemd to see what exactly happens.
The related code is in sd_event_add_time().

_public_ int sd_event_add_time(sd_event *e, ...) {
    ...
    if (d->fd < 0) {
        r = event_setup_timer_fd(e, d, clock);
        if (r < 0)
            return r;
    }
    ...
}

Where:

e is sd_event*, and clock is clockid_t, both are passed into this function
d is struct clock_data* initialized in this function So nothing is wrong.

Let’s see event_setup_timer_fd() then.

static int event_setup_timer_fd(...) {
    struct epoll_event ev;
    ...
    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };
    r = epoll_ctl(e->epoll_fd, EPOLL_CTL_ADD, fd, &ev);
    ...
}

The epoll_fd, fd, and ev, are all initialized, are they?

Let’s see how epoll_ctl is implemented in kernel source.

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    ...
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
        goto error_return;
    ...
}

Be noted that valgrind says Syscall param epoll_ctl(event) points to uninitialised byte(s), and here we do see that it’s copying the whole event struct from userspace, which means the event contains uninitialized bytes.
Let’s go back to see how the event struct is initialized:

    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };

And let’s see how the struct is defined in glibc.

typedef union epoll_data
{
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event
{
  uint32_t events;  /* Epoll events */
  epoll_data_t data;    /* User data variable */
} __EPOLL_PACKED;

Hmm, the events is a uint32_t and is initialized, and the data is initialized as well, it looks fine…
Is it really fine?
data is a union and should be at least 64 bits, while events is uint32_t, which is 32 bits, there could be padding inside the epoll_event padding if it’s not packed.
Hey, there is __EPOLL_PACKED… let’s grep it in glibc:

$ gg __EPOLL_PACKED
ChangeLog.old/ChangeLog.18:     (__EPOLL_PACKED): Define to empty if not defined by
ChangeLog.old/ChangeLog.18:     (struct epoll_event): Use __EPOLL_PACKED to make possibly packed.
sysdeps/unix/sysv/linux/sys/epoll.h:#ifndef __EPOLL_PACKED
sysdeps/unix/sysv/linux/sys/epoll.h:# define __EPOLL_PACKED
sysdeps/unix/sysv/linux/sys/epoll.h:} __EPOLL_PACKED;
sysdeps/unix/sysv/linux/x86/bits/epoll.h:#define __EPOLL_PACKED __attribute__ ((__packed__))

It is defined to __attribute__ ((__packed__)) for x86, and not defined for other architectures.
Remember that the issue is not 100% reproduced, right?
The OpenBMC CI backend has both x86-64 and ppc64le servers, so we could guess that the padding causes the valgrind error, and it only happens on ppc64le but not on x86-64, because on x86-64 there is no padding at all.
From the CI log, it confirms that the guess is correct: the issue only occurs on the ppc64le CI server!

So let’s go back to the question code:

    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };

It’s using GCC’s Designated Initializers extension
I tried to google how GCC initialize the padding, there are discussions in StackOverlows and blogs, e.g. according to https://stackoverflow.com/questions/37642026/does-c-initialize-structure-padding-to-zero, it looks like the case:

padding for the remaining objects are guaranteed to be 0, but not for the members which has received the initializers.

But I do not see an official GCC doc talking about this.
Let’s do some experiments.

Testing in GCC

Giving below demo code to test how the padding is initialized:

#include <string.h>
#include <stdint.h>
#include <stdio.h>

struct struct_with_padding {
        uint32_t a;
        uint64_t b;
        uint32_t c;
};
int main()
{
        struct struct_with_padding s;
        memset(&s, 0xff, sizeof(s));
        s = (struct struct_with_padding) {
                .a = 0xaaaaaaaa,
                .b = 0xbbbbbbbbbbbbbbbb,
#ifdef SHOW_GCC_BUG
                .c = 0xdddddddd,
#endif
        };
        uint8_t* p8 = (uint8_t*)(&s);
        printf("data: ");
        for (size_t i = 0; i < sizeof(s); ++i)
        {
                printf("0x%02x ", p8[i]);
        }
        printf("\n");
        return 0;
}

Compile with or without SHOW_GCC_BUG, the result is different:

$ gcc -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

$ gcc -DSHOW_GCC_BUG -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0xff 0xff 0xff 0xff 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xdd 0xdd 0xdd 0xdd 0xff 0xff 0xff 0xff

GCC behaves like:

If a struct is partial initialized, the all the padding are initialized to zero;
If a struct is fully initialized, the padding remains the old data. This is exactly what happens in my case!

How about clang?

$ clang -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

$ clang -DSHOW_GCC_BUG -o test_padding test_padding.c
$ ./test_padding
data: 0xaa 0xaa 0xaa 0xaa 0x00 0x00 0x00 0x00 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xbb 0xdd 0xdd 0xdd 0xdd 0x00 0x00 0x00 0x00

OK, clang initializes the padding in both cases, good!

While I tried to file a bug to GCC, I found an exact same bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77992, which is reported in 2016, and it looks like GCC is not going to fix it…

Follow-ups

I tried to send a PR to systemd as a workaround that manually zero initialize the struct epoll_event.
It’s under discussion, but likely the maintainer of systemd (@poettering) will not accept the PR, because it’s not really a systemd bug, instead, @poettering treats it a Valgrind bug (if not a GCC bug).
Although I do not think it’s a Valgrind bug, a bug is filed to https://bugs.kde.org/show_bug.cgi?id=415621, there is no further feedback yet.
Without GCC or systemd or valgrind’s fix, I have to add a valgrind suppression to OpenBMC CI https://gerrit.openbmc-project.xyz/c/openbmc/sdbusplus/+/25548, problem solved.

Summary

GCC has a bug of not initializing the padding.
systemd hits the bug on initializing struct epoll_event on non-x86 systems.
OpenBMC CI has both x86-64 and ppc64le systems. If a CI is run on ppc64le, the issue occurs.
Adding a valgrind suppression file fixes (or workarounds) the issue.

记两个gcc的bug

24 Replies

一般来说程序的编译、运行遇到问题，只会怀疑code有问题，而不会怀疑是编译器的问题。不过日子久了，总会遇到各种各样奇怪的问题，最近遇到了两个gcc相关的bug，记录一下。

问题一

在编译一个并不复杂的文件的时候，gcc报内存不够的错误。
这个文件只是定义了一个嵌套了几层的std::map，而且，如果用GCC 4.8来编译，很快就编译完了，占的内存也不多。
然而，如果用GCC 6.2 (或者最新的7.1)来编译，会发现占用的时间非常长，而且占用内存特别多，最终（如果内存不够）会报internal compiler error: Killed的错。
观察发现，在编译的过程中gcc占用的内存越来越多，分配了足够多的swap之后能编译成功，最终内存要占到16~20GiB左右。
同事去submit了一个bug，https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80290，被confirm了。
不过，从comments看，似乎还挺复杂的，什么时候能修好，就难说了。。。
另外，用clang的话也不会出现这个bug。

问题二

在一个动态库里，遇到一个std::unique_ptr相关的segment fault。这个变量在constructor里明明创建过了，但是运行时发现它还是nullptr。
简化一下问题，如果一个动态库里有类似这样的code:

std::unique_ptr<MyStruct> s = nullptr;
void onLoad() __attribute__((constructor));
void onLoad()
{
  s = std::make_unique<MyStruct>();
}

当这个动态库被dlopen的时候，我期望s会被onLoad()初始化好，然而在别的函数里用这个s的时候，会segment fault，因为s里面是个空指针！
关于这个问题，我在SO上提了个问题，结果发现，跟这个GCC的这个bug有关：

简单的说，动态库里 constructor 和 global/static 变量的初始化顺序是unspecified，所以有可能onLoad()先执行，然后再初始化global变量s, 把它初始化成nullptr。注意这个时候只是初始化，原来的变量并不会被析构，其实也是内存泄露。
不过，这个bug看上去暂时不会被修复，只是会在文档里注明这种情况初始化顺序是不定的。
要修复这个问题，需要给变量指定初始化顺序：

std::unique_ptr<MyStruct> s __attribute__((init_priority(101))) = nullptr;

注意init_priority的值设为101~65534都可以。 0~100是reserved，用了会有warning；最大值是65535，也是默认值，要是设成65535，就和__attribute__((constructor))一样，顺序就变成unspecified了。
另外，clang也支持这个属性，表现也gcc一样，有同样的bug。

Brief introduction to C++11 Smart Pointer

关于动态链接库(dynamic library)里static变量和函数的问题

Leave a reply

前段时间遇到一个问题，如果有多个动态链接库同时link另外一个静态链接库，这个静态库里的全局变量（或者static变量）会怎么样呢？
拍脑袋想想，总共也就这么几种可能：

link(dlopen)时报错，变量重定义
link(dlopen)时没错，执行时用用同一个变量
link(dlopen)时没错，执行时分别有不同的变量
link(dlopen)时报错，未定义的变量

仔细想想，分析一下，各种可能性都有，得看这些库和可执行文件是怎么编译链接的才行，具体看下面的各种case。
同时，测试的code里在dynamic里加上了同名的函数，看看函数会有什么表现。

假设有如下文件(code见Gist)：

    common.h
    common.cpp  // 生成libcommon.a
    dynamic1.h
    dynami1.cpp  // 生成libdynamic1.so
    dynamic2.h
    dynamic2.cpp // 生成libdynamic2.so
    main.cpp  // 生成main

生成libcommon.a总是这么编译：

    g++  -g -Wall -Wextra  -c -o common.o common.cpp
    ar rcs libcommon.a common.o

Case 0: 动态链接库不link .a，main直接link .so，生成main的时候也不link .a

    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp
    g++ -DDIRECT_CALL_SO -o main main.cpp -L. -ldynamic1 -ldynamic2

这种情况结果显然是4，link时出错，找不到.a里的函数。

Case 1: 动态链接库不link .a，main直接link .so，生成main的时候link .a
```
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp
    g++ -DDIRECT_CALL_SO -o main main.cpp -L. -ldynamic1 -ldynamic2 -lcommon
    LD_LIBRARY_PATH=. ./main
```
这种情况下执行的结果是2，link时没错，执行时看到的也是同一个变量。
GetInt()返回值是1，它依赖于-l的顺序，如果-ldynamic2在前，返回值就是2了。

Case 2: 动态链接库link .a，main直接link .so，生成main的时候无所谓要不要link .a

    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp -L. -lcommon
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp -L. -lcommon
    g++ -DDIRECT_CALL_SO -o main main.cpp -L. -ldynamic1 -ldynamic2
    LD_LIBRARY_PATH=. ./main

执行结果同上

Case 3: 动态链接库不link .a，main不链接.so，通过dlopen()调用so，无所谓链接.a

    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp
    g++ -o main main.cpp -ldl [-L. -lcommon]
    LD_LIBRARY_PATH=. ./main

执行结果是4，dlopen的时候找不到 GetGlobalStatic()。

Case 4: 动态链接库link .a，main不链接.so，通过dlopen()调用so(不用RTLD_GLOBAL)，无所谓链接.a

    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp -L. -lcommon
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp -L. -lcommon
    g++ -o main main.cpp -ldl
    LD_LIBRARY_PATH=. ./main

执行结果是3，各有各自的变量。
GetInt()返回的也是各自的变量。

Case 5: 动态链接库link .a，main不链接.so，通过dlopen()调用so(使用RTLD_GLOBAL)，无所谓链接.a

    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic1.so dynamic1.cpp -L. -lcommon
    g++ -g -Wall -Wextra -fPIC -shared -o libdynamic2.so dynamic2.cpp -L. -lcommon
    g++ -DUSE_RTLD_GLOBAL -o main main.cpp -ldl -L. -lcommon
    LD_LIBRARY_PATH=. ./main

执行结果变成2了，执行时看到了同一个变量。
GetInt()返回的仍然是各自的变量。

看了这么多case，结果1怎么没有出现呢？别急，如果编译的时候全部编译在一起：

    g++ -DDIRECT_CALL_SO -o main main.cpp dynamic1.cpp dynamic2.cpp common.cpp

就出错了，GetInt()重定义。

结论

对于不同so link的.a里的变量:
1. 如果是直接link的，总是用同一个变量。仔细想想肯定是这样，否则一定会出现multiple definition
2. 如果是dlopen，它依赖dlopen()的flag：
  - 如果是RTLD_LOCAL(默认)，各个so会使用各自的.a里的变量。
  - 如果用RTLD_GLOBAL，就跟直接link一样，用同一个变量了。
对于不同so里的同名函数:
1. 通过dynamic link或者dlopen时不会出现变量重定义;
2. 直接link时的顺序决定了main使用哪个so的实现
3. dlopen的话，它们有各自的函数

比较boost::ptr_vector和std::vector

17 Replies

今天有同事问起来关于boost的smart pointer的事情，原因是别人有一段code用了boost::scoped_ptr<>*，review的时候被揪出来说这不符合常理，讨论应该用啥容器。

基本情况就是有一些资源需要new出来放在一个容器里，这个容器的生命周期由自己控制，但是需要把new出来的东西作为一个数组（或者容器）传给别人。

原来的code写成了一个boost::scoped_ptr<>的数组，然后传给别人，像下面这样：

boost::scoped_ptr someResource[number];
for (int i = 0; i < number; ++i) {
  someResource[i].reset(new T);
}
OtherFunction(someResource);

很显然，这样的code能编译，能正常工作（只要OtherFunction的实现没问题）；只是，真的太不符合common sense了。怎么整？

直觉上来说，既然是一个指针的数组，而且要传给别人，那用std::vector<boost::shared_ptr<T>>最合适了，然后传个const&给别人，搞定。

不过看到瑞典同事有人用boost::ptr_vector，这个新鲜的玩意儿不常见，研究一下，原来是Boost.Pointer Container的一部分，用来保存heap-allocated objects，有放进去的指针会在出了作用域之后自动删除，所以有”own”的语义。

相比起shared_ptr的容器，有各种优点（见上面link里的advantages 1~8）。
当然，最主要的是语义上的不同：

boost::ptr_vector保存的是“own”的对象;
std::vector<boost::shared_ptr<>>保存的对象可以被别人own

然后，从效率上来说，ptr_vector显然要更好一点，因为创建shared_ptr还是有开销的。

回到上面的case，最简单的做法就是用shared_ptr的容器；更合适的做法是用ptr_vector。

那么，它们的效率到底能差多少呢？写段code跑跑看。

#include <boost/ptr_container/ptr_vector.hpp>
#include <boost/shared_ptr.hpp>
#include <vector>
#include <cstdio>
#include <ctime>
#include <stdint.h>

class TSomeData
{
private:
  int data;
public:
  TSomeData(int d)
    : data(d)
  {
    // Empty
  }
};

const int TEST_ITERATIONS = 10000000;

typedef std::vector<boost::shared_ptr > TVectorOfShared;
typedef boost::ptr_vector TPtrVector;

int main()
{
  clock_t start;
  clock_t end;

  start = ::clock();
  TVectorOfShared vectorOfShared;
  for (int i = 0; i < TEST_ITERATIONS; ++i) {
  // Test vector of shared_ptr
    boost::shared_ptr data(new TSomeData(i));
    vectorOfShared.push_back(data);
  }
  end = ::clock();
  printf("Vector of shared:\n  Time executed: %u\n",
         static_cast<uint32_t>((end - start) / (CLOCKS_PER_SEC/1000)));

  start = ::clock();
  TPtrVector ptrVector;
  for (int i = 0; i < TEST_ITERATIONS; ++i) {
  // Test ptr_vector
    TSomeData* data = new TSomeData(i);
    ptrVector.push_back(data);
  }
  end = ::clock();
  printf("PtrVector:\n  Time executed: %u\n",
         static_cast<uint32_t>((end - start) / (CLOCKS_PER_SEC/1000)));
  return 0;
}

跑一下结果如下(老式T400,跑着Ubuntu 14.04).

# g++ -O0
Vector of shared:
  Time executed: 7227
PtrVector:
  Time executed: 1507

# g++ -O2
Vector of shared:
  Time executed: 5090
PtrVector:
  Time executed: 731

无论是-O0还是-O2，都有着明显的差距。
结论：在语义合适的情况下，用ptr_vector有更好的效率。

最后，在stackoverflow上看到，如果项目的编译环境已经用c++11了，可以用std::vector<std::unique_ptr<T>>，测试一下,结果有点吃惊。

首先，unique_ptr相关的测试code如下：

  for (int i = 0; i < TEST_ITERATIONS; ++i) {
    std::unique_ptr data(new TSomeData(i));
    vectorOfUnique.push_back(std::move(data));
  }

测试结果：

# g++ -O0 -std=c++0x
Vector of Unique:
  Time executed: 5057
Vector of shared:
  Time executed: 4080
PtrVector:
  Time executed: 1681

# g++ -O2 -std=c++0x
Vector of Unique:
  Time executed: 835
Vector of shared:
  Time executed: 1794
PtrVector:
  Time executed: 743

让我吃惊的是：

在-O0下，unique_ptr反而是最慢的，不明白（如果有人知道为什么，请告诉我）
在-O2下，unique_ptr和ptr_vector有着差不多的效率；但是vector<shared_ptr>相比没有c++0x的时候效率也提升了很多！

结论：

在合适的语义下，ptr_vector最好（好吧，我更习惯用boost…）
能体会到，c++0x在标准库下有着更好的效率，至于是不是适合项目使用，看项目情况吧
在不用考虑效率的时候（个人觉得，开发中不需要在执行效率上太抠，把优化留到实际使用发现性能之后），vector<shared_ptr>最万能。

Q.E.D

This is Mine

on my way…