donatas abraitis

Busy polling with e1000 driver

The previous week I spent few hours playing with SO_BUSY_POLL socket option. It is promising if you are looking for low-latency bound applications, e.g.: caching.

I set my lab with VirtualBox using Ubuntu 16.04 virtual machine. I picked and patched Redis as an application to play around with busy polling support. Did benchmarks with a very nice tool from Twitter rpc-perf. Results with and without busy polling were almost identical, no visible changes were seen.

Hence, I started to dig more into how busy polling is actually working in the Linux kernel.

OK, first passenger is here:

int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
                int flags, int *addr_len)
{
        ...
    if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue) &&
            (sk->sk_state == TCP_ESTABLISHED))
                sk_busy_loop(sk, nonblock);
    ...

So, to call sk_busy_loop() you have to meet condition sk_can_busy_loop() which is:

static inline bool sk_can_busy_loop(struct sock *sk)
{
        return sk->sk_ll_usec && sk->sk_napi_id &&
               !need_resched() && !signal_pending(current);
}

This is nothing more than just checking if SO_BUSY_POLL is set for a socket (or global sysctl setting) and if sk_napi_id is non-zero value at first. Looks very trivial and obvious (bcc):

int kprobe__tcp_recvmsg(struct pt_regs *ctx, struct sock *sk)
{
        if (sk->sk_ll_usec)
                bpf_trace_printk("usec: %d, napi_id: %d\\n", sk->sk_ll_usec, sk->sk_napi_id);
        return 0;
}

This produced output:

usec: 50, napi_id: 0

sk_ll_usec is set correctly, but why sk_napi_id is zero? According to the code, this should be higher than NR_CPUS. This is even emerging as an additional check for napi_id. Hence, the condition is rejected and busy polling is ignored.

Let’s dive more into this. At first skb_mark_napi_id() is called by receive handler to set napi_id for skb. Later this napi_id is propagated to the socket using sk_mark_napi_id(). So, this napi_id along the path is carried as zero.

For instance ixgbe driver has such code:

static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
                         struct sk_buff *skb)
{
        skb_mark_napi_id(skb, &q_vector->napi);
        if (ixgbe_qv_busy_polling(q_vector))
                netif_receive_skb(skb);
        else
                napi_gro_receive(&q_vector->napi, skb);
}

This obviously marks napi_id for skb. If you would look into e1000 implementation, there isn’t any part of marking skb with napi_id. It’s handled directly from e1000_receive_skb() –> napi_gro_receive():

gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
        skb_mark_napi_id(skb, napi);
    ...
}

So despite that different drivers are handling this differently, it looks like napi_hash_add() is not setting napi_id correctly. That’s why you always have zero (not a valid NAPI id).

Ubuntu 16.04 is shipped with 4.4 kernel, which has broken implementation of netif_napi_add() (missing napi_hash_add() tail-call). Other drivers even virtio_net calls napi_hash_add() explicitly after netif_napi_add(), which seems correct, but e1000 doesn’t.

Quick verification:

# stap -e 'probe kernel.function("netif_napi_add") { printf("napi_id: %d\n", $napi->napi_id); }'

And from another console running:

# rmmod e1000 ; modprobe e1000

The output from the tracing command was 0. Which is nothing more than:

napi->napi_id = 0;

Final cuts

  • As always, network subsystem if fun;
  • The best documentation is no documentation - just write readable code, because code is like communication;
  • If you want to play with busy polling, you just have to have at least 4.8 kernel version.