Busy polling with e1000 driver
The previous week I spent few hours playing with SO_BUSY_POLL
socket option. It is promising if you are looking for low-latency bound applications, e.g.: caching.
I set my lab with VirtualBox using Ubuntu 16.04 virtual machine. I picked and patched Redis as an application to play around with busy polling support. Did benchmarks with a very nice tool from Twitter rpc-perf. Results with and without busy polling were almost identical, no visible changes were seen.
Hence, I started to dig more into how busy polling is actually working in the Linux kernel.
OK, first passenger is here:
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len)
{
...
if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue) &&
(sk->sk_state == TCP_ESTABLISHED))
sk_busy_loop(sk, nonblock);
...
So, to call sk_busy_loop()
you have to meet condition sk_can_busy_loop()
which is:
static inline bool sk_can_busy_loop(struct sock *sk)
{
return sk->sk_ll_usec && sk->sk_napi_id &&
!need_resched() && !signal_pending(current);
}
This is nothing more than just checking if SO_BUSY_POLL
is set for a socket (or global sysctl setting) and if sk_napi_id
is non-zero value at first. Looks very trivial and obvious (bcc):
int kprobe__tcp_recvmsg(struct pt_regs *ctx, struct sock *sk)
{
if (sk->sk_ll_usec)
bpf_trace_printk("usec: %d, napi_id: %d\\n", sk->sk_ll_usec, sk->sk_napi_id);
return 0;
}
This produced output:
usec: 50, napi_id: 0
sk_ll_usec
is set correctly, but why sk_napi_id
is zero? According to the code, this should be higher than NR_CPUS
. This is even emerging as an additional check for napi_id. Hence, the condition is rejected and busy polling is ignored.
Let’s dive more into this. At first skb_mark_napi_id()
is called by receive handler to set napi_id
for skb
. Later this napi_id is propagated to the socket using sk_mark_napi_id()
. So, this napi_id along the path is carried as zero.
For instance ixgbe
driver has such code:
static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
struct sk_buff *skb)
{
skb_mark_napi_id(skb, &q_vector->napi);
if (ixgbe_qv_busy_polling(q_vector))
netif_receive_skb(skb);
else
napi_gro_receive(&q_vector->napi, skb);
}
This obviously marks napi_id for skb. If you would look into e1000 implementation, there isn’t any part of marking skb with napi_id. It’s handled directly from e1000_receive_skb()
–> napi_gro_receive()
:
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
skb_mark_napi_id(skb, napi);
...
}
So despite that different drivers are handling this differently, it looks like napi_hash_add()
is not setting napi_id correctly. That’s why you always have zero (not a valid NAPI id).
Ubuntu 16.04 is shipped with 4.4 kernel, which has broken implementation of netif_napi_add()
(missing napi_hash_add()
tail-call). Other drivers even virtio_net
calls napi_hash_add()
explicitly after netif_napi_add()
, which seems correct, but e1000
doesn’t.
Quick verification:
# stap -e 'probe kernel.function("netif_napi_add") { printf("napi_id: %d\n", $napi->napi_id); }'
And from another console running:
# rmmod e1000 ; modprobe e1000
The output from the tracing command was 0
. Which is nothing more than:
napi->napi_id = 0;
Final cuts
- As always, network subsystem if fun;
- The best documentation is no documentation - just write readable code, because code is like communication;
- If you want to play with busy polling, you just have to have at least 4.8 kernel version.