donatas abraitis

NetDev 2.2, South Korea

TL;DR;

This is a really different kind of format conference, 101% geek-ish. I’m sure there were at least 95% of people commiting to the kernel tree, so one huge difference is that you feel like a total looser/idiot/you name it sitting between such a super smart people ;-) But otherwise, I always remember that I’m SRE/DevOps/Engineer where we clean all the shit every day and tackling a different kind of problems. That’s fine.

TS;WM;

  • A little bit history of Netfilter ecosystem (ipfwadm, ipchains, iptables, nftables), including some miss paste tools like patch-o-matic, nfsim. In addition, some people are asking for IPv6-to-IPv6 NA(P)T support in Netfilter, but NAT is what broke IPv4 end-to-end connectivity, let’s do not do the same with IPv6. ulogd - everyone use -j LOG, but this seems the right way for logging;
  • Linux Kernel Library - it’s used by Android actively, especially for MultiPath-TCP purposes;
  • Innova-TLS Mellanox for offloading TLS on receive side;
  • Arachne - very nice and good tool for testing the limits of the Linux kernel. It’s someway comparable with Ansible, Vagrant (kitchen?), CumulusVX, but very lightweight if you need to test your network topology with hundreds or thousands of racks/server. There was an idea to use Docker for this but few keywords about Docker were really cool (Overengineerd crap) ;-)
  • Lockless RTNetlink operations. This still is a really huge problem for networking stuff while it uses a global lock. It means that all operations are serialized. It touches tc sub-system, iproute2 as well. There are some modifications to lock on demand for specific functions only (ABBA deadlock);
  • Multi-PCI socket network devices are coming up. This is a really cool feature for big players. Let’s say when you have a standard 200G NIC, your traffic will be saturated on PCI bus because of transfer bandwidth limit of 100G. Even if it’s not a problem of saturation then multi-socket NICs will help in cache locality using NUMA architectures (pin CPU socket to PCI socket to avoid moving packets through QPI).
  • tc insertion rate 50k/s rules, what about 1m? It could be solved in few ways: by batching or compounding rtnetlink messages and running in parallel across different CPU workqueues. There were some doubts from the audience (especially from David S. Miller) about parallelism and who sanely needs 1M rules inserting into the kernel at once?
  • Lawrence Brakmo shot down a great talk about TCP-BPF. It’s already in upstream since 4.13. It allows you to override socket related options. Even so, this could be used as a congestion avoidance prevention mechanism. Facebook is testing this as a replacement for existing LD_PRELOAD solution, setting socket buffers on different workloads: inter-DC, intra-DC, … Another useful example was how they distinguish inter-DC from intra-DC traffic - just by comparing few bits from IPv6 addresses. Frankly speaking it’s nothing else, but IPv6 as a metadata store;
  • While XDP is getting to the top today others are thinking how to do that even better. This is where HW-based hints come. They would allow you to offload almost anything in hardware. Currently, there are few types of hints to filter out packets by matching source/destination address or so. Basically, those hints are put into data_meta (before data) field inside the available headroom of skb and additional ndo function is called to load ELF file to the NIC; XDP HW hints
  • Performance. Hear hear… Intel presented another one AF_PACKET V4 with zero-copy. This is quite a young branch but sounds promising. In v2/v3 kernel-space polls ring descriptors from the device and the user-space communicates through the kernel. To push packets directly between user-space and kernel-space needs modifications actually what v4 does. If operating in PACKET_ZEROCOPY (ZC) mode the device allocates packet buffers directly into process memory - TX/RX are bundled into one memory space. Hardware specific descriptors are not touched. This means that translation is needed between packet buffers pointing descriptors (virtual) and hardware - this task is done by the kernel; AF_PACKET V4
  • Inspiring keynote by David S. Miller about structure dietary on network related structures. This is valid not only for network structures, but it was a great example. In short, talk was about optimising existing data structures such as dst_entry and friends. In some places, it would be even better to use the index and not pointers for lookups. Dietary_Kernel_Networking “Real men compute structure offsets by hand.” - David S. Miller

Wrapping up

  • I have much more to add, but neither remember so good nor have any taken pictures in my phone;
  • Uniqueness of this conference is that the talk you listen wasn’t given anywhere. That was a requirement when submitting a talk;
  • No marketing shit;
  • Jetlag is guaranteed;
  • Seoul is awesome;
  • Fun fact about this conference was when people up questions for the stager I had two questions in my mind: don’t I understand English or am I dumb? I think it’s absolutely fine when you attend this marathon of netdev.