donatas abraitis

$Cisco high CPU usage with RPKI enabled validation

I don’t know how much of ISPs are now using RPKI based validation for BGP prefixes to avoid hijacking attacks, but a few years ago most of the ISPs still used old-school method (prefix/ACL lists) to filter “good” and “bad” prefixes from neighbors. I remember when I was working on ISP we dealt with this problem as well.

Today I found one old email which leads to RPKI implementation on the whole aforementioned ISP infrastructure. Translated email looks like:

Hi there,

I would like to propose the new cool feature named "RPKI for BGP". It's basically simple idea to filter out bad prefixes from neighbors.

More information is detailed in this link https://www.ripe.net/manage-ips-and-asns/resource-management/certification/bgp-origin-validation

In short, this has such benefits:

* You no longer have to explicitly define ACLs which prefixes to allow from neighbors;
* You have increased safety regarding route hijacks;
* It's centralized database (ROA records) which is used as a source for validation by BGP process.

If you interested, please contact me, we will find the best way to implement this!

Hence, it was in 2013, now my calendar is showing 2017, but the situation is still worse as depicted in the graph below: RPKI ROA records

Few emails upper I found a thread about $Cisco high CPU usage. It wasn’t for SOHO/HOME router, it was for 7600 series core router.

I will write-up few implementation details on Cisco side. Specify RPKI replication server, which is basically a local Linux based server with Java rpki-validator-app application installed:

router bgp 21412
  bgp rpki server tcp X.X.X.X port 8282 refresh 600

Route-maps applied for specific neighbor:

route-map LT:LITNET-IN permit 10
  match rpki valid
  set local-preference 110
route-map LT:LITNET-IN permit 15
  match rpki unknown
  set local-preference 100
route-map LT:LITNET-IN deny 20
  match rpki invalid

So, few minutes after turning this on, you see CPU usage is ~35% by BGP Router process, which is really abnormal:


CPU utilization for five seconds: 99%/2%; one minute: 31%; five minutes: 14%
 PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
  589     9404348    61152106        153 35.55% 10.90%  2.97%   0 BGP Router
  318     2944984     8481011        347 31.86%  6.22%  1.66%   0 IP RIB Update

It was due to high route processing for ~500k routes (global table). The best route is always selected with valid RPKI state. If you turn RPKI on a single router let’s say inside the cluster you should notice INVALID RPKI states from iBGP peers, because they are not RPKI compatible. In addition, peers have to send extended community between. To avoid this behavior I disabled this because other iBGP peers didn’t have this implemented. It breaks traffic load balancing if applied on a single router. RPKI MUST to applied on all routers at least inside cluster:

bgp bestpath prefix-validate disable

Another interesting behavior was that no RPKI State inside VRF, only inside the global table:

#sh ip bgp vpnv4 vrf lt 193.219.60.0
BGP routing table entry for 21412:5:193.219.60.0/24, version 181245
Paths: (2 available, best #1, table lt)
  Advertised to update-groups:
     8          36
  Refresh Epoch 1
  2847
    X.X.X.X from X.X.X.X (Y.Y.Y.Y)
      Origin IGP, localpref 170, valid, external, best
      Community: 21412:2847
      Extended Community: RT:21412:5
      rx pathid: 0, tx pathid: 0x0

I don’t know if it’s still valid for current 15.x releases, but unfortunatelly it was..

Sum up

  • I see RPKI is growing year by year, but not as fast as we wanted (like IPv6);
  • Believe me, RPKI is the way to go to improve security (even from untrusted neighbors or public IXP).