$Cisco high CPU usage with RPKI enabled validation
I don’t know how much of ISPs are now using RPKI based validation for BGP prefixes to avoid hijacking attacks, but a few years ago most of the ISPs still used old-school method (prefix/ACL lists) to filter “good” and “bad” prefixes from neighbors. I remember when I was working on ISP we dealt with this problem as well.
Today I found one old email which leads to RPKI implementation on the whole aforementioned ISP infrastructure. Translated email looks like:
Hi there, I would like to propose the new cool feature named "RPKI for BGP". It's basically simple idea to filter out bad prefixes from neighbors. More information is detailed in this link https://www.ripe.net/manage-ips-and-asns/resource-management/certification/bgp-origin-validation In short, this has such benefits: * You no longer have to explicitly define ACLs which prefixes to allow from neighbors; * You have increased safety regarding route hijacks; * It's centralized database (ROA records) which is used as a source for validation by BGP process. If you interested, please contact me, we will find the best way to implement this!
Hence, it was in 2013, now my calendar is showing 2017, but the situation is still worse as depicted in the graph below:
Few emails upper I found a thread about $Cisco high CPU usage. It wasn’t for SOHO/HOME router, it was for 7600 series core router.
I will write-up few implementation details on Cisco side. Specify RPKI replication server, which is basically a local Linux based server with Java
rpki-validator-app application installed:
router bgp 21412 bgp rpki server tcp X.X.X.X port 8282 refresh 600
Route-maps applied for specific neighbor:
route-map LT:LITNET-IN permit 10 match rpki valid set local-preference 110 route-map LT:LITNET-IN permit 15 match rpki unknown set local-preference 100 route-map LT:LITNET-IN deny 20 match rpki invalid
So, few minutes after turning this on, you see CPU usage is ~35% by
BGP Router process, which is really abnormal:
CPU utilization for five seconds: 99%/2%; one minute: 31%; five minutes: 14% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 589 9404348 61152106 153 35.55% 10.90% 2.97% 0 BGP Router 318 2944984 8481011 347 31.86% 6.22% 1.66% 0 IP RIB Update
It was due to high route processing for ~500k routes (global table). The best route is always selected with
valid RPKI state. If you turn RPKI on a single router let’s say inside the cluster you should notice
INVALID RPKI states from iBGP peers, because they are not RPKI compatible. In addition, peers have to send extended community between. To avoid this behavior I disabled this because other iBGP peers didn’t have this implemented. It breaks traffic load balancing if applied on a single router. RPKI MUST to applied on all routers at least inside cluster:
bgp bestpath prefix-validate disable
Another interesting behavior was that no
RPKI State inside VRF, only inside the global table:
#sh ip bgp vpnv4 vrf lt 126.96.36.199 BGP routing table entry for 21412:5:188.8.131.52/24, version 181245 Paths: (2 available, best #1, table lt) Advertised to update-groups: 8 36 Refresh Epoch 1 2847 X.X.X.X from X.X.X.X (Y.Y.Y.Y) Origin IGP, localpref 170, valid, external, best Community: 21412:2847 Extended Community: RT:21412:5 rx pathid: 0, tx pathid: 0x0
I don’t know if it’s still valid for current 15.x releases, but unfortunatelly it was..
- I see RPKI is growing year by year, but not as fast as we wanted (like IPv6);
- Believe me, RPKI is the way to go to improve security (even from untrusted neighbors or public IXP).