$Cisco high CPU usage with RPKI enabled validation
I don’t know how much of ISPs are now using RPKI based validation for BGP prefixes to avoid hijacking attacks, but a few years ago most of the ISPs still used old-school method (prefix/ACL lists) to filter “good” and “bad” prefixes from neighbors. I remember when I was working on ISP we dealt with this problem as well.
Today I found one old email which leads to RPKI implementation on the whole aforementioned ISP infrastructure. Translated email looks like:
Hi there,
I would like to propose the new cool feature named "RPKI for BGP". It's basically simple idea to filter out bad prefixes from neighbors.
More information is detailed in this link https://www.ripe.net/manage-ips-and-asns/resource-management/certification/bgp-origin-validation
In short, this has such benefits:
* You no longer have to explicitly define ACLs which prefixes to allow from neighbors;
* You have increased safety regarding route hijacks;
* It's centralized database (ROA records) which is used as a source for validation by BGP process.
If you interested, please contact me, we will find the best way to implement this!
Hence, it was in 2013, now my calendar is showing 2017, but the situation is still worse as depicted in the graph below:
Few emails upper I found a thread about $Cisco high CPU usage. It wasn’t for SOHO/HOME router, it was for 7600 series core router.
I will write-up few implementation details on Cisco side. Specify RPKI replication server, which is basically a local Linux based server with Java rpki-validator-app
application installed:
router bgp 21412
bgp rpki server tcp X.X.X.X port 8282 refresh 600
Route-maps applied for specific neighbor:
route-map LT:LITNET-IN permit 10
match rpki valid
set local-preference 110
route-map LT:LITNET-IN permit 15
match rpki unknown
set local-preference 100
route-map LT:LITNET-IN deny 20
match rpki invalid
So, few minutes after turning this on, you see CPU usage is ~35% by BGP Router
process, which is really abnormal:
CPU utilization for five seconds: 99%/2%; one minute: 31%; five minutes: 14%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
589 9404348 61152106 153 35.55% 10.90% 2.97% 0 BGP Router
318 2944984 8481011 347 31.86% 6.22% 1.66% 0 IP RIB Update
It was due to high route processing for ~500k routes (global table). The best route is always selected with valid
RPKI state. If you turn RPKI on a single router let’s say inside the cluster you should notice INVALID
RPKI states from iBGP peers, because they are not RPKI compatible. In addition, peers have to send extended community between. To avoid this behavior I disabled this because other iBGP peers didn’t have this implemented. It breaks traffic load balancing if applied on a single router. RPKI MUST to applied on all routers at least inside cluster:
bgp bestpath prefix-validate disable
Another interesting behavior was that no RPKI State
inside VRF, only inside the global table:
#sh ip bgp vpnv4 vrf lt 193.219.60.0
BGP routing table entry for 21412:5:193.219.60.0/24, version 181245
Paths: (2 available, best #1, table lt)
Advertised to update-groups:
8 36
Refresh Epoch 1
2847
X.X.X.X from X.X.X.X (Y.Y.Y.Y)
Origin IGP, localpref 170, valid, external, best
Community: 21412:2847
Extended Community: RT:21412:5
rx pathid: 0, tx pathid: 0x0
I don’t know if it’s still valid for current 15.x releases, but unfortunatelly it was..
Sum up
- I see RPKI is growing year by year, but not as fast as we wanted (like IPv6);
- Believe me, RPKI is the way to go to improve security (even from untrusted neighbors or public IXP).