"No Internet": Dealing with a slow ISP PPPoE server
A few weeks ago just as I was about to leave for Japan, the internet connection from one of my ISPs went down:
I actually didn't notice it for a while (thanks to my multi-ISP setup) until I was trying to tunnel into the network while I was at the airport. I managed to tunnel through another ISP and began troubleshooting.
I assumed there was loss of signal due to a fiber cut, but it was actually up, and also, the landline was still working. So the focus pivoted to the PPPoE connection between the ISP and my router. I checked the modem to see if it was still on bridge mode, and it was. My brother (who was at home), the on-site technician (who arrived eventually after days of back-and-forth with helpdesk) and I were stumped, because the connection was working on routing mode, so it should have worked on bridge mode... right?
Then I tried debugging it through my router. That's when I found the issue.
While I'm not an expert in PPPoE stuff, I already saw something peculiar with the logs:
- Multiple PADI (PPPoE Active Discovery Initiation) packets; ideally, the server should be able to respond immediately after receiving a PADI.
- PADO (PPPoE Active Discovery Offer) received with a different host-uniq value, specifically, the value from the previous PADI. This means the server is taking too long to respond.
Since Mikrotik routers send PADIs with incrementing host-uniq values on each connection attempt, and the PPPoE server doesn't respond quickly enough, the host-uniq values never match, and apparently the router's pretty strict with host-uniq values, hence the "received PADO with unknown host-uniq, dropping" error.
The "fix"? Have the router send the same host-uniq value every time. Made it send 0x00, and it's been working fine ever since.
I still shared my analysis to the affected ISP. Whether they actually fix the root cause is up to them, and I'm not keeping my hopes up. But at least the connection works now.