Standard CDN from Microsoft routes South African POP misses via London origin shield, not direct to South Africa North VM RRS feed

  • General discussion

  • I have end-users in South Africa for an art site hosted in Europe. I'm trying to serve them with the lowest latency via a combination of Azure's Standard CDN from Microsoft and an Azure VM deployed in South Africa North, with the goal of only fetching thumbnails from Europe once. I suspect, though, this is being frustrated by an origin shield used on the CDN nodes in Johannesburg (JNB), Cape Town (CPT) and elsewhere, directing missed requests to a node in London (LON).

    My idea was for the VM to acts as a South Africa-wide node to cache ~50GB of our thumbnails and icons - probably more than CDN endpoints in Johannesburg and Cape Town (or Lagos or Nairobi) will cache for a long period of time. The number of end-users that we have in the region is relatively low, and if content is not hit, latency could be huge. Users are only directed to this endpoint if we believe they are from South Africa or nearby, via IP location.

    Unfortunately, what I'm finding is misses in Cape Town and Johannesburg do not result in a direct fetch from the origin VM in South Africa North; instead, they typically result in another miss and a fetch from London - which then appears to go all the way back to South Africa (at least) for the actual file.

    See this example of requests over a period from an IP address in the suburb of Retreat, Cape Town:

    Table of hits and misses showing Cape Town requests redirecting to origin shield in London

    Chart of hits and misses for Cape Town and London (which is essentially all the misses of Cape Town).

    The number of misses in CPT is equal to the sum of misses and hits from London (which in this case are mostly misses, though that is not always the case). When CPT missed, it hit LON, rather than go to the origin, as highlighted in the table; "sentToOriginShield_b" appears to always be true for CPT, except in the case where it matches a rule to redirect directly to our European servers (we know some files are too big for our own node to cache, and it's better to serve from Europe).

    Is there a way to configure the CDN to stop this shielding behaviour? A magic query parameter, or a endpoint rule that could be set globally? Alternatively, could Microsoft beef up Cape Town or Johannesburg so it can be a shield node itself?

    I get that you want to reduce origin traffic in the typical case, and centralize caching; but in this specific case, I'm trying to do that myself, but closer and more comprehensively. Microsoft's CDN's shield is ~175ms from South Africa, and I doubt it'll store 50GB of small files that may be accessed relatively infrequently, compared to other content in your CDN nodes.

    Perhaps you could exclude IP ranges of your own VMs in the same region from being shielded if they're an origin, as such requests should transit your own links in the area, and that way you won't use Seacom cable bandwidth? (Maybe you already do this for some Azure services, but apparently not for custom origins like X.southafricanorth.cloudapp.azure.com.)

    As it is, a full CDN miss now costs ~350-500ms - it goes to CPT/JNB, LON, then back to my VM in South Africa; maybe back to Europe again if it's a totally new file - whereas it could be ~10ms (if it's in our VM, the CDN's origin) or at most ~180-200ms (if my VM has to go to the actual origin in Europe). And our miss rate is non-trivial:

    Chart of hit rates over twelve hours showing lots of misses on London, and about 30% hits on JNB and CPT

    This behaviour is also non-obvious; I only noticed it after doing the above log analysis and spotting a London CDN in my server logs. The documentation says "If no edge servers in the POP have the file in their cache, the POP requests the file from the origin server." not "it may go 180ms to a different POP to see if it's there first; if not, it hits the origin from there, even if it's 180ms in the opposite direction".

    There is a hint it may be served from a "regional" cache, and the raw logs document briefly outlines the purpose of an origin shield; but the fact that the parent cache could be 8,000 miles away was... surprising, and is problematic from a jurisdictional standpoint: would orgs expect/want CDN requests for South African VMs by South Africans to go via the UK?

    It'd make more sense with the current setup for my VM to be in London; or else not to have a VM at all. But that still doesn't make much sense, and won't give the best performance to end-users. I could also serve direct from the VM, without a CDN, but I'd prefer not to do that since:

    1. It'd expose my node, potentially making it a target for a network attack
    2. It'd cost 223% per GB: outbound transit from VMs to CDN is free - "For Azure CDN from Microsoft, any data transfer from an origin hosted in Azure is included in the base Azure CDN pricing" - and Africa is in EMEA so I pay Standard CDN rates at $0.081/GB, while raw VM transit in South Africa North is Zone 3: $0.181/GB
    3. I doubt I could get a VM in Cape Town, as it's meant for disaster recovery (Johannesburg was hard enough to get); Microsoft CDN also has POPs in Lagos and Nairobi, and it seems like they'd be close enough to serve from South Africa - but I don't know that they'll get proper regions any time soon

    If I'm missing anything, I'd be glad for a tip. Right now the closest thing I have to a solution is to set "Bypass caching for query strings" and add a constant query string to URLs for South African users, turning POPs into simple request proxies. That seems like it'd solve 1) and 2), and render 3) irrelevant - but it foregoes POP caching, and is a bit ugly code-wise.

    Apologies if this comes across as a bit of a rant, but I spent a while setting this up, and I fear it's partly wasted effort due to this shield component that isn't useful in this use-case, and which I can't seem to turn off without turning caching off entirely.

    Update (May 8) - For the curious: I've tried the method above and it seems to work, with the caveat that the only CDN cache hits now are truly static files, where previously some thumbnails could have been be HITs at the POPs as well (as many people look at recent work). It might have been possible to use a separate "bypass cache" rule rather than the query string; but several paths do/do not need to be cached, and some rules are in use already, so it's effectively a extra rule for free - aside from a few ugly hacks on the app side.

    Update 2 (May 14) - When testing local usage, I ran into another issue: Azure VM's own IP ranges in Johannesburg appear not to be routed to the local CDN endpoint, but instead to one in Europe. If I ping (or PSPing from PSTools) my CDN endpoint on the VM, I get:

    Pinging standard.t-0001.t-msedge.net [] with 32 bytes of data:
    Reply from bytes=32 time=158ms TTL=116

    ...just ~16ms lower than our Netherlands cache, suggesting it's picked up a node in London or Spain. This is also shown by webpagetest.org, whose South Africa node is hosted on Azure - giving an unfavourable view of how the CDN works for end-users.

    Fortunately, dotcom-tools.com does show what I've done above appears to be working, with latencies of ~20-30ms where a given thumbnail is already in cache (and ~160-170ms where it's not).

    In case it matters, current IP geolocation databases correctly pick up the VM's IP address as being in South Africa, though ones which are a year or two old do not appear to have it.

    Wednesday, May 6, 2020 1:51 PM