[Idpy-discuss] Questions about pyFF metadata downloads

Alex Stuart Alex.Stuart at jisc.ac.uk
Mon Sep 16 09:40:30 UTC 2019


Hi Leif,

> On 13 Sep 2019, at 14:08, Leif Johansson <leifj at sunet.se> wrote:
> 
> On 2019-09-13 11:51, Alex Stuart wrote:
>> Hello folks,
>> 
>> The UK federation team have discovered that a pyFF deployment is making a large number of metadata aggregate downloads from our Metadata Publication Service. In August, 34.74.200.81 made over 3,000 gzipped downloads of our metadata, downloading 36GB of metadata. As we update metdadata once per day, this deployment is clearly downloading excessively.
> 
> So its a thumb drive worth of data :-)

That's not the excessive part :-) We're getting in touch with deployments that are obviously downloading too much data (one IdP is downloading 1/4 TB per month) and also getting in touch with folks where the metadata client is doing something we don't expect.

We expect that clients use the conditional GET mechanism. Our metadata server sends ETag and Last-Modified headers [1][2]. HTTP clients send the 'If-None-Match' and 'If-Modified-Since' request headers on subsequent requests. If the metadata has changed, the server sends the updated metadata; if not, it sends a HTTP 304 Not modified code.

> 
> By default pyFF uses python-requests with the requests-caching on
> by default which means that if the endpoint implements normal HTTP
> cache controls the pyFF user-agent shouldn't download unless the
> cache is expired (which is configurable of course).
> 

Thanks. This looks like a different mechanism to the ETag/Last-Modified headers. In time based caching, the client has to download & parse metadata every request. I can see that this is OK for clients using MDQ requests, and arguably for metadata clients which aggregate and re-publish metadata, but it looks less efficient for IdP/SP clients that consume the full metadata aggregate. With conditional GET you can have clients making frequent requests and downloading updated metdata soon after it's published, while keeping the overall network load low.

It looks like federation operators expect the conditional GET mechanism for metadata clients. I've surveyed a few metadata feeds and all of them send the ETag and Last-Modified headers:

# Federation, metadata URL, Last-Modified, ETag, Expires, Cache-Control
UK-FEDERATION, http://metadata.ukfederation.org.uk/ukfederation-metadata.xml, Last-Modified, ETag, none, none 
SWAMID, https://mds.swamid.se/md/swamid-2.0.xml, Last-Modified, ETag, none, none 
INCOMMON, http://md.incommon.org/InCommon/InCommon-metadata.xml, Last-Modified, ETag, none, none 
SWITCH, http://metadata.aai.switch.ch/metadata.switchaai.xml, Last-Modified, ETag, Expires, Cache-Control
AAF, https://md.aaf.edu.au/aaf-metadata.xml, Last-Modified, ETag, none, none 
ACONET, https://eduid.at/md/aconet-interfed.xml, Last-Modified, ETag, none, none 
IDEM, http://md.idem.garr.it/metadata/idem-metadata-sha256.xml, Last-Modified, ETag, none, none  
SURFCONEXT, https://metadata.surfconext.nl/idp-metadata.xml, Last-Modified, ETag, none , Cache-Control

Only a couple send Cache-Control headers.

I'm new to python so I've not used the requests library [3] with the If-None-Match header. I'll keep trying to get that to work.

It also looks like the requests-cache library supports time-based caching but not ETag: "requests-cache ignores all cache headers, it just caches the data for the time you specify" [4]. However, the request-cache documentation points to the CacheControl library which does support both ETag and time based headers [5].

Cheers,
Alex

[1] https://docs.ukfederation.org.uk/fts/1.5/mps/#43-support-for-conditional-get

[2] https://tools.ietf.org/html/rfc7232

[3] http://2.python-requests.org/en/master/

[4] https://pypi.org/project/requests-cache/

[5] https://cachecontrol.readthedocs.io/en/latest/etags.html


—
Alex Stuart, Principal technical support specialist (UK federation)               
alex.stuart at jisc.ac.uk
UK federation helpdesk: service at ukfederation.org.uk

Jisc is a registered charity (number 1149740) and a company limited by guarantee which is registered in England under Company No. 5747339, VAT No. GB 197 0632 86. Jisc’s registered office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800.

Jisc Services Limited is a wholly owned Jisc subsidiary and a company limited by guarantee which is registered in England under company number 2881024, VAT number GB 197 0632 86. The registered office is: One Castle Park, Tower Hill, Bristol BS2 0JA. T 0203 697 5800.  


More information about the Idpy-discuss mailing list