Hi Leif,
On 13 Sep 2019, at 14:08, Leif Johansson <leifj at
sunet.se> wrote:
On 2019-09-13 11:51, Alex Stuart wrote:
Hello folks,
The UK federation team have discovered that a pyFF deployment is making a large number of
metadata aggregate downloads from our Metadata Publication Service. In August,
34.74.200.81 made over 3,000 gzipped downloads of our metadata, downloading 36GB of
metadata. As we update metdadata once per day, this deployment is clearly downloading
excessively.
So its a thumb drive worth of data :-)
That's not the excessive part :-) We're getting in touch with deployments that are
obviously downloading too much data (one IdP is downloading 1/4 TB per month) and also
getting in touch with folks where the metadata client is doing something we don't
expect.
We expect that clients use the conditional GET mechanism. Our metadata server sends ETag
and Last-Modified headers [1][2]. HTTP clients send the 'If-None-Match' and
'If-Modified-Since' request headers on subsequent requests. If the metadata has
changed, the server sends the updated metadata; if not, it sends a HTTP 304 Not modified
code.
By default pyFF uses python-requests with the requests-caching on
by default which means that if the endpoint implements normal HTTP
cache controls the pyFF user-agent shouldn't download unless the
cache is expired (which is configurable of course).
Thanks. This looks like a different mechanism to the ETag/Last-Modified headers. In time
based caching, the client has to download & parse metadata every request. I can see
that this is OK for clients using MDQ requests, and arguably for metadata clients which
aggregate and re-publish metadata, but it looks less efficient for IdP/SP clients that
consume the full metadata aggregate. With conditional GET you can have clients making
frequent requests and downloading updated metdata soon after it's published, while
keeping the overall network load low.
It looks like federation operators expect the conditional GET mechanism for metadata
clients. I've surveyed a few metadata feeds and all of them send the ETag and
Last-Modified headers:
# Federation, metadata URL, Last-Modified, ETag, Expires, Cache-Control
UK-FEDERATION,
http://metadata.ukfederation.org.uk/ukfederation-metadata.xml,
Last-Modified, ETag, none, none
SWAMID,
https://mds.swamid.se/md/swamid-2.0.xml, Last-Modified, ETag, none, none
INCOMMON,
http://md.incommon.org/InCommon/InCommon-metadata.xml, Last-Modified, ETag,
none, none
SWITCH,
http://metadata.aai.switch.ch/metadata.switchaai.xml, Last-Modified, ETag,
Expires, Cache-Control
AAF,
https://md.aaf.edu.au/aaf-metadata.xml, Last-Modified, ETag, none, none
ACONET,
https://eduid.at/md/aconet-interfed.xml, Last-Modified, ETag, none, none
IDEM,
http://md.idem.garr.it/metadata/idem-metadata-sha256.xml, Last-Modified, ETag, none,
none
SURFCONEXT,
https://metadata.surfconext.nl/idp-metadata.xml, Last-Modified, ETag, none ,
Cache-Control
Only a couple send Cache-Control headers.
I'm new to python so I've not used the requests library [3] with the If-None-Match
header. I'll keep trying to get that to work.
It also looks like the requests-cache library supports time-based caching but not ETag:
"requests-cache ignores all cache headers, it just caches the data for the time you
specify" [4]. However, the request-cache documentation points to the CacheControl
library which does support both ETag and time based headers [5].
Cheers,
Alex
[1]
https://docs.ukfederation.org.uk/fts/1.5/mps/#43-support-for-conditional-get
[2]
https://tools.ietf.org/html/rfc7232
[3]
http://2.python-requests.org/en/master/
[4]
https://pypi.org/project/requests-cache/
[5]
https://cachecontrol.readthedocs.io/en/latest/etags.html
—
Alex Stuart, Principal technical support specialist (UK federation)
alex.stuart at jisc.ac.uk
UK federation helpdesk: service at ukfederation.org.uk
Jisc is a registered charity (number 1149740) and a company limited by guarantee which is
registered in England under Company No. 5747339, VAT No. GB 197 0632 86. Jisc’s registered
office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800.
Jisc Services Limited is a wholly owned Jisc subsidiary and a company limited by guarantee
which is registered in England under company number 2881024, VAT number GB 197 0632 86.
The registered office is: One Castle Park, Tower Hill, Bristol BS2 0JA. T 0203 697 5800.