I want to run pyFF using the RedisWhooshStore to leverage the superior
indexing that Whoosh provides to support API search.
The example API runner script at
uses this command:
--log-config logger.ini \
--bind 0.0.0.0:8000 \
-t 600 \
-e PYFF_PIPELINE=$pipeline \
-e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore \
-e PYFF_UPDATE_FREQUENCY=300 \
--threads 4 \
My testing shows that combination of arguments will not work. If you
start from a new deployment the Whoosh index is never generated.
The reason is the --preload option to gunicorn. When that option is used
the application code (pyff.wsgi:app) is loaded/evaluated before the
worker process is forked. As such the APScheduler BackgroundScheduler
instance is created before gunicorn forks to create the worker process.
It is realized as a thread running in the parent.
The forked child process does not inherit threads (other than the main)
from the parent. So the BackgroundScheduler only runs in the parent,
which seems initially to be the desired behavior--you would not want two
copies of the BackgroundScheduler, one in the parent and one in the
Further, with --preload it is the parent process that adds or schedules
the "call" job to run the update/load cycle. Since the
BackgroundScheduler also runs in the parent it immediately "sees" the
added "call" job and runs it. The "call" job does an HTTP GET to
localhost to cause the update/load cycle.
By this point the parent has forked to create the child, and it is the
child worker that services that GET call. As part of servicing that GET
call the child worker eventually schedules the job
But since it is the child and not the parent that schedules the reindex
job, the BackgroundScheduler thread running in the parent never "sees"
that the reindex job has been scheduled. If using the default memory
scheduler job store the reindex job will never be seen, and hence the
Whoosh index will never be created.
If the command above is changed to include
then the job is stored in Redis. While that helps because now the
BackgroundScheduler thread in the parent will "see" the reindex job, it
will not "see" it until it wakes up to run the "call" job. So the
reindexing does not happen until PYFF_UPDATE_FREQUENCY later. I want to
run with update frequencey of one hour, but I do not want to wait an
hour for the initial Whoosh index to be created, nor do I want any
changes from the update/load to not appear in the index until an hour
I think the right workaround for now is to NOT use --preload. In this
scenario the BackgroundScheduler thread runs in the forked child process
and so it "sees" both the "call" and "reindex" jobs as they are
scheduled. The Whoosh index is created immediately after the update/load
The downside of this approach is that gunicorn should only be run with a
single worker (the default). When run with more than one worker there
would be multiple BackgroundSchedulers doing the same work, and that is
probably not desireable.
One gunicorn worker with multiple thread should be able to easily serve
the loads I expect for most of my deployments, but if pyff:wsgi:app
is to really scale I think the approach with APScheduler needs to be
redesigned. I think the scheduler will need to run in a dedicated
process that is managed via some type of remote procedure call, or
another background task/job manager will need to be used instead of
Leif, do you agree with this assessment?
If so, I will submit a PR for run-pyff-api.sh that removes the --preload
option and explicitly uses "--workers 1", and that includes a comment
about only using one worker.
Sorry for the crosspost...
After a few weeks of spending all of my available development bits on
the various parts of RA21 (cf github.com/TheIdentitySelector, yes its
all nodejs!) I'm back to working on pyFF for a bit.
Here is what I have planned for in the quite near term:
1. merge the api-refactory branch which includes a pyramids-based API
2. merge documentation PR from Hannah Sebuliba (thx!)
3. tag and release the last monolothic version of pyFF
4. in HEAD which becomes the new 1.0.0 release:
- remove all frontend bits (old discovery, management web app)
- pyffd will now start pyramids-based API server
- wsgi will be available/recommended
- create a new "frontend app" as a separate webpack+nodejs project
- create docker-compose.yaml that starts pyffd (API) + frontend app
5. tag and release 1.0.0 thereby moving pyFF over to semantic versioning
After 4 it makes sense to talk about things like...
- new redis/#nosql backends
- work on reducing memory footprint
- pubsub for notifications between MDQ servers
- more instrumentation & monitoring
- adaptive aggregation for large-scale deployments
- elastic search
- management APIs for integrated editing of local metadata
- generating offline MDQ directory structures (cf scripts/mirror-mdq.sh)
Thoughts etc are as usual welcome.
I drafted a new consent service based on Django:
I weighted the complexity of CMservice, its lack of documentation and community support vs being an already deployed project. I think that I will drop CMservice and go ahead with developing simpleconsent in second half of September, unless someone would propose an alternative.
Any encouragement or dissuation? A consideration is that Django does not work with SQLAlchemy, which is a different type of ORM. But I would need to stick to Django for development speed.
 @Heather: Is there an RFC that dismisses the use of „simple“ in project names? My excuse is, that SCAR (Simple Consent for Attribute Release) did not sound well, either.
> Am 2019-08-15 um 20:16 schrieb Rainer Hoerbe <rainer at hoerbe.at>:
> Thanks for the quick answer. I hope that we can cover this in the idpy call next week, as I will be on vacation fro 2weeks afterwards.
> I would be interested in your assessment of the code. On my side, I am unhappy that the APi is undocumented and has to be reverse engineered from the view definitions etc.
> - Rainer
>> Am 2019-08-15 um 20:06 schrieb Christos Kanellopoulos <christos.kanellopoulos at geant.org <mailto:christos.kanellopoulos at geant.org>>:
>> Hi Rainer
>> We have done some further work on the CM service and we have fixed various bugs. Now not myself and Ivan are on holidays. Next week we will not be back and share the updated code.
>> Having said this, we are seriously thinking to abandon this code base and develop a cm l component from scratch.
>> From: Rainer Hoerbe <rainer at hoerbe.at <mailto:rainer at hoerbe.at>>
>> Sent: Thursday, August 15, 2019 8:22:13 PM
>> To: Christos Kanellopoulos <christos.kanellopoulos at geant.org <mailto:christos.kanellopoulos at geant.org>>
>> Subject: Re: CMservice gitlab export
>> Hi Christos,
>> The integration of CMservice into SATOSA is again on the top of my todo list. When I added your tar-ball from 22. May, I notices that the unit tests have not been updated to reflect the changes in src. I fixed this in https://github.com/its-dirg/CMservice/pull/11 <https://github.com/its-dirg/CMservice/pull/11>, and a few dependency issuers.
>> Is there any new status on the GÉANt branch of the project? Any new commits? I would like to know if there is a chance to consolidate efforts wrt this project. do you know, or do you know someone who might know?
>> Cheers, Rainer
>>> Am 2019-05-22 um 12:02 schrieb Christos Kanellopoulos <christos.kanellopoulos at geant.org <mailto:christos.kanellopoulos at geant.org>>:
>>> Hello Rainer,
>>> find it attached. Yesterday afternoon, became really late night.
>>> On 22 May 2019, at 11:54, Rainer Hoerbe wrote:
>>> May I send a friendly reminder?
>>>> Am 2019-05-21 um 08:47 schrieb Christos Kanellopoulos <christos.kanellopoulos at geant.org <mailto:christos.kanellopoulos at geant.org>>:
>>>> Hello Rainer
>>>> I am at the hospital, but I will be able to send it to you later this afternoon
>>>> From: Rainer Hoerbe <rainer at hoerbe.at <mailto:rainer at hoerbe.at>>
>>>> Sent: Tuesday, May 21, 2019 9:46 AM
>>>> To: Christos Kanellopoulos
>>>> Subject: CMservice gitlab export
>>>> Hi Christos,
>>>> You mentioned in the last idpy meeting that I might get a copy of Geant’s CMService repo on gitlab. Whom would I ask to get it?
>>>> Thanks and best regards
>>> Christos Kanellopoulos
>>> Senior Trust & Identity Manager
>>> M: +31 611 477 919
>>> Networks • Services • People
>>> Learn more at www.geant.org <http://www.geant.org/>
>>> GÉANT Vereniging (Association) is registered with the Chamber of Commerce in Amsterdam with registration number 40535155 and operates in the UK as a branch of GÉANT Vereniging. Registered office: Hoekenrode 3, 1102BR Amsterdam, The Netherlands. UK branch address: City House, 126-130 Hills Road, Cambridge CB2 1PQ, UK.
Scott Koranda, Heather Flanagan, Leif Johansson, Giuseppe de Marco, Johan Lundberg, John Paraskevopoulos, Alex Stuard, Hannah Sebuliba,
Virtual IdP front end to Satosa - can expose multiple virtual IdPs through the Satosa front end. Can configure various options for the IdP, including the name of the IdP, the scope the IdP wants to use, etc. That config belongs to the front end. Scott also wants to have some microservices that operate on the assertions as they go through the system, and the microservices should have the same access to that configuration (e.g., so they can see the scope). Waiting on a decision from Ivan on how to implement this.
Can you run a single Satosa instance with multiple front and back ends? Example, a SAML back end that would authN against eduGAIN, and another that authN to ORCID, and front ends that would work with either SAML or OIDC.
Question: has anyone set up OIDC front end and had it work with mod_auth_oidc? Mod_auth_oidc is complaining. Giuseppe is planning on doing this in the next month or so.
With the OIDC front end, it won’t automatically work with multiple backends (cannot select between multiple backends). Need a custom routing service. Does anyone have such a routing service available? Giuseppe wrote one; can find it in the Satosa PRs. It intercepts the call and uses a map of entity IDs that need this.
Update on pyFF
Current release = 1.1.1; there are some bug fixes that need to go into 1.1.2 asap.
Code is stabilizing, but not sure he’d bet on 1.1.2 being stable.
2.0 will start with Leif removing the front end bits; he will provide a bash script to help people who are used to calling pyffd. There will still be a wsgi app (and it will be the main entry point).
Hannah has been working on some interesting memory things. She is looking for memory leaks. Scott thinks that when pyff is running as a server, it needs to never create a really large DOM, avoid ever having read the eduGAIN feed as a single DOM object, because that creates a huge, unnecessary memory request. Also have to avoid creating lists of many things; even if you don’t load the whole DOM, you have a list of small DOMs and if it is held in memory before being given to the backend store, you still consume a lot of memory. Scott suggests the architecture needs to shift from parsing large chunks of metadata, to parsing small chunks, handing them off to the backend, then garbage collect. Leif points out that as soon as you’re dealing with signed metadata, you have to handle all of it at once. Could try to do something by making the pipeline smaller.
One suggestion: switch to the Redis backend. Which could work in some use cases, but not for the full aggregate
Could do an offline fetch as another way to control size.
One goal is to keep pyFF from needing a server with more than 4GB. Not likely that’s going to function as eduGAIN gets larger
In eduTEAMS, pyFF does take up the largest memory footprint.
Could start pyFF, ingest all you need, use mirror MDQ to produce an offline copy, then shut down the pyFF service until you need to reingest. The offline MDQ could be used for discovery for as long as the signature is valid. Can use the thiss.io MDQ (thiss-mdq <https://github.com/TheIdentitySelector/thiss-mdq>) for a miniature search function.
Giuseppe uses pyFF with a scheduler.
Another alternative is to use the default discovery service being put together by SeamlessAccess.org <http://seamlessaccess.org/> (based on RA21)
Would be interesting to compare woosh+Redis to a JSON-only index store. Action item for Hannah.
How to exclude entityIDs from pyFF? It works as expected up to 0.9.3. Can use a filter, but the previous version of fork, merge, remove does not. (The latter impacts the current working document, and should not actually work.) Suggest you look at load-cleanup - there’s a way to run a pipeline early on before you update the backend store, and that might be it.
We have a call on the calendar for tomorrow, 6 August 2019.
While Ivan is on holiday and dreaming of functional programming in
Python, we will still have a call. Leif has agreed to join the call so
that we can spend some time talking about the latest changes to pyFF.
We can also cover other topics as time permits.
There used to be a satosa-dev slack workspace. This workspace has been
inactive for more than a year. I have now renamed it to
identity-python. Anyone can join https://identity-python.slack.com/ by
self-inviting with the link below:
In case these instructions or endpoints change, the website should be updated
Personally, I do prefer the mailing list for archiving reasons. As
Chris Philips put it:
> I have been using satosa-users list as the starting place
> to congregate/share info/challenges and find it's a good
> start. It is searchable more easily than slack will ever be
> (and wont delete history after certain size). Slack is good
> for real time-ness but poor on search and retrieval.
I cannot agree more. Any chat is good for real-time discussions, but
it is essentially unstructured and closed to the platform.
We already have multiple channels to communicate and discuss:
- the mailing lists
- the github PRs and Issues
- and, slack
Nobody should be forced to join and follow every communication
channel. Let's try to be conservative and use one for each discussion
subject. If the conversation is to be moved between channels, it
should be accompanied by a small summary of what has already been
discussed on the originating channel.
Ivan c00kiemon5ter Kanakarakis >:3