pyFF, RedisWhooshStore, gunicorn, APScheduler - Idpy-discuss

28 Aug 2019


      Hi,
I want to run pyFF using the RedisWhooshStore to leverage the superior
indexing that Whoosh provides to support API search.
The example API runner script at
https://github.com/IdentityPython/pyFF/blob/master/scripts/run-pyff-api.sh
uses this command:
gunicorn \
    --preload \
    --log-config logger.ini \
    --bind 0.0.0.0:8000 \
    -t 600 \
    -e PYFF_PIPELINE=$pipeline \
    -e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore \
    -e PYFF_UPDATE_FREQUENCY=300 \
    --threads 4 \
    --worker-tmp-dir=/dev/shm \
    --worker-class=gthread pyff.wsgi:app
My testing shows that combination of arguments will not work. If you
start from a new deployment the Whoosh index is never generated.
The reason is the --preload option to gunicorn. When that option is used
the application code (pyff.wsgi:app) is loaded/evaluated before the
worker process is forked. As such the APScheduler BackgroundScheduler
instance is created before gunicorn forks to create the worker process.
It is realized as a thread running in the parent.
The forked child process does not inherit threads (other than the main)
from the parent. So the BackgroundScheduler only runs in the parent,
which seems initially to be the desired behavior--you would not want two
copies of the BackgroundScheduler, one in the parent and one in the
child.
Further, with --preload it is the parent process that adds or schedules
the "call" job to run the update/load cycle. Since the
BackgroundScheduler also runs in the parent it immediately "sees" the
added "call" job and runs it. The "call" job does an HTTP GET to
localhost to cause the update/load cycle.
By this point the parent has forked to create the child, and it is the
child worker that services that GET call. As part of servicing that GET
call the child worker eventually schedules the job
"RedisWhooshStore._reindex".
But since it is the child and not the parent that schedules the reindex
job, the BackgroundScheduler thread running in the parent never "sees"
that the reindex job has been scheduled. If using the default memory
scheduler job store the reindex job will never be seen, and hence the
Whoosh index will never be created.
If the command above is changed to include
-e PYFF_SCHEDULER_JOB_STORE=redis
then the job is stored in Redis. While that helps because now the
BackgroundScheduler thread in the parent will "see" the reindex job, it
will not "see" it until it wakes up to run the "call" job. So the
reindexing does not happen until PYFF_UPDATE_FREQUENCY later. I want to
run with update frequencey of one hour, but I do not want to wait an
hour for the initial Whoosh index to be created, nor do I want any
changes from the update/load to not appear in the index until an hour
later.
I think the right workaround for now is to NOT use --preload. In this
scenario the BackgroundScheduler thread runs in the forked child process
and so it "sees" both the "call" and "reindex" jobs as they
are
scheduled. The Whoosh index is created immediately after the update/load
cycle.
The downside of this approach is that gunicorn should only be run with a
single worker (the default). When run with more than one worker there
would be multiple BackgroundSchedulers doing the same work, and that is
probably not desireable.
One gunicorn worker with multiple thread should be able to easily serve
the loads I expect for most of my deployments, but if pyff:wsgi:app
is to really scale I think the approach with APScheduler needs to be
redesigned. I think the scheduler will need to run in a dedicated
process that is managed via some type of remote procedure call, or
another background task/job manager will need to be used instead of
APScheduler.
Leif, do you agree with this assessment?
If so, I will submit a PR for run-pyff-api.sh that removes the --preload
option and explicitly uses "--workers 1", and that includes a comment
about only using one worker.
Thanks,
Scott K