Open Microscopy Environment

by **PaulVanSchayck** » Mon Jun 23, 2014 11:53 am

Dear all,

Using OMERO.script (Python) I'm submitting image analysis jobs to a cluster queue. This is all working quite nicely.

Whenever an error occurs, or whenever the user (or developer) interrupts the script I would like to clean up any remaining jobs from the queue. In order to do so I've wrapped the execution of jobs in a `try:` `finally:` block. This works quite nicely. When using `bin/omero script run` the script can be cancelled using ctrl+c and the finally block is executed, and any remaining jobs are removed.

However, when sending a SIGINT (kill [pid]) to the process this block is never executed. As an alternative to `finally`: I tried this block of code. Which also does not get executed.

Code: Select all: import signal import sys def signal_handler(signal, frame): print "Got SIGINT" signal.signal(signal.SIGINT, signal_handler)

As the SIGINT signal is also sent when the script is being cancelled from for example OMERO.insight. This also doesn't cleanup remaining cluster jobs.

What's a workable solution to this problem? Is there a way to make a shutdown handler for OMERO.script?

Thanks,

Paul

by **jmoore** » Mon Jun 23, 2014 1:53 pm

Hi Paul,

interesting idea! I must admit to having not tried this before. Could you paste the entire example you were attempting? Testing briefly, though, I put this under lib/scripts:

Code: Select all: import signal import sys import time import omero import omero.scripts c = omero.scripts.client("sig_example") def handler(signal, frame): print signal sys.exit(signal) signal.signal(signal.SIGINT, handler) signal.signal(signal.SIGTERM, handler) signal.signal(signal.SIGQUIT, handler) while True: time.sleep(1)

and I could certainly see a "15" being printed if I "kill PID"'d the process.

~Josh

by **PaulVanSchayck** » Tue Jun 24, 2014 7:34 am

Hey Josh,

Thanks for your response. I've no idea why I only placed a handler on signal.SIGINT when signal.SIGTERM is of course the one being sent. So that fixed that, and the shutdown handler is now being executed when kill`ing the pid (manual).

However, when cancelling the process from OMERO.insight (in the activities window) it seems like the kill command is send twice. This can be seen from the Processor.log:

Code: Select all: 2014-06-24 08:41:22,366 INFO [ omero.processor.ProcessI] (Dummy-6 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : Polling 2014-06-24 08:41:22,366 INFO [ omero.remote] (Dummy-6 ) Rslt: None 2014-06-24 08:41:22,380 INFO [ omero.remote] (Dummy-8 ) Meth: ProcessI.cancel 2014-06-24 08:41:22,380 INFO [ omero.processor.ProcessI] (Dummy-8 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : os.kill(TERM) 2014-06-24 08:41:22,380 INFO [ omero.processor.ProcessI] (Dummy-8 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : Callback processCancelled 2014-06-24 08:41:22,381 INFO [ omero.remote] (Dummy-8 ) Rslt: False 2014-06-24 08:41:22,386 INFO [ omero.remote] (Dummy-6 ) Meth: ProcessI.cancel 2014-06-24 08:41:22,386 INFO [ omero.processor.ProcessI] (Dummy-6 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : os.kill(TERM) 2014-06-24 08:41:22,386 INFO [ omero.processor.ProcessI] (Dummy-6 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : Callback processCancelled 2014-06-24 08:41:22,386 INFO [ omero.remote] (Dummy-6 ) Rslt: False 2014-06-24 08:41:38,470 INFO [ omero.remote] (Thread-2 ) Meth: ProcessI.poll 2014-06-24 08:41:38,470 INFO [ omero.processor.ProcessI] (Thread-2 ) <proc:19619,rc=-,uuid=f4ddad5f-2c94-43e6-b547-b67f1d043669> : Polling 2014-06-24 08:41:38,470 INFO [ omero.remote] (Thread-2 ) Rslt: None 2014-06-24 08:41:38,474 INFO [ omero.remote] (Thread-2 ) Meth: ProcessI.poll

You can see the os.kill(TERM) log twice, in short succession (6 ms). And indeed, sometimes this is long enough to clean up all jobs. But most of the time it is not. In those cases I can see from the traceback that the second kill comes in the middle of the job cleaning procedure.

This is accompanied with an exception in OMERO.insight:

Code: Select all: java.lang.Exception: org.openmicroscopy.shoola.env.data.ProcessException: Cannot cancel the following script: at org.openmicroscopy.shoola.env.data.ScriptCallback.cancel(ScriptCallback.java:138) at org.openmicroscopy.shoola.env.data.views.ProcessCallback.cancel(ProcessCallback.java:101) at org.openmicroscopy.shoola.env.ui.ScriptRunner.cancel(ScriptRunner.java:111) at org.openmicroscopy.shoola.env.ui.ScriptRunner.update(ScriptRunner.java:134) at org.openmicroscopy.shoola.env.data.events.DSCallAdapter.eventFired(DSCallAdapter.java:75) at org.openmicroscopy.shoola.env.data.views.BatchCallMonitor$1.run(BatchCallMonitor.java:124) at java.awt.event.InvocationEvent.dispatch(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$200(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source)

Thanks,

Paul

[edit]
I suspect I found where the 6ms comes from:
https://github.com/openmicroscopy/openm ... or.py#L607
If that's the source. Isn't that a bit short?

by **jmoore** » Tue Jun 24, 2014 8:19 am

Hi Paul,

PaulVanSchayck wrote:So that fixed that, and the shutdown handler is now being executed when kill`ing the pid (manual).

Excellent.

However, when cancelling the process from OMERO.insight (in the activities window) it seems like the kill command is send twice.
...
You can see the os.kill(TERM) log twice, in short succession (6 ms). And indeed, sometimes this is long enough to clean up all jobs. But most of the time it is not. In those cases I can see from the traceback that the second kill comes in the middle of the job cleaning procedure.

Hmmm.... I'm not sure where this would be coming from. We'll do some investigation under [url]
https://trac.openmicroscopy.org.uk/ome/ticket/12405[/url]

This is accompanied with an exception in OMERO.insight:
Code: Select all
java.lang.Exception: org.openmicroscopy.shoola.env.data.ProcessException: Cannot cancel the following script:

I saw this myself while trying to test the signal handler and filed the ticket https://trac.openmicroscopy.org.uk/ome/ticket/12404.

I suspect I found where the 6ms comes from...If that's the source. Isn't that a bit short?[/quote]
The "6" there is seconds as opposed to milliseconds, but thanks for taking such a close look!

Cheers,
~Josh

by **PaulVanSchayck** » Tue Jun 24, 2014 8:59 am

Dear Josh,

Thanks for the quick response. I realized after I posed that that wait(6) is indeed in seconds.

Thanks for the tickets. I've also made a small test case. As it's important to not sys.exit() within handler() but after a few seconds in a separate function to display the behavior.

Code: Select all: import signal import sys import time import omero import omero.scripts c = omero.scripts.client("sig_example") def handler(signal, frame): print "SIGTERM received" shutdown() signal.signal(signal.SIGTERM, handler) def shutdown(): print "before sleep" time.sleep(10) print "after sleep" sys.exit(0) while True: time.sleep(1)

So the behavior I see is that upon pressing cancel in the log the kill signal is sent twice. Then after a while the python process goes zombie, and then closes. At that point the stdout is still uploaded to /OMERO/Files. The output is:

Code: Select all: SIGTERM received before sleep SIGTERM received before sleep after sleep

So the shutdown handler is executed twice.

A different, but slightly related question is the use of "client.enableKeepAlive(60)". Is this the preferred way to allow the execution of long jobs, and not have the session timeout in between? We're intending to run jobs that can take several hours.

I was also slightly surprised at first that logging out of the webinterface also forces any script jobs to be canceled (BTW, that's NOT sending the kill signal twice!). This is not the behavior we're looking for (as these long jobs may run overnight). But can possibly be work around by telling users to close their browser/insight client and not to logout. I realize the reasons why this happens, and also found the ticket about it:
http://trac.openmicroscopy.org.uk/ome/ticket/5887

Paul

by **jmoore** » Wed Jun 25, 2014 7:22 pm

Hi Paul,

PaulVanSchayck wrote:So the behavior I see is that upon pressing cancel in the log the kill signal is sent twice. Then after a while the python process goes zombie, and then closes. At that point the stdout is still uploaded to /OMERO/Files. The output is:
...
So the shutdown handler is executed twice.

As a part of 12405 we'll definitely take a look at that. What you might do until then is in your handler or shutdown function, set some thread-safe state to only run your method once:

Code: Select all: import signal import sys import time import omero import omero.scripts import omero.util.concurrency as ouc c = omero.scripts.client("sig_example") event = ouc.get_event("my_script") def handler(signal, frame): print "SIGTERM received" shutdown() signal.signal(signal.SIGTERM, handler) def shutdown(): if event.is_set(): print "Event already set" else: event.set() print "before sleep" time.sleep(10) print "after sleep" sys.exit(0) while not event.is_set(): event.wait(1)

A different, but slightly related question is the use of "client.enableKeepAlive(60)". Is this the preferred way to allow the execution of long jobs, and not have the session timeout in between? We're intending to run jobs that can take several hours.

Interesting you should mention that. See https://github.com/openmicroscopy/openmicroscopy/pull/2681. There's currently no need to use enableKeepAlive, but there is a hard-one-hour limit at the moment that 2681 should correct for you. How long do you expect your scripts to go without connecting to the server?

I was also slightly surprised at first that logging out of the webinterface also forces any script jobs to be canceled (BTW, that's NOT sending the kill signal twice!). This is not the behavior we're looking for (as these long jobs may run overnight). But can possibly be work around by telling users to close their browser/insight client and not to logout. I realize the reasons why this happens, and also found the ticket about it:
http://trac.openmicroscopy.org.uk/ome/ticket/5887

Right, which is still an issue. I'll see where we are with that.

Cheers,
~Josh.

by **PaulVanSchayck** » Thu Jun 26, 2014 8:53 am

Hi Josh,

Thanks for the event suggestion as workarround. That's worth trying.

Interesting you should mention that. See https://github.com/openmicroscopy/openm ... /pull/2681. There's currently no need to use enableKeepAlive, but there is a hard-one-hour limit at the moment that 2681 should correct for you. How long do you expect your scripts to go without connecting to the server?

Yes, I saw your work on this. What do you define as "without connecting". Right now I'm polling for the completion of the jobs on the cluster in a non blocking way, so the Python script thread has time to perform actions. I've not tested with analysis running over an hour. Should there right now, before the mentioned PR be any problem, with KeepAlive enabled?

I'm also looking at a way to kill jobs from a web app. Right now I'm passing messages from the script to the web app by (ab)using omero.model.job.message like this (shortened code for readability a bit). This is a workarround for the fact that you cannot create callBacks from your script (Trac ticket)

Code: Select all: jobid = conn.getProperty("omero.job") jobObj = queryService.findByQuery("from Job where id = :jobid", p) jobObj.setMessage(rstring(msg)) updateService.saveObject(jobObj)

This I could abuse to pass the PID of the script to the webapp. Then, subsequently killing the script. However, I've been looking at a nicer way to achieve this by getting the omero.api.JobHandle(). Something like:

Code: Select all: handle = conn.c.sf.createJobHandle() handle.attach(long(jobid)) handle.cancelJob()

This works, and sets the jobStatus to cancel. But does not actually send a kill signal (as seen in Processor.log). I suspect that this may not be possible as I do not see a possibility to get the omero.grid.process().

Thanks,

Paul

by **jmoore** » Thu Jun 26, 2014 9:53 am

Morning, Paul

PaulVanSchayck wrote:What do you define as "without connecting".

Sorry for being unclear. The 2 issues at the moment with long-running scripts are: 1) if they run over an hour which should soon be configurable and 2) if they make no invocation on OMERO for 10 minutes. If you thought a single processing step would take more than the 10 minute limit, it might make sense to use "enableKeepAlive" so that a separate thread would do this for you. Otherwise, it shouldn't be necessary.

...Should there right now, before the mentioned PR be any problem, with KeepAlive enabled?

Other than the above stipulation, it shouldn't be necessary.

I'm also looking at a way to kill jobs from a web app. Right now I'm passing messages from the script to the web app by (ab)using omero.model.job.message...However, I've been looking at a nicer way to achieve this by getting the omero.api.JobHandle()....This works, and sets the jobStatus to cancel. But does not actually send a kill signal (as seen in Processor.log). I suspect that this may not be possible as I do not see a possibility to get the omero.grid.process().

It certainly sounds like its not working as it should (#12415). That being said, if you have a cluster and you're looking to provide visualization into that, it may be easier to have OMERO.scripts submit directly to the cluster rather than having OMERO hold on to the process itself. The folks from NERSC/LBL have recently tried something similar.

Cheers,
~Josh.

by **PaulVanSchayck** » Fri Jun 27, 2014 6:31 am

Dear Josh,

Thanks for all the explaining and the tickets.

The reason for having OMERO hold on to the process is that some of the output is added back in the form of annotations, comments, tags and attachments. Having an OMERO.script manage that seemed to be the most logical choice. So yes, the OMERO.script is submitting jobs to the cluster, and yes I'm trying to provide some visualization into that submitting process and script progress callbacks.

One final question, should I consider any performance (or other) issues when using jobObj.setMessage() as callback method for my script. I noticed it also create update events in the database. What do you think of lets say calling it several times per minute?

Thanks,

Paul

by **jmoore** » Fri Jun 27, 2014 1:30 pm

PaulVanSchayck wrote:One final question, should I consider any performance (or other) issues when using jobObj.setMessage() as callback method for my script. I noticed it also create update events in the database. What do you think of lets say calling it several times per minute?

It has a slightly higher overhead then some other methods, but you should still be fine. Another alternative would be to attach an annotation to the source object (image/dataset/etc) and just modify it in a loop:

Code: Select all: annotation.textValue = rstring("new status") annotation = client.sf.getUpdateService().saveAndReturnObject(annotation)

Or if you don't need the value to be persistent at all, you could store it in the session itself:

Code: Select all: client.setOutput("my.value", "new status")

Cheers,
~Josh.

Open Microscopy Environment

Shutdown handler in an OMERO.script

Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Re: Shutdown handler in an OMERO.script

Who is online