We're Hiring!

Blitz Log Errors

Having a problem deploying OMERO? Please ask new questions at https://forum.image.sc/tags/omero
Please note:
Historical discussions about OMERO. Please look for and ask new questions at https://forum.image.sc/tags/omero

The OMERO.server installation documentation begins here and you can find OMERO.web deployment documentation here.

Re: Blitz Log Errors

Postby manics » Tue Feb 17, 2015 9:18 am

Hi Paul, Blair,

Could you both give us more details about your NFS mounts, such as configuration options, underlying hardware/networking, filesystems, whether they're shared, and anything else you think might be relevant. OMERO sometimes has problems with network mounts if they're incorrectly configured: https://www.openmicroscopy.org/site/sup ... ote-shares

Could you also check your system logs e.g. /var/log/messages for anything that looks relevant?

Thanks, Simon
User avatar
manics
Team Member
 
Posts: 261
Joined: Mon Oct 08, 2012 11:01 am
Location: Dundee

Re: Blitz Log Errors

Postby PaulVanSchayck » Tue Feb 17, 2015 10:08 am

Hi Simon,

Some outputs:

Code: Select all
$ nfsstat -m
/mnt/omero-nas from ***:/pools/***/OMERO-NAS
Flags: rw,noatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=*.*.*.*,mountvers=3,mountport=45027,mountproto=tcp,local_lock=none,addr=*.*.*.*


The connection to the nfs server is a 10Gbit connection with a latency < 1ms.

Visible in the syslog are these messages every now and then. But they do not correspond (not within a day) of the same time the errors appear in Blitz.0.log.
Code: Select all
Feb 13 15:13:20 **** kernel: [189123.786623] lockd: server *** not responding, still trying
Feb 13 15:13:20 **** kernel: [189123.796487] lockd: server *** OK


I've also not had any complains or error reports from users. So I'm not sure how much of an issue this is.

Thanks,

Paul
PaulVanSchayck
 
Posts: 41
Joined: Tue May 20, 2014 7:04 am

Re: Blitz Log Errors

Postby PaulVanSchayck » Tue Feb 17, 2015 10:45 am

After some additional log analyzing and asking around I think this may be related to downloading large amount (500+) of original files using the Insight client.
The symptom for the user was that the download took very long to complete (starting with hunderds of files per minute, and then slowing to a crawl of only several files every minute).

This is an example of one of the servants IDs that would later crash with "Error on close, stage1". Note that this 20 hours before the final error.

Code: Select all
2015-02-12 12:57:44,862 INFO  [                      omero.cmd.SessionI] (Server-878) Added servant to adapter: f53082c7-6ca4-480d-a4f4-ee68c458b197/9a88464f-63be-4c1b-859c-a9669258a74comero.api.RawFileSt
ore(omero.api._RawFileStoreTie@fe9700a)
2015-02-12 12:57:44,865 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Executor.doWork -- ome.services.blitz.repo.RepositoryDaoImpl.getFile(132806)
2015-02-12 12:57:44,865 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Args:    [null, InternalSF@1451537047]
2015-02-12 12:57:44,865 INFO  [    ome.security.basic.BasicEventContext] (Server-874)  cctx:    group=-1
2015-02-12 12:57:44,866 INFO  [         ome.security.basic.EventHandler] (Server-874)  Auth:    user=452,group=-1,event=null(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,867 INFO  [                 org.perf4j.TimingLogger] (Server-874) start[1423742264865] time[2] tag[omero.call.success.ome.services.blitz.repo.RepositoryDaoImpl$14.doWork]
2015-02-12 12:57:44,867 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Rslt:    MDV_705/2015-02/10/08-10-55.458/Img3445_GKWI0005-33#20150103-1339.png
2015-02-12 12:57:44,867 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Executor.doWork -- ome.services.blitz.repo.RepositoryDaoImpl.getOriginalFile(132806)
2015-02-12 12:57:44,867 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Args:    [null, InternalSF@1451537047]
2015-02-12 12:57:44,867 INFO  [    ome.security.basic.BasicEventContext] (Server-874)  cctx:    group=-1
2015-02-12 12:57:44,868 INFO  [         ome.security.basic.EventHandler] (Server-874)  Auth:    user=452,group=-1,event=null(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,870 INFO  [                 org.perf4j.TimingLogger] (Server-874) start[1423742264867] time[3] tag[omero.call.success.ome.services.blitz.repo.RepositoryDaoImpl$5.doWork]
2015-02-12 12:57:44,870 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Rslt:    ome.model.core.OriginalFile:Id_132806
2015-02-12 12:57:44,871 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Executor.doWork -- ome.services.blitz.repo.RepositoryDaoImpl.getOriginalFile(132806)
2015-02-12 12:57:44,871 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Args:    [null, InternalSF@1451537047]
2015-02-12 12:57:44,871 INFO  [    ome.security.basic.BasicEventContext] (Server-874)  cctx:    group=-1
2015-02-12 12:57:44,872 INFO  [         ome.security.basic.EventHandler] (Server-874)  Auth:    user=452,group=-1,event=null(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,874 INFO  [                 org.perf4j.TimingLogger] (Server-874) start[1423742264871] time[2] tag[omero.call.success.ome.services.blitz.repo.RepositoryDaoImpl$5.doWork]
2015-02-12 12:57:44,874 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Rslt:    ome.model.core.OriginalFile:Id_132806
2015-02-12 12:57:44,874 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Executor.doWork -- ome.services.blitz.repo.RepositoryDaoImpl.setFileIdWithBuffer(132806, CheckedPath(MDV_705/2015-02/
10/08-10-55.458/Img3445_GKWI0005-33#20150103-1339.png), r)
2015-02-12 12:57:44,874 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Args:    [null, InternalSF@1850180323]
2015-02-12 12:57:44,875 INFO  [    ome.security.basic.BasicEventContext] (Server-874)  cctx:    group=104
2015-02-12 12:57:44,875 INFO  [         ome.security.basic.EventHandler] (Server-874)  Auth:    user=452,group=104,event=null(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,876 INFO  [                 org.perf4j.TimingLogger] (Server-874) start[1423742264874] time[1] tag[omero.call.success.ome.services.blitz.repo.RepositoryDaoImpl$1.doWork]
2015-02-12 12:57:44,876 INFO  [        ome.services.util.ServiceHandler] (Server-874)  Rslt:    null
2015-02-12 12:57:44,876 INFO  [                      omero.cmd.SessionI] (Server-874) Added servant to adapter: f53082c7-6ca4-480d-a4f4-ee68c458b197/6ea41435-3790-48fa-93fd-dd30e424de79-RepoRawFileStoreI(
omero.api._RawFileStoreTie@1daa653)
2015-02-12 12:57:44,877 INFO  [        ome.services.util.ServiceHandler] (Server-880)  Meth:    interface ome.api.RawFileStore.read
2015-02-12 12:57:44,877 INFO  [        ome.services.util.ServiceHandler] (Server-880)  Args:    [0, 1013169]
2015-02-12 12:57:44,878 INFO  [         ome.security.basic.EventHandler] (Server-880)  Auth:    user=452,group=104,event=null(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,879 INFO  [                 org.perf4j.TimingLogger] (Server-880) start[1423742264877] time[1] tag[omero.call.success.ome.services.RawFileBean.read]
2015-02-12 12:57:44,879 INFO  [        ome.services.util.ServiceHandler] (Server-880)  Rslt:    [-119, 80, 78, ... 1 more]
2015-02-12 12:57:44,906 INFO  [        ome.services.util.ServiceHandler] (Server-873)  Meth:    interface ome.api.StatefulServiceInterface.close
2015-02-12 12:57:44,906 INFO  [        ome.services.util.ServiceHandler] (Server-873)  Args:    ()
2015-02-12 12:57:44,913 INFO  [         ome.security.basic.EventHandler] (Server-873)  Auth:    user=452,group=104,event=2196840(User),sess=f53082c7-6ca4-480d-a4f4-ee68c458b197
2015-02-12 12:57:44,914 INFO  [                 org.perf4j.TimingLogger] (Server-873) start[1423742264906] time[7] tag[omero.call.success.ome.services.RawFileBean.close]
2015-02-12 12:57:44,914 INFO  [        ome.services.util.ServiceHandler] (Server-873)  Rslt:    null
2015-02-12 12:57:44,914 INFO  [        ome.services.util.ServiceHandler] (Server-873) Cleanup: ome.services.RawFileBean@3820396a
2015-02-12 12:57:44,914 INFO  [                      omero.cmd.SessionI] (Server-873) Unregistered servant:f53082c7-6ca4-480d-a4f4-ee68c458b197/6ea41435-3790-48fa-93fd-dd30e424de79-RepoRawFileStoreI(omero
.api._RawFileStoreTie@1daa653)
2015-02-12 12:57:44,914 WARN  [                      omero.cmd.SessionI] (Server-878) Holder is null for f53082c7-6ca4-480d-a4f4-ee68c458b197/9a88464f-63be-4c1b-859c-a9669258a74comero.api.RawFileStore(ome
ro.api._RawFileStoreTie@1d55e116)
2015-02-12 12:57:44,914 INFO  [                      omero.cmd.SessionI] (Server-878) Unregistered servant:f53082c7-6ca4-480d-a4f4-ee68c458b197/9a88464f-63be-4c1b-859c-a9669258a74comero.api.RawFileStore(o
mero.api._RawFileStoreTie@1d55e116)
PaulVanSchayck
 
Posts: 41
Joined: Tue May 20, 2014 7:04 am

Re: Blitz Log Errors

Postby brossetti » Tue Feb 17, 2015 12:37 pm

There is nothing in my sys logs. Here is a bit more on my mount:

Code: Select all
# nfsstat -m
/OMERODATA from ***.***.***.***:/volume1/data
Flags:   rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=***.***.***.***,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=***.***.***.***


I don't know if I've downloaded 500+ files from the insight client at once. My downloads and uploads are typically around 50 files. That said, my mount connection is limited by a gigabit switch. I will do some larger downloads today to see if errors pop up.

Cheers,
Blair
brossetti
 
Posts: 32
Joined: Wed Feb 12, 2014 8:57 pm

Re: Blitz Log Errors

Postby jmoore » Thu Feb 19, 2015 10:21 am

Hi Blair and Paul,

first, thanks for all the data. There certainly is a chance that OMERO isn't entirely happy with your NFS. But coming back to this after some time away, I'd like to be clear. I understand:

* the primary cause for concern is the presence of ERRORS in the logs
* large downloads are sometimes?/always? slow for some ? / all? users
* otherwise, there aren't any usability significant problems with the server

If that's roughly true, we should probably focus on the download speeds, since those combined with sessions timing out could be leading to the secondary messages we were looking at previously.

All the best,
~Josh.
User avatar
jmoore
Site Admin
 
Posts: 1591
Joined: Fri May 22, 2009 1:29 pm
Location: Germany

Re: Blitz Log Errors

Postby PaulVanSchayck » Mon Feb 23, 2015 9:23 am

Hi Josh,

Yes, the primary cause for concern is ERROR in the log. I'm not 100% sure whether the error also has a behavior for users. But one of the strong hunches I have is that it can be triggered by downloading large sets (250+) of original files.

Right now, I've managed to trigger the ERROR by:

- Downloading the original files of a dataset of 470 images (JPG files, several GB) using the INSIGHT 64 bit windows client.

- The download is quick (several files per few seconds) for the first 200 images. In then slows down, to eventually a crawl of 1 or 2 images every 10 seconds. At that point also in the log the following can be seen:

Code: Select all
2015-02-23 09:53:20,778 INFO  [ ome.services.util.ServiceHandler] (Server-203) Cleanup: ome.services.RawFileBean@339d41dc
2015-02-23 09:53:20,778 INFO  [omero.cmd.SessionI] (Server-203) Unregistered servant:99af1836-e928-46cd-abe5-75bf3c2361d0/f0a5d7c3-230f-4b73-96bf-7402a21b50d2-RepoRawFileStoreI(omero.api._RawFileStoreTie@23ddff02)
2015-02-23 09:53:20,778 WARN  [ omero.cmd.SessionI] (Server-260) Holder is null for 99af1836-e928-46cd-abe5-75bf3c2361d0/1627895d-5d6f-42b8-8859-3dfcf838d4ebomero.api.RawFileStore(omero.api._RawFileStoreTie@3e0ee0ec)
2015-02-23 09:53:20,778 INFO  [  omero.cmd.SessionI] (Server-260) Unregistered servant:99af1836-e928-46cd-abe5-75bf3c2361d0/1627895d-5d6f-42b8-8859-3dfcf838d4ebomero.api.RawFileStore(omero.api._RawFileStoreTie@3e0ee0ec)
2015-02-23 09:53:26,236 INFO  [ ome.services.util.ServiceHandler] (Server-270)  Meth:    interface ome.api.IContainer.getImages


What can be seen is that there is a pause between the second "Unregistered servant" line and the "interface ome.api.IContainer.getImages" call of 6 seconds. This seems to correspond at which moment the new image appears.

What may be a related but, very annoying UI behavior to say the least, is that the file dialog for the Download will stay visible during the entire download, and will keep getting refocused repeatedly. Also the Activity window will flicker behind this, but never come into view.

- Now, a way I've found to trigger the final ERROR quite quickly is to kill the client process (task manager) while the Download is in progress. Then restarting the client and starting a second download. Several minutes later the ERROR will show.

- What's important to observe is that the servant UUIDs in the final ERROR are the same as those shown in the WARN log above.

I'm BTW less sure that this is related to the fact that we run some parts of the binary repository over NFS.

With kind regards,

Paul
PaulVanSchayck
 
Posts: 41
Joined: Tue May 20, 2014 7:04 am

Re: Blitz Log Errors

Postby PaulVanSchayck » Thu Mar 19, 2015 11:20 am

Hi,

As an update, I can positively confirm that this error only occurs when downloading original files. It does not matter whether you use the java or web client. Nor does it seem to matter what number of files you select for download.

The moment the final error occurs "Error on close, stage1" is several hours (up to days) later than the export was performed.

I still cannot see any negative impact for users. But for monitoring purposes, having ignorable ERRORs is not desirable.

I'm also still not sure it is related to our NFS setup. But it somehow seems unlikely to me.

Thanks,

Paul
PaulVanSchayck
 
Posts: 41
Joined: Tue May 20, 2014 7:04 am

Re: Blitz Log Errors

Postby Dominik » Fri Mar 20, 2015 11:27 am

Thanks, that was a crucial lead. I think I finally fixed it: https://github.com/openmicroscopy/openm ... /pull/3643

Cheers,
Dominik
User avatar
Dominik
Team Member
 
Posts: 149
Joined: Mon Feb 10, 2014 11:26 am

Re: Blitz Log Errors

Postby PaulVanSchayck » Tue Mar 24, 2015 7:22 am

Thanks Dominik, glad you managed to track this down. Our monitoring system will be happy to get rid of these errors.
PaulVanSchayck
 
Posts: 41
Joined: Tue May 20, 2014 7:04 am

Re: Blitz Log Errors

Postby brossetti » Thu Apr 02, 2015 3:11 pm

Thanks, Dominik! Based on the discussion on github, it looks like the fix for this bug is not included in 5.1.0. Is this correct?
brossetti
 
Posts: 32
Joined: Wed Feb 12, 2014 8:57 pm

PreviousNext

Return to Installation and Deployment

Who is online

Users browsing this forum: No registered users and 1 guest