We're Hiring!

open files not released

Having a problem deploying OMERO? Please ask new questions at https://forum.image.sc/tags/omero
Please note:
Historical discussions about OMERO. Please look for and ask new questions at https://forum.image.sc/tags/omero

The OMERO.server installation documentation begins here and you can find OMERO.web deployment documentation here.

open files not released

Postby evenhuis » Tue Feb 12, 2019 1:09 am

Hi,

this came up in another post:
viewtopic.php?f=5&t=8611

We are running into problems with open files. When I select a batch of 32 files to delete this results in 7,000 files being opened and not released.

What has been happening is that we run into the file open limit (~20K for our system) which degrades the system enough to need either force OMERO to restart itself or us to manually restart..

Is there a way (other than restarting) to release these files.


Thanks,

Chris
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby mtbc » Tue Feb 12, 2019 10:07 am

Chris,

That's definitely unexpected and in checking the server code I indeed see calls simply to File.exists() and File.delete() and whatever: I don't see why the files would be opened at all. There are also open sockets to connect to the database to remove OriginalFile objects and the like but those are assigned from a fixed pool.

Let's confirm some basics: Which version of OMERO, on what platform? How is the managed repository mounted on the server? (It's some kind of network mount?) How exactly are you deleting: is this selecting them in the left-hand tree view in OMERO.web then using the delete function there?

Do you have a sense of what these 7000 files might be? How many originally imported files would be entailed in this deletion? (Then maybe we have a pyramid file and an import log for each fileset? Companion files or attachments maybe?)

Using lsof or similar, you're determining that it's this deletion that's incrementing the number of open files by this much? Do you know if it's the OMERO blitz process holding these open? (I also wonder if it could be something in the web stack: is OMERO.web deployed on the same machine?)

Maybe we can try to pin things down a bit to generate clues or try to reproduce at our end. For instance, perhaps there is some code path left that's not doing the deletion as a single bulk operation.

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland

Re: open files not released

Postby evenhuis » Tue Feb 12, 2019 11:37 am

Hi Mark,

Let's confirm some basics: Which version of OMERO, on what platform?

I'm not at work at the moment so I need to check the details later. We updated the server and web in November, we haven't applied the Jan update yet. The web version is OMERO.web 5.4.9-ice36-b101.

>How is the managed repository mounted on the server? (It's some kind of network mount?)

If is on a an EMC isilon storage device. I guess it's an nfs mount, I'll check tomorrow.

How exactly are you deleting: is this selecting them in the left-hand tree view in OMERO.web then using the delete function there?

Selected in the left hand tree in OMERO.web and trash can icon pressed. Yes for removing annotations.
Also deleted in the insight client.

Do you have a sense of what these 7000 files might be? How many originally imported files would be entailed in this deletion? (Then maybe we have a pyramid file and an import log for each fileset? Companion files or attachments maybe?)

32 tif files were selected for deletion resulting in 7000 files opened and not released. The files are associated with slide scanner output. For each root file name there is:
  • an ndpi which OMERO was able to construct the pyramid for.
  • is a macro.tif and map.tif which OMERO handles fine (imported separately from the ndpi).
  • a bunch x10_i1j1.tif, x20_i1j1.tif (imported separately from the ndpi).

It looks like the x20_i1j1.tif files were pyramid tiffs exported from some viewer software.

These files are cursed and cause big problems for bioformats and OMERO. A single 16MB tif in OMERO will clog the pixel data service taking about a day to open before exhausting the memory. In Fiji with bioformats it takes 30min at 100% CPU to open the 16MB file. I've uploaded an example in the other thread and this behaviour was replicated on the OME servers.

Maybe when OMRO attempted to contruct the pyramid it spewed a bunch of files around (hence 32 original files -> 7,000 to delete)?

Using lsof or similar, you're determining that it's this deletion that's incrementing the number of open files by this much? Do you know if it's the OMERO blitz process holding these open?


Attached is a screenshot from out monitoring service.

Screen Shot 2019-02-12 at 10.45.03 am copy.png
Screen Shot 2019-02-12 at 10.45.03 am copy.png (37.45 KiB) Viewed 1437 times

  • 12:00 I deleted about 50 files and locked the server until it was manually restarted at 16:00.
  • 20:00 I deleted about 50 files again and triggered a restart.
  • 22:00 I deleted 32 files, the open files was incremented by 7,000 and these never get released

(I also wonder if it could be something in the web stack: is OMERO.web deployed on the same machine?)

server, web and db are on seperate machines. The screenshot is for the OMERO.server machine.

I'll get the reset of the info I the morning. Is there anywhere I should look on the server for any errant files?

Thanks,

Chris
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby evenhuis » Wed Feb 13, 2019 8:56 am

Hi,

no word back from the infrastructure team about the server specs.

I had a bit of a play around this morning. I restarted the server to clear opened files (8:35) then deleted two files (8:44 and 8:54).


Screen Shot 2019-02-13 at 7.34.28 pm.png
Screen Shot 2019-02-13 at 7.34.28 pm.png (28.57 KiB) Viewed 1427 times


According the monitoring system 386 and 399 files are opened for the two deletion events. Using lsof the numbers were different but the trends were the same. I'm not sure why - I'm not an expert.

According to lsof the 610 new files were created by omero after deletion. Grouping these by file name (with counts) are

Code: Select all
Name                                                                                                                                          count
prdomeapp01.itd.int.uts.edu.au:59666->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)   54
prdomeapp01.itd.int.uts.edu.au:38939->prdomeapp01.itd.int.uts.edu.au:54482 (ESTABLISHED)    54
prdomeapp01.itd.int.uts.edu.au:38939->prdomeapp01.itd.int.uts.edu.au:54470 (ESTABLISHED)    54
localhost:45736->localhost:32881 (ESTABLISHED)                                              54
prdomeapp01.itd.int.uts.edu.au:59664->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54
prdomeapp01.itd.int.uts.edu.au:59672->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54
prdomeapp01.itd.int.uts.edu.au:59662->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54
prdomeapp01.itd.int.uts.edu.au:54482->prdomeapp01.itd.int.uts.edu.au:38939 (ESTABLISHED)    54
prdomeapp01.itd.int.uts.edu.au:59652->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54
prdomeapp01.itd.int.uts.edu.au:59670->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54
prdomeapp01.itd.int.uts.edu.au:59668->prdomedb01.itd.int.uts.edu.au:postgres (ESTABLISHED)  54


These look like connections from the OMERO.server computer to the db computer.

Looking into details for a single filename, PID and TID are systematically varying
Code: Select all
   COMMAND   PID   TID   USER
3152   java 12   819   NaN   omero
4296   java 12   819 12   820.0   omero
5440   java 12   819 12   821.0   omero
6584   java 12   819 12   822.0   omero
7728   java 12   819 12   823.0   omero
8872   java 12   819 12   824.0   omero
10016   java 12   819 12   825.0   omero
11160   java 12   819 12   826.0   omero
12304   java 12   819 12   827.0   omero
13448   java 12   819 12   828.0   omero
14592   java 12   819 12   829.0   omero
15736   java 12   819 12   830.0   omero
16880   java 12   819 12   831.0   omero
18024   java 12   819 12   832.0   omero
19168   java 12   819 12   833.0   omero
20312   java 12   819 13   15.0   omero
21456   java 12   819 13   16.0   omero
22600   java 12   819 13   17.0   omero
23744   java 12   819 13   18.0   omero
24888   java 12   819 13   20.0   omero
26032   java 12   819 13   37.0   omero
27176   java 12   819 13   40.0   omero
28320   java 12   819 13   49.0   omero
29464   java 12   819 13   50.0   omero
30608   java 12   819 13   53.0   omero
31752   java 12   819 13   63.0   omero
32896   java 12   819 13   66.0   omero
34040   java 12   819 13   67.0   omero
35184   java 12   819 13   68.0   omero
36328   java 12   819 13   69.0   omero
37472   java 12   819 13   71.0   omero
38616   java 12   819 13   613.0   omero
39760   java 12   819 13   680.0   omero
40904   java 12   819 13   851.0   omero
42048   java 12   819 13   872.0   omero
43192   java 12   819 13   900.0   omero
44336   java 12   819 13   962.0   omero
45480   java 12   819 14   0.0   omero
46624   java 12   819 14   26.0   omero
47768   java 12   819 14   30.0   omero
48912   java 12   819 14   31.0   omero
50056   java 12   819 14   33.0   omero
51200   java 12   819 14   66.0   omero
52344   java 12   819 14   69.0   omero
53488   java 12   819 14   91.0   omero
54632   java 12   819 14   155.0   omero
55776   java 12   819 14   222.0   omero
56920   java 12   819 14   225.0   omero
58064   java 12   819 14   231.0   omero
59208   java 12   819 14   248.0   omero
60352   java 12   819 14   250.0   omero
61496   java 12   819 14   345.0   omero
62640   java 12   819 14   346.0   omero
63784   java 12   819 14   347.0   omero


I don't have the Blitz log handy at the moment, but after deleting they are filled with 10's to 100's of these
messages

Code: Select all
2019-02-13 08:55:03,441 INFO  [  ome.services.pixeldata.PixelDataThread] (l.Server-7) Received: ome.io.messages.MissingPyramidMessage[source=ome.io.nio.PixelsService@6077d2e7]
2019-02-13 08:55:03,442 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Executor.doWork -- ome.services.pixeldata.PixelDataThread.getGroupId[]
2019-02-13 08:55:03,442 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Args:   [null, InternalSF@618785034]
2019-02-13 08:55:03,443 INFO  [    ome.security.basic.BasicEventContext] (2-thread-4)  cctx:   group=-1
2019-02-13 08:55:03,445 INFO  [         ome.security.basic.EventHandler] (2-thread-4)  Auth:   user=0,group=-1,event=null(Internal),sess=46c67a0b-125f-4275-90cc-940083d52704
2019-02-13 08:55:03,447 INFO  [                 org.perf4j.TimingLogger] (2-thread-4) start[1550008503442] time[5] tag[omero.call.success.ome.services.pixeldata.PixelDataThread$3.doWork]
2019-02-13 08:55:03,448 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Rslt:   206
2019-02-13 08:55:03,448 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Executor.doWork -- ome.services.pixeldata.PixelDataThread.createEvent[]
2019-02-13 08:55:03,448 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Args:   [null, InternalSF@618785034]
2019-02-13 08:55:03,449 INFO  [    ome.security.basic.BasicEventContext] (2-thread-4)  cctx:   group=206
2019-02-13 08:55:03,456 INFO  [         ome.security.basic.EventHandler] (2-thread-4)  Auth:   user=0,group=206,event=4366228(Internal),sess=46c67a0b-125f-4275-90cc-940083d52704
2019-02-13 08:55:03,457 INFO  [  ome.services.pixeldata.PixelDataThread] (2-thread-4) Creating PIXELDATA event for pixels id:22363
2019-02-13 08:55:03,467 INFO  [                 org.perf4j.TimingLogger] (2-thread-4) start[1550008503448] time[19] tag[omero.call.success.ome.services.pixeldata.PixelDataThread$4.doWork]
2019-02-13 08:55:03,467 INFO  [        ome.services.util.ServiceHandler] (2-thread-4)  Rslt:   ome.model.meta.EventLog:Id_13154053
2019-02-13 08:55:03,467 INFO  [                ome.io.nio.PixelsService] (l.Server-7) Missing pyramid:/OMERO/Pixels/Dir-022/22363_pyramid
2019-02-13 08:55:03,474 INFO  [                 org.perf4j.TimingLogger] (l.Server-7) start[1550008503467] time[7] tag[omero.transcodeSVG]
2019-02-13 08:55:03,485 INFO  [             ome.io.nio.FilePathResolver] (l.Server-7) Metadata only file, resulting path: /OMERO/ManagedRepository/11414253_316/2018-05/21/12-00-23.267/407_slide4_6.3.18_a4_x40_z0_i1j14.tif


Hope it sheds some light.

Chris
Last edited by evenhuis on Wed Feb 13, 2019 9:45 am, edited 2 times in total.
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby evenhuis » Wed Feb 13, 2019 9:35 am

Hi,

I found the result of a `tail -f` in my terminal history. This begins just before the second deletion event (about 8:55) and runs until I thought the deletion had completed and I canceled the tail.

I've attached the file here because the upload site
https://www.openmicroscopy.org/qa2/qa/upload/

keeps failing to load the flash plugin (tested on Safari and Firefox)

Cheers,

Chris

Blitz0.log.tar.gz
(62.24 KiB) Downloaded 88 times
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby jmoore » Thu Feb 14, 2019 5:22 pm

Hi Chris,

sorry for coming in late; I'm still trying to get my head around this. It sounds like delete is opening hundreds of connections to PostgreSQL and not letting them go. If so, it would definitely not be a wonder that your server hangs. (It also surprises me since PostgreSQL is usually quite strict about the number of connections that it permits.)

It would be good to know both what the server is blocking on and what the database server thinks it's doing.

For OMERO, if you have jstack installed then `jstack $PID` where you can set PID with `PID=$(bin/omero admin ice server pid Blitz-0)` should show you what the server is doing. If you don't have jstack installed, then `kill -QUIT $PID` will print the stacktrace to OMERO's var/log/master.out log.

For PostgreSQL, `select * from pg_stat_activity()` would be useful, or even initially `ps auxw | grep postgres`. It might be that the PG logs will tell us more.

Sorry I don't have anything more helpful. Definitely looking forward to figuring this out!
~Josh

P.S. Have any configuration changes been recently applied regarding the database server?
User avatar
jmoore
Site Admin
 
Posts: 1591
Joined: Fri May 22, 2009 1:29 pm
Location: Germany

Re: open files not released

Postby evenhuis » Sun Apr 07, 2019 11:15 pm

Hi Josh,

here's the output from stack from the last batch of deletions we did.

Hope it helps.

Cheers,

Chris
Attachments
jstack_omero.tar.gz
jstack output when deletion leaves files open
(145.23 KiB) Downloaded 65 times
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby evenhuis » Mon Apr 08, 2019 6:02 am

Hi Josh,

after deleting out the cursed tif files from OMERO the Pixelservice is able construct the pyramids for stitched images again and it is in the processs of working it's way through a backlog from the last couple of months.

Attached is a plot of the opened files on our system through out the day.

The two peaks around 9:30am are when I deleted the last batch of cursed tifs. As discussed before, this creates a bunch of open files that require a restart to release.

After this, the pixel service is happily processing pyramids again but I noticed the open are creeping up again.

At 3pm I interacted with a coupled of directories with outstanding pyramids to check the progress. This caused the open files to jump up.

So it seems that whenever a file with an outstanding pyramid is touched (either when deleted or view through) files are opened and not released.

Cheers,

Chris


Screen Shot 2019-04-08 at 3.51.57 pm.png
Screen Shot 2019-04-08 at 3.51.57 pm.png (49.77 KiB) Viewed 1097 times
evenhuis
 
Posts: 61
Joined: Tue Jan 30, 2018 4:47 am

Re: open files not released

Postby mtbc » Tue Apr 09, 2019 9:01 am

Dear Chris,

Thank you very much for the great detective work: that gives us a clear idea of what code to start reviewing suspiciously. I've also noted the issue at https://trello.com/c/KRiYv9NY/612-bug-o ... -file-leak to make sure we get to it. We've been puzzling over some surprising open files issues with our own production servers that also show sudden jumps in the "base level" of open files so this might all have the same underlying cause.

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland


Return to Installation and Deployment

Who is online

Users browsing this forum: No registered users and 1 guest

cron