Open Microscopy Environment

by **PaulVanSchayck** » Mon Sep 01, 2014 9:58 am

Thanks Sebastien for your help. That enabled me to debug this issue a bit further, and to a possible solution.

It led me to the find_next_pixels_data_per_user_for_null_repo query in `components/model/resources/ome/util/actions/default.properties`. This is: (I've filled in the `572962` according to my own situation):

Code: Select all: select * from ( select *, row_number() over (partition by entityid) as dupe from ( select e.experimenter, el.id as eventlog, entityid, row_number() over (partition by experimenter) as row from event e, eventlog el, pixels p where e.id = el.event and el.id > 572962 and action = 'PIXELDATA' and entitytype = 'ome.model.core.Pixels' and p.id = el.entityid group by e.experimenter, el.id, el.entityid) as x where row <= 5 order by row, eventlog asc) as y where dupe = 1

The output of this query is as follows:

Code: Select all: experimenter;eventlog;entitityid;row;dupe 252;573263;7169;1;1 252;573476;7224;2;1 252;573541;7212;3;1 252;572965;7505;4;1 252;573381;7287;5;1

Take note of the `eventlogId` and the `row` column. As you can see the the eventlogId jumps quite randomly. And this also causes the the image handled in pixeldata/PersistentEventLogLoader.java to skip forward, and over evenlogId it hasn't processed yet. Because if it picks up a eventlogId later in the list it will execute the query above with this new ID, and then never complete any remaining eventLogIds lower than this number.

To confirm my suspicion that this was the issue. I've changed the query to no longer order on `row` but only on `eventLogId` and instead of keeping `row` < 5, I've limited the output by `5`. This is the final query:

Code: Select all: select * from (select *, row_number() over (partition by entityid) as dupe from (select e.experimenter, el.id as eventlog, entityid, row_number() over (partition by experimenter) as row from event e, eventlog el, pixels p where e.id = el.event and el.id > 573849 and action = 'PIXELDATA' and entitytype = 'ome.model.core.Pixels' and p.id = el.entityid group by e.experimenter, el.id, el.entityid) as x order by eventlog asc limit 5) as y where dupe = 1

Typical output of this query is:

Code: Select all: experimenter;eventlog;entitityid;row;dupe 252;573850;7132;23;1 252;573851;7127;4;1 252;573852;7142;15;1 252;573853;7140;11;1 252;573854;7145;8;1

Note the order of the `eventlog` column, and the `row` is not any more.

And this indeed fixed the issue, and now eventlogIds are processed in order. And no images are skipped anymore. However, I've no idea if this is the correct fix. It has as side effect that only one thread is processing pixels, and all the others are trying to handle the previous eventlogid (log message SKIPPED ...).

Please tell me if you need additional log files

[update]This query did not fix all cases. But I still think the root cause of the issue might be related to this query.[/update]

by **jmoore** » Tue Sep 02, 2014 3:52 pm

Again, thanks for the detailed sleuthing, Paul! I'll spend some time this week trying to reproduce your results for the query. Just in the way of a long-text explanation, what the query is trying to do is find the next event per user so that one user with a log list of pyramids won't flood the queue. I may well have done something wrong though, so I'll set up a test to try to reproduce. If you have time, a small dump of the last couple of 100 event logs which you see reproducing this issue would be useful.

All the best,
~Josh.

by **PaulVanSchayck** » Wed Sep 03, 2014 10:28 am

Hi Josh,

Yeh I figured from the code and the commit log that trying to prevent one user from flooding the queue was your goal.

However, I'm wondering if we can ever get to that goal? Is this not trying to process a FIFO queue in a non sequential way, and thereby making life very hard?

Say, you have 3 users importing images, and a pixelThread size of 2.

All at once this happens:
User 1 starts eventLog 1-10
User 2 starts eventLog 11
User 3 starts eventLog 12

Now the query would keep returning eventlog 11 and 12 as it is the first image of the user, which it will process once. But then the queue remain stuck as PersistentEventLogLoader.setCurrentId() is never called with a higher ID. This is not called as currentId will never be higher than the ID it will never process.

So why does this deadlock not happen? This is because the random eventLog jumping causes the queue to never comes across such a situation. My theory why you're not seeing the bug I've been experiencing is because you're running with more pixelThreads (4 or more?) which will make the situation of skipping eventLogs more unlikely. Is this assumption correct?

So, I'm now using this query:

Code: Select all: select * from ( select *, row_number() over (partition by entityid) as dupe from ( select e.experimenter, el.id as eventlog, entityid, row_number() over (partition by experimenter) as row from event e, eventlog el, pixels p where e.id = el.event and el.id > ? and action = 'PIXELDATA' and entitytype = 'ome.model.core.Pixels' and p.id = el.entityid group by e.experimenter, el.id, el.entityid order by eventlog ) as x where row <= ? order by row, eventlog asc ) as y where dupe = 1

Note the addition of "order by eventlog" in the inner most query.

I think this is fixing the issue of skipping over the events. It is however introducing (or showing) other problems with having no parallel processing when there is just one user importing, and being slow to process duplicate events (hitting refresh many times).

In the end, would this queue system not benefit from:
- Setting a Handled flag on an event so that you query for events left to process
- Not submitting more PIXELDATA events for entity that you already have such an event running for?

The output of 1100 events from 2 users, of the query:

Code: Select all: select e.experimenter, el.id as eventlog, entityid from event e, eventlog el, pixels p where e.id = el.event and el.id > ? and action = 'PIXELDATA' and entitytype = 'ome.model.core.Pixels' and p.id = el.entityid

Can be found here: http://pastebin.com/AEfc5nsF
Forum does not allow csv or txt extensions...

Sorry.. for the wall of text..

by **jmoore** » Thu Sep 04, 2014 1:41 pm

Hi Paul,

PaulVanSchayck wrote:However, I'm wondering if we can ever get to that goal? Is this not trying to process a FIFO queue in a non sequential way, and thereby making life very hard?

Quite possibly! It was driven, though, by another issue you mention, namely that there's currently no straight-forward way to know if a PIXELDATA has already been submitted leading to many duplicates.

In the end, would this queue system not benefit from:
- Setting a Handled flag on an event so that you query for events left to process
- Not submitting more PIXELDATA events for entity that you already have such an event running for?

Better, I think, would be to just move to a proper message queue which is currently in the works, but I don't know if it will make it into 5.1.0. As an alternative, I'll look into using the new "EventLogQueue" (still not a proper MQ though!) which was introduced in 5.0.x. If so, backporting would be quite simple.

The output of 1100 events from 2 users, of the query:
Can be found here: http://pastebin.com/AEfc5nsF
Forum does not allow csv or txt extensions...
Sorry.. for the wall of text..

No worries, and many thanks!

~Josh.

by **PaulVanSchayck** » Mon Sep 08, 2014 1:51 pm

Hey Josh,

I saw your pull request. I'm in progress of testing it. I'm especially wondering about what the change from pixelDataEventLogLoader to pixelDataEventLogQueue will do.

BTW, with just the change to the query I'm not losing the feature to prevent flooding by one user.

Regarding testing, I've been trying to get the integration tests of the server component to work, which has been kind of tricky. I've got it working to a point where a few hundred of the tests pass using

Code: Select all: ./build.py -f components/server/build.xml integration

It's however skipping the pixelData tests for reasons I cannot see. Also I would like to run just the pixelData threads, but this still executes all tests:

Code: Select all: ./build.py -f components/server/build.xml integration -DGROUPS=pixeldata

Am I missing something?

Paul

by **jmoore** » Mon Sep 08, 2014 2:27 pm

Hi Paul,

PaulVanSchayck wrote:I saw your pull request. I'm in progress of testing it. I'm especially wondering about what the change from pixelDataEventLogLoader to pixelDataEventLogQueue will do.

The 2 real changes should be:
* event logs should come in order of creation regardless of ownership
* but all duplicate requests will be ignored.

BTW, with just the change to the query I'm not losing the feature to prevent flooding by one user.

Noted. I'll also outline the "lack of parallelism" issue, but there's currently an issue with that which needs to be sorted first. i.e. the PR won't be included in tomorrow's build.

Regarding testing, I've been trying to get the integration tests of the server component to work, which has been kind of tricky. I've got it working to a point where a few hundred of the tests pass using
Code: Select all
./build.py -f components/server/build.xml integration

It's however skipping the pixelData tests for reasons I cannot see. Also I would like to run just the pixelData threads, but this still executes all tests:
Code: Select all
./build.py -f components/server/build.xml integration -DGROUPS=pixeldata

Am I missing something?

Try:

Code: Select all: ant -f components/server/build.xml test -DTEST=ome/server/itests/pixeldata/PersistentEventLogLoaderTest

additionally passing any non-standard DB options like this:

Code: Select all: ant -f ... -Domero.db.name=test -Domero.data.dir=/tmp/test

Cheers,
~Josh

Open Microscopy Environment

Browser caching generated thumbs

Re: Browser caching generated thumbs

Re: Browser caching generated thumbs

Re: Browser caching generated thumbs

Re: Browser caching generated thumbs

Re: Browser caching generated thumbs

Re: Browser caching generated thumbs

Who is online