We're Hiring!

Python Upload, optimize verifyUpload

General user discussion about using the OMERO platform to its fullest. Please ask new questions at https://forum.image.sc/tags/omero
Please note:
Historical discussions about OMERO. Please look for and ask new questions at https://forum.image.sc/tags/omero

There are workflow guides for various OMERO functions on our help site - http://help.openmicroscopy.org

You should find answers to any basic questions about using the clients there.

Python Upload, optimize verifyUpload

Postby austinMLB » Tue Jan 15, 2019 10:13 pm

Hi, All,
I have a Python Upload script that I recently converted to do, essentially, in-place upload instead of a standard upload copying the image file. But, that change did not result in a significant speed up of my uploads. A lot of a my time seems to be in the "verifyUpload" process.

Code: Select all
handle = proc.verifyUpload(hashes)
cb = CmdCallbackI(self.connection.c, handle)
rsp = self.assert_passes(cb)


where that last call is a wrapper around cb.getResponse(). Given that this upload is effectively "in-place", can I safely skip this step in some way? Or, can I otherwise improve it's performance for my case?

Thanks for any advice,
Michael
austinMLB
 
Posts: 19
Joined: Wed Jul 25, 2018 3:26 pm

Re: Python Upload, optimize verifyUpload

Postby mtbc » Wed Jan 16, 2019 12:08 pm

Dear Michael,

You can't skip the verifyUpload but for in-place imports it may be delayed needlessly checksumming the content of the large image files. Taking http://downloads.openmicroscopy.org/lat ... itory.html as the reference: note that importFileset takes an ImportSettings argument which includes a checksumAlgorithm member. SHA1-160 (omero.model.enums.ChecksumAlgorithmSHA1160 from Python) is the default but File-Size-64 (omero.model.enums.ChecksumAlgorithmFileSize64) will ignore the file content and just use the file size as the checksum. You probably just need to choose that in your arguments to importFileset.

Note that there is a bit of related API that you may someday wish to use. suggestChecksumAlgorithm can be used to negotiate an algorithm with the server though present versions of the server all support the same set of algorithms. After your import setChecksumAlgorithm can be used to change the algorithm for the imported files to give you a greater chance of using the checksums stored in the database to detect server-side file corruption in the long term. That will take the extra time you saved at import time with the algorithm change.

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland

Re: Python Upload, optimize verifyUpload

Postby austinMLB » Tue Jan 22, 2019 10:12 pm

Thanks, Mark. That sounds good, but I haven't quite made it work, yet. If I make the checksum method File-Size-64, what do I need to provide as the "hash" of the original file? Based on your second paragraph, I initially thought I'd still need to provide the sha1. If I need to provide the size as the hash, what format does it need to be in?
The error I'm getting is essentially the checksum fails
failingChecksums =
{
key = 0
value = b4de320000000000
}


The "value", in that exception, appears to be a reverse ordering of the hex value of my file size (which is 0x32deb4).
Any thoughts on where I went wrong?
Thanks,
Michael
austinMLB
 
Posts: 19
Joined: Wed Jul 25, 2018 3:26 pm

Re: Python Upload, optimize verifyUpload

Postby mtbc » Wed Jan 23, 2019 8:48 am

Dear Michael,

Well-spotted re. the reverse ordering of the hex value of the file size. Did you try supplying that to verifyUpload as the "hash"? If that doesn't work then if you paste your script somewhere we'll give it a try and investigate.

In mentioning setChecksumAlgorithm I was thinking that you may want to use it post-import to switch the hash from being the file size to the SHA1. In that case the server would calculate the SHA1, it wouldn't ask you to also do it as you may not any longer have the file in hand locally.

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland

Re: Python Upload, optimize verifyUpload

Postby austinMLB » Wed Jan 23, 2019 3:30 pm

Hi, Mark,
Yes, I just hardcoded that value as the hash, and the checksum passes. So, my test code has
hashes.append("b4de320000000000")

That upload was successful.

In Python, from the file, I can get the file size as hex
0x32deb4

Would I need to text-manipulate that value into "b4de320000000000"? I'm guessing there is a method somewhere that just directly computes the correct formatting from my file. For example, when I was using SHA1, I was calling
client.c.sha1(my_file)

Do you think I should just text-manipulate the value?
--Michael
austinMLB
 
Posts: 19
Joined: Wed Jul 25, 2018 3:26 pm

Re: Python Upload, optimize verifyUpload

Postby mtbc » Thu Jan 24, 2019 11:10 am

Dear Michael,

I am not aware of such a function already provided. For me this kind of thing looks promising:
Code: Select all
from os import path
from struct import pack, unpack

filename = '/my/file.dat'
size = path.getsize(filename)
size = unpack('<Q', pack('>Q', size))[0]
hash = '{:016x}'.format(size)
print(hash)

I hope that helps.

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland

Re: Python Upload, optimize verifyUpload

Postby austinMLB » Mon Feb 04, 2019 10:24 pm

Mark, thanks for all the help you provided in this thread, and I apologize for my long delays between posts.

Your technique for reconstructing the Hash from the File Size does seem to work. My upload performance improved using the File-Size-64 verification instead of SHA1. It is about 20-40% faster.

While the time is significantly better, I'm interested to know if I can reduce it further. Specifically, my data is on a network drive. While the connection to that drive is good, for large data sets it is still costly to bring the files over to the server machine, which is still happening. I noticed the command line has options to skip "checksum", "minmax", and "thumbnails".

So, my follow-up questions:
1. Is it possible, and do you believe it would be valuable, to also address "thumbnails" and "minmax"?
2. Is it possible to avoid the file transfer to the server altogether? My images are (often) being created in parallel. I could pre-process them, in parallel, in some way if helpful.

Thanks for any insight, and please let me know if my questions aren't clear,
Michael
austinMLB
 
Posts: 19
Joined: Wed Jul 25, 2018 3:26 pm

Re: Python Upload, optimize verifyUpload

Postby mtbc » Tue Feb 05, 2019 1:08 pm

Dear Michael,

Can you tell us a little more about your in-place-alike import (even share code fragments?): are the files still being slowly brought over to the server at upload time or is the problem when the server processes the files in later import steps? It does have to do a substantial amount of reading at some point of course. Given your efforts to date you may find the code in https://gitlab.com/openmicroscopy/incub ... n-importer interesting.

Have you experimented with the parallelization options as at https://docs.openmicroscopy.org/latest/ ... lel-import? Also, have you studied the import time breakdown? Using OMERO.cli, for image 1234, you can find its fileset ID with,
Code: Select all
bin/omero obj get Image:1234 fileset

then its import time with,
Code: Select all
bin/omero fs importtime 567

We do skip import steps when importing publication data submissions to http://idr.openmicroscopy.org/. For those we use the kind of approach discussed at https://docs.openmicroscopy.org/latest/ ... -bulk.html: an example configuration for importing one of those datasets skipping all the deferrable steps is presently at https://github.com/IDR/idr0053-faas-vir ... A-bulk.yml.

Bio-Formats 6 is to include some exploratory work on importing remotely hosted files such as from Amazon S3. Depending on what you are trying, it might be worth discussing more about that angle?

Cheers,
Mark
User avatar
mtbc
Team Member
 
Posts: 282
Joined: Tue Oct 23, 2012 10:59 am
Location: Dundee, Scotland


Return to User Discussion

Who is online

Users browsing this forum: No registered users and 1 guest