Open Microscopy Environment

by **austinMLB** » Tue Jan 15, 2019 10:13 pm

Hi, All,
I have a Python Upload script that I recently converted to do, essentially, in-place upload instead of a standard upload copying the image file. But, that change did not result in a significant speed up of my uploads. A lot of a my time seems to be in the "verifyUpload" process.

Code: Select all: handle = proc.verifyUpload(hashes) cb = CmdCallbackI(self.connection.c, handle) rsp = self.assert_passes(cb)

where that last call is a wrapper around cb.getResponse(). Given that this upload is effectively "in-place", can I safely skip this step in some way? Or, can I otherwise improve it's performance for my case?

Thanks for any advice,
Michael

by **mtbc** » Wed Jan 16, 2019 12:08 pm

Dear Michael,

You can't skip the verifyUpload but for in-place imports it may be delayed needlessly checksumming the content of the large image files. Taking http://downloads.openmicroscopy.org/lat ... itory.html as the reference: note that importFileset takes an ImportSettings argument which includes a checksumAlgorithm member. SHA1-160 (omero.model.enums.ChecksumAlgorithmSHA1160 from Python) is the default but File-Size-64 (omero.model.enums.ChecksumAlgorithmFileSize64) will ignore the file content and just use the file size as the checksum. You probably just need to choose that in your arguments to importFileset.

Note that there is a bit of related API that you may someday wish to use. suggestChecksumAlgorithm can be used to negotiate an algorithm with the server though present versions of the server all support the same set of algorithms. After your import setChecksumAlgorithm can be used to change the algorithm for the imported files to give you a greater chance of using the checksums stored in the database to detect server-side file corruption in the long term. That will take the extra time you saved at import time with the algorithm change.

Cheers,
Mark

by **austinMLB** » Tue Jan 22, 2019 10:12 pm

Thanks, Mark. That sounds good, but I haven't quite made it work, yet. If I make the checksum method File-Size-64, what do I need to provide as the "hash" of the original file? Based on your second paragraph, I initially thought I'd still need to provide the sha1. If I need to provide the size as the hash, what format does it need to be in?
The error I'm getting is essentially the checksum fails

failingChecksums =
{
key = 0
value = b4de320000000000
}

The "value", in that exception, appears to be a reverse ordering of the hex value of my file size (which is 0x32deb4).
Any thoughts on where I went wrong?
Thanks,
Michael

by **mtbc** » Wed Jan 23, 2019 8:48 am

Dear Michael,

Well-spotted re. the reverse ordering of the hex value of the file size. Did you try supplying that to verifyUpload as the "hash"? If that doesn't work then if you paste your script somewhere we'll give it a try and investigate.

In mentioning setChecksumAlgorithm I was thinking that you may want to use it post-import to switch the hash from being the file size to the SHA1. In that case the server would calculate the SHA1, it wouldn't ask you to also do it as you may not any longer have the file in hand locally.

Cheers,
Mark

by **austinMLB** » Wed Jan 23, 2019 3:30 pm

Hi, Mark,
Yes, I just hardcoded that value as the hash, and the checksum passes. So, my test code has

hashes.append("b4de320000000000")

That upload was successful.

In Python, from the file, I can get the file size as hex

0x32deb4

Would I need to text-manipulate that value into "b4de320000000000"? I'm guessing there is a method somewhere that just directly computes the correct formatting from my file. For example, when I was using SHA1, I was calling

client.c.sha1(my_file)

Do you think I should just text-manipulate the value?
--Michael

by **mtbc** » Thu Jan 24, 2019 11:10 am

Dear Michael,

I am not aware of such a function already provided. For me this kind of thing looks promising:

Code: Select all: from os import path from struct import pack, unpack filename = '/my/file.dat' size = path.getsize(filename) size = unpack('<Q', pack('>Q', size))[0] hash = '{:016x}'.format(size) print(hash)

I hope that helps.

Cheers,
Mark

by **austinMLB** » Mon Feb 04, 2019 10:24 pm

Mark, thanks for all the help you provided in this thread, and I apologize for my long delays between posts.

Your technique for reconstructing the Hash from the File Size does seem to work. My upload performance improved using the File-Size-64 verification instead of SHA1. It is about 20-40% faster.

While the time is significantly better, I'm interested to know if I can reduce it further. Specifically, my data is on a network drive. While the connection to that drive is good, for large data sets it is still costly to bring the files over to the server machine, which is still happening. I noticed the command line has options to skip "checksum", "minmax", and "thumbnails".

So, my follow-up questions:
1. Is it possible, and do you believe it would be valuable, to also address "thumbnails" and "minmax"?
2. Is it possible to avoid the file transfer to the server altogether? My images are (often) being created in parallel. I could pre-process them, in parallel, in some way if helpful.

Thanks for any insight, and please let me know if my questions aren't clear,
Michael

by **mtbc** » Tue Feb 05, 2019 1:08 pm

Dear Michael,

Can you tell us a little more about your in-place-alike import (even share code fragments?): are the files still being slowly brought over to the server at upload time or is the problem when the server processes the files in later import steps? It does have to do a substantial amount of reading at some point of course. Given your efforts to date you may find the code in https://gitlab.com/openmicroscopy/incub ... n-importer interesting.

Have you experimented with the parallelization options as at https://docs.openmicroscopy.org/latest/ ... lel-import? Also, have you studied the import time breakdown? Using OMERO.cli, for image 1234, you can find its fileset ID with,

Code: Select all: bin/omero obj get Image:1234 fileset

then its import time with,

Code: Select all: bin/omero fs importtime 567

We do skip import steps when importing publication data submissions to http://idr.openmicroscopy.org/. For those we use the kind of approach discussed at https://docs.openmicroscopy.org/latest/ ... -bulk.html: an example configuration for importing one of those datasets skipping all the deferrable steps is presently at https://github.com/IDR/idr0053-faas-vir ... A-bulk.yml.

Bio-Formats 6 is to include some exploratory work on importing remotely hosted files such as from Amazon S3. Depending on what you are trying, it might be worth discussing more about that angle?

Cheers,
Mark

Open Microscopy Environment

Python Upload, optimize verifyUpload

Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Re: Python Upload, optimize verifyUpload

Who is online