Page 1 of 1

Bulk Data Import and Key:Value annotation

PostPosted: Mon Oct 22, 2018 4:35 pm
by DaveMellert
Hello everyone,

I have a question about the available tools and/or best practices for bulk import.

My use case: I have about 5000 images (each a 1-2gb slide scan) that I need to import into a new OMERO instance. These will fall into one Project, several hundred Datasets. I think I understand the Bulk Import option pretty well, and should be able to handle setting up import targets and creating the import configuration files programatically. What is less clear to me is how to handle annotating these files with key:value metadata (e.g., Species, tissue, gene, etc) in bulk. I haven't had much luck finding a clear solution in the documentation--the Populate Metadata script seems tailored toward annotation of screens.

There are several key:value pairs (falling under 2+ namespaces) that I could annotate files with at the same time as import, but there doesn't seem to be a straightforward way to include these annotations in the configuration file/settings, unless I am missing something. Nonetheless, I will still have to come back later and annotate all ~5000 images with an additional key:value pair in which the value with be unique to each image.

Are there any tools already available to, for example, take a tsv of image ids + key/value annotations and automatically populate? My backup plan is to write a tool, probably a shell script that uses the omero CLI, to do this very thing. But if I can save myself the work and/or follow best practices/conventions, I would prefer that.

Re: Bulk Data Import and Key:Value annotation

PostPosted: Tue Oct 23, 2018 2:40 pm
by manics
This sounds very similar to what we're doing with the IDR https://idr.openmicroscopy.org/ where we've got several large screens and datasets, all are heavily annotated.

Most of the new functionality was included in a separate OMERO CLI plugin, https://github.com/ome/omero-metadata

Unfortunately documentation is lacking, but all the raw files including CSVs and YAML configurations used to create the IDR annotations are available at https://github.com/IDR/idr-metadata

This is our most recent example of annotating a dataset in the manner you require: https://github.com/IDR/idr0045-reichman ... xperimentA

idr0045-experimentA-annotation.csv is a CSV of the Dataset and Image names along with the required annotations.
idr0045-experimentA-bulkmap-config.yml is a configuration file which controls the creation and grouping of the map-annotations, and is designed for use with https://github.com/ome/omero-mapr which enables easy querying of annotation terms across the IDR. For example, it allows us to split annotations into separate namespaces such as Organism. You may not need this configuration file.

If you click on some of the images in https://idr.openmicroscopy.org/webclien ... roject-405 under attributes in the right-hand pane you will see the annotations corresponding to the CSV.

The other files are used to manage imports, or change the rendering of images.

Re: Bulk Data Import and Key:Value annotation

PostPosted: Wed Oct 24, 2018 8:06 pm
by DaveMellert
This was quite helpful, but I am a bit stuck.

I was able to use:

Code: Select all
bin/omero --sudo ADMIN metadata populate Project:1 --cfg /path/to/bulkmap.yml --file /path/to/annotation.csv


This successfully attached "bulk_annotations" to the right project. And the image-specific annotations did end up associated with the right images, but there were a couple of problems:

1) The data ended up under "Tables" on the right, whereas IDR has "Attributes", which would be preferable (as would Key-Value Pairs)
2) All of the columns from the annotation.csv show up with the column headings ("name") as the keys rather than my specified "clientname". Also, columns I had set to ignore were included. i.e. my bulkmap.yml file was seemingly ignored
3) all of the other images (the ones not included in annotation.csv) in the target datasets were annotated with keys-empty values.

I am assuming I am missing something important here. Also, I should note that I do not (yet) have omero-mapr installed. Is that the problem?

Re: Bulk Data Import and Key:Value annotation

PostPosted: Thu Oct 25, 2018 12:17 pm
by manics
You're right, I forgot to mention an important point. The original design goals of the metadata plugin were that the raw data (from the CSV) should be attached unchanged as a table ("bulk-annotations"), with the map-annotations serving as an alternative "view" of the data.

Creating the map-annotations therefore requires a second step to convert the table into annotations, something along the lines of:
Code: Select all
omero metadata populate --context bulkmap --cfg bulkmap.yml Project:1


If this works for you we'd love to hear your feedback on what's good and bad about it (other than the lack of documentation!). omero-mapr is web-app for exploring the annotations, it's not required for creating them.

Simon

Re: Bulk Data Import and Key:Value annotation

PostPosted: Thu Oct 25, 2018 2:53 pm
by DaveMellert
Thanks, that did work!

A few points of feedback and a question:

--If you attach two separate bulk annotations and run the bulkmap step with two different respective config files, you get two separate groups of key-value pairs, each with the ns: openmicroscopy.org/omero/bulk_annotations. Hovering over each group reveals the reason for the separation--each group gets a separate ID and creation date. However, it seems like it would be cleaner to collect groups under a common namespace, and use some kind of subgrouping to indicate separate addition of those k-v pairs to that namespace.

--Just for testing purposes, I tried included the mapr section in the cfg file (without omero-mapr installed), because I wanted to see if I could annotate with separate namespaces of my choosing. This did work, which is great! However there is the same issue of having multiple groups of annotations with the same namespace, which I am not thrilled about. Still, this is largely covering my needs, so it's great! We want to be able to specify namespaces for ontologies when we do the annotation, and we need to do it in bulk. This tool will do that.

--Another issue is that if you (accidentally) run the bulkmap command twice with the same cfg file and bulk_annotation, you get a duplicated group of k-v pairs. I can just be careful to avoid this, but it seems like some sort of check for this kind of mistake might be good.

--I am still wondering about how the IDR has the "Attributes" section rather than using the Key-Value pairs section. Also, the overall appearance of how the IDR displays the key-value groups/namespace info is more attractive than the default key-value view. Are these web client customizations, or a feature of omero-mapr?

Re: Bulk Data Import and Key:Value annotation

PostPosted: Thu Oct 25, 2018 3:45 pm
by manics
Thanks for the feedback! It's probably worth mentioning a bit more of the history behind this plugin and mapr.

Ideally this information would be stored in a graph-database since ontologies are hierarchical and you want to follow a chain of links between images based on common terms. A graph-db would be useful but is a lot of work to integrate seamlessly into OMERO, so instead we opted to use map-annotations and do matching based on key-value pairs.

You've run into several of the limitations. For example, to prevent duplications we have to simulate "primary keys" in the metadata plugin since it's not in the database, which can be a time-consuming process. If you use the mapr configuration of the bulkmap-config this is what we attempt to do to prevent some duplicate annotations. The downside is if you accidentally attempt to create a duplicate the error messages can be quite cryptic.

The label of the "Attributes" section is controlled by the omero.web.ui.metadata_panes property in the standard OMERO.web client https://docs.openmicroscopy.org/omero/5 ... data_panes.

The nicer display of the namespaced key-value pairs is a mapr extension- it overrides one of the OMERO.web Django templates.

All of the IDR deployment is managed with Ansible. If you're familiar with Ansible playbooks and roles it might be helpful to look at the IDR web config https://github.com/IDR/deployment/blob/ ... -hosts.yml
If not I can give you a dump of any OMERO.web config settings you require.

Finally, in regards to your feedback would you like to open some issues on GitHub? https://github.com/ome/omero-mapr

Re: Bulk Data Import and Key:Value annotation

PostPosted: Thu Oct 25, 2018 7:00 pm
by DaveMellert
Thank you so much, this is very helpful and useful to know.

It is interesting that you bring up using a graphDB. At the Jackson Laboratory (where I work), we are in the process of creating a "Data Services Core" that we want to position as the main entry point for data discovery and metadata management for users. We will keep "ground truth" metadata in a graphDB and/or triplestore that covers not just images, but multiple domains (so sample info from lims, sequence data...anything created at the lab that users want to integrate, basically). For image data, we'll have a permalink in the form of the OMERO id. When the data moves to deep archive, that will get changed to the location on tape.

Anyway, the plan is to import some minimal set of metadata into OMERO, including the UID created by the Data Services Core, to provide nice functionality for the users, but the triples/graphDB will be the ultimate source of truth.

If OMERO were to use a graphDB, that would certainly make my life easier! But it's all good, OMERO is such a nice system for interacting with the data/metadata we wanted to find a way to make it work with the broader vision. If any of this sounds backward or crazy, please let me know!

All that said, I will definitely continue hammering on my dev box here and you may indeed see some issues from me in the near future :)