We're Hiring!

Unescaped ampersands in bfconvert output

Historical discussions about the Bio-Formats library. Please look for and ask new questions at https://forum.image.sc/tags/bio-formats
Please note:
Historical discussions about the Bio-Formats library. Please look for and ask new questions at https://forum.image.sc/tags/bio-formats

If you are having trouble with image files, there is information about reporting bugs in the Bio-Formats documentation. Please send us the data and let us know what version of Bio-Formats you are using. For issues with your code, please provide a link to a public repository, ideally GitHub.

Unescaped ampersands in bfconvert output

Postby dwight » Fri Aug 31, 2012 11:26 am

I have a few Olympys Fluoview files in which some fields have text with ampersands (the directory name of the saved image). Using bfconvert the file is saved as an OME-XML file, but it is not possible to read the converted file. For example, the showinf command aborts with a long stacktrace


Code: Select all
Exception in thread "main" loci.formats.FormatException: Malformed OME-XML
   at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:241)
   at loci.formats.FormatReader.setId(FormatReader.java:1178)
   at loci.formats.ImageReader.setId(ImageReader.java:727)
   at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:529)
   at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:988)
   at loci.formats.tools.ImageInfo.main(ImageInfo.java:1031)
Caused by: java.io.IOException
   at loci.common.xml.XMLTools.parseXML(XMLTools.java:350)
   at loci.common.xml.XMLTools.parseXML(XMLTools.java:318)
   at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:237)
   ... 5 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 4274; columnNumber: 59; The reference to entity "iso_1" must end with the ';' delimiter.

...etc...


As far as I see, the problem lies in that the ampersand in the original OIB file gets into a OriginalMetaData Value tag and XML parser doesn't like single & characters.

I made a quick hack to solve the issue (for me at least) by modifying the sanitizeXML method to look for standalone & characters, without (hopefully) altering the rest of the output of the method. This included using a StringBuffer instead of a character array as the total length of the string being sanitized does not necessarily remain constant.
Code: Select all
diff --git a/components/common/src/loci/common/xml/XMLTools.java b/components/common/src/loci/common/xml/XMLTools.java
index 6665abf..c00d74d 100644
--- a/components/common/src/loci/common/xml/XMLTools.java
+++ b/components/common/src/loci/common/xml/XMLTools.java
@@ -181,17 +181,29 @@ public final class XMLTools {

   /** Remove invalid characters from an XML string. */
   public static String sanitizeXML(String s) {
-    final char[] c = s.toCharArray();
-    for (int i=0; i<s.length(); i++) {
-      if (Character.isISOControl(c[i]) ||
-        !Character.isDefined(c[i]) || c[i] > '~')
-      {
-        c[i] = ' ';
+    StringBuffer sb = new StringBuffer();
+    for (int i=0; i < s.length(); i++) {
+      if (Character.isISOControl(s.charAt(i)) ||
+        !Character.isDefined(s.charAt(i)) || s.charAt(i) > '~'){
+        sb.append(' ');
       }
       // eliminate invalid &# sequences
-      if (i > 0 && c[i - 1] == '&' && c[i] == '#') c[i - 1] = ' ';
+      else if (i < s.length() - 1 && s.substring(i, i + 2).equals("&#")){
+        sb.append(" #");
+        i += 1;
+      }
+      else if (s.charAt(i) == '&'){
+        if (i < s.length() - 4 && s.substring(i, i + 5).equals("&amp;")){
+          i += 4;
+        }
+        sb.append("&amp;");
+      }
+      else{
+        sb.append(s.charAt(i));
+      }
+
     }
-    return new String(c);
+    return sb.toString();
   }

   /** Indents XML to be more readable. */
diff --git a/components/scifio/src/loci/formats/services/OMEXMLServiceImpl.java b/components/scifio/src/loci/formats/services/OMEXMLServiceImpl.java
index ba4fdb8..cab4d08 100644
--- a/components/scifio/src/loci/formats/services/OMEXMLServiceImpl.java
+++ b/components/scifio/src/loci/formats/services/OMEXMLServiceImpl.java
@@ -881,7 +881,7 @@ public class OMEXMLServiceImpl extends AbstractService implements OMEXMLService
       Element valueElement =
         document.createElementNS(ORIGINAL_METADATA_NS, "Value");
       keyElement.setTextContent(key);
-      valueElement.setTextContent(value);
+      valueElement.setTextContent(XMLTools.sanitizeXML(value));

       Element originalMetadata =
         document.createElementNS(ORIGINAL_METADATA_NS, "OriginalMetadata");



This could easily be extended to avoid other XML entities than &amp; if necessary. Let me know if there's a better way of getting &-s out of my converted OME-XML files (other than not having them in the metadata in the first place ;) )
dwight
 
Posts: 10
Joined: Wed Mar 16, 2011 7:14 pm

Re: Unescaped ampersands in bfconvert output

Postby rleigh » Mon Sep 03, 2012 1:09 pm

Hello,

I've opened a ticket for this here: https://trac.openmicroscopy.org.uk/ome/ticket/9572 and added you to the Cc, so that you will be notified of the progress of this ticket.

Regards,
Roger
User avatar
rleigh
 
Posts: 217
Joined: Tue Mar 13, 2012 11:45 am

Re: Unescaped ampersands in bfconvert output

Postby dwight » Wed Sep 05, 2012 2:13 pm

Thanks for letting me know about the ticket. However, even though the ticket has now been marked 'fixed' the fix does not solve the problem for my file. I'm not sure if the bug report actually corresponds to the issue I have (which is probably why my problem doesn't get fixed by the update). In the ticket the & character is added to an element attribute. In my case the offending & is not in an element attribute but, rather, the value itself:
Code: Select all
<XMLAnnotation ID="Annotation:847">
<Value>
<OriginalMetadata xmlns="openmicroscopy.org/OriginalMetadata">
<Key>[File Info] Path</Key>
<Value>D:/FV10-ASW/Users/confo/Image/120807/a&b_1/</Value></OriginalMetadata></Value></XMLAnnotation>


Apologies for not being specific enough earlier.
dwight
 
Posts: 10
Joined: Wed Mar 16, 2011 7:14 pm

Re: Unescaped ampersands in bfconvert output

Postby mlinkert » Wed Sep 05, 2012 2:55 pm

Thanks for letting me know about the ticket. However, even though the ticket has now been marked 'fixed' the fix does not solve the problem for my file. I'm not sure if the bug report actually corresponds to the issue I have (which is probably why my problem doesn't get fixed by the update). In the ticket the & character is added to an element attribute. In my case the offending & is not in an element attribute but, rather, the value itself:


What is fixed by that ticket is two things:

* properly escaped '&' (i.e. '&amp;') values are now read correctly
* '&' in the metadata values will now be correctly written as '&amp;'

If you had previously written files that contained an unescaped '&', those will still not be readable (as it is invalid XML), but if you re-generate the OME-XML using the latest build then the new OME-XML should be correctly escaped and readable.
User avatar
mlinkert
Team Member
 
Posts: 353
Joined: Fri May 29, 2009 2:12 pm
Location: Southwest Wisconsin

Re: Unescaped ampersands in bfconvert output

Postby dwight » Wed Sep 05, 2012 10:20 pm

Yes, I understand that the ome-xml file would have to be regenerated. I have done that and the error message persists. Also, at least when I run the bfconvert command, the sanitizeXML method does not even get called, so I am not sure how this one line fix can do anything about invalid characters. If you want I can upload the file causing the problem so you can try it for yourself. For the sake of completeness, this is the output from converting the file:
Code: Select all
./bfconvert Image0019.oib out.ome

Image0019.oib
Initializing helper readers
Reading additional metadata
Populating metadata
Reading bitmap header
Populating metadata
Unknown LaserMedium value 'fluo-3' will be stored as "Other"
[Olympus FV1000] -> out.ome [OME-XML]
   Series 0: converted 1/1 planes (100%)
   Series 1: converted 1/1 planes (100%)
[done]
11.872s elapsed (27.0+138.0ms per plane, 524ms overhead)

The output is identical with or without a
Code: Select all
System.out.println("In sanitizeXML")
in the sanitizeXML method.

The output when trying to use showinf on the file gives:
Code: Select all
./showinf out.ome

Checking file format [OME-XML]
Initializing reader
Exception in thread "main" loci.formats.FormatException: Malformed OME-XML
   at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:241)
   at loci.formats.FormatReader.setId(FormatReader.java:1178)
   at loci.formats.ImageReader.setId(ImageReader.java:727)
   at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:529)
   at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:988)
   at loci.formats.tools.ImageInfo.main(ImageInfo.java:1031)
Caused by: java.io.IOException
   at loci.common.xml.XMLTools.parseXML(XMLTools.java:338)
   at loci.common.xml.XMLTools.parseXML(XMLTools.java:306)
   at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:237)
   ... 5 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 4274; columnNumber: 59; The reference to entity "iso_1" must end with the ';' delimiter.
   at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
   at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
   at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:391)
   at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1404)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1826)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3009)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:625)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:488)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:819)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:748)
   at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1208)
   at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:525)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
   at loci.common.xml.XMLTools.parseXML(XMLTools.java:330)

The offending line I pasted in my previous post. I even made a pristine clone of bioformats straight from the repository just in case I had some old stuff hanging around from somewhere.
dwight
 
Posts: 10
Joined: Wed Mar 16, 2011 7:14 pm

Re: Unescaped ampersands in bfconvert output

Postby mlinkert » Tue Sep 11, 2012 2:37 am

Ah, I see the problem. Nearly everything was fixed by the ticket mentioned previously, but there was one lingering bug in the OME-XML writer which is fixed here:

https://github.com/melissalinkert/biofo ... 386005c1f0

This only showed up when actually fully converting files to OME-XML; converting to OME-TIFF or just generating the OME-XML metadata for a file did not exhibit the problem. So, if you checkout the 'sprint5-bug-fixes' branch of the above repository and rebuild, then OME-XML files converted with the new build should be readable.

As an aside, is there a reason why you are converting to OME-XML and not OME-TIFF? We do recommend that OME-TIFF is used instead of OME-XML, as the image data is stored in a much nicer fashion (and both formats store the same exact metadata).
User avatar
mlinkert
Team Member
 
Posts: 353
Joined: Fri May 29, 2009 2:12 pm
Location: Southwest Wisconsin


Return to User Discussion [Legacy]

Who is online

Users browsing this forum: No registered users and 1 guest