Hej alle,
Problemet skyldes brugen af en StringWriter som opsamling af XML outputtet.
StringWriters har deres helt egen idéer om hvordan strenge skal encodes, så
benyt i stedet for en OutputStreamWriter - såsom fx:
java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(
new java.io.OutputStreamWriter(
baos,
"UTF-8"
)
);
// baos.toString() indeholder XML'en
/Michael
www.hyperpal.com
> Hej Alle!
>
> Jeg har postet et indlæg i comp.lang.java.programmer omkring serialisering
> af DOM objekter via JAXP, men har indtil videre ikke fået nogle
> tilbagemeldinger. Så jeg håber der et eller andet sted her i gruppen
sidder
> en xml guru som kan hjælpe mig med det.
>
> Indlægget er på engelsk - lev med det ..
>
> Mvh Michael
>
> ###
>
> I'm trying to serialize an xml document with JAXP. The xml may or may not
> contain international characters, and so I want any text elements to be
> UTF-8 encoded. Consider the following (a brief summary is included below
the
> code):
>
> ---- code begin ----
>
> org.w3c.dom.Document doc =
>
javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
> newDocument();
>
> org.w3c.dom.Element el = doc.createElement("element");
> el.setAttribute("attr1","attr1value");
> el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
> doc.appendChild(el);
>
> javax.xml.transform.TransformerFactory transformerFactory =
> javax.xml.transform.TransformerFactory.newInstance();
> javax.xml.transform.Transformer transformer =
> transformerFactory.newTransformer();
>
>
transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
>
transformer.setOutputProperty("{
http://xml.apache.org/xslt}indent-amount","4
> ");
>
> java.io.StringWriter xmlout = new java.io.StringWriter();
> javax.xml.transform.stream.StreamResult result = new
> javax.xml.transform.stream.StreamResult(xmlout);
> transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);
>
> System.out.println(xmlout.getBuffer());
>
> ---- code end ----
>
> So, I'm creating a document (DOM), setting an attribute and appending a
text
> node with international characters (and a couple of brackets just for
fun).
> Then I create a transformer instance, I ask it to indent the output nicely
> and finally to actually serialize my DOM into xml.
>
> When I run this code (in a jsp file on a tomcat 4.1.x server with the
latest
> xerces2-j version installed) I get this output:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <element attr1="attr1value">Danish < æøå > characters!</element>
>
> Okay. So I got the < and > escaped as I expected. However, the
international
> characters have not been encoded to UTF-8 or anything else for that
matter.
> In fact, the above isn't even a valid xml document, and several parsers I
> tried (including Microsoft XML) rejects it because of the illegal
character
> data.
>
> Clearly there is a mismatch between what the xml encoding specification
> (UTF-8) and what's actually appearing in the
> text nodes of the document. It's very curious that JAXP will transform a
DOM
> into a result that isn't valid.
>
> Interestingly, when I run the same code interactively inside my WebSphere
> Studio Application Developer 5 (using what is known as a scrapbook page),
I
> get this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <element attr1="attr1value">Danish < æøå >
> characters!</element>
>
> Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact
I'm
> sure it isn't), but at least the document is now valid and even Microsoft
> XML will parse it without complaints.
>
> I am hoping that someone out there can shed some light on this problem and
> tell me what I am doing wrong. Exactly how do I instruct JAXP to encode
the
> text nodes in my DOM so that it doesn't break my XML parser?
>
> Regards,
> Michael Berg
>
www.hyperpal.com
>
>
>