[LUDOC-7] character set problems in HTML manual Created: 06/Jul/11  Updated: 13/Feb/14  Resolved: 08/Oct/13

Status: Closed
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Richard Henwood (Inactive)
Resolution: Won't Fix Votes: 0
Labels: QInfrastructure
Environment:

Mac OS/X 10.6.6 + Firefox 3.6.17
Fedora 13 + Firefox 3.5.15


Issue Links:
Related
is related to LUDOC-192 Lustre manual doesn't open in some br... Closed
is related to LUDOC-217 Lustre manual should be indexed by go... Closed
Business Value: 1
Severity: 3
Rank (Obsolete): 7203

 Description   

There are a large number of "unknown" characters in the HTML version of the Lustre manual. On my system they appear as black diamonds with a question mark like '�'.

For example, the copyright (C) character right at the start of the manual, and all of the accented characters in the Oracle boilerplate are shown this way, along with the hard-space (I guess) character in every section and subsection title is shown this way.



 Comments   
Comment by Richard Henwood (Inactive) [ 06/Jul/11 ]

It seems this might be an encoding issue. The xml is currently specified as 'UTF-8'. UTF-8 is not a good choice.

I propose: ISO-8859-1.

Comment by Richard Henwood (Inactive) [ 06/Jul/11 ]

More detail is now available:

xsltproc produces HTML output with a directive:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

When you point your browser at Jenkins for the HTML build, Jenkins tells the browser that the file is encoded: UTF-8.

Testing with my browser, it seems to take the UTF-8 encoding as correct, and shows the '�' characters. I can change the encoding manually in the browser and the document renders fine.

xsltproc with the docbook.xsl specifies ISO-8859-1 encoding because it does it desires ISO-8859-1 for it's output. For example, the 'copyright � date DDDD' lines are generated from <copyright> and <date> docbook elements.

A work-around is to manually specify encoding = ISO-8859-1 in your browser for the manual.

Two alternative solutions may also be possible:

  • Tell Jenkins to serve HTML content with ISO-8859-1
  • MORE SPECULATIVE: Create our own xsl to handle generating of the copyright bit at the beginning and convert all '�' chars to html entities.
Comment by Jessica A. Popp (Inactive) [ 20/Dec/11 ]

Would this be something for Joshua to help resolve?

Comment by Andreas Dilger [ 20/Dec/11 ]

Yes, Joshua could probably understand what is going on here a lot faster than I could.

Comment by Richard Henwood (Inactive) [ 31/Jan/12 ]

one addition piece of information I've just noticed:

lustre_manual.diff.html <- encoding appears correct
lustre_manual.html <- encoding appears incorrect

Comment by Andreas Dilger [ 23/Apr/13 ]

It looks like the .xhtml version of the manual does not have this problem. Is that just because of the filename extension is not .html?

Comment by Andreas Dilger [ 23/Apr/13 ]

I just found that there are HTML codes for these:

Comment by Richard Henwood (Inactive) [ 23/Apr/13 ]

I believe the reason the 'x' makes a difference to the rendering is because the webserver that Jira uses is not very clever... I've discussed this in: https://jira.hpdd.intel.com/browse/LUDOC-7?focusedCommentId=17320&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17320

Comment by Richard Henwood (Inactive) [ 02/May/13 ]

The config of the webserver is both:

  • making browsers render some characters incorrectly.
  • prohibiting the manual appearing on search engine results.
Comment by Andreas Dilger [ 24/Sep/13 ]

http://review.whamcloud.com/7739

Some non-ASCII characters, such as accented letters, copyright (c),
and single quotes (rather than apostrophes) were being rendered
incorrectly in the HTML version of the manual, because of confusion
between character sets (UTF-8 vs. ISO-8859-1).

Instead of using the encoded characters directly in the manual,
use the HTML escape sequences such as ©, à, etc.
These can be rendered correctly for both the HTML and PDF manuals.

This doesn't resolve the use of ISO-8859-1 hard spaces in the titles,
but at least fixes the most visible mess at the start of the manual.

Comment by Andreas Dilger [ 24/Sep/13 ]

Sadly, this doesn't work:

legalnoticeOracle.xml:17: parser error : Entity 'copy' not defined
<para>Copyright &copy; 2011, Oracle et/ou ses affili&eacute;s. Tous droits
^
legalnoticeOracle.xml:17: parser error : Entity 'eacute' not defined
<para>Copyright &copy; 2011, Oracle et/ou ses affili&eacute;s. Tous droits

Comment by Richard Henwood (Inactive) [ 08/Oct/13 ]

We won't be fixing this. The xhtml version renders fine - and that version is the one that is most commonly linked to.

Generated at Sat Feb 10 03:39:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.