Wrong letter in title

Il giorno dom 20 mag 2018 alle ore 18:35 Davide Liessi

The file
\version "2.19.81"
\header { title = "Ä" }
{ b1 }
results in a PDF with correct printed title (lowercase c with caron)
but wrong title field in metadata (Ä, i.e. uppercase c with dot
above).
Ghostscript bug when converting PostScript output to PDF. The
PostScript reads (pasted from less' display)
mark /Creator (LilyPond 2.21.0)
/Title (<FE><FF>^A^M)
/DOCINFO pdfmark
which is the correct UTF16-LE string with BOM. GhostScript however
converts the ^M (0x0d) into ^J (0x0a), basically converting an ASCII CR
to an ASCII LF. Unfortunately, we are not in the middle of ASCII here.

Actually, it turns out that the behaviour of GhostScript is not wrong
and this is probably a bug in how LilyPond produces the PostScript
file.

PostScript strings must either properly escape non-ASCII or ASCII
non-printable bytes, e.g., as \ddd with ddd the octal representation,
or they must be defined as a hexadecimal string (see [1], pages
29â31).
See the attached files: gs-string-wrong.ps defines the title as
LilyPond with literal UTF16LE bytes, gs-string-hex.ps uses a
hexadecimal string and gs-string-oct.ps escapes characters as octals.
Using ps2pdf on them results in the wrong Ä (uppercase c with dot
above) in the first case and in the correct Ä (lowercase c with caron)
in the second and third cases.
See also [2] where exactly this problem was reported to GhostScript.

I think that LilyPond should properly escape strings or use their
hexadecimal representation (probably easier, if strings are encoded as
UTF16LE regardless of the characters appearing in them).

Best wishes.
Davide

[1] https://www.adobe.com/content/dam/acom/en/devnet/actionscript/articles/PLRM.pdf
[2] https://bugs.ghostscript.com/show_bug.cgi?id=693614

David Kastrup

2018-09-30 12:52:42 UTC

Post by Davide Liessi
Il giorno dom 20 mag 2018 alle ore 18:35 Davide Liessi

The file
\version "2.19.81"
\header { title = "č" }
{ b1 }
results in a PDF with correct printed title (lowercase c with caron)
but wrong title field in metadata (Ċ, i.e. uppercase c with dot
above).
Ghostscript bug when converting PostScript output to PDF. The
PostScript reads (pasted from less' display)
mark /Creator (LilyPond 2.21.0)
/Title (<FE><FF>^A^M)
/DOCINFO pdfmark
which is the correct UTF16-LE string with BOM. GhostScript however
converts the ^M (0x0d) into ^J (0x0a), basically converting an ASCII CR
to an ASCII LF. Unfortunately, we are not in the middle of ASCII here.

Uh WHAT? To quote:

The \ddd form may be used to include any 8-bit character constant in
a string. One, two, or three octal digits may be specified, with
high-order overflow ignored. This notation is preferred for
specifying a character outside the recommended ASCII character set
for the PostScript language, since the notation itself stays within
the standard set and thereby avoids possible difficulties in
transmitting or storing the text of the program. It is recommended
that three octal digits always be used, with leading zeros as
needed, to prevent ambiguity. The string (\0053) , for example,
contains two characters—an ASCII 5 (Control-E) followed by the digit
3—whereas the strings (\53) and (\053) contain one character, the
ASCII character whose code is octal 53 (plus sign).

Recommended/preferred is not at all equivalent to "must". However, one
problem indeed is that strings as such have no notion of encoding and
CR, LF, CRLF are all equivalent. So at least those bytes, when they
occur as part of UTF-16, would warrant escaping.

--
David Kastrup

David Kastrup

2018-09-30 13:58:53 UTC

Post by Davide Liessi
Il giorno dom 20 mag 2018 alle ore 18:35 Davide Liessi

The file
\version "2.19.81"
\header { title = "č" }
{ b1 }
results in a PDF with correct printed title (lowercase c with caron)
but wrong title field in metadata (Ċ, i.e. uppercase c with dot
above).
Ghostscript bug when converting PostScript output to PDF. The
PostScript reads (pasted from less' display)
mark /Creator (LilyPond 2.21.0)
/Title (<FE><FF>^A^M)
/DOCINFO pdfmark
which is the correct UTF16-LE string with BOM. GhostScript however
converts the ^M (0x0d) into ^J (0x0a), basically converting an ASCII CR
to an ASCII LF. Unfortunately, we are not in the middle of ASCII here.

The \ddd form may be used to include any 8-bit character constant in
a string. One, two, or three octal digits may be specified, with
high-order overflow ignored. This notation is preferred for
specifying a character outside the recommended ASCII character set
for the PostScript language, since the notation itself stays within
the standard set and thereby avoids possible difficulties in
transmitting or storing the text of the program. It is recommended
that three octal digits always be used, with leading zeros as
needed, to prevent ambiguity. The string (\0053) , for example,
contains two characters—an ASCII 5 (Control-E) followed by the digit
3—whereas the strings (\53) and (\053) contain one character, the
ASCII character whose code is octal 53 (plus sign).
Recommended/preferred is not at all equivalent to "must". However, one
problem indeed is that strings as such have no notion of encoding and
CR, LF, CRLF are all equivalent. So at least those bytes, when they
occur as part of UTF-16, would warrant escaping.

Tracker issue: 5422 (https://sourceforge.net/p/testlilyissues/issues/5422/)
Rietveld issue: 345090043 (https://codereview.appspot.com/345090043)
Issue description:
Escape nul, cr, newline in PDF metadata

I wasn't really aware that the strings remain pure 8-bit strings on
input and the UTF16 interpretation is private business of the pdfmark
command. So thanks for that pointer, allowing to tackle this fairly
long-known bug.

--
David Kastrup

Thomas Morley

2018-09-30 18:34:17 UTC

Post by David Kastrup
Tracker issue: 5422 (https://sourceforge.net/p/testlilyissues/issues/5422/)
Rietveld issue: 345090043 (https://codereview.appspot.com/345090043)
Escape nul, cr, newline in PDF metadata
I wasn't really aware that the strings remain pure 8-bit strings on
input and the UTF16 interpretation is private business of the pdfmark
command. So thanks for that pointer, allowing to tackle this fairly
long-known bug.

Hi David,

I tested your patch with a .ly-file containing
\header { title = "fooüČč" }
checking meta-data with exiftool.

With 2.19.82:
Title : foo�Čč

Recent master:
Title : fooüČč

Recent master with guile-2.0.14 and the patches from branch guile-v2-work:
Title : fooüČč

Recent master with guile-2.2.4 and the patches from branch
guile-v2-work and some others:
Title : ??foo?....

Looks like a change in guile-2.2.x, so ly:encode-string-for-pdf does
not work as before.
But enabling the commented code in 'handle-metadata', i.e.:

(use-modules (ice-9 iconv))
(use-modules (rnrs bytevectors))
;;; Create DOCINFO pdfmark containing metadata
;;; header fields with pdf prefix override those without the prefix
(define (handle-metadata header port)
(define (metadata-encode val)
;; First, call ly:encode-string-for-pdf to encode the string (latin1 or
;; utf-16be), then escape all parentheses and backslashes
;;
;; NOTE: with guile-2.0+ ly:encode-string-for-pdf is not really needed and
;; could be replaced.
;; For guile-2.2.+ this is a 'must do'
;;
(ps-quote
(let* ((utf16be-bom #vu8(#xFE #xFF)))
(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val 'big)
"ISO-8859-1")))))
...)

Returns
Title : fooüČč
as desired.

I tried to create something like below with a guile-v2-condition:

(use-modules (ice-9 iconv))
(use-modules (rnrs bytevectors))
;;; Create DOCINFO pdfmark containing metadata
;;; header fields with pdf prefix override those without the prefix
(define (handle-metadata header port)
(define (metadata-encode val)
;; First, call ly:encode-string-for-pdf to encode the string (latin1 or
;; utf-16be), then escape all parentheses and backslashes
;;
;; NOTE: with guile-2.0+ ly:encode-string-for-pdf is not really needed and
;; could be replaced.
;; For guile-2.2.+ this is a 'must do'
;;
(ps-quote
(if guile-v2
(let* ((utf16be-bom #vu8(#xFE #xFF)))
(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val
'big) "ISO-8859-1")))
(ly:encode-string-for-pdf val))))

...)

Though, this does not work, because guile-1.8 would issue an error
about the unknown syntax.
Any chance to create something which will work in guilev1 and guilev2?

Cheers,
Harm

David Kastrup

2018-09-30 23:27:07 UTC

Hi David,
I tested your patch with a .ly-file containing
\header { title = "fooüČč" }
checking meta-data with exiftool.
Title : foo�Čč
Title : fooüČč
Title : fooüČč
Recent master with guile-2.2.4 and the patches from branch
Title : ??foo?....
Looks like a change in guile-2.2.x, so ly:encode-string-for-pdf does
not work as before.

Well, then we should try to make it work again if possible.

Post by Thomas Morley
(use-modules (ice-9 iconv))
(use-modules (rnrs bytevectors))
;;; Create DOCINFO pdfmark containing metadata
;;; header fields with pdf prefix override those without the prefix
(define (handle-metadata header port)
(define (metadata-encode val)
;; First, call ly:encode-string-for-pdf to encode the string (latin1 or
;; utf-16be), then escape all parentheses and backslashes
;;
;; NOTE: with guile-2.0+ ly:encode-string-for-pdf is not really needed and
;; could be replaced.
;; For guile-2.2.+ this is a 'must do'
;;
(ps-quote
(let* ((utf16be-bom #vu8(#xFE #xFF)))
(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val 'big)
"ISO-8859-1")))))
...)
Returns
Title : fooüČč
as desired.
(use-modules (ice-9 iconv))
(use-modules (rnrs bytevectors))
;;; Create DOCINFO pdfmark containing metadata
;;; header fields with pdf prefix override those without the prefix
(define (handle-metadata header port)
(define (metadata-encode val)
;; First, call ly:encode-string-for-pdf to encode the string (latin1 or
;; utf-16be), then escape all parentheses and backslashes
;;
;; NOTE: with guile-2.0+ ly:encode-string-for-pdf is not really needed and
;; could be replaced.
;; For guile-2.2.+ this is a 'must do'
;;
(ps-quote
(if guile-v2
(let* ((utf16be-bom #vu8(#xFE #xFF)))
(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val
'big) "ISO-8859-1")))
(ly:encode-string-for-pdf val))))
...)
Though, this does not work, because guile-1.8 would issue an error
about the unknown syntax.

You can always write something like #vu8(#xFE #xFF) as a function call
rather than as "read syntax".

Post by Thomas Morley
Any chance to create something which will work in guilev1 and guilev2?

I think the sane perspective would be fixing the problem where it
appears rather than at some later point of time.

--
David Kastrup

Thomas Morley

2018-10-01 09:43:53 UTC

Well, then we should try to make it work again if possible.

You can always write something like #vu8(#xFE #xFF) as a function call
rather than as "read syntax".

Don't understand. Could you give a code-example?

Post by Thomas Morley
Any chance to create something which will work in guilev1 and guilev2?

I think the sane perspective would be fixing the problem where it
appears rather than at some later point of time.

Agreed.
Though, to fix ly:encode-string-for-pdf, some C++-work is requiered,
which is beyond my capabilities.

Cheers,
Harm

David Kastrup

2018-10-01 10:08:23 UTC

Thomas Morley <***@gmail.com> writes:

[...]

Post by Thomas Morley
;;
(ps-quote
(if guile-v2
(let* ((utf16be-bom #vu8(#xFE #xFF)))
(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val
'big) "ISO-8859-1")))
(ly:encode-string-for-pdf val))))
...)
Though, this does not work, because guile-1.8 would issue an error
about the unknown syntax.

You can always write something like #vu8(#xFE #xFF) as a function call
rather than as "read syntax".

Don't understand. Could you give a code-example?

(u8-list->bytevector '(#xFE #xFF))

Post by Thomas Morley
Any chance to create something which will work in guilev1 and guilev2?

I think the sane perspective would be fixing the problem where it
appears rather than at some later point of time.

Agreed.
Though, to fix ly:encode-string-for-pdf, some C++-work is requiered,
which is beyond my capabilities.

Sure, but the 2.0/2.2 differences will have more pressing consequences
elsewhere. Document metadata is not the highest priority and may even
"happen" to be fixed as a side effect of other fixes.

--
David Kastrup

Thomas Morley

2018-10-01 19:26:50 UTC

Post by David Kastrup
[...]

You can always write something like #vu8(#xFE #xFF) as a function call
rather than as "read syntax".

Don't understand. Could you give a code-example?

(u8-list->bytevector '(#xFE #xFF))

Works.
Many thanks!

For the record here the code working with guile-1.8 and guile-2.2.4:

(if (guile-v2)
(use-modules
(ice-9 iconv)
(rnrs bytevectors)))
;;; Create DOCINFO pdfmark containing metadata
;;; header fields with pdf prefix override those without the prefix
(define (handle-metadata header port)
(define (metadata-encode val)
;; First, call ly:encode-string-for-pdf to encode the string (latin1 or
;; utf-16be), then escape all parentheses and backslashes
;;
;; NOTE: with guile-2.0+ ly:encode-string-for-pdf is not really needed and
;; could be replaced.
;; For guile-2.2.+ this is a 'must do'
;;
(ps-quote
(if (guile-v2)
(let* (;(utf16be-bom #vu8(#xFE #xFF))
(utf16be-bom (u8-list->bytevector '(#xFE #xFF)))
)

(string-append (bytevector->string utf16be-bom "ISO-8859-1")
(bytevector->string (string->utf16 val
'big) "ISO-8859-1")))
(ly:encode-string-for-pdf val))))

...)

Post by Thomas Morley
Any chance to create something which will work in guilev1 and guilev2?

I think the sane perspective would be fixing the problem where it
appears rather than at some later point of time.

Agreed.
Though, to fix ly:encode-string-for-pdf, some C++-work is requiered,
which is beyond my capabilities.

Sure, but the 2.0/2.2 differences will have more pressing consequences
elsewhere. Document metadata is not the highest priority and may even
"happen" to be fixed as a side effect of other fixes.

We already know the problems with guile-2.0.x.

With guile-2._2_.x (currently I use 2.2.4) the list gets more entries:
- ly:protects doesn't work any more
- message/warnings about missing (ice-9 threads)
- not working special characters in file-names. (Not sure whether it's
already a problem with guile-2.0.x)
- meta-data with special characters

I've already tried some experimental fixes for the first three problems.
See attachment to
http://lists.gnu.org/archive/html/bug-lilypond/2018-06/msg00011.html
patch 13-15

The code above is at least a work around for the last problem.

Are there other known problems new with guile-2.2.x (not yet known
with guile-2.0.x)?

Thanks,
Harm

-

Davide Liessi

2018-09-30 18:23:21 UTC