ASTERISK-20167: [patch] UTF-8 cyrillic characters in voicemail email subject cause subject corruption

[Home]

Summary: ASTERISK-20167: [patch] UTF-8 cyrillic characters in voicemail email subject cause subject corruption

Reporter: Arcadiy Ivanov (arcivanov) Labels: patch

Date Opened: 2012-07-23 23:27:11 Date Closed:

Priority: Major Regression?

Status: Open/New Components: Applications/app_voicemail

Versions: 1.8.8.2 13.18.4 Frequency of
Occurrence Constant

Related
Issues:

Environment: Linux myhost.mydomain 2.6.18-308.11.1.el5 #1 SMP Tue Jul 10 08:49:28 EDT 2012 i686 i686 i386 GNU/Linux Cent-OS 5.8 Attachments: ( 0) issueA20167_break_early_for_q_encoding.patch

Description: This has been happening ever since 1.4.x.

In {{voicemail.conf}}:
{noformat}
emailsubject=[PBX]: Сообщение от ${VM_CALLERID} в ${VM_DATE}
{noformat}

The emails arrive with the following subject:

{noformat}
[PBX]: Сообще�в Monday, July 23, 2012 at 11:45:46 PM
{noformat}

The subject should appear as follows:

{noformat}
[PBX]: Сообщение от "anonymous" <anonymous> в Monday, July 23, 2012 at 11:45:46 PM
{noformat}

The raw subject header as it appears in the email message is:

{noformat}
Subject: =?UTF-8?Q?=5BPBX=5D=3A_=D0=A1=D0=BE=D0=BE=D0=B1=D1=89=D0=B5=D0?=
=?UTF-8?Q?=BD=D0=B8=D0=B5_=D0=BE=D1=82_=22anonymous=22_=3Canonymous=3E_?=
=?UTF-8?Q?=D0=B2_Monday=2C_July_23=2C_2012_at_11=3A45=3A46_PM?=
{noformat}

Comments: By: David Woolley (davidw) 2012-07-24 06:14:31.624-0500

It looks like the line break escape has been applied in the middle of a character. I'd have to check the RFC to see if this is legitimate, but as the encoding could change, I suspect it is not.

Have you checked that the mail transport hasn't added the line breaks.
By: Arcadiy Ivanov (arcivanov) 2012-07-24 14:10:00.140-0500

The transport is default to what Asterisk is using (I assume "sendmail -t"). The code responsible for Q-encoding (including new lines) is in Asterisk, not transport. The call structure goes like this:

app_voicemail.c -> make_email_file -> ast_str_encode_mime(&str2, 0, ast_str_buffer(str1), strlen("Subject: "), 0)

I suspect there is a bug somewhere in this section:
{code}
if ((first_section && need_encoding && preamble + ast_str_strlen(tmp) > 70) ||
(first_section && !need_encoding && preamble + ast_str_strlen(tmp) > 72) ||
(!first_section && need_encoding && ast_str_strlen(tmp) > 70) ||
(!first_section && !need_encoding && ast_str_strlen(tmp) > 72)) {
/* Start new line */
ast_str_append(end, maxlen, "%s%s?=", first_section ? "" : " ", ast_str_buffer(tmp));
ast_str_set(&tmp, -1, "=?%s?Q?", charset);
first_section = 0;
}
{code}

On the side note, wouldn't it be more prudent to use B-encoding (base64) in all cases where multi-byte encoding (UTF-8, UTF-16LE/BE, UTF-32) is requested? The encoding wastage is 4 bytes for every 3 encoded (133%) for Base64 and is 3 bytes for every 1 encoded (300%) when Q-encoding is used. In fact, unless text contains ovewhelming proportion of Latin1 subset that can be represented by an unencoded atom in Q-encoding scheme, it always makes more sense to use Base64.

By: Arcadiy Ivanov (arcivanov) 2012-11-14 00:11:33.067-0600

Is there any progress on this?
By: Walter Doekes (wdoekes) 2012-11-14 03:41:30.480-0600

Here, a patch that should fix your problem.

It's not a panacea, but it should work for the majority of cases.

BTW: in Europe it is quite common to use only a few non-ascii characters, so base64 is inferior there.
By: Walter Doekes (wdoekes) 2012-11-14 03:53:45.040-0600

P.S. http://www.ietf.org/rfc/rfc2047.txt
{noformat}
An 'encoded-word' may not be more than 75 characters long, including
'charset', 'encoding', 'encoded-text', and delimiters. If it is
desirable to encode more text than will fit in an 'encoded-word' of
75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
be used.

While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
'encoded-word's is limited to 76 characters.
...
Some character sets use code-switching techniques to switch between
"ASCII mode" and other modes. If unencoded text in an 'encoded-word'
contains a sequence which causes the charset interpreter to switch
out of ASCII mode, it MUST contain additional control codes such that
ASCII mode is again selected at the end of the 'encoded-word'. (This
rule applies separately to each 'encoded-word', including adjacent
'encoded-word's within a single header field.)
{noformat}

I'd call the multibyte tokens "mode switching", so breaking mid char is indeed illegal.
By: Arcadiy Ivanov (arcivanov) 2012-11-14 03:59:24.670-0600

bq. BTW: in Europe it is quite common to use only a few non-ascii characters, so base64 is inferior there.
Not in the Far East, Middle East and most of Eastern Europe :)

Thanks a lot for the patch, I'll give it a whirl.
By: Walter Doekes (wdoekes) 2012-11-14 04:10:22.526-0600

I know. Sorry for sounding colonialistic (is that a word?).

The benefit of *possibly* having something readable in there outweighs (to me) the drawbacks of using more bytes other situations. We're talking 100 bytes here. If you're worried about those, you probably shouldn't be sending audio files over SMTP ;)
By: Arcadiy Ivanov (arcivanov) 2012-11-14 05:06:20.163-0600

Which raises a point - if text volume in headers is not expected to be large or at all significant, base64 it, no matter what, and be done? 33% increase in volume of insignificant amount of text beats having to figure out whether encoding involved is multibyte (to FFFF or to 10FFF?) and trying to arrange for character code alignment.

By: Walter Doekes (wdoekes) 2012-11-14 05:26:52.063-0600

Wrong (unfortunately). Base64'ing it does not help with the problem at hand.

If it did, I'd go with that solution, but it doesn't. See the following, keeping in mind that asterisk is encoding-agnostic here:

{noformat}"AB€" == "AB\xe2\x82\xac"{noformat}

If I attempt to make base64, I get this:

{noformat}"QULigqw="{noformat}

However, if I only have room for four bytes left. We get this:

{noformat}"QULi" + "gqw="{noformat}

Which translates to:

{noformat}"AB\xe2" + "\x82\xac"{noformat}

Same problem as before..
By: Arcadiy Ivanov (arcivanov) 2012-11-14 08:04:49.212-0600

But Base64 would be first reassembled and decoded as a whole vs with Q-coding you parse it segment by segment. It makes perfect sense doing Q-code parsing the way you're describing, but for B-code, which is monolithic but cannot exceed 75 chars (w/ encoding specs) per line, that would seem to be simply an invalid approach to parsing.

http://www.ietf.org/rfc/rfc2047.txt

{quote}
8. Examples

The following are examples of message headers containing 'encoded-
word's:

From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

Note: In the first 'encoded-word' of the Subject field above, the
last "=" at the end of the 'encoded-text' is necessary because each
'encoded-word' must be self-contained (the "=" character completes a
group of 4 base64 characters representing 2 octets). ****An additional
octet could have been encoded in the first 'encoded-word' (so that
the encoded-word would contain an exact multiple of 3 encoded
octets), except that the second 'encoded-word' uses a different
'charset' than the first one.****
{quote}

See the section with added emphasis - if encoding didn't switch it would've been perfectly valid to carry over a hanging byte.

{quote}
The 'encoded-text' in an 'encoded-word' must be self-contained;
'encoded-text' MUST NOT be continued from one 'encoded-word' to
another. This implies that the 'encoded-text' portion of a "B"
'encoded-word' will be a multiple of 4 characters long; for a "Q"
'encoded-word', any "=" character that appears in the 'encoded-text'
portion will be followed by two hexadecimal characters.
{quote}

and

{quote}
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
{quote}
By: Walter Doekes (wdoekes) 2012-11-14 08:25:07.606-0600

I think the only thing you did was point out that I'm right:

{quote}Each 'encoded-word' MUST represent an integral number of characters.{quote}

.. whether that's base64 or Q-encoded.

I didn't write that we MUST break up the base64 at four bytes. It'd be perfectly legal to break it up into:
{noformat}"QUI=" (AB) + "4oKs" (€){noformat}

But that requires Asterisk to know about the encoding.. *which* *it* (again) *doesn't* *right* *now*. Asterisk sees 5 bytes, not 3 characters. So it won't split things up like you want it to.

By: Arcadiy Ivanov (arcivanov) 2012-11-14 08:33:57.651-0600

{quote}
An additional octet could have been encoded in the first 'encoded-word' (so that
the encoded-word would contain an exact multiple of 3 encoded
octets), except that the second 'encoded-word' uses a different
'charset' than the first one.
{quote}

This, basically means that base64 must be integral (encoded-word) but underlying text (encoded-text) doesn't have to be. What you're quoting must hold true for Q, but not necessarily for B.
By: Walter Doekes (wdoekes) 2012-11-14 08:35:37.297-0600

This is not going anywhere :)
Find me on irc (wdoekes) if you want to continue this discussion.
By: Arcadiy Ivanov (arcivanov) 2012-11-14 09:09:18.727-0600

You're right. Objections withdrawn.
By: Walter Doekes (wdoekes) 2012-11-14 09:14:31.217-0600

Let me know if the patch works for you.
By: Arcadiy Ivanov (arcivanov) 2013-04-04 08:12:02.933-0500

FYI, patch works very well for UTF-8 and Cyrillic. Thanks a lot!
By: Arcadiy Ivanov (arcivanov) 2013-06-03 02:58:48.947-0500

There are still minor problems with this provisional patch. I get the following subject (notice the non-character):

[PBX]: Сообщение от "+7+33870460000" <+7+33870460000> в Понедельни�Июнь 03, 2013 at 03:55:19

encoded as follows:

Subject: =?UTF-8?Q?=5BPBX=5D=3A_?=
=?UTF-8?Q?=D0=A1=D0=BE=D0=BE=D0=B1=D1=89=D0=B5=D0=BD=D0=B8=D0=B5_?=
=?UTF-8?Q?=D0=BE=D1=82_?=
=?UTF-8?Q?=22+7+33870460000=22_?=
=?UTF-8?Q?=3C+7+33870460000=3E_?=
=?UTF-8?Q?=D0=B2_?=
=?UTF-8?Q?=D0=9F=D0=BE=D0=BD=D0=B5=D0=B4=D0=B5=D0=BB=D1=8C=D0=BD=D0=B8=D0?=
=?UTF-8?Q?=BA=2C_?=
=?UTF-8?Q?=D0=98=D1=8E=D0=BD=D1=8C_?=
=?UTF-8?Q?03=2C_?=
=?UTF-8?Q?2013_?=
=?UTF-8?Q?at_?=
=?UTF-8?Q?03=3A55=3A19_?=
=?UTF-8?Q??=

By: Walter Doekes (wdoekes) 2013-06-03 03:24:20.455-0500

Unfortunately, that's expected with that patch.
{noformat}
+ "=?UTF-8?Q?=AC=E2=82=AC?="; /* <-- this break mid utf-8 is not cool, but it is the current implementation */
{noformat}

The patch tries to break on any space boundary, if there are too few spaces in your words, you get that problem.

For your Понедельни�Июнь-word, it's so long that the 72-byte limit kicks in.

Three choices:
- leave it as is
- drop the 72 char limit if need_encoding, we limit by space now anyway, so that should result in short lines anyway
- add utf8 support (more work)
By: Arcadiy Ivanov (arcivanov) 2013-06-03 09:39:09.411-0500

Actually the "Понедельни�Июнь" is "Понедельник, Июнь 03, 2013 at 03:55:19" which is a Russian version of ${VM_DATE} in voicemail.conf

Shouldn't have "the patch tr[ying] to break on any space boundary" stopped after "Понедельник,"?

Thanks!
By: Walter Doekes (wdoekes) 2013-06-03 09:51:22.966-0500

It does:

{noformat}
=?UTF-8?Q? =D0=9F =D0=BE =D0=BD =D0=B5 =D0=B4 =D0=B5 =D0=BB =D1=8C =D0=BD =D0=B8 =D0?=
=?UTF-8?Q? =BA= 2C_?=
{noformat}

Note the =2C_ <-- that's {{,<space>}}

"Понедельник," takes 11 russian characters + 1 ascii, that's 6*11 + 1 = 67 characters.. plus the 10 + 2 head and tail, makes more than 72.

By: Arcadiy Ivanov (arcivanov) 2013-06-03 10:06:06.304-0500

Ugh!