Languages, fonts and encodings
as Unicode incarnation
The main feature of files and philosophy of Office-97, first
of all Winword-97, - almost full transition to the coding Unicode, i.e. and in the documents Word-97 a text part represented in Unicode, and during fulfilment of the program - input and text processing - is used (here, certainly, the clause), on all probability is necessary, it is exclusively Unicode. However, as well as in the previous versions Word, for normal work with language it is necessary, that in Word'Е there was it a support, irrespective of Unicodeness of the font.
- Fonts. It are required Uicode fonts, in which letter of national languages have correct Unicodes, and Windows'9x sees in the font an appropriate character set.
Fonts which are not have appropriate character set, is, as a rule, fonts have only script western or symbol encoding. Word-97 can, to some extent to use by a desirable art and such fonts, but with numerous inconveniences and restrictions. In particular, for want of switching of the keyboard with English on national (for example, Russian) will be switching the font happens on, where the necessary character set is (usually - Times New Roman). Still some features of a behaviour in these cases:
- For want of use of the font have only script western, should
mark the text as English. For want of it the russian checkspelling and hyphenation transpositions will not work (though if subsequently
mark this text as Russian, even spellcheck works).
The change of the font on an Unicode font
will reduce in loss of "russian-features".
The change of the font from Unicode on only western
will convert russian etc. text into "small squares".
Such russian font is possible convert to Unicode
with the help of accessible converters from the section Converters of the font encoding
and other tools.
- A font that has symbol encoding, calls a little bit less problems. And in this case such language functions will not work, as check of spelling, and the hyphenation will be made in any place of a word. The change of this font can convert the text into "small squares" (that, probably, is explained by absence in Unicocde font of the Unicode
range U+F020... U+F0FF), though a change from Unicode font
on a symbol font and even the consequent return of the font passes correctly (that, probably, is explained by translation "alphabetic" Unicodes through a code page in "privat use" Unicode range of the font symbol).
Such russian font is possible "convert to Unicode",
perhaps, only with the help of programs from R.Kinch' ttf_edit from the section tools, as the apart from of a modification of Unicodes is required to change a label of the encoding -
from 3.0 to 3.1 and in appropriate way to change
name table.
- Printing. As was clarified, the correct look of a multilingual
text on a screen in Word-97 is not yet warranty, that it will be correctly printed on the printer or sent by fax. Purely speaking, it is not an error of Word-97, but inferiority of majority (!) printers drivers:
Unicode functions of printing in them in this or that measure work
not properly. Though there are also exceptions, for example, Windows'95 built-in drivers of HPLaserJet4L printer. Other drivers can represent for want of printing small squares instead of the necessary characters.
In the worse case even national letters (Russian, Greek etc.) from code pages supported both Windows and Word-97 are not represented. In the best
case are not represented only characters, that are not included in Windows' standard character sets (charsets), in particular, pseudo-graphic (containing in all standard
"large" fonts). The driver for HPLaserJet 5-6, offered now on the server Hewlett-Packard concerns to the last, best case, it can print all that can print with standard fonts Wordpad, but not all, that can extract from the font Word-97.
Completely decide a problem can ony developers of drivers, but some
partial solutions are possible and from the Word-97 side. They are offered on the server Microsoft in the article from a knowledge base Q159418:
- If is installed Word-97 Service Release 1 (SR-1),
it is necessary to add in the system Registry the following key:
REGEDIT4
[HKEY_CURRENT_USER\Software\Microsoft\Office\8.0\Word\Options]
"NoWideTextPrinting"="1"
- If earlier Word-97 (non-SR1) is installed,
the Registry must be corrected with use of the exact printer name from a folder Printers, designated in an example as Printer_name:
REGEDIT4
[HKEY_CURRENT_USER\Software\Microsoft\Office\8.0\Word\Printer_name]
"Flags"="8192"
In this case will be printed the numerals from code pages, which are in "intersection" of knowledges Windows'9x and Word-97, i.e. will not be not only pseudo-graphic, but also not Hebrew and Arabic, but there will be a cyrillic and Greek, and also western and easteuropean
diacritical marks.
- For some kinds of printers Hewlett-Packard in the article are indicated specific of set-up deciding (in what to measure?) problem. In particular above mentioned HPLaserJet4L begins correctly for all to print, only if the option "Print TrueType as graphics" is set.
- At last, if anything from above-stated does not help, a radical means sometimes can help: for all (used) Unicode fonts with Cyrillic one
must enter by a hard hand in WIN.INI in section FontSubstitutes
line such as following:
Font,0=Font,204
Then it is necessary to say goodbye to hope to print by this font not only on Hebrew and Arabic, but also in German, in French and, probably, on Greek, but on Russian-English all will be printed correctly.
-
The converter for the documents created in earler versions Microsoft Word.
Word-97 extension from Agama,
changing in the not
readable russian document fonts and language in the correspondence with requirements Word-97.
- MS Office'97 and Russian fonts a digest from Pavel Gorodjansky.
Internet Explorer 4, including majority it of components: browser MSIE, mail&news Outlook Express, in some smaller degree HTML-editor Frontpage Express, uses Unicode as the main encoding for internal purposes and in input windows. As described in
a number of articles in Microsoft Knowledge Base, in MSIE is realized the concept of "multilingual object" is realized: class and functions for working with it are defined. All this allows to support practically anyone encoding in the Internet, including Unicode-based
(UTF-7 and UTF-8), and for an rendering of characters of language (character set), which is not supported by default font, is selected the font, where this language present. Such font substitution works for those Unicode's, which can be associated to the certain character set (problem similar to a problem of recognition accessible character sets of the font), and this substitution will not made, if the character set in the font present, but some characters from this set are absent - then they are represented by small squares.
- Wonderfully, that the multilingual support in MSIE4 can be controlled by user, it is based on files of national/native language support NLS and branch in the Registry
[HKEY_CLASSES_ROOT\MIME\Database].
The described schemes was valid until MSIE 5 beta 1, however in MSIE 5 beta 2 controllabilities of language support
was reduced and is displaced in MLANG.DLL. Modifications in the Registry have ceased to work as necessary.
The titles of files NLS look as CP_nnnn. NLS in
Windows'95/'98 (in the Windows\System directory) and у_nnnn. NLS in Windows NT (in the
Windows\System32 directory), where nnnn -
number of a code page (1251, 866 etc.)
These files contain definitionof their code page via Unicode,
table of mapping from Unicode in their code page and a number of some other tables. The approximate description of
a NLS-file structure
(CP_nnnn. NLS and C_nnnn. NLS) contains results of my observations, as official documentation to find it was not possible.
(File LOCALE.NLS contains an information specific to languages of the world, in particular format of date, time, currency, information about the alphabetic order, title and capitals etc.)
For encoding koi8-r is page number 20866 designated (ache asserts,
that by IBM for koi8-r number 878 was already chosen, however Microsoft did not know about that or has that fact ignored). For ISO-8859-5 is designated 28595, and for koi8-ru - 21866 (it is the Ukrainian code page which was more known as koi8-u, and under this name in April, 1998 it is received the status RFC2319 and registered in IANA as
MIME-charset).
It is possible to find some files, absent in delivery, in particular
CP_1255. NLS
(Hebrew) and
CP_1256.NLS
(Arabic),
in the national versions Windows and on the disk with VC ++ 5,
and now and in the distribution kit with MSIE5.
The indicated branch in the Registry is divided on four:
- Content Type, where the correspondence of MIME-types
and file extensions is set (there is no relation to multilingual
support);
- Codepage, where in keys the supported code pages are
listed and is determined a "presence"
of a code page in MSIE and Outlook Express:
[HKEY_CLASSES_ROOT\MIME\Database\Codepage\nnnn]
Variables:
"Level":
- 0-th bit in the first from four bytes answers for
presence of the encoding in the Outlook Express encodings list;
- 1-st bit of the same byte answers for presence of the encoding
in the Internet Explorer encodings list.
- The sense of 2nd bit in the first byte, and also of 2nd of four bytes (usually equally with first) is not clear to me (it seems that it plays
any role for Unicode encodings and hieroglyphic languages);
- Third and fourth byte are equal to zero, they are probably
reserve.
"Family" - code page, base for the given coding (1251 for a cyrillic);
"Description" - verbal description of the encoding (for example, "Cyrillic (Windows)")
"FixedWidthFont" - monospaced font;
"ProportionalFont" - proportional font
(Probably here is defined default fonts, choice of fonts fot MSIE
and Outlook Express is specified in section "Internet Explorer"
and "Outlook Express" of absolutely other Registry branch accordingly),
"BodyCharset" - mayby the default encoding
for e-mail.
"WebCharset" - mayby the default encoding
for WWW.
"Encoding" - it is not known.
- уharset - the definitions of allowed values for
MIME-charset among which are main values and synonyms.
- "Main" charset, variables:
- "InternetEncoding" - encoding (code page)
nnnn, defined in the above described branch Codepage;
- "Codepage" - main Windows codepage for the given language (1251 for all Russian encodings).
- "Synonyms" charset, variable:
"AliasForCharSet" - name of the
"main" charset, and synonym can be any,
if only not conterminous with other.
- Rfc1766 - here is contained list of
hexadecimal codes of languages of the peoples
in the world, in the correspondence with
RFC1766, their mnemonic label and description (where it is used - is not clearly), approximately so:
"043D"="ji;Yiddish"
"0435"="zu;Zulu"
- It is possible to add support for Hebrew
and Arabic in various components MSIE 4 and in Windows'9x as a whole. However to me is unknown a method to achieve the writing from
right to left and bidirectionality at all. I suspect,
that it is impossible - otherwise there would be no necessary to develop
hebrew- and arabic- "enabled" versions (Windows with English interface and support of bidirectionality and
semitic languages).
See detailed instructions for
adding Hebrew and Arabic on broken English.
- Support of the perverted encodings:
the attempt is made to learn Outlook Express (4 and 5beta1) for Windows'95/'98 and for Windows NT (NT is checked worse) to read the messages, former in koi8-r and recoded once more (by the server of the sender) from CP1251 in koi8-r. In archive file
cyrperv.zip files for Windows'95/'98 are in the subdirectory W95, and for Windows NT - in the subdirectory NT.
For installation of the perverted encoding it is necessary the file CP_24866. NLS to place in Windows\System (for NT - C_24866. NLS in Windows\System32) and to execute the file PERVRS95.REG (for NT - PERVRSNT.REG).
Then after reboot among the OE encodings will appear "CyrPerversion (Win to KOI)". If in the line with level to replace digits 1 with digits 3, the same encoding will appear and in browser, only what for?...
I can not tell, that completely has understood, how it is all works, whether is not present at MS of the plans for number 24866 and what is the "correct" method of choice of number of a code page for own purposes. It would be desirable also to prohibit to send the messages in perverted (and some standardized encodings), but I do not know, how...
- Support of the
"lower pseudo-graphic" (control characters with codes 0...31), popular in FIDO-conferences.
In the file
ctrlnls.zip are contained modified NLS for koi8-r, with which it is necessary to replace initial (previously having saved old). In Win'9x it is necessary to copy the file CPA20866.NLS with replaced
name CP_20866.NLS in the catalogue Windows\System (in
"MS DOS mode"), in Windows NT - to copy file CA20866
with replaced name C_20866 in the catalogue Windows\System32.
(The temporary name is specially made different to avoid casual damages.)
- FidoLook Express - modification of Outlook Express for more convenient work with echo-confereneces Fidonet (adding of a header X-Comment-To and other),
where the above described support of the win-to-koi-perversion is inserted.
- Ukrainization of programs "Internet Explorer 4" and
"Outlook Express". Th instruction on homepage from
Valery Zhurachovsky on correct setup of the encoding koi8-u,
page is in the Ukrainian language.
The multilingual support in MSIE5 has even more extended,
and now it includes as standard option to installation,
apart paneuropean language pack, also Hebrew, Arabic, Thai,
Vietnamese, Japanese, Korean and Chinese.
The label for the Ukrainian coding became standard KOI8-U. The keyboard support for Hebrew, Arabic, Thai is not put, but, at least,
for Hebrew if to register language in the system and to set an own keyboard layout, in Outlook Express wilk be supported
correct bidirectional input. Hebrew fragments in Unicode-pages
(UTF-8 etc.) are now represented "logically",
i.e. with change of a direction, as in products from Accent,
instead of "visually", as in the previous version.
As well as was promised, the management of encodings through the Registry (section MIME) was in this version very hardly reduced and is transferred in the tables mlang.dll (and at all not into resources,
where are mainly names of encodings in different languages,
such as "Cyrillic (Windows)").
- The support of the perverted encodings in Outlook Express 5beta2 is possible, but not by a so natural method,
as it was made in Outlook Express 4 and 5beta1
(and is described too on the rights of a beta). It is easier to sacrifice all by one from the european encodings, for example turkish: Turkish (Windows). For this purpose it is necessary as before to locate in Windows\System the file CP_24866.NLS (for NT - C_24866.NLS in Windows\System32) from archive
cyrperv.zip, and then to interchange in the Registry
values for Windows'9x:
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\control\Nls\Codepage]
"1254"="cp_24866.nls"
And for Windows NT:
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\control\Nls\Codepage]
"1254"="c_24866.nls"
After that for want of choice Turkish (Windows) win-to-koi-perversion
becomes readable.
In the fourth version Netscape became Unicode-aware and
multilingual, with support UTF-7 and UTF-8, however this support is
much more brocken than in MSIE4.
The fonts are required (or can be used) Unicode, but for single-byte
encodings Netscape don't perform the interpretation of encoding
but coverts it in the main Windows encoding for this Language.
In particular, all Russian encodings will be recoded in CP1251,
with loss of the pseudo-graphic from koi8-r.
"Ampersand" entities are understood only beforehand known,
exept of Unicode-encodings UTF-7 and UTF-8 - in them the
characters for any "ampersand"-unicode are taked
from the specific font via their Unicode-values.
- The possibilities of serios language support setup in Netscape are
to me unknown.
-
Cyrillic for MS Windows Netscape.
Paul Gorodyansky Homepage about rusian adaptation of Netscape,
or How to learn this program to correct handle with the Russian letters.
In each version - their own highlights:-)
For the first time for long "newest browsers history " it is possible to speak about essence innovations in recently published Netscape 6 Preview Release 1, then 2 and
3, and, at last, in
Netscape 6 release some of which are absent in MSIE.
(We shall mark in particular, that the interface tends
to full adjustability, many inyterface elements are made via
XML and JavaScript.)
Language support is now completely based on Unicode and included Hebrew, Arabian, and also Thai, Vietnamese (these are by Microsoft so called "complex scripts", but the "comlexity" isn't supported) and Armenian (encoding
ARMSCII-8).
There are 9 (nine) cyrillic encodings
(KOI8-R, Windows-1251, ISO-8859-5, ISO-IR-111, IBM-866, IBM-855, MacCyrillic, KOI8-U, MacUkrainian).
Autorecognition of encodings is made on several groups of languages separately (in MSIE
from all spectrum of languages). Besides iIt is possible to adjust a kind of the menu of the encodings, having kept in it
only necessary encodings, for example, only cyrillic.
Definitions of encodings are contained in libraries ucv*.dll in the subdirectory (Netscape 6\components\)
and we probably haven't any "legal" method to change them. At the same time set of viewable encodings may be
set in the file
(Netscape 6\res\charsetData.properties): for the encoding, which should be removed from the menu View - Character Coding
line such so should be placed in the file:
windows-1255.notForBrowser = true.
(By default all or nearly so all Hebrew and Arabic
encodings aren't shown.)
The search of characters that are absent in the current font, is made as it seems even not via unicode ranges and languages in TrueType font as in MSIE 4-5, and at the level
of single Unicode values. I.e. if the correct system
of the fonts is installed
(under Windows it is a set of TrueType fonts, compatible with
Unicode on the form and on the sense), then the sign,
used on page, will be shown and shown correctly,
if it is contained even in one installed font.
See samples of multilingual pages.
Last versions of Adobe products become Unicode-aware,
seems with support of chinesish/japanese/korean ideographes,
however from european codepages is supported only western CP1252, for want of switching of the keyboard, for example, on Russian - the layout changes, and language isn't change,
and on a screen instead of Russian the latin diacritical charactes are displayed.
For one - for example, Russian - language it is possible to workaround
the problem, having replaced the NLS-file for page
1252
(CP_1252.NLS
in Windows'9x, у_1252.NLS in NT) with the NLS-file from page
1251.
One can
make this either by direct copying (CP_1251.NLS into CP_1252.NLS,
in DOS-mode), or by editing in the Registry with reboot
of the computer:
REGEDIT4
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\control\Nls\Codepage]
"1252" = "cp_1251.nls"
(For NT - by similar way:
"1252" = "c_1251.nls").
As well as it was to expect, after that will be blocked all latin
diacritic (French, Spanish, German and other languages) also in other
Unicode-aware programs working through NLS: in the western encoding
in MSIE4, for want of input in Word'97 and WordPad'98. At the same time will be shown correctly datas immediately in Unicode (UTF-7/UTF-8 in MSIE4, earlier created file in Word'97).
Mail&Newsreader (commercial)
Forte Agent
isn't unicode-based program, however
in recent versions
1.6 and more
encodings-handling is perforemed via
Unicode. The program support now also UTF-7/UTF-8. Russian-oriented settings
for all versions of the Program and also (russian) article about Forte Agent 1.5 (with small modifications it is applicable also for last versions) see on the page with
lovely software.