Sometimes it is necessary to manipulate PO files in a way that is better
performed automatically than by hand. GNU gettext
includes a
complete set of tools for this purpose.
When merging two packages into a single package, the resulting POT file will be the concatenation of the two packages' POT files. Thus the maintainer must concatenate the two existing package translations into a single translation catalog, for each language. This is best performed using ‘msgcat’. It is then the translators' duty to deal with any possible conflicts that arose during the merge.
When a translator takes over the translation job from another translator, but she uses a different character encoding in her locale, she will convert the catalog to her character encoding. This is best done through the ‘msgconv’ program.
When a maintainer takes a source file with tagged messages from another package, he should also take the existing translations for this source file (and not let the translators do the same job twice). One way to do this is through ‘msggrep’, another is to create a POT file for that source file and use ‘msgmerge’.
When a translator wants to adjust some translation catalog for a special dialect or orthography -- for example, German as written in Switzerland versus German as written in Germany -- she needs to apply some text processing to every message in the catalog. The tool for doing this is ‘msgfilter’.
Another use of msgfilter
is to produce approximately the POT file for
which a given PO file was made. This can be done through a filter command
like ‘msgfilter sed -e d | sed -e '/^# /d'’. Note that the original
POT file may have had different comments and different plural message counts,
that's why it's better to use the original POT file if available.
When a translator wants to check her translations, for example according to orthography rules or using a non-interactive spell checker, she can do so using the ‘msgexec’ program.
When third party tools create PO or POT files, sometimes duplicates cannot
be avoided. But the GNU gettext
tools give an error when they
encounter duplicate msgids in the same file and in the same domain.
To merge duplicates, the ‘msguniq’ program can be used.
‘msgcomm’ is a more general tool for keeping or throwing away duplicates, occurring in different files.
‘msgcmp’ can be used to check whether a translation catalog is completely translated.
‘msgattrib’ can be used to select and extract only the fuzzy or untranslated messages of a translation catalog.
‘msgen’ is useful as a first step for preparing English translation catalogs. It copies each message's msgid to its msgstr.
Finally, for those applications where all these various programs are not sufficient, a library ‘libgettextpo’ is provided that can be used to write other specialized programs that process PO files.
msgcat
Programmsgcat [option] [inputfile]...
The msgcat
program concatenates and merges the specified PO files.
It finds messages which are common to two or more of the specified PO files.
By using the --more-than
option, greater commonality may be requested
before messages are printed. Conversely, the --less-than
option may be
used to specify less commonality before messages are printed (i.e.
‘--less-than=2’ will only print the unique messages). Translations,
comments and extract comments will be cumulated, except that if
--use-first
is specified, they will be taken from the first PO file
to define them. File positions from all PO files will be cumulated.
If inputfile is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msgconv
Programmsgconv [option] [inputfile]
The msgconv
program converts a translation catalog to a different
character encoding.
If no inputfile is given or if it is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
The default encoding is the current locale's encoding.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msggrep
Programmsggrep [option] [inputfile]
The msggrep
program extracts all messages of a translation catalog
that match a given pattern or belong to some given source files.
If no inputfile is given or if it is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
[-N sourcefile]... [-M domainname]... [-J msgctxt-pattern] [-K msgid-pattern] [-T msgstr-pattern] [-C comment-pattern]
A message is selected if
When more than one selection criterion is specified, the set of selected messages is the union of the selected messages of each criterion.
msgctxt-pattern or msgid-pattern or msgstr-pattern syntax:
[-E | -F] [-e pattern | -f file]...
patterns are basic regular expressions by default, or extended regular expressions if -E is given, or fixed strings if -F is given.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
To extract the messages that come from the source files
gnulib-lib/error.c
and gnulib-lib/getopt.c
:
msggrep -N gnulib-lib/error.c -N gnulib-lib/getopt.c input.po
To extract the messages that contain the string “Please specify” in the original string:
msggrep --msgid -F -e 'Please specify' input.po
To extract the messages that have a context specifier of either “Menu>File” or “Menu>Edit” or a submenu of them:
msggrep --msgctxt -E -e '^Menu>(File|Edit)' input.po
To extract the messages whose translation contains one of the strings in the
file wordlist.txt
:
msggrep --msgstr -F -f wordlist.txt input.po
msgfilter
Programmsgfilter [option] filter [filter-option]
The msgfilter
program applies a filter to all translations of a
translation catalog.
During each filter invocation, the environment variable
MSGFILTER_MSGID
is bound to the message's msgid, and the environment
variable MSGFILTER_LOCATION
is bound to the location in the PO file
of the message. If the message has a context, the environment variable
MSGFILTER_MSGCTXT
is bound to the message's msgctxt, otherwise it is
unbound.
If no inputfile is given or if it is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
The filter can be any program that reads a translation from standard input and writes a modified translation to standard output. A frequently used filter is ‘sed’. A few particular built-in filters are also recognized.
Note: If the filter is not a built-in filter, you have to care about encodings:
It is your responsibility to ensure that the filter can cope
with input encoded in the translation catalog's encoding. If the
filter wants input in a particular encoding, you can in a first step
convert the translation catalog to that encoding using the ‘msgconv’
program, before invoking ‘msgfilter’. If the filter wants input
in the locale's encoding, but you want to avoid the locale's encoding, then
you can first convert the translation catalog to UTF-8 using the
‘msgconv’ program and then make ‘msgfilter’ work in an UTF-8
locale, by using the LC_ALL
environment variable.
Note: Most translations in a translation catalog don't end with a newline
character. For this reason, it is important that the filter
recognizes its last input line even if it ends without a newline, and that
it doesn't add an undesired trailing newline at the end. The ‘sed’
program on some platforms is known to ignore the last line of input if it
is not terminated with a newline. You can use GNU sed
instead; it
does not have this limitation.
The filter ‘recode-sr-latin’ is recognized as a built-in filter. The command ‘recode-sr-latin’ converts Serbian text, written in the Cyrillic script, to the Latin script. The command ‘msgfilter recode-sr-latin’ applies this conversion to the translations of a PO file. Thus, it can be used to convert an ‘sr.po’ file to an ‘sr@latin.po’ file.
The use of built-in filters is not sensitive to the current locale's encoding. Moreover, when used with a built-in filter, ‘msgfilter’ can automatically convert the message catalog to the UTF-8 encoding when needed.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
To convert German translations to Swiss orthography (in an UTF-8 locale):
msgconv -t UTF-8 de.po | msgfilter sed -e 's/ß/ss/g'
To convert Serbian translations in Cyrillic script to Latin script:
msgfilter recode-sr-latin < sr.po
msguniq
Programmsguniq [option] [inputfile]
The msguniq
program unifies duplicate translations in a translation
catalog. It finds duplicate translations of the same message ID. Such
duplicates are invalid input for other programs like msgfmt
,
msgmerge
or msgcat
. By default, duplicates are merged
together. When using the ‘--repeated’ option, only duplicates are
output, and all other messages are discarded. Comments and extracted
comments will be cumulated, except that if ‘--use-first’ is
specified, they will be taken from the first translation. File positions
will be cumulated. When using the ‘--unique’ option, duplicates are
discarded.
If no inputfile is given or if it is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msgcomm
Programmsgcomm [option] [inputfile]...
The msgcomm
program finds messages which are common to two or more
of the specified PO files.
By using the --more-than
option, greater commonality may be requested
before messages are printed. Conversely, the --less-than
option may be
used to specify less commonality before messages are printed (i.e.
‘--less-than=2’ will only print the unique messages). Translations,
comments and extract comments will be preserved, but only from the first
PO file to define them. File positions from all PO files will be
cumulated.
If inputfile is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msgcmp
Programmsgcmp [option] def.po ref.pot
The msgcmp
program compares two Uniforum style .po files to check that
both contain the same set of msgid strings. The def.po file is an
existing PO file with the translations. The ref.pot file is the last
created PO file, or a PO Template file (generally created by xgettext
).
This is useful for checking that you have translated each and every message
in your program. Where an exact match cannot be found, fuzzy matching is
used to produce better diagnostics.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
msgattrib
Programmsgattrib [option] [inputfile]
The msgattrib
program filters the messages of a translation catalog
according to their attributes, and manipulates the attributes.
If no inputfile is given or if it is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
Attributes are modified after the message selection/removal has been performed. If the ‘--only-file’ or ‘--ignore-file’ option is specified, the attribute modification is applied only to those messages that are listed in the only-file and not listed in the ignore-file.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msgen
Programmsgen [option] inputfile
The msgen
program creates an English translation catalog. The
input file is the last created English PO file, or a PO Template file
(generally created by xgettext). Untranslated entries are assigned a
translation that is identical to the msgid.
Note: ‘msginit --no-translator --locale=en’ performs a very similar
task. The main difference is that msginit
cares specially about
the header entry, whereas msgen
doesn't.
If inputfile is ‘-’, standard input is read.
The results are written to standard output if no output file is specified or if it is ‘-’.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
--color
option for details.
--color
.
See section 9.11.3 The --style
option for details.
.properties
syntax. Note
that this file format doesn't support plural forms and silently drops
obsolete messages.
.strings
syntax.
Note that this file format doesn't support plural forms.
msgexec
Programmsgexec [option] command [command-option]
The msgexec
program applies a command to all translations of a
translation catalog.
The command can be any program that reads a translation from standard
input. It is invoked once for each translation. Its output becomes
msgexec's output. msgexec
's return code is the maximum return code
across all invocations.
A special builtin command called ‘0’ outputs the translation, followed by a null byte. The output of ‘msgexec 0’ is suitable as input for ‘xargs -0’.
During each command invocation, the environment variable
MSGEXEC_MSGID
is bound to the message's msgid, and the environment
variable MSGEXEC_LOCATION
is bound to the location in the PO file
of the message. If the message has a context, the environment variable
MSGEXEC_MSGCTXT
is bound to the message's msgctxt, otherwise it is
unbound.
Note: It is your responsibility to ensure that the command can cope
with input encoded in the translation catalog's encoding. If the
command wants input in a particular encoding, you can in a first step
convert the translation catalog to that encoding using the ‘msgconv’
program, before invoking ‘msgexec’. If the command wants input
in the locale's encoding, but you want to avoid the locale's encoding, then
you can first convert the translation catalog to UTF-8 using the
‘msgconv’ program and then make ‘msgexec’ work in an UTF-8
locale, by using the LC_ALL
environment variable.
If no inputfile is given or if it is ‘-’, standard input is read.
.properties
syntax, not in PO file syntax.
.strings
syntax, not in PO file syntax.
Translators are usually only interested in seeing the untranslated and fuzzy messages of a PO file. Also, when a message is set fuzzy because the msgid changed, they want to see the differences between the previous msgid and the current one (especially if the msgid is long and only few words in it have changed). Finally, it's always welcome to highlight the different sections of a message in a PO file (comments, msgid, msgstr, etc.).
Such highlighting is possible through the msgcat
options
‘--color’ and ‘--style’.
--color
optionThe ‘--color=when’ option specifies under which conditions colorized output should be generated. The when part can be one of the following:
always
yes
never
no
auto
tty
html
‘--color’ is equivalent to ‘--color=yes’. The default is ‘--color=auto’.
Thus, a command like ‘msgcat vi.po’ will produce colorized output when called by itself in a command window. Whereas in a pipe, such as ‘msgcat vi.po | less -R’, it will not produce colorized output. To get colorized output in this situation nevertheless, use the command ‘msgcat --color vi.po | less -R’.
The ‘--color=html’ option will produce output that can be viewed in a browser. This can be useful, for example, for Indic languages, because the renderic of Indic scripts in browser is usually better than in terminal emulators.
Note that the output produced with the --color
option is not
a valid PO file in itself. It contains additional terminal-specific escape
sequences or HTML tags. A PO file reader will give a syntax error when
confronted with such content. Except for the ‘--color=html’ case,
you therefore normally don't need to save output produced with the
--color
option in a file.
TERM
The environment variable TERM
contains a identifier for the text
window's capabilities. You can get a detailed list of these cababilities
by using the ‘infocmp’ command, using ‘man 5 terminfo’ as a
reference.
When producing text with embedded color directives, msgcat
looks
at the TERM
variable. Text windows today typically support at least
8 colors. Often, however, the text window supports 16 or more colors,
even though the TERM
variable is set to a identifier denoting only
8 supported colors. It can be worth setting the TERM
variable to
a different value in these cases:
xterm
xterm
is in most cases built with support for 16 colors. It can also
be built with support for 88 or 256 colors (but not both). You can try to
set TERM
to either xterm-16color
, xterm-88color
, or
xterm-256color
.
rxvt
rxvt
is often built with support for 16 colors. You can try to set
TERM
to rxvt-16color
.
konsole
konsole
too is often built with support for 16 colors. You can try to
set TERM
to konsole-16color
or xterm-16color
.
After setting TERM
, you can verify it by invoking
‘msgcat --color=test’ and seeing whether the output looks like a
reasonable color map.
--style
option
The ‘--style=style_file’ option specifies the style file to use
when colorizing. It has an effect only when the --color
option is
effective.
If the --style
option is not specified, the environment variable
PO_STYLE
is considered. It is meant to point to the user's
preferred style for PO files.
The default style file is ‘$prefix/share/gettext/styles/po-default.css’,
where $prefix
is the installation location.
A few style files are predefined:
You can use these styles without specifying a directory. They are actually
located in ‘$prefix/share/gettext/styles/’, where $prefix
is the
installation location.
You can also design your own styles. This is described in the next section.
The same style file can be used for styling of a PO file, for terminal output and for HTML output. It is written in CSS (Cascading Style Sheet) syntax. See http://www.w3.org/TR/css2/cover.html for a formal definition of CSS. Many HTML authoring tutorials also contain explanations of CSS.
In the case of HTML output, the style file is embedded in the HTML output.
In the case of text output, the style file is interpreted by the
msgcat
program. This means, in particular, that when
@import
is used with relative file names, the file names are
@import
, in the case of
text output. (Actually, @import
s are not yet supported in this case,
due to a limitation in libcroco
.)
CSS rules are built up from selectors and declarations. The declarations specify graphical properties; the selectors specify specify when they apply.
In PO files, the following simple selectors (based on "CSS classes", see the CSS2 spec, section 5.8.3) are supported.
.header
.translated
.untranslated
.fuzzy
.obsolete
white-space # translator-comments #. extracted-comments #: reference... #, flag... #| msgid previous-untranslated-string msgid untranslated-string msgstr translated-string
.comment
.translator-comment
.extracted-comment
.reference-comment
.reference
.flag-comment
.flag
.fuzzy-flag
.previous-comment
.previous
msgid
etc.) and the spaces between them.
.msgid
msgid
etc.) and the spaces between them.
.msgstr
msgstr
etc.) and the spaces between them.
.keyword
msgid
, msgstr
, etc.).
.string
.text
.escape-sequence
.format-directive
java-format
and csharp-format
, with a ‘~’ in the case of
lisp-format
and scheme-format
, or with ‘$’ in the case of
sh-format
).
.invalid-format-directive
.added
.changed
.removed
These selectors can be combined to hierarchical selectors. For example,
.msgstr .invalid-format-directive { color: red; }
will highlight the invalid format directives in the translated strings.
In text mode, pseudo-classes (CSS2 spec, section 5.11) and pseudo-elements (CSS2 spec, section 5.12) are not supported.
The declarations in HTML mode are not limited; any graphical attribute supported by the browsers can be used.
The declarations in text mode are limited to the following properties. Other properties will be silently ignored.
color
(CSS2 spec, section 14.1)
background-color
(CSS2 spec, section 14.2.1)
font-weight
(CSS2 spec, section 15.2.3)
normal
and bold
. Values >= 600 are rendered as
bold
.
font-style
(CSS2 spec, section 15.2.3)
italic
and oblique
are
rendered the same way.
text-decoration
(CSS2 spec, section 16.3.1)
none
and
underline
.
less
for viewing PO filesThe ‘less’ program is a popular text file browser for use in a text screen or terminal emulator. It also supports text with embedded escape sequences for colors and text decorations.
You can use less
to view a PO file like this (assuming an UTF-8
environment):
msgcat --to-code=UTF-8 --color xyz.po | less -R
You can simplify this to this simple command:
less xyz.po
after these three preparations:
LESS
environment
variable. In sh shells:
$ LESS="$LESS -R -f" $ export LESS
LESSOPEN
and
LESSCLOSE
environment variables, as indicated in the manual page
(‘man less’).
msgcat
on them, producing
a temporary file. Like this:
case "$1" in *.po) tmpfile=`mktemp "${TMPDIR-/tmp}/less.XXXXXX"` msgcat --to-code=UTF-8 --color "$1" > "$tmpfile" echo "$tmpfile" exit 0 ;; esac
For the tasks for which a combination of ‘msgattrib’, ‘msgcat’ etc. is not sufficient, a set of C functions is provided in a library, to make it possible to process PO files in your own programs. When you use this library, you don't need to write routines to parse the PO file; instead, you retrieve a pointer in memory to each of messages contained in the PO file. Functions for writing PO files are not provided at this time.
The functions are declared in the header file ‘<gettext-po.h>’, and are defined in a library called ‘libgettextpo’.
po_file_read
function reads a PO file into memory. The file name
is given as argument. The return value is a handle to the PO file's contents,
valid until po_file_free
is called on it. In case of error, the return
value is NULL
, and errno
is set.
po_file_free
function frees a PO file's contents from memory,
including all messages that are only implicitly accessible through iterators.
po_file_domains
function returns the domains for which the given
PO file has messages. The return value is a NULL
terminated array
which is valid as long as the file handle is valid. For PO files which
contain no ‘domain’ directive, the return value contains only one domain,
namely the default domain "messages"
.
po_message_iterator
returns an iterator that will produce the
messages of file that belong to the given domain. If domain
is NULL
, the default domain is used instead. To list the messages,
use the function po_next_message
repeatedly.
po_message_iterator_free
function frees an iterator previously
allocated through the po_message_iterator
function.
po_next_message
function returns the next message from
iterator and advances the iterator. It returns NULL
when the
iterator has reached the end of its message list.
The following functions returns details of a po_message_t
. Recall
that the results are valid as long as the file handle is valid.
po_message_msgid
function returns the msgid
(untranslated
English string) of a message. This is guaranteed to be non-NULL
.
po_message_msgid_plural
function returns the msgid_plural
(untranslated English plural string) of a message with plurals, or NULL
for a message without plural.
po_message_msgstr
function returns the msgstr
(translation)
of a message. For an untranslated message, the return value is an empty
string.
po_message_msgstr_plural
function returns the
msgstr[index]
of a message with plurals, or NULL
when
the index is out of range or for a message without plural.
Here is an example code how these functions can be used.
const char *filename = ...; po_file_t file = po_file_read (filename); if (file == NULL) error (EXIT_FAILURE, errno, "couldn't open the PO file %s", filename); { const char * const *domains = po_file_domains (file); const char * const *domainp; for (domainp = domains; *domainp; domainp++) { const char *domain = *domainp; po_message_iterator_t iterator = po_message_iterator (file, domain); for (;;) { po_message_t *message = po_next_message (iterator); if (message == NULL) break; { const char *msgid = po_message_msgid (message); const char *msgstr = po_message_msgstr (message); ... } } po_message_iterator_free (iterator); } } po_file_free (file);
Go to the first, previous, next, last section, table of contents.