There are many text-processing softwares, such as grep
,
groff
, head
, sort
, wc
,
uniq
, nl
, expand
, and so on. There are
also many script languages which are often used for text processing, such as
sed
, awk
, perl
, python
,
ruby
, and so on. These softwares need to be internationalized.
From a user's point of view, a software can use any internal encodings if I/O is done correctly. It is because a user cannot be aware of which kind of internal code is used in the software.
There are two candidate for internal encoding. One is wide character and the another is UCS-4. You can also use Mule-type encoding, where a pair of a number to express CCS and a number to express a character consist a unit.
I recommend to use wide character, for reasons I alread explained in LOCALE technology, Chapter 6, i.e., wide character can be encoding-independent and can support various encodings in the world including UTF-8, can supply a common united way for users to choose encodings, and so on.
Here a few examples of handling of wchar_t are shown.
The following program is a small example of stream I/O of wide characters.
#include <stdio.h> #include <wchar.h> #include <locale.h> main() { wint_t c; setlocale(LC_ALL, ""); while(1) { c = getwchar(); if (c == WEOF) break; putwchar(c); } }
I think you can easily imagine a corresponding version using char. Since this software does not do any character manipulation, you can use ordinal char for this software.
There are a few points. At first, never forget to call setlocale(). Then, putwchar(), getwchar(), and WEOF are the replacements of putchar(), getchar(), and EOF, respectively. Use wint_t instead of int for getwchar().
Here is an example of character clasification using wchar_t. At first, this is a non-internationalized version.
/* * wc.c * * Word Counter * */ #include <stdio.h> #include <string.h> int main(int argc, char **argv) { int n, p=0, d=0, c=0, w=0, l=0; while ((n=getchar()) != EOF) { c++; if (isdigit(n)) d++; if (strchr(" \t\n", n)) w++; if (n == '\n') l++; } printf("%d characters, %d digits, %d words, and %d lines\n", c, d, w, l); }
Here is the internationalized version.
/* * wc-i.c * * Word Counter (internationalized version) * */ #include <stdio.h> #include <string.h> #include <locale.h> int main(int argc, char **argv) { int p=0, d=0, c=0, w=0, l=0; wint_t n; setlocale(LC_ALL, ""); while ((n=getwchar()) != EOF) { c++; if (iswdigit(n)) d++; if (wcschr(L" \t\n", n)) w++; if (n == L'\n') l++; } printf("%d characters, %d digits, %d words, and %d lines\n", c, d, w, l); }
This example shows that iswdigit() is used instead of isdigit(). And more, L"string" and L'char' for wide character string and wide character.
The following is a sample program to obtain the length of the inputed string. Note that number of bytes and number of characters are not distinguished.
/* length.c * * a sample program to obtain the length of the inputed string * NOT INTERNATIONALIZED */ #include <stdio.h> #include <string.h> int main(int argc, char **argv) { int len; if (argc < 2) { printf("Usage: %s [string]\n", argv[0]); return 0; } printf("Your string is: \"%s\".\n", argv[1]); len = strlen(argv[1]); printf("Length of your string is: %d bytes.\n", len); printf("Length of your string is: %d characters.\n", len); printf("Width of your string is: %d columns.\n", len); return 0; }
The following is a internationalized version of the program using wide characters.
/* length-i.c * * a sample program to obtain the length of the inputed string * INTERNATIONALIZED */ #include <stdio.h> #include <string.h> #include <locale.h> int main(int argc, char **argv) { int len, n; wchar_t *wp; /* All softwares using locale should write this line */ setlocale(LC_ALL, ""); if (argc < 2) { printf("Usage: %s [string]\n", argv[0]); return 0; } printf("Your string is: \"%s\".\n", argv[1]); /* The concept of 'byte' is universal. */ len = strlen(argv[1]); printf("Length of your string is: %d bytes.\n", len); /* To obtain number of characters, it is the easiest way */ /* to convert the string into wide string. The number of */ /* characters is equal to the number of wide characters. */ /* It does not exceed the number of bytes. */ n = strlen(argv[1]) * sizeof(wchar_t); wp = (wchar_t *)malloc(n); len = mbstowcs(wp, argv[1], n); printf("Length of your string is: %d characters.\n", len); printf("Width of your string is: %d columns.\n", wcswidth(wp, len)); return 0; }
This program can count multibyte characters correctly. Of course the user has to set LANG variable properly.
For example, on UTF-8 xterm...
$ export LANG=ko_KR.UTF-8 $ ./length-i (a Hangul character) Your string is: "(the character)" Length of your string is: 3 bytes. Length of your string is: 1 characters. Width of your string is: 2 columns.
The following program extracts all characters contained in the given string.
/* extract.c * * a sample program to extract each character contained in the string * not internationalized */ #include <stdio.h> #include <string.h> int main(int argc, char **argv) { char *p; int c; if (argc < 2) { printf("Usage: %s [string]\n", argv[0]); return 0; } printf("Your string is: \"%s\".\n", argv[1]); c = 0; for (p=argv[1] ; *p ; p++) { printf("Character #%d is \"%c\".\n", ++c, *p); } return 0; }
Using wide characters, the program can be rewritten as following.
/* extract-i.c * * a sample program to extract each character contained in the string * INTERNATIONALIZED */ #include <stdio.h> #include <string.h> #include <locale.h> #include <stdlib.h> int main(int argc, char **argv) { wchar_t *wp; char p[MB_CUR_MAX+1]; int c, n, len; /* Don't forget. */ setlocale(LC_ALL, ""); if (argc < 2) { printf("Usage: %s [string]\n", argv[0]); return 0; } printf("Your string is: \"%s\".\n", argv[1]); /* To obtain each character of the string, it is easy to convert */ /* the string into wide string and re-convert each of the wide */ /* string into multibyte characters. */ n = strlen(argv[1]) * sizeof(wchar_t); wp = (wchar_t *)malloc(n); len = mbstowcs(wp, argv[1], n); for (c=0; c<len; c++) { /* re-convert from wide character to multibyte character */ int x; x = wctomb(p, wp[c]); /* One multibyte character may be two or more bytes. */ /* Thus "%s" is used instead of "%c". */ if (x>0) p[x]=0; printf("Character #%d is \"%s\" (%d byte(s)) \n", c, p, x); } return 0; }
Note that this program doesn't work well if the multibyte character is stateful.
Introduction to i18n
14 February 2003kubota@debian.org