7.6. String Processing¶
Caution
This version is already obsolete. Please check the latest guideline.
Index
7.6.1. Overview¶
7.6.2. How to use¶
7.6.2.1. Trim¶
String#trim
method can be used as well while carrying out half width blank trim operation, however, while carrying out complex trim operations like only leading and trailing trim operation, trim operation of any string etc. org.springframework.util.StringUtils
provided by Spring should be preferably used.String str = " Hello World!!";
StringUtils.trimWhitespace(str); // => "Hello World!!"
StringUtils.trimLeadingCharacter(str, ' '); // => "Hello World!!"
StringUtils.trimTrailingCharacter(str, '!'); // => " Hello World"
Note
There is no change in the behaviour even if surrogate pair string is specified in the first argument of StringUtils#trimLeadingCharacter
and StringUtils#trimTrailingCharacter
. Also note that, since the second argument is of char type, surrogate pair cannot be specified.
7.6.2.2. Padding, Suppress¶
String
class can be used.int num = 1;
String paddingStr = String.format("%03d", num); // => "001"
String suppressStr = paddingStr.replaceFirst("^0+(?!$)", ""); // => "1"
Warning
If a surrogate pair is included while carrying out padding of apparent length, appropriate results cannot be obtained since surrogate pair cannot be taken into account by String#format
.
In order to achieve the padding by using surrogate pair, it is necessary to count number of characters considered as surrogate pair described later, calculate appropriate number of characters that should be padded and join the strings.
7.6.2.3. Processing of a string considered as a surrogate pair¶
7.6.2.3.1. Fetching string length¶
String#length
method.len
.String str = "吉田太郎";
int len = str.length(); // => 5
String#codePointCount
is defined wherein length of the string considered as a surrogate pair is fetched from Java SE 5.String#codePointCount
.String str = "吉田太郎";
int lenOfChar = str.length(); // => 5
int lenOfCodePoint = str.codePointCount(0, lenOfChar); // => 4
\u304c
indicating [が] and \u304b\u3099
indicating [か] plus [voiced sound mark], however [か] plus [voiced sound mark] are likely to be counted as two characters.java.text.Normalizer
.public int getStrLength(String str) {
String normalizedStr = Normalizer.normalize(str, Normalizer.Form.NFC);
int length = normalizedStr.codePointCount(0, normalizedStr.length());
return length;
}
7.6.2.3.2. Fetch string in the specified range¶
String#substring
is used.String str = "吉田太郎";
int startIndex = 0;
int endIndex = 2;
String subStr = str.substring(startIndex, endIndex);
System.out.println(subStr); // => "吉田"
String#substring
method must be used by searching start and end positions considering the surrogate pair, by using String#offsetByCodePoints
.String str = "吉田太郎";
int startIndex = 0;
int endIndex = 2;
int startIndexSurrogate = str.offsetByCodePoints(0, startIndex); // => 0
int endIndexSurrogate = str.offsetByCodePoints(0, endIndex); // => 3
String subStrSurrogate = str.substring(startIndexSurrogate, endIndexSurrogate); // => "吉田"
7.6.2.3.3. String split¶
String#split
method handles the surrogate pair as a default.String str = "吉田太郎";
str.split(" "); // => {"吉田", "太郎"}
Note
Surrogate pair can also be specified in the argument ofString#split
as a delimiter.Note
Please note that behaviour while passing a blank character inString#split
changes in Java SE 7 environment, and Java SE 8. Refer Pattern#split JavadocString str = "ABC"; String[] elems = str.split(""); // Java SE 7 => {, A, B, C} // Java SE 8 => {A, B, C}
7.6.2.4. Full width, half width string conversion¶
Full width and half width character conversion is carried out by using API of org.terasoluna.gfw.common.fullhalf.FullHalfConverter
class provided by common library.
FullHalfConverter
class adopts a style wherein pair definition of full width and half width characters for conversion (org.terasoluna.gfw.common.fullhalf.FullHalfPair
) is registered in advance.
FullHalfConverter
object for which default pair definition is registered, is provided as a INSTANCE
constant
of org.terasoluna.gfw.common.fullhalf.DefaultFullHalf
class in the common library.
For default pair definition, refer DefaultFullHalf source .
Note
When change requirements are not met in the default pair definition provided by common library, FullHalfConverter
object registering a unique pair definition should be created.
For basic methods for creation, refer Creating FullHalfConverter class for which a unique full width and half width character pair definition is registered .
7.6.2.4.1. How to apply common library¶
<dependencies>
<dependency>
<groupId>org.terasoluna.gfw</groupId>
<artifactId>terasoluna-gfw-string</artifactId>
</dependency>
</dependencies>
7.6.2.4.2. Conversion to full width string¶
toFullwidth
method of FullHalfConverter
is used while converting a half width character to full width character.
String fullwidth = DefaultFullHalf.INSTANCE.toFullwidth("ア゙!A8ガザ"); // (1)
Sr. No. | Description |
---|---|
(1)
|
Pass the string which contains half width characters to argument of
toFullwidth method and convert to full width string.In this example, it is converted to
"ア゛!A8ガザ" . Also, note that the characters for which a pair is not defined ("ザ" in this example) are returned without any change. |
7.6.2.4.3. Conversion to half width string¶
toHalfwidth
method of FullHalfConverter
is used while converting a full width character to half width character.
String halfwidth = DefaultFullHalf.INSTANCE.toHalfwidth("A!アガサ"); // (1)
Sr. No. | Description |
---|---|
(1)
|
Pass the string which contains full width characters to argument of
toHalfwidth method and convert to half width characters.In this example, it is converted to
"A!アガサ" . Also, note that the characters for which a pair is not defined ("サ" of this example) are returned without any change. |
Note
FullHalfConverter
cannot convert combining characters that represent a single character using 2 or more characters (Example: “"シ"
(\u30b7
) + voiced sound mark(\u3099
)”)to half width character (Example: "ジ"
).
When combining characters are to be converted to half width characters, FullHalfConverter
must be used after converting the same to integrated characters (Example:"ジ"
(\u30b8
))by carrying out text normalization.
java.text.Normalizer
is used while carrying out text normalization.
Note that, when combining characters are to be converted to integrated characters, NFC or NFKC is used as a normalization format.
Implementation example wherein NFD (analyse by canonical equivalence) is used as a normalization format
String str1 = Normalizer.normalize("モジ", Normalizer.Form.NFD); // str1 = "モシ + Voiced sound mark(\u3099)" String str2 = Normalizer.normalize("モジ", Normalizer.Form.NFD); // str2 = "モジ"
Implementation example wherein NFC (analyse by canonical equivalence, and integrate again) is used as a normalization format
String mojiStr = "モシ\u3099"; // "モシ + Voiced sound mark(\u3099)" String str1 = Normalizer.normalize(mojiStr, Normalizer.Form.NFC); // str1 = "モジ(\u30b8)" String str2 = Normalizer.normalize("モジ", Normalizer.Form.NFC); // str2 = "モジ"
Implementation example wherein NFKD (analyse by compatibility equivalent) is used as a normalization format
String str1 = Normalizer.normalize("モジ", Normalizer.Form.NFKD); // str1 = "モシ + Voiced sound mark(\u3099)" String str2 = Normalizer.normalize("モジ", Normalizer.Form.NFKD); // str2 = "モシ + Voiced sound mark(\u3099)"
Implementation example wherein NFKC (analyse by compatibility equivalent and integrate again) is used as a normalization format
String mojiStr = "モシ\u3099"; // "モシ + Voiced sound mark(\u3099)" String str1 = Normalizer.normalize(mojiStr, Normalizer.Form.NFKC); // str1 = "モジ(\u30b8)" String str2 = Normalizer.normalize("モジ", Normalizer.Form.NFKC) ; // str2 = "モジ"
For details, refer Normalizer JavaDoc.
7.6.2.4.4. Creating FullHalfConverter class for which a unique full width and half width character pair definition is registered¶
FullHalfConverter
for which a unique full width and half width character pair definition is registered can also be used without using DefaultFullHalf
.FullHalfConverter
for which a unique full width character and half width character pair definition is registered, is shown below.Implementation example of a class that provides FullHalfConverter for which a unique pair definition is registered
public class CustomFullHalf {
private static final int FULL_HALF_CODE_DIFF = 0xFEE0;
public static final FullHalfConverter INSTANCE;
static {
// (1)
FullHalfPairsBuilder builder = new FullHalfPairsBuilder();
// (2)
builder.pair("ー", "-");
// (3)
for (char c = '!'; c <= '~'; c++) {
String fullwidth = String.valueOf((char) (c + FULL_HALF_CODE_DIFF));
builder.pair(fullwidth, String.valueOf(c));
}
// (4)
builder.pair("。", "。").pair("「", "「").pair("」", "」").pair("、", "、")
.pair("・", "・").pair("ァ", "ァ").pair("ィ", "ィ").pair("ゥ", "ゥ")
.pair("ェ", "ェ").pair("ォ", "ォ").pair("ャ", "ャ").pair("ュ", "ュ")
.pair("ョ", "ョ").pair("ッ", "ッ").pair("ア", "ア").pair("イ", "イ")
.pair("ウ", "ウ").pair("エ", "エ").pair("オ", "オ").pair("カ", "カ")
.pair("キ", "キ").pair("ク", "ク").pair("ケ", "ケ").pair("コ", "コ")
.pair("サ", "サ").pair("シ", "シ").pair("ス", "ス").pair("セ", "セ")
.pair("ソ", "ソ").pair("タ", "タ").pair("チ", "チ").pair("ツ", "ツ")
.pair("テ", "テ").pair("ト", "ト").pair("ナ", "ナ").pair("ニ", "ニ")
.pair("ヌ", "ヌ").pair("ネ", "ネ").pair("ノ", "ノ").pair("ハ", "ハ")
.pair("ヒ", "ヒ").pair("フ", "フ").pair("ヘ", "ヘ").pair("ホ", "ホ")
.pair("マ", "マ").pair("ミ", "ミ").pair("ム", "ム").pair("メ", "メ")
.pair("モ", "モ").pair("ヤ", "ヤ").pair("ユ", "ユ").pair("ヨ", "ヨ")
.pair("ラ", "ラ").pair("リ", "リ").pair("ル", "ル").pair("レ", "レ")
.pair("ロ", "ロ").pair("ワ", "ワ").pair("ヲ", "ヲ").pair("ン", "ン")
.pair("ガ", "ガ").pair("ギ", "ギ").pair("グ", "グ")
.pair("ゲ", "ゲ").pair("ゴ", "ゴ").pair("ザ", "ザ")
.pair("ジ", "ジ").pair("ズ", "ズ").pair("ゼ", "ゼ")
.pair("ゾ", "ゾ").pair("ダ", "ダ").pair("ヂ", "ヂ")
.pair("ヅ", "ヅ").pair("デ", "デ").pair("ド", "ド")
.pair("バ", "バ").pair("ビ", "ビ").pair("ブ", "ブ")
.pair("べ", "ベ").pair("ボ", "ボ").pair("パ", "パ")
.pair("ピ", "ピ").pair("プ", "プ").pair("ペ", "ペ")
.pair("ポ", "ポ").pair("ヴ", "ヴ").pair("\u30f7", "ヷ")
.pair("\u30fa", "ヺ").pair("゛", "゙").pair("゜", "゚").pair(" ", " ");
// (5)
INSTANCE = new FullHalfConverter(builder.build());
}
}
Sr. No. | Description |
---|---|
(1)
|
Use
org.terasoluna.gfw.common.fullhalf.FullHalfPairsBuilder and create org.terasoluna.gfw.common.fullhalf.FullHalfPairs which represents a set of pair definition of full width and half width characters. |
(2)
|
Half width character corresponding to
"ー" of full width character set to "ー" (\uFF70 ) in DefaultFullHalf is changed to "-" (\u002D ) in this example.Further, although
"-" (\u002D ) is also included in the process target given below (3), the pair definition defined earlier is given the precedence. |
(3)
|
In this example, a pair is defined for code values of full width of unicode from
"!" to "~" and of half width of unicode from "!" to "~" using a loop process which use the characteristic “code value sequence is same”. |
(4)
|
Since code value sequence for the characters other than given in (3) does not match for full width characters and half width characters, define a pair individually for respective characters.
|
(5)
|
Use
FullHalfPairs created by FullHalfPairsBuilder and create FullHalfConverter . |
Note
For the values that can be specified in the argument of FullHalfPairsBuilder#pair
method,
refer FullHalfPair constructor JavaDoc
How to use FullHalfConverter for which a unique pair definition is registered
String halfwidth = CustomFullHalf.INSTANCE.toHalfwidth("ハローワールド!"); // (1)
Sr. No. | Description |
---|---|
(1)
|
Use
toHalfwidth method of FullHalfConverter object for which a unique pair definition is registered and convert the string containing full width characters to half width string.In this example, it is converted to
"ハロ-ワ-ルド!" . ("-" is \u002D ) |
7.6.2.5. Code point set check (character type check)¶
A code point set function provided by common library should be used for checking character type.
Here, how to implement a character type check by using a code point set function is explained.
- Creating code point set
- Set operation of the code point sets
- String check by using code point set
- String check linked with Bean Validation
- Creating new code point set class
7.6.2.5.1. How to apply common library¶
7.6.2.5.2. Creating code point set¶
org.terasoluna.gfw.common.codepoints.CodePoints
is a class that represents a code point set.CodePoints
instance is shown below.When an instance is created by calling a factory method (cache)
Class<? extends CodePoints>
) and the created instance is then cached, is explained below.CodePoints codePoints = CodePoints.of(ASCIIPrintableChars.class); // (1)
Sr. No. | Description |
---|---|
(1)
|
Pass code point set class in
CodePoints#of method (factory method) and fetch an instance.In this example, an instance of code point set class (
org.terasoluna.gfw.common.codepoints.catalog.ASCIIPrintableChars ) of Ascii printable characters is fetched. |
Note
Code point set class exists multiple times in the module, same as CodePoints
class.
Although other modules which provide code point set also exist, these modules must be added to their own projects when required.
For details, refer Code point set class provided by common library.
Further, a new code point set class can also be created. For details, refer Creating new code point set class.
When an instance is created by calling a constructor of code point set class
CodePoints codePoints = new ASCIIPrintableChars(); // (1)
Sr. No. | Description |
---|---|
(1)
|
Call constructor by using
new operator and generate an instance of code point set class.In this example, an instance of code point set class (
ASCIIPrintableChars ) of Ascii printable characters is generated. |
When an instance is created by calling constructor of CodePoints
CodePoints
is shown below.When the codepoint (
int
) is to be passed by using a variable length argumentCodePoints codePoints = new CodePoints(0x0061 /* a */, 0x0062 /* b */); // (1)
Sr. No. Description (1)Generate an instance by passingint
code point inCodePoints
constructor.In this example, an instance of code point set for characters"a"
and"b"
is generated.
When the
Set
of code point (int
) is to be passedSet<Integet> set = new HashSet<>(); set.add(0x0061 /* a */); set.add(0x0062 /* b */); CodePoints codePoints = new CodePoints(set); // (1)
Sr. No. Description (1) Add code point ofint
toSet
and generate an instance by passing theSet
in constructor ofCodePoints
.In this example, an instance of code point set for characters"a"
and"b"
is generated.
When code point set string is to be passed by using variable length argument
CodePoints codePoints = new CodePoints("ab"); // (1)
CodePoints codePoints = new CodePoints("a", "b"); // (2)
Sr. No. Description (1)Generate an instance by passing code point set string in constructor ofCodePoints
.In this example, an instance of code point set for characters"a"
and"b"
is generated.(2)Code point set string can also be passed by dividing it in the arguments. Result is same as (1).
7.6.2.5.3. Set operation of the code point sets¶
When an instance of code point set is created by using union set method
CodePoints abCp = new CodePoints(0x0061 /* a */, 0x0062 /* b */);
CodePoints cdCp = new CodePoints(0x0063 /* c */, 0x0064 /* d */);
CodePoints abcdCp = abCp.union(cdCp); // (1)
Sr. No. | Description |
---|---|
(1)
|
Calculate union of two code point sets by using
CodePoints#union method and create an instance of new code point set.In this example, union of “code point set included in string
"ab" ” and “code point set included in string "cd" ” is calculated and an instance of new code point set (code point set included in string "abcd" ) is generated. |
When an instance of code point set is created by using difference set method
CodePoints abcdCp = new CodePoints(0x0061 /* a */, 0x0062 /* b */,
0x0063 /* c */, 0x0064 /* d */);
CodePoints cdCp = new CodePoints(0x0063 /* c */, 0x0064 /* d */);
CodePoints abCp = abcdCp.subtract(cdCp); // (1)
Sr. No. | Description |
---|---|
(1)
|
Calculate difference set of two code point sets by using
CodePoints#subtract method and create an instance of new code point set.In this example, difference set of “code point set included in string
"abcd" ” and “code point set included in string "cd" ” is calculated and an instance of new code point set (code point set included in string "ab" ) is created. |
When an instance of new code point set is to be created by intersection set
CodePoints abcdCp = new CodePoints(0x0061 /* a */, 0x0062 /* b */,
0x0063 /* c */, 0x0064 /* d */);
CodePoints cdeCp = new CodePoints(0x0063 /* c */, 0x0064 /* d */, 0x0064 /* e */);
CodePoints cdCp = abcdCp.intersect(cdeCp); // (1)
Sr. No. | Description |
---|---|
(1)
|
Calculate intersection set of two code point sets by using
CodePoints#intersect method and create an instance of new code point set.In this example, calculate intersection set of “code point set included in string
"abcd" ” and “code point set included in string "cde" ” is calculated and an instance of new code point set (code point set included in string "cd" )is created. |
7.6.2.5.4. String check by using code point set¶
CodePoints
.containsAll method
Determine whether the entire string for checking is included in the code point set.
CodePoints jisX208KanaCp = CodePoints.of(JIS_X_0208_Katakana.class);
boolean result;
result = jisX208KanaCp.containsAll("カ"); // true
result = jisX208KanaCp.containsAll("カナ"); // true
result = jisX208KanaCp.containsAll("カナa"); // false
firstExcludedContPoint method
Return the first code point which is not included in the code point set, from the string targeted for checking.
Further, return CodePoints#NOT_FOUND
when the entire string for checking is included in code point set.
CodePoints jisX208KanaCp = CodePoints.of(JIS_X_0208_Katakana.class);
int result;
result = jisX208KanaCp.firstExcludedCodePoint("カナa"); // 0x0061 (a)
result = jisX208KanaCp.firstExcludedCodePoint("カaナ"); // 0x0061 (a)
result = jisX208KanaCp.firstExcludedCodePoint("カナ"); // CodePoints#NOT_FOUND
allExcludedCodePoints method
Return Set
of the code point which is not included in the code point set, from the string targeted for checking.
CodePoints jisX208KanaCp = CodePoints.of(JIS_X_0208_Katakana.class);
Set<Integer> result;
result = jisX208KanaCp.allExcludedCodePoints("カナa"); // [0x0061 (a)]
result = jisX208KanaCp.allExcludedCodePoints("カaナb"); // [0x0061 (a), 0x0062 (b)]
result = jisX208KanaCp.allExcludedCodePoints("カナ"); // []
7.6.2.5.5. String check linked with Bean Validation¶
@org.terasoluna.gfw.common.codepoints.ConsistOf
annotation.When there is only one code point set used for checking
@ConsisOf(JIS_X_0208_Hiragana.class) // (1)
private String firstName;
Sr. No. | Description |
---|---|
(1)
|
Check whether the string specified in the targeted field is entirely “Hiragana of JIS X 0208”.
|
When there are multiple code point sets used for checking
@ConsisOf({JIS_X_0208_Hiragana.class, JIS_X_0208_Katakana.class}) // (1)
private String firstName;
Sr. No. | Description |
---|---|
(1)
|
Check whether the string specified in the targeted field is entirely “Hiragana of JIS X 0208” or “Katakana of JIS X 0208”.
|
Note
If string of length N is checked by code point sets M, a checking process that contains N x M is employed. When the string is large, it is likely to cause performance degradation. Hence, a new code point set class that acts as a union set of code point set used for checking is created, that class alone should be specified.
7.6.2.5.6. Creating new code point set class¶
CodePoints
class.When a new code point set class is created by specifying code point
How to create code point set which is formed by “only numbers”
public class NumberChars extends CodePoints {
public NumberCodePoints() {
super(0x0030 /* 0 */, 0x0031 /* 1 */, 0x0032 /* 2 */, 0x0033 /* 3 */,
0x0034 /* 4 */, 0x0035 /* 5 */, 0x0036 /* 6 */,
0x0037 /* 7 */, 0x0038 /* 8 */, 0x0039 /* 9 */);
}
}
When a new code point set class is created by using a set operation method of code point set class
How to create a code point set using a union set consisting of “Hiragana” and “Katakana”
public class FullwidthHiraganaKatakana extends CodePoints {
public FullwidthHiraganaKatakana() {
super(new X_JIS_0208_Hiragana().union(new X_JIS_0208_Katakana()));
}
}
How to create a code point set using difference set consisting of “half width katakana excluding symbols (。「」、・)”
public class HalfwidthKatakana extends CodePoints {
public HalfwidthKatakana() {
CodePoints symbolCp = new CodePoints(0xFF61 /* 。 */, 0xFF62 /* 「 */,
0xFF63 /* 」 */, 0xFF64 /* 、 */, 0xFF65 /* ・ */);
super(new JIS_X_0201_Katakana().subtract(symbolCp));
}
}
Note
When the code point set class used in set operation (X_JIS_0208_Hiragana
or X_JIS_0208_Katakana
etc in this example) is not to be used individually, it must be ensured that code point is not needlessly cached, by using new
operator and calling constructor.
If it is cached by using CodePoints#of
method, code point set used only during set operation calculation remains in the heap resulting in load on the memory.
On the other hand, if it is used individually, it should be cached using CodePoints#of
method.
7.6.2.5.7. Code point set class provided by common library¶
Code point class provided by common library (org.terasoluna.gfw.common.codepoints.catalog
package class) and
artifact information to be incorporated while using are given below.
Sr. No. | Class name | Description | Artifact information |
---|---|---|---|
(1)
|
ASCIIControlChars |
Ascii control characters set.
(
0x0000 -0x001F 、0x007F ) |
<dependency>
<groupId>org.terasoluna.gfw</groupId>
<artifactId>terasoluna-gfw-codepoints</artifactId>
</dependency>
|
(2)
|
ASCIIPrintableChars |
Ascii printable characters set.
(
0x0020 -0x007E ) |
(Same as above)
|
(3)
|
CRLF |
Linefeed code set.
0x000A (LINE FEED) and 0x000D (CARRIAGE RETURN)。 |
(Same as above)
|
(4)
|
JIS_X_0201_Katakana |
JIS X 0201 katakana set.
Symbols (。「」、・) included as well.
|
<dependency>
<groupId>org.terasoluna.gfw.codepoints</groupId>
<artifactId>terasoluna-gfw-codepoints-jisx0201</artifactId>
</dependency>
|
(5)
|
JIS_X_0201_LatinLetters |
JIS X 0201 Latin characters set.
|
(Same as above)
|
(6)
|
JIS_X_0208_SpecialChars |
Row 2 of JIS X 0208: Special characters set.
|
<dependency>
<groupId>org.terasoluna.gfw.codepoints</groupId>
<artifactId>terasoluna-gfw-codepoints-jisx0208</artifactId>
</dependency>
|
(7)
|
JIS_X_0208_LatinLetters |
Row 3 of JIS X 0208: Alphanumeric set.
|
(Same as above)
|
(8)
|
JIS_X_0208_Hiragana |
Row 4 of JIS X 0208: Hiragana set.
|
(Same as above)
|
(9)
|
JIS_X_0208_Katakana |
Row 5 of JIS X 0208: Katakana set.
|
(Same as above)
|
(10)
|
JIS_X_0208_GreekLetters |
Row 6 of JIS X 0208: Greek letters set.
|
(Same as above)
|
(11)
|
JIS_X_0208_CyrillicLetters |
Row 7 of JIS X 0208: Cyrillic letters set.
|
(Same as above)
|
(12)
|
JIS_X_0208_BoxDrawingChars |
Row 8 of JIS X 0208: Box drawing characters.
|
(Same as above)
|
(13)
|
JIS_X_0208_Kanji |
Kanji 6355 characters specified in JIS X 208.
First and second level kanjis.
|
<dependency>
<groupId>org.terasoluna.gfw.codepoints</groupId>
<artifactId>terasoluna-gfw-codepoints-jisx0208kanji</artifactId>
</dependency>
|
(14)
|
JIS_X_0213_Kanji |
Kanji 10050 characters specified in JIS X 0213:2004.
First, second, third and fourth level kanjis.
|
<dependency>
<groupId>org.terasoluna.gfw.codepoints</groupId>
<artifactId>terasoluna-gfw-codepoints-jisx0213kanji</artifactId>
</dependency>
|
Note
JIS_X_0208_SpecialChars
codepoint set class is a special character set corresponding to JIS chinese characters (JIS X 0208)-section 01-02.
Double byte dash (-) of JIS chinese characters is EM DASH and the corresponding UCS(ISO/IEC 10646-1, JIS X 0221, Unicode) codepoints usually correspond to U+2014
.
However, in the conversion table offered by Unicode consortium , characters supported by Unicode are HORINZONTAL BAR (U+2015) instead of EM DASH..
Since general conversion rules that are being used and Unicode conversion table vary, problems may occur during actual use if codepoint set is defined as per Unicode conversion table. Therefore, codepoint set is defined in 、JIS_X_0208_SpecialChars
codepoint set class by converting HORINZONTAL BAR (U+2015
) to EM DASH (U+2014
).