【源码学习】Java字符的编码和解码

it2023-11-25 69

java中与编码解码相关的类都在包java.nio.charset中，主要有Charset、CharsetDecoder、CharsetEncoder、CoderResult这几个类

Charset：字符集CoderResult：状态集CharsetDecoder：编码器CharsetEncoder：解码器

一、可通过Charset获得编码器（encoder）或解码器（decoder）

```java CharsetEncoder encoder = charset.newEncoder(); CharsetDecoder decoder = charset.newDecoder(); ```

二、CoderResult有以下几种状态：

- ByteBuffer 解码完成 - private static final int CR_UNDERFLOW = 0; - CharBuffer 空间不足，解码完成 - private static final int CR_OVERFLOW = 1; - private static final int CR_ERROR_MIN = 2; - ByteBuffer 解码异常 - private static final int CR_MALFORMED = 2; - ByteBuffer 超出了 Unicode 定义的范围，eg: 0xD800-0xDFFF 或 >0x10FFFF在编写的时候已经考虑了编解码的问题 - private static final int CR_UNMAPPABLE = 3;

其中，解码完成和空间不足属于正常状态，解码异常和不能映射到 Unicode 上是异常状态。

对应的CoderResult的状态函数如下所示：

```java public boolean isUnderflow() { return this.type == 0; } public boolean isOverflow() { return this.type == 1; } public boolean isError() { return this.type >= 2; } public boolean isMalformed() { return this.type == 2; } public boolean isUnmappable() { return this.type == 3; } ```

三、CharsetDecoder

3.1 该方法返回的是解码的状态码

public final CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)

核心代码decodeLoop从各缓冲区的当前位置开始进行读取和写入。最多将读取 in.remaining() 个字节，最多将写入 out.remaining() 个字符。之后返回状态码

decode在调用完decodeLoop之后，如果输入缓存还有数据且endOfInput为ture，则修改状态码为输入输出错误，由上层进行处理，否则返回状态码

参数多的方法比参数少的方法多了开始的状态检测：

int newState = endOfInput ? ST_END : ST_CODING; if ((state != ST_RESET) && (state != ST_CODING) && !(endOfInput && (state == ST_END))) throwIllegalStateException(state, newState); state = newState;

decode的核心代码如下所示：

for (;;) { CoderResult cr; try { // decodeLoop 完成解码 cr = decodeLoop(in, out); } catch (BufferUnderflowException x) { throw new CoderMalfunctionError(x); } catch (BufferOverflowException x) { throw new CoderMalfunctionError(x); } // 1. CharBuffer 空间不足 if (cr.isOverflow()) return cr; // 2. 解码完成，如果 endOfInput=false则会先返回，由调用者继续调用该方法解码 if (cr.isUnderflow()) { if (endOfInput && in.hasRemaining()) { cr = CoderResult.malformedForLength(in.remaining()); // Fall through to malformed-input case } else { return cr; } } // 3. 解码出现异常时的处理，默认为 REPORT，即由上层处理异常 CodingErrorAction action = null; if (cr.isMalformed()) action = malformedInputAction; else if (cr.isUnmappable()) action = unmappableCharacterAction; else assert false : cr.toString(); // 3.1 REPORT 由上层处理异常 if (action == CodingErrorAction.REPORT) return cr; // 3.2 REPLACE 追加了 replacement 字符 if (action == CodingErrorAction.REPLACE) { if (out.remaining() < replacement.length()) return CoderResult.OVERFLOW; out.put(replacement); } // 3.3 IGNORE 和 REPLACE 都忽略这种异常继续解码 if ((action == CodingErrorAction.IGNORE) || (action == CodingErrorAction.REPLACE)) { // Skip erroneous input either way in.position(in.position() + cr.length()); continue; } assert false; }

3.2 该方法返回的是解码之后的缓存

public final CharBuffer decode(ByteBuffer in) 该方法循环调用decode(in, out, true)，直到解码完成 for (;;) { CoderResult cr = in.hasRemaining() ? decode(in, out, true) : CoderResult.UNDERFLOW; // 1. 解码完成，调用 flush 并结束 if (cr.isUnderflow()) cr = flush(out); if (cr.isUnderflow()) break; // 2. CharBuffer 空间不足，扩容，继续解码 if (cr.isOverflow()) { n = 2*n + 1; // Ensure progress; n might be 0! CharBuffer o = CharBuffer.allocate(n); out.flip(); o.put(out); out = o; continue; } // 3. 解码异常，直接抛出 cr.throwException(); } out.flip();

flush主要做了一个状态转换

if (this.state == 2) { CoderResult cr = this.implFlush(out); if (cr.isUnderflow()) { this.state = 3; } return cr; }

缓冲的flip()用于切换读写状态

四、CharsetEncoder

CharsetEncoder的原理和CharsetDecoder很像，就不赘述了

最新回复(0)