p Each call to .Nm : l -enum -compact t examines up to .Fa n bytes starting at .Fa s , t yields a UTF-8 code unit if available by storing it at .Li * Ns Fa pc8 , t saves state at .Fa ps , and t returns either the number of bytes consumed if any or a special return value. .El
p Specifically: l -bullet t If the multibyte sequence at .Fa s is invalid after any previous input saved at .Fa ps , or if an error occurs in decoding, .Nm returns .Li (size_t)-1 and sets .Xr errno 2 to indicate the error. t If the multibyte sequence at .Fa s is still incomplete after .Fa n bytes, including any previous input saved in .Fa ps , .Nm saves its state in .Fa ps after all the input so far and returns .Li "(size_t)-2". .Sy All .Fa n bytes of input are consumed in this case. t If .Nm had previously decoded a multibyte character but has not yet yielded all the code units of its UTF-8 encoding, it stores the next UTF-8 code unit at .Li * Ns Fa pc8 and returns .Li "(size_t)-3" . .Sy No input is consumed in this case. t If .Nm decodes the null multibyte character, then it stores zero at .Li * Ns Fa pc8 and returns zero. t Otherwise, .Nm decodes a single multibyte character, stores the first (and possibly only) code unit in its UTF-8 encoding at .Li * Ns Fa pc8 , and returns the number of bytes consumed to decode the first multibyte character. .El
p If .Fa pc8 is a null pointer, nothing is stored, but the effects on .Fa ps and the return value are unchanged.
p If .Fa s is a null pointer, the .Nm call is equivalent to: d -ragged -offset indent .Fo mbrtoc8 .Li NULL , .Li \*q\*q , .Li 1 , .Fa ps .Fc .Ed
p This always returns zero, and has the effect of resetting .Fa ps to the initial conversion state, without writing to .Fa pc8 , even if it is nonnull.
p If .Fa ps is a null pointer, .Nm uses an internal .Vt mbstate_t object with static storage duration, distinct from all other .Vt mbstate_t objects
o including those used by .Xr mbrtoc16 3 , .Xr mbrtoc32 3 , .Xr c8rtomb 3 , .Xr c16rtomb 3 , and .Xr c32rtomb 3
c ,
which is initialized at program startup to the initial conversion
state.
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh IMPLEMENTATION NOTES
On well-formed input, the
.Nm
function yields either a Unicode scalar value in US-ASCII range, i.e.,
a 7-bit Unicode code point, or, over two to four successive calls, the
leading and trailing code units in order of the UTF-8 encoding of a
Unicode scalar value outside the US-ASCII range.
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh RETURN VALUES
The
.Nm
function returns:
l -tag -width Li t Li 0 q null if
.Nm
decoded a null multibyte character.
t Ar i q code unit where
.Li 1
\*(Le
.Ar i
\*(Le
.Fa n ,
if
.Nm
consumed
.Ar i
bytes of input to decode the next multibyte character, yielding a
UTF-8 code unit.
t Li (size_t)-3 q continuation if
.Nm
consumed no new bytes of input but yielded a UTF-8 code unit that was
pending from previous input.
t Li (size_t)-2 q incomplete if
.Nm
found only an incomplete multibyte sequence after all
.Fa n
bytes of input and any previous input, and saved its state to restart
in the next call with
.Fa ps .
t Li (size_t)-1 q error if any encoding error was detected;
.Xr errno 2
is set to reflect the error.
.El
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh EXAMPLES
Print the UTF-8 code units of a multibyte string in hexadecimal text:
d -literal -offset indent char *s = ...;
size_t n = ...;
mbstate_t mbs = {0}; /* initial conversion state */
while (n) {
char8_t c8;
size_t len;
len = mbrtoc8(&c8, s, n, &mbs);
switch (len) {
case 0: /* NUL terminator */
assert(c8 == 0);
goto out;
default: /* consumed input and yielded a byte c8 */
printf("0x%02hhx\en", c8);
break;
case (size_t)-3: /* yielded a pending byte c8 */
printf("continue 0x%02hhx\en", c8);
break;
case (size_t)-2: /* incomplete */
printf("incomplete\en");
goto readmore;
case (size_t)-1: /* error */
printf("error: %d\en", errno);
goto out;
}
s += len;
n -= len;
}
.Ed
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh ERRORS
l -tag -width Bq t Bq Er EILSEQ The multibyte sequence cannot be decoded in the current locale as a
Unicode scalar value.
t Bq Er EIO An error occurred in loading the locale's character conversions.
.El
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh SEE ALSO
.Xr c8rtomb 3 ,
.Xr c16rtomb 3 ,
.Xr c32rtomb 3 ,
.Xr mbrtoc16 3 ,
.Xr mbrtoc32 3 ,
.Xr uchar 3
.Rs
.%B The Unicode Standard
.%O Version 15.0 \(em Core Specification
.%Q The Unicode Consortium
.%D September 2022
.%U https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf
.Re
.Rs
.%A F. Yergeau
.%T UTF-8, a transformation format of ISO 10646
.%R RFC 3629
.%D November 2003
.%I Internet Engineering Task Force
.%U https://datatracker.ietf.org/doc/html/rfc3629
.Re
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh STANDARDS
The
.Nm
function conforms to
.St -isoC-2023 .
.\" XXX PR misc/58600: man pages lack C17, C23, C++98, C++03, C++11, C++17, C++20, C++23 citation syntax
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh HISTORY
The
.Nm
function first appeared in
.Nx 11.0 .