Skip to content

Commit 76b8b2d

Browse files
committed
docs: add docs
Signed-off-by: Okiki <[email protected]>
1 parent 39ebe4c commit 76b8b2d

File tree

7 files changed

+550
-254
lines changed

7 files changed

+550
-254
lines changed

Diff for: .gitpod.yml

-17
Original file line numberDiff line numberDiff line change
@@ -9,23 +9,6 @@ ports:
99
- port: 3001
1010
onOpen: ignore
1111

12-
github:
13-
prebuilds:
14-
# enable for the master/default branch (defaults to true)
15-
master: true
16-
# enable for all branches in this repo (defaults to false)
17-
branches: true
18-
# enable for pull requests coming from this repo (defaults to true)
19-
pullRequests: true
20-
# enable for pull requests coming from forks (defaults to false)
21-
pullRequestsFromForks: true
22-
# add a "Review in Gitpod" button as a comment to pull requests (defaults to true)
23-
addComment: true
24-
# add a "Review in Gitpod" button to pull requests (defaults to false)
25-
addBadge: false
26-
# add a label once the prebuild is ready to pull requests (defaults to false)
27-
addLabel: prebuilt-in-gitpod
28-
2912
# List the start up tasks. You can start them in parallel in multiple terminals. See https://www.gitpod.io/docs/44_config_start_tasks/
3013
tasks:
3114
- init: >

Diff for: byte_methods.ts

+61-22
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,31 @@
1+
/**
2+
* @module
3+
* Provides utility functions for working with UTF-8 encoded characters in TypeScript.
4+
* It includes methods for determining the byte length of UTF-8 characters, converting bytes to Unicode code points,
5+
* extracting code points from buffers, and dealing with UTF-16 code units in strings.
6+
*
7+
* @example
8+
* ```ts
9+
* import { getByteLength, bytesToCodePoint, bytesToCodePointFromBuffer, codePointAt } from 'jsr:@okikio/codepoint-iterator/byte_methods';
10+
*
11+
* // Determine the byte length of a UTF-8 encoded character
12+
* const leadByte = 0xF0; // Leading byte of a 4-byte UTF-8 character
13+
* console.log(getByteLength(leadByte)); // Expected output: 4
14+
*
15+
* // Convert a sequence of UTF-8 bytes to a Unicode code point
16+
* const bytes = [0xF0, 0x9F, 0x92, 0xA9]; // UTF-8 encoded representation of the 💩 emoji
17+
* console.log(bytesToCodePoint(4, bytes)); // Expected output: 128169 (code point for 💩)
18+
*
19+
* // Extract a Unicode code point from a buffer
20+
* const buffer = new Uint8Array([0xF0, 0x9F, 0x92, 0xA9]);
21+
* console.log(bytesToCodePointFromBuffer(4, buffer, 0)); // Expected output: 128169
22+
*
23+
* // Calculate the Unicode code point of a character in a string
24+
* const str = '🌍';
25+
* console.log(codePointAt(str, 0)); // Expected output: 127757 (code point for 🌍)
26+
* ```
27+
*/
28+
129
import {
230
BITS_FOR_2B,
331
BITS_FOR_3B,
@@ -18,12 +46,15 @@ import {
1846
/**
1947
* Calculates the number of bytes required to represent a single UTF-8 character.
2048
*
21-
* UTF-8 can be represented by 1 to 4 bytes.
49+
* Determines the byte length of a UTF-8 encoded character based on its leading byte.
50+
* This is crucial for correctly interpreting or encoding text in UTF-8,
51+
* where characters may vary in byte length from 1 to 4 bytes.
52+
*
2253
* This function given the byte value of the leading byte for the utf-8 character
2354
* calculates how many more bytes are required to represent the utf-8 character,
2455
* this allows emoji's another other symbols to be represented in utf-8.
2556
*
26-
* @param byte - The lead byte of a UTF-8 character.
57+
* @param byte The lead byte of a UTF-8 character.
2758
* @returns The number of bytes in a Uint8Array required to represent the UTF-8 character (the number of bytes ranges from 1 to 4).
2859
*/
2960
export function getByteLength(byte: number): number {
@@ -37,8 +68,8 @@ export function getByteLength(byte: number): number {
3768
}
3869

3970
/**
40-
* UTF-8 bytes to codepoint.
41-
* Calculates the Unicode code point from the bytes of a UTF-8 character.
71+
* Converts a sequence of bytes into a Unicode code point. This function is a key part of
72+
* decoding UTF-8 encoded text, as it translates the raw bytes back into the characters they represent.
4273
*
4374
* UTF-8 can be represented by 1 to 4 bytes.
4475
* This function given the byte length of the utf-8 character
@@ -48,10 +79,10 @@ export function getByteLength(byte: number): number {
4879
* Due to the dynamic length of utf-8 characters,
4980
* its faster to just grab the bytes from the Uint8Array then calculate it's codepoint
5081
* than trying to decode said Uint8Array into a string and then converting
51-
* said string into codepoints.
82+
* said string into codepoints.
5283
*
5384
* @param byteLength The number of bytes in a Uint8Array required to represent a single UTF-8 character (the number of bytes ranges from 1 to 4).
54-
* @param [bytes] - An array of length `byteLength` bytes that make up the UTF-8 character.
85+
* @param bytes An array of length `byteLength` bytes that make up the UTF-8 character.
5586
* @returns The Unicode code point of the UTF-8 character.
5687
*/
5788
export function bytesToCodePoint(byteLength: number, [byte1, byte2, byte3, byte4]: number[]): number {
@@ -79,16 +110,20 @@ export function bytesToCodePoint(byteLength: number, [byte1, byte2, byte3, byte4
79110
MASK_FOR_1B & byte4
80111

81112
// 1-byte UTF-8 sequence (fallback)
113+
// Default to 1-byte sequence if length is unexpected
82114
: byte1
83115
);
84116
}
85117

86-
/**
87-
* Calculates the Unicode code point from a given buffer using indexed access.
88-
* @param byteLength - The number of bytes representing the code point.
89-
* @param buffer - The Uint8Array buffer containing the bytes.
90-
* @param head - The starting index of the code point in the buffer.
91-
* @returns The calculated Unicode code point.
118+
/**
119+
* Extracts a Unicode code point from a given buffer starting at a specified index.
120+
* This method is useful for parsing a stream or array of data where UTF-8 characters
121+
* are embedded within a larger set of binary data.
122+
*
123+
* @param byteLength The byte length of the UTF-8 encoded character to extract.
124+
* @param buffer The buffer (array or Uint8Array) containing the UTF-8 data.
125+
* @param head The index in the buffer where the UTF-8 encoded character starts.
126+
* @returns The Unicode code point extracted from the buffer.
92127
*/
93128
export function bytesToCodePointFromBuffer<T extends number = number>(
94129
byteLength: number,
@@ -121,23 +156,26 @@ export function bytesToCodePointFromBuffer<T extends number = number>(
121156
MASK_FOR_1B & buffer[(head + 3) % bufferSize]
122157
);
123158
default:
159+
// Default case for unexpected byteLength
124160
return buffer[head];
125161
}
126162
}
127163

128164
/**
129165
* Extracts the Unicode code point and its size in UTF-16 code units from a string at a given position.
130-
* @param str - The input string.
131-
* @param index - The position in the string to extract the code point from.
132-
* @returns A number represent the code point in UTF-16 code units.
166+
*
167+
* Calculates the Unicode code point of a character at a specific index in a string,
168+
* taking into account UTF-16 encoding which may represent characters using one or two code units (surrogates).
169+
* This function is particularly useful for strings containing emoji or other characters
170+
* that may be represented as surrogate pairs in JavaScript.
171+
*
172+
* @param str The string to extract the code point from.
173+
* @param index The index of the character within the string.
174+
* @returns The Unicode code point of the character, considering potential surrogate pairs.
133175
*/
134-
export function codePointAt(str: string, index: number): number {
176+
export function codePointAt(str: string, index: number): number | undefined {
135177
const size = str.length;
136-
137-
// Account for out-of-bounds indices:
138-
if (index < 0 || index >= size) {
139-
return undefined;
140-
}
178+
if (index < 0 || index >= size) return undefined; // Guard clause for out-of-bounds index
141179

142180
// Get the first code unit
143181
const first = str.charCodeAt(index);
@@ -174,9 +212,10 @@ export function codePointAt(str: string, index: number): number {
174212
// Use bitwise shift instead of multiplication and addition
175213
// Bitwise left shift (<< 10) is used here as an efficient way to multiply by 2^10 (or 2**10) (or 1024).
176214
// This is equivalent to the expression (first - 0xD800) * 0x400, since 0x400 in decimal is 1024.
177-
return ((first - 0xD800) << 10) + (second - 0xDC00) + 0x10000;
215+
return ((first - 0xD800) << 10) + (second - 0xDC00) + 0x10000; // Calculate and return surrogate pair code point
178216
}
179217
}
180218

219+
// Return the code unit if not a surrogate pair
181220
return first;
182221
}

Diff for: constants.ts

+108-3
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,129 @@
1+
/**
2+
* @module
3+
* This module defines constants used for UTF-8 character encoding,
4+
* covering 1-byte to 5-byte sequences, including their leading bits
5+
* and masks for identifying and extracting the encoded character bits.
6+
*
7+
* Defines constants for UTF-8 encoding operations, including lead bytes, masks, and bits required for different byte sequences.
8+
* These constants are essential for encoding and decoding UTF-8 characters, from simple ASCII to complex symbols and emojis.
9+
*
10+
* @example
11+
* Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.
12+
*
13+
* 1. Identify the lead byte for a 4-byte sequence: `LEAD_FOR_4B` (1111 0000 in binary)
14+
* 2. The mask for extracting significant bits from the first byte in a 4-byte sequence: `MASK_FOR_4B` (0000 0111 in binary)
15+
* 3. To encode '𝄞', we calculate its bits beyond the ASCII range, requiring `BITS_FOR_4B` (18 bits for the highest bits 19 -> 21).
16+
*
17+
* The process involves:
18+
* - Using `LEAD_FOR_4B` to start the encoding sequence.
19+
* - Applying `MASK_FOR_4B` to extract the first few significant bits of the character.
20+
* - Shifting by `BITS_FOR_4B`, `BITS_FOR_3B`, and `BITS_FOR_2B` to position the remaining bits correctly.
21+
*
22+
* For a 2-byte character like 'Ω' (Omega):
23+
* - Start with `LEAD_FOR_2B` (1100 0000 in binary) to indicate a 2-byte sequence.
24+
* - Use `MASK_FOR_2B` (0001 1111 in binary) for the first byte's significant bits.
25+
* - The shift amount is `BITS_FOR_2B` (6 bits for positions 7 to 12).
26+
*
27+
* A 1-byte ASCII character, such as 'A':
28+
* - Simply uses `LEAD_FOR_1B` (1000 0000 in binary) and `MASK_FOR_1B` (0011 1111 in binary) to represent the character in UTF-8.
29+
*/
30+
131
// 1-byte encoding
32+
/**
33+
* Leading bits for a 1-byte sequence in UTF-8 encoding.
34+
* This indicates that the character is represented with a single byte.
35+
*
36+
* @example `1000 0000`
37+
*/
238
export const LEAD_FOR_1B = 0x80; // 1000 0000
39+
40+
/**
41+
* Mask for extracting the significant bits from a 1-byte encoded character.
42+
*
43+
* @example `0011 1111`
44+
*/
345
export const MASK_FOR_1B = 0x3F; // 0011 1111
446

547
// 2-byte encoding
6-
export const BITS_FOR_2B = 6; // bits 7 -> 12
48+
/**
49+
* Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range.
50+
*
51+
* @example highest bits 7 -> 12
52+
*/
53+
export const BITS_FOR_2B = 6; // highest bits 7 -> 12
54+
55+
/**
56+
* Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character.
57+
*
58+
* @example `1100 0000`
59+
*/
760
export const LEAD_FOR_2B = 0xC0; // 1100 0000
61+
62+
/**
63+
* Mask for extracting the significant bits from a 2-byte encoded character.
64+
*
65+
* @example `0001 1111`
66+
*/
867
export const MASK_FOR_2B = 0x1F; // 0001 1111
968

1069
// 3-byte encoding
11-
export const BITS_FOR_3B = 12; // bits 13 -> 18
70+
/**
71+
* Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets.
72+
*
73+
* @example highest bits 13 -> 18
74+
*/
75+
export const BITS_FOR_3B = 12; // highest bits 13 -> 18
76+
77+
/**
78+
* Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character.
79+
*
80+
* @example `1110 0000`
81+
*/
1282
export const LEAD_FOR_3B = 0xE0; // 1110 0000
83+
84+
/**
85+
* Mask for extracting the significant bits from a 3-byte encoded character.
86+
*
87+
* @example `0000 1111`
88+
*/
1389
export const MASK_FOR_3B = 0x0F; // 0000 1111
1490

1591
// 4-byte encoding
92+
/**
93+
* Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use.
94+
*
95+
* @example highest bits 19 -> 21
96+
*/
1697
export const BITS_FOR_4B = 18; // highest bits 19 -> 21
98+
99+
/**
100+
* Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character.
101+
*
102+
* @example `1111 0000`
103+
*/
17104
export const LEAD_FOR_4B = 0xF0; // 1111 0000
105+
106+
/**
107+
* Mask for extracting the significant bits from a 4-byte encoded character.
108+
*
109+
* @example `0000 0111`
110+
*/
18111
export const MASK_FOR_4B = 0x07; // 0000 0111
19112

20113
// 5-byte encoding
114+
/**
115+
* Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding
116+
* and is included for completeness.
117+
*
118+
* @example `1111 1000`
119+
*/
21120
export const LEAD_FOR_5B = 0xF8; // 1111 1000
22121

23-
// The maximum number of bytes required to represent a UTF-8 character
122+
// UTF-8 encoding specifics
123+
/**
124+
* The maximum number of bytes required to represent any UTF-8 character.
125+
* This constant defines the upper limit for UTF-8 encoded character size.
126+
*
127+
* @example 4
128+
*/
24129
export const UTF8_MAX_BYTE_LENGTH = 4;

Diff for: iterable.ts

+45-2
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,54 @@
11
/**
2-
* Converts ReadableStream into async iterables
2+
* @module
3+
* Provides extensions for `ReadableStream` to enhance its usability in JavaScript environments.
4+
* This module includes a function to convert a `ReadableStream` into an asynchronous iterable,
5+
* allowing for easier consumption of streamed data in a more modern and convenient syntax.
6+
*
7+
* This is particularly useful in environments or scenarios where `ReadableStream` does not natively support async iteration.
8+
*
9+
* @example
10+
* ```ts
11+
* // Assuming you have a function that returns a ReadableStream, e.g., fetching some data
12+
* async function fetchDataAsStream() {
13+
* const response = await fetch('https://example.com/data');
14+
* return response.body; // This is a ReadableStream
15+
* }
16+
*
17+
* // Utilize `getIterableStream` to consume the ReadableStream as an async iterable
18+
* async function processStreamData() {
19+
* const stream = await fetchDataAsStream();
20+
* for await (const chunk of getIterableStream(stream)) {
21+
* console.log(chunk); // Process each chunk of data as it's read from the stream
22+
* }
23+
* }
24+
*
25+
* processStreamData();
26+
* ```
27+
* Consuming a `ReadableStream` of data (e.g., from a network response) using the `getIterableStream` function,
28+
* enabling the use of an async for-loop to process the data in chunks as it's received.
29+
*/
30+
31+
/**
32+
* Converts a `ReadableStream` into an async iterable. This allows for easier consumption
33+
* of stream data using asynchronous iteration, providing a more modern approach to handling streamed data.
334
*
435
* Ideally this would already be built into ReadableStream,
536
* but it's currently not so this should help tide over til
637
* js runtimes support async iterables for ReadableStreams.
738
*
8-
* @param stream ReadableStream to convert into async iterable
39+
* @param stream The `ReadableStream` to be converted into an async iterable. This stream can contain any type of data, typically `Uint8Array` for binary data.
40+
* @returns An `AsyncIterable` that yields data chunks from the `ReadableStream` as they are read.
41+
* @template T The type of data chunks contained within the `ReadableStream`, defaulting to `Uint8Array`.
42+
*
43+
* @example
44+
* ```ts
45+
* const responseStream = fetch('https://example.com/data').then(res => res.body);
46+
* for await (const chunk of getIterableStream(await responseStream)) {
47+
* console.log(new TextDecoder().decode(chunk)); // Assuming the stream is text data
48+
* }
49+
* ```
50+
* Converting a `ReadableStream` from a fetch request into an async iterable,
51+
* and then asynchronously iterating over each chunk of data, decoding and logging the text content.
952
*/
1053
export async function* getIterableStream<T = Uint8Array>(stream: ReadableStream<T>): AsyncIterable<T> {
1154
const reader = stream.getReader();

0 commit comments

Comments
 (0)