Skip to content

Commit aa65c31

Browse files
authored
Rollup merge of #147932 - thaliaarchi:utf8-osstring, r=tgross35
Create UTF-8 version of `OsStr`/`OsString` Implement a UTF-8 version of `OsStr`/`OsString`, in addition to the existing bytes and WTF-8 platform-dependent encodings. This is applicable for several platforms, but I've currently only implemented it for Motor OS: - WASI uses Unicode paths, but currently reexports the Unix bytes-assuming `OsStrExt`/`OsStringExt` traits. - [wasi:filesystem](https://wa.dev/wasi:filesystem) APIs: > Paths are passed as interface-type `strings`, meaning they must consist of a sequence of Unicode Scalar Values (USVs). Some filesystems may contain paths which are not accessible by this API. - In [wasi-filesystem#17](WebAssembly/wasi-filesystem#17 (comment)), it was decided that applications can use any Unicode transformation format, so we're free to use UTF-8 (and probably already do). This was chosen over specifically UTF-8 or an ad hoc encoding which preserves paths not representable in UTF-8. > The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings. > > This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it. - As of Redox OS [0.7.0](https://www.redox-os.org/news/release-0.7.0/), "All paths are now required to be UTF-8, and the kernel enforces this". This appears to have been implemented in commit [d331f72f](https://gitlab.redox-os.org/redox-os/kernel/-/commit/d331f72f2a51fa577072f24bc2587829fd87368b) (Use UTF-8 for all paths, 2021-02-14). Redox does not have `OsStrExt`/`OsStringExt`. - Motor OS guarantees that its OS strings are UTF-8 in its [current `OsStrExt`/`OsStringExt` traits](https://github.com/moturus/rust/blob/a828ffcf5f04be5cdd91b5fad608102eabc17ec7/library/std/src/os/motor/ffi.rs), but they're still internally bytes like Unix. This is an alternate approach to #147797, which reuses the existing bytes `OsString` and relies on the safety properties of `from_encoded_bytes_unchecked`. Compared to that, this also gains efficiency from propagating the UTF-8 invariant to the whole type, as it never needs to test for UTF-8 validity. Note that Motor OS currently does not build until #147930 merges. cc `@tgross35` (for earlier review) cc `@alexcrichton,` `@rylev,` `@loganek` (for WASI) cc `@lasiotus` (for Motor OS) cc `@jackpot51` (for Redox OS)
2 parents 8c40b9c + 7e2b76e commit aa65c31

File tree

3 files changed

+347
-8
lines changed

3 files changed

+347
-8
lines changed

library/std/src/os/motor/ffi.rs

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,35 +3,40 @@
33

44
use crate::ffi::{OsStr, OsString};
55
use crate::sealed::Sealed;
6+
use crate::sys_common::{AsInner, IntoInner};
67

7-
/// Motor OS-specific extensions to [`OsString`].
8+
/// Motor OSspecific extensions to [`OsString`].
89
///
910
/// This trait is sealed: it cannot be implemented outside the standard library.
1011
/// This is so that future additional methods are not breaking changes.
1112
pub trait OsStringExt: Sealed {
12-
/// Motor OS strings are utf-8, and thus just strings.
13-
fn as_str(&self) -> &str;
13+
/// Yields the underlying UTF-8 string of this [`OsString`].
14+
///
15+
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
16+
fn into_string(self) -> String;
1417
}
1518

1619
impl OsStringExt for OsString {
1720
#[inline]
18-
fn as_str(&self) -> &str {
19-
self.to_str().unwrap()
21+
fn into_string(self) -> String {
22+
self.into_inner().inner
2023
}
2124
}
2225

23-
/// Motor OS-specific extensions to [`OsString`].
26+
/// Motor OSspecific extensions to [`OsString`].
2427
///
2528
/// This trait is sealed: it cannot be implemented outside the standard library.
2629
/// This is so that future additional methods are not breaking changes.
2730
pub trait OsStrExt: Sealed {
28-
/// Motor OS strings are utf-8, and thus just strings.
31+
/// Gets the underlying UTF-8 string view of the [`OsStr`] slice.
32+
///
33+
/// OS strings on Motor OS are guaranteed to be UTF-8, so are just strings.
2934
fn as_str(&self) -> &str;
3035
}
3136

3237
impl OsStrExt for OsStr {
3338
#[inline]
3439
fn as_str(&self) -> &str {
35-
self.to_str().unwrap()
40+
&self.as_inner().inner
3641
}
3742
}

library/std/src/sys/os_str/mod.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ cfg_select! {
55
mod wtf8;
66
pub use wtf8::{Buf, Slice};
77
}
8+
any(target_os = "motor") => {
9+
mod utf8;
10+
pub use utf8::{Buf, Slice};
11+
}
812
_ => {
913
mod bytes;
1014
pub use bytes::{Buf, Slice};

library/std/src/sys/os_str/utf8.rs

Lines changed: 330 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,330 @@
1+
//! An OsString/OsStr implementation that is guaranteed to be UTF-8.
2+
3+
use core::clone::CloneToUninit;
4+
5+
use crate::borrow::Cow;
6+
use crate::collections::TryReserveError;
7+
use crate::rc::Rc;
8+
use crate::sync::Arc;
9+
use crate::sys_common::{AsInner, FromInner, IntoInner};
10+
use crate::{fmt, mem};
11+
12+
#[derive(Hash)]
13+
#[repr(transparent)]
14+
pub struct Buf {
15+
pub inner: String,
16+
}
17+
18+
#[repr(transparent)]
19+
pub struct Slice {
20+
pub inner: str,
21+
}
22+
23+
impl IntoInner<String> for Buf {
24+
fn into_inner(self) -> String {
25+
self.inner
26+
}
27+
}
28+
29+
impl FromInner<String> for Buf {
30+
fn from_inner(inner: String) -> Self {
31+
Buf { inner }
32+
}
33+
}
34+
35+
impl AsInner<str> for Buf {
36+
#[inline]
37+
fn as_inner(&self) -> &str {
38+
&self.inner
39+
}
40+
}
41+
42+
impl fmt::Debug for Buf {
43+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
44+
fmt::Debug::fmt(&self.inner, f)
45+
}
46+
}
47+
48+
impl fmt::Display for Buf {
49+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
50+
fmt::Display::fmt(&self.inner, f)
51+
}
52+
}
53+
54+
impl fmt::Debug for Slice {
55+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
56+
fmt::Debug::fmt(&self.inner, f)
57+
}
58+
}
59+
60+
impl fmt::Display for Slice {
61+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
62+
fmt::Display::fmt(&self.inner, f)
63+
}
64+
}
65+
66+
impl Clone for Buf {
67+
#[inline]
68+
fn clone(&self) -> Self {
69+
Buf { inner: self.inner.clone() }
70+
}
71+
72+
#[inline]
73+
fn clone_from(&mut self, source: &Self) {
74+
self.inner.clone_from(&source.inner)
75+
}
76+
}
77+
78+
impl Buf {
79+
#[inline]
80+
pub fn into_encoded_bytes(self) -> Vec<u8> {
81+
self.inner.into_bytes()
82+
}
83+
84+
#[inline]
85+
pub unsafe fn from_encoded_bytes_unchecked(s: Vec<u8>) -> Self {
86+
unsafe { Self { inner: String::from_utf8_unchecked(s) } }
87+
}
88+
89+
#[inline]
90+
pub fn into_string(self) -> Result<String, Buf> {
91+
Ok(self.inner)
92+
}
93+
94+
#[inline]
95+
pub const fn from_string(s: String) -> Buf {
96+
Buf { inner: s }
97+
}
98+
99+
#[inline]
100+
pub fn with_capacity(capacity: usize) -> Buf {
101+
Buf { inner: String::with_capacity(capacity) }
102+
}
103+
104+
#[inline]
105+
pub fn clear(&mut self) {
106+
self.inner.clear()
107+
}
108+
109+
#[inline]
110+
pub fn capacity(&self) -> usize {
111+
self.inner.capacity()
112+
}
113+
114+
#[inline]
115+
pub fn push_slice(&mut self, s: &Slice) {
116+
self.inner.push_str(&s.inner)
117+
}
118+
119+
#[inline]
120+
pub fn push_str(&mut self, s: &str) {
121+
self.inner.push_str(s);
122+
}
123+
124+
#[inline]
125+
pub fn reserve(&mut self, additional: usize) {
126+
self.inner.reserve(additional)
127+
}
128+
129+
#[inline]
130+
pub fn try_reserve(&mut self, additional: usize) -> Result<(), TryReserveError> {
131+
self.inner.try_reserve(additional)
132+
}
133+
134+
#[inline]
135+
pub fn reserve_exact(&mut self, additional: usize) {
136+
self.inner.reserve_exact(additional)
137+
}
138+
139+
#[inline]
140+
pub fn try_reserve_exact(&mut self, additional: usize) -> Result<(), TryReserveError> {
141+
self.inner.try_reserve_exact(additional)
142+
}
143+
144+
#[inline]
145+
pub fn shrink_to_fit(&mut self) {
146+
self.inner.shrink_to_fit()
147+
}
148+
149+
#[inline]
150+
pub fn shrink_to(&mut self, min_capacity: usize) {
151+
self.inner.shrink_to(min_capacity)
152+
}
153+
154+
#[inline]
155+
pub fn as_slice(&self) -> &Slice {
156+
Slice::from_str(&self.inner)
157+
}
158+
159+
#[inline]
160+
pub fn as_mut_slice(&mut self) -> &mut Slice {
161+
Slice::from_mut_str(&mut self.inner)
162+
}
163+
164+
#[inline]
165+
pub fn leak<'a>(self) -> &'a mut Slice {
166+
Slice::from_mut_str(self.inner.leak())
167+
}
168+
169+
#[inline]
170+
pub fn into_box(self) -> Box<Slice> {
171+
unsafe { mem::transmute(self.inner.into_boxed_str()) }
172+
}
173+
174+
#[inline]
175+
pub fn from_box(boxed: Box<Slice>) -> Buf {
176+
let inner: Box<str> = unsafe { mem::transmute(boxed) };
177+
Buf { inner: inner.into_string() }
178+
}
179+
180+
#[inline]
181+
pub fn into_arc(&self) -> Arc<Slice> {
182+
self.as_slice().into_arc()
183+
}
184+
185+
#[inline]
186+
pub fn into_rc(&self) -> Rc<Slice> {
187+
self.as_slice().into_rc()
188+
}
189+
190+
/// Provides plumbing to `Vec::truncate` without giving full mutable access
191+
/// to the `Vec`.
192+
///
193+
/// # Safety
194+
///
195+
/// The length must be at an `OsStr` boundary, according to
196+
/// `Slice::check_public_boundary`.
197+
#[inline]
198+
pub unsafe fn truncate_unchecked(&mut self, len: usize) {
199+
self.inner.truncate(len);
200+
}
201+
202+
/// Provides plumbing to `Vec::extend_from_slice` without giving full
203+
/// mutable access to the `Vec`.
204+
///
205+
/// # Safety
206+
///
207+
/// The slice must be valid for the platform encoding (as described in
208+
/// `OsStr::from_encoded_bytes_unchecked`). For this encoding, that means
209+
/// `other` must be valid UTF-8.
210+
#[inline]
211+
pub unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
212+
self.inner.push_str(unsafe { str::from_utf8_unchecked(other) });
213+
}
214+
}
215+
216+
impl Slice {
217+
#[inline]
218+
pub fn as_encoded_bytes(&self) -> &[u8] {
219+
self.inner.as_bytes()
220+
}
221+
222+
#[inline]
223+
pub unsafe fn from_encoded_bytes_unchecked(s: &[u8]) -> &Slice {
224+
Slice::from_str(unsafe { str::from_utf8_unchecked(s) })
225+
}
226+
227+
#[track_caller]
228+
#[inline]
229+
pub fn check_public_boundary(&self, index: usize) {
230+
if !self.inner.is_char_boundary(index) {
231+
panic!("byte index {index} is not an OsStr boundary");
232+
}
233+
}
234+
235+
#[inline]
236+
pub fn from_str(s: &str) -> &Slice {
237+
// SAFETY: Slice is just a wrapper over str.
238+
unsafe { mem::transmute(s) }
239+
}
240+
241+
#[inline]
242+
fn from_mut_str(s: &mut str) -> &mut Slice {
243+
// SAFETY: Slice is just a wrapper over str.
244+
unsafe { mem::transmute(s) }
245+
}
246+
247+
#[inline]
248+
pub fn to_str(&self) -> Result<&str, crate::str::Utf8Error> {
249+
Ok(&self.inner)
250+
}
251+
252+
#[inline]
253+
pub fn to_string_lossy(&self) -> Cow<'_, str> {
254+
Cow::Borrowed(&self.inner)
255+
}
256+
257+
#[inline]
258+
pub fn to_owned(&self) -> Buf {
259+
Buf { inner: self.inner.to_owned() }
260+
}
261+
262+
#[inline]
263+
pub fn clone_into(&self, buf: &mut Buf) {
264+
self.inner.clone_into(&mut buf.inner)
265+
}
266+
267+
#[inline]
268+
pub fn into_box(&self) -> Box<Slice> {
269+
let boxed: Box<str> = self.inner.into();
270+
unsafe { mem::transmute(boxed) }
271+
}
272+
273+
#[inline]
274+
pub fn empty_box() -> Box<Slice> {
275+
let boxed: Box<str> = Default::default();
276+
unsafe { mem::transmute(boxed) }
277+
}
278+
279+
#[inline]
280+
pub fn into_arc(&self) -> Arc<Slice> {
281+
let arc: Arc<str> = Arc::from(&self.inner);
282+
unsafe { Arc::from_raw(Arc::into_raw(arc) as *const Slice) }
283+
}
284+
285+
#[inline]
286+
pub fn into_rc(&self) -> Rc<Slice> {
287+
let rc: Rc<str> = Rc::from(&self.inner);
288+
unsafe { Rc::from_raw(Rc::into_raw(rc) as *const Slice) }
289+
}
290+
291+
#[inline]
292+
pub fn make_ascii_lowercase(&mut self) {
293+
self.inner.make_ascii_lowercase()
294+
}
295+
296+
#[inline]
297+
pub fn make_ascii_uppercase(&mut self) {
298+
self.inner.make_ascii_uppercase()
299+
}
300+
301+
#[inline]
302+
pub fn to_ascii_lowercase(&self) -> Buf {
303+
Buf { inner: self.inner.to_ascii_lowercase() }
304+
}
305+
306+
#[inline]
307+
pub fn to_ascii_uppercase(&self) -> Buf {
308+
Buf { inner: self.inner.to_ascii_uppercase() }
309+
}
310+
311+
#[inline]
312+
pub fn is_ascii(&self) -> bool {
313+
self.inner.is_ascii()
314+
}
315+
316+
#[inline]
317+
pub fn eq_ignore_ascii_case(&self, other: &Self) -> bool {
318+
self.inner.eq_ignore_ascii_case(&other.inner)
319+
}
320+
}
321+
322+
#[unstable(feature = "clone_to_uninit", issue = "126799")]
323+
unsafe impl CloneToUninit for Slice {
324+
#[inline]
325+
#[cfg_attr(debug_assertions, track_caller)]
326+
unsafe fn clone_to_uninit(&self, dst: *mut u8) {
327+
// SAFETY: we're just a transparent wrapper around [u8]
328+
unsafe { self.inner.clone_to_uninit(dst) }
329+
}
330+
}

0 commit comments

Comments
 (0)