18.3 The String
Type
Rust’s String
type represents a growable, mutable, owned sequence of UTF-8 encoded text. It is stored on the heap and automatically manages its memory, conceptually similar to Vec<u8>
but specifically designed for string data with the critical guarantee that its contents are always valid UTF-8.
18.3.1 Understanding String
vs. &str
This distinction is fundamental in Rust and often a point of confusion for newcomers:
String
: An owned, heap-allocated buffer containing UTF-8 text. It owns the data it holds. It is mutable (can be modified, e.g., by appending text) and responsible for freeing its memory when it goes out of scope. Think of it like aVec<u8>
specialized for UTF-8.&str
(string slice): A borrowed, immutable view into a sequence of UTF-8 bytes. It consists of a pointer to the data and a length. It does not own the data it points to. It can refer to part of aString
, an entireString
, or a string literal embedded in the program’s binary.- String literals: Expressions like
"hello"
in your code have the type&'static str
. The'static
lifetime means the reference is valid for the entire duration of the program, because the underlying string data (hello
) is embedded directly into the program’s binary data segment and thus lives forever. - The
str
type: You might wonder aboutstr
without the&
.str
itself is the primitive sequence type, but it’s an unsized type (Dynamically Sized Type or DST) because its length isn’t known at compile time. Because variables and function arguments must have a known size, Rust requires that we always interact withstr
via pointers like&str
(a “fat pointer” containing address and length) orBox<str>
(an owned pointer).&str
is the ubiquitous borrowed form.
- String literals: Expressions like
You can get an immutable &str
slice from a String
easily (e.g., &my_string[..]
, or often implicitly via deref coercion), but converting a &str
to an owned String
usually involves allocating memory and copying the data (e.g., using .to_string()
or String::from()
).
18.3.2 String
vs. Vec<u8>
While a String
is internally backed by a buffer of bytes (like Vec<u8>
), its primary difference is the UTF-8 guarantee. String
methods ensure that the byte sequence remains valid UTF-8. If you need to handle arbitrary binary data, raw byte streams, or text in an encoding other than UTF-8, you should use Vec<u8>
instead. Attempting to create a String
from invalid UTF-8 byte sequences will result in an error or panic.
18.3.3 Creating and Modifying Strings
#![allow(unused)] fn main() { // Create an empty String let mut s1 = String::new(); // Create from a string literal (&str) let s2 = String::from("initial content"); let s3 = "initial content".to_string(); // Equivalent, often preferred style // Appending content let mut s = String::from("foo"); s.push_str("bar"); // Appends a &str slice. s is now "foobar" s.push('!'); // Appends a single char. s is now "foobar!" }
Appending uses similar reallocation strategies as Vec
for amortized O(1)
performance.
18.3.4 Concatenation
There are several ways to combine strings:
-
Using the
+
operator (via theadd
trait method): This operation consumes ownership of the left-handString
and requires a borrowed&str
on the right.#![allow(unused)] fn main() { let s1 = String::from("Hello, "); let s2 = String::from("world!"); // s1 is moved here and can no longer be used directly. // &s2 works because String derefs to &str. let s3 = s1 + &s2; println!("{}", s3); // Prints "Hello, world!" // println!("{}", s1); // Compile Error: value used after move }
Because
+
moves the left operand, chaining multiple additions can be inefficient and verbose (s1 + &s2 + &s3 + ...
). -
Using the
format!
macro: This is generally the most flexible and readable approach, especially for combining multiple pieces or non-string data. It does not take ownership of its arguments (it borrows them via references) and returns a newly allocated, ownedString
.#![allow(unused)] fn main() { let name = "Rustacean"; let level = 99; let s1 = String::from("Status: "); let greeting = format!("{}{}! Your level is {}.", s1, name, level); println!("{}", greeting); // Prints "Status: Rustacean! Your level is 99." // s1, name, and level are still usable here because format! borrowed them. println!("{} still exists.", s1); }
18.3.5 UTF-8, Characters, and Indexing
Because String
guarantees UTF-8, where characters can span multiple bytes (1 to 4), direct indexing by byte position (s[i]
) to get a char
is disallowed. A byte index might fall in the middle of a multi-byte character, leading to invalid data if treated as a character boundary.
Instead, Rust provides methods to work with strings correctly:
- Iterating over Unicode scalar values (
char
):#![allow(unused)] fn main() { let hello = String::from("Здравствуйте"); // Russian "Hello" (multi-byte chars) for c in hello.chars() { print!("'{}' ", c); // Prints 'З' 'д' 'р' 'а' 'в' 'с' 'т' 'в' 'у' 'й' 'т' 'е' } println!("\nNumber of chars: {}", hello.chars().count()); // 12 chars }
- Iterating over raw bytes (
u8
):#![allow(unused)] fn main() { let hello = String::from("Здравствуйте"); for b in hello.bytes() { print!("{} ", b); // Prints the underlying UTF-8 bytes (2 bytes per char here) } println!("\nNumber of bytes: {}", hello.len()); // 24 bytes }
- Slicing (
&s[start..end]
): You can create&str
slices using byte indices, but this will panic the current thread if thestart
orend
indices do not fall exactly on UTF-8 character boundaries. Use with caution.#![allow(unused)] fn main() { let s = String::from("hello"); let h = &s[0..1]; // Ok, slice is "h" let multi_byte = String::from("नमस्ते"); // Hindi "Namaste" // Each char is 3 bytes: न=bytes 0-2, म=3-5, स=6-8, ्=9-11, त=12-14, े=15-17 let first_char_slice = &multi_byte[0..3]; // Ok, slice is "न" // let bad_slice = &multi_byte[0..1]; // PANIC! 1 is not on a char boundary }
For operations sensitive to grapheme clusters (user-perceived characters, like ‘e’ + combining accent ‘´’), use external crates like unicode-segmentation
.
18.3.6 Common String
Methods
len() -> usize
: Returns the length of the string in bytes (not characters).O(1)
.is_empty() -> bool
: Checks if the string has zero bytes.O(1)
.contains(pattern: &str) -> bool
: Checks if the string contains a given substring.replace(from: &str, to: &str) -> String
: Returns a newString
with all occurrences offrom
replaced byto
.split(pattern) -> Split
: Returns an iterator over&str
slices separated by a pattern (char, &str, etc.).trim() -> &str
: Returns a&str
slice with leading and trailing whitespace removed.as_str() -> &str
: Borrows theString
as an immutable&str
slice covering the entire string. Often done implicitly via deref coercion.
18.3.7 Summary: String
vs. C Strings
Traditional C strings (char*
, usually null-terminated) present several challenges that Rust’s String
and &str
system addresses:
- Encoding Ambiguity: C strings lack inherent encoding information. They might be ASCII, Latin-1, UTF-8, or another encoding depending on context and platform. Rust’s
String
/&str
guarantee UTF-8. - Length Calculation: Finding the length of a C string (
strlen
) requires scanning for the null terminator (\0
), anO(n)
operation. Rust’sString
stores its byte length, makinglen()
anO(1)
operation.&str
also includes the length as part of its fat pointer. - Memory Management: Manual allocation, resizing (
malloc
/realloc
), and copying (strcpy
/strcat
) in C are common sources of buffer overflows and memory leaks. Rust’sString
handles memory automatically and safely. - Mutability Risks: Modifying C strings in place requires careful buffer management to avoid overflows.
String
provides safe methods likepush_str
.&str
is immutable, preventing accidental modification through slices. - Interior Null Bytes: C strings cannot contain null bytes (
\0
) as they signal termination. RustString
s can contain\0
like any other valid UTF-8 character (though this is uncommon in text data). - Null Termination and FFI: Crucially, Rust
String
s and&str
s are not null-terminated. Passing a pointer fromString::as_ptr()
or a&str
directly to a C function expecting a null-terminatedconst char*
is unsafe and incorrect, as the C code might read past the end of the Rust string’s data. For safe interoperability when passing strings to C, Rust providesstd::ffi::CString
, which creates an owned, null-terminated byte sequence (checking for and prohibiting interior nulls). Interacting with C strings received from C typically usesstd::ffi::CStr
. (FFI details are covered elsewhere).
String
and &str
provide a robust, safe, and Unicode-aware system for handling text data, significantly improving upon the limitations and unsafety of traditional C strings, while offering specific mechanisms for safe C interoperability when needed.