Benchmark a custom string normalisation function #139

Merged
wojtek merged 4 commits from 138---benchmark-a-custom-string-normalisation-function into main 2024-02-19 20:56:04 +01:00
Owner

Closes #138

Closes #138
wojtek added 1 commit 2024-02-19 20:40:58 +01:00
Use a representative sample for benchmarking
Some checks failed
Cargo CI / Build and Test (pull_request) Successful in 3m3s
Cargo CI / Lint (pull_request) Failing after 1m12s
a6febda864
wojtek added 1 commit 2024-02-19 20:42:18 +01:00
Remove stray comment
Some checks failed
Cargo CI / Build and Test (pull_request) Successful in 1m41s
Cargo CI / Lint (pull_request) Failing after 1m12s
8cf68b6ea5
wojtek added 1 commit 2024-02-19 20:44:09 +01:00
Clippy lint
All checks were successful
Cargo CI / Build and Test (pull_request) Successful in 1m41s
Cargo CI / Lint (pull_request) Successful in 1m14s
074f539ea7
wojtek added 1 commit 2024-02-19 20:44:51 +01:00
Benches go to the bottom
All checks were successful
Cargo CI / Build and Test (pull_request) Successful in 1m41s
Cargo CI / Lint (pull_request) Successful in 1m14s
bda2059054
wojtek changed title from Use a representative sample for benchmarking to Benchmark a custom string normalisation function 2024-02-19 20:48:51 +01:00
Author
Owner

Detailed history is in #138.

The summary is:

AhoCorasick:

test tui::app::machine::search::benches::is_char_sensitive ... bench:           4 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:          78 ns/iter (+/- 1)

Specialised custom solution:

    fn replace_chars<const LOWERCASE: bool>(original: &str) -> String {
        // UTF8 chars are max 4 bytes. Artist names are short, so just allocate for the worst case.
        let mut new = String::with_capacity(4 * original.len());
        for ch in original.chars() {
            if let Some(index) = SPECIAL.iter().position(|sp| sp == &ch) {
                new.push_str(REPLACE[index]);
            } else {
                if LOWERCASE {
                    for lch in ch.to_lowercase() {
                        new.push(lch)
                    }
                } else {
                    new.push(ch);
                }
            }
        }
        new
    }
test tui::app::machine::search::benches::is_char_sensitive ... bench:           6 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:          82 ns/iter (+/- 1)

String methods only:

    fn replace_chars<const LOWERCASE: bool>(original: &str) -> String {
        let new = original
            .replace(['‐', '‒', '–', '—', '―', '−'], "-")
            .replace(['‘', '’'], "'")
            .replace(['“', '”'], "\"")
            .replace(['…'], "...");

        if LOWERCASE {
            new.to_lowercase()
        } else {
            new
        }
    }
test tui::app::machine::search::benches::is_char_sensitive ... bench:           6 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:         142 ns/iter (+/- 3)
Detailed history is in #138. The summary is: AhoCorasick: ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 4 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 78 ns/iter (+/- 1) ``` Specialised custom solution: ```rust fn replace_chars<const LOWERCASE: bool>(original: &str) -> String { // UTF8 chars are max 4 bytes. Artist names are short, so just allocate for the worst case. let mut new = String::with_capacity(4 * original.len()); for ch in original.chars() { if let Some(index) = SPECIAL.iter().position(|sp| sp == &ch) { new.push_str(REPLACE[index]); } else { if LOWERCASE { for lch in ch.to_lowercase() { new.push(lch) } } else { new.push(ch); } } } new } ``` ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 6 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 82 ns/iter (+/- 1) ``` String methods only: ```rust fn replace_chars<const LOWERCASE: bool>(original: &str) -> String { let new = original .replace(['‐', '‒', '–', '—', '―', '−'], "-") .replace(['‘', '’'], "'") .replace(['“', '”'], "\"") .replace(['…'], "..."); if LOWERCASE { new.to_lowercase() } else { new } } ``` ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 6 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 142 ns/iter (+/- 3) ```
Author
Owner

Conclusion was to keep AhoCorasick as it performs the best on the representative sample.

Conclusion was to keep AhoCorasick as it performs the best on the representative sample.
wojtek merged commit dcc33d62b1 into main 2024-02-19 20:56:04 +01:00
wojtek deleted branch 138---benchmark-a-custom-string-normalisation-function 2024-02-19 20:56:04 +01:00
Sign in to join this conversation.
No description provided.