Benchmark a custom string normalisation function #138

Closed
opened 2024-02-19 07:58:27 +01:00 by wojtek · 8 comments
Owner

A custom written string normalisation function might be faster than AC simply by being specialised for the data being processed. Things this custom function can assume:

  • lowercasing and replacements can be done at the same time since they don't conflict - what's more it can only be one or other
  • the result string will in almost all cases be the same length or less as the original string as the characters go towards ASCII and only the ellipsis creates more characters (check how many bytes is ellipsis character vs three dots)
A custom written string normalisation function might be faster than AC simply by being specialised for the data being processed. Things this custom function can assume: - lowercasing and replacements can be done at the same time since they don't conflict - what's more it can only be one or other - the result string will in almost all cases be the same length or less as the original string as the characters go towards ASCII and only the ellipsis creates more characters (check how many bytes is ellipsis character vs three dots)
wojtek added the
enhancement
label 2024-02-19 15:41:09 +01:00
Author
Owner
    fn replace_chars<const LOWERCASE: bool>(original: &str) -> String {
        // UTF8 chars are max 4 bytes. Artist names are short, so just allocate for the worst case.
        let mut new = String::with_capacity(4 * original.len());
        for ch in original.chars() {
            if let Some(index) = SPECIAL.iter().position(|sp| sp == &ch) {
                new.push_str(REPLACE[index]);
            } else {
                if LOWERCASE {
                    for lch in ch.to_lowercase() {
                        new.push(lch)
                    }
                } else {
                    new.push(ch);
                }
            }
        }
        new
    }
```rust fn replace_chars<const LOWERCASE: bool>(original: &str) -> String { // UTF8 chars are max 4 bytes. Artist names are short, so just allocate for the worst case. let mut new = String::with_capacity(4 * original.len()); for ch in original.chars() { if let Some(index) = SPECIAL.iter().position(|sp| sp == &ch) { new.push_str(REPLACE[index]); } else { if LOWERCASE { for lch in ch.to_lowercase() { new.push(lch) } } else { new.push(ch); } } } new } ```
Author
Owner

Running with AhoCorasick

test tui::app::machine::search::benches::is_char_sensitive_alphanumeric ... bench:           3 ns/iter (+/- 0)
test tui::app::machine::search::benches::is_char_sensitive_utf8         ... bench:          16 ns/iter (+/- 0)                                                                        
test tui::app::machine::search::benches::normalize_search_alphanumeric  ... bench:          77 ns/iter (+/- 1)
test tui::app::machine::search::benches::normalize_search_utf8          ... bench:         265 ns/iter (+/- 4)

Running with custom code above

test tui::app::machine::search::benches::is_char_sensitive_alphanumeric ... bench:           7 ns/iter (+/- 0)
test tui::app::machine::search::benches::is_char_sensitive_utf8         ... bench:          17 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search_alphanumeric  ... bench:          80 ns/iter (+/- 1)
test tui::app::machine::search::benches::normalize_search_utf8          ... bench:         205 ns/iter (+/- 3)
Running with AhoCorasick ``` test tui::app::machine::search::benches::is_char_sensitive_alphanumeric ... bench: 3 ns/iter (+/- 0) test tui::app::machine::search::benches::is_char_sensitive_utf8 ... bench: 16 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search_alphanumeric ... bench: 77 ns/iter (+/- 1) test tui::app::machine::search::benches::normalize_search_utf8 ... bench: 265 ns/iter (+/- 4) ``` Running with custom code above ``` test tui::app::machine::search::benches::is_char_sensitive_alphanumeric ... bench: 7 ns/iter (+/- 0) test tui::app::machine::search::benches::is_char_sensitive_utf8 ... bench: 17 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search_alphanumeric ... bench: 80 ns/iter (+/- 1) test tui::app::machine::search::benches::normalize_search_utf8 ... bench: 205 ns/iter (+/- 3) ```
Author
Owner

Not sure how to explain tui::app::machine::search::benches::is_char_sensitive_alphanumeric... It seems consistent.

In the most likely case, i.e., alphanumeric - the performance is similar. In the most unlikely and difficult case, i.e. UTF8, the custom solution wins by about 20%.

Not sure how to explain `tui::app::machine::search::benches::is_char_sensitive_alphanumeric`... It seems consistent. In the most likely case, i.e., alphanumeric - the performance is similar. In the most unlikely and difficult case, i.e. UTF8, the custom solution wins by about 20%.
Author
Owner

A quick test using just "Heaven's Basement" revealed a definite advantage for the custom solution. A better sample is needed.

A quick test using just "Heaven's Basement" revealed a definite advantage for the custom solution. A better sample is needed.
Author
Owner

Using the artist list from 2024-02-19 the results are as follows:

Running with AhoCorasick:

test tui::app::machine::search::benches::is_char_sensitive ... bench:           4 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:          78 ns/iter (+/- 1)

Running with custom solution:

test tui::app::machine::search::benches::is_char_sensitive ... bench:           6 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:          82 ns/iter (+/- 1)
Using the artist list from 2024-02-19 the results are as follows: Running with AhoCorasick: ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 4 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 78 ns/iter (+/- 1) ``` Running with custom solution: ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 6 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 82 ns/iter (+/- 1) ```
Author
Owner

Still don't know what's up with test tui::app::machine::search::benches::is_char_sensitive since that method is untouched...

Still don't know what's up with `test tui::app::machine::search::benches::is_char_sensitive` since that method is untouched...
Author
Owner

Conclusion is that on the most likely input which is mostly alphanumeric with some non-ASCII characters and the occasional special character AhoCorasick is ever so slightly better. An additional dependence in exchange for increased confidence of handling future edge cases.

Conclusion is that on the most likely input which is mostly alphanumeric with some non-ASCII characters and the occasional special character AhoCorasick is ever so slightly better. An additional dependence in exchange for increased confidence of handling future edge cases.
wojtek added this to the v0.2.0 milestone 2024-02-19 20:26:24 +01:00
Author
Owner

For final reference, an approximation of the original method which used just string methods:

    fn replace_chars<const LOWERCASE: bool>(original: &str) -> String {
        let new = original
            .replace(['‐', '‒', '–', '—', '―', '−'], "-")
            .replace(['‘', '’'], "'")
            .replace(['“', '”'], "\"")
            .replace(['…'], "...");

        if LOWERCASE {
            new.to_lowercase()
        } else {
            new
        }
    }

Performs with:

test tui::app::machine::search::benches::is_char_sensitive ... bench:           6 ns/iter (+/- 0)
test tui::app::machine::search::benches::normalize_search  ... bench:         142 ns/iter (+/- 3)
For final reference, an approximation of the original method which used just string methods: ```rust fn replace_chars<const LOWERCASE: bool>(original: &str) -> String { let new = original .replace(['‐', '‒', '–', '—', '―', '−'], "-") .replace(['‘', '’'], "'") .replace(['“', '”'], "\"") .replace(['…'], "..."); if LOWERCASE { new.to_lowercase() } else { new } } ``` Performs with: ``` test tui::app::machine::search::benches::is_char_sensitive ... bench: 6 ns/iter (+/- 0) test tui::app::machine::search::benches::normalize_search ... bench: 142 ns/iter (+/- 3) ```
wojtek added reference 138---benchmark-a-custom-string-normalisation-function 2024-02-19 20:40:37 +01:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: wojtek/musichoard#138
No description provided.