annotate _posts/2024-06-09-schism-unicode-and-you.html @ 118:503e22dd6cf5

blog: add (unfinished) series on OMS I'll update this as I do more research into the inner workings of OMS. It's much more interesting (and more convoluted) than ASIO is unfortunately, but it means the blog posts will probably be more interesting
author Paper <paper@tflc.us>
date Sun, 19 Oct 2025 23:15:02 -0400
parents 60f77a3de847
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
85
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
1 ---
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
2 layout: post
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
3 author: Paper
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
4 title: 'Schism Tracker, Unicode, and you'
86
1fed81c848a5 html: add plugs
Paper <paper@paper.us.eu.org>
parents: 85
diff changeset
5 nowplaying: 'Holy Fuck - LP'
85
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
6 ---
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
7 <span>Recently I've taken on adding real Unicode-awareness to Schism, and it was <i>surprisingly</i> easy, to say the least.</span>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
8 <br><br>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
9 <span>I was expecting to have to convert lots of things to be real Unicode, but nope! All that really needed to be done was to convert UTF-8 to CP437 where necessary to actually *draw* the data while keeping the internal form pure UTF-8, and then bundle everything up into a neat macro to keep everything consistent:</span>
87
60f77a3de847 css: improve appearance on mobile
Paper <paper@paper.us.eu.org>
parents: 86
diff changeset
10 <figure><pre class="code-block"><code>#define CHARSET_EASY_MODE_EX(MOD, in, inset, outset, x) \
85
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
11 do { \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
12 MOD uint8_t* out; \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
13 charset_error_t err = charset_iconv(in, (uint8_t**)&out, inset, outset); \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
14 if (err) \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
15 out = in; \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
16 \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
17 x \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
18 \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
19 if (!err) \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
20 free((uint8_t*)out); \
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
21 } while (0)
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
22 </code></pre></figure>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
23 <span>I just shoved this macro anywhere necessary and it works perfectly fine for loading any Unicode path. For example, the Spanish word "maƱana" gets displayed correctly now:</span>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
24 <br><br>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
25 <img class="drop-shadow-box center-image" src="/media/blog/schism-spanish-file-listing.png">
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
26 <br>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
27 <span>The file sorting algorithms were a different beast though, and even now strverscmp doesn't have a real charset-independent variant. For strcasecmp, I had to implement (simple) Unicode case folding, which meant having a <a class="prettylink" href="https://github.com/schismtracker/schismtracker/blob/b858a5917ee7e83f7cb4da1ad698dd24159f241b/schism/charset_data.c#L183">switch statement that is almost 1500 lines long</a> and takes up about 20K of space in the binary.</span>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
28 <br><br>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
29 <span>Schism currently does not do any Unicode normalization when comparing strings. This is primarily a problem with decomposed strings (which will likely not get converted properly), though with filenames that probably shouldn't exist anyway...</span>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
30 <br><br>
52d59a351bf5 add post about unicode in schism
Paper <paper@paper.us.eu.org>
parents:
diff changeset
31 <span>anyway, Unicode is easy, if you can't use it properly it's a skill issue :p</span>