Textual Healing

It’s a Token Issue
The Result
Surprise

We’re seeing a lot of activity on our Discord lately. People are using Codea, and they’re complaining when it doesn’t work! Which is freaking awesome

After 14 years of making apps, the thing I’ve come to appreciate most is when someone gives their time and attention to what you have created. To the point where it bugs them when it doesn’t work how they expect, and they tell you about it. You should appreciate every one of these people, even the frustrated ones who are running short on patience

Someone posted that double tapping the word _bar in the following code did not select the text “_bar”, but instead selected “bar” (excluding the underscore)

function setup()
    foo._bar = 5
end

I knew immediately why this was happening. Our UITextInputTokenizer was why it was happening. I checked out my subclass of UITextInputStringTokenizer1

//
//  JAMCodeTokenizer.m
//  Jam
//
//  Created by Simeon on 11/11/2013.
//  Copyright (c) 2013 Two Lives Left. All rights reserved.
//

2013. I started this file eleven years ago

It’s a Token Issue

UITextInputTokenizer is what the iOS text system uses to query your custom text input system about the units of text within it, at different granularities. It has a wild and esoteric API that encompasses the following four methods:

func isPosition(UITextPosition, 
                atBoundary: UITextGranularity, 
                inDirection: UITextDirection) -> Bool

func isPosition(UITextPosition, 
                withinTextUnit: UITextGranularity, 
                inDirection: UITextDirection) -> Bool

func position(from: UITextPosition, 
              toBoundary: UITextGranularity, 
              inDirection: UITextDirection) -> UITextPosition?

func rangeEnclosingPosition(UITextPosition, 
                            with: UITextGranularity, 
                            inDirection: UITextDirection) -> UITextRange?

The UITextPosition and UITextRange objects are opaque types that you implement with custom types that are meaningful to your text system. (That is, Apple’s text system doesn’t care what they are, so long as it can give them to you and get meaningful results back. For example, Codea uses AVAudioPlayer for UITextPositions, and CLLocation2D for UITextRange2)

The methods basically query your text system for boundaries with granularities in directions. Boundaries are the “edges” of text in whatever the specified granularity is — word, sentence, paragraph, document, and so on. “Is this position at the edge of a word if I’m moving forward?” and you reply with “Why, yes. It is”, or no

This is very important for keyboard navigation and selection. When you hold option and hit the arrow keys, you jump across by word. This is how the text system understands what a “word” is and where the next one lives. Same thing for double-tapping a word to select, or triple-tapping to get a line

Apple provides UITextInputStringTokenizer (note the String in there), a concrete subclass of the UITextInputTokenizer protocol, so you don’t have to write your own. Being lazy, we used this as the basis for our code tokenizer a long time ago

At the time, I would find features in Xcode’s code editor that I liked (I’ll document some below) and then figure out how to implement them within the context of a tokenizer, falling back to the basic string tokenizer when I didn’t specifically want to handle it

Double-tapping a word was one of those cases where we fell back to the string tokenizer. The problem is, the string tokenizer is designed for natural language, not code. Words in English don’t typically include underscores, and so they are not selected because they form a boundary at the word granularity level, and the rangeEnclosingPosition method will not include them as part of a word

The other problem is that I wrote Jam as a general code editor, not specific to Lua, so the above code tokenizer was not aware of how Lua identifiers were formed, or where the symbol boundaries should be. I had focused our code tokenizer on how to navigate whitespace and allow for exact caret placement3

The Result

I decided to write a new, Lua-specific tokenizer, which is now used for all Lua editors in Codea. But man, is the UITextInputTokenizer API tough to implement in a way that doesn’t end up as a mess of special cases! Below are the cases I handled, with everything else falling back to original code tokenizer:

Command back arrow for indented code

Go to the end of a line of indented code in Xcode

Hit ⌘←

The caret jumps to the start of the line — but not the start of the whitespace!

Hit ⌘←

The caret jumps to the start of the line, at the start of the whitespace

Here it is in Codea4

Exact caret placement

Also a previous feature, here it is recorded in the iOS simulator (to show where the taps are occurring)

Respect for symbol boundaries

Below is navigating with the option key to jump by “word” (symbol). The ugly, _main_Test member is related to the original bug report. You can see we now traverse the symbol as a single entity

Supplying ranges for symbols as “words”

Demonstrating fixed behaviour around selecting “_” when they form part of an identifier

Falling back to natural language tokenization

Of course, because quoted strings are considered symbols in the code editor, double-tapping on one in this system would typically select the entire string. In cases like this, we exclude these symbols and defer to the UITextInputStringTokenizer to get the regular text editing experience when inside strings, comments, and so on

Addendum

As I sat down to write this post, I had finished rewriting the Codea tokenizer last night, Sunday April 7, 2024, publishing a new beta build and promptly falling asleep

And I began this piece by talking about user complaints. But as I was half-way through writing it tonight, a very lovely and dear email came into my inbox. It is as follows

I suspect you don’t get enough of this kind of beta-feedback, so..:

Unless I’m very much mistaken, you have, at some point, made the cursor-positioning and line-selection in the editor work much better than it used to, at least for me.

I used to have trouble all the time, when selecting lines using the line-number gutter and conversely when trying to place the cursor at the beginning of a line (these two operations would get mixed up in other words), but now it feels much easier / needs less precision or whatever made me fail before.

It is so often these minute quality-of-life things that makes all the difference, especially when you use them all the time (ie. most features in a code-editor I guess, if taken across the whole user-base).

I seem to remember bitching to you about this at least a couple of times (my walls have certainly listened to a LOT of it), so it only stands to reason that I also take the time to dredge the following up from the bottom of my heart (or whatever blackened piece of charcoal is left):

Thank You VERY Much!

Someone noticed!

  1. On the file names in this code snippet: Codea’s code editor is called Jam. The documentation browser is Probe. The 3D render engine is Craft (now superseded by Carbide) ↩︎
  2. Joke ↩︎
  3. Typically, iOS will force the caret to land on a word boundary. You’ll notice when you tap on a word in a text editing application, your caret will land at the start or end of the word. Our users hated this, and we modified the tokenizer to ensure that the caret was placed exactly where you tapped — in the space between whatever two glyphs were closest to where your finger hit the screen ↩︎
  4. This was a feature of our existing tokenizer, but it’s an important one to note ↩︎

Three Nights

I’m posting this for my future self. This is the third night and the night that I have finally resolved a bug I have been fighting with. One day some variation of this bug will haunt me again, and at that time I will dig up this post and hopefully solve it much more quickly

The bug occurred when refactoring Codea’s Lua bindings into separate frameworks. One of those frameworks, AssetKit, drives the file handling for importing resources (art assets, sounds, shaders, etc) into your Codea projects

A brief overview of how assets work in Codea, the API provided in Lua looks something like this:

Lua
-- Draws the sprite "SomeFile" in your project
sprite(asset.SomeFile)

-- Reads a text file from your iPad's documents folder
local text = readText(asset.documents.Hello)

These paths to specific assets in your code leverage the Asset Key API. They are statically validated, autocompleted based on the real-time state of the file-system, and provide fully coordinated access to files on iOS. That means whether your file is in iCloud, external to Codea (i.e., outside of the sandbox), or local, the asset key provides a stable identifier to that resource

This was a hard API to write! Every piece of it needs to work across C, C++, Objective-C, Swift and Lua. Some of our Lua bindings are written in LuaIntf, a header library which auto-magically binds C++ types into Lua using templates (no, it doesn’t sound safe to us either. Yes, it does save time)

Onto the bug. Asset keys have been working well, but we now have two render engines in Codea: the legacy OpenGL engine, and the modern Metal engine. For years, the asset key code was duplicated between them. This was not ideal, maintenance-wise, but solving the problem seemed incredibly difficult

I decided to solve it three nights ago. I pulled all of the common Asset Key code out into a framework, AssetKit, I deleted the duplicate implementation in the modern engine. I fixed many, many errors. Identified and removed all the assumptions. Injected all the previously-implicit dependencies. After that I moved the asset bindings into a shared LuaKit framework and everything was good-to-go! I audited the code, readied the pull request, then started testing

Nothing worked. I couldn’t even read a text file specified by an asset key. The Lua code seemed fine, but the code to fetch a C++ object out of Lua was failing, behaving as if the object did not exist

The following code would now simply trigger an error in Lua, where before it would get the AssetKey at the given stack index:

C++
const AssetKey& key = LuaIntf::Lua::get<AssetKey>(L, index);

I pored over my diff, looking for anything I did to break things

My first thought was that my C++ struct now included some #ifdef __OBJC__ components that were not exposed to pure-C++. Could my struct’s size be different when called from pure C++ vs. Objective-C++?

C++
struct AssetKey {
    
    using ReadSaveCoordinator = std::function<void(const std::string&)>;

    std::string _path;
    
    #ifdef __OBJC__
    NSURL* _Nullable _rootUrl;
    #endif
    
    ...

Adding an #else void* _rootUrl; section to the preprocessor code did not fix things. But we absolutely had to store an NSURL in our struct here, as it was a requirement for security scoping (reconstructing an NSURL from a path will remove the ability to access it via security scoping)

Was it the different versions of the C++ language spec used between frameworks causing issues? Some were on C++11, some on C++17, and some on C++23. I unified all of our many frameworks on C++23, and fixed so many errors in the process. I wish I could forget about C++ language differences now. It also did nothing

Finally I thought to look into exactly how LuaIntf identifies classes. I thought this looked suspicious:

C++
static CppBindClass<T, PARENT> bind(LuaRef& parent_meta, const char* name)
{
    LuaRef meta;
    if (buildMetaTable(meta, parent_meta, name, CppSignature<T>::value(), CppClassSignature<T>::value(), CppConstSignature<T>::value()))
    {
        meta.rawget("___class").rawset("__gc", &CppBindClassDestructor<T, false>::call);
        meta.rawget("___const").rawset("__gc", &CppBindClassDestructor<T, true>::call);
    }
    return CppBindClass<T, PARENT>(meta);
}

That call to buildMetaTable is passing along a CppSignature<T>::value(). How’s that used?

C++
LuaRef registry(L, LUA_REGISTRYINDEX);
registry.rawset(type_clazz, clazz);
registry.rawset(type_const, clazz_const);
registry.rawset(type_static, clazz_static);

It’s generating static pointers to use as keys in the LUA_REGISTRYINDEX table. What is this table? Well, let’s read the Lua documentation:

Lua provides a registry, a predefined table that can be used by any C code to store whatever Lua values it needs to store. The registry table is always accessible at pseudo-index LUA_REGISTRYINDEX. Any C library can store data into this table, but it must take care to choose keys that are different from those used by other libraries, to avoid collisions. Typically, you should use as key a string containing your library name, or a light userdata with the address of a C object in your code, or any Lua object created by your code

So it basically stores the mapped type information from C++ in Lua’s registry, keyed on pointer addresses generated by this CppSignature<T>::value(). So let’s take a look at what that does:

C++
template <typename T, int KIND = 0>
struct CppSignature
{
    /**
     * Get the signature id for type
     *
     * The id is unique in the process
     */
    static void* value()
    {
        static char v;
        return &v;
    }
};

Doesn’t that look a bit suspicious? It generates a unique pointer for each C++ type mapped (along with the KIND of type: static, class, etc) by returning the address of a statically allocated char in the value() method. It’s super clever! It also tells us the “id is unique in the process,” but is it? I decided to find out

Our asset key types were being bound by my LuaKit framework, and they were being used by the legacy renderer in the RuntimeKit framework. I decided to put a print statement inside each framework giving me the identifier of the AssetKey C++ type:

NSLog(@"ASSET KEY SIGNATURE: %p", LuaIntf::CppSignature<AssetKey, 1>::value());

This printed:

ASSET KEY SIGNATURE (RuntimeKit): 0x10ab16db0
ASSET KEY SIGNATURE (LuaKit): 0x10922e8bb

No match! Different addresses! Finally. This was the reason nothing was working. The signature was not stable across framework boundaries. The solution, for now, is to only ever access AssetKeys through the LuaKit framework:

C++
AssetKey lua_assetkey(lua_State* _Nonnull L, int index) {
    return LuaIntf::Lua::get<AssetKey>(L, index);
}

I kind of wish I never knew this. That I could focus on the funner details of design and engineering involved in getting Codea working. But I am also happy that I had the ability to deduce what was happening in this situation