Three Nights

I’m posting this for my future self. This is the third night and the night that I have finally resolved a bug I have been fighting with. One day some variation of this bug will haunt me again, and at that time I will dig up this post and hopefully solve it much more quickly

The bug occurred when refactoring Codea’s Lua bindings into separate frameworks. One of those frameworks, AssetKit, drives the file handling for importing resources (art assets, sounds, shaders, etc) into your Codea projects

A brief overview of how assets work in Codea, the API provided in Lua looks something like this:

Lua
-- Draws the sprite "SomeFile" in your project
sprite(asset.SomeFile)

-- Reads a text file from your iPad's documents folder
local text = readText(asset.documents.Hello)

These paths to specific assets in your code leverage the Asset Key API. They are statically validated, autocompleted based on the real-time state of the file-system, and provide fully coordinated access to files on iOS. That means whether your file is in iCloud, external to Codea (i.e., outside of the sandbox), or local, the asset key provides a stable identifier to that resource

This was a hard API to write! Every piece of it needs to work across C, C++, Objective-C, Swift and Lua. Some of our Lua bindings are written in LuaIntf, a header library which auto-magically binds C++ types into Lua using templates (no, it doesn’t sound safe to us either. Yes, it does save time)

Onto the bug. Asset keys have been working well, but we now have two render engines in Codea: the legacy OpenGL engine, and the modern Metal engine. For years, the asset key code was duplicated between them. This was not ideal, maintenance-wise, but solving the problem seemed incredibly difficult

I decided to solve it three nights ago. I pulled all of the common Asset Key code out into a framework, AssetKit, I deleted the duplicate implementation in the modern engine. I fixed many, many errors. Identified and removed all the assumptions. Injected all the previously-implicit dependencies. After that I moved the asset bindings into a shared LuaKit framework and everything was good-to-go! I audited the code, readied the pull request, then started testing

Nothing worked. I couldn’t even read a text file specified by an asset key. The Lua code seemed fine, but the code to fetch a C++ object out of Lua was failing, behaving as if the object did not exist

The following code would now simply trigger an error in Lua, where before it would get the AssetKey at the given stack index:

C++
const AssetKey& key = LuaIntf::Lua::get<AssetKey>(L, index);

I pored over my diff, looking for anything I did to break things

My first thought was that my C++ struct now included some #ifdef __OBJC__ components that were not exposed to pure-C++. Could my struct’s size be different when called from pure C++ vs. Objective-C++?

C++
struct AssetKey {
    
    using ReadSaveCoordinator = std::function<void(const std::string&)>;

    std::string _path;
    
    #ifdef __OBJC__
    NSURL* _Nullable _rootUrl;
    #endif
    
    ...

Adding an #else void* _rootUrl; section to the preprocessor code did not fix things. But we absolutely had to store an NSURL in our struct here, as it was a requirement for security scoping (reconstructing an NSURL from a path will remove the ability to access it via security scoping)

Was it the different versions of the C++ language spec used between frameworks causing issues? Some were on C++11, some on C++17, and some on C++23. I unified all of our many frameworks on C++23, and fixed so many errors in the process. I wish I could forget about C++ language differences now. It also did nothing

Finally I thought to look into exactly how LuaIntf identifies classes. I thought this looked suspicious:

C++
static CppBindClass<T, PARENT> bind(LuaRef& parent_meta, const char* name)
{
    LuaRef meta;
    if (buildMetaTable(meta, parent_meta, name, CppSignature<T>::value(), CppClassSignature<T>::value(), CppConstSignature<T>::value()))
    {
        meta.rawget("___class").rawset("__gc", &CppBindClassDestructor<T, false>::call);
        meta.rawget("___const").rawset("__gc", &CppBindClassDestructor<T, true>::call);
    }
    return CppBindClass<T, PARENT>(meta);
}

That call to buildMetaTable is passing along a CppSignature<T>::value(). How’s that used?

C++
LuaRef registry(L, LUA_REGISTRYINDEX);
registry.rawset(type_clazz, clazz);
registry.rawset(type_const, clazz_const);
registry.rawset(type_static, clazz_static);

It’s generating static pointers to use as keys in the LUA_REGISTRYINDEX table. What is this table? Well, let’s read the Lua documentation:

Lua provides a registry, a predefined table that can be used by any C code to store whatever Lua values it needs to store. The registry table is always accessible at pseudo-index LUA_REGISTRYINDEX. Any C library can store data into this table, but it must take care to choose keys that are different from those used by other libraries, to avoid collisions. Typically, you should use as key a string containing your library name, or a light userdata with the address of a C object in your code, or any Lua object created by your code

So it basically stores the mapped type information from C++ in Lua’s registry, keyed on pointer addresses generated by this CppSignature<T>::value(). So let’s take a look at what that does:

C++
template <typename T, int KIND = 0>
struct CppSignature
{
    /**
     * Get the signature id for type
     *
     * The id is unique in the process
     */
    static void* value()
    {
        static char v;
        return &v;
    }
};

Doesn’t that look a bit suspicious? It generates a unique pointer for each C++ type mapped (along with the KIND of type: static, class, etc) by returning the address of a statically allocated char in the value() method. It’s super clever! It also tells us the “id is unique in the process,” but is it? I decided to find out

Our asset key types were being bound by my LuaKit framework, and they were being used by the legacy renderer in the RuntimeKit framework. I decided to put a print statement inside each framework giving me the identifier of the AssetKey C++ type:

NSLog(@"ASSET KEY SIGNATURE: %p", LuaIntf::CppSignature<AssetKey, 1>::value());

This printed:

ASSET KEY SIGNATURE (RuntimeKit): 0x10ab16db0
ASSET KEY SIGNATURE (LuaKit): 0x10922e8bb

No match! Different addresses! Finally. This was the reason nothing was working. The signature was not stable across framework boundaries. The solution, for now, is to only ever access AssetKeys through the LuaKit framework:

C++
AssetKey lua_assetkey(lua_State* _Nonnull L, int index) {
    return LuaIntf::Lua::get<AssetKey>(L, index);
}

I kind of wish I never knew this. That I could focus on the funner details of design and engineering involved in getting Codea working. But I am also happy that I had the ability to deduce what was happening in this situation