Home

Kalt — Dockerrising Tensor Flow for Accessibility

(K)elleen’s Alternative Text

TLDR; Here’s the link to the extension. After loading in Firefox, unzip and run install.sh with sudo.

Last Friday my wife mentioned offhand that she wanted some help setting up a program. Naturally I was curious and excited when I found out that she wanted help spinning up an AI to help her caption images for accessibility purposes. This felt delightfully futuristic and I was thrilled to take up the challenge.

A High-End Braille Reader

I know from a developer perspective that accessibility is not always treated as a first class citizen during content creation. People tend to take the path of least resistance and if I can build a tool that makes it easier for developers to keep the internet accessible, people who would normally be excluded can join the conversation.

The guide that my wife was trying to use relied on Azure’s Machine Learning program to spin up and train a competitor of 2015’s Show and Tell captioning competition. I’m far from a machine learning expert, but I know that training is a intensive process that requires more data, time, and expertise than I possess. Plus the guide wanted her to rent out an NC6. That’s 6 cores, 56 gigs of ram, a tanky SSD and near a thousand USD per month!

Azure NC6 is Serious Hardware

But the seed of auto-generated captions had been planted. I dug up the GitHub repo that the guide was using and took a look. Turns out they offered a pretrained model which lowers the computing cost to something that’s feasible on any laptop. Once I sorted the Python build of that repo with the pre-built model, I asked my wife:

  • “What would make generating captions as easy for you as possible?”

She told me that in her normal workflow she would use Firefox to gather images for publishing, write some text relating to the image and then have to repeat the process to generate the alternative text. So in her ideal world, she would be able to pick her image, start writing her blurb relating to that image, and have the alternative text magically apppear without having to reiterate her description.

Firefox Native Extensions

Way back in the day I made a Firefox extension that scanned your RFID card and entered a password mapped to that card as a strange sort of password manager. My dim memories from that time pointed me towards the native extension feature of Firefox extensions that allows communication with a local application. If I used the Show and Tell model as the native application for a Firefox extension, I thought I could create a captioning generation solution that would fit naturally into my wife’s workflow.

Right Click Context Menu

I could hook into the browser’s existing right click context menus for a relatively seamless experience. My wife could find an image that wanted to use for a post, right click it to kick off the caption generation, and write a post while the caption was generating. All that’s needed for a Firefox extension to communicate with a native application is a manifest.json file placed in a special location and the appropriate privileges.

Structure of a Native Firefox Extension

.
├── containers
│   └── Dockerfile
├── containers.zip
├── Dockerfile
├── go.sh
├── icons
│   └── icon.png
├── im2txt.js
├── im2txt.json
├── install.sh
├── manifest.json
├── README.md
└── run_inference_wrapper.py

The docs state that for Firefox to recognize a native application, the developer must place a manifest.json file with the same name as the native application in a special folder. This manifest has some metadata, but the important part is a path to the actual application which will be called by context or background scripts in the extension.

The interesting thing about native extensions is that Mozilla does not give developers much guidance when it comes to how they should approach the install process. As a developer, it’s up to you to find a good solution for reliable cross-platform builds of the native application.

Docker: Repeatable Builds for All Platforms

Without a formal mechanism to have Firefox run code for installation, I was forced to improvise to create a stable and reliable install process for a native application.

One mature project in the repeatable development space that I hadn’t much experience in was Docker. I’d done some configuration review of docker applications for work and saw this as a perfect application to provide a repeatable and low maintenance install process.

I started by creating the following image as the base for the Show and Tell model:


FROM python:3.6.9

WORKDIR /usr/src/kalt

COPY im2txt_model_parameters.tar.gz .

# Setup pre-trained model
RUN apt-get update && apt-get install -y tar zip wget npm python3-pip && \
    wget https://github.com/HughKu/Im2txt/archive/master.zip &&  \
    unzip master.zip && mv Im2txt-master Im2txt && \
    mkdir -p Im2txt/im2txt/model/Hugh/train && \
    # Extract pre-built mode
    tar xvzf im2txt_model_parameters.tar.gz && \
    #Move model to the right places
    mkdir -p Im2txt/im2txt/model/Hugh/train && \
    mv modelParameters/* Im2txt/im2txt/model/Hugh/train/ && \
    #Install build requirements
    npm install -g @bazel/bazelisk && \
    bazelisk

#Build AI engine
WORKDIR /usr/src/kalt/Im2txt/
RUN pip3 install --no-cache-dir -r requirement.txt
WORKDIR /usr/src/kalt/Im2txt/im2txt/
RUN bazel build run_inference
From a brief reading of the recommended developer guidance for docker, it was clear that performance during build time could vary depending on the number of RUN calls that are made in the docker file. Here you can see that I chained a number of the setup commands together to avoid any performance issues related to the number of writable layers.

The gist here is that I expect that the pretrained model will be in the same directory as the Dockerfile and move it into the image at build time. This will have to be taken care of by a separate pre-build script.

Next we download the source of the Show and Tell model, distribute the checkpoint files to the correct location and build the captioning agent using Bazel. There’s something amusing about running a repeatable build program in a repeatable build program.

Containers, but briefly

Once we have this base image created, we need to create a container. The best guidance that I can find about creating containers says that we should try to keep them as ephemeral as possible. Using the above image for most of our hard configuration we are freed to create this slim Dockerfile for our container:

# Download And Analyze A User Provided Image Using KALT

FROM kalt

VOLUME "/var/log/kalt"

WORKDIR /usr/src/kalt/Im2txt/

# Source URL of the file to run through the engine
ENV kalt = "https://maxfieldchen.com/images/profile.jpg"

#Run the engine to generate captions
ENTRYPOINT wget $kalt -O im2txt/data/images/toParse.jpeg; bazel-bin/im2txt/run_inference ...

Here we can use a single ENTRYPOINT command in this container Dockerfile to launch the previously defined kalt image, generate some captions, and return the output on stdout and clean up the launched image when the command is complete. This gives us one single executable to call from the Firefox native manifest (very convenient).

As a new docker user it was a little confusing for me to figure out if I should be using the ARG or ENV command to take user input, since my general programmer intuition tells me that I should be using ARG for arguments. The TLDR is that you should probably use ENV for runtime arguments and ARG for anything that you know at build time since ENV commands have the potential to affect commands executed later in time.

Wrapping it all Together

Once I had the image and container put together, I wrote a shell script to get the pre-built model in the current directory, build the docker image and move the native extension files to their position in the magic location prescribed by Mozilla.


#!/bin/bash

function loring () {
    printf "=====================\n"
    printf "$1\n"
    printf "=====================\n"
}

# Make sure that you have docker, wget, and python before you run this, okay?
# This script is meant to be ran from the directory that it shipped in, if you move
# it, then you'll have a lot of paths to update. Your call.
if ! hash docker &> /dev/null
then
    loring "Looks like you don't have docker, but that's no problem friend, I'll just go ahead and grab it for you. My treat, I insist."
    wget -O - https://get.docker.com/ | bash
fi

if ! hash python3 &> /dev/null
then
    loring "You don't seem to have python3 installed there partner. Go take a gander around https://www.python.org/. \n Now you come back real soon, you hear?"
    exit -1
fi

function getModel () {
    loring "Fetching us a prebuilt model so we don't have to sit on our butts."
    loring "That said, it's a quarter gig, so this may take a moment..."
    wget "https://maxfieldchen.com/raw/im2txt_model_parameters.tar.gz"
    loring "Done!, hopefully you're still with me."
}

# If the model is not already here, download it.
[ -f "./im2txt_model_parameters.tar.gz" ] || getModel

# Builds docker image / container

loring "Building the AI jail now..."
docker build . -t kalt && \
docker build ./containers/ -t caption
loring "AI's are secure. Probably."

# Linux by default, how about that...
nativeManifestPath='/usr/lib/mozilla/native-messaging-hosts/'
manifestName='im2txt.json'

if [ "$(uname)" == "Darwin" ]; then
    nativeManifestPath='/Library/Application Support/Mozilla/NativeMessagingHosts/'
fi

# Installs native trampoline and manifest files
# Python script will call the caption docker container
mkdir -p "$nativeManifestPath"
chmod +x ./go.sh
cp ./go.sh $nativeManifestPath
cp ./run_inference_wrapper.py $nativeManifestPath

# Set the path for the native file now that we know our platform
python -c "import sys; lines = sys.stdin.read(); print lines.replace('FIXME', \"$nativeManifestPath/go.sh\")" < im2txt.json > "$nativeManifestPath$manifestName"

loring "Install the manifest.json file to your browser."

exit 0

Although I would have preferred that Mozilla had more support for installation of native extensions, using a single build script with dockerrized containers has proven to be a pretty solid user experience. The only hurdle that I’ve ran into so far was related to OSX being unable to find the docker path when called from the native manifest file. This was easily solved by adding the full path instead of the relative path.

Inter-Extension Communication

So now that I can run the Show and Tell model at will, I need to feed it input from the browser. To generate captions we need to add a context menu to images the user clicks on, download that image, and send it to the docker container. After the docker container runs, we need to take the output and display it to the user (I chose to show it to them in a notification that copies to their clipboard on interaction).

Component Diagram

I broke these high level tasks into two components, one background script(im2txt.js) that has privileges to add to Firefox’s right click context menu and retrieve the URL of the image selected, and a Python script(run_inference_wrapper.py) that will download the image and run it through the docker container, feeding the output back to the background script.

Firefox was having some trouble launching python for reasons that I’m still unable to debug. As a workaround I used a tiny shell script(go.sh) which just runs Python script in a sub shell. This seemed to fix the pathing issue, but is definitely something that I want to improve since it’s a bit of a hack.

Going Official & Results

I was able to load my extension on multiple machines using the temporary extension functionality but every time I closed the browser my extension would be unloaded. To avoid this I needed to get my extension signed by Mozilla.

The Signing Process

This process was painless, I just bundled up all my files into a tar archive and uploaded it to the developer hub and in around ten minutes my extension was approved. They returned me a signed .xpi archive and now my extension can be loaded into browsers for more than the span of a single session.

I started this project on a Friday and had the binary signed and production ready on the following Saturday. Mozilla has done us all a huge service by exposing these APIs, and it was really cool to see how a little bit of code can improve your workflow and automate simple tasks.

If you liked this article feel free to check out the source and give it a try for yourself! Just remember to unzip the archive and run the install script with sudo so the native application is configured. If projects like this are interesting to you, shoot me an email

I’d love to chat with you.