Deploying Databricks Dolly as an API on Azure Functions

7 min readApr 23, 2023

Obviously ChatGPT has taken the world by storm and has certainly taken over my LinkedIn feed. While OpenAI’s GPT models are certainly state-of-the-art, many open-source large language models (LLMs) have been popping up as competition.

Azure OpenAI is a deployable version of the OpenAI API on the cloud that includes some privacy and security offerings that will make your IT people happy. However, the cost structure of using GPT3 or GPT4 across your entire company may be a little cost-prohibitive, which we’ll discuss in a bit.

Databricks has entered the chat, pun intended, with Dolly, an open-source LLM that can used commercially or be retrained on your own data. A goal of Dolly is to allow research and commercial organizations to use LLMs without paying for API access or sharing data with third parties. So, all you pay for is the compute infrastructure to house it.

“A realistic sheep animal wearing a large blonde Dolly Parton wig” — Generated by BlueWillow AI

You can read more about Dolly 2.0 here: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

Today, I’ll quickly show you how to deploy Dolly 2.0, for testing in its most basic form, as an Azure Function App.

Building the Docker Image

For this post, I’ve provided a Docker image, Dockerfile, and companion logic for creating an Azure Function Docker image with the 3B Dolly 2.0 model inside. You can either use my image as-is or rebuild with your own customizations.

GitHub Repo: https://github.com/colbyford/dolly-on-azure-functions

Option 1: Use The Image As-Is

If you were to clone the repo, you can see a script called GenerateText/instruct_pipeline.py that defines the input prompt and expected response. Note that the default behavior is for Dolly to return a response that completes the requested input task. This is a bit vague, but good for general purpose stuff.

INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
INTRO_BLURB = (
    "Below is an instruction that describes a task. Write a response that appropriately completes the request."
)

If this is sufficient for what you want to do, you can use my public DockerHub image and skip to the next section to deploy the Azure Function.

Option 2: Rebuild With A Custom Purpose

If you want to customize the behavior of Dolly, you can clone the repo and modify the GenerateText/instruct_pipeline.py script to fit your needs. For example, you could request that Dolly respond as if it (she?) were a chatbot or maybe summarize something and return information in bullet points.

Also, if you want to use a larger version of Dolly 2.0, you can uncomment lines in the get_dolly.py file and rebuild the image.

from transformers import AutoModelForCausalLM, AutoTokenizer

## 3B Model (~5GB)
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", device_map="auto")

## 7B Model (~14GB)
# tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
# model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto")

## 12B Model (~24GB)
# tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
# model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto")

## Save Tokenizer and Model
tokenizer.save_pretrained('/home/site/wwwroot/dolly/tokenizer')
model.save_pretrained('/home/site/wwwroot/dolly/model')

Once you’ve made your modifications, you can rebuild the image. (Note that the image will be quite large. The 12B model itself is over 24GB in size. Even the 3B model packaged in my image is >25GB or ~14GB compressed. Just keep this in mind when you go to build/push.)

docker build -t dollyaf .

docker tag dollyaf <YOUR-REPO>/dolly-v2-3b-azurefunction:latest
docker push <YOUR-REPO>/dolly-v2-3b-azurefunction:latest

Be sure to either push your image to an Azure Container Registry or to DockerHub so that Azure Functions can pull it in the next step.

Creating a Function

I’ve included the Azure CLI steps for deploying the Resource Group, Storage Account, App Service Plan, and Function App for the Dolly 2.0 API. Please change the names, region, etc. of the resources to fit your needs.

az login
az group create --name dolly-dev-rg --location eastus
az storage account create --name dollyst --location eastus --resource-group dolly-dev-rg --sku Standard_LRS
az functionapp plan create --resource-group dolly-dev-rg --name dolly-asp --location eastus --number-of-workers 1 --sku P3V3 --is-linux
az functionapp create --name dolly-func --storage-account dollyst --resource-group dolly-dev-rg --plan dolly-asp --functions-version 4 --os-type Linux --image cford38/dolly-v2-3b-azurefunction:latest

Note that this is not a GPU compute context, which results in less than optimal performance for the text generation. If you’d like to explore GPU-based Functions on Azure Kubernetes Service, you can still use my Docker image, just follow this tutorial: https://github.com/puthurr/python-azure-function-gpu

Once the Function app is deployed, you can locate the URL and App key from the service’s screen in the Azure Portal.

Testing It Out

I’ve included a Postman collection in the repo for you to quickly use your API endpoint.

If you don’t use Postman and prefer Python, here’s some sample code for you:

import requests
import json

url = "https://<your-app-name>.azurewebsites.net/api/GenerateText?code=<app-key>"
payload = json.dumps({"prompt": "<your prompt>"})
headers = {'Content-Type': 'application/json'}

response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

And just like that, you can ask Dolly whatever you’d like…

Interestingly, Dolly (the singer, not the model, not the sheep) fans know this, but she isn’t from Indiana and didn’t write 2/3 of those songs. Also, I’m pretty sure Dolly (the sheep, not the singer, not the model) never gave a TED talk. (I did.)

If this had popped out with another verse of Jolene, I would have been thoroughly impressed. However, this made me chuckle nonetheless.

Cost Comparisons

As stated on the Azure Calculator, the Azure OpenAI service costs (at a minimum) $0.03/1k tokens. Tokens are defined as “common sequences of characters found in text” and are used by the models to calculate the statistical relationships between words.

If we asked for 4 paragraphs of text as output, ~3,000 words, this equals about 1,000 tokens. (So, roughly 3 words/token…ish.)

You can play with OpenAI’s token counter here: https://platform.openai.com/tokenizer

While not insanely expensive, you can see how costs could get out of hand if everyone at a large organization had unrestricted access to the Azure OpenAI API. Or, if you were running a GPT-backed chatbot on your public-facing website, you could have thousands of users sending in requests via chat per day. Thus, it may be advantageous to run a self-hosted LLM like Dolly where you can control the costs by simply paying for the compute infrastructure rather than monitoring/worrying about usage.

For a Premium mv3 App Service Plan (64GB memory and 8 vCPU), we’re looking at just under $550/month. This is fairly pricey for a less than optimal deployment of Dolly, but it is a decent option for development and testing purposes. For production deployment, you’d want to use a GPU-based environment, such as Azure Kubernetes Service or Azure Machine Learning. (If there’s enough interest in this post, I might cover that next.)

Even if you don’t use this much on Azure Functions in the cloud, the Dockerized API version of Dolly I’ve included for this post is quite useful for local testing.

Final Thoughts and Caveats

At first glance, the 3B version of Dolly 2.0 is considerably inferior in terms of its accuracy/robustness in the responses, as should be expected. Dolly was not meant to directly compete at the level of ChatGPT, rather, just exhibit similar behavior on a smaller, more consumer-trainable scale. In fact, Databricks states, “dolly-v2-12b is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.”

Do I think Dolly 2.0 is the best open-source LLM out there? No, but it is the one that is getting the most press. I also found it interesting that they act as if these models are consumer equipment friendly. I have an NVIDIA GeForce GTX TITAN X in my home workstation (which has 12GB of memory) and I couldn’t run anything higher than the 3B model locally. I also couldn’t ask it to return very long responses without running out of GPU memory. So, we’re still a ways away from having these LLMs work on non-data center hardware, especially ones with double-digit billions of parameters that respond as well as GPT3/4.

If you’re interested in trying another LLM that is hosted on HuggingFace, the logic will be quite similar. Simply modify the logic of my repo to pre-pull the tokenizer and model of your desired LLM during the Docker build step and update the code in the GenerateText API method to match your LLM’s documentation/logic.

I’m excited to see what people come up with when using open-source models like Dolly. It will also be interesting to see where these models fit in the market alongside OpenAI’s GPT4 and Google’s Bard APIs.

Stay Curious… 🐑