Deploy & Scale Pre-trained NLP Models in Minutes with Watson Machine Learning & Hugging Face Transformers

Published in

IBM Data Science in Practice

10 min readJan 25, 2021

In this blog post, I will show you how you can access the latest pre-trained open source NLP models, and deploy them in a couple lines of code in Watson Machine Learning, both on Cloud or on-premise.

Why deploy a pre-trained model?

As a Data Scientist working with clients every day, I try to simplify the scope of a project before starting the implementation. Building custom solutions is fun and exciting, but, sometimes, looking for an easy-to-use solution that is already out there is key to the success of a project, at least to get started. It’s also a good practice to deploy a first version of your solution early on and integrate it with your business application before iterating further.

Luckily, in the modern machine learning world, “easy-to-use and implement” does not necessarily mean naive approach¹. It’s now very easy to find high quality implementations of the latest models available in Open Source. In this post, I will show you how to leverage one of the most popular NLP frameworks out there: the Python library Transformers, developed by Hugging Face (commonly referred to using the 🤗 emoji).

Testing a Transformers pipeline in a notebook

Before moving to the deployment, let’s look at how easy it is to use a pre-trained model in a notebook. First, we pip install the transformers package and load a model. 🤗 Transformers has the amazing Pipeline interface that abstracts away preprocessing steps and model inference in one callable object. To use a pre-trained pipeline, just import and call pipeline() and provide the task you want your pipeline to perform. In this example, we’ll use the “fill-mask” task which tries to guess missing words in a sentence. We use the framework argument to use the Tensorflow (to write this article, I used Tensorflow 2.1.0) implementation but could swap it to PyTorch and adapt the rest of this tutorial accordingly.

!pip install transformers==3.1
from transformers import pipelineunmasker = pipeline('fill-mask', framework='tf')
print(unmasker("Watson Machine Learning is a <mask> tool!"))

This will output a list of potential missing words and their score, starting with “powerful” (Thanks Transformers 😉). Under the hood, this unmasker pipeline is using RoBERTa (by default), one of the recent variants of BERT with improved training methodology: more information about this model and the Hugging Face implementation can be found here.

4 lines of code, and we have access to an end-to-end pipeline running the latest state-of-the-art model — pretty sweet! Now let’s look at how we can deploy this model in Watson Machine Learning.

What is Watson Machine Learning, how do I get started?

TL;DR: If you have used Watson Machine Learning before, and want to try the end to end code right away, jump to the last section of this article “Putting all the steps together”. If you haven’t, this tutorial will make you a WML expert!

WML, short for Watson Machine Learning, is a service that lets you run ML models anywhere and across any cloud. My favorite thing about WML is that it is a “self-service” tool: Data scientists can easily submit models for deployment and start using them in minutes, without help from a machine learning engineer for example.

The rest of this tutorial assumes that you have access to a WML instance, which could be anywhere (IBM Cloud, other Cloud vendors e.g. AWS, on-premise…). Except for the credentials, the code will be exactly the same. If you are just getting started with WML, and IBM Cloud in general, pause here and perform the following steps:

Sign up for IBM Cloud.
Create an instance of Watson Machine Learning: you can use a Lite instance which includes free credits.
Get an API key to authenticate to that instance.

Last point before we look at some code, you may run the code below from a notebook or script running anywhere. Because they integrate well with each other, I wrote the following code from a notebook in Watson Studio, WML’s counter part for model development.

Connecting to Watson Machine Learning

In order to deploy the pipeline we’ve tried above, we will perform these 5 steps:

Import the WML python client and connect to our WML instance
Create a custom “software specification” i.e. a runtime that will include 🤗 Transformers
Wrap the pipeline in a simple Python function and store it in the WML repository
Deploy the stored function and test it
Optional: Scale it up!

First, we install and import the ibm_watson_machine_learning package. If you are running your code from within Watson Studio, this package is pre-installed.

!pip install ibm_watson_machine_learning -qimport os
from ibm_watson_machine_learning import APIClient# If running WML on Cloud:
wml_credentials = {
  "apikey": "replaceme"
  "url": "https://us-south.ml.cloud.ibm.com"
}# If running WML on Cloud Pak for Data on-premise:
wml_credentials = {
    'token': os.environ['USER_ACCESS_TOKEN'],
    'url': os.environ['RUNTIME_ENV_APSX_URL'],
    'version': '3.5',
    'instance_id': 'openshift'
}client = APIClient(wml_credentials)
client.set.default_space("replaceme")

WML is divided into different Deployment Spaces which help you organize different deployments. A Space contains both the assets ready to be deployed (models, functions, but also scripts, data assets, data connections…) and the deployments themselves. In the snippet above, I’m using the .set.default_space() method to associate the WML client with a target deployment space I had already created. You can find more information about spaces in this part of the documentation.

A short guide to using the WML python client

When you read the sections below, you will notice certain patterns when using the WML python client. Before jumping to the actual code, I will show you three useful patterns when using this package.

First, actions are performed in WML using the following pattern: client.<a-type-of-asset>.<an-action-on-this-type-of-asset>. For example, client.deployments.list() , or client.repository.store_model(). Using tab (for auto-completion) after writing client. will help you navigate through everything you can do with the client.

screenshot of hitting tab to see options for a function — When you’re not sure of what action you can perform, just hit tab!

Second, most functions work by taking a set of properties in a dictionary called a meta_prop. The keys of this dictionary are defined by WML objects called MetaNames, and you can check what MetaNames are mandatory and sample values by calling the .show() and .get_example_values() methods. For example, when storing a model:

client.repository.ModelMetaNames.show()
client.repository.ModelMetaNames.get_example_values()

Finally, when you deploy models or functions, the payloads expected for scoring will follow this structure: {'input_data': [{'fields': ['col_1', 'col_2'], 'values': [[1, 2], [3, 4]]} . You can send multiple dictionaries of data to be scored, each dictionary having the 'values' key which contains a list of samples to be scored, and an optional 'fields' key containing the names of the columns (only needed for certain frameworks).

If you’d like to look at a few simple examples before going further, take a look at this excellent GitHub repository which contains various sample notebooks using the WML client.

Now that you know the basics, let’s get started!

Adding the right dependencies

WML can deploy models and functions in a set of predefined software specifications (i.e. runtimes), that are listed here and include most of the common ML and DL frameworks (scikit-learn, Spark MLLib, TensorFlow, Pytorch…). When you need a specific package for your deployment, you can create your own custom software specification, based on a pre-existing one.

First, we create what is called a package extension, which we will then attach to a software specification. We do this by creating a conda yaml file which will be applied on top of the base pre-defined packages (so we only need to add what is not by default in WML, i.e. here only the transformers package).

%%writefile environment.yml
channels:
  - empty
  - nodefaults
dependencies:
- pip:
  - transformers==3.1

We add this yaml file to the package extension with the following code.

meta_props = {
   client.package_extensions.ConfigurationMetaNames.NAME: "transformers",
   client.package_extensions.ConfigurationMetaNames.TYPE: "conda_yml"
}pkg_extn_details = client.package_extensions.store(meta_props, "./environment.yml")
pkg_extn_id = client.package_extensions.get_id(pkg_extn_details)

We are now going to create a software specification based on a pre-existing one called default_py3.7 and point it to the package extension we just created.

base_id = client.software_specifications.get_id_by_name('default_py3.7')
meta_props = {
   client.software_specifications.ConfigurationMetaNames.NAME: "default with transformers",
   client.software_specifications.ConfigurationMetaNames.PACKAGE_EXTENSIONS: [{'guid': pkg_extn_id}],
   client.software_specifications.ConfigurationMetaNames.BASE_SOFTWARE_SPECIFICATION: {'guid': base_id}
}sw_spec_details = client.software_specifications.store(meta_props)
sw_spec_id = client.software_specifications.get_id(sw_spec_details)

Creating and storing a function calling Transformers

Now that we created the environment which will be running our code, we define what will actually be running in this environment. The transformers pipeline we tested earlier is not technically a model, so we can’t deploy it as a plain Tensorflow or PyTorch model. Therefore, we won’t be deploying the code as an actual model but rather as a function. In WML, you can deploy lightweight Python functions without worrying about the backend. All we need is to create a function with the following structure:

def my_masking_function():
    from transformers import pipeline
    unmasker = pipeline('fill-mask', framework='tf')
    
    def score(payload):
        # we assume only one batch is sent
        to_score = payload['input_data'][0]['values']
        preds = unmasker(to_score)
        return {'predictions': [{'fields': ['sequence', 'score'],
                                 'values': [[x['sequence'] for x in preds], [x['score'] for x in preds]]}]
               }
    return score

Basically, you need a nested function called score() inside an outer function with an arbitrary name (here, called my_masking_function()). When WML creates the deployment, the code of the outer function (my_masking_function) will be run once, including the definition of the score function. Then, for each scoring request, WML will run the score function only. This type of nested function is called a Python closure. Note that the nested score function has access to the variables defined in the outer function, i.e. it has access to the unmasker object.

Now that we have created this function, we can store it in the WML Deployment Space. To store it, we point to the sw_spec_id (software specification id) created earlier, and also pass the my_masking_function object to the store_function() method.

function_props = {
    client.repository.FunctionMetaNames.NAME: 'huggingface masking',
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: sw_spec_id
}function_details = client.repository.store_function(my_masking_function, function_props)
function_id = client.repository.get_function_id(function_details)
print(function_id)

Creating and testing the function deployment

Now that the function calling Transformers is stored in the Deployment Space, we can deploy it by running the following code. All we need is to point to the function_id , as well as specify in the meta props that this will be an ONLINE deployment.

deployment_props = {
    client.deployments.ConfigurationMetaNames.NAME: 'huggingface masking deployment',
    client.deployments.ConfigurationMetaNames.ONLINE: {},
}deployment_details = client.deployments.create(function_id, deployment_props)
deployment_id = client.deployments.get_id(deployment_details)

We now have a REST API endpoint ready to be called and give us predictions on new data! Accessing this endpoint is a standard POST request, where the input data is structured as I showed earlier in the “short guide”. To make it easier, the WML python client has a .score() method that takes care of that request for you:

client.deployments.score(deployment_id, {'input_data': [{'values': ['Watson Machine Learning is a <mask> tool!']}]})

You can also head over to the WML User Interface, and test your deployment from there! This will help you make sure that your function is working.

screenshot of hugging face masking deployment — Testing the deployment from the UI. Looks like Transformers likes this tutorial :)

Finally, to integrate this deployment in your business application, you don’t have to use the WML client of course. Luckily, the WML UI gives you auto-generated code snippets in various languages, showing you how to make the call to that REST API endpoint.

another screenshot of hugging face masking deployment — Once your function is deployed, you can grab auto-generated scoring code from the UI.

Bonus: Scaling up!

You’ve created the function, stored and deployed it, and successfully integrated in an application where your end users can send queries to get new predictions. What’s next? Obviously, as your application usage grows, you will need to serve more concurrent requests.

Scaling up the deployment to more replicas can be done either from code, or directly from the UI. From code, you provide new meta props that specify the number of replicas num_nodes and size of each replica. More information about scaling deployments is available here.

deployment_update_props = {
    client.deployments.ConfigurationMetaNames.HARDWARE_SPEC: {'name': 'S', 'num_nodes': 2}
}client.deployments.update(deployment_id, deployment_update_props)

screenshot of scaling up deployment with hugging face — Scaling up the deployment can be done directly in the UI, by clicking on the “Copies” section in the info panel.

When you update a deployment, WML takes care of the backend details, and the scoring endpoint remains the same. This is true for scaling up or down, but also for swapping out a model with its latest revision, meaning that we could for example replace this function by a different function using a different BERT variant for example, without worrying about downtime or about pointing our front end to a new scoring endpoint.

Putting all the steps together

To summarize what we’ve done in this article, here the end to end code needed to deploy a masking pipeline to a Watson Machine Learning instance, assuming you already have an instance ready and have created a Deployment Space.

End to end code needed to deploy a Hugging Face Transformers pipeline in Watson Machine Learning

Last words

In this blog post, I showed you how easy it is to both use state of the art models, thanks to the amazing work of Hugging Face, and deploy them in seconds in Watson Machine Learning. Tools like these help you automate tasks of your Data Science and Application Development workflow, so that you can focus on the interesting and challenging parts!

If you enjoyed this article, feel free to connect with me on LinkedIn or GitHub . I’m part of a team called Data Science and AI Elite at IBM and we combine open source frameworks with technologies like Watson Machine Learning every day.

To learn how to kick-start your data science project with the right expertise, tools and resources, the Data Science and AI Elite (DSE) is here to help. The DSE team can plan, co-create and prove the project with you based on our proven Agile AI methodology. Visit ibm.biz/datascienceelite to connect with us, explore our resources and learn more about Data Science and AI Elite. Request a complimentary consultation: ibm.co/DSE-Consultation.

[1] While for some problems, typically when working on tabular data, you would want to start with rule-based approaches, then simple linear models, and build up to more complex approaches if needed, in fields such as Natural Language Processing (NLP) you can benefit from large pre-trained language models to get started.