Skip to main content

Getting Started

Docker Images

The easiest way to get started locally is to use one of the docker images we publish on DockerHub:

  • latest - includes CUDA and libcudnn bindings to support GPU and CPU accelerated inference.
  • latest-cpu - a minimial image, lacking CUDA + libcuddn dylibs that allows for only CPU inference.

Once you have a docker image, you can then simply run it using docker-compose or docker:

docker run -p 0.0.0.0:50051:50051 \
-p 0.0.0.0:50052:50052 \
-v $(PWD)/.cache:/app/.cache \
anansi:embeddings-latest-cpu

With the above setup, you can then send HTTP1 requests to 50051 or issue gRPC requests to 50052.

Configuration

Environment Variables

embedds can be configured using environment variables. All environment variables are prefixed with EMBEDDS_ and are outlined below, along with their effects:

The list of environment variables that are supported are as follows:

Environment VariableUsage

EMBEDDS_GRPC_PORT

port to listen for and server gRPC requests [default=50051]

EMBEDDS_HTTP_PORT

port to listen for and server HTTP requests [default=50052]

EMBEDDS_CONFIG_FILE

filepath to store the runtime configuration for models - more on this file is available below [default=/app/config.yaml]

EMBEDDS_CACHE_FOLDER

folder in which to store the cached model files - these are typically on the order of ~100s of MBs and can grow to GBs if you bin-pack different types of models [default=/app/.cache]

EMBEDDS_ALLOW_ADMIN

whether or not to honor Initalize(..) requests, which load a model into the current ONNX runtime. we recommend that models be loaded once on startup; however, this conveninece is included for experimentation [default=false]

Config Files

The `EMBEDDS_CONFIG_FILE` points to an accessible filepath that stores a list of models that should be instantiated on startup of the process.If these models are missing, embedds will attempt to download them and store them

in the filepath pointed to by EMBEDDS_CACHE_FOLDER. An example configuration is outlined below:

models:
# class must match one of the available models, defined at:
# https://github.com/infrawhispers/anansi/blob/main/embeddings/proto/api.proto
- name: VIT_L_14_336_OPENAI
class: ModelClass_CLIP
# [optional] set to zero or leave empty for parallelism to be determined
num_threads: 4
# [optional] enable | disable parallel execution of the onnx graph, which may improve
# performance at the cost of memory usage.
parallel_execution: true
- name: INSTRUCTOR_LARGE
class: ModelClass_INSTRUCTOR
- name: INSTRUCTOR_LARGE
class: ModelClass_INSTRUCTOR

This configuration would create ONE VIT_L_14_336_OPENAI and TWO INSTRUCTOR_LARGE models. This is useful for running multiple embedding models on a single GPU. The list of devices and available models can be found here. By default, embedds will instantiate one instance of INSTRUCTOR_LARGE.

Issuing Requests

With a running server, you can now issue requests against the server. Take a look at the swagger-api docs for the HTTP methods available to you.


You can also pull the proto definition from source to build your grpc client. Native clients are on the roadmap, we are also open for pull-requests 🤩.