Document Classifier

This content transformer uses an LLM to extract further structured metadata out of an unstructured text. The resulting fields must become a dictionary or map in form of a JSON. This dictionary is then added to the document metadata.

image-20260206-094434.png

Configuration Parameters

  1. Transformer Stage Type: choose Classifier

  2. Prompt: Displays the prompt which is sent, together with the document body, to the LLM. You can adjust it to your needs. However, please make sure that a proper JSON is returned which comprises key-value pairs. Supported values are lists and scalars (strings, numbers, etc.)

  3. Length limitation: Here you can enter a fixed number of characters to reduce the load on the LLM. If you leave this value to 0 or a negative value, the entire document is included.

  4. LLM Configuration

    1. Open Llama configuration

      1. Embedding model: here you can provide the name of the embedding model, you want to use. For example mxbai-embed-large

      2. Use authentication. If enabled, the Suite can use basic authentication for communicating with the embedding endpoint. Please provide an according username and password.

      3. Public keys for SSL certificates: this configuration is needed, if you run the environment with self-signed certificates, or certificates which are not known to the Java key store.
        We use a straight-forward approach to validate SSL certificates. In order to render a certificate valid, add the modulus of the public key into this text field. You can access this modulus by viewing the certificate within the browser.

      image-20240928-201320.png

    2. Azure OpenAI GPT configuration

      image-20241005-080407.png

      1. GPT Endpoint: Offer the endpoint such as <https://<baseUrl>>.openai.azure.com/openai/deployments/<deploymentName>/chat/embeddings?api-version=<version>

      2. Password: here please add your API key which you can configure in the OpenAI configuration in the Azure portal.