Text Extractor

This stage uses Apache Tika to extract textual contents from a given binary, text or HTML document. It also adds additional metadata which are generated during text extraction to the document metadata.

This stage does not have additional configuration parameters.

image-20241005-084953.png

This stage does not remove any HTML tokens from an HTML document. Here, you should use Html Token Remover .