Documentation

Text Extractor

This stage uses Apache Tika to extract textual contents from a given binary, text or HTML document. It also adds additional metadata which are generated during text extraction to the document metadata.

This stage does not have additional configuration parameters.

This stage does not remove any HTML tokens from an HTML document. Here, you should use Html Token Remover .