ApacheCon US 2008 Session

Content analysis for ECM with Apache Tika

Apache Tika is an extensible content analysis toolkit designed for detecting and extracting metadata and structured text content from a large number of document formats. It represents an higher level layer over existing parser libraries. World-class content management systems, and most of all enterprise document management focused ones, always have to face the challenge of detecting, extracting and indexing as more various media content types as possible. This session provides a technical presentation on how you can integrate Tika inside an Enterprise Document Management System, in order to centralize media type detection implementation and leverage dedicated parsers behind a common extraction layer. Partecipants will also learn how Tika is integrated with Lucene in order to provide high performance document indexing and searching features. As a real life demostration, the most recent supported media types, Office Open XML (OOXML), will be detected, extracted and indexed.