Email Parser is a Java-based application for extracting text from email content and files, converting the parsed content into a searchable document structure, and sending it to Elasticsearch.
The project is useful as a small ingestion service for document/email processing pipelines where raw files need to be parsed, normalized, and indexed for full-text search.
Java Spring Boot application for extracting text from emails and files with Apache Tika and indexing parsed content into Elasticsearch.
- Extract text content from email-related files and documents.
- Use Apache Tika to parse different file formats through a common parser layer.
- Convert extracted content into JSON-compatible data.
- Send parsed documents to Elasticsearch for indexing.
- Expose the application as a Spring Boot service.
- Provide a Maven-based build and packaging workflow.
- Keep the parser executable as a standalone Java application entry point.
| Area | Technology |
|---|---|
| Runtime | Java 8 |
| Framework | Spring Boot |
| Build tool | Maven |
| Text extraction | Apache Tika Core, Apache Tika Parsers |
| Search engine integration | Elasticsearch REST client |
| JSON processing | Gson, json-simple, fluent-json |
| Utilities | Apache Commons IO, Apache Commons Lang, Commons Validator |
| Logging | Log4j |
| IDE metadata | NetBeans configuration files |
emailparser/
├── arch/
├── src/
│ └── main/
│ ├── java/
│ │ └── com/
│ │ └── fruitfactory/
│ │ └── OFEmailParser.java
│ └── resources/
│ └── application.properties
├── pom.xml
├── nbactions.xml
├── nb-configuration.xml
└── .gitignore
The application works as a lightweight ingestion pipeline. The main responsibility is to receive or read email/file content, extract readable text, transform the extracted result into a document model, and index it in Elasticsearch.
| Layer / Component | Responsibility |
|---|---|
| Application entry point | Starts the Spring Boot application and configures runtime properties |
| Parser layer | Uses Apache Tika to detect file type and extract text/metadata |
| Transformation layer | Converts parsed content into a JSON/search document structure |
| Elasticsearch client | Sends parsed documents to Elasticsearch indexes |
| Configuration | Stores runtime settings such as server port and application-level properties |
flowchart TB
User[User / External Caller]
FileSource[Email content / Files / Attachments]
subgraph App[Email Parser Application]
SpringBoot[Spring Boot App\nOFEmailParser]
Parser[Parser Layer\nApache Tika]
Transformer[Document Transformer\nJSON payload builder]
ESClient[Elasticsearch Client\nREST API integration]
end
Elasticsearch[(Elasticsearch)]
User -->|Submit / trigger parsing| SpringBoot
FileSource -->|Raw file or email content| SpringBoot
SpringBoot --> Parser
Parser -->|Extracted text + metadata| Transformer
Transformer -->|Indexable JSON document| ESClient
ESClient -->|Index document| Elasticsearch
sequenceDiagram
participant Caller as Caller / File Source
participant App as Email Parser App
participant Tika as Apache Tika
participant Json as JSON Builder
participant ES as Elasticsearch
Caller->>App: Provide email/file content
App->>Tika: Parse file stream/content
Tika-->>App: Extracted text and metadata
App->>Json: Build indexable document
Json-->>App: JSON payload
App->>ES: Send document for indexing
ES-->>App: Index response
App-->>Caller: Parsing/indexing result
The application entry point is com.fruitfactory.OFEmailParser. It starts the Spring Boot runtime and sets an application log folder before the service starts.
Apache Tika is used for content extraction. This allows the application to parse multiple document types through a unified API instead of writing a separate parser for every file format.
Typical supported formats depend on the Apache Tika parser set and may include:
- plain text files;
- HTML/XML files;
- PDF documents;
- Microsoft Office documents;
- email-like or attachment-based content;
- other Tika-supported document formats.
After text extraction, the application prepares the parsed result as a JSON-compatible document. This document can contain extracted text, metadata, source information, timestamps, file names, or other fields needed for search.
The parsed document is sent to Elasticsearch using the Elasticsearch REST client. Elasticsearch then stores the extracted text as searchable indexed content.
From the repository root:
mvn clean packageThis compiles the project, resolves Maven dependencies, and packages the application.
From the repository root:
mvn spring-boot:runOr run the packaged application:
java -jar target/emailparser-1.0-SNAPSHOT.jarThe application configuration defines the server port as:
server.port=11223After startup, the application should be available locally on:
http://localhost:11223
Before using the application in a production-like environment, review the following areas:
- Elasticsearch host, port, index name, and authentication settings.
- Maximum accepted file size and supported file types.
- Error handling for unsupported or corrupted files.
- Retry policy for failed Elasticsearch indexing requests.
- Logging location and log rotation.
- Input validation for uploaded files or email content.
- Security controls if parsing endpoints are exposed over HTTP.
- Add a clear REST API contract for upload/parse/index operations.
- Add Docker Compose with Elasticsearch and the application service.
- Add unit tests for parser and JSON transformation logic.
- Add integration tests using a test Elasticsearch container.
- Add structured logging.
- Add OpenAPI/Swagger documentation for HTTP endpoints.
- Add configuration through environment variables.
- Add bulk indexing support for processing many files efficiently.
- Add dead-letter or retry storage for failed documents.
A normalized Elasticsearch document could look like this:
{
"sourceFileName": "message.eml",
"contentType": "message/rfc822",
"title": "Example email subject",
"text": "Extracted plain text content from the email or attachment...",
"metadata": {
"author": "sender@example.com",
"createdAt": "2026-05-29T10:00:00Z"
}
}# Build
mvn clean package
# Run
mvn spring-boot:run
# Run tests, if test cases are added
mvn testjava spring-boot apache-tika elasticsearch email-parser document-parser text-extraction indexing maven full-text-search