Email Parser

Email Parser is a Java-based application for extracting text from email content and files, converting the parsed content into a searchable document structure, and sending it to Elasticsearch.

The project is useful as a small ingestion service for document/email processing pipelines where raw files need to be parsed, normalized, and indexed for full-text search.

Repository description

Java Spring Boot application for extracting text from emails and files with Apache Tika and indexing parsed content into Elasticsearch.

Main capabilities

Extract text content from email-related files and documents.
Use Apache Tika to parse different file formats through a common parser layer.
Convert extracted content into JSON-compatible data.
Send parsed documents to Elasticsearch for indexing.
Expose the application as a Spring Boot service.
Provide a Maven-based build and packaging workflow.
Keep the parser executable as a standalone Java application entry point.

Technology stack

Area	Technology
Runtime	Java 8
Framework	Spring Boot
Build tool	Maven
Text extraction	Apache Tika Core, Apache Tika Parsers
Search engine integration	Elasticsearch REST client
JSON processing	Gson, json-simple, fluent-json
Utilities	Apache Commons IO, Apache Commons Lang, Commons Validator
Logging	Log4j
IDE metadata	NetBeans configuration files

Project structure

emailparser/
├── arch/
├── src/
│   └── main/
│       ├── java/
│       │   └── com/
│       │       └── fruitfactory/
│       │           └── OFEmailParser.java
│       └── resources/
│           └── application.properties
├── pom.xml
├── nbactions.xml
├── nb-configuration.xml
└── .gitignore

Architecture overview

The application works as a lightweight ingestion pipeline. The main responsibility is to receive or read email/file content, extract readable text, transform the extracted result into a document model, and index it in Elasticsearch.

Layer responsibilities

Layer / Component	Responsibility
Application entry point	Starts the Spring Boot application and configures runtime properties
Parser layer	Uses Apache Tika to detect file type and extract text/metadata
Transformation layer	Converts parsed content into a JSON/search document structure
Elasticsearch client	Sends parsed documents to Elasticsearch indexes
Configuration	Stores runtime settings such as server port and application-level properties

Architecture diagram

flowchart TB
    User[User / External Caller]
    FileSource[Email content / Files / Attachments]

    subgraph App[Email Parser Application]
        SpringBoot[Spring Boot App\nOFEmailParser]
        Parser[Parser Layer\nApache Tika]
        Transformer[Document Transformer\nJSON payload builder]
        ESClient[Elasticsearch Client\nREST API integration]
    end

    Elasticsearch[(Elasticsearch)]

    User -->|Submit / trigger parsing| SpringBoot
    FileSource -->|Raw file or email content| SpringBoot
    SpringBoot --> Parser
    Parser -->|Extracted text + metadata| Transformer
    Transformer -->|Indexable JSON document| ESClient
    ESClient -->|Index document| Elasticsearch

Processing flow

sequenceDiagram
    participant Caller as Caller / File Source
    participant App as Email Parser App
    participant Tika as Apache Tika
    participant Json as JSON Builder
    participant ES as Elasticsearch

    Caller->>App: Provide email/file content
    App->>Tika: Parse file stream/content
    Tika-->>App: Extracted text and metadata
    App->>Json: Build indexable document
    Json-->>App: JSON payload
    App->>ES: Send document for indexing
    ES-->>App: Index response
    App-->>Caller: Parsing/indexing result

Components

Spring Boot application

The application entry point is com.fruitfactory.OFEmailParser. It starts the Spring Boot runtime and sets an application log folder before the service starts.

Apache Tika parser

Apache Tika is used for content extraction. This allows the application to parse multiple document types through a unified API instead of writing a separate parser for every file format.

Typical supported formats depend on the Apache Tika parser set and may include:

plain text files;
HTML/XML files;
PDF documents;
Microsoft Office documents;
email-like or attachment-based content;
other Tika-supported document formats.

JSON transformation

After text extraction, the application prepares the parsed result as a JSON-compatible document. This document can contain extracted text, metadata, source information, timestamps, file names, or other fields needed for search.

Elasticsearch indexing

The parsed document is sent to Elasticsearch using the Elasticsearch REST client. Elasticsearch then stores the extracted text as searchable indexed content.

Build

From the repository root:

mvn clean package

This compiles the project, resolves Maven dependencies, and packages the application.

Run locally

From the repository root:

mvn spring-boot:run

Or run the packaged application:

java -jar target/emailparser-1.0-SNAPSHOT.jar

The application configuration defines the server port as:

server.port=11223

After startup, the application should be available locally on:

http://localhost:11223

Configuration notes

Before using the application in a production-like environment, review the following areas:

Elasticsearch host, port, index name, and authentication settings.
Maximum accepted file size and supported file types.
Error handling for unsupported or corrupted files.
Retry policy for failed Elasticsearch indexing requests.
Logging location and log rotation.
Input validation for uploaded files or email content.
Security controls if parsing endpoints are exposed over HTTP.

Suggested improvements

Add a clear REST API contract for upload/parse/index operations.
Add Docker Compose with Elasticsearch and the application service.
Add unit tests for parser and JSON transformation logic.
Add integration tests using a test Elasticsearch container.
Add structured logging.
Add OpenAPI/Swagger documentation for HTTP endpoints.
Add configuration through environment variables.
Add bulk indexing support for processing many files efficiently.
Add dead-letter or retry storage for failed documents.

Example indexing document

A normalized Elasticsearch document could look like this:

{
  "sourceFileName": "message.eml",
  "contentType": "message/rfc822",
  "title": "Example email subject",
  "text": "Extracted plain text content from the email or attachment...",
  "metadata": {
    "author": "sender@example.com",
    "createdAt": "2026-05-29T10:00:00Z"
  }
}

Development workflow

# Build
mvn clean package

# Run
mvn spring-boot:run

# Run tests, if test cases are added
mvn test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Parser

Repository description

Main capabilities

Technology stack

Project structure

Architecture overview

Layer responsibilities

Architecture diagram

Processing flow

Components

Spring Boot application

Apache Tika parser

JSON transformation

Elasticsearch indexing

Build

Run locally

Configuration notes

Suggested improvements

Example indexing document

Development workflow

Suggested GitHub topics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
arch		arch
src/main		src/main
.gitignore		.gitignore
README.md		README.md
nb-configuration.xml		nb-configuration.xml
nbactions.xml		nbactions.xml
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

Email Parser

Repository description

Main capabilities

Technology stack

Project structure

Architecture overview

Layer responsibilities

Architecture diagram

Processing flow

Components

Spring Boot application

Apache Tika parser

JSON transformation

Elasticsearch indexing

Build

Run locally

Configuration notes

Suggested improvements

Example indexing document

Development workflow

Suggested GitHub topics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages