Skip to content

Yariki/emailparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Email Parser

Email Parser is a Java-based application for extracting text from email content and files, converting the parsed content into a searchable document structure, and sending it to Elasticsearch.

The project is useful as a small ingestion service for document/email processing pipelines where raw files need to be parsed, normalized, and indexed for full-text search.

Repository description

Java Spring Boot application for extracting text from emails and files with Apache Tika and indexing parsed content into Elasticsearch.

Main capabilities

  • Extract text content from email-related files and documents.
  • Use Apache Tika to parse different file formats through a common parser layer.
  • Convert extracted content into JSON-compatible data.
  • Send parsed documents to Elasticsearch for indexing.
  • Expose the application as a Spring Boot service.
  • Provide a Maven-based build and packaging workflow.
  • Keep the parser executable as a standalone Java application entry point.

Technology stack

Area Technology
Runtime Java 8
Framework Spring Boot
Build tool Maven
Text extraction Apache Tika Core, Apache Tika Parsers
Search engine integration Elasticsearch REST client
JSON processing Gson, json-simple, fluent-json
Utilities Apache Commons IO, Apache Commons Lang, Commons Validator
Logging Log4j
IDE metadata NetBeans configuration files

Project structure

emailparser/
├── arch/
├── src/
│   └── main/
│       ├── java/
│       │   └── com/
│       │       └── fruitfactory/
│       │           └── OFEmailParser.java
│       └── resources/
│           └── application.properties
├── pom.xml
├── nbactions.xml
├── nb-configuration.xml
└── .gitignore

Architecture overview

The application works as a lightweight ingestion pipeline. The main responsibility is to receive or read email/file content, extract readable text, transform the extracted result into a document model, and index it in Elasticsearch.

Layer responsibilities

Layer / Component Responsibility
Application entry point Starts the Spring Boot application and configures runtime properties
Parser layer Uses Apache Tika to detect file type and extract text/metadata
Transformation layer Converts parsed content into a JSON/search document structure
Elasticsearch client Sends parsed documents to Elasticsearch indexes
Configuration Stores runtime settings such as server port and application-level properties

Architecture diagram

flowchart TB
    User[User / External Caller]
    FileSource[Email content / Files / Attachments]

    subgraph App[Email Parser Application]
        SpringBoot[Spring Boot App\nOFEmailParser]
        Parser[Parser Layer\nApache Tika]
        Transformer[Document Transformer\nJSON payload builder]
        ESClient[Elasticsearch Client\nREST API integration]
    end

    Elasticsearch[(Elasticsearch)]

    User -->|Submit / trigger parsing| SpringBoot
    FileSource -->|Raw file or email content| SpringBoot
    SpringBoot --> Parser
    Parser -->|Extracted text + metadata| Transformer
    Transformer -->|Indexable JSON document| ESClient
    ESClient -->|Index document| Elasticsearch
Loading

Processing flow

sequenceDiagram
    participant Caller as Caller / File Source
    participant App as Email Parser App
    participant Tika as Apache Tika
    participant Json as JSON Builder
    participant ES as Elasticsearch

    Caller->>App: Provide email/file content
    App->>Tika: Parse file stream/content
    Tika-->>App: Extracted text and metadata
    App->>Json: Build indexable document
    Json-->>App: JSON payload
    App->>ES: Send document for indexing
    ES-->>App: Index response
    App-->>Caller: Parsing/indexing result
Loading

Components

Spring Boot application

The application entry point is com.fruitfactory.OFEmailParser. It starts the Spring Boot runtime and sets an application log folder before the service starts.

Apache Tika parser

Apache Tika is used for content extraction. This allows the application to parse multiple document types through a unified API instead of writing a separate parser for every file format.

Typical supported formats depend on the Apache Tika parser set and may include:

  • plain text files;
  • HTML/XML files;
  • PDF documents;
  • Microsoft Office documents;
  • email-like or attachment-based content;
  • other Tika-supported document formats.

JSON transformation

After text extraction, the application prepares the parsed result as a JSON-compatible document. This document can contain extracted text, metadata, source information, timestamps, file names, or other fields needed for search.

Elasticsearch indexing

The parsed document is sent to Elasticsearch using the Elasticsearch REST client. Elasticsearch then stores the extracted text as searchable indexed content.

Build

From the repository root:

mvn clean package

This compiles the project, resolves Maven dependencies, and packages the application.

Run locally

From the repository root:

mvn spring-boot:run

Or run the packaged application:

java -jar target/emailparser-1.0-SNAPSHOT.jar

The application configuration defines the server port as:

server.port=11223

After startup, the application should be available locally on:

http://localhost:11223

Configuration notes

Before using the application in a production-like environment, review the following areas:

  • Elasticsearch host, port, index name, and authentication settings.
  • Maximum accepted file size and supported file types.
  • Error handling for unsupported or corrupted files.
  • Retry policy for failed Elasticsearch indexing requests.
  • Logging location and log rotation.
  • Input validation for uploaded files or email content.
  • Security controls if parsing endpoints are exposed over HTTP.

Suggested improvements

  • Add a clear REST API contract for upload/parse/index operations.
  • Add Docker Compose with Elasticsearch and the application service.
  • Add unit tests for parser and JSON transformation logic.
  • Add integration tests using a test Elasticsearch container.
  • Add structured logging.
  • Add OpenAPI/Swagger documentation for HTTP endpoints.
  • Add configuration through environment variables.
  • Add bulk indexing support for processing many files efficiently.
  • Add dead-letter or retry storage for failed documents.

Example indexing document

A normalized Elasticsearch document could look like this:

{
  "sourceFileName": "message.eml",
  "contentType": "message/rfc822",
  "title": "Example email subject",
  "text": "Extracted plain text content from the email or attachment...",
  "metadata": {
    "author": "sender@example.com",
    "createdAt": "2026-05-29T10:00:00Z"
  }
}

Development workflow

# Build
mvn clean package

# Run
mvn spring-boot:run

# Run tests, if test cases are added
mvn test

Suggested GitHub topics

java spring-boot apache-tika elasticsearch email-parser document-parser text-extraction indexing maven full-text-search

About

Email Parser is a Java-based document and email text extraction service. It parses file/email content, converts extracted data into a searchable structure, and sends it to Elasticsearch for indexing and later search.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages