A Node.js application to scan the Gmail MBOX files (that contains all the inbox/sent email messages). The process will contain verification of the email messages count, and all email addresses count, validate each email address, and in the end, export all email addresses to a TXT file.
Built in 2019 with stream-based processing for efficient handling of large email archives.
- 📧 Extracts email addresses from MBOX files (Gmail exports)
- 🔍 Dual-scan verification using multiple NPM packages
- 🔄 Recursive merge and deduplication of email addresses
- ✅ Email validation with detailed reporting
- 📊 Comprehensive statistics and progress tracking
- 💾 Memory-efficient stream processing for large files
- 📁 Organized output with multiple file formats
- 🔒 Configurable limits and safety checks
- 📈 Real-time progress display
- 🗂️ Backup functionality for application data
graph TD
A[MBOX File in sources/] --> B[Setup Step]
B --> C[Initiate Step]
C --> D[Scan Step]
D --> E[Confirm Step]
E --> F[Crawl Step]
F --> G[Merge Step]
G --> H[Validation Step]
H --> I[Finalize Step]
I --> J[Summary Step]
J --> K[Output Files in dist/]
D --> D1[Line-by-Line Scan]
D --> D2[Node-MBOX Scan]
D1 --> D3[Count Lines & Emails]
D2 --> D3
F --> F1[Stream Processing]
F1 --> F2[Extract Email Addresses]
F2 --> F3[Write to Temp TXT Files]
G --> G1[Round 1 Merge]
G1 --> G2[Round 2 Merge]
G2 --> G3[Continue Until Single File]
G3 --> G4[Remove Duplicates]
H --> H1[Validate Each Email]
H1 --> H2[Split Valid/Invalid]
K --> K1[final_list_view.txt]
K --> K2[final_merge_view.txt]
K --> K3[final_valid.txt]
K --> K4[final_invalid.txt]
K --> K5[final_summary.txt]
style A fill:#e1f5ff
style K fill:#e1ffe1
style D fill:#fff4e1
style F fill:#fff4e1
style G fill:#fff4e1
style H fill:#fff4e1
sequenceDiagram
participant User
participant Script
participant Scanner
participant Crawler
participant Merger
participant Validator
participant Output
User->>Script: npm start
Script->>Script: Load Settings
Script->>Scanner: Scan MBOX File
Scanner->>Scanner: Count Lines
Scanner->>Scanner: Count Messages
Scanner->>Scanner: Count Email Addresses
Scanner-->>Script: Scan Results
Script->>Crawler: Begin Crawl
loop For Each Line
Crawler->>Crawler: Extract Emails
Crawler->>Crawler: Write to TXT (every 100)
end
Crawler-->>Script: Crawl Complete
Script->>Merger: Begin Merge
loop Until Single File
Merger->>Merger: Read TXT Files
Merger->>Merger: Remove Duplicates
Merger->>Merger: Write Merged TXT
end
Merger-->>Script: Merge Complete
Script->>Validator: Validate Emails
loop For Each Email
Validator->>Validator: Check Validity
end
Validator-->>Script: Valid/Invalid Lists
Script->>Output: Generate Output Files
Output->>Output: List View
Output->>Output: Merge View
Output->>Output: Valid Emails
Output->>Output: Invalid Emails
Output->>Output: Summary Stats
Output-->>User: Process Complete
- Node.js (v12 or higher)
- npm or yarn
- Sufficient disk space (at least 2x MBOX file size)
- Clone the repository:
git clone https://github.com/orassayag/mbox-crawler.git
cd mbox-crawler- Install dependencies:
npm install- Configure settings in
src/settings/settings.js(optional)
Edit src/settings/settings.js to customize:
{
MAXIMUM_EMAIL_ADDRESSES_COUNT_PER_MBOX_FILE: 5000000,
EMAIL_ADDRESSES_CRAWL_LIMIT_COUNT: 100,
EMAIL_ADDRESSES_MERGE_LIMIT_COUNT: 100,
MAXIMUM_MERGE_ROUNDS_COUNT: 10,
// ... see INSTRUCTIONS.md for full list
}- Place your MBOX file in the
sources/directory - Run the crawler:
npm start- Find output files in the
dist/directory
# Crawl MBOX files and extract email addresses
npm start
# Create backup of the application
npm run backup
# Stop the running process (Windows only)
npm run stopAfter processing, you'll find these files in dist/:
{name}_final_list_view_{date}.txt- All unique emails (one per line){name}_final_merge_view_{date}.txt- All unique emails (comma-separated){name}_final_valid_{date}.txt- Valid emails only{name}_final_invalid_{date}.txt- Invalid emails only{name}_final_summary_{date}.txt- Process statistics and summary
mbox-crawler/
├── src/
│ ├── core/
│ │ ├── enums/ # Enumeration constants
│ │ └── models/ # Data models
│ ├── logics/ # Process orchestration
│ │ ├── backup.logic.js
│ │ └── crawl.logic.js
│ ├── scripts/ # Entry points
│ │ ├── backup.script.js
│ │ ├── crawl.script.js
│ │ └── error.script.js
│ ├── services/ # Business logic
│ │ ├── confirm.service.js
│ │ ├── crawl.service.js
│ │ ├── finalize.service.js
│ │ ├── initiate.service.js
│ │ ├── merge.service.js
│ │ ├── scan.service.js
│ │ ├── summary.service.js
│ │ └── validate.service.js
│ ├── settings/ # Configuration
│ │ └── settings.js
│ └── utils/ # Utility functions
│ ├── email.utils.js
│ ├── file.utils.js
│ ├── log.utils.js
│ ├── stream.utils.js
│ ├── text.utils.js
│ └── validation.utils.js
├── sources/ # Place MBOX files here
├── dist/ # Output files
├── package.json
└── README.md
- Validates settings and configuration
- Scans
sources/directory for MBOX files - Displays file information table
- Uses
line-by-linepackage to count lines and extract emails - Uses
node-mboxpackage to count email messages - Cross-validates results from both methods
- Verifies file size within configured limits
- Checks available disk space
- Validates counts against maximum thresholds
- Streams through MBOX file line by line
- Extracts email addresses using regex
- Writes emails to temporary TXT files (100 emails per file)
- Real-time progress display
- Recursively merges TXT files
- Removes duplicate email addresses
- Continues until single file remains
- Progress tracking for each merge round
- Validates each email address format
- Separates valid and invalid emails
- Delay between validations (configurable)
- Cleanup temporary files
- Generate final output files in multiple formats
- Create comprehensive summary statistics
- Display results table
- Memory efficient: Stream-based processing avoids loading entire file into memory
- Incremental writes: Email addresses written to disk every 100 entries
- Optimized merge: Exponential merge strategy reduces I/O operations
- Progress tracking: Real-time display of processing status
- Scalable: Handles MBOX files up to 10GB (configurable)
- Extract email addresses from Gmail exports for backup
- Build contact lists from historical email data
- Migrate contacts between email systems
- Analyze email communication patterns
- Clean and deduplicate email address lists
- Create mailing lists from email archives
- Node.js - Runtime environment
- line-by-line - Efficient line-by-line file reading
- node-mbox - MBOX file parsing
- validator - Email address validation
- fs-extra - Enhanced file system operations
- check-disk-space - Disk space verification
- table - Console table formatting
- ESLint - Code quality
All errors include unique error codes for easy troubleshooting:
- Format:
Error message (1000XXX) - Codes range from 1000001 to 1000011
- Detailed error context logged to console
Contributions are welcome! Please read CONTRIBUTING.md for details on the process for submitting pull requests.
- INSTRUCTIONS.md - Detailed setup and usage instructions
- CONTRIBUTING.md - Contribution guidelines
- Or Assayag - Initial work - orassayag
- Or Assayag orassayag@gmail.com
- GitHub: https://github.com/orassayag
- StackOverflow: https://stackoverflow.com/users/4442606/or-assayag?tab=profile
- LinkedIn: https://linkedin.com/in/orassayag
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for personal email management needs
- Inspired by the need to maintain contact lists from Gmail exports
- Designed for efficiency and reliability with large email archives