The Text Processing Pipeline is a modular rule-based NLP project that converts unstructured resume text into structured profile data.
The pipeline processes raw text through multiple stages including cleaning, tokenization, entity extraction, skill extraction, validation, and profile generation.
The goal of this project is to demonstrate how raw textual information can be transformed into machine-readable structured data using a clean and modular architecture.
Convert raw resume text into structured profile information using a reusable text processing pipeline.
Raw Text
↓
Text Cleaner
↓
Tokenizer
↓
Entity Extraction
↓
Skill Extraction
↓
Validation
↓
Profile Builder
↓
Structured Output
text-processing-pipeline/
│
├── data/
│
├── docs/
│ ├── engineering_log.md
│ └── engineering_note.md
│
├── notebooks/
│
├── src/
│ ├── text_cleaner.py
│ ├── tokenizer.py
│ ├── entity_extractor.py
│ ├── skill_extractor.py
│ ├── validation.py
│ ├── structure.py
│ └── pipeline_runner.py
│
├── tests/
│
├── main.py
├── README.md
└── requirements.txt
Responsible for preprocessing raw text.
Features:
- Convert text to lowercase
- Remove unwanted characters
- Preserve useful symbols required for extraction
- Normalize whitespace
Example:
cleaned_text = clean_text(raw_text)Converts cleaned text into tokens.
Features:
- Whitespace tokenization
- Handles empty input
- Removes trailing punctuation from tokens
Example:
tokens = tokenize_text(cleaned_text)Extracts structured entities using regular expressions and controlled vocabularies.
Extracted Entities:
- Email addresses
- Phone numbers
- Experience years
- Job roles
Example Output:
{
"email": ["john@example.com"],
"phone_number": ["+1 234 567 8901"],
"experience_years": ["3+ years"],
"roles": ["backend developer"]
}Extracts technical skills from text using a predefined skills database.
Supported Categories:
- Programming Languages
- Databases
- Frameworks
- Data & AI Skills
- Tools
- Cloud Technologies
Example Output:
[
"python",
"sql",
"docker",
"aws"
]Validates extracted entities before profile generation.
Validation Rules:
Checks:
- Contains "@"
- Contains "."
Checks:
- Minimum digit length
Checks:
- Numeric value exists
- Experience range between 0 and 50 years
Builds the final structured profile.
Example Output:
{
"email": "john@example.com",
"phone_number": "+1 234 567 8901",
"experience_years": "3+ years",
"skills": ["python", "sql"],
"roles": ["backend developer"]
}Integrates all pipeline stages into a complete workflow.
Pipeline Execution:
cleaned_text = clean_text(text)
tokens = tokenize_text(cleaned_text)
entities = extract_entities(cleaned_text, tokens)
skills = extract_skills(cleaned_text, tokens)
profile = build_profile(...)John Doe is a backend developer with 3+ years of experience in python and SQL.
Contact:
john.doe@example.com
Phone:
+1 234 567 8901
{
"email": "john.doe@example.com",
"phone_number": "+1 234 567 8901",
"experience_years": "3+ years",
"skills": ["python", "sql"],
"roles": ["backend developer"]
}The pipeline was tested using multiple scenarios.
Normal Resume Input
Result:
- Passed
Missing Email
Result:
- Passed
Observation:
- Profile builder correctly stored email as None.
Invalid Email
Result:
- Passed
Observation:
- Invalid email was rejected while other entities were processed successfully.
Multiple Skills and Multi-word Roles
Result:
- Passed
Observation:
- Extracted multiple skills successfully.
- Extracted role:
- machine learning engineer
Noisy Text Input
Result:
- Passed
Observation:
- Pipeline remained functional despite excessive symbols and formatting noise.
Skill Extraction Failure
Problem:
docker.
was not matched as:
docker
Root Cause:
Trailing punctuation survived tokenization.
Fix:
token.strip(".,!?")Added token normalization.
Experience Extraction Limitation
Problem:
Regex only supported single-digit experience values.
Failed Examples:
10 years
12+ years
15 years
Fix:
Updated regex:
r"\d+\+?\s+years?"to support multi-digit experience values.
Current implementation is rule-based.
Limitations:
- No Named Entity Recognition (NER)
- No machine learning models
- Limited role vocabulary
- Limited skills database
- Exact matching required for skills
- English text only
Possible future enhancements:
- spaCy-based Named Entity Recognition
- Machine Learning skill extraction
- Configurable skills database
- Resume parsing from files
- JSON export support
- Improved phone number handling
- Expanded role detection
This project demonstrates:
- Regular Expressions
- String Processing
- Tokenization
- Rule-Based Information Extraction
- Data Validation
- Modular Software Design
- Pipeline Architecture
- End-to-End Testing
- Debugging and Root Cause Analysis
Version 1 Complete
All planned modules have been implemented, integrated, tested, and documented.
The project successfully converts raw resume text into structured profile information using a modular rule-based processing pipeline.