Universal Document Parser

A high-performance C++ library with Python bindings for parsing various document formats.

This library provides a simple and efficient way to extract content and metadata from various document types.

Features

Fast C++ parsing engine
Memory efficient
Extensible architecture
Comprehensive metadata extraction
Easy-to-use Python API

Supported Formats

Plain text (.txt)
CSV files (.csv)
JSON files (.json)
XML/HTML files (.xml, .html, .htm)
Markdown files (.md, .markdown)

Installation

You can install the library from the root of the project directory using pip:

pip install .

This will compile the C++ extension and install the Python package.

Usage

Here is a simple example of how to use the library:

import docparser
import os

# Get a list of supported formats
print("Supported formats:", docparser.supported_formats())

# Create a dummy file to parse
file_to_parse = "example.txt"
with open(file_to_parse, "w") as f:
    f.write("Hello, this is a test.")

# Check if the file can be parsed
if docparser.can_parse_file(file_to_parse):
    print(f"\n'{file_to_parse}' can be parsed.")

    # Parse the file
    parsed_document = docparser.parse_file(file_to_parse)

    # Print the results
    print("\n--- Parsed Document ---")
    print("Content:", parsed_document['content'])
    print("Format:", parsed_document['format'])
    print("Metadata:", parsed_document['metadata'])
else:
    print(f"'{file_to_parse}' cannot be parsed.")

# Clean up the dummy file
os.remove(file_to_parse)

Building from Source

This project uses pybind11 and setuptools to build the Python bindings. Ensure you have a C++ compiler that supports C++17.

To build the project for development, you can run:

pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docparser/src		docparser/src
.gitignore		.gitignore
.python-version		.python-version
CMakeLists.txt		CMakeLists.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Document Parser

Features

Supported Formats

Installation

Usage

Building from Source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Universal Document Parser

Features

Supported Formats

Installation

Usage

Building from Source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages