Skip to content

AshrafGalibShaik/Universal-Document-Parser-Library

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal Document Parser

A high-performance C++ library with Python bindings for parsing various document formats.

This library provides a simple and efficient way to extract content and metadata from various document types.

Features

  • Fast C++ parsing engine
  • Memory efficient
  • Extensible architecture
  • Comprehensive metadata extraction
  • Easy-to-use Python API

Supported Formats

  • Plain text (.txt)
  • CSV files (.csv)
  • JSON files (.json)
  • XML/HTML files (.xml, .html, .htm)
  • Markdown files (.md, .markdown)

Installation

You can install the library from the root of the project directory using pip:

pip install .

This will compile the C++ extension and install the Python package.

Usage

Here is a simple example of how to use the library:

import docparser
import os

# Get a list of supported formats
print("Supported formats:", docparser.supported_formats())

# Create a dummy file to parse
file_to_parse = "example.txt"
with open(file_to_parse, "w") as f:
    f.write("Hello, this is a test.")

# Check if the file can be parsed
if docparser.can_parse_file(file_to_parse):
    print(f"\n'{file_to_parse}' can be parsed.")

    # Parse the file
    parsed_document = docparser.parse_file(file_to_parse)

    # Print the results
    print("\n--- Parsed Document ---")
    print("Content:", parsed_document['content'])
    print("Format:", parsed_document['format'])
    print("Metadata:", parsed_document['metadata'])
else:
    print(f"'{file_to_parse}' cannot be parsed.")

# Clean up the dummy file
os.remove(file_to_parse)

Building from Source

This project uses pybind11 and setuptools to build the Python bindings. Ensure you have a C++ compiler that supports C++17.

To build the project for development, you can run:

pip install -e .

About

A high-performance C++ library with Python bindings for parsing various document formats including plain text, CSV, JSON, XML/HTML, and Markdown files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors