A high-performance C++ library with Python bindings for parsing various document formats.
This library provides a simple and efficient way to extract content and metadata from various document types.
- Fast C++ parsing engine
- Memory efficient
- Extensible architecture
- Comprehensive metadata extraction
- Easy-to-use Python API
- Plain text (.txt)
- CSV files (.csv)
- JSON files (.json)
- XML/HTML files (.xml, .html, .htm)
- Markdown files (.md, .markdown)
You can install the library from the root of the project directory using pip:
pip install .This will compile the C++ extension and install the Python package.
Here is a simple example of how to use the library:
import docparser
import os
# Get a list of supported formats
print("Supported formats:", docparser.supported_formats())
# Create a dummy file to parse
file_to_parse = "example.txt"
with open(file_to_parse, "w") as f:
f.write("Hello, this is a test.")
# Check if the file can be parsed
if docparser.can_parse_file(file_to_parse):
print(f"\n'{file_to_parse}' can be parsed.")
# Parse the file
parsed_document = docparser.parse_file(file_to_parse)
# Print the results
print("\n--- Parsed Document ---")
print("Content:", parsed_document['content'])
print("Format:", parsed_document['format'])
print("Metadata:", parsed_document['metadata'])
else:
print(f"'{file_to_parse}' cannot be parsed.")
# Clean up the dummy file
os.remove(file_to_parse)This project uses pybind11 and setuptools to build the Python bindings. Ensure you have a C++ compiler that supports C++17.
To build the project for development, you can run:
pip install -e .