Skip to content

theslava/shi

Repository files navigation

shi — Huffman Compression Implementation

A complete, production-ready Huffman compression/decompression tool written in C.

Overview

shi (Slava's Huffman Implementation) implements a full Huffman coding pipeline in C — from frequency analysis and tree construction to bitstream encoding/decoding. It reads any binary file, builds an optimal prefix code, and produces a compressed output that can be perfectly reconstructed.

This project was created to:

  • Demonstrate core data structures (trees, heaps, linked lists) in C
  • Provide a working reference implementation of Huffman coding
  • Exercise low-level I/O, bit manipulation, and memory management patterns
  • Serve as a learning tool for understanding lossless compression algorithms

Architecture

The codebase is organized into modular layers:

Layer Modules Responsibility
I/O file_io.c/h Buffered file reader (fr_fd) and writer (fw_fd)
Data Structures node.c/h, tree.c/h, list.c/h Huffman tree node, tree builder, sorted doubly-linked list
Sorting sort.c/h Insertion sort for ordering nodes by frequency weight (heapsort functions declared but unused)
Metrics metric.c/h Byte-frequency counter over input data
Bitstream bitstream.c/h, bitarray.c/h MSB-first bit-level reader/writer with auto-flush
Core compress.c/h, decompress.c/h High-level compression & decompression pipelines

Compression Pipeline

Input file → frequency counting → build Huffman tree → generate codes
           → write header → bitstream encode → Output file

Decompression Pipeline

Compressed input → read header → reconstruct tree → bitstream decode → Output file

File Format & Versioning

Compressed files use magic bytes "SHI<version>" (4B) where the 4th byte encodes the format version. Decompression validates the magic and dispatches to the appropriate version handler.

Version Magic Description
0 (0x00) SHI\x00 Current format — per-symbol code storage
1 (0x01) SHI\x01 Planned — flat tree header (see Roadmap)

Header Format (v0)

The compressed file header stores:

  • Magic bytes: "SHI\x00" (4B, 0x53, 0x48, 0x49, 0x00)
  • num_symbols (4B LE) — number of unique bytes in the original
  • file_size (4B LE) — original file size (to stop decoding at the right point)
  • Per symbol: byte_value (1B) + code_length (1B) + code_value (4B LE)

Tools Used

  • Language: C23
  • Build System: CMake 3.15+ (primary and only build system)
  • Compiler Flags: -Wall -Wextra -Wpedantic -g (GCC/Clang), /W4 (MSVC)
  • Testing: CTest — 12 test executables, 114 test cases

Prerequisites

  • CMake ≥ 3.15
  • A C compiler: GCC, Clang, or MSVC (Visual Studio Build Tools)
  • Ninja (optional, recommended for faster builds)

Build

Quick Start

# Configure and build in one step
cmake -B build
cmake --build build

The main executable shi will be located in the build/ directory.

Build Options

# Specify build type
cmake -B build -DCMAKE_BUILD_TYPE=Release

# Available build types:
#   Debug            — Full debug info, no optimization
#   Release          — Full optimization, no debug info
#   RelWithDebInfo   — Full optimization + debug info (recommended for most use cases)
#   MinSizeRel       — Optimize for size

# Verbose build (show full compiler commands)
cmake --build build --verbose

# Build with Ninja (if installed)
cmake -B build -G Ninja
cmake --build build

Full Build Workflow

# 1. Configure the project
cmake -B build -DCMAKE_BUILD_TYPE=Debug

# 2. Build the project
cmake --build build

# 3. Run tests
ctest --test-dir build

# 4. Clean up
cmake --build . --target clean-build

Install

# Install to system (default: /usr/local on Unix, C:\Program Files on Windows)
cmake --install build

Running Tests

Run All Tests

# From the project root
ctest --test-dir build --output-on-failure

# Convenience target (equivalent)
cmake --build . --target run-tests

Run a Specific Test

# Using ctest directly
ctest --test-dir build -R test_compress --output-on-failure

# Using convenience target
cmake --build . --target run-test-compress

Available Test Suites

Test Target Convenience Target Description
test_bitstream run-test-test_bitstream Bitstream reader/writer (7 tests)
test_compress run-test-test_compress Compression roundtrip (13 tests)
test_file_reader run-test-test_file_reader File reader (5 tests)
test_file_writer run-test-test_file_writer File writer (4 tests)
test_list run-test-test_list Linked list (5 tests)
test_tree run-test-test_tree Huffman tree (3 tests)
test_utils run-test-test_utils Utility functions (2 tests)
test_decompress_version run-test-test_decompress_version Version handling in decompression (10 tests)
test_args run-test-test_args CLI argument parsing (43 tests)
test_generate_codes run-test-test_generate_codes Huffman code generation (6 tests)
test_reconstruct_tree run-test-test_reconstruct_tree Tree reconstruction from codes (7 tests)
test_integration run-test-test_integration Full pipeline integration (9 tests)

Test Output

# Run all tests with detailed output
ctest --test-dir build --output-on-failure -V

# Run tests in a specific build configuration
ctest --test-dir build -C Release --output-on-failure

Usage

Compress a file

./shi -c -f input.txt
# outputs: input.txt.shi

Decompress a file

./shi -d -f input.txt.shi
# outputs: input.txt

Options

# Specify file format version (default: 0)
./shi -c -f input.txt --version 0

# Verbose output
./shi -c -f input.txt --verbose

# Show help
./shi --help
Option Description
-c, --compress Compress mode
-d, --decompress Decompress mode
-f, --file <path> Input file path (required)
-v, --verbose Enable verbose output
-V, --version <N> Format version N (default: 0)
-h, --help Show help message

Note: The output file is derived automatically from the input file. For compression, .shi is appended. For decompression, .shi is stripped.

Project Status

Core Implementation: Complete

  • ✅ Full compression pipeline (compress_file())
  • ✅ Full decompression pipeline (decompress_file())
  • ✅ Error handling with NULL checks and status codes throughout
  • ✅ All 12 test suites passing (100%)
  • ✅ Magic byte validation on decompression

Known Limitations

  • Single-symbol edge case is handled but could be more robust

Future Enhancements (v2+)

  • Custom buffer sizes via CLI
  • stdin/stdout support
  • Flat tree header (v1) — replace per-symbol code storage with serialized flat tree for faster decompression (see Roadmap)

File Index

include/
├── core/
│   ├── compress.h      — compress_file, write_header, compress_data, read_header, etc.
│   ├── decompress.h    — decompress_file, reconstruct_tree_from_codes
│   └── version.h       — version constants, magic bytes, per-version dispatch
├── data_structures/
│   ├── bitarray.h      — bit array operations
│   ├── bitstream.h     — bit-level reader + writer
│   ├── list.h          — doubly-linked sorted list
│   ├── node.h          — Huffman tree node
│   └── tree.h          — tree builder + code generator
├── io/
│   └── file_io.h       — buffered reader + writer
└── utils/
    ├── metric.h        — frequency counter
    └── sort.h          — insertion sort for node ordering (heapsort helpers declared but unused)

src/
├── core/
│   ├── compress.c
│   └── decompress.c
├── data_structures/
│   ├── bitarray.c
│   ├── bitstream.c
│   ├── list.c
│   ├── node.c
│   └── tree.c
├── io/
│   └── file_io.c
├── utils/
│   ├── metric.c
│   └── sort.c
├── cli/
│   ├── args.c          — CLI argument parsing (version, verbose, help, flags)
│   └── args.h          — CLI args API (shi_args_t, shi_parse_args, etc.)
└── main.c              — CLI entry point (command dispatch)

tests/
├── test_compress.c     — 13 tests (roundtrip, empty, repeated, single-byte, single-symbol, binary, null-byte, bad magic, truncated header, zero symbols, bad num_symbols, truncated data, empty file)
├── test_bitstream.c    — 7 tests (reader/writer, EOF, NULL)
├── test_file_reader.c  — 5 tests
├── test_file_writer.c  — 4 tests
├── test_list.c         — 5 tests
├── test_tree.c         — 3 tests
├── test_utils.c        — 2 tests
├── test_decompress_version.c — 10 tests (version handling in decompression)
├── test_args.c         — 43 tests (CLI argument parsing)
├── test_generate_codes.c — 6 tests (Huffman code generation)
├── test_reconstruct_tree.c — 7 tests (tree reconstruction from codes)
└── test_integration.c  — 9 tests (full pipeline integration)

docs/
├── Architecture.md     — module responsibilities, data flow, design decisions
├── Changelog.md        — history of bug fixes and changes
└── Roadmap.md          — phased work items and future plans

CMakeLists.txt          — CMake build configuration

License

This project is provided as-is for educational and reference purposes.

About

Slava's Huffman Implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors