A complete, production-ready Huffman compression/decompression tool written in C.
shi (Slava's Huffman Implementation) implements a full Huffman coding pipeline in C — from frequency analysis and tree construction to bitstream encoding/decoding. It reads any binary file, builds an optimal prefix code, and produces a compressed output that can be perfectly reconstructed.
This project was created to:
- Demonstrate core data structures (trees, heaps, linked lists) in C
- Provide a working reference implementation of Huffman coding
- Exercise low-level I/O, bit manipulation, and memory management patterns
- Serve as a learning tool for understanding lossless compression algorithms
The codebase is organized into modular layers:
| Layer | Modules | Responsibility |
|---|---|---|
| I/O | file_io.c/h |
Buffered file reader (fr_fd) and writer (fw_fd) |
| Data Structures | node.c/h, tree.c/h, list.c/h |
Huffman tree node, tree builder, sorted doubly-linked list |
| Sorting | sort.c/h |
Insertion sort for ordering nodes by frequency weight (heapsort functions declared but unused) |
| Metrics | metric.c/h |
Byte-frequency counter over input data |
| Bitstream | bitstream.c/h, bitarray.c/h |
MSB-first bit-level reader/writer with auto-flush |
| Core | compress.c/h, decompress.c/h |
High-level compression & decompression pipelines |
Input file → frequency counting → build Huffman tree → generate codes
→ write header → bitstream encode → Output file
Compressed input → read header → reconstruct tree → bitstream decode → Output file
Compressed files use magic bytes "SHI<version>" (4B) where the 4th byte encodes the format version. Decompression validates the magic and dispatches to the appropriate version handler.
| Version | Magic | Description |
|---|---|---|
| 0 (0x00) | SHI\x00 |
Current format — per-symbol code storage |
| 1 (0x01) | SHI\x01 |
Planned — flat tree header (see Roadmap) |
The compressed file header stores:
- Magic bytes:
"SHI\x00"(4B,0x53, 0x48, 0x49, 0x00) num_symbols(4B LE) — number of unique bytes in the originalfile_size(4B LE) — original file size (to stop decoding at the right point)- Per symbol:
byte_value(1B) +code_length(1B) +code_value(4B LE)
- Language: C23
- Build System: CMake 3.15+ (primary and only build system)
- Compiler Flags:
-Wall -Wextra -Wpedantic -g(GCC/Clang),/W4(MSVC) - Testing: CTest — 12 test executables, 114 test cases
- CMake ≥ 3.15
- A C compiler: GCC, Clang, or MSVC (Visual Studio Build Tools)
- Ninja (optional, recommended for faster builds)
# Configure and build in one step
cmake -B build
cmake --build buildThe main executable shi will be located in the build/ directory.
# Specify build type
cmake -B build -DCMAKE_BUILD_TYPE=Release
# Available build types:
# Debug — Full debug info, no optimization
# Release — Full optimization, no debug info
# RelWithDebInfo — Full optimization + debug info (recommended for most use cases)
# MinSizeRel — Optimize for size
# Verbose build (show full compiler commands)
cmake --build build --verbose
# Build with Ninja (if installed)
cmake -B build -G Ninja
cmake --build build# 1. Configure the project
cmake -B build -DCMAKE_BUILD_TYPE=Debug
# 2. Build the project
cmake --build build
# 3. Run tests
ctest --test-dir build
# 4. Clean up
cmake --build . --target clean-build# Install to system (default: /usr/local on Unix, C:\Program Files on Windows)
cmake --install build# From the project root
ctest --test-dir build --output-on-failure
# Convenience target (equivalent)
cmake --build . --target run-tests# Using ctest directly
ctest --test-dir build -R test_compress --output-on-failure
# Using convenience target
cmake --build . --target run-test-compress| Test Target | Convenience Target | Description |
|---|---|---|
test_bitstream |
run-test-test_bitstream |
Bitstream reader/writer (7 tests) |
test_compress |
run-test-test_compress |
Compression roundtrip (13 tests) |
test_file_reader |
run-test-test_file_reader |
File reader (5 tests) |
test_file_writer |
run-test-test_file_writer |
File writer (4 tests) |
test_list |
run-test-test_list |
Linked list (5 tests) |
test_tree |
run-test-test_tree |
Huffman tree (3 tests) |
test_utils |
run-test-test_utils |
Utility functions (2 tests) |
test_decompress_version |
run-test-test_decompress_version |
Version handling in decompression (10 tests) |
test_args |
run-test-test_args |
CLI argument parsing (43 tests) |
test_generate_codes |
run-test-test_generate_codes |
Huffman code generation (6 tests) |
test_reconstruct_tree |
run-test-test_reconstruct_tree |
Tree reconstruction from codes (7 tests) |
test_integration |
run-test-test_integration |
Full pipeline integration (9 tests) |
# Run all tests with detailed output
ctest --test-dir build --output-on-failure -V
# Run tests in a specific build configuration
ctest --test-dir build -C Release --output-on-failure./shi -c -f input.txt
# outputs: input.txt.shi./shi -d -f input.txt.shi
# outputs: input.txt# Specify file format version (default: 0)
./shi -c -f input.txt --version 0
# Verbose output
./shi -c -f input.txt --verbose
# Show help
./shi --help| Option | Description |
|---|---|
-c, --compress |
Compress mode |
-d, --decompress |
Decompress mode |
-f, --file <path> |
Input file path (required) |
-v, --verbose |
Enable verbose output |
-V, --version <N> |
Format version N (default: 0) |
-h, --help |
Show help message |
Note: The output file is derived automatically from the input file. For compression,
.shiis appended. For decompression,.shiis stripped.
Core Implementation: Complete ✅
- ✅ Full compression pipeline (
compress_file()) - ✅ Full decompression pipeline (
decompress_file()) - ✅ Error handling with NULL checks and status codes throughout
- ✅ All 12 test suites passing (100%)
- ✅ Magic byte validation on decompression
- Single-symbol edge case is handled but could be more robust
- Custom buffer sizes via CLI
- stdin/stdout support
- Flat tree header (v1) — replace per-symbol code storage with serialized flat tree for faster decompression (see Roadmap)
include/
├── core/
│ ├── compress.h — compress_file, write_header, compress_data, read_header, etc.
│ ├── decompress.h — decompress_file, reconstruct_tree_from_codes
│ └── version.h — version constants, magic bytes, per-version dispatch
├── data_structures/
│ ├── bitarray.h — bit array operations
│ ├── bitstream.h — bit-level reader + writer
│ ├── list.h — doubly-linked sorted list
│ ├── node.h — Huffman tree node
│ └── tree.h — tree builder + code generator
├── io/
│ └── file_io.h — buffered reader + writer
└── utils/
├── metric.h — frequency counter
└── sort.h — insertion sort for node ordering (heapsort helpers declared but unused)
src/
├── core/
│ ├── compress.c
│ └── decompress.c
├── data_structures/
│ ├── bitarray.c
│ ├── bitstream.c
│ ├── list.c
│ ├── node.c
│ └── tree.c
├── io/
│ └── file_io.c
├── utils/
│ ├── metric.c
│ └── sort.c
├── cli/
│ ├── args.c — CLI argument parsing (version, verbose, help, flags)
│ └── args.h — CLI args API (shi_args_t, shi_parse_args, etc.)
└── main.c — CLI entry point (command dispatch)
tests/
├── test_compress.c — 13 tests (roundtrip, empty, repeated, single-byte, single-symbol, binary, null-byte, bad magic, truncated header, zero symbols, bad num_symbols, truncated data, empty file)
├── test_bitstream.c — 7 tests (reader/writer, EOF, NULL)
├── test_file_reader.c — 5 tests
├── test_file_writer.c — 4 tests
├── test_list.c — 5 tests
├── test_tree.c — 3 tests
├── test_utils.c — 2 tests
├── test_decompress_version.c — 10 tests (version handling in decompression)
├── test_args.c — 43 tests (CLI argument parsing)
├── test_generate_codes.c — 6 tests (Huffman code generation)
├── test_reconstruct_tree.c — 7 tests (tree reconstruction from codes)
└── test_integration.c — 9 tests (full pipeline integration)
docs/
├── Architecture.md — module responsibilities, data flow, design decisions
├── Changelog.md — history of bug fixes and changes
└── Roadmap.md — phased work items and future plans
CMakeLists.txt — CMake build configuration
This project is provided as-is for educational and reference purposes.