Skip to content

hparadiz/ext-gnu-grep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

grep

grep is a greenfield PHP extension project implemented in C and built as a shared object with phpize.

This repository keeps the upstream GNU grep source in vendor/grep and builds a native PHP module around a separate extension entrypoint. It is not a PHP userspace wrapper around the grep CLI.

Vendored upstream commit:

  • 071ac3aa76a575dd55dc184570da2192adafe267

Licensing

GNU grep is GPLv3-or-later. If this extension links against, embeds, or adapts GNU grep internals, the resulting combined work has GPL implications for distribution. That constraint is intentional and should stay explicit in project documentation and release artifacts.

The repository-level license notice is in LICENSE, and the full GNU GPLv3 text is vendored in vendor/grep/COPYING.

Current scope

The tree now contains a real PHP extension skeleton:

  • config.m4
  • php_grep.h
  • php_grep.c
  • tests/*.phpt
  • vendor/grep

The current vertical slice is intentionally small:

  • module loads as grep.so
  • exposes grep_version(): array
  • exposes GNUGrep\Engine
  • exposes GNUGrep\Pattern
  • supports fixed-string matching via GNU grep's upstream Fcompile/Fexecute
  • supports basic and extended regular expressions via GNU grep's upstream GEAcompile/EGexecute
  • supports common PHP-style regex shorthands like \d, \D, \s, \S, \w, \W, \h, and \H on the GNU basic/extended regex path
  • supports a substantial GNUGrep\Engine::run() slice for prominent grep --help switches

Implemented run() option slices:

  • pattern modes: -G, -E, -F
  • matcher controls: -i, -v, -w, -x
  • recursive search: -r, -R, -d skip|recurse
  • binary handling: -I, -a, -U, --binary-files=without-match|text|binary
  • result shaping: -n, -c, -l, -L, -m
  • output controls: -b, -H, -h, -o, -Z
  • pattern sources: -e, -f
  • file selection: --include, --exclude, --exclude-dir, --exclude-from
  • context controls: -A, -B, -C, -NUM, --group-separator, --no-group-separator
  • stdin and record modes: --label, -z

PCRE mode, richer text-rendering flags like -T and --line-buffered, and colorized CLI formatting are still follow-up work. The extension is being built out engine-first, with matcher parity and benchmark harnesses added slice by slice.

Build

tools/build_upstream_grep.sh
phpize
./configure --enable-grep
make

The built module will be written to modules/grep.so.

Run tests

make test

The PHPT suite uses --EXTENSIONS-- grep, so the tests execute against the freshly built module.

Compare Against Upstream GNU grep

Build standalone upstream GNU grep from the vendored source:

tools/build_upstream_grep.sh

Then run a side-by-side correctness and timing check against the extension:

tools/compare_with_upstream.sh

That harness:

  • generates a deterministic benchmark fixture tree
  • runs standalone GNU grep with -RnI 'abstract class'
  • runs the extension on the same fixture
  • diffs the normalized outputs
  • records repeated wall-clock timings for both paths

That benchmark now exercises the global ggrep() helper for the extension side, so it measures the actual short userspace entrypoint instead of an older internal helper path.

If you want to benchmark in-memory grep work directly, use:

php -n -d extension=modules/grep.so tools/benchmark_ggrep_pipe.php \
  '-iE fatal|panic|timeout' \
  /path/to/captured-output.log \
  100 \
  'captured-output'

That is useful for shell_exec() / pipeline-style usage where startup and filesystem traversal are not the main cost.

Load manually

php -d extension=/absolute/path/to/modules/grep.so -r 'var_dump(grep_version());'

Minimal API

For a full PHP-visible reference, see docs/USERSPACE_API_REFERENCE.md.

<?php

$pattern = GNUGrep\Pattern::fixedString('TODO');

var_dump($pattern->matches("TODO: wire GNU grep internals\n"));
var_dump(GNUGrep\Engine::versionInfo());
var_dump(GNUGrep\Engine::match(
    'abstract class (Alpha|Beta)Base',
    "abstract class BetaBase\n",
    GNUGrep\Pattern::MODE_EXTENDED_REGEXP
));
var_dump(GNUGrep\Engine::run(['-RnI', 'abstract class', __DIR__ . '/src']));
var_dump(GNUGrep\Engine::run(['-Rniw', 'model', __DIR__ . '/src']));
var_dump(GNUGrep\Engine::run(['-Rn', '-e', 'alpha', '-e', 'beta', __DIR__ . '/src']));

Quick Start

ggrep() is now the shortest userspace entrypoint. Pass GNU grep-style args as a string or array, then give it either paths or in-memory text:

<?php

$literalMatches = ggrep(
    '-F lamb',
    'Mary had a little lamb'
);

$errorMatches = ggrep(
    '-iE fatal|panic|timeout',
    shell_exec('php artisan about 2>&1') ?? '',
    'artisan about'
);

$httpBlob = <<<HEADERS
GET /checkout HTTP/1.1
Host: payments.internal.example
Authorization: Bearer redacted-token
X-Forwarded-For: 203.0.113.9
X-Request-Id: req-7f3a

HTTP/1.1 503 Service Unavailable
Set-Cookie: session=secret; HttpOnly; Secure
Content-Security-Policy: default-src 'self'
Strict-Transport-Security: max-age=31536000
CF-Ray: 89abc123-sea
HEADERS;

$headerMatches = ggrep(
    '-niE ^(Host|Authorization|X-Forwarded-For|X-Request-Id|Set-Cookie|Content-Security-Policy|Strict-Transport-Security|CF-Ray):',
    $httpBlob,
    'checkout trace'
);

$tokenMatches = ggrep(
    ['-nE', '-e', '\d+', '-e', 'alpha\w+', '-e', 'space\shere'],
    "id=42\nslug=alpha_beta\nspace here\n",
    'token demo'
);

// Folder search, equivalent to a practical grep -RnI style search.
$classMatches = ggrep(
    '-RnI abstract class',
    __DIR__ . '/src'
);

// PHP 8.5 pipe operator works cleanly with a tiny wrapper.
$findLamb = fn(string $input): array => ggrep('-F lamb', $input);
$pipedMatches = 'Mary had a little lamb' |> $findLamb(...);

On the GNU basic and extended regex modes, the extension also accepts common PHP-style shorthand tokens such as \d, \D, \s, \S, \w, \W, \h, and \H.

For the common "grep folders like grep -RnI" case, the class helpers still exist:

<?php

use GNUGrep\Engine;
use GNUGrep\Pattern;

$matches = Engine::grep('abstract class', __DIR__ . '/src');

$matches = Engine::grep('BetaLeaf', [
    __DIR__ . '/src',
    __DIR__ . '/tests',
], Pattern::MODE_FIXED_STRING);

$matches = Engine::grepFixed('TODO', [
    __DIR__ . '/src',
    __DIR__ . '/docs',
]);

These helpers assume:

  • recursive traversal
  • line-numbered results
  • binary files treated as without-match
  • -R-style recursive directory handling

Use ggrep() when you want the shortest userspace form. Use GNUGrep\Engine::run(array $argv) when you want exact CLI-style argv control. Use the class helpers when you want explicit path-only or buffer-only intent.

Example: Search PSR-4 Autoload Trees

If you want a quick inventory of PHP type declarations across one or more PSR-4 autoload roots, point GNUGrep\Engine::run() at those folders and search for the declaration forms you care about:

<?php

use GNUGrep\Engine;

$autoloadRoots = [
    __DIR__ . '/src',
    __DIR__ . '/modules/Billing/src',
];

$matches = Engine::run([
    '-RnI',
    '-E',
    '-e', '^(abstract|final|readonly)[[:space:]]+class[[:space:]]+',
    '-e', '^class[[:space:]]+',
    '-e', '^interface[[:space:]]+',
    '-e', '^trait[[:space:]]+',
    '-e', '^enum[[:space:]]+',
    '--include=*.php',
    ...$autoloadRoots,
]);

$lines = $matches
    |> (static fn(array $rows): array => array_map(
        static fn(array $match): string => sprintf(
            '%s:%d %s',
            $match['path'],
            $match['line'],
            $match['text']
        ),
        $rows
    ));

echo implode(PHP_EOL, $lines), PHP_EOL;

That gives you a grep-style scan of the PSR-4 code roots while ignoring binary files and non-PHP assets. It is a good fit for codebase audits like "show me every abstract class, interface, trait, enum, or concrete class we autoload from these roots."

Next integration steps

  1. Add a native compiled-program abstraction that owns GNU grep matcher state instead of rebuilding per call.
  2. Close the remaining regex bridge gap for anchored ^...$ semantics in the generic non--x path.
  3. Add the remaining major CLI slices such as -P, plus any text-rendering-only flags that need a dedicated formatted-output API instead of structured arrays.
  4. Keep adding parity PHPTs and side-by-side benchmarks before expanding the CLI surface further.

About

Uses upstream GNU grep to build a native php extension.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors