Back to Portfolio

Products Parser

Supplier product list processor in modern native PHP that streams large CSV/TSV, JSON/NDJSON, and XML catalogs, normalizes fields, and counts unique product combinations with optional PCNTL-powered parallelism.

01. Project Overview

Purpose

Build a lightweight CLI that can ingest massive supplier catalogs in multiple formats, normalize field names, and emit both raw products and aggregated unique combinations without exhausting memory.

Challenge

Support CSV/TSV delimiter auto-detection, JSON arrays/objects/NDJSON, and streaming XML while keeping processing consistent across Windows and POSIX systems, with opt-in PCNTL parallelism for heavy workloads.

Solution

Implemented streaming parsers, a flexible field mapper, and a unique combination counter that can export CSV/JSON/XML. A seeder generates deterministic synthetic datasets, and parallel modes accelerate both parsing and generation when PCNTL is available.

02. Key Features

Streaming Parsers

CSV/TSV with delimiter auto-detect, JSON arrays or NDJSON, and XML via XMLReader—processed as generators to keep memory flat on huge files.

Field Normalization

Maps supplier headers (brand_name, model_name, gb_spec_name, etc.) into canonical fields: make, model, colour, capacity, network, grade, condition.

Unique Combination Counter

Aggregates normalized products and exports CSV, JSON, or XML summaries of unique make/model/capacity/colour/network/grade/condition combinations.

Parallel Modes

Optional PCNTL-powered workers partition inputs and merge results for faster parsing or seeding on POSIX systems; sequential mode runs everywhere.

Synthetic Data Seeder

Generates deterministic sample catalogs across CSV, TSV, JSON, NDJSON, and XML for benchmarks or fixtures with zero-copy merges in parallel mode.

Memory Efficient

Generator-driven parsing keeps RAM steady on large catalogs; optional chunk flushing avoids blowups on extreme cardinality.

03. Technology & Requirements

Runtime & Libraries

  • ?
    PHP 8.4+

    Strict typing, readonly classes, modern enums.

  • ?
    XMLReader, DOM, SimpleXML

    Native extensions for streaming XML parsing.

  • ?
    Composer

    Install dependencies with composer install --no-interaction --ignore-platform-reqs.

Parallelism & OS Support

  • ?
    PCNTL Optional

    Parallel modes for parsing/seeding on POSIX systems; sequential mode works on Windows.

  • ?
    Input/Output Paths

    Inputs at data/input/; outputs at data/output/; seeded defaults go to data/input/products.<type>.

  • ?
    Format Coverage

    CSV, TSV, JSON, NDJSON, XML with consistent normalized fields.

04. CLI Usage

Parser

Stream products and optionally write aggregated unique combinations.

composer install --no-interaction --ignore-platform-reqs

php parser.php --file=data/input/products_comma_separated.csv --unique-combinations=data/output/combination_count.csv

php parser.php --file=data/input/products.json --unique-combinations=data/output/combination_count.json

php parser.php --file=data/input/products.xml --unique-combinations=data/output/combination_count.xml

php parser.php --file=data/input/products.csv --unique-combinations=data/output/results.json --parallel=4

  • ?--file required; format auto-detected.
  • ?--unique-combinations writes CSV/JSON/XML based on extension.
  • ?--parallel enables PCNTL workers (>=2).

Seeder

Generate deterministic sample catalogs for benchmarks or fixtures.

php seeder.php --type=csv --count=1000

php seeder.php --type=json --count=50000

php seeder.php --type=xml --count=100 --output=custom/data.xml

php seeder.php --type=ndjson --count=10000

php seeder.php --type=csv --count=1000000 --parallel=8

  • ?--type and --count required; output defaults to data/input/products.<type>.
  • ?--output customizes target path.
  • ?Parallel generation merges worker outputs zero-copy.

05. Architecture Highlights

Parsers

CsvParser (auto-detect delimiter), JsonParser (arrays, wrapped objects, NDJSON), and XmlParser (XMLReader) all implement FileParserInterface and yield products via generators.

Field Mapper

FieldMapper normalizes supplier headers (brand_name, model_name, gb_spec_name, etc.) into canonical fields: make, model, colour, capacity, network, grade, condition.

Product VO

Immutable Product validates required fields and generates unique keys used by UniqueCounter to aggregate combinations safely.

UniqueCounter

Aggregates normalized products, supports optional chunk flushing for extreme cardinality, and exports CSV/JSON/XML summaries.

Parallel

ParallelProcessor and ParallelSeeder split work across PCNTL workers on POSIX, then merge results (zero-copy for seeding); sequential paths run on Windows.

Factories & CLI

ParserFactory and SeederFactory pick implementations; ParserOptions/SeederOptions validate CLI flags; output writers centralize help and console UX.

06. Testing & Quality

Coverage

  • ?68 passing tests, 21 skipped (PCNTL on Windows) with 185 assertions.
  • ?Unit suites for ParserFactory, Product, FieldMapper, ParserOptions, UniqueCounter, ParallelProcessor.
  • ?Integration coverage for CSV/JSON/XML parsers and end-to-end parser CLI flows.

Commands

composer test

composer test:parallel

composer format

composer check-format

Parallel test targets skip gracefully on Windows when PCNTL is unavailable.

Back to Portfolio