Products Parser
01. Project Overview
Purpose
Build a lightweight CLI that can ingest massive supplier catalogs in multiple formats, normalize field names, and emit both raw products and aggregated unique combinations without exhausting memory.
Challenge
Support CSV/TSV delimiter auto-detection, JSON arrays/objects/NDJSON, and streaming XML while keeping processing consistent across Windows and POSIX systems, with opt-in PCNTL parallelism for heavy workloads.
Solution
Implemented streaming parsers, a flexible field mapper, and a unique combination counter that can export CSV/JSON/XML. A seeder generates deterministic synthetic datasets, and parallel modes accelerate both parsing and generation when PCNTL is available.
02. Key Features
Streaming Parsers
CSV/TSV with delimiter auto-detect, JSON arrays or NDJSON, and XML via XMLReader—processed as generators to keep memory flat on huge files.
Field Normalization
Maps supplier headers (brand_name, model_name, gb_spec_name, etc.) into canonical fields: make, model, colour, capacity, network, grade, condition.
Unique Combination Counter
Aggregates normalized products and exports CSV, JSON, or XML summaries of unique make/model/capacity/colour/network/grade/condition combinations.
Parallel Modes
Optional PCNTL-powered workers partition inputs and merge results for faster parsing or seeding on POSIX systems; sequential mode runs everywhere.
Synthetic Data Seeder
Generates deterministic sample catalogs across CSV, TSV, JSON, NDJSON, and XML for benchmarks or fixtures with zero-copy merges in parallel mode.
Memory Efficient
Generator-driven parsing keeps RAM steady on large catalogs; optional chunk flushing avoids blowups on extreme cardinality.
03. Technology & Requirements
Runtime & Libraries
-
?
PHP 8.4+
Strict typing, readonly classes, modern enums.
-
?
XMLReader, DOM, SimpleXML
Native extensions for streaming XML parsing.
-
?
Composer
Install dependencies with
composer install --no-interaction --ignore-platform-reqs.
Parallelism & OS Support
-
?
PCNTL Optional
Parallel modes for parsing/seeding on POSIX systems; sequential mode works on Windows.
-
?
Input/Output Paths
Inputs at
data/input/; outputs atdata/output/; seeded defaults go todata/input/products.<type>. -
?
Format Coverage
CSV, TSV, JSON, NDJSON, XML with consistent normalized fields.
04. CLI Usage
Parser
Stream products and optionally write aggregated unique combinations.
composer install --no-interaction --ignore-platform-reqs
php parser.php --file=data/input/products_comma_separated.csv --unique-combinations=data/output/combination_count.csv
php parser.php --file=data/input/products.json --unique-combinations=data/output/combination_count.json
php parser.php --file=data/input/products.xml --unique-combinations=data/output/combination_count.xml
php parser.php --file=data/input/products.csv --unique-combinations=data/output/results.json --parallel=4
- ?
--filerequired; format auto-detected. - ?
--unique-combinationswrites CSV/JSON/XML based on extension. - ?
--parallelenables PCNTL workers (>=2).
Seeder
Generate deterministic sample catalogs for benchmarks or fixtures.
php seeder.php --type=csv --count=1000
php seeder.php --type=json --count=50000
php seeder.php --type=xml --count=100 --output=custom/data.xml
php seeder.php --type=ndjson --count=10000
php seeder.php --type=csv --count=1000000 --parallel=8
- ?
--typeand--countrequired; output defaults todata/input/products.<type>. - ?
--outputcustomizes target path. - ?Parallel generation merges worker outputs zero-copy.
05. Architecture Highlights
Parsers
CsvParser (auto-detect delimiter), JsonParser (arrays, wrapped objects, NDJSON), and XmlParser (XMLReader) all implement FileParserInterface and yield products via generators.
Field Mapper
FieldMapper normalizes supplier headers (brand_name, model_name, gb_spec_name, etc.) into canonical fields: make, model, colour, capacity, network, grade, condition.
Product VO
Immutable Product validates required fields and generates unique keys used by UniqueCounter to aggregate combinations safely.
UniqueCounter
Aggregates normalized products, supports optional chunk flushing for extreme cardinality, and exports CSV/JSON/XML summaries.
Parallel
ParallelProcessor and ParallelSeeder split work across PCNTL workers on POSIX, then merge results (zero-copy for seeding); sequential paths run on Windows.
Factories & CLI
ParserFactory and SeederFactory pick implementations; ParserOptions/SeederOptions validate CLI flags; output writers centralize help and console UX.
06. Testing & Quality
Coverage
- ?68 passing tests, 21 skipped (PCNTL on Windows) with 185 assertions.
- ?Unit suites for ParserFactory, Product, FieldMapper, ParserOptions, UniqueCounter, ParallelProcessor.
- ?Integration coverage for CSV/JSON/XML parsers and end-to-end parser CLI flows.
Commands
composer test
composer test:parallel
composer format
composer check-format
Parallel test targets skip gracefully on Windows when PCNTL is unavailable.