Skip to content

Commit

Permalink
Add structural rules to YML schema (#118)
Browse files Browse the repository at this point in the history
Added a new section `structural_rules` to the YML schema to enforce
rules related to the structure of the CSV file. These rules ensure that
the columns in CSV files strictly follow the order specified in the
schema and disallow extra columns not specified in the schema.
  • Loading branch information
SmetDenis authored Mar 31, 2024
1 parent d9078f0 commit f86acf2
Show file tree
Hide file tree
Showing 23 changed files with 611 additions and 307 deletions.
353 changes: 208 additions & 145 deletions README.md

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions schema-examples/full.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@
"bom" : false
},

"structural_rules" : {
"strict_column_order" : true,
"allow_extra_columns" : false
},

"columns" : [
{
"name" : "Column Name (header)",
Expand Down
5 changes: 5 additions & 0 deletions schema-examples/full.php
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@
'bom' => false,
],

'structural_rules' => [
'strict_column_order' => true,
'allow_extra_columns' => false,
],

'columns' => [
[
'name' => 'Column Name (header)',
Expand Down
14 changes: 9 additions & 5 deletions schema-examples/full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,12 @@ description: | # Any description of the CSV file. Not u
supporting a wide range of data validation rules from basic type checks to complex regex validations.
This example serves as a comprehensive guide for creating robust CSV file validations.
# Regular expression to match the file name. If not set, then no pattern check.
# This allows you to pre-validate the file name before processing its contents.
# Feel free to check parent directories as well.
# See https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
# See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
filename_pattern: /demo(-\d+)?\.csv$/i


# Here are default values to parse CSV file.
# You can skip this section if you don't need to override the default values.
csv:
Expand All @@ -40,6 +38,12 @@ csv:
encoding: utf-8 # (Experimental) Only utf-8, utf-16, utf-32.
bom: false # (Experimental) If the file has a BOM (Byte Order Mark) at the beginning.

# Structural rules for the CSV file. These rules are applied to the entire CSV file.
# They are not(!) related to the data in the columns.
# You can skip this section if you don't need to override the default values.
structural_rules: # Here are default values.
strict_column_order: true # Ensure columns in CSV follow the same order as defined in this YML schema. It works only if "csv.header" is true.
allow_extra_columns: false # Allow CSV files to have more columns than specified in this YML schema.

# Description of each column in CSV.
# It is recommended to present each column in the same order as presented in the CSV file.
Expand Down Expand Up @@ -74,7 +78,7 @@ columns:
allow_values: [ y, n, "" ] # Strict set of values that are allowed.
not_allow_values: [ invalid ] # Strict set of values that are NOT allowed.

# Any valid regex pattern. See https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
# Any valid regex pattern. See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
# Of course it's a super powerful tool to verify any sort of string data.
# Please, be careful. Regex is a powerful tool, but it can be very dangerous if used incorrectly.
# Remember that if you want to solve a problem with regex, you now have two problems.
Expand Down Expand Up @@ -407,7 +411,7 @@ columns:
contraharmonic_mean_max: 9.0 # x <= 9.0

# Root mean square (quadratic mean) The square root of the arithmetic mean of the squares of a set of numbers.
# See https://en.wikipedia.org/wiki/Root_mean_square
# See: https://en.wikipedia.org/wiki/Root_mean_square
root_mean_square_min: 1.0 # x >= 1.0
root_mean_square_greater: 2.0 # x > 2.0
root_mean_square_not: 5.0 # x != 5.0
Expand Down
4 changes: 4 additions & 0 deletions schema-examples/full_clean.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ csv:
encoding: utf-8
bom: false

structural_rules:
strict_column_order: true
allow_extra_columns: false

columns:
- name: 'Column Name (header)'
description: 'Lorem ipsum'
Expand Down
31 changes: 31 additions & 0 deletions schema-examples/readme_sample.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#
# JBZoo Toolbox - Csv-Blueprint.
#
# This file is part of the JBZoo Toolbox project.
# For the full copyright and license information, please view the LICENSE
# file that was distributed with this source code.
#
# @license MIT
# @copyright Copyright (C) JBZoo.com, All rights reserved.
# @see https://github.com/JBZoo/Csv-Blueprint
#

name: Simple CSV Schema
filename_pattern: /my-favorite-csv-\d+\.csv$/i
csv:
delimiter: ';'

columns:
- name: id
rules:
not_empty: true
is_int: true
aggregate_rules:
is_unique: true
sorted: [ asc, numeric ]

- name: name
rules:
length_min: 3
aggregate_rules:
count: 10
32 changes: 20 additions & 12 deletions src/Csv/ParseConfig.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,17 @@ final class ParseConfig
public const ENCODING_UTF32 = 'utf-32';

private const FALLBACK_VALUES = [
'inherit' => null,
'bom' => false,
'delimiter' => ',',
'quote_char' => '\\',
'enclosure' => '"',
'encoding' => 'utf-8',
'header' => true,
'strict_column_order' => false,
'other_columns_possible' => false,
'inherit' => null,
'bom' => false,
'delimiter' => ',',
'quote_char' => '\\',
'enclosure' => '"',
'encoding' => 'utf-8',
'header' => true,

// Global validation rules
'strict_column_order' => true,
'allow_extra_columns' => false,
];

private Data $structure;
Expand Down Expand Up @@ -109,12 +111,18 @@ public function isHeader(): bool

public function isStrictColumnOrder(): bool
{
return $this->structure->getBool('strict_column_order', self::FALLBACK_VALUES['strict_column_order']);
return $this->structure->findBool(
'structural_rules.strict_column_order',
self::FALLBACK_VALUES['strict_column_order'],
);
}

public function isOtherColumnsPossible(): bool
public function isAllowExtraColumns(): bool
{
return $this->structure->getBool('other_columns_possible', self::FALLBACK_VALUES['other_columns_possible']);
return $this->structure->findBool(
'structural_rules.allow_extra_columns',
self::FALLBACK_VALUES['allow_extra_columns'],
);
}

public function getArrayCopy(): array
Expand Down
2 changes: 1 addition & 1 deletion src/Rules/Aggregate/ComboRootMeanSquare.php
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ public function getHelpMeta(): array
[
'Root mean square (quadratic mean) ' .
'The square root of the arithmetic mean of the squares of a set of numbers.',
'See https://en.wikipedia.org/wiki/Root_mean_square',
'See: https://en.wikipedia.org/wiki/Root_mean_square',
],
[],
];
Expand Down
2 changes: 1 addition & 1 deletion src/Rules/Cell/Regex.php
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ public function getHelpMeta(): array
{
return [
[
'Any valid regex pattern. See https://www.php.net/manual/en/reference.pcre.pattern.syntax.php',
'Any valid regex pattern. See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php',
"Of course it's a super powerful tool to verify any sort of string data.",
'Please, be careful. Regex is a powerful tool, but it can be very dangerous if used incorrectly.',
'Remember that if you want to solve a problem with regex, you now have two problems.',
Expand Down
16 changes: 16 additions & 0 deletions src/Utils.php
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,22 @@ final class Utils
{
public const MAX_DIRECTORY_DEPTH = 10;

public static function isArrayInOrder(array $array, array $correctOrder): bool
{
$orderIndex = 0;

foreach ($array as $element) {
$foundIndex = \array_search($element, \array_slice($correctOrder, $orderIndex), true);
if ($foundIndex !== false) {
$orderIndex += (int)$foundIndex + 1;
} elseif (\in_array($element, $correctOrder, true)) {
return false;
}
}

return true;
}

public static function printList(null|array|bool|float|int|string $items, string $color = ''): string
{
if (!\is_array($items)) {
Expand Down
88 changes: 55 additions & 33 deletions src/Validators/ValidatorCsv.php
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,27 @@ private function validateHeader(bool $quickStop = false): ErrorSuite
}
}

if ($this->schema->getCsvStructure()->isStrictColumnOrder()) {
$realColumns = $this->csv->getHeader();
$schemaColumns = $this->schema->getSchemaHeader();

if (!Utils::isArrayInOrder($schemaColumns, $realColumns)) {
$error = new Error(
'strict_column_order',
"Real columns order doesn't match schema. " .
'Expected: <c>' . Utils::printList($realColumns) . '</c>. ' .
'Actual: <green>' . Utils::printList($schemaColumns) . '</green>',
'',
ValidatorColumn::FALLBACK_LINE,
);

$errors->addError($error);
if ($quickStop && $errors->count() > 0) {
return $errors;
}
}
}

return $errors;
}

Expand Down Expand Up @@ -196,8 +217,6 @@ private function validateFile(bool $quickStop = false): ErrorSuite
'filename_pattern',
'Filename "<c>' . Utils::cutPath($this->csv->getCsvFilename()) . '</c>" ' .
"does not match pattern: \"<c>{$filenamePattern}</c>\"",
'',
Error::UNDEFINED_LINE,
);

$errors->addError($error);
Expand All @@ -214,38 +233,41 @@ private function validateColumn(bool $quickStop): ErrorSuite
{
$errors = new ErrorSuite();

if ($this->schema->getCsvStructure()->isHeader()) {
$realColumns = $this->csv->getHeader();
$schemaColumns = $this->schema->getSchemaHeader();
$notFoundColums = \array_diff($schemaColumns, $realColumns);

if (\count($notFoundColums) > 0) {
$error = new Error(
'csv.header',
'Columns not found in CSV: ' . Utils::printList($notFoundColums, 'c'),
'',
ValidatorColumn::FALLBACK_LINE,
);

$errors->addError($error);
if ($quickStop) {
return $errors;
if (!$this->schema->getCsvStructure()->isAllowExtraColumns()) {
if ($this->schema->getCsvStructure()->isHeader()) {
$realColumns = $this->csv->getHeader();
$schemaColumns = $this->schema->getSchemaHeader();
$notFoundColums = \array_diff($schemaColumns, $realColumns);

if (\count($notFoundColums) > 0) {
$error = new Error(
'allow_extra_columns',
'Column(s) not found in CSV: ' . Utils::printList($notFoundColums, 'c'),
'',
ValidatorColumn::FALLBACK_LINE,
);

$errors->addError($error);
if ($quickStop) {
return $errors;
}
}
}
} else {
$schemaColumns = \count($this->schema->getColumns());
$realColumns = $this->csv->getRealColumNumber();
if ($realColumns < $schemaColumns) {
$error = new Error(
'csv.header',
'Real number of columns is less than schema: ' . $realColumns . ' < ' . $schemaColumns,
'',
ValidatorColumn::FALLBACK_LINE,
);

$errors->addError($error);
if ($quickStop) {
return $errors;
} else {
$schemaColumns = \count($this->schema->getColumns());
$realColumns = $this->csv->getRealColumNumber();
if ($realColumns < $schemaColumns) {
$error = new Error(
'allow_extra_columns',
"Schema number of columns \"<c>{$schemaColumns}</c>\" greater " .
"than real \"<green>{$realColumns}</green>\"",
'',
ValidatorColumn::FALLBACK_LINE,
);

$errors->addError($error);
if ($quickStop) {
return $errors;
}
}
}
}
Expand Down
34 changes: 17 additions & 17 deletions tests/Commands/ValidateCsvBasicTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -127,23 +127,23 @@ public function testValidateOneCsvWithInvalidSchemaNegative(): void
(1/1) Schema: ./tests/schemas/demo_invalid.yml
(1/1) CSV : ./tests/fixtures/demo.csv; Size: 123.34 MB
(1/1) Issues: 10
+------+------------------+--------------+------------------------- demo.csv -------------------------------------------------------------------+
| Line | id:Column | Rule | Message |
+------+------------------+--------------+------------------------------------------------------------------------------------------------------+
| 1 | | csv.header | Columns not found in CSV: "wrong_column_name" |
| 6 | 0:Name | length_min | The length of the value "Carl" is 4, which is less than the expected "5" |
| 11 | 0:Name | length_min | The length of the value "Lois" is 4, which is less than the expected "5" |
| 1 | 1:City | ag:is_unique | Column has non-unique values. Unique: 9, total: 10 |
| 2 | 2:Float | num_max | The value "4825.185" is greater than the expected "4825.184" |
| 1 | 2:Float | ag:nth_num | The N-th value in the column is "74", which is not equal than the expected "0.001" |
| 6 | 3:Birthday | date_min | The date of the value "1955-05-14" is parsed as "1955-05-14 00:00:00 +00:00", which is less than the |
| | | | expected "1955-05-15 00:00:00 +00:00 (1955-05-15)" |
| 8 | 3:Birthday | date_min | The date of the value "1955-05-14" is parsed as "1955-05-14 00:00:00 +00:00", which is less than the |
| | | | expected "1955-05-15 00:00:00 +00:00 (1955-05-15)" |
| 9 | 3:Birthday | date_max | The date of the value "2010-07-20" is parsed as "2010-07-20 00:00:00 +00:00", which is greater than |
| | | | the expected "2009-01-01 00:00:00 +00:00 (2009-01-01)" |
| 5 | 4:Favorite color | allow_values | Value "blue" is not allowed. Allowed values: ["red", "green", "Blue"] |
+------+------------------+--------------+------------------------- demo.csv -------------------------------------------------------------------+
+------+------------------+---------------------+---------------------- demo.csv ----------------------------------------------------------------------+
| Line | id:Column | Rule | Message |
+------+------------------+---------------------+------------------------------------------------------------------------------------------------------+
| 1 | | allow_extra_columns | Column(s) not found in CSV: "wrong_column_name" |
| 6 | 0:Name | length_min | The length of the value "Carl" is 4, which is less than the expected "5" |
| 11 | 0:Name | length_min | The length of the value "Lois" is 4, which is less than the expected "5" |
| 1 | 1:City | ag:is_unique | Column has non-unique values. Unique: 9, total: 10 |
| 2 | 2:Float | num_max | The value "4825.185" is greater than the expected "4825.184" |
| 1 | 2:Float | ag:nth_num | The N-th value in the column is "74", which is not equal than the expected "0.001" |
| 6 | 3:Birthday | date_min | The date of the value "1955-05-14" is parsed as "1955-05-14 00:00:00 +00:00", which is less than the |
| | | | expected "1955-05-15 00:00:00 +00:00 (1955-05-15)" |
| 8 | 3:Birthday | date_min | The date of the value "1955-05-14" is parsed as "1955-05-14 00:00:00 +00:00", which is less than the |
| | | | expected "1955-05-15 00:00:00 +00:00 (1955-05-15)" |
| 9 | 3:Birthday | date_max | The date of the value "2010-07-20" is parsed as "2010-07-20 00:00:00 +00:00", which is greater than |
| | | | the expected "2009-01-01 00:00:00 +00:00 (2009-01-01)" |
| 5 | 4:Favorite color | allow_values | Value "blue" is not allowed. Allowed values: ["red", "green", "Blue"] |
+------+------------------+---------------------+---------------------- demo.csv ----------------------------------------------------------------------+
Summary:
Expand Down
Loading

0 comments on commit f86acf2

Please sign in to comment.