All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- handling of empty C5 and C5.1 data files
- crashing get_months updated
- Removed Section_Type from COUNTER 5.1 TR
- test for max_id during metrics extraction
- make it possible to extract year/month from row and other month/year from header
- make C4 heuristics more tolerant
- parsing of CONTER 5.1
- branch with undefined variable
- different offsets bases for Coord ("start", "parser", "area")
- validator to force aligned dates
- validator to convert HH:MM:SS to seconds
- handle situation when there is no date in output CounterRecord more gracefully
- dynamic area offset calculations
- normalize date when it is parsed using date_pattern
- dynamic areas
- fallbacks for sources
- pydantic updated to 2.7.3
- parse item_authors and item_publication_date
- make sure that item_* are included in all parsers
- bump nigiri to 2.2.0
- dimensions_to_skip wasn't actually skipping records
- wrapping of xlrd error
- add Parent_Data_Type as a dimension for IR reports
- bump nigiri to 2.1.0
- counter-like output of nibbler-eat
- limit the number of lines parsed via ExtractParams.max_idx
- display better exception when Exclude_Monthly_Details=True is used for C5 reports
- parser for C5 IR reports
- pydantic updated to 2.6.4
- make conter-like output look more like counter data
- use ruff instead of black
- become compatible with nigiri >=2.0.0
- in item parsing of IR_M1 reports use CounterRecord.item instead of CounterRecord.title
- bump nigiri to 1.3.3
- bump nigiri to 1.3.1
- use nltk instead of jellyfish
- Don't close opened files in DictReader
- handle a situation when no date is provided in date validators
- store idx into exception when RecordError occurs
- reading xls files (optional dependency)
- implement DictReader abstraction over SheetReader
- make it possible to pass opened files to all Readers
- improved counter header detection
- value validator should strip content before number is parsed
- properly terminate when no metric is present for counter formats
- isbn13 and isbn10 only validators
- special value extractor MinutesToSeconds
- make it possible to override value validators per metric
- add for C5 Json parsers parser_info
- updated pydantic to 2.3.0
- small optimization of pydantic models loading
- using pydantic dataclasses for validators
- nibbler-eat: make platfrom optional param
- extend NoParserMatchesHeuristics with information obtained from parsers (only C4 and C5)
- extract data and put it to Poops.extras (currently only C4 and C5 headers)
- performace regression introduced with PoopStats
- title detection for counter format parsing was fixed
- PoopStat.organizations updated to provide more info per orgnization
- Organization is extracted from C5 reports according to the standard
- using newer version of pydantic 2.1 with significant performance boost
- deprecated
get_metrics_dimensions_title_ids_months
function was removed
- pydantic error when reusing date_pattern attribute of DateSource
- CI linters
- make allowed_metrics working in data header processing
- aliases in data header stopped working
- fallback when processing title_ids
- generic parser and parser definition
- allow to process log-like files without values
- new action to data header parsing
- new
available_metrics
andon_metric_check_failed
parser attributes
- wrong exception handling during data header processing
- bump nigiri to 1.3.1
- Poop.area_counter to count the number of records from each area
- Poop.records_with_stats function which fills Poop.current_stats while processing the records
- IsDateCondition can be used to check that field contains a date (using validators.Date)
- Conditions for areas - this enables to dynamically detect areas position based on provided condition
- removed
parse_date
function to unite date parsing logic - use
data_headers
for celus format (so the parsing behaves more similar to regular parsers)
- removed double caching in
CounterHeaderArea.header_row
- on_validation_error for ExtractParams
- the way how errors are handled
- use calculate_dimensions in xlsx parsing properly (even for unsized worksheets)
- remedy for parsing of messed up xlsx files
- prefix and suffix to extract parameters
- configurable row offsets for parsers
- allow to choose between US 1/24/2022 and EU 24/1/2022 date format
- allow to manually pick date format
- compose date from two cells (month cell and year cell)
- allow "Metric Type" as "Metric_Type" alias
- python 3.11 compatibility issues
- explicitly closing SheetReader
- don't link SheetReader to Source
- add function to merge PoopStats
- typo in SpecialExtraction.COMMA_SEPARATED_NUMBER
- use tuple in record hash calculation
- allow to parse numbers with comma separated digits (e.g. '1,123,456')
- added
Poop.get_stats
to gather various info regarding output data
- make function
get_metrics_dimensions_title_ids_months
deprecated
- don't stop when an empty Platform dimension occurs during counter data parsing
- setting nested data while merging CounterRecord
- case insensitive metric_to_skip, titles_to_skip and dimesions_to_skip
- skip empty lines while parsing tabular counter reports
- add offset and limit to Poop.records()
- extend eat() function with same_check_size attribute to check for the same records
- extend Poop with data_format attribute
- allow to use CoordRange in conditions
- make sure that the line lenght is not decreasing during XLSX -> CSV conversion
- don't check for the name of the report in C5 reports
- bump nigiri to 1.3.0
- don't convert CounterReport to csv before passing it to debug logger
- flag
--show-summary
to nibbler-eat - flag
--no-output
to nobbler-eat
- make counter sources cached
- avoid seeking to the first postion in a file when moving window of CsvReader forward
- optimize extract function - reduce the number of created Coords
- increase max header row size to 1000
- counter 5 parser is not ignoring extra dimensions
- cache default value Validator generation to speed up extraction
- add 'Printed_ISSN' and 'Printed ISSN' aliases for ISSN
- make dimension detection case insensitive for counter data
- remove constraint that title can't be a number
- don't store empty values to title_ids
- add missing
URI
to title_ids - less strict validators (isbn, issn, doi)
- consider empty values as zeros in Tabular counter reports
- strip titles during parsing
- allow to parse JSON celus format with empty header
- fix extractor caching between sheets
- wrong naming in DataHeaders rules
- searching for data columns is more customizable using a set of rules
- ability to skip or stop while searching for data columns
- proper parsing of dates in
Dec-21
format
nibbler-eat
is able to generate celus-format output
- parsing performance optimizations
- nigiri version bumped
- value_extract_params option added to CelusFormat parser and definition
- use default_metric instead of override_metric in celus format
- added dynamic parser for non-counter data in celus format
- use proper csv dialect detectio when using nigiri
- JsonCounter5SheetReader for reading counter data in JSON format
- static TR, DR, PR, IR_M1 counter 5 Json parsers
- csv dialect detection is done using nigiri
- should not csv dialect when parsing XLSX files
- Print_ISSN and Online_ISSN were swapped
- exception logic
- dimensions in counter parsers
- metric_based parser and definition
- aggregators (now it aggregates the same records)
- date_metric_based parser and definition
- sheet_idx condition
- extraction logic (ExtractParams - regex, default, skip_validation, ...)
- function to list all definitions (kind attr distinguishes different definitions)
- mandatory function dimensions to Area (returns dimensions names as list)
- parsing of IR_M1 tabular counter format
- DummyParser definition (no longer required since more defintion exists now)
- some unused fields from Sources
- more defintion refactored
- code moved to separate files to avoid cicrular deps
- fixed definition renamed to non_counter.date_based
- counter parsers and definition redone
- reduced some repetitive parts of the code
- instead of format_name data_format field (of DataFormat class) is used
- parsers names unified
- Or condition behavior
- counter defintions for dynamic parsers
- overrided for counter definitions (heuristics, metric column, ...)
- support for TR_B1 files
- definitions refactored
- every definition needs to have name specified (currently "fixed" or "counter")
- Title with None no longer crashes the validation
- use importlib.metadata instead of pkg_resources
- add function which converts error to dict
- use latest celus-nigiri (1.1.1)
- use CounterRecord from celus_nigiri
- organization to CounterRecord
- make error comparion more error prune
- treat plaform as ordinary dimension
- caching of reader lines
- implemented dimensions and metric aliases
- version of format into definition
- more tests for dynamic parsers
- split NoParserFound error to NoParserMatchesHeuristics and NoParserForPlatformFound
- improve error display
- fix Value and SheetAttr to TableException conversion
- Fill title IDs with empty string when they are empty (they were skipped before)
- Parser should contain a mandatory field
format_name
(unique id of output format) - added
dynamic_parsers
attribute toeat()
function
- Serializable definitions
- Serialization of Coord and CoordRange
- DynamicParser which is able to generate a parser based of a definition
- Value, SheetAttr classes added
- Value, SheetAttr, Coord became feeders for output data (like CoordRange)
- nibbler-eat() can use dynamic parser by passing parser defintion in a file (
-D
)
eat()
returns list of Poop or Exception (not raised)
- C5 format parssing (csv/tsv) - DR, PR, TR
- nibbler-eat binary args extended
- match parsers based on regex not just using startswith
- allow empty values in Dimensions data
- make title IDs and dimensions compatible with celus
- set default CounterRecord.title_ids and CounterRecord.dimension_data to empty dict
- function
eat()
now accepts both str and pathlib.Path as path argument
- "Print ISSN" is ISSN and "Online ISSN" is EISSN in counter 4 formats
- sheet_idx property added to Poop class (sheet index from were the Poop comes from)
- get_months() function added to Poop class
- C4 format parsing (csv/tsv) - BR3, DB1, DB2, PR1, JR1, JR1a, JR1GOA, JR2, MR1
- CounterRecord.platform removed (should be used as a part of
dimensions_data
) - C4 parsing is done in more dynamic way (no hardcoded lines nor columns)
- eat() API extended - new options added
parsers
,check_platform
,use_heuristics
- eat() will return list which will contain None if nibbler is unable to parse the sheet
- CounterRecord.platform removed (should be used as a part of
dimensions_data
)
- simple BR2 format parsing
- added more platforms for BR1 format
- use python entry_points to add parsers
- more complex conditions for heuristics
- support for reading multiple tables from a single sheet
- simple counter BR1 data parser
- parsing title_ids
- parsing dimensions
- use chardet to detect encodings
- readers update - process files in streaming mode
- new eat and poop API introduced
- parsers reworked (new coords range abstraction)
- non-free parts removed
- relicensing nibbler to MIT
nibbler-eat
doesn't display debug log messages by default
- old API
- old parsing functinality
nibbler-eat
returns list of supported platforms- two new parsers added
ignore_metrics
- option to ignore specific metrics- parsing of .xlsx files
- multiple sheets handling within one document
- date parsing improvements
find_parser_and_parse
- return list of list of records now- using extra python classes insted of dict in parsing
find_new_metrics
- no longer used
get_supported_platforms
api call added
find_parser_and_parse
function returns only records
- fixed parsing of couple of test files
- make sure that all test files are used for parsing
- parsing data to records
- finding parser based on the file and platform
- data validations
- nibbler binary implemented
- documentation of the formats added