27 Oct 11:49

3adcc54

Latest

Important Change: Unitxt is Faster!

To improve Unitxt’s performance, we've made several optimizations:

Operator Acceleration: Many operators have been sped up by removing unnecessary deep copying in their code, enhancing runtime efficiency.
Caching Hugging Face Datasets: We added the option to cache Hugging Face datasets in loaders, which can prevent redundant loading operations. To enable this, you can either:
- Set it globally in code:
```
import unitxt

unitxt.settings.disable_hf_datasets_cache = False
```
- Use the settings context:
```
with settings.context(disable_hf_datasets_cache=False):
    # your code
```
- Or set the environment variable:
```
export UNITXT_DISABLE_HF_DATASETS_CACHE=False
```
Eager Execution Mode: Running Unitxt without streaming, which can be faster in certain scenarios. Enable eager execution using the environment variable or directly in code:
```
unitxt.settings.use_eager_execution = True
# or
with settings.context(use_eager_execution=True):
    # your code
```

Partial Stream Loading: This feature lets you load only the necessary data instances, avoiding full dataset loads when not required. Here's an example:

from unitxt import load_dataset

dataset = load_dataset(
    card="cards.doc_vqa.lmms_eval",
    template="templates.qa.with_context.title",
    format="formats.models.llava_interleave",
    loader_limit=300,
    streaming=True,
)
print(next(iter(dataset["test"][0])))  # Loads only the first instance

Complete Example: Combining the optimizations above can lead to near 1000x faster dataset loading:

from unitxt import load_dataset, settings

with settings.context(
    disable_hf_datasets_cache=False,
    use_eager_execution=True,
):
    dataset = load_dataset(
        card="cards.doc_vqa.lmms_eval",
        template="templates.qa.with_context.title",
        format="formats.models.llava_interleave",
        loader_limit=300,
        streaming=True,
    )
    print(next(iter(dataset["test"][0])))  # Loads only the first instance

Execution Speed Tracking: A GitHub action has been added to monitor Unitxt’s execution speed in new pull requests, helping ensure that optimizations are maintained.

Summary

This release is focused on accelerating performance in Unitxt by introducing several key optimizations. Operator efficiency has been enhanced by removing deep copies, making operations faster. Users can now enable dataset caching for Hugging Face datasets to prevent redundant loading, configured directly in code or through environment variables. An optional eager execution mode has been added, bypassing streaming to increase speed in certain scenarios. Additionally, partial stream loading allows selective instance loading, reducing memory usage and improving response times. To maintain these improvements, a new GitHub action now monitors Unitxt’s execution speed in pull requests, ensuring consistent performance across updates.

All Changes

Enhancements to inference engines by @lilacheden in #1243
add post processor to convert log probs dictionary to probabilities of a specific class by @lilacheden in #1247
CI for metrics other than main + Bugfix in RetrievalAtK by @lilacheden in #1246
Add huggingface cache disabling option to unitxt settings by @elronbandel in #1250
Make F1Strings faster by @elronbandel in #1248
Fix duplicate column deletion bug in pandas serializer by @elronbandel in #1249
revived no_deep just to compare performance by @dafnapension in #1254
fixed scigen post-processor by @csrajmohan in #1253
Add prediction length metric by @perlitz in #1252
Fix faithfulness confidence intervals by @matanor in #1257
Allow role names to be captialized in SerializeOpenAiFormatDialog by @yoavkatz in #1259
Accelerate image example 1000X by @elronbandel in #1258
Fix the empty few-shot target issue when using produce() by @marukaz in #1266
fix postprocessors in turl_col_type taskcard by @csrajmohan in #1261
Fix answer correctness confidence intervals by @matanor in #1256
add BlueBench as a benchmark to the catalog by @shachardon in #1262
Fix MultipleSourceLoader documentation by @marukaz in #1270
Ignore unitxt-venv by @marukaz in #1269
Add mmmu by @elronbandel in #1271
A fix for a bug in metric pipeline by @elronbandel in #1268
Added Tablebench taskcard by @csrajmohan in #1273
Fix missing deep copy in MapInstanceValues by @yoavkatz in #1267
Add stream name to generation of dataset by @elronbandel in #1276
Fix demos pool inference by @elronbandel in #1278
Fix quality github action by @elronbandel in #1281
add operators for robustness check on tables by @csrajmohan in #1279
Instruction in SystemFormet demo support. by @piotrhelm in #1274
change the max_test_instances of bluebench.recipe.attaq_500 to 100 by @shachardon in #1285
Add documentation for types and serializers by @elronbandel in #1286
Add example for image processing with different templates by @elronbandel in #1280
Integrate metrics team LLMaJ with current unitxt implemantation by @lilacheden in #1205
performance profiler with visualization by @dafnapension in #1255
Remove split arg to support old hf datasets versions by @elronbandel in #1288
add post-processors for tablebench taskcard by @csrajmohan in #1289
recursive copy seems safer here by @dafnapension in #1295
Fix performance tracking action by @elronbandel in #1296
try num of instances in nested global scores by @dafnapension in #1282
Update version to 1.14.0 by @elronbandel in #1298
expand performance table by @dafnapension in #1299
Fix doc_vqa lmms_eval by @elronbandel in #1300
prepare for int-ish group names and type names and add the exposing card by @dafnapension in #1303
remove groups breakdowns from global score of grouped instance metrics by @dafnapension in #1306
Update the safety metric batch size to 10 by @perlitz in #1305

New Contributors

@piotrhelm made their first contribution in #1274

Full Changelog: 1.13.1...1.14.1

Contributors

marukaz, perlitz, and 8 other contributors

Assets 2

20 Oct 11:55

elronbandel

1.14.0

3d027e9

Unitxt 1.14.0 - Faster Unitxt

What's Changed

Simplify qa example by @yoavkatz in #1234
allow multiple references for f1 strings metric by @ShirApp in #1225
Add bluebench recipes by @shachardon in #1237
Allow templates dicts to be python dicts and fix a bug in the TemplatesDict definition by @elronbandel in #1240
Deep copy artifacts that fetched twice by @elronbandel in #1239
Adding of ANLS metric to doc_vqa and info_vqa datasets by @alfassy in #1241
Update README.md by @elronbandel in #1242
Update version to 1.13.1 by @elronbandel in #1244
Enhancements to inference engines by @lilacheden in #1243
add post processor to convert log probs dictionary to probabilities of a specific class by @lilacheden in #1247
CI for metrics other than main + Bugfix in RetrievalAtK by @lilacheden in #1246
Add huggingface cache disabling option to unitxt settings by @elronbandel in #1250
Make F1Strings faster by @elronbandel in #1248
Fix duplicate column deletion bug in pandas serializer by @elronbandel in #1249
revived no_deep just to compare performance by @dafnapension in #1254
fixed scigen post-processor by @csrajmohan in #1253
Add prediction length metric by @perlitz in #1252
Fix faithfulness confidence intervals by @matanor in #1257
Allow role names to be captialized in SerializeOpenAiFormatDialog by @yoavkatz in #1259
Accelerate image example 1000X by @elronbandel in #1258
Fix the empty few-shot target issue when using produce() by @marukaz in #1266
fix postprocessors in turl_col_type taskcard by @csrajmohan in #1261
Fix answer correctness confidence intervals by @matanor in #1256
add BlueBench as a benchmark to the catalog by @shachardon in #1262
Fix MultipleSourceLoader documentation by @marukaz in #1270
Ignore unitxt-venv by @marukaz in #1269
Add mmmu by @elronbandel in #1271
A fix for a bug in metric pipeline by @elronbandel in #1268
Added Tablebench taskcard by @csrajmohan in #1273
Fix missing deep copy in MapInstanceValues by @yoavkatz in #1267
Add stream name to generation of dataset by @elronbandel in #1276
Fix demos pool inference by @elronbandel in #1278
Fix quality github action by @elronbandel in #1281
add operators for robustness check on tables by @csrajmohan in #1279
Instruction in SystemFormet demo support. by @piotrhelm in #1274
change the max_test_instances of bluebench.recipe.attaq_500 to 100 by @shachardon in #1285
Add documentation for types and serializers by @elronbandel in #1286
Add example for image processing with different templates by @elronbandel in #1280
Integrate metrics team LLMaJ with current unitxt implemantation by @lilacheden in #1205
performance profiler with visualization by @dafnapension in #1255
Remove split arg to support old hf datasets versions by @elronbandel in #1288
add post-processors for tablebench taskcard by @csrajmohan in #1289
recursive copy seems safer here by @dafnapension in #1295
Fix performance tracking action by @elronbandel in #1296
try num of instances in nested global scores by @dafnapension in #1282
Update version to 1.14.0 by @elronbandel in #1298

New Contributors

@alfassy made their first contribution in #1241
@piotrhelm made their first contribution in #1274

Full Changelog: 1.13.0...1.14.0

Contributors

marukaz, perlitz, and 10 other contributors

Assets 2

30 Sep 07:05

elronbandel

1.13.1

d1219c9

Unitxt 1.13.1

Update version to 1.13.1 (#1244)

Assets 2

25 Sep 06:37

elronbandel

1.13.0

80b284f

Unitxt 1.13.0

Unitxt 1.13.0 - Multi Modality and Types

New type handling capabilities

The most significant change in this release is the introduction of type serializers to unitxt.
Type serializers in charge of taking a specific type of data structure such as Table, or Dialog and serialize it to textual representation.
Now you can define tasks in unitxt that have complex types such as Table or Dialog and define serializers that handle their transformation to text.

This allows to control the representation of different types from the recipe api:

from unitxt import load_dataset
from unitxt.struct_data_operators import SerializeTableAsMarkdown

serializer = SerializeTableAsMarkdown(shuffle_rows=True, seed=0)
dataset = load_dataset(card="cards.wikitq", template_card_index=0, serializer=serializer)

And if you want to serialize this table differently you can change any of the many available table serializers.

Defining New Type

If you wish to define a new type with custom serializers you can do so by using python typing library:

from typing import Any, List, TypedDict

class Table(TypedDict):
    header: List[str]
    rows: List[List[Any]]

Once your type is ready you should register it to unitxt type handling within the code you are running:

from unitxt.type_utils import register_type

register_type(Table)

Now your type can be used anywhere across unitxt (e.g in task definition or serializers).

Defining a Serializer For a Type

If you want to define a serializer for your custom type or any typing type combination you can do so by:

class MySerizlizer(SingleTypeSerializer):
    serialized_type = Table
    def serialize(self, value: Table, instance: Dict[str, Any]) -> str:
        # your code to turn value of type Table to string

Multi-Modality

You now can process Image-Text to Text or Image-Audio to Text datasets in unitxt.
For example if you want to load the doc-vqa dataset you can do so by:

from unitxt import load_dataset

dataset = load_dataset(
    card="cards.doc_vqa.en",
    template="templates.qa.with_context.title",
    format="formats.models.llava_interleave",
    loader_limit=20,
)

Since we have data augmentation mechanisms it is just natural to use it for images. For example if you want your images in grey scale:

dataset = load_dataset(
    card="cards.doc_vqa.en",
    template="templates.qa.with_context.title",
    format="formats.models.llava_interleave",
    loader_limit=20,
    augmentor="augmentors.image.grey_scale", # <= Just like the text augmenters!
)

Then if you want to get the scores of a model on this dataset you can use:

from unitxt.inference import HFLlavaInferenceEngine
from unitxt.text_utils import print_dict
from unitxt import evaluate

inference_model = HFLlavaInferenceEngine(
    model_name="llava-hf/llava-interleave-qwen-0.5b-hf", max_new_tokens=32
)

test_dataset = dataset["test"].select(range(5))

predictions = inference_model.infer(test_dataset)
evaluated_dataset = evaluate(predictions=predictions, data=test_dataset)

print_dict(
    evaluated_dataset[0],
    keys_to_print=["source", "media", "references", "processed_prediction", "score"],
)

Multi modality support in unitxt is building upon the type handling introduced in the previous section with two new types: Image and Audio.

What's Changed

add revision option to hf loader by @OfirArviv in #1189
Support dataset field in nested JSON files by @antonpibm in #1188
Add TURL Table column type annotation task card by @csrajmohan in #1186
Update operators.py - copy edits (grammar, consistency, clarity) by @welisheva22 in #1187
Numeric nlg postproc by @ShirApp in #1185
Add support for Literal, TypedDict and NewType for unitxt type checking by @elronbandel in #1191
Scarebleu metric: remove mecab_ko and mecab_ko_dic from metric requir… by @eladven in #1197
Add rag dataset + openai format dialog operator by @OfirArviv in #1192
Update README.md by @elronbandel in #1198
add decorator with init warning by @MikolajCharchut in #1200
Add mock inference mode setting and allow testing without gen ai key by @elronbandel in #1204
Fix using OpenAiInferenceEngine for LLMAsJudge by @yifanmai in #1194
Add TogetherAiInferenceEngine by @yifanmai in #1203
Fix OpenAiInferenceEngine by @yifanmai in #1193
Add serializers to templates and reorganize and unite all templates by @elronbandel in #1195
Add demos to task_data by @elronbandel in #1206
Move test_context_correctness by @matanor in #1207
Add image-text to text datasets by @elronbandel in #1211
Refactor augmentors to be more scaleable + add image aumgentors by @elronbandel in #1212
Fix grey scale augmentor and add to image example by @elronbandel in #1213
Add images to UI by @elronbandel in #1216
add unified decorator for warnings and unit tests by @MikolajCharchut in #1209
Add templates list option to standard recipe by @elronbandel in #1219
Use read token for huggingface datasets reading by @elronbandel in #1223
add Llava-next system prompt by @OfirArviv in #1221
Improve performance for huggingface tokenizer based format by @elronbandel in #1224
Fix compute expression to use the instance variables as globals by @elronbandel in #1217
Add generic inference engine to allow dynamic selection by the user by @eladven in #1226
A suggested PR for issue 1106: More meaningful error message when catalog consistency fails by @dafnapension in #1201
Add random templates for bluebench by @perlitz in #1222
A suggested PR for issue #1214: fixed a bug in score_prefix for grouped instance scores by @dafnapension in #1228
Add control over serizliers from recipe + improve serializers construction + allow seed for table shuffling serizliers by @elronbandel in #1229
Fix table tasks to use default table serializers by @elronbandel in #1230
Add concurency_limit parameter to WMLInferenceEngine by @elronbandel in #1231
Add wml and generic based llmaj metric by @perlitz in #1227
Update version to 1.13.0 by @elronbandel in #1232

New Contributors

@MikolajCharchut made their first contribution in #1200

Full Changelog: 1.12.4...1.13.0

Contributors

yifanmai, eladven, and 10 other contributors

Assets 2

28 Aug 13:17

eladven

1.12.4

1a97cce

1.12.4

Main changes

Enable to define benchmark in Unitxt by adding the ability to produce scores of groups based on task attributes and recipe metadata. For more information see https://www.unitxt.ai/en/latest/docs/benchmark.html by @elronbandel in #1130
Enable inference/production APIs to support invocation by task without specifying a card. It enables using any task in the Unitxt catalog as an inference function. Check https://www.unitxt.ai/en/latest/docs/production.html for details (#957)
Add support for multi-modality. For details see https://www.unitxt.ai/en/latest/docs/multimodality.html by @elronbandel in #1175

Additions to catalog

Add ProvoQ dataset artifacts by @bnayahu in #1168
Add Wikitq metric by @ShirApp in #1167
Add more LLMs as judges ensembles by @pvn25 in #1171
Add Scigen table2text task with llm_as_judge metric by @csrajmohan in #1134

New Features

Add LLM as judge ensemble metrics, and add LLMaaJ ensemble example by @pvn25 in #1081
Refactor RenameFields operator to Rename. The old operator is still supported but raises a deprecation warning by @elronbandel in #1123

Bug Fixes

Make cache compatible with python 3.8 by @elronbandel in #1172
Deprecated field used to print warning message with wrong reason @dafnapension in #1174

Documentation changes

Update llm_as_judge.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1164
Update formats.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1163
Update loaders.py --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1162
Update card.py - minor documentation changes by @welisheva22 in #1161
Update adding_dataset.rst - a few more minor documentation changes by @welisheva22 in #1160
Update artifact.py --- documentation edits (grammar, consistency, cla… by @welisheva22 in #1159
Update glossary.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1155
Update helm.rst --- copy edits (grammar, consistency, clarity) by @welisheva22 in #1154
Update operators.py --- copy edits (grammar, consistency, clarity) - take 2 by @welisheva22 in #1158
Docfix: Fix typo in Installation doc by @yifanmai in #1181

New Contributors

@pvn25 made their first contribution in #1081

Contributors

yifanmai, csrajmohan, and 6 other contributors

Assets 2

15 Aug 23:01

eladven

1.12.3

8fd91be

1.12.3

Main changes

New option to use multiple templates and/or num_demos in single dataset recipe. Unitxt will randomly sample from the provided templates and possible number of demos for each instance.
See example : https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_templates_num_demos.py
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it
See more details on how to deal with conflicting metric names in https://www.unitxt.ai/en/latest/docs/adding_metric.html#metric-outputs-with-multiple-metrics

Non backward compatible changes in catalog

change rag metrics name convention (e.g. "metrics.rag.mrr" -> "metrics.rag.context_correctness.mrr",) - catalog non backward compatible change by @assaftibm in #1104
Update summarization task and templates to support multiple reference summaries - by @yoavkatz in #1126
Fix belebele due to new convention by @elronbandel in #1145

Additions to catalog

Add DeepSeek-Coder format and system prompt by @oktie in #1105
Add a metric to calculate the ratio of references included in the prediction by @marukaz in #1091
adding RAG bge metrics by @assaftibm

New Features

Add option to run multiple templates and or num_demos in single dataset recipe. Now it is possible to give a list of templates or num_demos. Unitxt will randomly sample from the templates and for each instance assign a random template from the list. by @elronbandel in #1110
A warning is now generated when a metric generate a score with the same name as that of another metric and overwrites it @dafnapension in #1124
MetricPipeline fields postpreprocess_steps has been renamed to postprocess_steps. The old field (postpreprocess_steps) still exists for backward compatible but depricated. by @dafnapension in #1117
Decrease runtime of demo examples
Add tests for RAG metrics by @matanor
Adding dedicated Unitxt warning and error classes to link online documentation by @yoavkatz in
The code now uses a central controllable deepcopy function by @elronbandel in #1120

Bug Fixes

Create a dedicated nltk a mixin, for downloading all versions of punkt which needed by metrics code. by @elronbandel in #1151
For bulk instance metrics, Replace mean function with nanmean to support aggregation in case of nan scores. by @elronbandel in #1150
Fix helm test by @elronbandel in #1109
Fix bug with RAG metrics: Fix use of minilm model by @assaftibm in #1115
Fix data classification of WML model to include 'public' classification by @yoavkatz in #1118
Fix WMLInferenceEngine by @pawelknes in #1122
Fix belebele HF path due to new convention by @elronbandel in #1145

Documentation changes

Improve debugging.rst wording
Improve examples.rst wording by @welisheva22 in #1138
Improve data_classification_policy.rst wording by @welisheva22 in #1139
Improve rag_support.rst wording by @welisheva22 in #1139
Improve production.rst wording by @welisheva22 in #1148
Improve the clarity of the code examples.
Improve load_datasets.rst wording by @welisheva22
Improve introduction.rst wording by @welisheva22
Improve installation.rst wording by @welisheva22
Improve adding_format.rst wording by @welisheva22
Improve adding_task.rst wording by @welisheva22
Improve adding_template.rst wording by @welisheva22
mprove adding_dataset.rst wording by @hanansinger
improve index.rst page by @yoavkatz
Fix link to llama blog in adding_format.rst by @andersonm-ibm in #1113
Added example of RAG response by @yoavkatz in #1121

New Contributors

@andersonm-ibm made their first contribution in #1113 by @welisheva22 in #1152

Contributors

oktie, marukaz, and 9 other contributors

Assets 2

31 Jul 14:46

yoavkatz

1.12.2

ce2992c

Unitxt 1.12.2

Main changes

Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

Non backward compatible changes

changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
Remove financebench card since it was removed from HF by @elronbandel in #1016
add validation to tldr, remove shuffle from billsum by @alonh in #1038
Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

Arena hard elad2 by @eladven and @OfirArviv in #1026
Add flores101 by @perlitz in #1053
Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
Add Finqa dataset by @ShirApp in #962
Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080

New Features

Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
Support for ensemble by metrics @eladven in #1047
Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
Real types in tasks and metrics by @elronbandel in #1045
Ability to create demo samplers based on instance by @yoavkatz in #1034
add judge input to the LLM as Judge metric scores by @OfirArviv in #1064

Bug Fixes

Solve problem with striping format at LLM as a judge code. by @eladven in #1005
Added seed to LLM as judges for consistent results by @yoavkatz in #1029
Fixed issues with fresh install by @yoavkatz in #1037
WML Inference Engine fix by @pawelknes in #1013
replace type and type in type error message by @perlitz in #1035
FinQA - filter problematic examples by @ShirApp in #1039
demo's target prefix is now taken from demo instance by @dafnapension in #1031
Make sure preparation times printed fully and nicely by @elronbandel in #1046
Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes

Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
Update llm_as_judge.rst by @yoavkatz in #1085
Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
Example improvements by @yoavkatz in #1022
Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
Fix some docs titles and links by @elronbandel in #1023
Add example of meta evaluation of llm as judge by @yoavkatz in #1025
Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
Added example for selection of demos by @yoavkatz in #1052

New Contributors

We want to thank the new contributors for their first contributions!

@welisheva22 made their first contribution in #1015
@luisaadanttas made their first contribution in #994
@benjaminsznajder made their first contribution in #1055
@hanansinger made their first contribution in #1057

Contributors

eladven, perlitz, and 13 other contributors

Assets 2

31 Jul 12:25

yoavkatz

1.12.0

8a40c3f

Unitxt 1.12.0

Main changes

Task "input"/"output" fields renamed to "input_fields" and "reference_fields" to be better reflect their meaning and the type of each field is now define by python class names and not strings (str vs "str") . See example of new syntax here:
https://www.unitxt.ai/en/latest/docs/adding_task.html (old syntax still allowed)
Ability create ensemble of judges . See example in https://www.unitxt.ai/en/latest/docs/examples.html#evaluate-using-ensemble-of-llm-as-a-judge-metrics
Optimized Rouge and Meteor metrics to run faster and now report confidence intervals by default. This cause very small variances in scores (well within the confidence internal)
Added ability to select demonstrations that depend on the specific instance (and not only random). See example in https://github.com/IBM/unitxt/blob/main/examples/evaluate_different_demo_selections.py . This change causes some changes in selection of random demos due to seed changes, but should not have any aggregated effect beyond random fluctuations.
For LLM as Judges, the input sent to the judge is now displayed in the score field called 'judge_raw_input'
Support for arena hard benchmark. See example: https://github.com/IBM/unitxt/blob/main/examples/evaluate_a_model_using_arena_hard.py

Non backward compatible changes

changed method template names "input_fields" and "reference_ fields" (effects only people who wrote custom templates code) by @yoavkatz in #1030
Refactor Rouge and Meteor to InstanceMetric for faster score computation - this cause very small variances in scores (well within the confidence internal) by @yoavkatz in #1011
Ability to create demo samplers based on instance (this causes changes in random selection of demos in normal mode) by @yoavkatz in #1034

Changes in Catalog

safety and regard metrics became instance metrics and named SafetyMetric and RegardMetric by @dafnapension in #1004
Remove financebench card since it was removed from HF by @elronbandel in #1016
add validation to tldr, remove shuffle from billsum by @alonh in #1038
Fix typo in japanese_llama system prompt (issue #964) by @bnayahu in #1056
numeric nlg dataset template changes by @ShirApp in #1041

Additions to catalog

Arena hard elad2 by @eladven and @OfirArviv in #1026
Add flores101 by @perlitz in #1053
Add metric "metrics.rag.retrieval_at_k" to catalog by @matanor in #1074
Add Finqa dataset by @ShirApp in #962
Allow rag context_id fields to be List[str] and not only List[int] by @perlitz in #1036
Rag end to end task support (in progress) - by @benjaminsznajder in #1044, #1080

New Features

Rename task fields "input"/"output" fields r to "input_fields" and "reference_fields" by @luisaadanttas in #994
Support for ensemble by metrics @eladven in #1047
Additional inference parameters for openai and genai and simplfied InferenceEngine API param passing by @pawelknes in #1019 @pawelknes in #1024
Real types in tasks and metrics by @elronbandel in #1045
Ability to create demo samplers based on instance by @yoavkatz in #1034
add judge input to the LLM as Judge metric scores by @OfirArviv in #1064

Bug Fixes

Solve problem with striping format at LLM as a judge code. by @eladven in #1005
Added seed to LLM as judges for consistent results by @yoavkatz in #1029
Fixed issues with fresh install by @yoavkatz in #1037
WML Inference Engine fix by @pawelknes in #1013
replace type and type in type error message by @perlitz in #1035
FinQA - filter problematic examples by @ShirApp in #1039
demo's target prefix is now taken from demo instance by @dafnapension in #1031
Make sure preparation times printed fully and nicely by @elronbandel in #1046
Added prediction type to llm as jusdge to avoid warning by @yoavkatz in #1072
Fixed confidence interval inconsistency when some metrics compute ci and some do not by @dafnapension in #1065
Fix bug in data classes and add support for field overriding in fields containing types or functions by @elronbandel in #1027
Set LoadFromIBMCloud verify to be lazy, in order to allow preparing the cards without define FMEVAL_COS_URL by @eladven in #1021
Added check of type of format and system prompt to LLM as judge by @yoavkatz in #1068
Allow assigning None in overwrites when fetching artifacts with modifications by @dafnapension in #1062
fix - building test is not working. Updated Kaggle version. by @benjaminsznajder in #1055

Documentation changes

Update error message and documentation on unitxt local and HF version conflict by @yoavkatz in #995
Update llm_as_judge.rst by @yoavkatz in #1085
Update introduction.rst add the word "a" before "variety" by @welisheva22 in #1015
Example improvements by @yoavkatz in #1022
Add a guide for using unitxt with lm-evaluation-harness by @elronbandel in #1020
Fix some docs titles and links by @elronbandel in #1023
Add example of meta evaluation of llm as judge by @yoavkatz in #1025
Update introduction.rst - - copy edits (grammar, consistency, clarity) by @welisheva22 in #1063
Added example for selection of demos by @yoavkatz in #1052

New Contributors

We want to thank the new contributors for their first contributions!

@welisheva22 made their first contribution in #1015
@luisaadanttas made their first contribution in #994
@benjaminsznajder made their first contribution in #1055
@hanansinger made their first contribution in #1057

Contributors

eladven, perlitz, and 13 other contributors

Assets 2

08 Jul 05:52

eladven

1.11.1

b23fb42

1.11.1

Non backward compatible changes

The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option for lazy load hf inference engine by @elronbandel in #980
Added a format based on Huggingface format by @yoavkatz in #988

New Assets

Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Improve the examples table documentation by @eladven in #976

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 5 other contributors

Assets 2

07 Jul 11:32

eladven

1.11.0

306fc50

1.11.0 (#996)

Non backward compatible changes

The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
Add option for lazy load hf inference engine by @elronbandel in #980
Added a format based on Huggingface format by @yoavkatz in #988

New Assets

Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
Improve the examples table documentation by @eladven in #976

Refactoring

Delete empty metrics folder by @elronbandel in #984

Testing and CI/CD

Add answer correctness tests by @matanor in #977

New Contributors

@lga-zurich made their first contribution in #978

Full Changelog: 1.10.1...1.10.2

Contributors

eladven, csrajmohan, and 5 other contributors

Assets 2

Releases: IBM/unitxt

Unitxt 1.14.1 - Faster Unitxt 🚀

Important Change: Unitxt is Faster!

Summary

All Changes

New Contributors

Contributors

Unitxt 1.14.0 - Faster Unitxt

What's Changed

New Contributors

Contributors

Unitxt 1.13.1

Unitxt 1.13.0

Unitxt 1.13.0 - Multi Modality and Types

New type handling capabilities

Defining New Type

Defining a Serializer For a Type

Multi-Modality

What's Changed

New Contributors

Contributors

1.12.4

Main changes

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

1.12.3

Main changes

Non backward compatible changes in catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

Unitxt 1.12.2

Main changes

Non backward compatible changes

Changes in Catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

Unitxt 1.12.0

Main changes

Non backward compatible changes

Changes in Catalog

Additions to catalog

New Features

Bug Fixes

Documentation changes

New Contributors

Contributors

1.11.1

Non backward compatible changes

New Features

New Assets

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors

1.11.0 (#996)

Non backward compatible changes

New Features

New Assets

Bug Fixes

Documentation

Refactoring

Testing and CI/CD

New Contributors

Contributors