17 Dec 02:03

alexreinking

ac2fc94

v19.0.0 Latest

Latest

Major improvements

Halide is now available for both C++ and Python usage via Pip. Try pip install halide today!
The Vulkan backend has matured substantially.
The HTML "conceptual statement" output now supports dark mode viewing.
For developers, CMake 3.28 is now required and we no longer require an internet connection during the build.
Thread pool improvements mean that workloads that do a small number of small tasks in parallel (e.g. a cheap operation applied to a small image) are up to 3x faster. If you have schedules that do not use parallelism for small inputs because you found it didn't provide any speedup, you may want to re-benchmark.
You can now query properties of the compiled-for target as Exprs, simplifying helper code that wants to do different things depending on the target architecture. Example: f(x) = select(target_arch_is(Target::ARM), 3, 7). Helpers include target_arch_is, target_os_is, target_has_feature, target_bits, and target_natural_vector_size. These are resolved to constants at compile-time and simplified away. Use with care, as this (intentionally) results in different behavior on different platforms.

Breaking changes

We now distribute libGenGen.a rather than GenGen.cpp.
- Downstream users should link to this library with /WHOLEARCHIVE: or -Wl,--whole-archive rather than build GenGen.cpp themselves.
- Users of the CMake package should be unaffected.
In keeping with our LLVM support policy, support for LLVM 16 has been removed.
We no longer use the le64/le32 generic targets for compiling runtime modules to LLVM. These targets were removed in LLVM upstream.

What's Changed

Apps and tests

Reschedule the matrix multiply performance app by @abadams in #8418
Update lesson_22_jit_performance.cpp by @abadams in #8438
Add threadpool performance test by @abadams in #8447
Don't allow internal_error to pass an error test by @alexreinking in #8458
Get more consistent distributions in parallel scenarios test by @abadams in #8451

Autoschedulers

Consider all Exprs a func uses, not just the RHS, in Li2018 by @abadams in #8326

Build system

Python_bindings-test-as-installed by @LebedevRI in #8355
Bump Halide version to 19 in main branch by @steven-johnson in #8357
Remove warning for unsupported compilers by @alexreinking in #8362
Bump CMake minimum version to 3.28 by @alexreinking in #8363
Quick CMake fixes enabled by 3.28 by @alexreinking in #8365
Distribute GenGen as a static library by @alexreinking in #8367
Clean up serialization build code by @alexreinking in #8369
List headers with target_sources FILE_SETS by @alexreinking in #8370
Clean up autoscheduler dependencies by @alexreinking in #8372
Use a Find module for V8 by @alexreinking in #8373
Use a Find module for NodeJS by @alexreinking in #8374
Move dependencies/wasm to use sites by @alexreinking in #8377
Replace FetchContent with a custom dependency provider by @alexreinking in #8378
Two more build fixes by @LebedevRI in #8371
Rework LLVM into Find module and enact new component policy. by @alexreinking in #8379
Reflow src/CMakeLists.txt in logical groups by @alexreinking in #8383
Introduce HalideFeatures system for optional components by @alexreinking in #8384
Scan generated export files to determine dependencies. by @alexreinking in #8385
Rewrite bundle_static to be much more efficient. by @alexreinking in #8386
Support using vcpkg to build dependencies on all platforms by @alexreinking in #8387
Fix bundling error on buildbots by @alexreinking in #8392
Support CMAKE_OSX_ARCHITECTURES by @alexreinking in #8390
Fix Homebrew LLVM 19 by @alexreinking in #8431
Fix CPack package naming when cross-compiling by @alexreinking in #8492
Fix Apple libtool detection in bundle_static by @alexreinking in #8495

CodeGen

Select condition vector lanes must match the true and false value by @abadams in #8465
Emit vscale_range() fn attribute in correct syntax by @steven-johnson in #8457
Fix #8455 (in combination with #8457) by @steven-johnson in #8456
Fix bonehead mistake in get_md_bool() by @steven-johnson in #8469
Propagate some facts about inequalities with min/max by @shoaibkamil in #8475
- This fixed an issue where predicates in .specialize() directives weren't able to eliminate select() cases. #8443

Debugging

Add LLDB pretty-printing by @alexreinking in #8460
Print constants in scientific precision by @antonysigma in #8506
Adaptive Dark colorscheme for Stmt HTML. Ability to programmatically export conceptual stmt files. by @mcourteaux in #8327

Documentation

Update README.md by @abadams in #8404
Big documentation update by @alexreinking in #8410
- Document how to find Halide from a pip installation by @alexreinking in #8411
- Link to PyPI from Doxygen index.html by @alexreinking in #8415
- Include our Markdown documentation in the Doxygen site. by @alexreinking in #8417
- Add missing backslash by @abadams in #8419

Frontend

Don't let users disguise RVars as Vars by @abadams in #8441
Add helper functions to query properties of the lowered Target (#8192) by @steven-johnson in #8359

Hardware backends

Fix injection of GPU buffers that do not go by a Func name (i.e. alloc groups). by @mcourteaux in #8333
Remove vestigial AMDGPU backend by @alexreinking in #8382
Add ARMv8.x feature flags by @steven-johnson in #4489
[vulkan] Fixes to address outstanding validation failures by @derek-gerstmann in #8448
[vulkan] Reduce descriptor sets, use official headers, improve allocator, remove module destructor by @derek-gerstmann in #8452
[vulkan] Skip async_copy_chain and gpu_allocation_cache correctness tests on Windows by @derek-gerstmann in #8503

LLVM

Don't use le32/le64 by @steven-johnson in #8344
Fix for the removed DataLayout constructor. by @mcourteaux in #8391
Drop support for LLVM 16 in main by @steven-johnson in #8358
Allow LLVM 20 by @steven-johnson in #8352
Fix for top-of-tree LLVM by @steven-johnson in #8421
Fix for top-of-tree LLVM by @steven-johnson in #8425
Fix for top-of-tree LLVM by @steven-johnson in #8442
Fix datalayout for osx-arm-64 by @abadams in #8449
Fix top of LLVM. by @mcourteaux in #8454
Replace all use of getPointerTo() with PointerType::get() by @steven-johnson in #8473

Python

Fix Numpy 2.0 compatibility bug in lesson 10 by @alexreinking in #8381
Pip packaging at last! by @alexreinking in #8405
- Update pip package metadata by @alexreinking in #8412
- Fix classifier spelling by @alexreinking in #8413
- Upgrade LLVM to 19.1.0 in pip package by @alexreinking in #8423
- Update PIP LLVM to 19.1.4 by @alexreinking in #8488
PythonExtensionGen: ~PyHalideBuffer should call device_free() (#8399) by @steven-johnson in #8439

Runtime

Fix profiler to report time spent on GPU kernels again instead of on 'wait for parallel tasks'. by @mcourteaux in #8453
Don't spin on the main mutex while waiting for new work by @abadams in #8433

Minor bugfixes / other cleanup

Remove remaining dregs of tuple_select (oops) by @steven-johnson in https://github.com/halid...

Contributors

LebedevRI, alexreinking, and 9 other contributors

Assets 10

17 Jul 20:31

steven-johnson

v18.0.0

8c651b4

Halide v18.0.0

Changes Of Note since Halide 17

Ring-buffering now supported in schedules (Func::ring_buffer()). This is distinct from fold_storage in that it folds across time (the loop variables) rather than folding across space (the pure vars of the Func).
Fixed a longstanding bug in lossless_cast()
Lots of fixes for Vulkan backend
OpenGLCompute is no longer supported
Added support for ARM SVE2
Added (basic) support for Intel APX and AVX10
Added support for Hexagon HVX v68
Added support for numpy's .npy format to .debug_to_file() and the code in halide_image_io.h
Python bindings now support bfloat and int64 properly
Hacky code that auto-named Funcs, Vars etc via DWARF introspection was removed
The profiler was revamped to behave better when multiple Halide pipelines are in flight at the same time.
Numerous lowering passes were sped up, resulting in faster compilation for large pipelines. However, time spent in LLVM is still the long pole for most pipelines.
Fixed-point instruction selection has been improved via tracking constant integer bounds of expressions.
Adds feature detection for ARM CPUs to the runtime library and to the host target feature computation. Supports Windows, macOS,
Linux, iOS, and Android.

Deprecations / Removals

tuple_select() has been removed in favor of overloads to select().
Various fixed-point operators have been removed from the Halide::Internal namespace and are now in the public Halide namespace.

What's Changed

Detect ARM CPU features for host target and in runtime (#8298)
Scheduling directive to support ring buffering by @vksnk in #7967
Don't add ring_buffer semaphores if the function is not scheduled as async by @vksnk in #8015
Quick fix for crash that is occurring in SVE2 tests. by @zvookin in #8020
Don't use variable-length arrays by @steven-johnson in #8021
Set warnings on tests as well as src by @steven-johnson in #8022
Stronger chain detection in LoopCarry pass by @vksnk in #8016
adds mappings for f16 variants of halide float math by @mikewoodworth in #8029
Require LLVM >= 16.0 by @steven-johnson in #8003
Add test for #8029 by @steven-johnson in #8032
Tweak the Printer code in runtime for smaller code by @steven-johnson in #8023
Fix bounds_of_nested_lanes by @abadams in #8039
Track whether or not let expressions failed to solve in solver by @abadams in #7982
Fix type error in VectorizeLoops by @abadams in #8055
Update makefile to use test/common/terminate_handler.cpp by @abadams in #8066
add unsafe_promise_clamped by @wraith1995 in #8071
Don't require Halide_WebGPU when using wasm (#8063) by @steven-johnson in #8065
Outsmart the LLVM optimizer by @steven-johnson in #8073
Add hexagon_benchmarks app for CMake builds by @prasmish in #8069
Fix bool conversion bug in Vulkan code generator by @derek-gerstmann in #8067
Better validation of gpu schedules by @abadams in #8068
Add an easy way to print vectors in debug output. by @zvookin in #8072
[WebGPU] Update to latest native headers by @jrprice in #8081
Remove OpenGLCompute by @steven-johnson in #8077
Add checks to prevent people from using negative split factors by @abadams in #8076
Fix rfactor adding too many pure loops by @abadams in #8086
Forward the partition methods from generator outputs by @abadams in #8090
Parallelize some tests by @abadams in #8078
Allow disabling of mutlithreading in simd op check by @steven-johnson in #8096
clang does not support _Float16 when targeting i386 by @LebedevRI in #8085
tests: correctness/float16_t: mark __extendhfsf2 with default visibility by @LebedevRI in #8084
Fix reduce_expr_modulo of vector in Solve.cpp by @abadams in #8089
[Vulkan] Region allocator fixes for memory requirements and allocations by @derek-gerstmann in #8087
Ensure string(REPLACE) is called with the right number of arguments by @alexreinking in #8097
Strip asserts right at the end of lowering by @abadams in #8094
Fix clang-tidy error in runtime.printer.h (parameter shadows member) by @steven-johnson in #8074
Fix an issue where the Halide compiler hits an internal error for bool types in widening intrinsics. by @zvookin in #8099
Small Tutorial Fix by @2022tgoel in #8111
Optionally print the time taken by each lowering pass by @abadams in #8116
Do less redundant work in UnpackBuffers by @abadams in #8104
Avoid redundant scope lookups by @abadams in #8103
Add Intel APX and AVX10 target flags and LLVM attribute setting. by @zvookin in #8052
Use a caching version of stmt_uses_vars in TightenProducerConsumer nodes by @abadams in #8102
Fix hoist_storage not handling condition correctly. by @abadams in #8123
Rewrite the skip stages lowering pass by @abadams in #8115
Remove two dead vars from the Makefile by @abadams in #8125
Add support for setting the default allocator and deallocator functions in Halide::Runtime::Buffer. by @mcourteaux in #8132
Make realization order invariant to unique_name suffixes by @abadams in #8124
Make gpu thread and block for loop names opaque by @abadams in #8133
Add class template type deduction guides to avoid CTAD warning. by @zvookin in #8135
[vulkan] Add conform API methods to memory allocator to fix block allocations by @derek-gerstmann in #8130
Add sobel in hexagon benchmarks app for CMake builds by @prasmish in #8127
Handle loads of broadcasts in FlattenNestedRamps by @abadams in #8139
Use python itself to get the extension suffix, not python-config by @abadams in #8148
Rewrite the pass that adds mutexes for atomic nodes by @abadams in #8105
Feature: mark a Func as no_profiling, to prevent injection of profiling. (2nd implementation) by @mcourteaux in #8143
Bound allocation extents for hoist_storage using loop variables one-by-one by @vksnk in #8154
Support for ARM SVE2. by @zvookin in #8051
Fix two compute_with bugs. by @abadams in #8152
Python bindings: add_python_test(): do set HL_JIT_TARGET too by @LebedevRI in #8156
fix ub in lower rounding shift right by @abadams in #8173
Add some missing _Float16 support by @steven-johnson in #8174
Add conversion code for Float16 that was missed in #8174 by @steven-johnson in #8178
Tighten bounds of abs() by @rootjalex in #8168
Clarify the meaning of Shuffle::is_broadcast() by @abadams in #8158
Add .npy support to halide_image_io by @steven-johnson in #8175
Update Hexagon Install Instructions by @FabianSchuetze in #8182
Add .npy support to debug_to_file() by @steven-johnson in #8177
Don't print on parallel task entry/exit with -debug flag by @abadams in #8185
Fix corner case in if_then_else simplification by @abadams in #8189
Rewrite IREquality to use a more compact stack instead of deep recursion by @abadams in #8198
[HEXAGON] Keep support for hexagon_remote/Makefile by @aankit-quic in #8186
Faster substitute_facts by @abadams in #8200
Make Interval::is_single_point check for deep equality by @abadams in #8202
Refactor ConstantInterval by @abadams in #8179
Faster vars used tracking in simplify let visitor by @abadams in #8205
M...

Contributors

LebedevRI, alexreinking, and 19 other contributors

Assets 10

25 Jun 15:30

steven-johnson

v17.0.2

b2e6d2a

Halide v17.0.2

What's Changed

Backport a fix for the simpler bug in lossless_cast by @abadams in #8264
Fix Vulkan SIMT mappings for GPU loop vars; avoid formatting the GPU kernel to a string for Vulkan (since it's binary SPIR-V needs to remain intact). @derek-gerstmann in #8270

Full Changelog: v17.0.1...v17.0.2

Contributors

abadams and derek-gerstmann

Assets 10

20 Feb 19:50

steven-johnson

v17.0.1

5254117

Halide v17.0.1

What's Changed

Changes to make WebGPU code compliant with recent versions of Emscripten (#8106)
Fix rfactor adding too many pure loops (#8107)
Forward the partition methods from generator outputs (#8090)
Fix reduce_expr_modulo of vector in Solve.cpp (#8107)

Full Changelog: v17.0.0...v17.0.1

Assets 10

02 Feb 00:30

steven-johnson

v17.0.0

3577f88

Halide v17.0.0

Changes Of Note

ParamMap has been removed entirely from the public API. All users of ParamMap should migrate to Callable instead.
Halide::Parameter has been moved to the public Halide API (it was formerly "internal" and not intended for public use).
New scheduling primitives:
- Func::partition() and friends: Set the loop partition policy, which controls how/whether a loop is split into three loops (prologue/steady-state/epilogue). Loop partitioning can be useful to optimize boundary conditions (e.g. clamp_edge).
- Func::hoist_storage() and friends: allows a functions's storage to be moved to a given loop level. Unlike Func::store_at(), no optimizations are triggered (e.g. sliding window).
New TailStrategy options for for existing scheduling directives:
- ShiftInwardsAndBlend: Equivalent to ShiftInwards, but protects values that would be re-evaluated by loading the memory location that would be stored to, modifying only the elements not contained within the overlap, and then storing the blended result. Unlike ShiftInwards, this is valid to use in update definitions.
- RoundUpAndBlend: Equivalent to RoundUp, but protects values that would be written beyond the end by loading the memory location that would be stored to, modifying only the elements within the region being computed, and then storing the blended result. Unlike RoundUp, this is valid to use on non-outermost splits in update definitions.
Substantially improved performance and display in the VizIR output.
Profiler improvements:
- Substantially nicer text output
- Injects timing into calls for copy_to_host and copy_to_device so you can measure host<->device copy overhead
- Allows option sorting via HL_PROFILER_SORT env var
Substantially faster codegen for several GPU backends.
Experimental serialization/deserialization feature allows for saving of Halide IR code.
Various bug fixes and improvements in the Anderson2021 autoscheduler.
Improved ARM codegen, including: better patterns for sdot/udot; improved shift/mul codegen.
Support for Zen4 architecture in the x86 backend.
Updates to the ONNX app.
Various fixes and improvements to sliding-window and storage-folding.
Improvements to slow gather operations for some x86 variants.
Improvements to correctness for the .async() scheduling directive.
Improved codegen for float16 conversion, especially on x86.
Several compile-time warnings of dubious usefulness disabled.
WebAssembly codegen now defaults to assuming that saturating-float-to-int and sign-extension instructions sets are always available.
Target now does some reality-checking that it doesn't contain obviously nonsensical Feature combinations

What's Changed

Misc changes and fixes to RISCV codegen
Revise LLVM fix to work when no V8 or WABT available by @steven-johnson in #7635
Be more careful about overflow in trim_bounds_using_alignment by @abadams in #7645
Add a compositing example app by @abadams in #7646
Get the ASAN toolchain working again by @steven-johnson in #7604
Upgrade clang-format and clang-tidy to use v16 by @steven-johnson in #7660
Enable the misc-use-anonymous-namespace clang-tidy check by @steven-johnson in #7661
Enable clang-tidy's modernize-use-default-member-init check by @steven-johnson in #7662
Update onnx app to Adams2019 autoscheduler and new autoscheduler API by @abadams in #7673
Remove ParamMap by @steven-johnson in #7675
Fix correctness_float16_t for ASAN builds by @steven-johnson in #7687
Add a select overload for tuples by @abadams in #7672
Add Sanitizer details to README_cmake.md by @steven-johnson in #7688
Fix quadratic algorithm in simplify_correlated_differences by @abadams in #7686
Fix float16 under asan, attempt #2 by @steven-johnson in #7691
Add a warning if a Generator declares any Outputs before the final Input (Fixes #7669) by @steven-johnson in #7697
Fixed the regularization for BGU. by @mcourteaux in #7684
Fix clang and llvm versions in scripts by @TH3CHARLie in #7702
Fix leaks caused by self-referential parameter constraints by @abadams in #7700
Fix float16 warning for older clangs by @abadams in #7701
Upgrade Halide main branch for LLVM18 by @steven-johnson in #7710
Improved profiler result printing. by @mcourteaux in #7709
Default WITH_TEST_FUZZ to OFF by @steven-johnson in #7695
Throw an erorr if split is called with the same older and inner var name by @TH3CHARLie in #7715
Making HLSL code-gen a couple orders of magnitude faster... by @slomp in #7719
Making Metal code-gen a bit faster by @slomp in #7720
Fix handling of thread features for scalars in Anderson2021 by @aekul in #7726
Change default generator timeout to infinite by @abadams in #7718
Remove unused using decl by @abadams in #7730
[Hexagon] - Fix problems in sim_host.cpp by @pranavb-ca in #7725
Fix RDom usage in anderson2021_test_apps_autoscheduler (Fixes #7729) by @steven-johnson in #7734
Fix leak on cloning functions with update defs by @abadams in #7735
Ignore code in src/runtime/hexagon_remote/bin/src for clang-format by @steven-johnson in #7736
Clean up really long line lengths in Anderson2021 by @steven-johnson in #7728
Revise labels on autoscheduler tests by @steven-johnson in #7732
Speedup the VizIR HTML. by @mcourteaux in #7713
Run clang-tidy on macOS runners instead of Linux by @steven-johnson in #7746
Fix infinite recursion in loop partitioning by @abadams in #7743
Fix leaks in test/correctness/memoize.cpp by @abadams in #7705
Allow optional sorting of profiler output via HL_PROFILER_SORT env var (Fixes #7638) by @steven-johnson in #7639
Permit llvm 15 on windows by @abadams in #7744
Revert accidental typo change in #7746 by @steven-johnson in #7747
[vulkan] Fix heap buffer overflow in Vulkan extension handling discovered by ASAN by @derek-gerstmann in #7740
[vulkan] Fix SPIR-V IR references causing leaks by @derek-gerstmann in #7739
Improve error-handling in Anderson2021, and ensure build deps are cor… by @steven-johnson in #7748
StmtViz: Search for tooltip only in the child node by @antonysigma in #7754
Experimental serializer by @TH3CHARLie in #7594
Define cast<i32>(u32) overflow behavior by @rootjalex in #7769
Fix vector reduce HTML by @mcourteaux in #7773
Remove fragile simd_op_check test for mlal/mlsl on ARM by @rootjalex in #7775
Speedup page loading of VizStmt. by @mcourteaux in #7755
Try to fix remaining ASAN-reported leaks by @steven-johnson in #7767
Fix out of bounds access in anderson2021_test_apps_autoscheduler by @aekul in #7771
Don't introduce reinterprets in find/lower intrinsics by @rootjalex in #7776
[Hexagon] -Build Hexagon runtime components using the Hexagon SDK (Clone of #7671) by @pranavb-ca in #7741
slice IRMatcher should only match on slices by @abadams in #7772
Don't inject undef() in the simplifier by @abadams in #7791
Fix for top-of-tree LLVM by @steven-johnson in #7798
[ARM] Distribute shifts as muls by @rootjalex in #7790
[ARM] support new udot/sdot patterns by @rootjalex in #7800
Remove some unused includes by @abadams in #7799
Add support to the makefile for serialization by @abadams in #7762
[wasm] Enable PIC for WebAssembly on LLVM v18.x by @derek-gerstmann in #7803
Update WebGPU to latest Emscripten/Dawn API by @steven-johnson in #7804
Add jump-buttons to get fro Stmt directly to Assembly by @mcourteaux in #7793
Update clang-tidy action to stop breaking by @...

Contributors

steven-johnson, abadams, and 12 other contributors

Assets 10

0 Join discussion

24 Jun 01:10

derek-gerstmann

v16.0.0

027547f

Halide v16.0.0

What's Changed

General Notes

Support for the Vulkan API (w/SPIR-V codegen)
Support for WebGPU (experimental)
Improved Halide IR HTML Visualization
Fixed a regression in the Adams2019 auto-scheduler that disabled sub-tiling
Added GPU auto-scheduler (Anderson2021)

Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU
Luke Anderson, Andrew Adams, Karima Ma, Tzu-Mao Li, Tian Jin, Jonathan Ragan-Kelley
Proceedings of the ACM on Programming Languages (OOPSLA 2021)

Deprecations / Removals

OpenGLCompute has been deprecated
ParamMap has been deprecated
Deprecated HVX_shared_object feature has been removed
References to deprecated fixed-point operators have been removed
Deprecated halide_target_feature_disable_llvm_loop_opt has been removed
Deprecated MIPS device support has been removed

Notable Fixes & Changes

Generate dot() in the Metal backend by @vksnk in #7085
Add evaluate() and evaluate_may_gpu() to Python bindings by @steven-johnson in #7108
Add support for generating LLVM vector predication intrinsics. by @zvookin in #7111
RISC V vector predication support intrinsics support by @zvookin in #7119
Add range-checking to Buffer objects in Python by @steven-johnson in #7128
Fix Python buffer handling by @steven-johnson in #7125
[WASM] Use rounding_mul_shift_right for q15mulr_sat_s pattern by @rootjalex in #7134
[x86] Generate AVX512 fixed-point instructions by @rootjalex in #7129
Fix readnone attribute for llvm 16 by @abadams in #7152
Call cache.clear between internal functions in CG_C by @steven-johnson in #7155
Add bfloat support to halide_type_to_string() by @steven-johnson in #7154
Factor simd_op_check into separate files by architecture. by @zvookin in #7163
Slightly improve error message for non-integer RDom min/extent by @abadams in #7151
Migrate from MCJIT to ORC JIT by @dkurt in #7166
Use n32:64 in RISC-V data layout by @dkurt in #7175
Don't attempt to use makecontext()/swapcontext() on Android by @steven-johnson in #7196
Add bridging for clang _Float16 type. by @zvookin in #7201
Fix issue with vector predicated comparison and select instructions. by @zvookin in #7205
Add RISC V zvl flag for LLVM version 16 or greater. by @zvookin in #7209
Extend LLVM IR type mangling to handle scalars. by @zvookin in #7212
Fix bitrot in PowerPC testing by @steven-johnson in #7211
Use aligned_alloc() as default allocator for HalideBuffer.h on most platforms by @steven-johnson in #7190
Tighten alignment promises for halide_malloc() by @steven-johnson in #7222
Fix some sources of signed integer overflow in the compiler by @abadams in #7231
Explicitly stage strided loads by @abadams in #7230
Remove deprecated halide_target_feature_disable_llvm_loop_opt by @steven-johnson in #7247
Conditional allocations shouldn't fail for size=0 in C++ backend (#7255) by @steven-johnson in #7256
Inline into extern function args during bounds inference by @abadams in #7261
Use ::aligned_alloc() instead of std::aligned_alloc() in HalideBuffer.h by @steven-johnson in #7268
Optimize Module::compile() for some edge cases by @steven-johnson in #7269
Drop support for MIPS (#7287) by @steven-johnson in #7289
Emit prototypes for destructor functions in C Backend by @steven-johnson in #7296
[HVX] Fix EliminateInterleaves by @rootjalex in #7279
Remove dependency on platform threads library by @alexreinking in #7297
Fix error of add_halide_generator in cross-compilation by @stevesuzuki-arm in #7283
Fix issue in add_halide_runtime in cross-compilation by @stevesuzuki-arm in #7284
Add workaround for the const-or-not user_context issue (#635) by @steven-johnson in #7291
[x86 & wasm] Split up double saturating-narrows from i32 by @rootjalex in #7280
Hoist vector slices using rewrite rules by @abadams in #7243
Improved halide_popcount by @Aelphy in #7225
halide_popcount<uint64_t> is broken by @steven-johnson in #7313
Fix segfault by nonconstant bound in Adams2019 by @stevesuzuki-arm in #7321
Make auto scheduler libs available in HalideHelpers package by @stevesuzuki-arm in #7285
Improve support for Arm baremetal compilation and runtime by @stevesuzuki-arm in #7286
Remove deprecated HVX_shared_object feature by @steven-johnson in #7331
Fix a subtle uninitialized-memory-read in Buffer::for_each_value() by @steven-johnson in #7330
Add a hook to Codegen_C::compile() by @steven-johnson in #7335
Tiny improvements in codegen in C backend by @steven-johnson in #7337
Devirtualize the protected compile() methods in Codegen_C by @steven-johnson in #7341
Fix tuple output bounds checks by @abadams in #7345
Change early-bound default args in Python bindings to late-bound by @steven-johnson in #7347
Fix Python error handling by @steven-johnson in #7352
Permit vectorization of non-recursive atomic operations by @abadams in #7346
Update WABT to 1.0.32; Increase stack size for WASM AOT apps by @steven-johnson in #7373
Bounds visitors for min/max were missing single_point mutated case by @abadams in #7377
Fix overflow in x86 absd lowering by @abadams in #7407
Add initial support for WebGPU by @jrprice in #6492
Use pmaddubsw for non-RDom horizontal widening adds by @abadams in #7440
Compute comparison masks in narrower types if possible by @abadams in #7392
Fix bugs in PyTorch codegen. by @Yongqi-Zhuo in #7443
Remove references to deprecated variants of fixed-point operators by @steven-johnson in #7457
Add GPU autoscheduler by @aekul in #6856
d3d12 runtime: replacing spinlocks by mutex objects by @slomp in #7489
Feature Enhancement: Halide IR HTML Visualization by @maaz139 in #7421
Deprecate ParamMap (#7121) by @steven-johnson in #7357
Forbid assigning to Buffer(Expr) by introducing an intermediate type. by @abadams in #7517
[vulkan phase2] Vulkan Runtime by @derek-gerstmann in #6924
Add libfuzzer compatible fuzz harness by @silvergasp in #7512
fuzz: Port correctness/cse fuzzer over to libfuzzer by @silvergasp in #7543
metal : replacing spinlock by mutex by @slomp in #7532
Fix save_tiff() PlanarConfig assignment for monochrome inputs by @philboske in #7568
Fix various compilation errors with AppleClang 14.0.3 by @steven-johnson in #7578
fuzz: Add libfuzzer compatible bounds fuzzer by @silvergasp in #7549
Significant change to RISC V and scalable vector code generation. by @zvookin in #7616
Fix inverted may_subtile checks by @abadams in #7626
Deprecate OpenGLCompute for Halide 16 by @shoaibkamil in #7627

New Contributors

@sashashura made their first contribution in #7136
@twesterhout made their first contribution in #7315
@terryheo made their first contribution in #7323
@adrian-lebioda made their first contribution in #7379
@Ttayu made their first contribution in #7402
@Yongqi-Zhuo made their f...

Contributors

alexreinking, steven-johnson, and 23 other contributors

Assets 10

07 Apr 23:21

steven-johnson

v15.0.1

4c63f1b

Halide v15.0.1

What's Changed

The Python binding of compile_to_callable() was not properly copying from device to host for output buffers, so output was typically black (or garbage) when used with a GPU target. (#7213)
The bin directory was missing from the installs.
Upgraded LLVM to 15.0.7
New in 15.0.0, but restated here for visibility: The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.

Assets 10

06 Mar 23:38

steven-johnson

v15.0.0

d7651f4

Halide v15.0.0

What's Changed

General Notes

Support for RISC V Vector architectures.
Python-related:
- Halide builds for Python are now being built and provided to PyPI, so it is now possible to use the Halide Python bindings simply by pip install halide
- Major improvements were made to the Python bindings, with many missing or incomplete sections of the API added or filled in.
- We now support the use of Generators from Python (for both JIT and AOT usage).
- The standard CMake rules now support generating a Python extension directly.
- Support for Python was removed from Halide's Makefiles; you must use CMake to build the Python bindings
Halide::Func now allows you to (optionally) constrain the type(s) of Exprs that the Func can contain, and/or the dimensionality of the Func.
Added a new way to use the JIT (compile_to_callable) that allows calling a jitted function with the same syntax as for AOT-compiled functions, allowing more control over JIT lifespan, as well as thread-safe arguments without requiring ParamMap
General improvements to SIMD codegen
Several rarely-used parts of the C++ Generator API were deprecated, and the way that autoschedulers are specified for AOT compilation is now completely different (but better for future expandability).
CMake builds now require >= v3.22
WABT usage requires >= v1.0.30
LLVM 12 is no longer supported
The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.

Notable bug fixes

Make Halide::round behave as documented (#7012)
Incorrect folding of saturating_sub (#6883)
The check for race conditions didn't consider where clauses (#6808)
Performance regression for x86 for certain LLVM versions (#6783)
Fusing a specialization drops compute_withs from generated code (#6770)
Incorrect output when realize condition depends on tuple call (#6915)
Python extensions should default to throwing exceptions rather than calling abort() for errors (#6986)
Python bindings didn't support bool buffers (#7006)
Python bindings didn't support float16 buffers (#7060)
Python extensions that executed on GPU didn't copy back to host properly (#6869)
Fix bugs in div_round_to_zero and fast_integer_divide_round_to_zero (#7008)
Bugs in add_requirement() (#7045)

Major changes

Augment Halide::Func to allow for constraining Type and Dimensionality by @steven-johnson in #6734 and #6735
Add Target support for architectures with implementation specific vector size. by @zvookin in #6786
Add support for vscale vector code generation. by @zvookin in #6802
Remove Python bindings from Makefiles by @alexreinking in #6821
Add a new, alternate JIT-call convention by @steven-johnson in #6777
Pip packaging by @alexreinking in #6886 and #6938
Define a Generator framework in Python by @steven-johnson in #6764
Make Halide::round behave as documented by @abadams in #7012

Minor changes

-mtune=/-mcpu= support for x86 AMD CPU's by @LebedevRI in #6655
Enable deprecations warnings by @steven-johnson in #6555
Fix GPU depredication/scalarization by @shoaibkamil in #6669
Allow PyPipeline and PyFunc to realize() scalar buffers by @steven-johnson in #6674
Future-proof 'processortotune processor` by @LebedevRI in #6673
Fix ctors for Realization by @steven-johnson in #6675
-mtune=native CPU autodetection for AMD Zen 3 CPU by @LebedevRI in #6648
Clean up Python extensions in python_bindings by @steven-johnson in #6670
Halide::Tools::save_image() should accept buffers with const types by @steven-johnson in #6679
Fix "set but not used" warnings/errors by @steven-johnson in #6683
Drop support for LLVM12 by @steven-johnson in #6686
Upgrade to clang-format 13 by @steven-johnson in #6689
Always mark _ucon as 'unused' in Codegen_C by @steven-johnson in #6691
Add break to avoid 'possible unintentional fallthru' warning by @steven-johnson in #6694
Silence "unknown warning" in Clang 13 by @steven-johnson in #6693
Fixes for top-of-tree LLVM by @steven-johnson in #6697
Python: make Func implicitly convertible to Stage (#6702) by @steven-johnson in #6704
llvm no longer wants a type suffix on vst intrinsics by @abadams in #6701
Fix type-mangling for vst on arm32 for LLVM15 by @steven-johnson in #6705
Remove the last remaining call to getPointerElementType() by @steven-johnson in #6715
ARM vst mangling needs to be conditional on opaque ptrs by @steven-johnson in #6716
Combine string constants in combine_strings() by @steven-johnson in #6717
Update CodeGen_PTX_Dev to use new PassManager by @steven-johnson in #6718
Closure functions for parallel tasks should be internal, not external by @steven-johnson in #6720
Smarten type_of<> for fn ptrs; fix async_parallel for C backend by @steven-johnson in #6719
Remove legacy::FunctionPassManager usage in Codegen_PTX_Dev by @steven-johnson in #6722
get_amd_processor(): implement detection for the rest of supported AMD CPU's by @LebedevRI in #6711
Add Func::output_type() method by @steven-johnson in #6724
Grab-bag of minor Python fixes by @steven-johnson in #6725
Remove rounding_halving_sub and non-existent arm rhsub instructions by @rootjalex in #6723
Faster widening_mul(int16x, int16x) -> int32x for x86 (AVX2 and SSE2) by @rootjalex in #6677
Add missing #include in ThreadPool.h by @steven-johnson in #6738
Fix regression from #6734 by @steven-johnson in #6739
Add forwarding for the recently-added Func::output_type() method by @steven-johnson in #6741
Silence "unscheduled update stage" warnings in msan_generator.cpp by @steven-johnson in #6740
Add pycache to toplevel .gitignore file by @steven-johnson in #6743
Silence "may be used uninitialized" in Buffer::for_each_element() by @steven-johnson in #6747
Update WABT to 1.0.29 by @steven-johnson in #6748
Update hannk README link to hosted models page by @steven-johnson in #6749
Add a HalideError base class to Python bindings by @steven-johnson in #6750
Add GeneratorFactoryProvider to generate_filter_main() by @steven-johnson in #6755
Minor metadata-related cleanups by @steven-johnson in #6759
Expand the x86 SIMD variants tested in correctness_vector_reductions by @steven-johnson in #6762
Fix Param::set_estimate for T=void by @steven-johnson in #6766
add_python_aot_extension should use FUNCTION_NAME for the .so output … by @steven-johnson in #6767
Fix fundamental confusion about target/tune CPU by @LebedevRI in #6765
Fix annoying typo in Func.h by @steven-johnson in #6774
Add execute_generator() API by @steven-johnson in #6771
Allow overriding of ...

Contributors

LebedevRI, alexreinking, and 11 other contributors

Assets 10

07 Apr 23:02

alexreinking

v14.0.0

6b9ed2a

Halide 14.0.0

What's Changed

Major changes

@abadams
- Add ability to pass a user context in JIT mode (#6313)
- Reenable warning about unscheduled update definitions (#6602)
@alexreinking
- Add helper for cross-compiling Halide generators. (#6366)
@LebedevRI
- Implement SanitizerCoverage support (Refs. #6513) (#6517)
@steven-johnson
- Expand optional static-typing for Buffer to include dimensionality (#6574)
- Deprecate the Generator::build() method (#6580)
- Move GeneratorContext into a standalone class (#6618)
- Python Bindings didn't allow for zero-D Funcs, ImageParams, Buffers (#6633)
@zvookin
- Timer based profiler (#6642)

Minor changes

@abadams
- Deprecate JIT runtime override methods that take void * (#6344)
- Allow users to use their own cuda contexts and streams in JIT mode (#6345)
- Add --help flag to rungenmain, fixing #5323 (#6354)
- Do target-specific lowering of lerp (#6432)
- Reduce overhead of sampling profiler by having only one thread do it (#6433)
- Skip custom cuda context test on older GPUs (#6437)
- Avoid needless gather in fast_integer_divide lowering (#6441)
- Fixes for c++20 (#6446)
- Add a fast integer divide that rounds to zero (#6455)
- Let lerp lowering incorporate a final cast. (#6480)
- Try removing optional buffer added to closure (#6481)
- rounding shift rights should use rounding halving add (#6494)
- Make random faster by putting the innermost var last (#6504)
- Make it possible to interpret a wide type as multiple smaller elements (#6506)
- Handle mixed-width args to mul-shift-right (#6526)
- Attempted redo of faster noise (#6539)
- Better default lowering of absd (#6545)
- Make HALIDE_REGISTER_GENERATOR work with multiple template args (#6556)
- Rename Output to OutputFileType and deprecated Output (#6568)
- Remove incorrect not-multiple-of-16 claim (#6573)
- Fix bug in mul_shift_right matching (#6610)
@alexreinking
- Add super-build for cross-compiling HANNK (#6374)
- Fix empty INSTALL_COMMAND in hannk super-build (#6387)
- Remove halide_config.cmake from Makefile build. Fixes #6615 (#6616)
- Make IRComparer consider nans to be less than non-nans. (#6626)
@ashishUthama
- Include LICENSE.txt in package (#6428)
@dsharletg
- Fix description of rounding_shift_left/rounding_shift_right (#6549)
@Elarnon
- Only commutative reductions can be parallelized (#6609)
@jinderek
- Support new warp shuffle intrinsics after CUDA Volta architecture (#6505)
@knzivid
- python_bindings: Fix SIGSEGV in HalidePythonCompileTimeErrorReporter (#6635)
@LebedevRI
- [CMake] Deduplicate Halide_LLVM_VERSION and LLVM_PACKAGE_VERSION (#6646)
@masahi
- [APP] Fix hexagon_benchmarks build (use two-var prefetch) (#6563)
@mcleary
- Add support for AMX instructions (#5818)
@mcourteaux
- Include GPU source kernels in Stmt and StmtHtml file. (#6444)
- Syntax highlighting for embedded PTX code. (#6447)
@mgharbi
- Fixes the Pytorch Wrapper Codegen for CPU-only machines. (#6590)
@OmarEmaraDev
- Fix default device wrap native function (#6310)
- Fix wrong type in Ramp CodeGen for OpenGLCompute (#6349)
- Vectorize Ramp in OpenGLCompute backend (#6372)
- Support vectorization in OpenGLCompute backend (#6348)
- Support vectorized Select in OpenGLCompute backend (#6371)
@rootjalex
- Make bounds of let visitor use unique_name() (#6583)
- Remove incorrect docs on widening_add (#6625)
- Disallow Type::narrow() and Type::widen() from producing bitwidths between 1 and 8 bits (#6622)
- Wild match object should not be foldable (#6623)
- Clear bounds info on casts when value bounds are undefined for overflow types (#6640)
@slomp
- decommissioning StackPrinter (#6470)
@steven-johnson
- [hannk] Fix MeanOp (#6336)
- Add using OpVisitor::visit; to various OpVisitors to avoid overload warnings for some compilers (#6337)
- [hannk] Add a prepare() method for ops and interp (#6338)
- Fix WASM datalayout for top-of-tree LLVM (#6339)
- Make halide_type_t and halide_type_of constexpr (#6340)
- Harvest IWYU changes for LLVM, WABT (#6341)
- Fix HelloWasm (#6342)
- Fix Makefile for LLVM11 (injection from #5818) (#6343)
- [hannk] requantize() should never skip the operation (#6350)
- [hannk] augment SoftmaxOp to allow specifying axis (#6351)
- Use Node instead of d8 for Wasm AOT testing (#6356)
- [hannk] Add missing call to Interpreter::prepare in benchmark app (#6358)
- [hannk] Allow disabling TFLite+Delegate build in CMake (#6360)
- [hannk] Add support for building/running for wasm (#6361)
- Update Emscripten settings (#6362)
- [hannk] Clean up aliasing (v2) (#6364)
- [hannk] tests should only process .tflite files (#6368)
- Revamp Hannk IR (#6379)
- Fix for top-of-tree LLVM (#6380)
- Remove halide_assert() from halide_default_device_wrap_native (#6381)
- Rename halide_assert -> halide_abort_if_false (#6382)
- Convert various halide_assert -> static_assert (#6383)
- Fix for top-of-tree LLVM (#6386)
- Check results of all runtime function calls (#6389)
- Add halide_debug_assert() macro (#6390)
- [hannk] Have CMake emit .s, .stmt, .ll files (#6392)
- [hannk] Upgrade hannk to use TFLite 2.7.0 by default (#6393)
- Clean up CodeGen_LLVM names to match ASAN nomenclature changes (#6395)
- Drop support for LLVM11 (#6396)
- Move PyTorch test into standalone tests (#6397)
- Remove halide_abort_if_false() usage in runtime/metal (#6398)
- Fix OGLC debug builds (#6399)
- Add defensive checks to halide_buffer_copy_already_locked (#6401)
- _halide_buffer_crop() needs to check for runtime failures (v2) (#6403)
- Fix broken ASAN code (#6408)
- [hannk] Pacify clang-tidy (#6412)
- One more ASAN fix (#6413)
- [hannk] Fix lower_tflite_fullyconnected (#6414)
- Fix Introspection issues (#6424)
- Don't remap the function name or the target in the metadata (#6430)
- Set up SANITIZER_FLAGS and OPTIMIZE for apps/Makefile.inc (#6435)
- Ensure that halide_start_clock() is called before halide_current_time… (#6438)
- Codegen_C: buffer compilation needs to special-case scalar buffers (#6442)
- Add operator<< for Closure (#6443)
- Re-enable performance_async_gpu for D3D12Compute (#6450)
- Tweak Hexagon codegen output (#6461)
- Add LinkageType::ExternalPlusArgv (#6452) (#6463)
- Fix Closure API (#6464)
- Move null check from Printer to halide_string_to_string() (#6467)
- Deal with Printer::scratch (#6469) (#6472)
- Restore support for using V8 as the Wasm JIT interpreter (#6478)
- Fail if no_bounds_query specified for HL_JIT_TARGET (#6489)
- Document the usage of llvm::legacy::PassManager (#6491)
- Update WABT to 1.0.25 (#6497)
- Grab Bag of minor cleanups to LowerParallelTasks (#6498)
- Update simd_op_check for arm64 upz1 code generation (#6499) (#6500)
- Fix size_t -> int conversion warning (#6501)
- Fix simd-op-check for top-of-tree LLVM (#6529)
- Revert "Make random faster by putting the innermost var last" (#6538)
- Fix GeneratorOutput_Buffer::set_estimates() (#6540)
- Revert "Make it possible to interpret a wide type as multiple smaller elements" (#6541)
- Convert apps/hannk/Elementwise to use generate() (#6543)
- Fixes for top-of-tree LLVM (#6546) (#6548)
- Fix deprecation warnings in Python tutorials (#6552)
- Use add_halide_generator() everywhere in apps/ (#6554)
- Fix for top-of-tree LLVM (#6561)
- Enable simd_op_check test for wasm i8x16.popcnt (#6562)
- Revert "Fix for top-of-tree LLVM" (#6564)
- wasm simd cleanup (#6566)
- Add support for wasm-simd ops for integer-integer widening (#6567)
- Add explicit to a handful of Generator-related ctors. (#6569)
- Fix typo in comment in HalideBuffer.h (#6570)
- Allow calling scheduling methods on Output<Buffer[]> (#6577)
- Fix for top-of-tree LLVM (#6579)
- Fix Win32-specific breakage in top-of-tree LLVM (#6581)
- Convert apps/ to use static Buffer dims where useful (#6585)
- Various fixes to static-dimensioned Buffer (#6589)
- Convert Buffer<> usage in python_bindings/ to use static dimensions (#6591)
- Convert Buffer<> usage in test/generators to use static dimensions (#6592)
- Rename BufferDimsUnconstrained -> AnyDims (#6594)
- Allow building with LLVM15 (#6603)
- Update WasmExecutor for WABT API changes (#6612)
- Minor Generator cleanup (#6613)
- Unbreak WABT again by using main instead of a commit (#6614)
- Update apps/hannk to use TFLite 2.8.0 (#6617)
- Update WABT version to the just-released 1.027 (instead of main) (#6619)
- Clean up python_binding Makefile (#6634)
- Fix const-correctness in C/C++ backend (Issue #6636) (#6638)
- Convert most remaining Generators to prefer statically-dimensioned In… (#6641)
- Allow profiler feature under wasm iff wasm_threads is enabled (#6643)
- Fix UB in hannk FillWithRandom operation. (#6645)
- Update initialization of WABT store field to work with top-of-tree (#6649)
- Fix apparent typo in PR #6294 (#6653)
- Eliminate some unnecessary clamping in ClampUnsafeAccesses (#6297) (#6654)
- Python Bindings: fix Python bool -> Expr implicit conversion (#6657)
- Fix 'variable set but not used` warning/error (#6658)
- Allow make test_apps to work with ASAN (#6659)
- Add optional runtime H::R::Buffer access checks (#6660)
- Add ldscript code for Python extensions in CMake (#6665)
- Remove the nobuild/partialbuildmethod tests from...