Skip to content

24.09.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 21 Oct 15:00
· 77 commits to main since this release
98ab6c8

Features

  • Add support for optional payload encryption in the client SDK and CLI as a follow-up to #484 (#493)
  • Allow unicode characters in project(user group) name and domain name. (#1663)
  • Improve exception logging stability by pre-formatting exception objects instead of pickling/unpickling them (#1759)
  • Add new API to create new image from live session (#1973)
  • Clear error_logs records in the clear-history command (#1989)
  • Introduce mgr schema dump-history and mgr schema apply-missing-revisions command to ease the major upgrade involving deviation of database migration histories (#2002)
  • Update image forget CLI command to untag image from registry before forgetting it from the database (#2010)
  • Update etcd-client-py to 0.3.0 (#2014)
  • Allow self-ssh in single-node single-container compute sessions. (#2032)
  • Prevent deleting mounted folders. (#2036)
  • Allow agent to report its internal registry snapshot via UNIX domain socket server (#2038)
  • New redis client (experimental) (#2041)
  • Expose user info to environment variables (#2043)
  • Introduce the rolling_count GraphQL field to provide the current rate limit counter for a keypair within the designated time window slice (#2050)
  • Deprecate the reliance on HTTP cookies for authenticating the pipeline service, switching to the use of HTTP headers instead (#2051)
  • Allow user to explicitly set filename of model definition YAML (#2063)
  • Add the backend.ai plugin scan command to inspect the plugin scan results from various entrypoint sources (#2070)
  • Bring back etcetra-backed Etcd as an option for ditributed lock backend (#2079)
  • Enable distribute-lock configuration (#2080)
  • Cache volume objects in RootContext.get_volume (#2081)
  • Revamp images GQL query by changing image filtering from flag-based to feature set-based and add aliases field to customized image GQL schema (#2136)
  • Added missing fields for keypair_resource_policy in client-py, models, etc. (#2146)
  • Add parameters to check-presets SDK function (#2153)
  • Add relay-aware VirtualFolderNode GQL Query (#2165)
  • Also perform basic model service validation process when updating model service via ModifyEndpoint (#2167)
  • Add support for mounting arbitrary VFolders on model service session (#2168)
  • Add support for CentOS 8 based kernels (#2220)
  • Clear zombie routes automatically (#2229)
  • Add scaling_group.agent_count_by_status and scaling_group.agent_total_resource_slots_by_status GQL fields to query the count and the resource allocation of agents that belong to a scaling group. (#2254)
  • Allow modifying model service session's environment variable setup (#2255)
  • Add endpoint.runtime_variant column (#2256)
  • Add new API to show list of supported inference runtimes (#2258)
  • Add support for model service provisioning without model-definition.yaml (#2260)
  • Allow superadmins to force-update session status through destroy API. (#2275)
  • Add session status check & update API. (#2312)
  • Add support for fetching container logs of a specific kernel. (#2364)
  • Introduce Python native WSProxy (#2372)
  • Implement scanning plugin entrypoints of external packages (#2377)
  • Add row_id, type and container_registry fields to the GroupNode GQL schema. (#2409)
  • Add support for PureStorage RapidFiles Toolkit v2 (#2419)
  • Add API that extends lifespan of webserver's login session. (#2456)
  • Allow bulk association and disassociation of scaling groups with domains, user groups, and key pairs. (#2473)
  • Match container's timezone to container host OS when available (#2503)
  • Add a pre-setup configuration menu to the TUI installer to allow setting the public-facing address of Backend.AI components (#2541)
  • Now Backend.AI can run arbitrary container images without Backend.AI-specific metadata labels by introducing good default values and replacing intrinsic kernel-runner binaries with statically built ones (#2582)
  • Allow Bearer as valid token type on model service authentication (#2583)
  • Introduce automatic creation of a 'model-store' group upon inserting a new domain. (#2611)
  • Add support for declaring custom description field for GraphQL relay edge types. (#2643)
  • Add an enable_LLM_playground option to show/hide the LLM playground tab on the serving page. (#2677)
  • Add max_gaudi2_devices_per_container config on webserver (#2685)
  • Add max_atom_plus_device_per_container config on webserver (#2686)
  • Introduce Account-manager component. (#2688)
    • Add query depth limit config of GQL.
    • Add page size limit config of GQL Connection.
    • Set default page size of GQL Connection to 10. (#2709)
  • Add compute session GQL Relay query schema. (#2711)
  • Allow DataLoaderManager to get a loader function by function itself rather than function name. (#2717)
  • Allow filter and order in endpointlist gql request. (#2723)
  • Add new vfolder API to update sharing status. (#2740)
  • Avoid raising a type error even if a particular table in the toml file is empty, as long as the default value for all settings exists. (#2782)
  • Add an explicit configuration scaling-group-type to agent.toml so that the agent could distinguish whether itself belongs to an SFTP resource group or not (#2796)
  • Add per-session priority attributes and ModifyComputeSession GraphQL mutation to update session names and priorities (#2840)
  • Add dependee/dependent/graph ComputeSessionNode connection queries (#2844)
  • Implement the priority-aware scheduler that applies to any arbitrary scheduler plugin (#2848)
  • Add support for setting a timeout when pulling Docker images and upgrade aiodocker to version 0.23.0. (#2852)

Improvements

  • Enable robust DB connection handling by allowing pool-pre-ping setting. (#1991)
  • Enhance update mechanism of session & kernel status. (#2311)
  • Remove database-level foreign key constraints in vfolders.{user,group} columns to decouple the timing of vfolder deletion and user/group deletion. (#2404)
  • Implement storage-host RBAC interface. (#2505)
  • Optimize the query latency when fetching a large number of agents with stat metrics from Redis (#2558)
  • Split out ai.backend.logging package from the ai.backend.common to improve reusability and reduce the startup time (i.e., import latencies) (#2760)
  • Avoid using collections.OrderedDict when not necessary in the manager API and client SDK (#2842)

Deprecations

  • Remove no longer used env-tester-{admin,user,user2}.sh scripts and all references (#1956)

Fixes

  • Merge kernels.role into sessions.session_type and check the image compatibility based on comparison with the ai.backend.role label (#1587)
  • Refactor PendingSession Scheduler into PendingSession scheduler and AgentSelector, and replace roundrobin flag with AgentSelectionStrategy.RoundRobin policy. (#1655)
  • Do not omit to update session's occupying resources to DB when a kernel starts. (#1832)
  • Fix DDN command output handling when exceeding quotas. (#1901)
  • Explicitly specify the storage-side UID/GID when creating qtrees in the NetApp storage backend (#1983)
  • Sync mismatch between kernels.session_name and sessions.name and fix session-rename API to update session_name of sibling kernels atomically. (#1985)
  • Change function default arguments from mutable object to None. (#1986)
  • Revert some VFolder APIs response type to remove mismatch between Content-Type header and body. (#1988)
  • Upgrade pants to 2.21.0.dev4 for Python 3.12 support in their embedded pex/pip versions (#1998)
  • Fix Graylog log adapter not working after upgrading to Python 3.12 (#1999)
  • Fix compute_container GraphQL query resolver functions. (#2012)
  • Fix harbor v2 image scanner skipping importing rest of the artifacts when any of the item does not include tag (#2015)
  • Let external log viewers display more accurate, meaningful stack frames of logger invocations. (#2019)
  • Fix handling of undefined values in the ModifyImage GraphQL mutation. (#2028)
  • Fix container commit not working on certain docker engine versions (#2040)
  • add omitted request fetching from client to manager about deleting vfolder in trash bin. (#2042)
  • Fix a buggy restriction on VFolder deletion due to wrong query condition (#2055)
  • Fix wrong usage of dataloader in GQL group resolver. (#2056)
  • Ensure that vfolders, including automount vfolders, are mounted during session creation only if their status is not set to "DEAD" (i.e., deleted). (#2059)
  • Fix wrong calculation of resource usage (#2062)
  • Fix VFolder file operation not working when user has been granted access to shared but deleted VFolder which has same name with the normal one (#2072)
  • Add missing type argument in group query (#2073)
  • Let the backend.ai mgr clear-history command clears session records as well as kernel records (#2077)
  • Fix compute_session_list GQL query not responding on an abundant amount of sessions (#2084)
  • Fix VFolder invitation not accepted when inviting VFolder shares name with already deleted one (#2093)
  • Fix orphan model service routes being created (#2096)
  • Fix initialization of the resource usage API's kernel-level usage aggregation (#2102)
  • Fix model server starting on every kernels (including sub role kernels) on multi container infernce session (#2124)
  • Add missing commit_session_to_file to OP_EXC (#2127)
  • Fix wrong SQL query build for GQL Relay node (#2128)
  • Pass ImageRef.canonical in commit_session_to_file (#2134)
  • Handle fileset-already-exists response of create-filset API request and make sure to wait between all GPFS job polling iterations (#2144)
  • Skip any possible redundant quota update requests when creating new quota (#2145)
    • Fix error when calling check_presets Client SDK API with an invalid group parameter
    • Rewrite Client SDK to access all APIConfig fields (#2152)
  • Ensure that all pending sessions are picked by schedulers (#2155)
  • Fix user creation error when any model-store does not exists. (#2160)
  • Fix buggy resolver of model_card GQL Query. (#2161)
  • Fix security vulnerability for sudo_session_enabled (#2162)
  • Rename endpoints.model_mount_destiation to model_mount_destination (#2163)
  • Wait for real quota scope directory creation after Netapp create_qtree() call (#2170)
  • Fix wrong per-user concurrency calculation logic (#2175)
  • Keep sync_container_lifecycles() bgtask alive in a loop. (#2178)
  • Fix missing check for group (project) vfolder count limit and error handling with an invalid group parameter (#2190)
  • Fix model service persisting on degraded status forever in rare chance when trying to delete the service (#2191)
  • Fix error when query or mutate GraphQL using BigInt field type (#2203)
  • Ensure that utilization idleness is checked after a set period. (#2205)
  • Fix backend.ai ssh command execution when packaged as SCIE/PEX (#2226)
    • fix endpoints query not working when trying to load image_row.aliases
    • fix endpoints.status reporting PROVISIONING when its status is in DESTROYING state (#2233)
  • Fix GQL raising error when trying to resolve endpoints.errors field occasionally (#2236)
  • Fix ZeroDivisionError in volume usage calculation by returning 0% when volume capacity is zero (#2245)
  • Fix GraphQL to support query to non-installed images (#2250)
  • Add missing push_image method implementation to Dummy Agent (#2253)
  • Rename no-op access_key parameter of endpoint_list GQL Query to user_uuid (#2287)
  • Fix ai.backend.service-ports label syntax broken when image does not expose built-in service port (#2288)
  • Improve stability of untag_image_from_registry mutation (#2289)
  • SSH not working between kernels started with customized image (#2290)
  • Invalid container memory capacity reported (#2291)
  • Corrected an issue where the resource_policy field in the user model was incorrectly mapped to domain_name. (#2314)
  • Omit to clean containerless kernels which are still creating its container. (#2317)
  • Fix model service sessions created before 24.03.5 failing to spawn (#2318)
  • Image commit not working (#2319)
  • model service session scheduler (scale_services()) failing when sessions bound to active route already marked as terminated (#2320)
  • Fix container metric collection halted on systems with Cgroups v1 (#2321)
  • Run batch execution after the batch session starts. (#2327)
  • Add support for configuring sync_container_lifecycles() task. (#2338)
  • Fix mismatches between responses of /services/_runtimes and new model service creation input (#2371)
  • Fix incorrect check of values returned from docker stat API. (#2389)
  • Shutdown agent properly by removing a code that waits a cancelled task. (#2392)
  • Restrict GraphQL query to user_nodes field to require superadmin privilege (#2401)
  • Handle all possible exceptions when scheduling single node session so that the status information of pending session is not empty. (#2411)
  • Utilize ExtendedJSONEncoder for error logging to handle UUID objects in extra_data (#2415)
  • Change outdated references in event module from kernels to sessions. (#2421)
  • Upgrade inquirer to remove dependency on deprecated distutils, which breaks up execution of the scie builds (#2424)
  • Allow specific status of vfolders to query to purge. (#2429)
  • Update the install-dev scripts to use pnpm instead of npm to speed up installation and resolve some peculiar version resolution issues related to esbuild. (#2436)
  • Fix a packaging issue in the backendai-webserver scie executable due to missing explicit requirement of setuptools (#2454)
  • Improve pruning of non-physical filesystems when measuring disk usage in agents (#2460)
  • Update the install-dev scripts to install pnpm if pnpm isn't installed. (#2472)
  • Improve error handling of initialization failures in the kernel runner (#2478)
  • Fix BACKEND_MODEL_NAME environment always overwritten to model name specified at model definition (#2481)
  • Do not allow assigning preopen port which collides with image's own service port definition (#2482)
  • Fix GET requests with queryparams defined in API spec occasionally throwing 400 Bad Request error (#2483)
  • Check null value of user mutation by Undefined sentinel value rather than None. (#2506)
  • Do null check on groups.total_resource_slots and domains.total_resource_slots value. (#2509)
  • Fix hearbeat processing failing when agent reports image with its name not compilant to Backend.AI's naming rule (#2516)
  • Corrected a typo (maanger corrected to manager) in the check_status() API response of the storage component (#2523)
  • Rename images.image_filters GQL Query argument to images.image_types (#2555)
  • Prevent session status from being transit to PULLING status event if image pull is not required (#2556)
  • Prevent other user's customized image from being listed as a response of images GQL query (#2557)
  • skip resolving malformed ModelCard GQL item (#2570)
  • Delete sessions DB records when purging project. (#2573)
  • Initialize Redis connection pool objects with specified connection opts rather than ignoring them. (#2574)
  • Fix GET /func/folders/{folderName} API returning string literal "null" instead of null value on user and group fields (#2584)
  • Update GQLPrivilegeCheckMiddleware to align with upstream changes on graphql-core package (#2598)
  • Robust type check when idle checker fetches utilization data. (#2601)
  • Skip mounting zero-byte lxcfs files when lxcfs is activated to prevent crashes in session containers (#2604)
  • Fix typo in minilang query field spec and column map. (#2605)
  • Remove duplicate CPU quota arguments when creating containers (#2608)
  • Increase MAX_CMD_LEN of dropbear to improve compatibility with PyCharm debugger (#2613)
  • Silence falsy Redis timeout warnings when retrying blocking commands if the timeout does not exceed the expected command timeout (#2632)
  • Fix a regression of #2483 in the session-download API used by the backend.ai ssh command (#2635)
  • Implement missing StrEnumType handling in populate_fixture(). (#2648)
  • Let GET /resource/usage/period request contain data in query parameter rather than JSON body. (#2661)
  • Allow sudo-enabled container users to ovewrite /usr/bin/scp and /usr/libexec/sftp-server by unifying the intrinsic ssh binaries to use the merged dropbearmulti executable. (#2667)
  • Update webserver logout API to respond with HTTP 200 OK (#2681)
  • Fix WSProxy not properly handling WebSocket request sent from Firefox (#2684)
  • Scan parent directory of created qtree to avoid creating quota on non-existing directory. (#2696)
  • Fix list_files, get_fstab_contents, get_performance_metric and shared_vfolder_info Python SDK function not working with ValidationError exception printed (#2706)
  • Resolve the issue where the vfolder id does not match in list_shared_vfolders. (#2731)
  • Handle OS Error when deleting vfolders. (#2741)
  • Fix typo in Virtual-folder status update code. (#2742)
  • Correct msgpack deserialization of ResourceSlot. (#2754)
  • Fix regression error of session create_from_template command. (#2761)
  • Silence model_ namespace warnings with pydantic-based model classes (#2765)
  • Change the initialization order of PackageContext to apply target_path correctly in the TUI installer (#2768)
  • Make the regex patterns to update configuration files working with multiline texts correctly in the TUI installer (#2771)
  • Omit null parameter when call usage-per-period API. (#2777)
  • Delete vfolder invitation and permission rows when deleting vfolders. (#2780)
  • Handle container port mismatch when creating kernel. (#2786)
  • Explicitly set the protected service ports depending on the resource group type and the service types (#2797)
  • Correct session status determiner function. (#2803)
  • Fix endpoint_list.total_count GQL field returning incorrect value (#2805)
  • Fix Service.create() SDK method and service create CLI command not working with UnboundLocalError exception (#2806)
  • Refresh expiration time of login session when login. (#2816)
  • Fix kernel_id assignment for main kernel log retrieval (#2820)
  • Use a safer TLS version (v1.2) when creating SSL sockets in the logstash handler (#2827)
  • Wrong count of concurrent compute sessions. (#2829)
  • Create kernels with correct scaling_group value. (#2837)
  • Fix a regression in progress bar rendering of the TUI installer after upgrading the Textual library (#2867)

Documentation Updates

  • Add note about installing client library with same version as server (#1976)
  • Remove deprecated version from the docker compose YAML templates in package installation docs. (#2035)
  • Fix a typo in the agent.toml example of the package-based installation guide to have a duplicate double quote (#2069)

External Dependency Updates

  • Upgrade the base runtime (CPython) version from 3.11.6 to 3.12.2 (#1994)
  • Upgrade aiodocker to v0.22.0 with minor bug fixes found by improved type annotations (#2339)
  • Update the halfstack containers to point the latest stable versions (#2367)
  • Upgrade aiodocker to 0.22.1 to fix error handling when trying to extract the log of non-existing containers (#2402)
  • Upgrade the base CPython from 3.12.2 to 3.12.4 (#2449)
  • Upgrade Python (3.12.4 -> 3.12.6) and common/tool dependencies to prepare for Python 3.13 and apply latest fixes (#2851)

Miscellaneous

  • Wrap RPC authentication error to custom error for better logging. (#1970)
  • Add requested_slots field to compute session GQL type. (#1984)
  • Allow pydantic.BaseModel as the API handler return schema. (#1987)
  • Fix incorrect version notation of GQL Field. (#1993)
  • Add max_pending_session_count field to Keypair resource policy GQL schema (#2013)
  • Handle container creation exception and start exception in separate try-except contexts. (#2316)
  • Fix broken the workflow call for the action that auto-assigns PR numbers to news fragments (#2358)
  • Finally stabilize the hanging tests in our CI due to docker-internal races on TCP port mappings to concurrently spawned fixture containers by introducing monotonically increasing TCP port numbers (#2379)
  • Further improve the monotonic port allocation logic for the test containers to remove maximum concurrency restrictions (#2396)
  • Add PEX, SCIE binary build configs for the plugin subsystem. (#2422)
    • Add POST /folders API endpoints to replace DELETE APIs that require request body.
    • Allow DELETE requests to have body data. (#2571)
  • Enhacne type hints for potential None arguments (#2580)
  • Add ai.backend.manager.models.graphql module for better code base management. (#2669)
  • Remove Scheduler related types that are no longer used. (#2705)
  • Allow adding required GQL field argument to schema. (#2712)
  • Upgrade readthedocs build environment to Python 3.12 (#2814)## 24.03.0rc1 (2024-03-31)

Features

  • Allw filter compute_session query by user_id. (#1805)
  • Allow overriding vfolder mount permissions in API calls and CLI commands to create new sessions, with addition of a generic parser of comma-separated "key=value" list for CLI args and API params (#1838)
  • Always enable ai.backend.accelerator.cuda_open in the scie-based installer (#1966)
  • Use config["pipeline"]["endpoint"] as default value of config["pipeline"]["frontend-endpoint"] when not provided (#1972)
  • Migrate container registry config storage from Etcd to PostgreSQL (#1917)
  • Implement ID-based client workflow to ContainerRegistry API. (#2615)
  • Rafactor Base ContainerRegistry's scan_tag and implement MEDIA_TYPE_DOCKER_MANIFEST type handling. (#2620)
  • Support GitHub Container Registry. (#2621)
  • Support GitLab Container Registry. (#2622)
  • Support AWS ECR Public Container Registry. (#2623)
  • Support AWS ECR Private Container Registry. (#2624)
  • Replace rescan command's --local flag with local container registry record. (#2665)
  • Add project column to the images table and refactoring ImageRef logic. (#2707)
  • Support docker image manifest v2 schema1. (#2815)
  • Add filter and order parameters to Group GQL Relay API. (#2863)
  • Add vast_use_auth_token config to utilize VASTData API token optionally. (#2901)
  • Use a valid value for the id field in the GQL schema query resolver for ContainerRegistry. (#2908)

Fixes

  • Set single agent per kernel resource usage. (#1725)
  • Abort container creation when duplicate container port definition exists (#1750)
  • To update image metadata, check if the min/max values in resource_limits are undefined. (#1941)
  • Explicitly disable the user-site package detection in the krunner python commands to avoid potential conflicts with user-installed packages in .local directories (#1962)
  • Fix caf54fcc17ab migration to drop a primary key only if it exists and in 589c764a18f1 migration, add missing table arguments. (#1963)
  • Explicitly wait for readiness of the Docker daemon and the compose stack before pouring database fixtures in install-dev.sh for when installing at the provisioning stage of Codespaces and integration tests in CI. (#2378)
  • Add missing implementation of wsproxy and manager CLI's log-level customization options (#2698)
  • Add missing batch execution call after session starts (#2884)
  • Fix a regression of the unicode-aware slug update that prevented creation of dot-prefixed (automount) vfolders (#2892)
  • Fix invalid image format log spam in Agent (#2894)
  • Fix wrong creation of raw_configs in _create_kernels_in_one_agent (#2896)
  • Assign valid value to id field in ContainerRegistryNode GQL schema query resolver. (#2899)
  • Update vast quota rather than raise error when quota exists. (#2900)
  • Calculate correct expiration time of VAST auth token and add vast_force_login config to enable login before every REST API call (#2911)

Documentation Updates

  • Update docstrings in ai.backend.client.request.Request:fetch() and ai.backend.client.request.FetchContextManager as the support for synchronous context manager has been deprecated. (#1801)
  • Resize font-size of footer text in ethical ads in documentation hosted by read-the-docs (#1965)
  • Only resize font-size of footer text in ethical ads not in title of content in documentation (#1967)

Miscellaneous

  • Revert response type of service create API. (#1979)

Full Changelog

Check out the full changelog until this release (24.09.0).

Full Commit Logs

Check out the full commit logs between release (24.09.0rc1) and (24.09.0).