Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Hardlink Feature for Cache Optimization and Data Deduplication #1953

Open
ChengyuZhu6 opened this issue Jan 24, 2025 · 0 comments · May be fixed by #1954
Open

Implement Hardlink Feature for Cache Optimization and Data Deduplication #1953

ChengyuZhu6 opened this issue Jan 24, 2025 · 0 comments · May be fixed by #1954

Comments

@ChengyuZhu6
Copy link

ChengyuZhu6 commented Jan 24, 2025

Description

I would like to propose the implementation of a hardlink feature in the caching mechanism to optimize memory usage, improve performance and save disk space.

Background

The current caching system stores files in memory, which can lead to high memory usage, especially when dealing with large datasets. By utilizing hardlinks, we can reduce memory consumption and storage redundancy by allowing multiple references to the same file on disk without duplicating the file content.

Design

Key Components

  1. HardlinkManager: Manages the creation, validation, and persistence of hardlinks.
  • CreateLink: Attempts to create a hardlink for a given cache key.
  • HasHardlink: Checks if a hardlink exists for a given key.
  • Persist and Restore: Manages the persistence of hardlink metadata to disk and restores it on startup.
  1. DirectoryCache: Implements the cache logic, including hardlink support.
  • CreateHardlink: Invokes the HardlinkManager to create a hardlink.
  • HasHardlink: Checks for the existence of a hardlink using the HardlinkManager.
  1. Configuration: The EnableHardlink flag in the configuration determines whether hardlinking is enabled.

Work Flow

[Start] 
   |
   v
[Initialize Cache]
   |
   v
[Check if Hardlinking is Enabled]
   |
   v
[Access Cached File] 
   |
   v
[Check if Hardlink Exists] -- No --> [Create Hardlink]
   |                                   |
  Yes                                  v
   |                             [Verify Hardlink]
   v                                   |
[Use Hardlink]                         v
   |                             [Rename to Final Location]
   v                                   |
[Persist Hardlink State] <-------------|
   |
   v
[Restore Hardlink State on Startup]
   |
   v
[End]
+-----------------------------+
  1. Cache Write:
  • When a file is added to the cache, the system checks if hardlinking is enabled.
  • If enabled, it attempts to create a hardlink for the cached file.
  1. Cache Read:
  • When accessing a cached file, the system checks if a hardlink exists.
  • If a hardlink exists, it uses the hardlink path to access the file.
    Persistence:
  • Hardlink metadata is periodically persisted to disk.
  • On startup, the system restores hardlink metadata from disk.

Benefits

  • Reduced Memory Usage: By leveraging hardlinks, we can significantly decrease the memory footprint of the caching system.
  • Improved Performance: Hardlinks allow for faster access to cached files, as they avoid the overhead of duplicating file data.
  • Data Deduplication: Hardlinks inherently support data deduplication by allowing multiple cache entries to reference the same physical file, reducing storage redundancy.
  • Scalability: This feature will enable the caching system to handle larger datasets more efficiently.
ChengyuZhu6 added a commit to ChengyuZhu6/stargz-snapshotter that referenced this issue Jan 24, 2025
- Adds the EnableHardlink configuration option
- Adds the HardlinkCapability interface
- Updates the directoryCache struct to support hardlinks
- Adds logging for hardlink configuration
- Updates the layer package to pass through hardlink configuration
- Concurrent access testing

Fixes: containerd#1953

Signed-off-by: ChengyuZhu6 <[email protected]>
@ChengyuZhu6 ChengyuZhu6 linked a pull request Jan 24, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant