Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JSON] Support Rename Fields for JSON operator #1133

Open
ShihChun-H opened this issue Oct 9, 2024 · 12 comments
Open

[JSON] Support Rename Fields for JSON operator #1133

ShihChun-H opened this issue Oct 9, 2024 · 12 comments
Assignees
Labels
component feature New feature or request hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core

Comments

@ShihChun-H
Copy link
Member

ShihChun-H commented Oct 9, 2024

Issue Description

Current State

  • It is very difficult to manipulate JSON data with JSON operator.

Proposed Change

  • Please fetch this JSON Schema to implement the functions.
  • Manipulating JSON data

JSON schema pseudo code

JsonOperator:
  Task: Rename fields
  
  Input:
    data: 
      type: object
      description: Original data, which can be a JSON object or array of objects.
    fields: 
      type: array
      description: An array of objects specifying the fields to be renamed.
      items:
        type: object
        properties:
          currentField: 
            type: string
            description: The field name in the original data to be replaced, supports nested paths if "supportDotNotation" is true.
          newField: 
            type: string
            description: The new field name that will replace the currentField, supports nested paths if "supportDotNotation" is true.
#    supportDotNotation:
#      type: boolean
#      default: true
#      description: Determines whether to interpret field names as paths using dot notation. If false, fields are treated as literal keys.
    conflictResolution:
      type: string
      enum: [overwrite, skip, error]
      default: overwrite
      description: Defines how conflicts are handled when the newField already exists in the data.
  
  Output:
    data:
      type: object
      description: The modified data with the specified fields renamed.

Key Features:
conflictResolution: Handling conflicts when renaming fields in JSON, especially when working with nested objects and dot notation, is critical to avoid data loss or unexpected behavior. Allow users to specify how they want conflicts to be resolved (e.g., via a parameter such as conflictResolution: 'overwrite'|'skip'|'error'),

  • Provides flexibility and control to the user.
  • Adapts to different use cases.

Here are different strategies to manage conflicts and some considerations for each.

1. Overwrite the Existing Field (Default Behavior)

Description: If the newField already exists in the object, overwrite its value with the value from currentField.
Pros:

  • Simple and straightforward.
  • Useful when the intention is to replace the existing value.
    Cons:
  • Can lead to data loss if not used carefully.

Implementation:

if new_key in obj:
    obj[new_key] = obj.pop(current_key)
else:
    obj[new_key] = obj.pop(current_key)

2. Skip the Renaming Operation

Description: If the newField already exists, skip the renaming operation for that particular field.
Pros:

  • Prevents accidental overwriting of data.
  • Safeguards against potential conflicts without altering the existing data.
    Cons:
  • The currentField remains unchanged, which might not be the desired outcome.

Implementation:

if new_key in obj:
    # Skip renaming if new_key already exists
    continue
else:
    obj[new_key] = obj.pop(current_key)

3. Merge Values

Description: If both currentField and newField exist and contain objects or arrays, merge the two values. This approach is more complex but can be very powerful.
Pros:

  • Preserves both sets of data.
  • Useful for combining information rather than choosing one over the other.
    Cons:
  • Can be complex to implement, especially if the data types of currentField and newField differ.
  • May require custom logic depending on how you want to merge the data (e.g., combining arrays, merging objects, etc.).

Implementation:

if new_key in obj:
    if isinstance(obj[new_key], dict) and isinstance(obj[current_key], dict):
        # Merge dictionaries
        obj[new_key].update(obj.pop(current_key))
    elif isinstance(obj[new_key], list) and isinstance(obj[current_key], list):
        # Merge lists
        obj[new_key].extend(obj.pop(current_key))
    else:
        # Handle other types (overwrite, append, etc.)
        obj[new_key] = obj.pop(current_key)
else:
    obj[new_key] = obj.pop(current_key)

4. Rename with a Suffix or Prefix

Description: If the newField already exists, rename the new field by appending a suffix or prefix (e.g., _1, _conflict) to avoid conflicts.
Pros:

  • Both original and new data are preserved.
  • Easy to track conflicts.
    Cons:
  • The resulting data structure may become less predictable or harder to work with if many conflicts occur.

Implementation:

suffix = 1
original_new_key = new_key
while new_key in obj:
    new_key = f"{original_new_key}_{suffix}"
    suffix += 1
obj[new_key] = obj.pop(current_key)

5. Return an Error or Warning

Description: If a conflict is detected, stop the operation and return an error or warning to the user. This forces the user to address the conflict before proceeding.
Pros:

  • Prevents accidental data overwriting.
  • Makes the user aware of potential issues immediately.
    Cons:
  • Halts the process, which might be undesirable in automated workflows.

Implementation:

if new_key in obj:
    raise ValueError(f"Conflict detected: '{new_key}' already exists.")
else:
    obj[new_key] = obj.pop(current_key)

Summary:

  • Overwrite: Simple and effective, but can lead to data loss.
  • Skip: Safe but may leave data unchanged.
  • Error/Warning: Forces user intervention; best for critical operations.
    Choose the strategy that best aligns with your application's needs and the user's expectations. Implementing a combination of these strategies, such as providing a default behavior with options for customization, can offer the best balance between usability and robustness.

Example Usage:

Scenario: Input data as JSON object

// input
{
"data": {
"name": "John Doe",
"age": 30,
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA"
},
"state": "conflict"
},
"fields": [
{"currentField": "address.street", "newField": "address.road"},
{"currentField": "state", "newField": "address.state"}
],
// "supportDotNotation": true,
"conflictResolution": "overwrite"
}

Conflict Resolution Scenarios:
1. Overwrite (Default):

  • The state field in data would be moved to address.state, overwriting the existing address.state field.
  • Final output:
{
  "data": {
    "name": "John Doe",
    "age": 30,
    "address": {
      "road": "123 Main St",
      "city": "Anytown",
      "state": "conflict"
    }
  }
}

2. Skip:

  • The renaming of state to address.state would be skipped, so both state and address.state remain unchanged.
  • Final output:
{
  "data": {
    "name": "John Doe",
    "age": 30,
    "address": {
      "road": "123 Main St",
      "city": "Anytown",
      "state": "CA"
    },
    "state": "conflict"
  }
}

3. Error:

  • The process would raise an error, stopping execution, because address.state already exists.
    ValueError: Conflict detected: 'address.state' already exists.

Scenario: Input Data as an Array of Objects

If the input data is an array of objects, the logic needs to be adapted to handle each object in the array individually. The schema and the function would process each object within the array according to the specified fields and conflictResolution rules.

Below is an example demonstrating how the "Rename Fields" operation would work with input data that is an array of objects.

Input

{
  "data": [
    {
      "name": "John Doe",
      "age": 30,
      "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "state": "CA"
      },
      "contacts": [
        {
          "type": "email",
          "value": "[email protected]"
        }
      ]
    },
    {
      "name": "Jane Smith",
      "age": 28,
      "address": {
        "street": "456 Oak St",
        "city": "Othertown",
        "state": "NY"
      }
      // Note: Jane Smith does not have a "contacts" field
    }
  ],
  "fields": [
    {"currentField": "name", "newField": "fullName"},
    {"currentField": "address.street", "newField": "address.road"},
    {"currentField": "contacts.0.value", "newField": "contacts.0.contactInfo"},
    {"currentField": "age", "newField": "yearsOld"}
  ],
//  "supportDotNotation": true,
  "conflictResolution": "skip"
}

Explanation:

  • Field "name": The "name" field will be renamed to "fullName" for each object in the array.
  • Field "address.street": The "street" field inside the "address" object will be renamed to "road" for each object.
  • Field "contacts.0.value": The "value" field inside the first element of the "contacts" array will be renamed to "contactInfo" for the first object, but this step will be skipped for the second object because the "contacts" field does not exist.
  • Field "age": The "age" field will be renamed to "yearsOld" for each object.

Output:

{
  "data": [
    {
      "fullName": "John Doe",
      "yearsOld": 30,
      "address": {
        "road": "123 Main St",
        "city": "Anytown",
        "state": "CA"
      },
      "contacts": [
        {
          "type": "email",
          "contactInfo": "[email protected]"
        }
      ]
    },
    {
      "fullName": "Jane Smith",
      "yearsOld": 28,
      "address": {
        "road": "456 Oak St",
        "city": "Othertown",
        "state": "NY"
      }
      // The "contacts" field is not present, so no renaming occurs for "contacts.0.value"
    }
  ]
}

Rules for the Component Hackathon

  • Each issue will only be assigned to one person/team at a time.
  • You can only work on one issue at a time.
  • To express interest in an issue, please comment on it and tag @kuroxx, allowing the Instill AI team to assign it to you.
  • Ensure you address all feedback and suggestions provided by the Instill AI team.
  • If no commits are made within five days, the issue may be reassigned to another contributor.
  • Join our Discord to engage in discussions and seek assistance in #hackathon channel. For technical queries, you can tag @chuang8511.

Component Contribution Guideline | Documentation | Official Go Tutorial

@ShihChun-H ShihChun-H added documentation Improvements for instill.tech/docs tutorial Improvements for instill.tech/tutorials need-triage Need to be investigated further labels Oct 9, 2024
Copy link

linear bot commented Oct 9, 2024

@ShihChun-H ShihChun-H added help-wanted Help from the community is appreciated feature New feature or request instill core component hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 improvement Improvement on existing features and removed documentation Improvements for instill.tech/docs tutorial Improvements for instill.tech/tutorials need-triage Need to be investigated further labels Oct 9, 2024
@Danbaba1
Copy link

I am interested. Can I work on this issue?

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 11, 2024

Hello @Danbaba1, sure I have assigned this ticket to you! 🙌

@kuroxx kuroxx assigned kuroxx and Danbaba1 and unassigned kuroxx Oct 11, 2024
@kuroxx
Copy link
Collaborator

kuroxx commented Oct 22, 2024

Hey @Danbaba1 I have removed you as an assignee because there is no activity for the past 2 weeks 🙏 Please raise again if you are still working on it, thanks

@AkashJana18
Copy link

I would like to give a try for this issue. Can you please assign me?

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 23, 2024

@AkashJana18 Sounds good, I have assigned it to you!

@AkashJana18
Copy link

Hey @chuang8511 @ShihChun-H Could you please guide me on where to make the changes for implementing JSON manipulation with the JsonOperator schema? I haven’t worked with this tech stack before, so any pointers on relevant files, modules, or general structure would be very helpful. Thanks in advance!

@gagan-bhullar-tech
Copy link

I would like to work on this issue. Can you please assign it to me.

@AkashJana18
Copy link

Hey @gagan-bhullar-tech I am already working on it would you like to collaborate?

@chuang8511
Copy link
Member

@AkashJana18
Sorry, I put the wrong json schema.

Could you take a look on this?

We have built the task definition. So, what you only have to do is working on Golang implementation.

@AkashJana18
Copy link

@chuang8511 so the Golang Implementation needs to be done in pipeline-backend repo?

@chuang8511
Copy link
Member

@AkashJana18
Yes, please check the guideline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component feature New feature or request hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core
Projects
Status: In Progress
Development

No branches or pull requests

6 participants