Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member enrichment llm lfx 1712 #2688

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

themarolt
Copy link
Contributor

@themarolt themarolt commented Nov 14, 2024

Changes proposed ✍️

What

copilot:summary

copilot:poem

Why

How

copilot:walkthrough

Checklist ✅

  • Label appropriately with Feature, Improvement, or Bug.
  • Add screenshots to the PR description for relevant FE changes
  • New backend functionality has been unit-tested.
  • API documentation has been updated (if necessary) (see docs on API documentation).
  • Quality standards are met.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new llmPromptHistory table for tracking prompts and responses.
    • Added a member enrichment onboarding script for processing eligible members.
    • Enhanced member enrichment capabilities with new functions for handling multiple data sources.
    • Added support for new organization identity types and attribute sources.
  • Improvements

    • Streamlined error handling and data processing in various services.
    • Updated scheduling logic for member enrichment tasks based on environment variables.
    • Improved flexibility in email handling within the enrichment service.
  • Documentation

    • Added new interfaces and enumerations for better structure and clarity in member and organization data handling.

Copy link

coderabbitai bot commented Nov 14, 2024

Walkthrough

The pull request introduces substantial changes to the member enrichment process, including the creation of a new database table llmPromptHistory to log prompts and responses from a language model. It also includes modifications to existing tables, backup operations, and various enhancements to the member enrichment worker service, such as new functions for processing member sources and improved error handling. Additionally, several enums and interfaces are added to support the new functionality, along with updates to import paths and service configurations.

Changes

File Change Summary
backend/src/database/migrations/V1731052735__llm-prompt-history.sql Created llmPromptHistory table with various fields and indexes; added backup operations for existing tables.
backend/src/database/migrations/V1731484783__drop-member-deprecated-columns.sql Dropped columns oldEmails and oldWeakIdentities from members table.
services/apps/premium/members_enrichment_worker/package.json Added new script script:onboarding and new dependency @crowd/common_services.
services/apps/premium/members_enrichment_worker/src/activities.ts Added new method processMemberSources.
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts Introduced processMemberSources and updated findMemberEnrichmentCache to accept an array of sources.
services/apps/premium/members_enrichment_worker/src/activities/getMembers.ts Updated import path for svc from ../main to ../service.
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/*.ts Updated import paths for svc from ../../main to ../../service.
services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts Updated import path for svc from ../main to ../service.
services/apps/premium/members_enrichment_worker/src/main.ts Removed configuration setup and imported existing svc instance.
services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts Updated import path for svc and modified scheduling logic.
services/apps/premium/members_enrichment_worker/src/service.ts Introduced a new service.ts file with configuration for the service worker.
services/libs/common_services/src/repos/llmPromptHistory.repo.ts Added LlmPromptHistoryRepository class with insertPromptHistoryEntry method.
services/libs/common_services/src/services/llm.service.ts Introduced LlmService class for interacting with the language model and managing prompt history.
services/libs/types/src/enums/llm.ts Added enums LlmModelType and LlmQueryType.
services/libs/types/src/premium/enrichment.ts Added interfaces IMemberOrganizationData and IMemberOriginalData.
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts Added functions fetchMemberDataForLLMSquashing and resetMemberEnrichedAt; updated findMemberEnrichmentCacheDb.
services/libs/data-access-layer/src/organizations/organizations.ts Added findOrCreateOrganization function for organization management.
services/libs/types/src/enums/organizations.ts Updated OrganizationIdentityType and OrganizationAttributeSource enums with new values.
services/libs/types/src/index.ts Added export for ./llm.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Worker
    participant LLMService
    participant Database

    User->>Worker: Start Enrichment Process
    Worker->>Database: Fetch Enrichable Members
    Database-->>Worker: Return Members
    Worker->>LLMService: Process Member Sources
    LLMService-->>Worker: Return Processed Data
    Worker->>Database: Update Member Data
    Database-->>Worker: Confirmation
    Worker->>User: Enrichment Complete
Loading

Possibly related PRs

Suggested labels

Feature, Improvement

"In the burrow, we hop and play,
With prompts and data, bright as day.
Enrichment flows, a joyful spree,
New tables and scripts, oh so merry!
From members to models, we weave our tale,
In the world of data, we shall prevail! 🐇✨"


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Outside diff range and nitpick comments (27)
services/libs/types/src/enums/llm.ts (1)

1-4: Add documentation for model capabilities and versioning strategy.

The enum implementation is correct, but consider adding JSDoc comments to document:

  • Capabilities and differences between the models
  • Versioning strategy for handling model updates
  • Any specific requirements or limitations of each model

Example documentation:

/**
 * Supported LLM model types for AWS Bedrock Runtime.
 * @see https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html
 */
export enum LlmModelType {
  /** 
   * Claude 3 Sonnet - Balanced performance and cost.
   * Best for: [specific use cases]
   * Limitations: [any known limitations]
   */
  CLAUDE_3_5_SONNET = 'anthropic.claude-3-5-sonnet-20240620-v1:0',
  
  /** 
   * Claude 3 Opus - Highest capability model.
   * Best for: [specific use cases]
   * Limitations: [any known limitations]
   */
  CLAUDE_3_OPUS = 'anthropic.claude-3-opus-20240229-v1:0',
}
services/apps/premium/members_enrichment_worker/src/main.ts (1)

Line range hint 7-16: Add error handling and graceful shutdown.

While the initialization flow is correct, consider these essential improvements:

  1. Add try-catch blocks for error handling
  2. Implement graceful shutdown
  3. Add validation for the CROWD_EDITION environment variable

Consider this implementation:

 setImmediate(async () => {
+  try {
     await svc.init()
 
     await scheduleMembersEnrichment()
 
-    if (process.env['CROWD_EDITION'] === Edition.LFX) {
+    const edition = process.env['CROWD_EDITION']
+    if (edition && edition === Edition.LFX) {
       await scheduleMembersLFIDEnrichment()
     }
 
     await svc.start()
+  } catch (error) {
+    console.error('Failed to initialize service:', error)
+    process.exit(1)
+  }
 })
+
+process.on('SIGTERM', async () => {
+  try {
+    await svc.stop()
+    process.exit(0)
+  } catch (error) {
+    console.error('Error during shutdown:', error)
+    process.exit(1)
+  }
+})
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1)

Line range hint 11-14: Enhance error handling with more context

The current error handling simply wraps and rethrows the error. Consider adding more context about the operation being performed.

   try {
     const db = svc.postgres.reader
     rows = await fetchMembersForLFIDEnrichment(db, limit, afterId)
   } catch (err) {
-    throw new Error(err)
+    throw new Error(`Failed to fetch enrichable members for LFID: ${err.message}`)
   }
services/libs/types/src/index.ts (1)

64-64: Add newline at end of file.

Consider adding a newline at the end of the file to follow best practices.

 export * from './llm'
+
services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts (2)

Line range hint 21-31: Improve error handling and function signature

The current implementation has several issues that should be addressed:

  1. Error handling loses the original stack trace
  2. Function returns null despite void return type
  3. Documentation could be more descriptive

Consider this improved implementation:

 /*
- syncMembersToOpensearch is a Temporal activity that sync a newly enriched member
- in database to OpenSearch.
+ * Temporal activity that synchronizes an enriched member from the database to OpenSearch.
+ * @param input - The member ID to synchronize
+ * @throws {Error} If synchronization fails
  */
 export async function syncMembersToOpensearch(input: string): Promise<void> {
   try {
     syncMembers.syncMembers(input)
   } catch (err) {
-    throw new Error(err)
+    throw err instanceof Error ? err : new Error(String(err))
   }
-
-  return null
 }

Line range hint 33-43: Apply consistent improvements to organization sync function

The organization sync function has the same issues as the member sync function.

Apply similar improvements:

 /*
- syncOrganizationsToOpensearch is a Temporal activity that sync newly enriched
- organizations in database to OpenSearch.
+ * Temporal activity that synchronizes enriched organizations from the database to OpenSearch.
+ * @param input - Array of organization IDs to synchronize
+ * @throws {Error} If synchronization fails
  */
 export async function syncOrganizationsToOpensearch(input: string[]): Promise<void> {
   try {
     syncOrganizations.syncOrganizations(input)
   } catch (err) {
-    throw new Error(err)
+    throw err instanceof Error ? err : new Error(String(err))
   }
-
-  return null
 }
services/libs/types/src/premium/enrichment.ts (2)

34-41: Consider using a more specific type for date fields

The interface correctly defines organization data structure, but date fields could benefit from more specific typing to ensure proper date handling.

Consider this improvement:

export interface IMemberOrganizationData {
  orgId: string
  orgName: string
  jobTitle: string
-  dateStart: string
-  dateEnd: string
+  dateStart: string | null  // ISO 8601 date string
+  dateEnd: string | null    // ISO 8601 date string
  source: string
}

43-54: LGTM! Well-structured interface with clear documentation

The interface effectively aggregates member data with clear section comments and appropriate type usage. The attributes field using Record<string, Record<string, unknown>> provides good flexibility while maintaining type safety.

Consider adding a type for the attribute values to improve type safety:

type AttributeValue = string | number | boolean | null;
type Attributes = Record<string, Record<string, AttributeValue>>;
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/authenticateLFAuth0.ts (1)

Line range hint 29-33: Add environment variable validation.

The code relies on several critical environment variables for Auth0 configuration. Consider adding validation at startup to fail fast if any required variables are missing.

+const REQUIRED_ENV_VARS = [
+  'CROWD_LFX_AUTH0_DOMAIN',
+  'CROWD_LFX_AUTH0_CLIENT_ID',
+  'CROWD_LFX_AUTH0_CLIENT_ASSERTION_SIGNING_KEY',
+  'CROWD_LFX_AUTH0_AUDIENCE',
+];
+
+function validateEnvVars() {
+  const missing = REQUIRED_ENV_VARS.filter(key => !process.env[key]);
+  if (missing.length > 0) {
+    throw new Error(`Missing required environment variables: ${missing.join(', ')}`);
+  }
+}
+
 async function getTokenFromAuth0(): Promise<ITokenWithExpiration> {
+  validateEnvVars();
+
   const auth = new AuthenticationClient({
     domain: process.env['CROWD_LFX_AUTH0_DOMAIN'],
     clientId: process.env['CROWD_LFX_AUTH0_CLIENT_ID'],
     clientAssertionSigningKey: process.env['CROWD_LFX_AUTH0_CLIENT_ASSERTION_SIGNING_KEY'],
   })
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (2)

Line range hint 17-19: Improve error handling to preserve stack traces.

The current error handling pattern loses the original error stack trace, which could make debugging more difficult. Consider using the Error.cause property for proper error chaining.

Apply this pattern in both error handlers:

  } catch (err) {
-   throw new Error(err)
+   throw new Error('Failed to get identities', { cause: err })
  }
  } catch (err) {
-   throw new Error(err)
+   throw new Error('Failed to update member with enrichment data', { cause: err })
  }

Also applies to: 31-33


Line range hint 54-72: Consider adding timeout and error handling for the fetch call.

The mergeMembers function makes an HTTP request without:

  • Request timeout
  • Proper error handling for network failures
  • Response body parsing for error details

Consider enhancing the implementation:

 export async function mergeMembers(
   primaryMemberId: string,
   secondaryMemberId: string,
   tenantId: string,
 ): Promise<void> {
-  const res = await fetch(
-    `${process.env['CROWD_API_SERVICE_URL']}/tenant/${tenantId}/member/${primaryMemberId}/merge`,
-    {
-      method: 'PUT',
-      headers: {
-        Authorization: `Bearer ${process.env['CROWD_API_SERVICE_USER_TOKEN']}`,
-        'Content-Type': 'application/json',
-      },
-      body: JSON.stringify({
-        memberToMerge: secondaryMemberId,
-      }),
-    },
-  )
+  try {
+    const controller = new AbortController();
+    const timeoutId = setTimeout(() => controller.abort(), 30000);
+    
+    const res = await fetch(
+      `${process.env['CROWD_API_SERVICE_URL']}/tenant/${tenantId}/member/${primaryMemberId}/merge`,
+      {
+        method: 'PUT',
+        headers: {
+          Authorization: `Bearer ${process.env['CROWD_API_SERVICE_USER_TOKEN']}`,
+          'Content-Type': 'application/json',
+        },
+        body: JSON.stringify({
+          memberToMerge: secondaryMemberId,
+        }),
+        signal: controller.signal,
+      },
+    );
+    
+    clearTimeout(timeoutId);
 
-  if (res.status !== 200) {
-    throw new Error(`Failed to merge member ${primaryMemberId} with ${secondaryMemberId}!`)
+    if (!res.ok) {
+      const errorBody = await res.text();
+      throw new Error(
+        `Failed to merge member ${primaryMemberId} with ${secondaryMemberId}. Status: ${res.status}. Details: ${errorBody}`
+      );
+    }
+  } catch (err) {
+    if (err.name === 'AbortError') {
+      throw new Error('Request timeout while merging members', { cause: err });
+    }
+    throw new Error('Failed to merge members', { cause: err });
   }
 }
services/libs/types/src/llm.ts (2)

3-19: LGTM! Consider adding JSDoc comments.

The base interfaces are well-structured with a clear hierarchy. The use of generics in ILlmResult<T> provides good type safety.

Consider adding JSDoc comments to document:

  • The purpose of each interface
  • The meaning of the entityId and metadata fields in ILlmHistoryEntry
  • The units for inputTokenCount in ILlmResponse
+/**
+ * Represents an entry in the LLM prompt history, extending the base response with tracking information.
+ */
 export interface ILlmHistoryEntry extends ILlmResponse {
   type: LlmQueryType
+  /** ID of the entity (e.g., member, organization) this prompt was about */
   entityId?: string
+  /** Additional contextual information about this prompt */
   metadata?: Record<string, unknown>
 }

35-45: Consider implementing a pricing update mechanism.

The comment "to estimate costs - these numbers can change" suggests volatile pricing. Having this hardcoded in the codebase will require code changes for price updates.

Consider:

  1. Moving pricing to a configuration service or database
  2. Implementing a price update mechanism
  3. Adding price monitoring and alerting for cost control
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (3)

Line range hint 12-39: Enhance error handling in database operations

The current error handling in both database functions simply wraps and rethrows errors, losing context and potentially exposing sensitive information.

Consider implementing more robust error handling:

 export async function getGithubIdentitiesWithoutSourceId(
   limit: number,
   afterId: string,
   afterUsername: string,
 ): Promise<IMemberIdentity[]> {
   let rows: IMemberIdentity[] = []
 
   try {
     const db = svc.postgres.reader
     rows = await getGithubIdentities(db, limit, afterId, afterUsername)
   } catch (err) {
-    throw new Error(err)
+    throw new Error(`Failed to fetch GitHub identities: ${err.message}`)
   }
 
   return rows
 }

 export async function updateIdentitySourceId(
   identity: IMemberIdentity,
   sourceId: string,
 ): Promise<void> {
   try {
     const db = svc.postgres.writer
     await updateSourceId(db, identity, sourceId)
   } catch (err) {
-    throw new Error(err)
+    throw new Error(
+      `Failed to update source ID for identity ${identity.id}: ${err.message}`
+    )
   }
 }

Line range hint 44-71: Improve environment variable handling and token security

The function directly accesses environment variables without validation and splits the token string without type checking.

Consider these security improvements:

 export async function checkTokens(): Promise<boolean> {
+  const tokens = process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS']
+  if (!tokens) {
+    throw new Error('GitHub tokens not configured')
+  }
+
   const cache = new RedisCache(`lfx-auth0`, svc.redis, svc.log)
   const tokenRotator = new GithubTokenRotator(
     cache,
-    process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS'].split(','),
+    tokens.split(',').filter(token => token.trim().length > 0),
   )

Also, consider:

  1. Moving token configuration to a secure configuration service
  2. Implementing token validation before usage
  3. Adding token rotation logs for audit trails

Line range hint 73-108: Improve GitHub API interaction robustness

The function needs several improvements for better reliability and maintainability.

Consider these enhancements:

 export async function findGithubSourceId(username: string): Promise<string | null> {
+  const RATE_LIMIT_STATUS = 429;
+  const REQUEST_TIMEOUT = 5000;
+
   try {
     const cache = new RedisCache(`lfx-auth0`, svc.redis, svc.log)
     const tokenRotator = new GithubTokenRotator(
       cache,
       process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS'].split(','),
     )

     const token = await tokenRotator.getToken()
+    const controller = new AbortController();
+    const timeoutId = setTimeout(() => controller.abort(), REQUEST_TIMEOUT);

     const response = await fetch(`https://api.github.com/users/${username}`, {
       headers: {
         Authorization: `Bearer ${token}`,
       },
+      signal: controller.signal,
     })
+    clearTimeout(timeoutId);
+
     if (!response.ok) {
-      console.error('HTTP error', response.status)
-      if (response.status === 429) {
+      svc.log.error('GitHub API error', { status: response.status, username })
+      if (response.status === RATE_LIMIT_STATUS) {
         throw new Error('Rate limit exceeded')
       }
       return null

Additional suggestions:

  1. Implement retry mechanism for transient failures
  2. Add request/response logging for debugging
  3. Consider using an API client library for better error handling
services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (2)

8-14: Consider using configuration file for environment setup

While the current setup works, hardcoding environment variables and service configurations directly in the code makes it less maintainable and harder to modify for different environments.

Consider moving these configurations to a separate config file:

-process.env['CROWD_TEMPORAL_TASKQUEUE'] = 'not-important'
-svc.config.envvars = []
-svc.config.producer = { enabled: false }
-svc.config.redis = { enabled: true }
-svc.config.temporal = { enabled: false }
-svc.config.questdb = { enabled: false }
-svc.options.opensearch = { enabled: false }
+import { onboardingConfig } from '../config/onboarding'
+
+svc.config = onboardingConfig

62-62: Remove commented-out debugging code

Remove the commented-out test data as it's not needed in production code.

-  // let members = ['8db6c61e-f8b0-400e-ac5d-cb550d8740c9']
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1)

Line range hint 33-105: Well-structured implementation with robust error handling

The changes effectively implement the new member enrichment architecture while maintaining good separation of concerns and error handling. The workflow handles multiple sources, caching, and data processing in a clean and maintainable way.

A few suggestions for future improvements:

  1. Consider adding logging for better observability of the enrichment process
  2. Add comments explaining the new processing flow for better maintainability
services/archetypes/worker/src/index.ts (2)

233-235: Use path.join for cross-platform compatibility.

Using path.resolve with string concatenation can be fragile across different operating systems. Consider using path.join for more robust path handling.

    const workflowBundle = await bundleWorkflowCode({
-     workflowsPath: path.resolve('./src/workflows'),
+     workflowsPath: path.join(__dirname, 'workflows'),
    })

245-245: Consider a more type-safe approach for loading activities.

Using require with path.resolve is not type-safe and could lead to runtime errors. Consider importing activities directly or using a more structured approach.

- activities: require(path.resolve('./src/activities')),
+ activities: await import(path.join(__dirname, 'activities')),
services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (1)

282-289: Consider adding input sanitization for email arrays.

While the implementation handles both formats well, consider adding sanitization for the array case:

 let emails: string[]
 if (Array.isArray(data.email)) {
-  emails = data.email
+  emails = data.email.filter(isEmail)
 } else {
   emails = data.email.split(',').filter(isEmail)
 }

This ensures consistent email validation regardless of the input format.

services/libs/common_services/src/services/llm.service.ts (2)

152-153: Remove unnecessary ESLint disable comment

The ESLint disable comment @typescript-eslint/no-explicit-any appears to be unnecessary here, as there is no usage of any in the function parameters. Remove the comment to keep the code clean.

Apply this diff to remove the unnecessary comment:

         prompt: string,
-        // eslint-disable-next-line @typescript-eslint/no-explicit-any
       ): Promise<ILlmResult<LlmMemberEnrichmentResult>> {

97-99: Define appropriate types instead of using any

Casting settings.arguments to any and disabling the ESLint rule @typescript-eslint/no-explicit-any is not recommended. Define appropriate types for settings.arguments to maintain type safety.

Consider updating the code as follows:

         const command = new InvokeModelCommand({
           body: JSON.stringify({
             messages: [
               {
                 role: 'user',
                 content: [
                   {
                     type: 'text',
                     text: prompt,
                   },
                 ],
               },
             ],
-            // eslint-disable-next-line @typescript-eslint/no-explicit-any
-            ...(settings.arguments as any),
+            ...settings.arguments,
           }),
           modelId: settings.modelId,
           accept: 'application/json',
           contentType: 'application/json',
         })

Then, ensure that settings.arguments is properly typed in ILlmSettings. For example:

export interface ILlmSettings {
  modelId: string
  arguments?: Record<string, unknown> // Define a more specific type if possible
}
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (3)

155-155: Use optional chaining for cleaner conditional checks

You can simplify the condition in line 155 by using optional chaining. Instead of if (cache && cache.data), use if (cache?.data) for cleaner and more concise code.

Apply this diff:

- if (cache && cache.data) {
+ if (cache?.data) {
🧰 Tools
🪛 Biome

[error] 155-155: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


195-195: Address the TODO: Implement data squasher using LLM

There's a TODO comment indicating that the data squasher using LLM needs to be implemented. Ensuring this functionality is crucial for member data consolidation.

Would you like assistance in implementing the data squashing logic using LLM? I can help by providing a sample implementation or opening a GitHub issue to track this task.


331-331: Address the TODO: Update member data with LLM results

At line 331, there's a TODO to update the member data based on the LLM's output. Implementing this will ensure that high-confidence changes are correctly applied to the member profile.

Do you need assistance in writing the code to update the member data? I can help draft the implementation or open a GitHub issue for further tracking.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 730b954 and e255498.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (31)
  • backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1 hunks)
  • backend/src/database/migrations/V1731484783__drop-member-deprecated-columns.sql (1 hunks)
  • services/apps/premium/members_enrichment_worker/package.json (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (4 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/getMembers.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/authenticateLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getEnrichmentLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/main.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/types.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (3 hunks)
  • services/archetypes/worker/src/index.ts (2 hunks)
  • services/libs/common_services/package.json (1 hunks)
  • services/libs/common_services/src/repos/llmPromptHistory.repo.ts (1 hunks)
  • services/libs/common_services/src/services/index.ts (1 hunks)
  • services/libs/common_services/src/services/llm.service.ts (1 hunks)
  • services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (4 hunks)
  • services/libs/types/src/enums/activities.ts (0 hunks)
  • services/libs/types/src/enums/index.ts (1 hunks)
  • services/libs/types/src/enums/llm.ts (1 hunks)
  • services/libs/types/src/index.ts (1 hunks)
  • services/libs/types/src/llm.ts (1 hunks)
  • services/libs/types/src/premium/enrichment.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • services/libs/types/src/enums/activities.ts
✅ Files skipped from review due to trivial changes (4)
  • backend/src/database/migrations/V1731484783__drop-member-deprecated-columns.sql
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getEnrichmentLFAuth0.ts
  • services/apps/premium/members_enrichment_worker/src/service.ts
  • services/apps/premium/members_enrichment_worker/src/types.ts
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

[error] 155-155: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


[error] 167-167: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 171-171: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 180-180: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 184-184: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

services/libs/common_services/src/repos/llmPromptHistory.repo.ts

[error] 6-8: This constructor is unnecessary.

Unsafe fix: Remove the unnecessary constructor.

(lint/complexity/noUselessConstructor)

🔇 Additional comments (34)
services/libs/common_services/src/services/index.ts (2)

2-2: LGTM! Export follows consistent pattern

The new export follows the same pattern as other exports in the file, maintaining consistency.


2-2: Verify architectural placement of LLM service

Since this is being added to common_services, let's verify there are no circular dependencies and that this is the appropriate location for LLM functionality.

✅ Verification successful

Let me gather more information about the dependencies and imports to verify the architectural placement.


Based on the gathered information, I can now provide the final response:

LLM service placement in common_services is architecturally sound

The placement is appropriate because:

  • The service is a foundational utility that provides LLM capabilities through AWS Bedrock
  • It has no circular dependencies, only depending on core utilities and types
  • It's correctly consumed by higher-level services like merge_suggestions_worker and members_enrichment_worker
  • It shares the same level of abstraction as other common services (priority.service)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential circular dependencies and verify service placement

# Check for imports of llm.service across the codebase
echo "Checking usage of llm.service:"
rg -l "import.*llm\.service"

# Check if LlmService has any imports from other services
echo -e "\nChecking LlmService dependencies:"
rg -A 5 "import.*from.*services" "./services/libs/common_services/src/services/llm.service.ts"

# Look for other LLM-related code to ensure proper centralization
echo -e "\nChecking for other LLM-related code:"
rg -l "llm|LLM|bedrock|Bedrock" --type ts

Length of output: 2003


Script:

#!/bin/bash
# Check the content of llm.service.ts to understand its dependencies
echo "Checking llm.service.ts content:"
cat "./services/libs/common_services/src/services/llm.service.ts"

# Check what other services are in common_services
echo -e "\nListing other services in common_services:"
ls -1 "./services/libs/common_services/src/services/"

# Check if any other services depend on llm.service
echo -e "\nChecking imports of llm.service:"
rg "from.*llm\.service" -A 2

Length of output: 4983

services/libs/types/src/enums/llm.ts (1)

6-8: Verify if additional query types are planned.

The enum is correctly implemented. However, having a single member suggests that more query types might be added in the future.

Let's check if there are plans for additional query types:

✅ Verification successful

Based on the search results, I can see that LlmQueryType.MEMBER_ENRICHMENT is currently the only query type being used in the codebase, specifically in services/libs/common_services/src/services/llm.service.ts. There are no TODO comments or GitHub issues discussing additional query types.

No additional query types are currently planned.

The single member suggests this is a focused implementation for member enrichment functionality rather than a placeholder for future expansion. The enum is correctly implemented for its current purpose.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for TODO comments or discussions about additional query types

# Check for TODO comments mentioning LLM queries
rg -i "todo.*llm.*query" 

# Check for references to LlmQueryType to understand current usage
ast-grep --pattern 'LlmQueryType.$_'

# Check GitHub issues for planned LLM query types
gh issue list --search "llm query type"

Length of output: 188736

services/apps/premium/members_enrichment_worker/src/main.ts (1)

1-5: LGTM! Verify service initialization pattern.

The shift to a service-based pattern with centralized configuration is a good architectural improvement.

Let's verify the consistency of this pattern across the codebase:

✅ Verification successful

Let me check the service initialization pattern in more detail to verify its consistency.


Let me check the service initialization pattern in the specific service files.


Based on the verification results, I can now provide a final response:

Service initialization pattern is consistent and well-structured

The implementation follows the standard service worker pattern using @crowd/archetype-worker:

  • Properly configured with required environment variables
  • Consistent use of ServiceWorker with typed Config and Options
  • Clean separation of configuration from business logic
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent service initialization pattern
# Expected: All worker services should follow similar pattern

# Check for similar service.ts files in other workers
fd -e ts -g 'service.ts' services/apps/

# Check for consistent import pattern
rg -g '*.ts' "import \{ svc \} from './service'" services/apps/

Length of output: 676


Script:

#!/bin/bash
# Check the implementation of service.ts files
ast-grep --pattern 'export const svc = $_'

# Check for any other service initialization patterns
rg -g '*.ts' -B 2 -A 2 'createService|initializeService|ServiceConfig' services/apps/premium/members_enrichment_worker/

Length of output: 50155


Script:

#!/bin/bash
# Check the service implementation in members_enrichment_worker
cat services/apps/premium/members_enrichment_worker/src/service.ts

# Check for service worker pattern in other files
rg -g '*.ts' -B 2 -A 2 'class ServiceWorker' services/apps/

Length of output: 910

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1)

4-4: Verify service import path change across the codebase

The import path change from ../../main to ../../service appears to be part of a broader restructuring. Let's verify this change is consistent across the codebase.

✅ Verification successful

Let me verify the content of the main service file to ensure it's properly structured and contains the expected exports.


Let me check the content of main.ts and any references to it to ensure the migration is complete and correct.


Import path change is correct and consistent

The verification shows that:

  1. The service.ts file exists and exports the svc instance correctly
  2. All imports of svc across the codebase are using the correct path from service
  3. The main.ts file is still present but serves as the application entry point, importing svc from ./service
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining imports from 'main' that should be updated to 'service'
# and verify the new 'service' path exists and is properly structured

# Check for any remaining imports from 'main'
echo "Checking for remaining imports from 'main'..."
rg "from '\.\./\.\./main'" services/apps/premium/members_enrichment_worker/src/

# Verify the existence and structure of the new service path
echo "Verifying new service path..."
fd -t f "service.ts" services/apps/premium/members_enrichment_worker/src/

Length of output: 810


Script:

#!/bin/bash
# Check the content of the main service file
echo "Content of main service file:"
cat services/apps/premium/members_enrichment_worker/src/service.ts

# Check if there are any other files still importing from main.ts
echo -e "\nChecking for main.ts file:"
fd -t f "main.ts" services/apps/premium/members_enrichment_worker/src/

# Check for any imports referencing main.ts (with different path patterns)
echo -e "\nChecking for any main.ts imports:"
rg "from.*main'" services/apps/premium/members_enrichment_worker/src/

Length of output: 1275


Script:

#!/bin/bash
# Check the content of main.ts
echo "Content of main.ts:"
cat services/apps/premium/members_enrichment_worker/src/main.ts

# Look for any imports of 'svc' in the codebase
echo -e "\nChecking for svc imports:"
rg "import.*\{ ?svc ?}.*from" services/apps/premium/members_enrichment_worker/src/

Length of output: 1934

services/libs/types/src/enums/index.ts (1)

20-20: LGTM! Verify the referenced module.

The new export follows the established pattern and maintains alphabetical ordering.

Let's verify the existence of the referenced module and its contents:

✅ Verification successful

Export from './llm' is valid and contains the expected enums

The verification confirms:

  • The llm.ts module exists at the correct location
  • Contains both expected enums:
    • LlmModelType with Claude model variants
    • LlmQueryType with member enrichment type
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and contents of the llm enum module

# Check if the llm.ts file exists in the enums directory
fd -p "services/libs/types/src/enums/llm.ts$"

# If it exists, check its contents for the expected enums
if [ $? -eq 0 ]; then
  echo "Checking enum definitions..."
  ast-grep --pattern 'enum LlmModelType {
    $$$
  }' services/libs/types/src/enums/llm.ts
  
  ast-grep --pattern 'enum LlmQueryType {
    $$$
  }' services/libs/types/src/enums/llm.ts
fi

Length of output: 853

services/libs/common_services/package.json (2)

15-15: Verify security advisories for AWS SDK Bedrock client.

Let's ensure this version is free from known vulnerabilities.

✅ Verification successful

No security vulnerabilities found for @aws-sdk/[email protected]

Both npm audit and GitHub security advisories show no known vulnerabilities for this package version.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for security advisories for the AWS SDK Bedrock client

# Check npm for security advisories
echo "Checking npm advisories for @aws-sdk/client-bedrock-runtime:"
npm audit --json @aws-sdk/[email protected] 2>/dev/null | jq '.advisories'

# Check for GitHub security advisories
echo "Checking GitHub security advisories:"
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: NPM, package: "@aws-sdk/client-bedrock-runtime") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 659


15-15: LGTM! Verify AWS SDK version compatibility.

The addition of @aws-sdk/client-bedrock-runtime is appropriate for implementing LLM capabilities through AWS Bedrock. The caret versioning (^3.572.0) is a good practice for maintaining stability while allowing security updates.

Let's verify AWS SDK version compatibility:

backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1)

15-17: Optimize index strategy

The current index setup might be redundant:

  1. The composite index ix_llmPromptHistory_type_entityId can already serve queries on the type column alone
  2. Having separate indexes on type and entityId might not provide additional performance benefits and could impact write performance

Consider keeping only the composite index and the index on entityId:

 create index "ix_llmPromptHistory_type_entityId" on "llmPromptHistory"("type", "entityId");
 create index "ix_llmPromptHistory_entityId" on "llmPromptHistory"("entityId");
-create index "ix_llmPromptHistory_type" on "llmPromptHistory"("type");

Let's verify if there are any existing queries that might specifically benefit from the separate type index:

✅ Verification successful

Index optimization confirmed as beneficial

Based on the codebase analysis, the current usage patterns show only insert operations (insertPromptHistoryEntry) and no explicit query patterns that would benefit from a separate index on the type column. The composite index ix_llmPromptHistory_type_entityId will efficiently cover queries filtering by type alone, making the separate type index redundant.

  • The repository only shows INSERT operations
  • No SELECT queries were found filtering by type alone
  • The composite index can already optimize type-based queries if needed in the future

Proceed with removing the separate type index as suggested, keeping only:

  1. ix_llmPromptHistory_type_entityId (covers type-based queries)
  2. ix_llmPromptHistory_entityId (for entity-specific lookups)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for queries that filter by type without entityId
rg -g '*.{ts,js}' 'llmPromptHistory.*where.*type\s*=' -A 5

Length of output: 60


Script:

#!/bin/bash
# Let's try a broader search for any queries involving llmPromptHistory table
rg -g '*.{ts,js}' 'llmPromptHistory.*where' -A 5

# Also search for any TypeORM repository usage or query builders
rg -g '*.{ts,js}' 'llmPromptHistory.*find' -A 5

# Search for any direct SQL queries
rg -g '*.{sql,ts,js}' 'SELECT.*FROM.*llmPromptHistory' -A 5

Length of output: 162


Script:

#!/bin/bash
# Let's check for the entity definition and repository to understand the query patterns
rg -g '*.{ts,js}' 'class.*LlmPromptHistory' -A 20

# Look for any imports or references to this table
rg -g '*.{ts,js}' 'llmPromptHistory' -i

# Check for any TypeORM decorators that might indicate query patterns
rg -g '*.{ts,js}' '@Index.*llmPromptHistory' -A 5

Length of output: 3035

services/apps/premium/members_enrichment_worker/src/activities/getMembers.ts (2)

Line range hint 1-28: Implementation looks good!

The overall implementation is well-structured with:

  • Clear separation of concerns
  • Proper use of factory pattern for enrichment services
  • Type-safe interfaces
  • Efficient member fetching with pagination support

9-9: Verify import path consistency across the service

The import path change from '../main' to '../service' appears to be part of a broader service restructuring.

✅ Verification successful

The verification results show:

  1. No remaining '../main' imports found in the codebase
  2. The service.ts file exists at the expected location
  3. The 'svc' is properly exported from the service module

Let me verify one more aspect - the actual import usage in the activities directory to ensure consistency:


Import path change is consistent and correctly implemented

The import path change from '../main' to '../service' has been consistently applied across all activity files, and the service module properly exports the required interface. All files in the activities directory are using the new import path, and there are no remaining references to the old '../main' import.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all files in the service use the new import path consistently
# and that the service module exists and exports the expected interface

# Check if other files still use the old import path
echo "Checking for any remaining '../main' imports..."
rg "from '\.\./main'" services/apps/premium/members_enrichment_worker/

# Verify the existence of the service module
echo "Verifying '../service' module exists..."
fd -t f "service.ts" services/apps/premium/members_enrichment_worker/

# Check the service interface
echo "Checking service exports..."
rg "export.*svc" services/apps/premium/members_enrichment_worker/src/service.ts

Length of output: 1028


Script:

#!/bin/bash
# Check all imports in the activities directory
echo "Checking imports in activities directory..."
rg "from '\.\./service'" services/apps/premium/members_enrichment_worker/src/activities/

Length of output: 529

services/libs/types/src/index.ts (1)

64-64: LGTM! Verify the module path.

The new export follows the existing pattern and integrates well with the codebase structure.

Let's verify the module path and check for potential circular dependencies:

✅ Verification successful

Export path verified and no circular dependencies found

The module structure is valid:

  • llm.ts exists and imports from ./enums without circular dependencies
  • Contains well-structured interfaces and constants for LLM functionality
  • Follows TypeScript best practices for type definitions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the llm module exists and check for circular dependencies

# Check if the llm module exists
fd -t f "llm.(ts|js)" "services/libs/types/src"

# Check for potential circular dependencies
rg -l "from '\.\.\/index'" "services/libs/types/src/llm"

Length of output: 273


Script:

#!/bin/bash
# Check for potential circular dependencies with corrected regex
rg "from '../index'" "services/libs/types/src/llm.ts"

# Check the content of llm.ts to understand its structure
cat "services/libs/types/src/llm.ts"

# Check if llm.ts re-exports from enums/llm.ts
rg "from.*enums/llm" "services/libs/types/src/llm.ts"

Length of output: 2678

services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts (1)

4-4: Verify the service import path change across the codebase

The import path change from '../main' to '../service' appears to be part of a broader restructuring. Let's verify this change is consistent across related files.

✅ Verification successful

Let me verify the import path changes more thoroughly by checking the actual content of these files to ensure the service import is used consistently and correctly.


Import path change is consistent across the codebase

The import change from '../main' to '../service' has been consistently implemented across the codebase. All files that require the service instance are using the new '../service' import path, and there are no remaining references to the old '../main' import. The change appears to be part of a successful refactoring effort.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining '../main' imports that might need updating
# and verify the new '../service' import is used consistently

echo "Checking for any remaining old imports..."
rg -l "from '\.\./main'" services/apps/premium/members_enrichment_worker/

echo "Verifying new service import usage..."
rg -l "from '\.\./service'" services/apps/premium/members_enrichment_worker/

Length of output: 735


Script:

#!/bin/bash
# Check the actual import statements and their usage in the files
echo "Checking service import usage in all related files..."
rg "from '\.\./service'" services/apps/premium/members_enrichment_worker/src/ -A 2

# Also verify if there are any other variations of service imports
echo -e "\nChecking for other service-related imports..."
rg "import.*service" services/apps/premium/members_enrichment_worker/src/

Length of output: 4144

services/libs/common_services/src/repos/llmPromptHistory.repo.ts (2)

1-5: LGTM! Clean imports and proper class declaration.

The imports are well-organized and the class properly extends RepositoryBase with correct type parameters.


10-30: Consider adding createdAt to match database schema.

The implementation looks solid with proper SQL injection protection and metadata handling. However, according to the AI summary, the database table includes a createdAt column which isn't being set in this insert statement.

Let's verify the table schema:

If the column exists without a default value, consider adding it to the insert statement:

  insert into "llmPromptHistory"(
    type, model, "entityId", metadata, prompt, 
    answer, "inputTokenCount", "outputTokenCount", 
-   "responseTimeSeconds"
+   "responseTimeSeconds", "createdAt"
  )
  values(
    $(type), $(model), $(entityId), $(metadata), 
    $(prompt), $(answer), $(inputTokenCount), 
-   $(outputTokenCount), $(responseTimeSeconds)
+   $(outputTokenCount), $(responseTimeSeconds), NOW()
  );
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/authenticateLFAuth0.ts (1)

5-5: LGTM! Verify the new service module path.

The import path change from main to service aligns with the broader restructuring effort.

Let's verify the new import path exists:

✅ Verification successful

Import path change is valid and correctly implemented

The new service.ts module exists at the correct location and exports the svc instance of ServiceWorker with proper configuration, maintaining the expected interface. The change from main to service is consistent with the codebase structure.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and exports of the new service module

# Check if the service.ts file exists
fd -t f "service.ts" "services/apps/premium/members_enrichment_worker/src"

# Check the exports from the service module
rg -A 5 "export.*svc" "services/apps/premium/members_enrichment_worker/src"

Length of output: 738


Script:

#!/bin/bash
# Let's verify the service module's interface and exports
cat services/apps/premium/members_enrichment_worker/src/service.ts

Length of output: 848

services/apps/premium/members_enrichment_worker/src/activities.ts (2)

9-9: LGTM! Clean addition of the new function export.

The changes follow good practices:

  • Function name is descriptive and follows naming conventions
  • Import and export are properly organized

Also applies to: 34-34


9-9: Verify the implementation of processMemberSources

Since this function processes multiple sources and uses LLM for consolidation, let's verify its implementation details for potential performance considerations and rate limiting.

Also applies to: 34-34

✅ Verification successful

Rate limiting and performance considerations are properly handled

The implementation of processMemberSources has several built-in safeguards and optimizations:

  1. Credit-based rate limiting through hasRemainingCredits check before data enrichment
  2. Redis-based caching of credit status with 60-second TTL
  3. Source-specific cache obsolescence checks
  4. LLM data consolidation with confidence thresholds (>= 0.85)
  5. Optimization of response size by prioritizing and filtering data

The function also includes proper error handling and logging for monitoring purposes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the implementation of processMemberSources for performance considerations

# Find and display the implementation
ast-grep --pattern 'function processMemberSources($_) {
  $$$
}'

# Look for rate limiting or throttling mechanisms
rg -A 5 'rate|limit|throttle' ./services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 177


Script:

#!/bin/bash
# Let's try a different approach to find the implementation

# Search for the file containing processMemberSources
fd "activities.ts|enrichment.ts" services/apps/premium/members_enrichment_worker/src/

# Once we find the file, let's look at its content
rg -A 20 "processMemberSources" services/apps/premium/members_enrichment_worker/src/

Length of output: 12521


Script:

#!/bin/bash
# Let's look at the complete implementation of processMemberSources and related functions

# Get the full content of enrichment.ts
cat services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

# Also check for any rate limiting configurations in the service
rg -l "rate|limit|throttle" services/apps/premium/members_enrichment_worker/src/

Length of output: 12253

services/apps/premium/members_enrichment_worker/package.json (1)

20-20: LGTM!

The new dependency @crowd/common_services follows the workspace package pattern and is correctly placed in alphabetical order.

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (1)

8-8: Verify the service import path change across the codebase.

The import path change from ../../main to ../../service appears to be part of a larger refactoring effort.

Let's verify this change is consistent across related files:

✅ Verification successful

Import path change is consistent across the codebase

The import change from ../../main to ../../service is part of a consistent pattern across the codebase. All files that import svc are now correctly importing it from the service module, with proper relative paths (e.g., ./service, ../service, or ../../service depending on their location in the directory structure). No remaining imports from ../../main were found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining imports from '../../main'
# Expect: No results, indicating all imports have been updated

rg -l "from '../../main'" services/apps/premium/members_enrichment_worker/src/

Length of output: 86


Script:

#!/bin/bash
# Let's verify the import changes more thoroughly:

# 1. Check for any imports from 'service'
echo "=== Checking imports from 'service' ==="
rg "from ['\"].*service['\"]" services/apps/premium/members_enrichment_worker/src/

# 2. Check all import statements in the directory to understand the pattern
echo -e "\n=== Checking all import statements in the directory ==="
rg "^import.*from ['\"]\.\./" services/apps/premium/members_enrichment_worker/src/

# 3. Check specifically for the 'svc' import
echo -e "\n=== Checking for 'svc' imports ==="
rg "import.*\{ ?svc ?\}" services/apps/premium/members_enrichment_worker/src/

Length of output: 10016

services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (3)

Line range hint 8-38: LGTM! The schedule configuration is well-structured

The daily schedule at 6 AM with SKIP overlap policy is appropriate for a batch enrichment task. The retry configuration with a 15-second initial interval and maximum of 3 attempts provides good resilience while preventing excessive retries.


5-5: Verify the consistency of service imports across the codebase

The import path has been changed from '../main' to '../service'. Let's ensure this change is consistent across all files.


Line range hint 40-71: Review the scheduling frequency and retry configuration

A few concerns about the current configuration:

  1. The bi-monthly schedule (0 0 1 */2 *) in production seems unusually infrequent. Consider if this aligns with business requirements for LFID enrichment.
  2. The BUFFER_ONE overlap policy with a bi-monthly schedule might cause resource contention if tasks take longer than expected.
  3. The 2-second retry interval might be too aggressive and could lead to unnecessary load during failures.

Consider these adjustments:

  1. Review if a monthly or weekly schedule would be more appropriate for production
  2. Consider using ScheduleOverlapPolicy.SKIP if tasks might take longer than expected
  3. Increase the retry initial interval to at least 5-10 seconds to prevent overwhelming the system during failures
services/libs/types/src/llm.ts (1)

91-112: LGTM! Well-structured enrichment result interface.

The LlmMemberEnrichmentResult interface provides a comprehensive structure for handling various aspects of member enrichment, including confidence scores, identity management, attributes, and organization relationships.

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (1)

9-9: Verify service import path changes across the codebase

The import path change from ../../main to ../../service suggests a service layer restructuring.

✅ Verification successful

The results show that there are no remaining imports from '../../main', and all relevant files in the lf-auth0 directory consistently use the new '../../service' import path. This confirms that the import path change has been applied consistently across the codebase.

Import path changes are consistent across the codebase

The migration from ../../main to ../../service has been properly implemented across all related files in the members enrichment worker service.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining old import paths and verify consistency
# Look for any remaining imports from 'main'
rg -l "from '../../main'" "services/apps/premium/members_enrichment_worker/src/"

# Look for new service imports to verify consistency
rg -l "from '../../service'" "services/apps/premium/members_enrichment_worker/src/"

Length of output: 648

services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2)

104-104: Verify integration with the new processing approach

The code now uses processMemberSources instead of data squashing. Ensure that all necessary data transformations are handled correctly in the new processing flow.

✅ Verification successful

Let me search for the implementation of processMemberSources to understand how it handles the data transformations.


Let me check the implementation of processMemberSources in the enrichment activities file to understand how it handles the data processing.


Based on the implementation of processMemberSources, I can now provide a final response:

The new processing approach is properly implemented

The code correctly handles data transformations through a well-structured process:

  • Fetches and normalizes enrichment data from different sources
  • Uses LLM-based data consolidation for multiple sources with confidence scoring
  • Implements proper data cleaning (removing contributions and reach data to manage context size)
  • Follows strict verification rules for identities and data consolidation
  • Includes safeguards like confidence thresholds (0.85) for applying changes

The removal of the old squashing approach is intentional, replaced with a more sophisticated LLM-based consolidation system that provides better data quality control.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the implementation and usage of processMemberSources
ast-grep --pattern 'async function processMemberSources($_) {
  $$$
}'

# Search for any remaining references to the old squashing approach
rg -i 'squash|normalize' --type ts

Length of output: 35172


Script:

#!/bin/bash
# Search for the implementation of processMemberSources
ast-grep --pattern 'async function processMemberSources($_) {
  $$$
}'

# Also search for any direct references to this function
rg 'processMemberSources' --type ts -A 3

Length of output: 2866


Script:

#!/bin/bash
# Get the implementation of processMemberSources from the enrichment activities file
cat services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 11587


22-22: Verify timeout settings for processing multiple sources

The timeout of 20 seconds might be insufficient when processing multiple enrichment sources simultaneously. Consider adjusting the startToCloseTimeout based on the expected processing time for multiple sources.

Also applies to: 23-31

services/archetypes/worker/src/index.ts (2)

206-249: LGTM! Well-structured Temporal worker initialization.

The implementation correctly:

  • Handles TLS configuration with proper certificate encoding
  • Configures worker settings with appropriate defaults
  • Sets up workflow bundling and activity registration

103-103: Verify the impact of the method signature change.

The addition of the optional initWorker parameter is a good enhancement for flexibility. However, since this modifies a method signature in what appears to be a base worker class, we should verify that all inheriting classes handle this change correctly.

✅ Verification successful

No impact from the method signature change

The search results show that while many services import and use ServiceWorker, none of them explicitly override the init method. The only two init method implementations found are:

  1. In the ServiceWorker class itself (the file under review)
  2. In the Consumer archetype which is a different base class

Since no inheriting classes override the init method, the addition of the optional parameter with a default value of true maintains backward compatibility and has no breaking impact on the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for classes that extend ServiceWorker and override init
ast-grep --pattern 'class $_ extends ServiceWorker {
  $$$
  init($$$) {
    $$$
  }
  $$$
}'

Length of output: 95


Script:

#!/bin/bash
# Let's try a different approach to find classes extending ServiceWorker
# First, find all TypeScript files
echo "=== Searching for ServiceWorker imports ==="
rg "import.*ServiceWorker" -A 2

echo -e "\n=== Searching for classes extending ServiceWorker ==="
rg "extends\s+ServiceWorker" -A 5

echo -e "\n=== Searching for init method implementations ==="
rg "override\s+async\s+init" -A 3

Length of output: 4159

services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (2)

282-288: LGTM: Improved email handling with better type safety.

The changes improve the robustness of email handling by supporting both array and string formats. The implementation correctly:

  • Handles type variations using Array.isArray()
  • Maintains email validation using isEmail
  • Preserves the existing comma-separation logic for string inputs

282-289: Verify email handling in upstream data sources.

The change in email handling logic suggests that the data format from Crustdata API might have changed.

✅ Verification successful

Let me gather more specific information about the email handling in these files.


Let me check one more thing to see the Crustdata response type definition specifically for the email field.


Email handling change appears to be a bug fix, not an API change

The Crustdata API type definition clearly shows email is defined as a string (email: string), but the code handles it as both array and comma-separated string. This suggests the change is actually fixing a bug in the code to handle malformed data, rather than reflecting an API change. Other enrichment services (Progai, Clearbit) handle emails consistently with their respective type definitions.

  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts: Type definition shows email: string
  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts: Code defensively handles both array and comma-separated string formats
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other places where Crustdata email data is handled
rg -l "data\.email" --type ts

# Look for Crustdata API response type definitions
ast-grep --pattern 'interface $_Response {
  $$$
  email: $_
  $$$
}'

Length of output: 712


Script:

#!/bin/bash
# Check email handling in other enrichment services
rg -A 5 "data\.email" services/apps/premium/members_enrichment_worker/src/sources/progai/service.ts
rg -A 5 "data\.email" services/apps/premium/members_enrichment_worker/src/sources/clearbit/service.ts

# Look for type definitions in Crustdata
rg -A 10 "interface.*Response" services/apps/premium/members_enrichment_worker/src/sources/crustdata/

Length of output: 1923


Script:

#!/bin/bash
# Look for the main Crustdata response type that includes email
rg -B 5 -A 15 "email.*:" services/apps/premium/members_enrichment_worker/src/sources/crustdata/types.ts

Length of output: 672

services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (3)

533-536: Ensure consistent transaction usage

In the resetMemberEnrichedAt function, you're accepting a DbConnOrTx object as tx. Ensure that when calling this function within a transaction, you're consistently using the transaction object to maintain atomicity.


549-550: Confirm the impact of resetting lastEnriched after cache updates

After inserting or updating the member enrichment cache in insertMemberEnrichmentCacheDb and updateMemberEnrichmentCacheDb, you're resetting the lastEnriched field to null (lines 549–550 and 568–569). Verify that this behavior aligns with the application's logic and that downstream processes correctly handle a null lastEnriched value.

Also applies to: 568-569


588-590: Verify all usages of findMemberEnrichmentCacheDb accommodate the new parameters

The findMemberEnrichmentCacheDb function now accepts an array of sources (lines 588–590) and returns multiple results. Ensure that all calls to this function have been updated to pass an array of MemberEnrichmentSource and properly handle the array of results to prevent runtime errors.

You can run the following script to identify usages that may need updates:

This will help you locate all instances where the function is called and verify that they conform to the new signature.

✅ Verification successful

Let me check the implementation and usage context more thoroughly.


Let me try a different search pattern to get the implementation details.


Let me try one final search to find the enum definition and any other usages.


Function usage is properly aligned with the new signature

The verification shows that the only usage of findMemberEnrichmentCacheDb is in the findMemberEnrichmentCache function in services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts, and it correctly:

  • Accepts an array of MemberEnrichmentSource as the first parameter
  • Passes it along with the correct parameter order (connection, memberId, sources)
  • Expects an array of IMemberEnrichmentCache as the return type

The implementation and usage are consistent with the changes, and no updates are needed in other parts of the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to `findMemberEnrichmentCacheDb` and check for the correct usage of the new parameters.

# Search for function calls
rg 'findMemberEnrichmentCacheDb\(' -A 3

Length of output: 504


Script:

#!/bin/bash
# Get the implementation of findMemberEnrichmentCacheForAllSources to see how it uses the result
ast-grep --pattern 'export async function findMemberEnrichmentCacheForAllSources($_) {
  $$$
}'

# Get the implementation of findMemberEnrichmentCacheDb to verify the signature change
ast-grep --pattern 'export async function findMemberEnrichmentCacheDb($_) {
  $$$
}'

# Check for any tests that might show the usage pattern
rg "findMemberEnrichmentCacheDb" -g "*.test.ts" -g "*.spec.ts" -A 5

Length of output: 252


Script:

#!/bin/bash
# Get more context around the function call to see how it's being used
rg 'findMemberEnrichmentCacheDb' -B 5 -A 10

# Get the type definition of MemberEnrichmentSource to understand the parameter
rg 'type MemberEnrichmentSource' -A 5

Length of output: 5520


Script:

#!/bin/bash
# Find the enum definition and any other usages
rg 'enum MemberEnrichmentSource' -A 10

# Look for any other calls to the function in the codebase
rg -g '!*.{md,json}' 'findMemberEnrichmentCacheDb\(' -B 3 -A 3

Length of output: 1410

Comment on lines +6 to +8
public constructor(dbStore: DbStore, parentLog: Logger) {
super(dbStore, parentLog)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unnecessary constructor.

Since the constructor only forwards parameters to the parent class without additional initialization, it can be safely removed. TypeScript will automatically handle the parent constructor call.

-  public constructor(dbStore: DbStore, parentLog: Logger) {
-    super(dbStore, parentLog)
-  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public constructor(dbStore: DbStore, parentLog: Logger) {
super(dbStore, parentLog)
}
🧰 Tools
🪛 Biome

[error] 6-8: This constructor is unnecessary.

Unsafe fix: Remove the unnecessary constructor.

(lint/complexity/noUselessConstructor)

@@ -6,6 +6,7 @@
"start:debug": "CROWD_TEMPORAL_TASKQUEUE=members-enrichment SERVICE=members-enrichment-worker LOG_LEVEL=trace tsx --inspect=0.0.0.0:9232 src/main.ts",
"dev:local": "nodemon --watch src --watch ../../../libs --ext ts --exec pnpm run start:debug:local",
"dev": "nodemon --watch src --watch ../../../libs --ext ts --exec pnpm run start:debug",
"script:onboarding": "SERVICE=script tsx --inspect src/bin/onboarding.ts",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Specify a unique debug port for the onboarding script

The --inspect flag without a port number will use the default Node.js debug port (9229), which could conflict with other services running in debug mode. Consider specifying a unique port to prevent debugging conflicts.

-    "script:onboarding": "SERVICE=script tsx --inspect src/bin/onboarding.ts",
+    "script:onboarding": "SERVICE=script tsx --inspect=0.0.0.0:9233 src/bin/onboarding.ts",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"script:onboarding": "SERVICE=script tsx --inspect src/bin/onboarding.ts",
"script:onboarding": "SERVICE=script tsx --inspect=0.0.0.0:9233 src/bin/onboarding.ts",

Comment on lines +84 to +89
export interface LlmAttributeValue {
// eslint-disable-next-line @typescript-eslint/no-explicit-any
default?: any
// eslint-disable-next-line @typescript-eslint/no-explicit-any
[source: string]: any
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve type safety by avoiding any.

The use of any with eslint-disable comments reduces type safety.

Consider using a more specific type or a generic type parameter:

 export interface LlmAttributeValue {
-  // eslint-disable-next-line @typescript-eslint/no-explicit-any
-  default?: any
-  // eslint-disable-next-line @typescript-eslint/no-explicit-any
-  [source: string]: any
+  default?: unknown
+  [source: string]: unknown
 }

Or even better, make it generic:

export interface LlmAttributeValue<T = unknown> {
  default?: T
  [source: string]: T | undefined
}

Comment on lines +58 to +72
export interface LlmIdentity {
t: string // type
v: string // value
p: string // platform
ve: boolean // verification status
}

export interface LlmOrganizationConnection {
orgId: string
t: string // title
ds: string // dateStart
de: string // dateEnd
s: string // source
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve property naming for better code readability.

The single-letter property names (t, v, p, ve, ds, de, s) make the code harder to understand and maintain.

Consider using full descriptive names:

 export interface LlmIdentity {
-  t: string // type
-  v: string // value
-  p: string // platform
-  ve: boolean // verification status
+  type: string
+  value: string
+  platform: string
+  isVerified: boolean
 }

 export interface LlmOrganizationConnection {
   orgId: string
-  t: string // title
-  ds: string // dateStart
-  de: string // dateEnd
-  s: string // source
+  title: string
+  dateStart: string
+  dateEnd: string
+  source: string
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export interface LlmIdentity {
t: string // type
v: string // value
p: string // platform
ve: boolean // verification status
}
export interface LlmOrganizationConnection {
orgId: string
t: string // title
ds: string // dateStart
de: string // dateEnd
s: string // source
}
export interface LlmIdentity {
type: string
value: string
platform: string
isVerified: boolean
}
export interface LlmOrganizationConnection {
orgId: string
title: string
dateStart: string
dateEnd: string
source: string
}

Comment on lines +47 to +56
export const LLM_SETTINGS: Record<LlmQueryType, ILlmSettings> = {
[LlmQueryType.MEMBER_ENRICHMENT]: {
modelId: LlmModelType.CLAUDE_3_5_SONNET,
arguments: {
max_tokens: 200000,
anthropic_version: 'bedrock-2023-05-31',
temperature: 0,
},
},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Adjust max_tokens configuration to align with actual usage patterns

The verification reveals that the max_tokens: 200000 setting in services/libs/types/src/llm.ts is significantly higher than what's being used in actual implementations. Other LLM workflows in the codebase (mergeOrganizationsWithLLM.ts and mergeMembersWithLLM.ts) use a much more reasonable value of 2000 tokens.

  • services/libs/types/src/llm.ts: Reduce max_tokens from 200000 to 2000 for the MEMBER_ENRICHMENT query type to maintain consistency with other implementations
  • Consider extracting the common LLM configuration (including anthropic_version and temperature) into a shared configuration object since these values are identical across implementations
🔗 Analysis chain

Review the LLM configuration settings.

The max_tokens value of 200,000 seems unusually high. This could lead to:

  1. Increased costs due to token consumption
  2. Longer processing times
  3. Potential timeout issues

Additionally, the hardcoded anthropic_version might need periodic updates.

Consider:

  1. Moving these configurations to environment variables or a configuration service
  2. Implementing a mechanism to update the Anthropic version without code changes
  3. Setting a more conservative max_tokens limit based on actual usage patterns

Run this script to analyze token usage patterns:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find LLM token usage patterns in the codebase
# Look for token limits in tests or documentation

# Search for token-related configurations
rg -i "max.*tokens|tokens.*limit" 

# Search for test cases involving tokens
rg -i "test.*tokens|tokens.*test"

Length of output: 330


Script:

#!/bin/bash
# Let's analyze the context around these token configurations and their usage

# Check the context of token configurations
rg -B 5 -A 5 "max_tokens"

# Look for any LLM response handling or error cases
rg -i "response.*timeout|timeout.*error|token.*limit.*exceed"

# Find any documentation about token limits
fd "README|CONTRIBUTING" --exec grep -i -l "token"

# Look for any configuration patterns
ast-grep --pattern 'max_tokens: $_'

Length of output: 4020

Comment on lines +138 to +143
try {
await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
} catch (err) {
this.log.error(err, 'Failed to save LLM prompt history entry!')
throw err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle prompt history saving errors without interrupting the main flow

Currently, if saving the prompt history entry fails, an error is thrown, which stops the main operation. Since saving history is optional, consider logging the error but not throwing it, so that the main response can still be returned to the user.

Apply this diff to prevent throwing an error when history saving fails:

         if (saveHistory) {
           try {
             await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
           } catch (err) {
             this.log.error(err, 'Failed to save LLM prompt history entry!')
-            throw err
           }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
} catch (err) {
this.log.error(err, 'Failed to save LLM prompt history entry!')
throw err
}
try {
await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
} catch (err) {
this.log.error(err, 'Failed to save LLM prompt history entry!')
}

Comment on lines +156 to +157
const result = JSON.parse(response.answer)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for JSON parsing of LLM response

Parsing response.answer without a try-catch block could lead to unhandled exceptions if the JSON is invalid. Add proper error handling to manage parsing errors gracefully.

Apply this diff to add error handling:

-        const result = JSON.parse(response.answer)
+        let result: LlmMemberEnrichmentResult
+        try {
+          result = JSON.parse(response.answer)
+        } catch (err) {
+          this.log.error(err, 'Failed to parse LLM response as JSON')
+          throw new Error('Invalid JSON response from LLM')
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const result = JSON.parse(response.answer)
let result: LlmMemberEnrichmentResult
try {
result = JSON.parse(response.answer)
} catch (err) {
this.log.error(err, 'Failed to parse LLM response as JSON')
throw new Error('Invalid JSON response from LLM')
}

Comment on lines +105 to +114
let res: InvokeModelCommandOutput
try {
res = await client.send(command)
} catch (err) {
this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
throw err
}

const body = JSON.parse(res.body.transformToString())
const responseTimeSeconds = end()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure correct parsing of response body from AWS SDK

The method transformToString() used on res.body may not be valid for parsing the response. In AWS SDK v3, the response body is typically a readable stream, and you need to properly collect and parse it.

Consider using the collectBodyString utility function provided by the SDK to correctly read the body.

Here's how you might adjust the code:

+        import { collectBodyString } from '@aws-sdk/util-body-length-node'

         let res: InvokeModelCommandOutput
         try {
           res = await client.send(command)
         } catch (err) {
           this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
           throw err
         }

-        const body = JSON.parse(res.body.transformToString())
+        const bodyString = await collectBodyString(res.Body)
+        const body = JSON.parse(bodyString)

Alternatively, you can use a helper function to read the stream:

+        import { Readable } from 'stream'
+        const streamToString = (stream: Readable): Promise<string> =>
+          new Promise((resolve, reject) => {
+            const chunks = []
+            stream.on('data', (chunk) => chunks.push(chunk))
+            stream.on('error', reject)
+            stream.on('end', () => resolve(Buffer.concat(chunks).toString('utf8')))
+          })

         let res: InvokeModelCommandOutput
         try {
           res = await client.send(command)
         } catch (err) {
           this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
           throw err
         }

-        const body = JSON.parse(res.body.transformToString())
+        const bodyString = await streamToString(res.Body as Readable)
+        const body = JSON.parse(bodyString)

Committable suggestion skipped: line range outside the PR's diff.

for (const n of normalized) {
if (n.contributions) {
normalizedContributions.push(n.contributions)
delete n.contributions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid using the delete operator to enhance performance

Using the delete operator can negatively impact performance. Instead, consider setting the properties to undefined or using object destructuring to create new objects without those properties.

Apply these changes:

- delete n.contributions
+ n.contributions = undefined

- delete n.reach
+ n.reach = undefined

- delete normalized.contributions
+ normalized.contributions = undefined

- delete normalized.reach
+ normalized.reach = undefined

Also applies to: 171-171, 180-180, 184-184

🧰 Tools
🪛 Biome

[error] 167-167: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

Comment on lines 16 to 74
export async function fetchMemberDataForLLMSquashing(
db: DbStore,
memberId: string,
): Promise<IMemberOriginalData | null> {
const result = await db.connection().oneOrNone(
`
with member_orgs as (select distinct mo."memberId",
mo."organizationId" as "orgId",
o."displayName" as "orgName",
mo.title as "jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source
from "memberOrganizations" mo
inner join organizations o on mo."organizationId" = o.id
where mo."memberId" = $(memberId)
and mo."deletedAt" is null
and o."deletedAt" is null)
select m."displayName",
m.attributes,
m."manuallyChangedFields",
coalesce(
(select json_agg(
(select row_to_json(r)
from (select mi.type,
mi.platform,
mi.value) r)
)
from "memberIdentities" mi
where mi."memberId" = m.id
and verified = true), '[]'::json) as identities,
coalesce(
nullif(
json_agg(
(select row_to_json(r)
from (select mo."orgId",
mo."orgName",
mo."jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source) r)
)::jsonb,
jsonb_build_array(null)
)::json,
'[]'::json
) as organizations
from members m
left join member_orgs mo on mo."memberId" = m.id
where m.id = $(memberId)
and m."deletedAt" is null
group by m.id, m."displayName", m.attributes, m."manuallyChangedFields";
`,
{
memberId,
},
)

return result
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize the GROUP BY clause for better performance

In the fetchMemberDataForLLMSquashing function, the GROUP BY clause includes m.id, m."displayName", m.attributes, and m."manuallyChangedFields" (line 66). Since m.id uniquely identifies each member, grouping by m.id alone is sufficient. Including large text or JSON fields like attributes and manuallyChangedFields can negatively impact query performance.

Consider modifying the GROUP BY clause as follows:

-      group by m.id, m."displayName", m.attributes, m."manuallyChangedFields";
+      GROUP BY m.id;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export async function fetchMemberDataForLLMSquashing(
db: DbStore,
memberId: string,
): Promise<IMemberOriginalData | null> {
const result = await db.connection().oneOrNone(
`
with member_orgs as (select distinct mo."memberId",
mo."organizationId" as "orgId",
o."displayName" as "orgName",
mo.title as "jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source
from "memberOrganizations" mo
inner join organizations o on mo."organizationId" = o.id
where mo."memberId" = $(memberId)
and mo."deletedAt" is null
and o."deletedAt" is null)
select m."displayName",
m.attributes,
m."manuallyChangedFields",
coalesce(
(select json_agg(
(select row_to_json(r)
from (select mi.type,
mi.platform,
mi.value) r)
)
from "memberIdentities" mi
where mi."memberId" = m.id
and verified = true), '[]'::json) as identities,
coalesce(
nullif(
json_agg(
(select row_to_json(r)
from (select mo."orgId",
mo."orgName",
mo."jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source) r)
)::jsonb,
jsonb_build_array(null)
)::json,
'[]'::json
) as organizations
from members m
left join member_orgs mo on mo."memberId" = m.id
where m.id = $(memberId)
and m."deletedAt" is null
group by m.id, m."displayName", m.attributes, m."manuallyChangedFields";
`,
{
memberId,
},
)
return result
}
export async function fetchMemberDataForLLMSquashing(
db: DbStore,
memberId: string,
): Promise<IMemberOriginalData | null> {
const result = await db.connection().oneOrNone(
`
with member_orgs as (select distinct mo."memberId",
mo."organizationId" as "orgId",
o."displayName" as "orgName",
mo.title as "jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source
from "memberOrganizations" mo
inner join organizations o on mo."organizationId" = o.id
where mo."memberId" = $(memberId)
and mo."deletedAt" is null
and o."deletedAt" is null)
select m."displayName",
m.attributes,
m."manuallyChangedFields",
coalesce(
(select json_agg(
(select row_to_json(r)
from (select mi.type,
mi.platform,
mi.value) r)
)
from "memberIdentities" mi
where mi."memberId" = m.id
and verified = true), '[]'::json) as identities,
coalesce(
nullif(
json_agg(
(select row_to_json(r)
from (select mo."orgId",
mo."orgName",
mo."jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source) r)
)::jsonb,
jsonb_build_array(null)
)::json,
'[]'::json
) as organizations
from members m
left join member_orgs mo on mo."memberId" = m.id
where m.id = $(memberId)
and m."deletedAt" is null
GROUP BY m.id;
`,
{
memberId,
},
)
return result
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🧹 Outside diff range and nitpick comments (38)
services/libs/types/src/enums/llm.ts (1)

1-4: Consider improving version management and documentation for LLM models.

The hardcoded version strings could make future model updates more challenging. Consider:

  1. Adding JSDoc comments describing each model's capabilities and use cases
  2. Moving version strings to constants for easier updates
+/** Constants for LLM model versions */
+const CLAUDE_VERSIONS = {
+  SONNET: '20240620-v1:0',
+  OPUS: '20240229-v1:0',
+} as const;
+
+/**
+ * Enum for supported LLM model types
+ * @remarks
+ * - CLAUDE_3_5_SONNET: Lighter, faster model suitable for routine tasks
+ * - CLAUDE_3_OPUS: More powerful model for complex reasoning
+ */
 export enum LlmModelType {
-  CLAUDE_3_5_SONNET = 'anthropic.claude-3-5-sonnet-20240620-v1:0',
-  CLAUDE_3_OPUS = 'anthropic.claude-3-opus-20240229-v1:0',
+  CLAUDE_3_5_SONNET = `anthropic.claude-3-5-sonnet-${CLAUDE_VERSIONS.SONNET}`,
+  CLAUDE_3_OPUS = `anthropic.claude-3-opus-${CLAUDE_VERSIONS.OPUS}`,
 }
services/apps/premium/members_enrichment_worker/src/main.ts (2)

4-4: Good architectural improvement moving configuration to a service module!

The refactoring to move configuration into a dedicated service module improves separation of concerns and maintainability.


Line range hint 6-16: Consider adding error handling for initialization failures

While the initialization flow is well-structured, it would benefit from proper error handling to ensure graceful service shutdown in case of initialization failures.

Consider wrapping the initialization in a try-catch block:

 setImmediate(async () => {
+  try {
     await svc.init()
 
     await scheduleMembersEnrichment()
 
     if (process.env['CROWD_EDITION'] === Edition.LFX) {
       await scheduleMembersLFIDEnrichment()
     }
 
     await svc.start()
+  } catch (error) {
+    console.error('Failed to initialize members enrichment worker:', error)
+    process.exit(1)
+  }
 })
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1)

Line range hint 9-14: Improve error handling to preserve stack traces

The current error handling loses the original stack trace by creating a new Error. Consider using error chaining to preserve the error context.

  try {
    const db = svc.postgres.reader
    rows = await fetchMembersForLFIDEnrichment(db, limit, afterId)
  } catch (err) {
-   throw new Error(err)
+   throw new Error('Failed to fetch members for LF ID enrichment', { cause: err })
  }
services/libs/common_services/package.json (1)

15-15: Consider AWS Bedrock cost management strategies.

The addition of AWS Bedrock Runtime client indicates usage of AWS foundation models, which can incur significant costs. Consider implementing:

  1. Rate limiting and throttling mechanisms
  2. Cost monitoring and alerting
  3. Usage quotas per user/organization
  4. Caching strategies for repeated requests
  5. Fallback mechanisms for service disruptions

Would you like assistance in implementing any of these cost control mechanisms?

services/libs/common_services/src/repos/llmPromptHistory.repo.ts (1)

6-8: Remove unnecessary constructor.

The constructor can be removed as it only forwards parameters to the parent class without additional initialization.

-  public constructor(dbStore: DbStore, parentLog: Logger) {
-    super(dbStore, parentLog)
-  }
🧰 Tools
🪛 Biome

[error] 6-8: This constructor is unnecessary.

Unsafe fix: Remove the unnecessary constructor.

(lint/complexity/noUselessConstructor)

services/libs/types/src/premium/enrichment.ts (2)

34-41: Consider using a more specific type for date fields.

The dateStart and dateEnd fields are currently typed as string. Consider using a more specific type to ensure consistent date formatting and validation.

export interface IMemberOrganizationData {
  orgId: string
  orgName: string
  jobTitle: string
-  dateStart: string
-  dateEnd: string
+  dateStart: string | null  // ISO 8601 format
+  dateEnd: string | null    // ISO 8601 format
  source: string
}

43-54: Enhance type safety and add JSDoc documentation.

The interface could benefit from:

  1. More specific typing for the attributes field
  2. Documentation explaining the purpose and requirements of each field
+/**
+ * Represents the original member data aggregated from multiple sources
+ * @property displayName - The member's display name
+ * @property attributes - Key-value pairs of member attributes grouped by source
+ * @property manuallyChangedFields - List of fields that were manually modified
+ * @property identities - List of member's platform identities
+ * @property organizations - List of member's organization affiliations
+ */
 export interface IMemberOriginalData {
   // members table data
   displayName: string
-  attributes: Record<string, Record<string, unknown>>
+  attributes: Record<string, Record<string, string | number | boolean | null>>
   manuallyChangedFields: string[]

   // memberIdentities table data
   identities: IMemberIdentity[]

   // memberOrganizations table data
   organizations: IMemberOrganizationData[]
 }
services/apps/merge_suggestions_worker/src/workflows/mergeMembersWithLLM.ts (3)

44-46: Good architectural improvement moving LLM configuration out of the workflow.

The removal of hardcoded LLM parameters and the use of LlmQueryType makes the code cleaner and more maintainable. This change suggests better separation of concerns with centralized LLM configuration management.


44-51: Consider improving error handling and type safety.

The current implementation could benefit from:

  1. Explicit typing for the verdict
  2. Error handling for LLM failures
  3. Logging for audit and debugging purposes

Consider applying this improvement:

-    const verdict = await commonActivitiesProxy.getLLMResult(
+    const verdict: boolean = await commonActivitiesProxy.getLLMResult(
       LlmQueryType.MEMBER_MERGE,
       members.map((member) => removeEmailLikeIdentitiesFromMember(member)),
     )
 
+    // Log the decision for audit purposes
+    console.log(`LLM merge decision for members ${suggestion[0]}, ${suggestion[1]}: ${verdict}`)
+
     if (verdict) {
       await commonActivitiesProxy.mergeMembers(suggestion[0], suggestion[1], args.tenantId)
+    } else {
+      console.log(`Skipping merge for members ${suggestion[0]}, ${suggestion[1]} based on LLM verdict`)
     }

Line range hint 24-24: Consider making SUGGESTIONS_PER_RUN configurable.

The hardcoded batch size of 10 suggestions per run could be moved to a configuration value for better flexibility in different environments.

Consider moving this to a configuration:

-  const SUGGESTIONS_PER_RUN = 10
+  const SUGGESTIONS_PER_RUN = config.get('mergeSuggestions.batchSize', 10)
services/apps/premium/members_enrichment_worker/src/activities.ts (1)

34-34: LGTM! Consider documenting the new capability

The export is properly added, maintaining symmetry with the import section. Since this introduces a new capability for processing multiple member sources, consider adding JSDoc comments in the implementation file to document its purpose and usage.

services/libs/types/src/llm.ts (2)

35-45: Consider making pricing configuration external

The pricing information is currently hardcoded, but as noted in the comment, "these numbers can change". Consider moving this to an external configuration for easier updates.

Consider moving the pricing to environment variables or a configuration file:

-export const LLM_MODEL_PRICING_MAP: Record<LlmModelType, ILlmPricing> = {
+export const getLlmModelPricing = (): Record<LlmModelType, ILlmPricing> => ({
   [LlmModelType.CLAUDE_3_OPUS]: {
-    costPer1000InputTokens: 0.015,
-    costPer1000OutputTokens: 0.075,
+    costPer1000InputTokens: Number(process.env.CLAUDE_3_OPUS_INPUT_COST ?? 0.015),
+    costPer1000OutputTokens: Number(process.env.CLAUDE_3_OPUS_OUTPUT_COST ?? 0.075),
   },
   // ... similar changes for other models
-}
+})

67-67: Address the TODO comment

The TODO comment needs attention: "TODO uros check if this is ok"

Would you like me to help create a GitHub issue to track this TODO item?

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (3)

Line range hint 52-71: Enhance error handling and robustness of the merge operation

While the basic error handling is in place, consider the following improvements for better reliability:

  1. Add request timeout handling
  2. Implement retry mechanism for transient failures
  3. Add detailed error logging including response body
  4. Validate environment variables before usage
  5. Add audit logging for merge operations

Here's a suggested implementation:

 export async function mergeMembers(
   primaryMemberId: string,
   secondaryMemberId: string,
   tenantId: string,
 ): Promise<void> {
+  const apiUrl = process.env['CROWD_API_SERVICE_URL'];
+  const apiToken = process.env['CROWD_API_SERVICE_USER_TOKEN'];
+  
+  if (!apiUrl || !apiToken) {
+    throw new Error('Required environment variables are not set');
+  }
+
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), 30000);
+
   const res = await fetch(
-    `${process.env['CROWD_API_SERVICE_URL']}/tenant/${tenantId}/member/${primaryMemberId}/merge`,
+    `${apiUrl}/tenant/${tenantId}/member/${primaryMemberId}/merge`,
     {
       method: 'PUT',
       headers: {
-        Authorization: `Bearer ${process.env['CROWD_API_SERVICE_USER_TOKEN']}`,
+        Authorization: `Bearer ${apiToken}`,
         'Content-Type': 'application/json',
       },
       body: JSON.stringify({
         memberToMerge: secondaryMemberId,
       }),
+      signal: controller.signal,
     },
-  )
+  ).finally(() => clearTimeout(timeout));
 
-  if (res.status !== 200) {
-    throw new Error(`Failed to merge member ${primaryMemberId} with ${secondaryMemberId}!`)
+  const responseBody = await res.text();
+  
+  if (!res.ok) {
+    const error = new Error(
+      `Failed to merge member ${primaryMemberId} with ${secondaryMemberId}. ` +
+      `Status: ${res.status}. Response: ${responseBody}`
+    );
+    svc.logger.error(error);
+    throw error;
   }
+  
+  svc.logger.info(`Successfully merged member ${secondaryMemberId} into ${primaryMemberId}`);
 }

Line range hint 13-21: Implement consistent error handling across functions

The current error handling approach loses valuable context and stack traces. Consider implementing a unified error handling strategy:

  1. Create specific error types for different failure scenarios
  2. Preserve error context and stack traces
  3. Add proper error logging

Example implementation:

// Create specific error types
class MemberEnrichmentError extends Error {
  constructor(message: string, public cause?: Error) {
    super(message);
    this.name = 'MemberEnrichmentError';
  }
}

// Update error handling in functions
export async function getIdentitiesExistInOtherMembers(
  tenantId: string,
  excludeMemberId: string,
  identities: IMemberIdentity[],
): Promise<IMemberIdentity[]> {
  try {
    const db = svc.postgres.reader;
    return await getIdentitiesExistInOthers(db, tenantId, excludeMemberId, identities);
  } catch (err) {
    const error = new MemberEnrichmentError(
      `Failed to get existing identities for tenant ${tenantId}`,
      err instanceof Error ? err : undefined
    );
    svc.logger.error(error);
    throw error;
  }
}

Also applies to: 28-43


Line range hint 1-71: Address potential security concerns

Several security considerations should be addressed:

  1. Add input validation for member IDs and tenant IDs to prevent injection
  2. Avoid exposing sensitive data in error messages
  3. Implement rate limiting for merge operations
  4. Add request validation middleware

Consider implementing these security measures at the service layer to ensure consistent protection across all operations. This could include:

  1. Input validation middleware
  2. Rate limiting middleware
  3. Secure error handling that sanitizes sensitive data
  4. Audit logging for all member operations

Would you like me to provide specific implementation examples for any of these security measures?

backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1)

1-13: Consider adding additional constraints for data integrity

While the table structure is generally good, consider these improvements:

  1. Add check constraints for numeric fields to ensure valid ranges:
    • inputTokenCount and outputTokenCount should be positive
    • responseTimeSeconds should be positive
  2. Specify precision for the decimal field responseTimeSeconds
  3. Consider adding a check constraint for valid type values
 create table "llmPromptHistory" (
     id                    bigserial primary key,
-    type                  varchar(255) not null,
+    type                  varchar(255) not null check (type in ('organization_merge_suggestion', 'member_merge_suggestion')),
     model                 text         not null,
     "entityId"            text         null,
     metadata              jsonb        null,
     prompt                text         not null,
     answer                text         not null,
-    "inputTokenCount"     int          not null,
-    "outputTokenCount"    int          not null,
-    "responseTimeSeconds" decimal      not null,
+    "inputTokenCount"     int          not null check ("inputTokenCount" > 0),
+    "outputTokenCount"    int          not null check ("outputTokenCount" > 0),
+    "responseTimeSeconds" decimal(10,3) not null check ("responseTimeSeconds" > 0),
     "createdAt"           timestamptz  not null default now()
 );
services/apps/merge_suggestions_worker/src/workflows/testMergingEntitiesWithLLM.ts (2)

71-76: Consider refactoring to reduce code duplication

The organization merging logic closely mirrors the member merging logic. Consider extracting the common pattern into a reusable function.

+async function processEntityMerge<T>(
+  entities: T[],
+  queryType: LlmQueryType,
+  entityType: string
+): Promise<void> {
+  if (entities.length !== 2) {
+    console.log(`Failed getting ${entityType} data in suggestion. Skipping suggestion.`);
+    return;
+  }
+
+  const res = await commonActivitiesProxy.getLLMResult(queryType, entities);
+  console.log(`Raw res: `);
+  console.log(res.answer);
+  return res;
+}

 // Then use it in both places:
-const res = await commonActivitiesProxy.getLLMResult(
-  LlmQueryType.ORGANIZATION_MERGE,
-  organizations,
-);
-console.log(`Raw res: `);
-console.log(res.answer);
+const res = await processEntityMerge(
+  organizations,
+  LlmQueryType.ORGANIZATION_MERGE,
+  'organization'
+);

Line range hint 80-84: Consider enhancing token counting metrics

The token counting logic could be enhanced to provide more detailed metrics:

+interface TokenMetrics {
+  totalTokens: number;
+  averageTokens: number;
+  maxTokens: number;
+  minTokens: number;
+  promptCount: number;
+}
+
+function calculateTokenMetrics(tokenCounts: number[]): TokenMetrics {
+  const total = tokenCounts.reduce((sum, count) => sum + count, 0);
+  return {
+    totalTokens: total,
+    averageTokens: Math.floor(total / tokenCounts.length),
+    maxTokens: Math.max(...tokenCounts),
+    minTokens: Math.min(...tokenCounts),
+    promptCount: tokenCounts.length
+  };
+}

-console.log(`Total input token count: ${totalInputTokenCount}`);
-console.log(
-  `Average input token count per prompt: ${Math.floor(totalInputTokenCount / promptCount)}`,
-);
+const metrics = calculateTokenMetrics(tokenCounts);
+console.log('Token usage metrics:', metrics);
services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (4)

Line range hint 12-24: Improve error handling to preserve context

The current error handling loses the original error context. Consider using a more descriptive error message and preserving the error chain.

   try {
     const db = svc.postgres.reader
     rows = await getGithubIdentities(db, limit, afterId, afterUsername)
   } catch (err) {
-    throw new Error(err)
+    throw new Error(`Failed to fetch GitHub identities: ${err.message}`, { cause: err })
   }

Line range hint 44-67: Add environment variable validation and error logging

The function accesses environment variables without validation and silently removes invalid tokens.

   const cache = new RedisCache(`lfx-auth0`, svc.redis, svc.log)
+  const tokens = process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS']
+  if (!tokens) {
+    throw new Error('CROWD_GITHUB_PERSONAL_ACCESS_TOKENS environment variable is not set')
+  }
   const tokenRotator = new GithubTokenRotator(
     cache,
-    process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS'].split(','),
+    tokens.split(','),
   )

   const tokens = await tokenRotator.getAllTokens()

   for (const token of tokens) {
     try {
       await tokenRotator.updateRateLimitInfoFromApi(token, GithubAPIResource.CORE)
     } catch (e) {
       // something is wrong with the token, remove it from the list
+      svc.log.warn(`Removing invalid GitHub token: ${e.message}`)
       tokenRotator.removeToken(token)
     }
   }

Line range hint 69-106: Enhance API interaction robustness and maintainability

Several improvements can be made to make the code more maintainable and reliable:

  1. Extract API URL to a constant
  2. Use constants for status codes
  3. Consistent error handling
  4. Add request timeout
+const GITHUB_API_BASE_URL = 'https://api.github.com'
+const RATE_LIMIT_STATUS = 429
+const API_TIMEOUT_MS = 5000

 export async function findGithubSourceId(username: string): Promise<string | null> {
   try {
     const cache = new RedisCache(`lfx-auth0`, svc.redis, svc.log)
     const tokenRotator = new GithubTokenRotator(
       cache,
       process.env['CROWD_GITHUB_PERSONAL_ACCESS_TOKENS'].split(','),
     )

     const token = await tokenRotator.getToken()

-    const response = await fetch(`https://api.github.com/users/${username}`, {
+    const controller = new AbortController()
+    const timeout = setTimeout(() => controller.abort(), API_TIMEOUT_MS)
+    
+    const response = await fetch(`${GITHUB_API_BASE_URL}/users/${username}`, {
       headers: {
         Authorization: `Bearer ${token}`,
       },
+      signal: controller.signal,
     })
+    clearTimeout(timeout)
+    
     if (!response.ok) {
-      console.error('HTTP error', response.status)
-      if (response.status === 429) {
-        throw new Error('Rate limit exceeded')
+      svc.log.error('GitHub API error', { status: response.status, username })
+      if (response.status === RATE_LIMIT_STATUS) {
+        throw new Error('GitHub API rate limit exceeded')
       }
       return null
     }

Line range hint 1-106: Consider architectural improvements for better maintainability

The file handles multiple concerns that could be better separated:

  1. Token management could be moved to a dedicated service
  2. API interactions could be abstracted into a GitHub API client class
  3. Configuration could be centralized using a config service

This separation would improve testability and maintenance.

Consider restructuring the code into these modules:

// githubApiClient.ts
class GitHubApiClient {
  constructor(private tokenRotator: GithubTokenRotator) {}
  async getUser(username: string): Promise<IGithubUser>
}

// tokenService.ts
class TokenService {
  constructor(private cache: RedisCache) {}
  async rotateTokens(): Promise<void>
}

// config.ts
interface Config {
  github: {
    apiUrl: string
    tokens: string[]
    timeout: number
  }
}
services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (1)

41-42: Consider optimizing cache lookup for single source case

The change from single to batch lookup is good for extensibility, but might be less efficient when we know we're only looking up one source. Consider adding a fast path for single source lookups.

-    const caches = await findMemberEnrichmentCache([source], input.id)
-    const cache = caches.find((c) => c.source === source)
+    const cache = sources.length === 1 
+      ? (await findMemberEnrichmentCache([source], input.id))[0]
+      : (await findMemberEnrichmentCache([source], input.id)).find((c) => c.source === source)
services/libs/types/src/members.ts (1)

Line range hint 163-176: Consider adding JSDoc comments to document the interface purpose.

The addition of the id field is correct. However, since this interface appears to be a transformed version of ILLMConsumableMemberDbResult, it would be helpful to document the relationship and purpose.

Add documentation like this:

+/**
+ * Represents a cleaned/transformed version of ILLMConsumableMemberDbResult
+ * for LLM consumption with simplified identity fields.
+ */
export interface ILLMConsumableMember {
  id: string
  // ... rest of the interface
services/archetypes/worker/src/index.ts (1)

233-246: Avoid hardcoded paths in worker configuration

The hardcoded paths './src/workflows' and './src/activities' might break in different environments. Consider making these configurable.

+const WORKFLOW_PATH = process.env['CROWD_WORKFLOW_PATH'] || './src/workflows'
+const ACTIVITIES_PATH = process.env['CROWD_ACTIVITIES_PATH'] || './src/activities'

 const workflowBundle = await bundleWorkflowCode({
-  workflowsPath: path.resolve('./src/workflows'),
+  workflowsPath: path.resolve(WORKFLOW_PATH),
 })

 this._worker = await TemporalWorker.create({
   // ... other config
-  activities: require(path.resolve('./src/activities')),
+  activities: require(path.resolve(ACTIVITIES_PATH)),
   // ... rest of config
 })
services/apps/merge_suggestions_worker/src/activities/common.ts (2)

Line range hint 77-81: Use the logging service instead of console.log in mergeMembers

In the catch block of mergeMembers, consider using the application's logging service (svc.log) instead of console.log for consistent logging and better log management.

Also, correct the typo in the error message.

Apply this diff to update the logging:

} catch (error) {
-  console.log(`Failed merging member wit status [${error.response.status}]. Skipping!`)
+  svc.log.error(`Failed merging member with status [${error.response.status}]. Skipping!`, error)
}

Line range hint 100-104: Use the logging service instead of console.log in mergeOrganizations

Similarly, in the mergeOrganizations function, replace console.log with svc.log.error for consistent logging.

Apply this diff to update the logging:

} catch (error) {
-  console.log(`Failed merging organization with status [${error.response.status}]. Skipping!`)
+  svc.log.error(`Failed merging organization with status [${error.response.status}]. Skipping!`, error)
}
services/libs/common_services/src/services/llm.service.ts (1)

29-191: Consider adding unit tests for the LlmService methods

The LlmService class is central to your application's interaction with LLMs. Adding unit tests for its methods—queryLlm, consolidateMemberEnrichmentData, and mergeSuggestion—will enhance code reliability and facilitate future maintenance. Unit tests can help catch bugs early and ensure that changes do not introduce regressions.

Would you like assistance in setting up unit tests for this service? I can help outline test cases or provide example implementations.

services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/memberMergeSuggestions.repo.ts (1)

262-272: Improve query performance by indexing JSONB fields used in WHERE clauses

Accessing JSONB fields in the WHERE clause, such as lsv.metadata ->> 'secondaryId', can lead to performance issues if the appropriate indexes are not in place. To optimize query performance, consider adding an index on the metadata field for the 'secondaryId' key.

Example of creating an index:

CREATE INDEX idx_llmPromptHistory_secondaryId ON "llmPromptHistory" ((metadata ->> 'secondaryId'));

This index will help expedite queries filtering on metadata ->> 'secondaryId'.

services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (6)

Line range hint 4-11: Consider updating import paths from 'old' directory

The functions are being imported from @crowd/data-access-layer/src/old/apps/premium/members_enrichment_worker. Importing from an 'old' directory suggests reliance on deprecated or outdated code. Consider updating to the latest versions or refactoring the code to use the current data access layer.


148-148: Consider using a more descriptive variable name

The variable toBeSquashed may benefit from a more descriptive name, such as normalizedDataBySource, to improve code readability and maintainability.


154-154: Simplify condition using optional chaining

The condition if (cache && cache.data) can be simplified using optional chaining:

-if (cache && cache.data) {
+if (cache?.data) {
🧰 Tools
🪛 Biome

[error] 154-154: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


160-174: Remove commented-out code to improve readability

There is a large block of commented-out code which may reduce readability. If this code is no longer needed, consider removing it or preserving it through version control instead of commenting it out.


184-184: Review the TODO comment regarding LLM data squasher

The TODO comment indicates that the data squasher using LLM needs implementation, but the code appears to instantiate and use llmService.consolidateMemberEnrichmentData. Please update or remove the TODO comment to reflect the current state of the code.


195-305: Consider externalizing the LLM prompt for maintainability

The LLM prompt is extensive and embedded directly in the code. Consider moving it to a separate file or using a template system to improve readability and maintainability.

services/apps/merge_suggestions_worker/src/activities/organizationMergeSuggestions.ts (1)

Line range hint 361-373: Refactor attribute extraction to improve readability and maintainability

The current implementation repeats the pattern of filtering and mapping attributes to extract values. Consider abstracting this logic into helper functions to reduce redundancy and enhance code clarity.

Here's how you could refactor the code:

First, define helper functions:

function getAttributeValue(attributes: any[], name: string): string {
  return attributes.find((a) => a.name === name)?.value || '';
}

function getAttributeValues(attributes: any[], name: string): string[] {
  return attributes.filter((a) => a.name === name).map((a) => a.value);
}

Then, update the return statement:

 return {
     id: base.id,
     displayName: base.displayName,
     description: base.description,
-    phoneNumbers: attributes.filter((a) => a.name === 'phoneNumber').map((a) => a.value),
+    phoneNumbers: getAttributeValues(attributes, 'phoneNumber'),
     logo: base.logo,
     tags: base.tags,
     location: base.location,
     type: base.type,
-    geoLocation: attributes.find((a) => a.name === 'geoLocation')?.value || '',
-    ticker: attributes.find((a) => a.name === 'ticker')?.value || '',
-    profiles: attributes.filter((a) => a.name === 'profile').map((a) => a.value),
     headline: base.headline,
     industry: base.industry,
     founded: base.founded,
-    alternativeNames: attributes.filter((a) => a.name === 'alternativeName').map((a) => a.value),
+    geoLocation: getAttributeValue(attributes, 'geoLocation'),
+    ticker: getAttributeValue(attributes, 'ticker'),
+    profiles: getAttributeValues(attributes, 'profile'),
+    alternativeNames: getAttributeValues(attributes, 'alternativeName'),
     identities: identities.map((i) => ({
       platform: i.platform,
       value: i.value,
     })),
 }

This refactoring reduces code duplication and makes the code easier to maintain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 730b954 and f4f3784.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (45)
  • backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1 hunks)
  • backend/src/database/migrations/V1731484783__drop-member-deprecated-columns.sql (1 hunks)
  • services/apps/merge_suggestions_worker/package.json (1 hunks)
  • services/apps/merge_suggestions_worker/src/activities.ts (1 hunks)
  • services/apps/merge_suggestions_worker/src/activities/common.ts (2 hunks)
  • services/apps/merge_suggestions_worker/src/activities/memberMergeSuggestions.ts (2 hunks)
  • services/apps/merge_suggestions_worker/src/activities/organizationMergeSuggestions.ts (1 hunks)
  • services/apps/merge_suggestions_worker/src/types.ts (0 hunks)
  • services/apps/merge_suggestions_worker/src/workflows/mergeMembersWithLLM.ts (2 hunks)
  • services/apps/merge_suggestions_worker/src/workflows/mergeOrganizationsWithLLM.ts (2 hunks)
  • services/apps/merge_suggestions_worker/src/workflows/testMergingEntitiesWithLLM.ts (3 hunks)
  • services/apps/premium/members_enrichment_worker/package.json (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (4 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/getMembers.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/authenticateLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getEnrichmentLFAuth0.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/main.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/sources/crustdata/service.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (3 hunks)
  • services/archetypes/worker/src/index.ts (2 hunks)
  • services/libs/common_services/package.json (1 hunks)
  • services/libs/common_services/src/repos/llmPromptHistory.repo.ts (1 hunks)
  • services/libs/common_services/src/services/index.ts (1 hunks)
  • services/libs/common_services/src/services/llm.service.ts (1 hunks)
  • services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/llmSuggestionVerdicts.repo.ts (0 hunks)
  • services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/memberMergeSuggestions.repo.ts (3 hunks)
  • services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/organizationMergeSuggestions.repo.ts (4 hunks)
  • services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (4 hunks)
  • services/libs/types/src/enums/activities.ts (0 hunks)
  • services/libs/types/src/enums/index.ts (1 hunks)
  • services/libs/types/src/enums/llm.ts (1 hunks)
  • services/libs/types/src/enums/merging.ts (0 hunks)
  • services/libs/types/src/index.ts (1 hunks)
  • services/libs/types/src/llm.ts (1 hunks)
  • services/libs/types/src/members.ts (2 hunks)
  • services/libs/types/src/merging.ts (0 hunks)
  • services/libs/types/src/organizations.ts (1 hunks)
  • services/libs/types/src/premium/enrichment.ts (1 hunks)
💤 Files with no reviewable changes (5)
  • services/apps/merge_suggestions_worker/src/types.ts
  • services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/llmSuggestionVerdicts.repo.ts
  • services/libs/types/src/enums/activities.ts
  • services/libs/types/src/enums/merging.ts
  • services/libs/types/src/merging.ts
✅ Files skipped from review due to trivial changes (6)
  • backend/src/database/migrations/V1731484783__drop-member-deprecated-columns.sql
  • services/apps/premium/members_enrichment_worker/src/activities/getMembers.ts
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/authenticateLFAuth0.ts
  • services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getEnrichmentLFAuth0.ts
  • services/apps/premium/members_enrichment_worker/src/activities/syncEnrichedData.ts
  • services/apps/premium/members_enrichment_worker/src/service.ts
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

[error] 154-154: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

services/libs/common_services/src/repos/llmPromptHistory.repo.ts

[error] 6-8: This constructor is unnecessary.

Unsafe fix: Remove the unnecessary constructor.

(lint/complexity/noUselessConstructor)

🔇 Additional comments (49)
services/libs/common_services/src/services/index.ts (1)

2-2: LGTM! Verify the referenced service file.

The export statement follows the established barrel file pattern and correctly exposes the LLM service.

Let's verify the existence and structure of the referenced LLM service:

✅ Verification successful

Export and implementation of LlmService verified successfully

The LlmService class is properly implemented and exported:

  • Extends LoggerBase with correct AWS Bedrock integration
  • Implements core LLM functionality (queryLlm, consolidateMemberEnrichmentData, mergeSuggestion)
  • Contains necessary type definitions and configurations
  • Includes proper error handling and logging
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and structure of the LLM service file

# Check if the service file exists and show its basic structure
fd --type f "llm.service.ts" --exec cat {} \;

# Verify exports in the service file
ast-grep --pattern 'export class LlmService extends $_'

Length of output: 8321

services/libs/types/src/enums/llm.ts (1)

6-10: LGTM! Verify the enum usage across the codebase.

The LlmQueryType enum is well-structured with clear, consistent naming. Let's verify its integration with other components.

✅ Verification successful

Enum usage is properly integrated across the codebase

The verification shows consistent usage of LlmQueryType enum:

  • Correctly used in LLM service configuration (services/libs/types/src/llm.ts)
  • Properly integrated in merge suggestion workers for both members and organizations
  • Correctly utilized in LLM service implementation (services/libs/common_services/src/services/llm.service.ts)
  • String literals only appear in the enum definition and a database migration file where they're expected
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of LlmQueryType enum across the codebase
# Expected: Should find references in merge suggestion workers and member enrichment code

# Search for direct enum usage
echo "=== Direct enum usage ==="
rg "LlmQueryType\." -A 2

# Search for string literals to ensure we haven't missed any hardcoded values
echo -e "\n=== Potential hardcoded values ==="
rg "'member_merge_suggestion'|'organization_merge_suggestion'|'member_enrichment'"

Length of output: 7015

services/apps/premium/members_enrichment_worker/src/main.ts (1)

4-4: Verify consistent service module usage across the codebase

Let's ensure that all related files are using the new service module correctly.

✅ Verification successful

Let me gather more information about the service module usage to ensure a complete verification.


Service module import and usage is consistent

The service module is properly defined in service.ts and correctly imported in main.ts. The exported svc instance is consistently used across the codebase for accessing various services (redis, postgres, opensearch, etc.) and logging. No issues were found with the service module implementation or its usage.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent service module imports and potential missing updates

# Search for any remaining direct config/options usage that should use the service module
rg -g '!*.{json,md}' -l "const config.*=.*Config" "services/apps/premium/members_enrichment_worker"
rg -g '!*.{json,md}' -l "const options.*=.*Options" "services/apps/premium/members_enrichment_worker"

# Verify service module imports in related files
rg -g '!*.{json,md}' "import.*from.*['\"]./service['\"]" "services/apps/premium/members_enrichment_worker"

Length of output: 404


Script:

#!/bin/bash
# Check the content of service.ts and look for any exports
cat services/apps/premium/members_enrichment_worker/src/service.ts

# Look for any usage of 'svc' in the codebase
rg "svc\." services/apps/premium/members_enrichment_worker/src/

Length of output: 8593

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/getLFIDEnrichableMembers.ts (1)

4-4: Verify the new service import path

The import path has been changed from ../../main to ../../service. Let's verify this change is consistent across the codebase.

✅ Verification successful

Based on the shell script results, I can see that:

  1. The new service.ts file exists at services/apps/premium/members_enrichment_worker/src/service.ts
  2. There are no remaining imports from the old path (../../main)
  3. Multiple files are consistently using the new import path (../../service)
  4. The service.ts file exports the svc instance that's being imported

Therefore, I can generate the final response:

Import path change is correctly implemented

The change from ../../main to ../../service is consistent across the codebase, with the new service.ts file properly exporting the required svc instance. All related files have been updated to use the new import path.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the service import path change and its consistency

# Check if the new service file exists
fd --type f "service.ts" --parent-dir "services/apps/premium/members_enrichment_worker/src"

# Check for any remaining imports from the old path
rg "from '../../main'" "services/apps/premium/members_enrichment_worker/src"

Length of output: 441


Script:

#!/bin/bash
# Description: Verify the service import path change and its consistency

# Check if the new service file exists
fd "service.ts" services/apps/premium/members_enrichment_worker/src

# Check for any remaining imports from the old path
rg "from '../../main'" services/apps/premium/members_enrichment_worker/src

# Check for imports from the new path
rg "from '../../service'" services/apps/premium/members_enrichment_worker/src

# Look at the content of service.ts if it exists
fd "service.ts" services/apps/premium/members_enrichment_worker/src -x cat {}

Length of output: 41985

services/libs/types/src/enums/index.ts (1)

20-20: LGTM! Verify module existence and breaking changes.

The new export statement follows the existing pattern and maintains alphabetical order.

Let's verify the module existence and check for any breaking changes:

✅ Verification successful

Export statement is valid and safe to use

The verification confirms:

  • The llm.ts module exists and contains two new enums: LlmModelType and LlmQueryType
  • The old LLMSuggestionVerdictType enum has been completely removed from merging.ts
  • No references to LLMSuggestionVerdictType exist in the codebase, indicating a clean removal
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of llm.ts and check for potential breaking changes

# Check if llm.ts exists in the expected location
fd -t f "llm.ts" services/libs/types/src/enums

# Check for any remaining references to LLMSuggestionVerdictType that might be affected
rg "LLMSuggestionVerdictType" -A 5

Length of output: 118


Script:

#!/bin/bash
# Let's check the content of llm.ts and look for any references to the old enum
cat services/libs/types/src/enums/llm.ts

# Let's also check the content of merging.ts to verify the removal
fd -t f "merging.ts" services/libs/types/src/enums
cat services/libs/types/src/enums/merging.ts

# Let's search for any potential usages of LLMSuggestionVerdictType in the entire codebase
# Using -w for word boundary to ensure exact matches
rg -w "LLMSuggestionVerdictType"

Length of output: 1257

services/libs/common_services/package.json (1)

15-15: Verify AWS SDK version compatibility and security.

Let's ensure the AWS SDK version is secure and compatible.

✅ Verification successful

AWS SDK version is secure but newer versions available

The version 3.572.0 has no known vulnerabilities, but it's several versions behind the latest (3.691.0). Since the caret (^) is used, it will accept compatible updates which is reasonable for AWS SDK clients as they maintain backward compatibility within major versions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for latest versions and security advisories for @aws-sdk/client-bedrock-runtime

# Check npm registry for latest versions
echo "Latest version information:"
npm view @aws-sdk/client-bedrock-runtime versions --json

# Check for known vulnerabilities
echo -e "\nVulnerability scan:"
npm audit --json @aws-sdk/[email protected]

Length of output: 2236

services/libs/types/src/index.ts (2)

63-64: LGTM! Export statement follows consistent pattern.

The new export for the LLM module is correctly placed and follows the established pattern in the barrel file.


64-64: Verify LLM types implementation.

Let's ensure that all necessary LLM-related types are properly defined and exported from the llm module.

✅ Verification successful

Based on the shell script results, I can now generate the final response since I have sufficient information about the LLM types implementation.

LLM types are properly implemented and being used correctly

The LLM module exports well-defined interfaces and enums that cover:

  • Core interfaces: ILlmResponse, ILlmResult, ILlmSettings, ILlmHistoryEntry
  • Model configuration: LLM_MODEL_REGION_MAP, LLM_MODEL_PRICING_MAP, LLM_SETTINGS
  • Enums: LlmModelType, LlmQueryType

The types are being correctly imported and used in multiple workflow files for member and organization merging operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of LLM types and their exports

# Check the LLM types file structure and content
fd -t f "llm.ts" -X cat

# Find all files importing LLM types to understand usage patterns
rg -l "import.*from.*['\"].*types.*llm['\"]"

# Check for any potential type-related issues in files using LLM types
ast-grep --pattern 'import { $_ } from "@crowd/types"'

Length of output: 8579

services/libs/common_services/src/repos/llmPromptHistory.repo.ts (1)

1-5: LGTM! Clean imports and proper class declaration.

The imports are well-organized and the class properly extends RepositoryBase with correct type parameters.

services/apps/merge_suggestions_worker/src/activities.ts (2)

Line range hint 1-40: LGTM! Clean organization of imports and exports.

The imports and exports are well-organized into logical groups (members, organizations, LLM-related functions) and maintain good symmetry.


1-1: Verify the removal of saveLLMVerdict.

The removal of saveLLMVerdict from imports appears to be part of the architectural changes involving the new llmPromptHistory table. Let's verify there are no remaining references to this function.

✅ Verification successful

The removal of saveLLMVerdict import is safe

No references to saveLLMVerdict were found in the TypeScript codebase, confirming that its removal from the imports is appropriate and won't cause any issues.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to saveLLMVerdict that might need updating

# Search for any remaining references to saveLLMVerdict
echo "Searching for remaining references to saveLLMVerdict..."
rg "saveLLMVerdict" --type ts

# Search for any TODO or FIXME comments related to LLM verdicts
echo "Checking for related TODO/FIXME comments..."
rg "TODO.*LLM.*verdict" --type ts
rg "FIXME.*LLM.*verdict" --type ts

# Look for potential calls to this function that might need updating
echo "Checking for potential function calls..."
ast-grep --pattern 'saveLLMVerdict($_)'

Length of output: 1066

services/apps/merge_suggestions_worker/src/workflows/mergeMembersWithLLM.ts (1)

3-3: Verify consistent usage of the new LlmQueryType across the codebase.

The migration from LLMSuggestionVerdictType to LlmQueryType suggests an API redesign. Let's ensure this change is consistently applied.

Also applies to: 7-7

✅ Verification successful

Based on the results, I can now generate the final response:

The migration from LLMSuggestionVerdictType to LlmQueryType is consistently implemented

The verification shows:

  • No remaining references to the old LLMSuggestionVerdictType
  • Consistent usage of LlmQueryType.MEMBER_MERGE across the codebase:
    • Defined in types (services/libs/types/src/llm.ts)
    • Used in data access layer
    • Properly integrated in LLM service
    • Correctly used in merge suggestion workflows
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining usage of the old type and verify new type adoption

# Check for any remaining references to the old type
echo "Checking for remaining LLMSuggestionVerdictType references:"
rg "LLMSuggestionVerdictType"

# Verify the usage pattern of the new type
echo -e "\nVerifying LlmQueryType usage pattern:"
rg "LlmQueryType\.MEMBER_MERGE" -A 2

Length of output: 3164

services/apps/merge_suggestions_worker/package.json (1)

19-19: LGTM! Verify workspace dependency.

The addition of @crowd/common_services as a workspace dependency aligns with the integration of LLM services.

Let's verify the dependency exists in the workspace:

✅ Verification successful

Dependency verified and properly integrated

The @crowd/common_services package:

  • Exists as a workspace package at services/libs/common_services
  • Contains the expected LLM service implementation at src/services/llm.service.ts
  • Is properly referenced by multiple services including the merge suggestions worker
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the @crowd/common_services package exists in the workspace
# and contains the expected LLM service implementation

# Check if the package exists in the workspace
fd -t f "package.json" -x grep -l "@crowd/common_services"

# Verify LLM service implementation
fd -t f "llm.service.ts" services/libs/common_services

Length of output: 617

services/apps/premium/members_enrichment_worker/src/activities.ts (1)

9-9: LGTM! Verify the implementation of processMemberSources

The import is properly added and maintains consistency with the module's structure.

Let's verify the implementation of the new function:

✅ Verification successful

Implementation of processMemberSources is verified and working as expected

The implementation is found in services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts and is properly integrated:

  • The function is correctly implemented with proper error handling and logging
  • It processes multiple enrichment sources for a member by:
    1. Retrieving existing enrichment caches for the member
    2. Normalizing and consolidating data from different sources
    3. Using LLM service to squash/merge data when multiple sources exist
  • The function is properly imported and exported in the activities module
  • It's correctly used in the enrichment workflow (enrichMember.ts)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of processMemberSources
# Expected: Find the implementation and its test coverage

# Find the implementation
echo "Looking for processMemberSources implementation:"
ast-grep --pattern 'function processMemberSources' -A 10

# Check for test coverage
echo -e "\nLooking for corresponding tests:"
fd -e test.ts -e spec.ts | xargs rg "processMemberSources"

Length of output: 305


Script:

#!/bin/bash
# Let's try a different approach to find the implementation
# First, let's find the enrichment file that should contain the implementation
echo "Looking for the enrichment file:"
fd enrichment.ts

# Then let's examine its content
echo -e "\nChecking the content of enrichment files:"
fd enrichment.ts --exec cat {}

# Let's also try searching for any file containing processMemberSources
echo -e "\nSearching for files containing processMemberSources:"
rg -l "processMemberSources"

# And get the actual implementation with context
echo -e "\nGetting implementation context:"
rg -A 10 "processMemberSources"

Length of output: 37359

services/apps/merge_suggestions_worker/src/workflows/mergeOrganizationsWithLLM.ts (4)

Line range hint 22-63: LGTM! Robust workflow implementation

The workflow maintains a solid structure with:

  • Proper batch processing using SUGGESTIONS_PER_RUN
  • Good error handling for missing organization data
  • Correct usage of Temporal's continueAsNew pattern for processing large datasets

Line range hint 54-59: Verify verdict persistence mechanism

The removal of saveLLMVerdict suggests a change in how verdicts are persisted. Please ensure that verdict tracking/auditing is handled appropriately in the new implementation.

#!/bin/bash
# Description: Verify how verdicts are now persisted

echo "Checking for new verdict persistence mechanisms..."
rg -A 5 "saveLLMVerdict|llmPromptHistory|storeLLMResult"

echo "Checking getLLMResult implementation..."
ast-grep --pattern 'getLLMResult($_)'

49-51: Verify LLM configuration management

The removal of hardcoded model configuration (REGION, MODEL_ID, etc.) suggests these are now managed elsewhere.

#!/bin/bash
# Description: Verify where LLM configuration is now managed

echo "Searching for LLM configuration..."
rg -A 5 "MODEL_ID|MODEL_ARGS|REGION" 

echo "Checking LlmQueryType.ORGANIZATION_MERGE usage..."
ast-grep --pattern 'LlmQueryType.ORGANIZATION_MERGE'

3-3: Verify consistent usage of the new LlmQueryType

The change from LLMSuggestionVerdictType to LlmQueryType suggests a broader refactoring of LLM-related types.

Also applies to: 7-7

✅ Verification successful

Type change is correctly implemented and consistent

The change from LLMSuggestionVerdictType to LlmQueryType is part of a broader refactoring where:

  1. No traces of LLMSuggestionVerdictType remain in the codebase
  2. LlmQueryType is consistently used across all relevant files
  3. The enum is properly defined with all necessary values: MEMBER_MERGE, ORGANIZATION_MERGE, and MEMBER_ENRICHMENT
  4. The implementation in mergeOrganizationsWithLLM.ts correctly uses the new type
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that LLMSuggestionVerdictType is completely replaced with LlmQueryType
# and there are no remaining usages of the old type.

echo "Checking for any remaining usage of LLMSuggestionVerdictType..."
rg "LLMSuggestionVerdictType"

echo "Verifying consistent usage of new LlmQueryType..."
rg "LlmQueryType"

Length of output: 9012


Script:

#!/bin/bash
# Let's search for both types more specifically and include some context

echo "Searching for LLMSuggestionVerdictType..."
rg -A 3 "LLMSuggestionVerdictType" --type typescript

echo "Searching for LlmQueryType usage..."
rg -A 3 "LlmQueryType" --type typescript

echo "Checking type definitions..."
fd -e ts -e d.ts -x cat {} \; -E "node_modules" | rg -A 5 "export (type|enum|interface) (LLMSuggestionVerdictType|LlmQueryType)"

Length of output: 753


Script:

#!/bin/bash
# Let's try without type specification and search the entire codebase

echo "Searching for any remaining LLMSuggestionVerdictType..."
rg -A 3 "LLMSuggestionVerdictType"

echo "Checking all usages of LlmQueryType..."
rg -A 3 "LlmQueryType"

# Let's also check the specific file for the usage context
echo "Checking the specific workflow file..."
cat services/apps/merge_suggestions_worker/src/workflows/mergeOrganizationsWithLLM.ts

Length of output: 15280

services/apps/premium/members_enrichment_worker/package.json (1)

20-20: LGTM: New dependency aligns with PR objectives.

The addition of @crowd/common_services as a workspace dependency is appropriate for the LLM-related functionalities mentioned in the PR summary.

services/libs/types/src/llm.ts (3)

3-28: LGTM! Well-structured interfaces with clear type definitions.

The interfaces are well-designed with appropriate use of TypeScript features:

  • Good use of interface extension for ILlmHistoryEntry
  • Proper generic type parameter in ILlmResult<T>
  • Clear and specific type definitions

67-68: Verify the high token limit for member enrichment

The token limit of 200,000 seems unusually high. Please verify if this is intentional and within the model's capabilities.

#!/bin/bash
# Search for any other token limits in the codebase for comparison
rg -g '!{dist,node_modules}' 'max_tokens'

30-33: Verify optimal regions for LLM models

Different regions are specified for different models:

  • CLAUDE_3_OPUS: us-west-2
  • CLAUDE_3_5_SONNET: us-east-1

Consider the impact on latency if your application servers are in different regions.

✅ Verification successful

Let me search for more context about the LLM configuration and usage to understand the application's primary region and LLM integration better.


Let me search for the actual usage of these LLM services to understand the deployment context better.


Based on the gathered information from the executed scripts, I can now provide a final response:

Region configuration appears intentional and aligned with service architecture

The region mapping in LLM_MODEL_REGION_MAP is used by the LLM service (llm.service.ts) to initialize Bedrock clients per region. The different regions for different models appear to be a deliberate design choice:

  • The service maintains a client region map cache
  • Bedrock clients are initialized on-demand for specific regions
  • The application's other AWS services (S3, SQS) are configured independently

This suggests a distributed architecture where LLM services are intentionally accessed in their respective optimal regions, separate from other application services.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any AWS region configurations or environment variables
# that might indicate the primary application region

rg -g '!{dist,node_modules}' -i 'AWS_REGION|AWS_DEFAULT_REGION'

Length of output: 1403


Script:

#!/bin/bash
# Search for LLM configuration and AWS client initialization
rg -g '!{dist,node_modules}' -B 3 -A 3 'new Bedrock|new BedrockRuntimeClient'

# Search for any configuration files or environment variables related to LLM services
rg -g '!{dist,node_modules}' -i 'BEDROCK.*REGION|LLM.*REGION'

# Look for any documentation about LLM configuration
fd README.md --exec cat {} \; | rg -i "llm|bedrock|region"

Length of output: 1208


Script:

#!/bin/bash
# Search for LLM service initialization and usage
ast-grep --pattern 'class $_ {
  $$$
  constructor($$$) {
    $$$
    LLM_MODEL_REGION_MAP
    $$$
  }
  $$$
}'

# Search for environment configurations related to service deployment
rg -g '!{dist,node_modules}' -B 2 -A 2 'DEPLOYMENT_REGION|SERVICE_REGION|APP_REGION'

# Look for deployment configuration files
fd -e yaml -e yml --exec cat {} \; | rg -i 'region:|aws_region'

Length of output: 269

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/enrichLFAuth0.ts (1)

8-8: Verify the service layer restructuring

The import path has been updated from ../../main to ../../service, which appears to be part of a service layer restructuring.

Let's verify the consistency of this change across the codebase:

✅ Verification successful

Based on the shell script results, I can generate the final response as we have sufficient information:

Import path change is consistent and valid

The verification shows:

  • No remaining imports from the old ../../main path exist
  • The new service.ts file exists at the correct location
  • The new import pattern is consistently used across multiple files in the same directory:
    • authenticateLFAuth0.ts
    • getEnrichmentLFAuth0.ts
    • getLFIDEnrichableMembers.ts
    • githubIdentities.ts
    • enrichLFAuth0.ts
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining imports from the old path and verify the new path exists

# Check if there are any remaining imports from the old path
echo "Checking for remaining imports from old path..."
rg "from '../../main'" services/apps/premium/members_enrichment_worker/

# Verify the new service file exists
echo "Verifying new service file exists..."
fd "service.ts" services/apps/premium/members_enrichment_worker/

# Check for consistency in import patterns
echo "Checking import patterns across the codebase..."
rg "from '../../service'" services/apps/premium/members_enrichment_worker/

Length of output: 1645

services/apps/premium/members_enrichment_worker/src/schedules/getMembersToEnrich.ts (3)

Line range hint 8-39: LGTM! The schedule configuration is well-structured

The schedule configuration for member enrichment is appropriate with:

  • Daily execution at 6 AM
  • Skip policy for overlapping runs
  • Reasonable retry policy with exponential backoff
  • Proper error handling for existing schedules

Line range hint 41-80: Review the LFID enrichment schedule configuration changes

Several significant changes require attention:

  1. The schedule ID change to '-2' suggests a breaking change. Was the old schedule properly deprecated?
  2. The production schedule (every 2 months) seems quite infrequent compared to the dev/test schedule (every 2 minutes).
  3. The overlap policy has been changed to BUFFER_ONE, which could affect the workflow execution pattern.
#!/bin/bash
# Description: Check for references to the old schedule ID and verify the workflow handling

# Check for references to the old schedule ID
echo "Checking for references to old schedule ID:"
rg "members-lfid-enrichment[^-]" --type ts

# Check the workflow implementation for timing dependencies
echo -e "\nChecking workflow implementation:"
rg -A 10 "getMembersForLFIDEnrichment" --type ts

Consider the following recommendations:

  1. Document the rationale for the 2-month production schedule interval
  2. Consider adding a staging environment with an intermediate schedule (e.g., daily) to catch issues earlier
  3. Add monitoring and alerts for failed enrichments to compensate for the infrequent production runs

5-5: Verify the service import path change across the codebase

The import path has been changed from '../main' to '../service'. Let's verify this change is consistent across the codebase.

✅ Verification successful

Import path change is consistent and valid

The import path change from '../main' to '../service' appears to be part of a consistent refactoring:

  • No files are currently importing from '../main'
  • Multiple files are already using the '../service' import
  • Both service.ts and main.ts exist, confirming this is a valid refactoring
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining imports from '../main' that might need updating
# and verify the new '../service' path exists and is imported consistently

# Check for any remaining '../main' imports
echo "Checking for remaining '../main' imports:"
rg "from '\.\./main'" --type ts

# Check for consistency of '../service' imports
echo -e "\nChecking '../service' import usage:"
rg "from '\.\./service'" --type ts

Length of output: 5755


Script:

#!/bin/bash
# Let's check for both import patterns and also verify the existence of the service file

# Check for any files importing from '../main'
echo "Files importing from '../main':"
rg -l "from '\.\./main'" --type ts services/apps/premium/members_enrichment_worker/src/

# Check for files importing from '../service'
echo -e "\nFiles importing from '../service':"
rg -l "from '\.\./service'" --type ts services/apps/premium/members_enrichment_worker/src/

# Check if service.ts exists in the expected location
echo -e "\nChecking for service.ts file:"
fd "service.ts" services/apps/premium/members_enrichment_worker/src/

# Check if main.ts exists in the expected location
echo -e "\nChecking for main.ts file:"
fd "main.ts" services/apps/premium/members_enrichment_worker/src/

Length of output: 1497

backend/src/database/migrations/V1731052735__llm-prompt-history.sql (2)

15-18: Well-designed indexing strategy!

The indexes are well-chosen and will support efficient querying:

  • Composite index supports queries filtering by both type and entityId
  • Individual indexes support single-column filters
  • Partial index on JSON data optimizes merge suggestion queries

54-62: Excellent safety check before table drop!

The verification logic before dropping the table is a great safety measure. It ensures:

  1. No data loss by checking for remaining rows
  2. Clear error message with actual row count
  3. Atomic operation within DO block
services/apps/merge_suggestions_worker/src/workflows/testMergingEntitiesWithLLM.ts (2)

3-4: LGTM! Clean import addition

The addition of LlmQueryType import enhances type safety for LLM query operations.


44-50: Verify handling of removed LLM parameters

The streamlined getLLMResult call looks good, but we should verify that the removed parameters (modelId, prompt, region, modelSpecificArgs) are properly handled internally.

✅ Verification successful

Removed parameters are properly handled through centralized settings

The verification shows that the removed parameters are now managed through the LLM_SETTINGS configuration object, which provides type-specific settings including:

  • modelId is pre-configured as CLAUDE_3_OPUS for member merges
  • Model-specific arguments are defined with appropriate defaults
  • Prompts are handled internally in the mergeSuggestion method

The streamlined API is an improvement as it enforces consistent configurations through the type system.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the getLLMResult implementation to verify parameter handling
ast-grep --pattern 'getLLMResult($type, $data) {
  $$$
}'

# Check for any default configurations that might handle these parameters
rg -l "modelId|prompt|region|modelSpecificArgs" --type typescript

Length of output: 162


Script:

#!/bin/bash
# Search for getLLMResult implementation with correct file extension
ast-grep --pattern 'getLLMResult($type, $data) {
  $$$
}'

# Search for configuration files and usages with correct extension
rg -l "modelId|prompt|region|modelSpecificArgs" -g '*.ts' -g '*.tsx'

# Search for the common activities interface/class
rg -A 5 "interface.*CommonActivities|class.*CommonActivities" -g '*.ts' -g '*.tsx'

# Look for LLM related configuration files
fd -e ts -e tsx -e json "llm|LLM" --exec rg -l "modelId|prompt|region|config"

Length of output: 1222


Script:

#!/bin/bash
# Check the LLM service implementation and types
rg -A 10 "getLLMResult|LlmQueryType" services/libs/common_services/src/services/llm.service.ts services/libs/types/src/llm.ts

# Look for the implementation details of member merge type
rg -A 5 "MEMBER_MERGE" services/libs/types/src/llm.ts services/libs/common_services/src/services/llm.service.ts

# Check the LLM prompt history repository for default configurations
cat services/libs/common_services/src/repos/llmPromptHistory.repo.ts

Length of output: 9840

services/apps/premium/members_enrichment_worker/src/activities/lf-auth0/githubIdentities.ts (1)

9-9: Verify service import path consistency across the codebase

The import path has been changed from ../../main to ../../service. Let's ensure this change is consistent across all files.

✅ Verification successful

Import path change appears to be a deliberate service-specific pattern

The verification reveals a clear pattern:

  • All files in premium/members_enrichment_worker consistently use ../../service
  • Other worker services consistently use ../../main

This suggests the import path change is a deliberate architectural decision specific to the premium members enrichment worker service, not an inconsistency that needs to be fixed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining imports from '../../main'
rg "from '\.\./\.\./main'" --type ts

# Search for the new pattern to confirm consistency
rg "from '\.\./\.\./service'" --type ts

Length of output: 2462

services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (3)

Line range hint 63-88: LGTM! Robust handling of multiple GitHub identities

The implementation properly handles the edge case of multiple verified GitHub identities by selecting the one with the most activities. This is a good approach to resolve potential identity conflicts.


22-22: Verify the implementation of the new processMemberSources activity

The addition of this new activity appears to replace the previous normalization step. Let's ensure it's properly implemented.


104-104: Verify data consistency after removing normalization

The removal of the normalization step in favor of direct processing with processMemberSources simplifies the flow. However, we should ensure that any necessary data transformations are handled within the new activity.

✅ Verification successful

Let me gather more specific information about the data transformations and normalization logic.


Let me check the actual implementation of processMemberSources and any data transformation logic within it.


Let me check the implementation of processMemberSources and normalizeEnrichmentData to understand how the data transformation is handled.


Based on the code analysis, I can now provide a final response about the data consistency concern:

Data consistency is maintained through source-specific normalization

The removal of the normalization step in favor of direct processMemberSources is safe because:

  • Each enrichment source (SERP, ProgAI, Clearbit, etc.) implements its own normalize method that handles data transformation
  • The normalizeEnrichmentData function is still called within processMemberSources to transform source-specific data into a standardized IMemberEnrichmentDataNormalized format
  • Common normalization utilities like normalizeAttributes and normalizeSocialIdentity are used consistently across sources
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any existing normalization logic that might need to be preserved
echo "Searching for existing normalization patterns..."
rg -l "normalize.*Data" 

# Search for data transformation in the new process
echo "Searching for data transformation in processMemberSources..."
ast-grep --pattern 'function processMemberSources($_, $_) {
  $$$
}'

Length of output: 955


Script:

#!/bin/bash
# Check the implementation of processMemberSources
echo "Checking processMemberSources implementation..."
rg -A 10 "processMemberSources" services/apps/premium/members_enrichment_worker/src/

# Check the removed normalization code from git history
echo "Checking removed normalization code..."
git log -p | rg -A 10 -B 10 "normalization"

# Check the actual normalization implementations
echo "Checking normalization implementations..."
rg -A 10 "normalizeData" services/apps/premium/members_enrichment_worker/src/

Length of output: 29661


Script:

#!/bin/bash
# Check the implementation details of processMemberSources
echo "Checking processMemberSources implementation details..."
rg -B 5 -A 20 "export async function processMemberSources" services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

# Check for any data transformation logic in the sources
echo "Checking source-specific transformations..."
rg -A 10 "transform|normalize" services/apps/premium/members_enrichment_worker/src/sources/

Length of output: 62287


Script:

#!/bin/bash
# Check the full implementation of processMemberSources and normalizeEnrichmentData
echo "Checking processMemberSources and normalizeEnrichmentData implementations..."
rg -B 5 -A 30 "export async function processMemberSources|export async function normalizeEnrichmentData" services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

# Check for any data transformation in the enrichment activity
echo "Checking enrichment activity transformations..."
rg -A 10 "transform|normalize" services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 4914

services/libs/types/src/members.ts (1)

Line range hint 147-159: LGTM! The addition of the id field enhances type safety.

The id field is correctly placed and typed, ensuring proper identification of member entities in the database results.

services/libs/types/src/organizations.ts (1)

169-169: LGTM! Addition of required id field aligns with interface design patterns.

The addition of the required id field to ILLMConsumableOrganization brings consistency with other organization interfaces in the codebase and ensures proper organization tracking in LLM operations.

services/archetypes/worker/src/index.ts (2)

103-103: LGTM: Good separation of concerns

The optional initWorker parameter with a default value of true maintains backward compatibility while providing flexibility to initialize connections without starting the worker.


206-252: LGTM: Clean conditional initialization

The worker initialization is properly encapsulated in a conditional block with consistent error handling patterns.

services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/organizationMergeSuggestions.repo.ts (3)

Line range hint 208-214: Documentation accurately reflects architectural changes

The updated documentation clearly explains the transition to using the llmPromptHistory table while maintaining clarity about the method's functionality.


5-5: Verify type replacement across the codebase

The change from LLMSuggestionVerdictType to LlmQueryType appears to be part of a broader refactoring. Let's verify all usages have been updated consistently.

✅ Verification successful

Type replacement has been consistently updated across the codebase

The verification shows that:

  • No references to the old LLMSuggestionVerdictType exist in the codebase
  • LlmQueryType is properly defined in services/libs/types/src/enums/llm.ts
  • All usages of LlmQueryType are consistent across the codebase, including:
    • Merge suggestion workers
    • Data access layer repositories
    • Common services
    • Type definitions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old type
rg "LLMSuggestionVerdictType"

# Search for new type usage to ensure consistent implementation
rg "LlmQueryType"

Length of output: 3741


257-267: Consider performance optimization for JSON field access

The queries now use JSON extraction (metadata ->> 'secondaryId') in WHERE clauses, which might impact performance. Consider:

  1. Adding an index on the JSON field if not already present
  2. Alternatively, consider storing secondaryId as a separate column if it's frequently queried

If you'd like to improve query performance, here's a suggested index:

CREATE INDEX idx_llm_prompt_history_metadata_secondary_id 
ON "llmPromptHistory" ((metadata->>'secondaryId'));

Let's verify if this index exists:

Also applies to: 277-287

services/apps/merge_suggestions_worker/src/activities/memberMergeSuggestions.ts (2)

310-310: LGTM: Addition of primary member ID field

The addition of the id field for the primary member aligns with the interface requirements and provides necessary identification for LLM processing.


327-327: LGTM: Addition of secondary member ID field

The addition of the id field for the secondary member maintains consistency with the primary member structure and completes the interface requirements.

Let's verify the interface compliance:

✅ Verification successful

The verification results show that:

  1. The ILLMConsumableMember interface in services/libs/types/src/members.ts explicitly requires the id field as a string.
  2. The implementation for the primary member includes all required fields including id.

However, let me verify the secondary member implementation to ensure complete interface compliance:


Let me try a different pattern to locate the secondary member implementation:


LGTM: Secondary member object correctly implements ILLMConsumableMember interface

The addition of id: secondaryMember.id is correct and the secondary member object fully implements the ILLMConsumableMember interface, matching the structure of the primary member. The implementation includes all required fields: id, displayName, joinedAt, attributes, identities, and organizations with their respective properties.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that ILLMConsumableMember interface requires the id field
# and that all implementations include it

# Search for the interface definition and its usage
ast-grep --pattern 'interface ILLMConsumableMember {
  $$$
}'

# Search for all object constructions matching the interface pattern
ast-grep --pattern '{
  id: $_,
  displayName: $_,
  joinedAt: $_,
  $$$
}'

Length of output: 2787


Script:

#!/bin/bash
# Search for the secondary member object construction
ast-grep --pattern '{
  id: secondaryMember.id,
  displayName: secondaryMember.$_,
  joinedAt: secondaryMember.$_,
  $$$
}'

Length of output: 125


Script:

#!/bin/bash
# Search for the secondary member object construction with a more flexible pattern
rg -A10 "secondaryMember\.id" services/apps/merge_suggestions_worker/src/activities/memberMergeSuggestions.ts

Length of output: 608

services/apps/merge_suggestions_worker/src/activities/common.ts (1)

44-46: Verify all calls to getLLMResult are updated

The function signature of getLLMResult has changed to include the type parameter of LlmQueryType. Ensure that all calls to getLLMResult in the codebase are updated to match the new signature.

Run the following script to find usages of getLLMResult:

✅ Verification successful

Based on the rg search results, I can generate the final response as I have enough information to verify the concern.

All calls to getLLMResult are properly updated

All 4 calls to getLLMResult in the codebase are already passing the correct type parameter (LlmQueryType.MEMBER_MERGE or LlmQueryType.ORGANIZATION_MERGE) as the first argument, matching the new function signature.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all usages of 'getLLMResult' in the codebase.

rg 'getLLMResult\(' -A 2

Length of output: 1966

services/libs/data-access-layer/src/old/apps/merge_suggestions_worker/memberMergeSuggestions.repo.ts (2)

6-6: Importing LlmQueryType is appropriate for updated query handling

The addition of LlmQueryType in the import statements correctly reflects the updated types used for query operations in this file.


262-272: ⚠️ Potential issue

Ensure safe casting of JSONB field to UUID to prevent runtime errors

When casting (lsv.metadata ->> 'secondaryId')::uuid, it assumes that the metadata JSONB field always contains the 'secondaryId' key with a valid UUID value. If this key is missing or the value is not a valid UUID, the cast will fail and result in a runtime error.

Run the following script to check for entries where secondaryId is missing or invalid:

Consider modifying the query to safely handle cases where secondaryId might be missing or invalid:

- (lsv.metadata ->> 'secondaryId')::uuid = mtmr."toMergeId" and
+ (lsv.metadata ->> 'secondaryId') IS NOT NULL AND
+ lsv.metadata ->> 'secondaryId' = mtmr."toMergeId"::text and

This change avoids casting to uuid and compares the values as text after ensuring secondaryId is not NULL.

services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (3)

16-68: Function fetchMemberDataForLLMSquashing is properly implemented

The newly added function fetchMemberDataForLLMSquashing correctly retrieves the member data along with associated organizations and verified identities. The SQL query is well-structured, uses parameter binding to prevent SQL injection, and aggregates the data as expected.


527-529: Function resetMemberEnrichedAt is correctly added

The resetMemberEnrichedAt function effectively resets the lastEnriched field to null for a given member, which is essential for triggering re-enrichment.


582-595: ⚠️ Potential issue

Update all callers of findMemberEnrichmentCacheDb due to signature change

The function findMemberEnrichmentCacheDb now accepts an array of MemberEnrichmentSource as the sources parameter instead of a single source. Ensure that all function calls throughout the codebase are updated to pass an array of sources to prevent runtime errors.

Run the following script to locate and review all calls to findMemberEnrichmentCacheDb:

Comment on lines +10 to +16
public async insertPromptHistoryEntry(
type: LlmQueryType,
model: LlmModelType,
result: ILlmResponse,
entityId?: string,
metadata?: Record<string, unknown>,
): Promise<void> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation for required fields.

The method should validate that all required fields in the ILlmResponse object are present before proceeding with the insert operation.

  public async insertPromptHistoryEntry(
    type: LlmQueryType,
    model: LlmModelType,
    result: ILlmResponse,
    entityId?: string,
    metadata?: Record<string, unknown>,
  ): Promise<void> {
+   const requiredFields = ['prompt', 'answer', 'inputTokenCount', 'outputTokenCount', 'responseTimeSeconds'];
+   const missingFields = requiredFields.filter(field => !(field in result));
+   if (missingFields.length > 0) {
+     throw new Error(`Missing required fields in ILlmResponse: ${missingFields.join(', ')}`);
+   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public async insertPromptHistoryEntry(
type: LlmQueryType,
model: LlmModelType,
result: ILlmResponse,
entityId?: string,
metadata?: Record<string, unknown>,
): Promise<void> {
public async insertPromptHistoryEntry(
type: LlmQueryType,
model: LlmModelType,
result: ILlmResponse,
entityId?: string,
metadata?: Record<string, unknown>,
): Promise<void> {
const requiredFields = ['prompt', 'answer', 'inputTokenCount', 'outputTokenCount', 'responseTimeSeconds'];
const missingFields = requiredFields.filter(field => !(field in result));
if (missingFields.length > 0) {
throw new Error(`Missing required fields in ILlmResponse: ${missingFields.join(', ')}`);
}

Comment on lines +17 to +30
await this.db().none(
`
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
values($(type), $(model), $(entityId), $(metadata), $(prompt), $(answer), $(inputTokenCount), $(outputTokenCount), $(responseTimeSeconds));
`,
{
type,
model,
entityId,
metadata: metadata ? JSON.stringify(metadata) : null,
...result,
},
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling with specific error types.

The database operation should be wrapped in a try-catch block with appropriate error handling.

+   try {
      await this.db().none(
        `
          insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
          values($(type), $(model), $(entityId), $(metadata), $(prompt), $(answer), $(inputTokenCount), $(outputTokenCount), $(responseTimeSeconds));
        `,
        {
          type,
          model,
          entityId,
          metadata: metadata ? JSON.stringify(metadata) : null,
          ...result,
        },
      )
+   } catch (error) {
+     this.log.error('Failed to insert LLM prompt history', { error, type, model, entityId });
+     throw new Error('Failed to insert LLM prompt history');
+   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await this.db().none(
`
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
values($(type), $(model), $(entityId), $(metadata), $(prompt), $(answer), $(inputTokenCount), $(outputTokenCount), $(responseTimeSeconds));
`,
{
type,
model,
entityId,
metadata: metadata ? JSON.stringify(metadata) : null,
...result,
},
)
}
try {
await this.db().none(
`
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
values($(type), $(model), $(entityId), $(metadata), $(prompt), $(answer), $(inputTokenCount), $(outputTokenCount), $(responseTimeSeconds));
`,
{
type,
model,
entityId,
metadata: metadata ? JSON.stringify(metadata) : null,
...result,
},
)
} catch (error) {
this.log.error('Failed to insert LLM prompt history', { error, type, model, entityId });
throw new Error('Failed to insert LLM prompt history');
}
}

@@ -6,6 +6,7 @@
"start:debug": "CROWD_TEMPORAL_TASKQUEUE=members-enrichment SERVICE=members-enrichment-worker LOG_LEVEL=trace tsx --inspect=0.0.0.0:9232 src/main.ts",
"dev:local": "nodemon --watch src --watch ../../../libs --ext ts --exec pnpm run start:debug:local",
"dev": "nodemon --watch src --watch ../../../libs --ext ts --exec pnpm run start:debug",
"script:onboarding": "SERVICE=script tsx --inspect src/bin/onboarding.ts",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Script path does not exist and needs to be created

The src/bin/onboarding.ts path referenced in the package.json script does not exist in the codebase. The script cannot be executed until:

  1. The bin directory is created under src
  2. The onboarding.ts script is implemented

Additionally:

  • Remove the --inspect flag from the production script
  • Consider adding a corresponding test script
🔗 Analysis chain

Verify the onboarding script path and consider improvements.

The new script uses the --inspect flag which enables debugging. Consider:

  1. Adding a non-debug version for production use
  2. Adding a corresponding test script

Let's verify the script path exists:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the onboarding script path exists
# Expected: The file should exist at the specified path

fd "onboarding.ts" "services/apps/premium/members_enrichment_worker/src/bin"

Length of output: 217


Script:

#!/bin/bash
# Let's first check if the directory structure exists
fd "bin" "services/apps/premium/members_enrichment_worker"

# Also search for onboarding.ts anywhere in the service directory
fd "onboarding.ts" "services/apps/premium/members_enrichment_worker"

# Check if there are any recent changes related to onboarding.ts
git diff HEAD~1 HEAD -- "services/apps/premium/members_enrichment_worker"

Length of output: 6863


Script:

#!/bin/bash
# Let's check if there's a src directory and its structure
ls -la services/apps/premium/members_enrichment_worker/src

# Check if there are any TypeScript files in the root src directory
# that might be related to onboarding
fd -e ts . "services/apps/premium/members_enrichment_worker/src"

Length of output: 3834

Comment on lines 47 to 73
export const LLM_SETTINGS: Record<LlmQueryType, ILlmSettings> = {
[LlmQueryType.MEMBER_MERGE]: {
modelId: LlmModelType.CLAUDE_3_OPUS,
arguments: {
max_tokens: 2000,
anthropic_version: 'bedrock-2023-05-31',
temperature: 0,
},
},
[LlmQueryType.ORGANIZATION_MERGE]: {
modelId: LlmModelType.CLAUDE_3_OPUS,
arguments: {
max_tokens: 2000,
anthropic_version: 'bedrock-2023-05-31',
temperature: 0,
},
},
[LlmQueryType.MEMBER_ENRICHMENT]: {
modelId: LlmModelType.CLAUDE_3_5_SONNET,
arguments: {
// TODO uros check if this is ok
max_tokens: 200000,
anthropic_version: 'bedrock-2023-05-31',
temperature: 0,
},
},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicate configuration

The merge operations (MEMBER_MERGE and ORGANIZATION_MERGE) share identical configuration. Consider extracting this to reduce duplication.

 export const LLM_SETTINGS: Record<LlmQueryType, ILlmSettings> = {
+  const mergeConfig = {
+    modelId: LlmModelType.CLAUDE_3_OPUS,
+    arguments: {
+      max_tokens: 2000,
+      anthropic_version: 'bedrock-2023-05-31',
+      temperature: 0,
+    },
+  };
+
-  [LlmQueryType.MEMBER_MERGE]: {
-    modelId: LlmModelType.CLAUDE_3_OPUS,
-    arguments: {
-      max_tokens: 2000,
-      anthropic_version: 'bedrock-2023-05-31',
-      temperature: 0,
-    },
-  },
-  [LlmQueryType.ORGANIZATION_MERGE]: {
-    modelId: LlmModelType.CLAUDE_3_OPUS,
-    arguments: {
-      max_tokens: 2000,
-      anthropic_version: 'bedrock-2023-05-31',
-      temperature: 0,
-    },
-  },
+  [LlmQueryType.MEMBER_MERGE]: mergeConfig,
+  [LlmQueryType.ORGANIZATION_MERGE]: mergeConfig,

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 20 to 52
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'organization_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'organization';

delete from "llmSuggestionVerdicts" where type = 'organization';

insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'member_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'member';

delete from "llmSuggestionVerdicts" where type = 'member';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add transaction safety to prevent data loss

The current migration lacks transaction safety. If the script fails between insert and delete operations, we might lose data or end up in an inconsistent state.

Wrap the operations in transactions and add validation:

+ begin;
+
+ -- Store counts for validation
+ do $$
+ declare
+     org_count int;
+     member_count int;
+ begin
+     select count(*) into org_count from "llmSuggestionVerdicts" where type = 'organization';
+     select count(*) into member_count from "llmSuggestionVerdicts" where type = 'member';
+
  insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
  -- ... (organization insert query remains same) ...
  where type = 'organization';

+ -- Validate organization migration
+ if (select count(*) from "llmPromptHistory" where type = 'organization_merge_suggestion') != org_count then
+     raise exception 'Organization data migration count mismatch';
+ end if;

  delete from "llmSuggestionVerdicts" where type = 'organization';

  insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
  -- ... (member insert query remains same) ...
  where type = 'member';

+ -- Validate member migration
+ if (select count(*) from "llmPromptHistory" where type = 'member_merge_suggestion') != member_count then
+     raise exception 'Member data migration count mismatch';
+ end if;

  delete from "llmSuggestionVerdicts" where type = 'member';
+ end $$;
+
+ commit;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'organization_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'organization';
delete from "llmSuggestionVerdicts" where type = 'organization';
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'member_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'member';
delete from "llmSuggestionVerdicts" where type = 'member';
begin;
-- Store counts for validation
do $$
declare
org_count int;
member_count int;
begin
select count(*) into org_count from "llmSuggestionVerdicts" where type = 'organization';
select count(*) into member_count from "llmSuggestionVerdicts" where type = 'member';
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'organization_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'organization';
-- Validate organization migration
if (select count(*) from "llmPromptHistory" where type = 'organization_merge_suggestion') != org_count then
raise exception 'Organization data migration count mismatch';
end if;
delete from "llmSuggestionVerdicts" where type = 'organization';
insert into "llmPromptHistory"(type, model, "entityId", metadata, prompt, answer, "inputTokenCount", "outputTokenCount", "responseTimeSeconds")
select 'member_merge_suggestion',
model,
"primaryId",
json_build_object(
'secondaryId', "secondaryId"
),
prompt,
verdict,
"inputTokenCount",
"outputTokenCount",
"responseTimeSeconds"
from "llmSuggestionVerdicts"
where type = 'member';
-- Validate member migration
if (select count(*) from "llmPromptHistory" where type = 'member_merge_suggestion') != member_count then
raise exception 'Member data migration count mismatch';
end if;
delete from "llmSuggestionVerdicts" where type = 'member';
end $$;
commit;

@@ -212,6 +212,7 @@ class MemberMergeSuggestionsRepository {
const result: ILLMConsumableMemberDbResult[] = await this.connection.manyOrNone(
`
select
mem.id,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Include mem.id in the GROUP BY clause to prevent SQL errors

The mem.id column has been added to the SELECT statement but is not included in the GROUP BY clause. In PostgreSQL, when using aggregation functions like jsonb_agg, all selected columns that are not aggregated must be included in the GROUP BY clause to avoid SQL errors.

Apply this diff to include mem.id in the GROUP BY clause:

 group by 
+ mem.id,
  mI."memberId",
  mem.attributes,
  mem."displayName",
  mem."joinedAt"

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 307 to 309
const result = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
svc.log.info({ memberId }, 'LLM result')
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for LLM service call

Currently, there is no error handling for the call to llmService.consolidateMemberEnrichmentData. If the LLM service fails or returns invalid data, it could cause unintended behavior. Consider adding try-catch blocks and appropriate error handling.

`

const result = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
svc.log.info({ memberId }, 'LLM result')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid logging potentially sensitive data

The log statement svc.log.info({ memberId }, 'LLM result') might log sensitive data if the LLM result includes personal information. Ensure that sensitive data is not exposed in logs.

Comment on lines +554 to +563
const res = await tx.query(
`UPDATE "memberEnrichmentCache"
SET
"updatedAt" = NOW(),
"data" = $2
WHERE "memberId" = $1 and source = $3;`,
[memberId, dataSanitized, source],
)
await resetMemberEnrichedAt(tx, memberId)
return res
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure error handling in updateMemberEnrichmentCacheDb

Similarly, in updateMemberEnrichmentCacheDb, ensure that resetMemberEnrichedAt is called only if the update operation affects rows. This adds robustness to your function by preventing unintended state changes.

Apply this diff to enhance error handling:

const res = await tx.query(
  `UPDATE "memberEnrichmentCache"
    SET
      "updatedAt" = NOW(),
      "data" = $2
    WHERE "memberId" = $1 and source = $3;`,
  [memberId, dataSanitized, source],
)
+if (res.rowCount > 0) {
+  await resetMemberEnrichedAt(tx, memberId)
+}
return res
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const res = await tx.query(
`UPDATE "memberEnrichmentCache"
SET
"updatedAt" = NOW(),
"data" = $2
WHERE "memberId" = $1 and source = $3;`,
[memberId, dataSanitized, source],
)
await resetMemberEnrichedAt(tx, memberId)
return res
const res = await tx.query(
`UPDATE "memberEnrichmentCache"
SET
"updatedAt" = NOW(),
"data" = $2
WHERE "memberId" = $1 and source = $3;`,
[memberId, dataSanitized, source],
)
if (res.rowCount > 0) {
await resetMemberEnrichedAt(tx, memberId)
}
return res

Comment on lines +538 to +544
const res = await tx.query(
`INSERT INTO "memberEnrichmentCache" ("memberId", "data", "createdAt", "updatedAt", "source")
VALUES ($1, $2, NOW(), NOW(), $3);`,
[memberId, dataSanitized, source],
)
await resetMemberEnrichedAt(tx, memberId)
return res
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure error handling in insertMemberEnrichmentCacheDb

In insertMemberEnrichmentCacheDb, consider adding error handling to ensure that resetMemberEnrichedAt is only called if the insertion into "memberEnrichmentCache" is successful. This prevents resetting the lastEnriched timestamp when the cache update fails.

Apply this diff to enhance error handling:

const res = await tx.query(
  `INSERT INTO "memberEnrichmentCache" ("memberId", "data", "createdAt", "updatedAt", "source")
    VALUES ($1, $2, NOW(), NOW(), $3);`,
  [memberId, dataSanitized, source],
)
+if (res.rowCount > 0) {
+  await resetMemberEnrichedAt(tx, memberId)
+}
return res
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const res = await tx.query(
`INSERT INTO "memberEnrichmentCache" ("memberId", "data", "createdAt", "updatedAt", "source")
VALUES ($1, $2, NOW(), NOW(), $3);`,
[memberId, dataSanitized, source],
)
await resetMemberEnrichedAt(tx, memberId)
return res
const res = await tx.query(
`INSERT INTO "memberEnrichmentCache" ("memberId", "data", "createdAt", "updatedAt", "source")
VALUES ($1, $2, NOW(), NOW(), $3);`,
[memberId, dataSanitized, source],
)
if (res.rowCount > 0) {
await resetMemberEnrichedAt(tx, memberId)
}
return res

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (19)
backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1)

15-17: Optimize index strategy

  1. The single indexes might be redundant as the composite index ix_llmPromptHistory_type_entityId can be used for queries on either column
  2. Consider adding NULLS LAST for better null handling in the indexes
 create index "ix_llmPromptHistory_type_entityId" on "llmPromptHistory"("type", "entityId" NULLS LAST);
-create index "ix_llmPromptHistory_entityId" on "llmPromptHistory"("entityId");
-create index "ix_llmPromptHistory_type" on "llmPromptHistory"("type");
services/libs/types/src/llm.ts (3)

35-45: Enhance pricing documentation

The comment "to estimate costs - these numbers can change" could be more informative. Consider:

  • Adding a link to the official pricing documentation
  • Including the last update date
  • Documenting how to calculate total cost using these rates

Example documentation:

// Pricing as of April 2024 (see https://aws.amazon.com/bedrock/pricing/)
// Total cost calculation:
// Input cost = (input_tokens / 1000) * costPer1000InputTokens
// Output cost = (output_tokens / 1000) * costPer1000OutputTokens
// Total cost = Input cost + Output cost

91-112: Add JSDoc comments for the enrichment result interface

The LlmMemberEnrichmentResult interface would benefit from documentation explaining:

  • The meaning of the confidence score range
  • The difference between 'update' and 'new' arrays
  • The purpose of each nested structure

Example documentation:

/**
 * Represents the result of an LLM-based member enrichment operation
 * @property confidence - Score between 0-1 indicating the LLM's confidence in the enrichment
 * @property changes - Structured changes to be applied to the member
 * @property changes.identities - New identities to add and existing ones to update
 * @property changes.attributes - New attributes to add and existing ones to update
 * @property changes.organizations - New organizations and connections to be created
 */

30-56: Consider environment-specific configuration management

The hardcoded region map and model settings might need to vary across different environments (development, staging, production). Consider:

  • Moving these configurations to environment variables
  • Creating a configuration service
  • Adding validation for required environment variables
services/libs/common_services/src/services/llm.service.ts (1)

28-62: Consider implementing cleanup for AWS clients

The client caching strategy is good for performance, but consider implementing a cleanup method to properly dispose of AWS clients when they're no longer needed. This prevents potential resource leaks, especially in long-running applications.

 export class LlmService extends LoggerBase {
   private readonly repo: LlmPromptHistoryRepository
   private readonly clientRegionMap: Map<string, BedrockRuntimeClient>
+
+  public async dispose(): Promise<void> {
+    for (const client of this.clientRegionMap.values()) {
+      await client.destroy()
+    }
+    this.clientRegionMap.clear()
+  }
services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (1)

535-537: Add error handling to resetMemberEnrichedAt

The function should handle the case where the update affects no rows, as this might indicate a potential issue.

Apply this diff to add error handling:

 export async function resetMemberEnrichedAt(tx: DbConnOrTx, memberId: string): Promise<void> {
-  await tx.none(`update members set "lastEnriched" = null where id = $(memberId);`, { memberId })
+  const result = await tx.result(`update members set "lastEnriched" = null where id = $(memberId);`, { memberId })
+  if (result.rowCount === 0) {
+    console.warn(`No member found with id ${memberId} when resetting lastEnriched`)
+  }
 }
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (1)

219-341: Consider extracting the LLM prompt template

The LLM prompt is quite long and contains complex validation rules. Consider extracting it to a separate template file for better maintainability.

Create a new file llm-templates/member-enrichment.ts:

export const getMemberEnrichmentPrompt = (
  existingMemberData: any,
  enrichedData: any,
  memberIdentityTypes: string[],
  organizationIdentityTypes: string[],
  organizationSources: string[],
) => `
You are a data consolidation expert...
// Rest of the prompt template
`

Then update the code to use the template:

+ import { getMemberEnrichmentPrompt } from './llm-templates/member-enrichment'

- const prompt = `...`
+ const prompt = getMemberEnrichmentPrompt(
+   existingMemberData,
+   toBeSquashed,
+   Object.values(MemberIdentityType),
+   Object.values(OrganizationIdentityType),
+   Object.values(OrganizationSource),
+ )
services/apps/data_sink_worker/src/service/organization.service.ts (4)

Line range hint 85-85: Consider reusing the existing OrganizationService instance instead of creating a new one

At line 85, a new instance of OrganizationService is created within the processOrganizationEnrich method:

const txService = new OrganizationService(txStore, this.log)

Since OrganizationService is already instantiated, consider reusing the existing instance by passing the transactional store to the required methods or refactoring to support transactions without creating a new instance. This would improve maintainability and reduce unnecessary instantiation.


Line range hint 116-125: Avoid mutating input parameters directly

In the loop starting at line 116, the organization.identities array is being modified by pushing existing identities:

for (const i of existingIdentities) {
  if (
    organization.identities.find(
      (oi) =>
        oi.type === i.type &&
        oi.value === i.value &&
        oi.platform === i.platform &&
        oi.verified === i.verified,
    )
  ) {
    continue
  }

  organization.identities.push(i)
}

Mutating the organization object directly can lead to unintended side effects, especially if organization is used elsewhere. Consider creating a new array or cloning the object before making modifications to prevent side effects.

Apply this diff to avoid mutating the input parameter:

-  organization.identities.push(i)
+  const updatedIdentities = [...organization.identities, i]

And use updatedIdentities in place of organization.identities in subsequent code.


Line range hint 95-95: Ensure the primary identity selection logic is robust

At line 95, the primary identity is selected as the first verified identity:

const primaryIdentity = verifiedIdentities[0]

Ensure that this logic aligns with business requirements. If there are multiple verified identities, selecting the first one may not always be appropriate. Consider implementing logic to determine the primary identity based on specific criteria (e.g., priority of identity types, recency, etc.).


Line range hint 108-109: Consider handling the case when no organization is found more explicitly

When no existing organization is found, the method logs a debug message and returns:

this.log.debug(
  'No organization found for enriching. This organization enrich process had no affect.',
)

If the absence of an organization to enrich is expected behavior, consider whether additional actions are required. If it's an unexpected scenario, you may want to log a warning or handle it accordingly.

services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (5)

19-21: Enhance error messaging for missing arguments

Currently, if the required tenantId argument is missing, the script exits without any explanation. Consider adding an error message to inform the user about the correct usage.

Apply this diff to add informative error messaging:

 if (processArguments.length !== 1) {
+  console.error('Error: Exactly one argument (tenantId) is required')
+  console.error('Usage: node onboarding.ts <tenantId>')
   process.exit(1)
 }

23-23: Validate tenantId format

To prevent potential issues with invalid tenantId values, consider adding a validation check to ensure it conforms to the expected UUID v4 format.

Apply this diff to validate the tenantId:

 const tenantId = processArguments[0]
+if (!tenantId.match(/^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i)) {
+  console.error('Error: Invalid tenantId format. Expected UUID v4')
+  process.exit(1)
+}

25-26: Make configuration parameters adjustable

The minMemberActivities and maxConcurrentProcessing values are currently hard-coded. Consider making these parameters configurable via environment variables or command-line arguments to enhance flexibility.

Apply this diff to make the parameters configurable:

-const minMemberActivities = 100
-const maxConcurrentProcessing = 5
+const minMemberActivities = process.env.MIN_MEMBER_ACTIVITIES
+  ? parseInt(process.env.MIN_MEMBER_ACTIVITIES, 10)
+  : 100
+const maxConcurrentProcessing = process.env.MAX_CONCURRENT_PROCESSING
+  ? parseInt(process.env.MAX_CONCURRENT_PROCESSING, 10)
+  : 5

You may also update your documentation to reflect these new environment variables.


63-154: Simplify concurrency management using p-limit

To enhance code readability and simplify concurrency control, consider using the p-limit library to manage the maximum number of concurrent promises.

Apply this diff to refactor the processing loop:

+import pLimit from 'p-limit'

 const sources = Object.values(MemberEnrichmentSource) as MemberEnrichmentSource[]

 setImmediate(async () => {
   await svc.init(false)

-  let processingCount = 0
   let updatedMembersCount = 0
   let skippedMembersCount = 0
   let failedMembersCount = 0

   let totalProcessingTime = 0
   let processedMembersCount = 0
   const REPORT_INTERVAL = 10

   const pageSize = 100
   let members = await getEnrichableMembers(pageSize)
-  let pagePromises: Promise<void>[] = []
+
+  const limit = pLimit(maxConcurrentProcessing)
   while (members.length > 0) {
     svc.log.info({ memberCount: members.length }, 'Processing members!')
     // process members
-    for (const memberId of members) {
-      while (processingCount >= maxConcurrentProcessing) {
-        await timeout(100)
-      }
-
-      processingCount++
+    const tasks = members.map((memberId) =>
+      limit(async () => {
         const startTime = Date.now()
-        const promise = processMemberSources(memberId, sources)
-          .then((res) => {
-            processingCount--
-            if (res) {
-              const processingTime = Date.now() - startTime
-              totalProcessingTime += processingTime
-              processedMembersCount++
-
-              updatedMembersCount++
-            } else {
-              skippedMembersCount++
+        try {
+          const res = await processMemberSources(memberId, sources)
+          if (res) {
+            const processingTime = Date.now() - startTime
+            totalProcessingTime += processingTime
+            processedMembersCount++
+            updatedMembersCount++
+          } else {
+            skippedMembersCount++
+          }
+        } catch (err) {
+          svc.log.error(err, { memberId }, 'Error while processing member enrichment sources!')
+          failedMembersCount++
         }
-
-            // Report average processing time every REPORT_INTERVAL members
-            if (processedMembersCount > 0 && processedMembersCount % REPORT_INTERVAL === 0) {
-              const averageProcessingTime = totalProcessingTime / processedMembersCount
-              svc.log.info(
-                {
-                  averageProcessingTime: `${(averageProcessingTime / 1000).toFixed(2)}s`,
-                  processedMembers: processedMembersCount,
-                  updatedMembers: updatedMembersCount,
-                  skippedMembers: skippedMembersCount,
-                  failedMembers: failedMembersCount,
-                },
-                'Processing time statistics',
-              )
-            }
-          })
-          .catch((err) => {
-            processingCount--
-            svc.log.error(err, { memberId }, 'Error while processing member enrichment sources!')
-            failedMembersCount++
-          })
-        pagePromises.push(promise)
-      }
-
-      await Promise.all(pagePromises)
-      pagePromises = []
+        // Report average processing time every REPORT_INTERVAL members
+        if (
+          processedMembersCount > 0 &&
+          processedMembersCount % REPORT_INTERVAL === 0
+        ) {
+          const averageProcessingTime = totalProcessingTime / processedMembersCount
+          svc.log.info(
+            {
+              averageProcessingTime: `${(averageProcessingTime / 1000).toFixed(2)}s`,
+              processedMembers: processedMembersCount,
+              updatedMembers: updatedMembersCount,
+              skippedMembers: skippedMembersCount,
+              failedMembers: failedMembersCount,
+            },
+            'Processing time statistics',
+          )
+        }
+      }),
+    )
+
+    await Promise.all(tasks)
     // load next page
     members = await getEnrichableMembers(pageSize, members[members.length - 1])

     svc.log.info(
       {
         updatedMembersCount,
         skippedMembersCount,
         failedMembersCount,
         averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
       },
       'Current statistics!',
     )
   }

This refactoring removes the manual concurrency management and leverages p-limit for better readability and maintainability.


63-154: Replace setImmediate with an async function

Using setImmediate is unnecessary in this context. Consider wrapping your main logic inside an async function for clarity.

Apply this diff to refactor the entry point:

-setImmediate(async () => {
+async function main() {
   await svc.init(false)

   // ... rest of the code ...

-  process.exit(0)
-})
+  process.exit(0)
+}
+
+main().catch((error) => {
+  svc.log.error({ error }, 'An unexpected error occurred')
+  process.exit(1)
+})

This change improves readability and provides a clear entry point for your script.

services/libs/data-access-layer/src/organizations/organizations.ts (3)

480-485: Optimize organization lookup by querying all verified identities at once

Currently, the code loops over verifiedIdentities and queries the database for each identity until a match is found. This can be inefficient, especially with many identities.

Consider modifying findOrgByVerifiedIdentity to accept multiple identities and perform a single database query. This would reduce database load and improve performance:

// Modify the function to accept an array of identities
async function findOrgByVerifiedIdentities(
  qx: QueryExecutor,
  tenantId: string,
  identities: IDbOrgIdentity[],
): Promise<IDbOrganization | null> {
  // Implementation to query all identities at once
}

// Use the new function
const existing = await findOrgByVerifiedIdentities(qe, tenantId, verifiedIdentities)

544-556: Batch insert organization identities to improve performance

The code inserts each identity individually using addOrgIdentity, which can be inefficient for multiple identities.

Consider modifying addOrgIdentity to accept an array of identities and perform a batch insert:

// Modify the function to accept multiple identities
async function addOrgIdentities(
  qe: QueryExecutor,
  identities: {
    organizationId: string
    tenantId: string
    identities: IDbOrgIdentity[]
    integrationId?: string
  },
): Promise<void> {
  // Implementation of batch insert
}

// Use the new function
await addOrgIdentities(qe, {
  organizationId: id,
  tenantId,
  identities: data.identities,
  integrationId,
})

560-563: Enhance error handling with more descriptive messages

The catch block logs the error and rethrows it, but additional context could aid in debugging.

Consider adding more context to the error message or handling specific error types:

catch (err) {
  log.error({ err, data }, 'Error while upserting an organization!')
  throw new Error(`Failed to upsert organization: ${err.message}`)
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f4f3784 and 1aba439.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (13)
  • backend/src/database/migrations/V1731052735__llm-prompt-history.sql (1 hunks)
  • services/apps/data_sink_worker/src/service/organization.service.ts (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (3 hunks)
  • services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/types.ts (1 hunks)
  • services/libs/common_services/src/services/llm.service.ts (1 hunks)
  • services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (4 hunks)
  • services/libs/data-access-layer/src/organizations/attributesConfig.ts (1 hunks)
  • services/libs/data-access-layer/src/organizations/organizations.ts (2 hunks)
  • services/libs/types/src/enums/llm.ts (1 hunks)
  • services/libs/types/src/enums/organizations.ts (1 hunks)
  • services/libs/types/src/llm.ts (1 hunks)
  • services/libs/types/src/premium/enrichment.ts (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • services/apps/premium/members_enrichment_worker/src/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • services/libs/types/src/enums/llm.ts
  • services/libs/types/src/premium/enrichment.ts
🧰 Additional context used
🪛 Biome
services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

[error] 171-171: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


[error] 182-182: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 186-186: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 195-195: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 199-199: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

🔇 Additional comments (17)
services/libs/types/src/enums/organizations.ts (2)

35-35: Verify the distinction between email-based identifiers

The addition of EMAIL to OrganizationIdentityType looks good, but please verify that there's a clear distinction between this and OrganizationAttributeSource.EMAIL to avoid confusion in the codebase.

✅ Verification successful

No overlap concerns between email-related enums

The enums serve distinct purposes with clear separation:

  • OrganizationIdentityType.EMAIL represents an identity type for organizations (like usernames or domains)
  • OrganizationAttributeSource.EMAIL indicates the source of organization attributes (like GitHub or custom sources)

The distinction is further reinforced by their usage patterns:

  • Identity types are used in identity management and organization structure
  • Attribute sources are used in enrichment and data attribution contexts
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usages of both email-related enums to ensure proper distinction
echo "Searching for OrganizationIdentityType.EMAIL usage:"
rg "OrganizationIdentityType.EMAIL" -A 3

echo -e "\nSearching for OrganizationAttributeSource.EMAIL usage:"
rg "OrganizationAttributeSource.EMAIL" -A 3

Length of output: 2525


Script:

#!/bin/bash
# Let's search for the actual usage context of both email-related enums
echo "Searching for files containing OrganizationIdentityType:"
rg -l "OrganizationIdentityType" 

echo -e "\nSearching for files containing OrganizationAttributeSource:"
rg -l "OrganizationAttributeSource"

echo -e "\nLet's look at the enum definitions:"
cat services/libs/types/src/enums/organizations.ts

Length of output: 5087


43-43: Consider using specific enrichment sources instead of a generic one

The ENRICHMENT value seems redundant when there are already specific enrichment sources defined in OrganizationSource (e.g., ENRICHMENT_PROGAI, ENRICHMENT_CLEARBIT). This could lead to ambiguity about which enrichment source was actually used.

Consider either:

  1. Using the specific enrichment sources directly, or
  2. Adding a comment explaining when to use this generic source versus the specific ones
services/libs/types/src/llm.ts (1)

1-19: Well-structured base interfaces!

The base interfaces are well-designed with:

  • Clear hierarchy through interface extension
  • Descriptive property names
  • Proper type definitions
services/libs/common_services/src/services/llm.service.ts (2)

1-27: LGTM! Well-structured imports and interface definition.

The imports are properly organized, and the IBedrockClientCredentials interface is clear and focused.


149-163: 🛠️ Refactor suggestion

Replace any type with proper typing

The method uses any type which bypasses TypeScript's type checking. Consider defining proper types for better type safety.

   public async consolidateMemberEnrichmentData(
     memberId: string,
     prompt: string,
-    // eslint-disable-next-line @typescript-eslint/no-explicit-any
-  ): Promise<ILlmResult<LlmMemberEnrichmentResult>> {
+  ): Promise<ILlmResult<LlmMemberEnrichmentResult>> {
     const response = await this.queryLlm(LlmQueryType.MEMBER_ENRICHMENT, prompt, memberId)
 
     const result = JSON.parse(response.answer)
 
     return {
-      result,
+      result: result as LlmMemberEnrichmentResult,
       ...response,
     }
   }

Likely invalid or redundant comment.

services/libs/data-access-layer/src/organizations/attributesConfig.ts (2)

234-234: LGTM! Appropriate priority placement for enrichment source.

The placement of OrganizationAttributeSource.ENRICHMENT at the lowest priority is correct, ensuring that enriched data doesn't override data from more authoritative sources like custom inputs or direct integrations.


234-234: Verify the OrganizationAttributeSource enum update.

Let's ensure the ENRICHMENT value is properly defined in the types package.

✅ Verification successful

OrganizationAttributeSource.ENRICHMENT is properly defined in the types package

The verification confirms that ENRICHMENT is correctly defined as 'enrichment' in the OrganizationAttributeSource enum within services/libs/types/src/enums/organizations.ts.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the ENRICHMENT value is properly defined in the OrganizationAttributeSource enum

# Check the enum definition
rg -A 10 "export enum OrganizationAttributeSource" "services/libs/types/src/enums/organizations.ts"

Length of output: 259

services/libs/data-access-layer/src/old/apps/premium/members_enrichment_worker/index.ts (2)

546-552: LGTM: Changes align with existing review comments

The integration of resetMemberEnrichedAt is correct, and the existing review comments about error handling are still applicable.

Also applies to: 562-571


589-603: Verify callers of findMemberEnrichmentCacheDb

The function signature has changed to accept an array of sources instead of a single source. This change requires verification of all callers to ensure they've been updated accordingly.

✅ Verification successful

Let me search for more context about the callers by examining the implementation.


Based on the context gathered, I can now generate the final response:

Function signature change is correctly handled

The only caller of findMemberEnrichmentCacheDb is a wrapper function findMemberEnrichmentCache in services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts, which already accepts an array of sources and correctly passes them to the modified function. The change is consistent throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for function calls to findMemberEnrichmentCacheDb
ast-grep --pattern 'findMemberEnrichmentCacheDb($$$)'

Length of output: 839


Script:

#!/bin/bash
# Get context around the caller
rg -B 5 -A 5 "findMemberEnrichmentCacheDb" services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts

Length of output: 872

services/apps/premium/members_enrichment_worker/src/activities/enrichment.ts (3)

1-32: LGTM! Import statements are well-organized

The new imports are properly structured and necessary for the added functionality.


113-116: LGTM! Function signature updated to handle multiple sources

The change to accept an array of sources aligns well with the new multi-source data consolidation approach.


161-204: 🛠️ Refactor suggestion

Optimize performance by avoiding the delete operator

The data preparation logic uses the delete operator which can impact performance. Consider using object destructuring or setting properties to undefined instead.

Apply these changes:

- delete n.contributions
+ n.contributions = undefined

- delete n.reach
+ n.reach = undefined

- delete normalized.contributions
+ normalized.contributions = undefined

- delete normalized.reach
+ normalized.reach = undefined

Likely invalid or redundant comment.

🧰 Tools
🪛 Biome

[error] 171-171: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


[error] 182-182: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 186-186: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 195-195: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 199-199: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

services/apps/data_sink_worker/src/service/organization.service.ts (1)

36-42: Ensure that error handling is appropriately managed after removing the try-catch block

In the findOrCreate method, the previous implementation included a try-catch block for error handling within the transaction. With the current changes, the try-catch block has been removed. Please ensure that any exceptions thrown within the transaction are properly handled at a higher level to prevent unhandled exceptions.

services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (1)

7-16: Service configuration is appropriately customized

The service environment variables and configurations have been correctly set to ensure that only the necessary components are enabled for this script.

services/libs/data-access-layer/src/organizations/organizations.ts (3)

1-2: Imports are appropriate for new functionality

The added imports generateUUIDv1, websiteNormalizer, and getServiceChildLogger are necessary for the new findOrCreateOrganization function and enhance modularity.


516-527: Validate displayName assignment for new organizations

When creating a new organization, displayName is set to firstVerified.value, which may not always be the desired display name.

Ensure that firstVerified.value is appropriate for the displayName. If value might not represent a suitable display name (e.g., a URL or code), consider checking and assigning a more user-friendly name if available.


498-501: ⚠️ Potential issue

Ensure the organization is updated regardless of recent updates

The updateOrganization function checks if the organization's updatedAt timestamp is older than one minute ago, which may prevent updates if the organization was recently modified.

Consider removing or adjusting the timestamp check to ensure updates are applied as intended:

// In updateOrganization function, remove the updatedAt condition
- where id = $(organizationId) and "updatedAt" <= $(oneMinuteAgo)
+ where id = $(organizationId)

Alternatively, if the timestamp check is required for concurrency control, implement optimistic locking or handle the case when the update doesn't occur due to the timestamp condition.

Likely invalid or redundant comment.

Comment on lines +19 to +50
-- backup members table
create table members_backup_14_11_2024 as
select *
from members
with no data;

-- Copy all data
insert into members_backup_14_11_2024
select *
from members;

-- backup memberIdentities table
create table member_identities_backup_14_11_2024 as
select *
from "memberIdentities"
with no data;

-- Copy all data
insert into member_identities_backup_14_11_2024
select *
from "memberIdentities";

-- backup memberOrganizations table
create table member_organizations_backup_14_11_2024 as
select *
from "memberOrganizations"
with no data;

-- Copy all data
insert into member_organizations_backup_14_11_2024
select *
from "memberOrganizations";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve backup strategy

  1. Use dynamic date generation instead of hardcoding
  2. Add row count validation
  3. Consider cleanup strategy for old backups
+do $$
+declare
+    backup_date text := to_char(current_timestamp, 'DD_MM_YYYY');
+    members_count int;
+    member_identities_count int;
+    member_organizations_count int;
+begin
+    -- Get original counts
+    select count(*) into members_count from members;
+    select count(*) into member_identities_count from "memberIdentities";
+    select count(*) into member_organizations_count from "memberOrganizations";
+
     -- backup members table
-    create table members_backup_14_11_2024 as
+    execute format('create table members_backup_%s as
     select *
     from members
-        with no data;
+        with no data', backup_date);

     -- Copy all data
-    insert into members_backup_14_11_2024
+    execute format('insert into members_backup_%s
     select *
-    from members;
+    from members', backup_date);

+    -- Validate backup
+    execute format('
+        if (select count(*) from members_backup_%s) != $1 then
+            raise exception ''Members backup validation failed'';
+        end if
+    ', backup_date) using members_count;

     -- Similar changes for other tables...

+    -- Cleanup old backups (older than 30 days)
+    for backup_table in
+        select tablename 
+        from pg_tables 
+        where tablename like 'members_backup_%'
+           or tablename like 'member_identities_backup_%'
+           or tablename like 'member_organizations_backup_%'
+    loop
+        if to_date(right(backup_table, 10), 'DD_MM_YYYY') < current_date - interval '30 days' then
+            execute format('drop table if exists %I', backup_table);
+        end if;
+    end loop;
+end $$;

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +1 to +13
create table "llmPromptHistory" (
id bigserial primary key,
type varchar(255) not null,
model text not null,
"entityId" text null,
metadata jsonb null,
prompt text not null,
answer text not null,
"inputTokenCount" int not null,
"outputTokenCount" int not null,
"responseTimeSeconds" decimal not null,
"createdAt" timestamptz not null default now()
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider additional constraints and optimizations

  1. The type column could be more efficient as an enum type
  2. The model column should have a CHECK constraint to validate allowed models
  3. The metadata column should have validation to ensure JSON structure
+create type llm_prompt_type as enum (
+    'organization_merge_suggestion',
+    'member_merge_suggestion'
+    -- add other types as needed
+);
+
 create table "llmPromptHistory" (
     id                    bigserial primary key,
-    type                  varchar(255) not null,
+    type                  llm_prompt_type not null,
     model                 text         not null,
+                         check (model in ('anthropic.claude-v2', 'anthropic.claude-v1')),
     "entityId"            text         null,
-    metadata              jsonb        null,
+    metadata              jsonb        null check (jsonb_typeof(metadata) = 'object'),
     prompt                text         not null,
     answer                text         not null,
     "inputTokenCount"     int          not null,
     "outputTokenCount"    int          not null,
     "responseTimeSeconds" decimal      not null,
     "createdAt"           timestamptz  not null default now()
 );

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +63 to +147
public async queryLlm(
type: LlmQueryType,
prompt: string,
entityId?: string,
metadata?: Record<string, unknown>,
saveHistory = true,
): Promise<ILlmResponse> {
const settings = LLM_SETTINGS[type]
if (!settings) {
throw new Error(`No settings found for LLM query type: ${type}`)
}

const client = this.client(settings)

const start = performance.now()
const end = () => {
const end = performance.now()
const duration = end - start
return Math.ceil(duration / 1000)
}

const command = new InvokeModelCommand({
body: JSON.stringify({
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: prompt,
},
],
},
],
// eslint-disable-next-line @typescript-eslint/no-explicit-any
...(settings.arguments as any),
}),
modelId: settings.modelId,
accept: 'application/json',
contentType: 'application/json',
})

let res: InvokeModelCommandOutput
try {
res = await client.send(command)
} catch (err) {
this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
throw err
}

const body = JSON.parse(res.body.transformToString())
const responseTimeSeconds = end()

const inputTokenCount = body.usage.input_tokens
const outputTokenCount = body.usage.output_tokens
const answer = body.content[0].text

const pricing = LLM_MODEL_PRICING_MAP[settings.modelId]

const inputCost = (inputTokenCount / 1000) * pricing.costPer1000InputTokens
const outputCost = (outputTokenCount / 1000) * pricing.costPer1000OutputTokens
const totalCost = inputCost + outputCost

this.log.info({ type, entityId, inputCost, outputCost, totalCost }, 'Estimated LLM cost!')

const result = {
prompt,
answer,
inputTokenCount,
outputTokenCount,
responseTimeSeconds,
model: settings.modelId,
}

if (saveHistory) {
try {
await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
} catch (err) {
this.log.error(err, 'Failed to save LLM prompt history entry!')
throw err
}
}

return result
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add rate limiting and timeout configuration

The LLM query implementation should include:

  1. Rate limiting to prevent overwhelming the AWS Bedrock service
  2. Timeout configuration to handle slow responses
+import { setTimeout } from 'timers/promises'
+
 export class LlmService extends LoggerBase {
+  private static readonly DEFAULT_TIMEOUT_MS = 30000
+  private static readonly DEFAULT_RETRY_COUNT = 3
+  private static readonly DEFAULT_RETRY_DELAY_MS = 1000
+
   public async queryLlm(
     type: LlmQueryType,
     prompt: string,
     entityId?: string,
     metadata?: Record<string, unknown>,
     saveHistory = true,
+    timeoutMs = LlmService.DEFAULT_TIMEOUT_MS,
   ): Promise<ILlmResponse> {
     // ... existing code ...

     let res: InvokeModelCommandOutput
+    let lastError: Error
+
+    for (let attempt = 1; attempt <= LlmService.DEFAULT_RETRY_COUNT; attempt++) {
       try {
-        res = await client.send(command)
+        res = await Promise.race([
+          client.send(command),
+          setTimeout(timeoutMs).then(() => {
+            throw new Error(`LLM query timed out after ${timeoutMs}ms`)
+          }),
+        ])
+        break
       } catch (err) {
-        this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
-        throw err
+        lastError = err
+        if (attempt === LlmService.DEFAULT_RETRY_COUNT) {
+          this.log.error(err, { settings, prompt, attempts: attempt }, 'Failed to query LLM after all retries!')
+          throw err
+        }
+        this.log.warn(
+          { settings, prompt, attempt, error: err.message },
+          'LLM query failed, retrying...',
+        )
+        await setTimeout(LlmService.DEFAULT_RETRY_DELAY_MS)
       }
     }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +1 to +163
type: LlmQueryType,
prompt: string,
entityId?: string,
metadata?: Record<string, unknown>,
saveHistory = true,
): Promise<ILlmResponse> {
const settings = LLM_SETTINGS[type]
if (!settings) {
throw new Error(`No settings found for LLM query type: ${type}`)
}

const client = this.client(settings)

const start = performance.now()
const end = () => {
const end = performance.now()
const duration = end - start
return Math.ceil(duration / 1000)
}

const command = new InvokeModelCommand({
body: JSON.stringify({
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: prompt,
},
],
},
],
// eslint-disable-next-line @typescript-eslint/no-explicit-any
...(settings.arguments as any),
}),
modelId: settings.modelId,
accept: 'application/json',
contentType: 'application/json',
})

let res: InvokeModelCommandOutput
try {
res = await client.send(command)
} catch (err) {
this.log.error(err, { settings, prompt }, 'Failed to query LLM!')
throw err
}

const body = JSON.parse(res.body.transformToString())
const responseTimeSeconds = end()

const inputTokenCount = body.usage.input_tokens
const outputTokenCount = body.usage.output_tokens
const answer = body.content[0].text

const pricing = LLM_MODEL_PRICING_MAP[settings.modelId]

const inputCost = (inputTokenCount / 1000) * pricing.costPer1000InputTokens
const outputCost = (outputTokenCount / 1000) * pricing.costPer1000OutputTokens
const totalCost = inputCost + outputCost

this.log.info({ type, entityId, inputCost, outputCost, totalCost }, 'Estimated LLM cost!')

const result = {
prompt,
answer,
inputTokenCount,
outputTokenCount,
responseTimeSeconds,
model: settings.modelId,
}

if (saveHistory) {
try {
await this.repo.insertPromptHistoryEntry(type, settings.modelId, result, entityId, metadata)
} catch (err) {
this.log.error(err, 'Failed to save LLM prompt history entry!')
throw err
}
}

return result
}

public async consolidateMemberEnrichmentData(
memberId: string,
prompt: string,
// eslint-disable-next-line @typescript-eslint/no-explicit-any
): Promise<ILlmResult<LlmMemberEnrichmentResult>> {
const response = await this.queryLlm(LlmQueryType.MEMBER_ENRICHMENT, prompt, memberId)

const result = JSON.parse(response.answer)

return {
result,
...response,
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add data sanitization and redaction for sensitive LLM data logging

The LLM service currently logs sensitive information without proper sanitization, including:

  • Complete prompt content
  • Full response content
  • Token counts and cost details
  • Entity IDs and metadata

While the rest of the codebase consistently uses sanitization utilities like sanitizeHtml and redactNullByte, the LLM service lacks similar protections. To maintain security best practices, you should:

  1. Create a sanitization helper for LLM data that:

    • Redacts sensitive content from prompts and responses
    • Masks token counts and cost information
    • Sanitizes entity IDs and metadata before logging
  2. Update the logging calls in llm.service.ts to use the sanitization helper:

this.log.info({ 
  type, 
  entityId,
  // Sanitize cost info
  costs: maskSensitiveData({ inputCost, outputCost, totalCost })
}, 'Estimated LLM cost!')
  1. Consider moving sensitive data logging to debug level instead of info level for better control.
🔗 Analysis chain

Enhance security measures for sensitive data handling

Consider implementing these security enhancements:

  1. Sanitize sensitive data before logging
  2. Implement AWS credential rotation
  3. Add input validation for prompts to prevent injection attacks

Let me analyze the logging patterns in the LLM service code and compare it with the rest of the codebase to verify the security concerns.

From the logs search results, I can see that:

  1. The LLM service logs sensitive information like:
  • Full prompt content
  • Full response content
  • Cost information including token counts
  • Entity IDs and metadata
  1. The logging patterns in other services show they generally:
  • Redact sensitive data
  • Log at appropriate levels (debug/info/warn/error)
  • Include structured metadata
  • Focus on operational data rather than payload content

Let me verify if there are any sanitization helpers being used in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for potential sensitive data logging
rg -i "log\.(info|warn|error|debug)" -A 2 -B 2

Length of output: 492624


Script:

#!/bin/bash
# Look for sanitization/redaction utilities
rg -g '!dist' -g '!node_modules' -g '!*.map' "sanitize|redact|mask" 

Length of output: 18758

Comment on lines +16 to +76
export async function fetchMemberDataForLLMSquashing(
db: DbStore,
memberId: string,
): Promise<IMemberOriginalData | null> {
const result = await db.connection().oneOrNone(
`
with member_orgs as (select distinct mo."memberId",
mo."organizationId" as "orgId",
o."displayName" as "orgName",
mo.title as "jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source
from "memberOrganizations" mo
inner join organizations o on mo."organizationId" = o.id
where mo."memberId" = $(memberId)
and mo."deletedAt" is null
and o."deletedAt" is null)
select m."displayName",
m.attributes,
m."manuallyChangedFields",
m."tenantId",
coalesce(
(select json_agg(
(select row_to_json(r)
from (select mi.type,
mi.platform,
mi.value) r)
)
from "memberIdentities" mi
where mi."memberId" = m.id
and verified = true), '[]'::json) as identities,
case
when exists (select 1 from member_orgs where "memberId" = m.id)
then (
select json_agg(
(select row_to_json(r)
from (select mo."orgId",
mo."orgName",
mo."jobTitle",
mo."dateStart",
mo."dateEnd",
mo.source) r)
)
from member_orgs mo
where mo."memberId" = m.id
)
else '[]'::json
end as organizations
from members m
where m.id = $(memberId)
and m."deletedAt" is null
group by m.id, m."displayName", m.attributes, m."manuallyChangedFields";
`,
{
memberId,
},
)

return result
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize query performance in fetchMemberDataForLLMSquashing

The query structure could be optimized for better performance:

  1. The GROUP BY clause includes unnecessary columns as m.id is sufficient for grouping.
  2. Consider adding index hints for the JOIN operations.

Apply this diff to optimize the query:

-    group by m.id, m."displayName", m.attributes, m."manuallyChangedFields";
+    group by m.id;

Additionally, consider adding the following indexes if they don't exist:

CREATE INDEX IF NOT EXISTS idx_member_organizations_member_id ON "memberOrganizations" ("memberId");
CREATE INDEX IF NOT EXISTS idx_member_identities_member_id ON "memberIdentities" ("memberId");

Answer with JSON only and nothing else. Ensure the response is complete and valid JSON.
`

const data = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for LLM service call

The LLM service call lacks error handling. Add a try-catch block to handle potential failures gracefully.

Add error handling:

+ try {
  const data = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
+ } catch (error) {
+   svc.log.error({ memberId, error }, 'Failed to consolidate member enrichment data')
+   return false
+ }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const data = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
try {
const data = await llmService.consolidateMemberEnrichmentData(memberId, prompt)
} catch (error) {
svc.log.error({ memberId, error }, 'Failed to consolidate member enrichment data')
return false
}

Comment on lines +210 to +217
const llmService = new LlmService(
svc.postgres.writer,
{
accessKeyId: process.env['CROWD_AWS_BEDROCK_ACCESS_KEY_ID'],
secretAccessKey: process.env['CROWD_AWS_BEDROCK_SECRET_ACCESS_KEY'],
},
svc.log,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate AWS credentials before LLM service initialization

The code directly uses AWS credentials from environment variables without validation. Consider adding validation to prevent runtime errors.

Add credential validation:

+ const awsAccessKeyId = process.env['CROWD_AWS_BEDROCK_ACCESS_KEY_ID']
+ const awsSecretKey = process.env['CROWD_AWS_BEDROCK_SECRET_ACCESS_KEY']
+ if (!awsAccessKeyId || !awsSecretKey) {
+   throw new Error('AWS Bedrock credentials are not configured')
+ }

const llmService = new LlmService(
  svc.postgres.writer,
  {
-   accessKeyId: process.env['CROWD_AWS_BEDROCK_ACCESS_KEY_ID'],
-   secretAccessKey: process.env['CROWD_AWS_BEDROCK_SECRET_ACCESS_KEY'],
+   accessKeyId: awsAccessKeyId,
+   secretAccessKey: awsSecretKey,
  },
  svc.log,
)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const llmService = new LlmService(
svc.postgres.writer,
{
accessKeyId: process.env['CROWD_AWS_BEDROCK_ACCESS_KEY_ID'],
secretAccessKey: process.env['CROWD_AWS_BEDROCK_SECRET_ACCESS_KEY'],
},
svc.log,
)
const awsAccessKeyId = process.env['CROWD_AWS_BEDROCK_ACCESS_KEY_ID']
const awsSecretKey = process.env['CROWD_AWS_BEDROCK_SECRET_ACCESS_KEY']
if (!awsAccessKeyId || !awsSecretKey) {
throw new Error('AWS Bedrock credentials are not configured')
}
const llmService = new LlmService(
svc.postgres.writer,
{
accessKeyId: awsAccessKeyId,
secretAccessKey: awsSecretKey,
},
svc.log,
)

Comment on lines +347 to +483
platform: toUpdate.p,
type: toUpdate.t as MemberIdentityType,
value: toUpdate.v,
verified: toUpdate.ve,
}),
)
} else {
svc.log.warn({ memberId, toUpdate }, 'Unknown identity type!')
}
}
}

if (data.result.changes.identities.new) {
for (const toAdd of data.result.changes.identities.new) {
if (identityTypes.includes(toAdd.t as MemberIdentityType)) {
svc.log.info({ memberId, toAdd }, 'Adding new identity!')
promises.push(
upsertMemberIdentity(qx, {
memberId,
tenantId: existingMemberData.tenantId,
platform: toAdd.p,
type: toAdd.t as MemberIdentityType,
value: toAdd.v,
verified: toAdd.ve,
}),
)
} else {
svc.log.warn({ memberId, toAdd }, 'Unknown identity type!')
}
}
}
}

// process organizations
if (data.result.changes.organizations) {
const sources = Object.values(OrganizationSource)

if (data.result.changes.organizations.newConns) {
for (const conn of data.result.changes.organizations.newConns) {
if (sources.includes(conn.s as OrganizationSource)) {
svc.log.info({ memberId, conn }, 'Adding new connection to existing organization!')
promises.push(
insertWorkExperience(
tx.transaction(),
memberId,
conn.orgId,
conn.t,
conn.ds,
conn.de,
conn.s as OrganizationSource,
),
)
} else {
svc.log.warn({ memberId, conn }, 'Unknown organization source!')
}
}
}

if (data.result.changes.organizations.newOrgs) {
for (const org of data.result.changes.organizations.newOrgs) {
svc.log.info({ memberId, org }, 'Adding new organization!')
promises.push(
findOrCreateOrganization(
qx,
existingMemberData.tenantId,
OrganizationAttributeSource.ENRICHMENT,
{
displayName: org.n,
identities: org.i.map((i) => {
return {
type: i.t as OrganizationIdentityType,
platform: i.p,
value: i.v,
verified: i.ve,
}
}),
},
).then((orgId) =>
insertWorkExperience(
tx.transaction(),
memberId,
orgId,
org.conn.t,
org.conn.ds,
org.conn.de,
org.conn.s as OrganizationSource,
),
),
)
}
}
}

// also touch members.lastEnriched date
promises.push(
updateLastEnrichedDate(tx.transaction(), memberId, existingMemberData.tenantId),
)

await Promise.all(promises)
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add transaction error handling and sanitize logs

The transaction block handles multiple database operations but lacks structured error handling. Additionally, some log statements might expose sensitive data.

Add error handling and sanitize logs:

+ try {
  await svc.postgres.writer.transactionally(async (tx) => {
    // ... existing code ...
  })
+ } catch (error) {
+   svc.log.error({ memberId, error: error.message }, 'Failed to process member data updates')
+   throw error
+ }

  // Update log statements to exclude sensitive data
- svc.log.info({ memberId, toUpdate }, 'Updating verified flag for identity!')
+ svc.log.info({ memberId, platform: toUpdate.p }, 'Updating verified flag for identity!')

- svc.log.info({ memberId, toAdd }, 'Adding new identity!')
+ svc.log.info({ memberId, platform: toAdd.p }, 'Adding new identity!')

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 28 to 60
async function getEnrichableMembers(limit: number, lastMemberId?: string): Promise<string[]> {
const query = `
-- only use members that have more than one enrichment source
with members_with_sources as (select distinct "memberId", count(*)
from "memberEnrichmentCache"
group by "memberId"
having count(*) > 1),
-- also only use members that have more than 10 activities
members_with_activities as (select distinct msa."memberId", sum("activityCount") as total_activities
from members_with_sources ms
inner join "memberSegmentsAgg" msa on msa."memberId" = ms."memberId"
-- only consider subprojects otherwise we count some activities multiple times
inner join segments s on s.id = msa."segmentId" and s."tenantId" = $(tenantId) and s.type = 'subproject'
group by msa."memberId"
having sum("activityCount") > $(minMemberActivities))
select m.id
from members m
inner join members_with_activities ma on m.id = ma."memberId"
where m."deletedAt" is null and m."tenantId" = $(tenantId)
${lastMemberId ? `and m.id > $(lastMemberId)` : ''}
and (m."lastEnriched" is null
or m."lastEnriched" < now() - interval '3 months')
order by ma.total_activities desc, m.id
limit $(limit)
`

return (
await svc.postgres.writer
.connection()
.any(query, { lastMemberId, limit, tenantId, minMemberActivities })
).map((row) => row.id)
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for database query

The function getEnrichableMembers lacks error handling for the database query. To make the script more robust, consider adding a try-catch block to handle potential query failures.

Apply this diff to add error handling:

 async function getEnrichableMembers(limit: number, lastMemberId?: string): Promise<string[]> {
   const query = `
   -- SQL query omitted for brevity
   `
+  try {
     return (
       await svc.postgres.writer
         .connection()
         .any(query, { lastMemberId, limit, tenantId, minMemberActivities })
     ).map((row) => row.id)
+  } catch (error) {
+    svc.log.error({ error }, 'Failed to fetch enrichable members')
+    throw error
+  }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function getEnrichableMembers(limit: number, lastMemberId?: string): Promise<string[]> {
const query = `
-- only use members that have more than one enrichment source
with members_with_sources as (select distinct "memberId", count(*)
from "memberEnrichmentCache"
group by "memberId"
having count(*) > 1),
-- also only use members that have more than 10 activities
members_with_activities as (select distinct msa."memberId", sum("activityCount") as total_activities
from members_with_sources ms
inner join "memberSegmentsAgg" msa on msa."memberId" = ms."memberId"
-- only consider subprojects otherwise we count some activities multiple times
inner join segments s on s.id = msa."segmentId" and s."tenantId" = $(tenantId) and s.type = 'subproject'
group by msa."memberId"
having sum("activityCount") > $(minMemberActivities))
select m.id
from members m
inner join members_with_activities ma on m.id = ma."memberId"
where m."deletedAt" is null and m."tenantId" = $(tenantId)
${lastMemberId ? `and m.id > $(lastMemberId)` : ''}
and (m."lastEnriched" is null
or m."lastEnriched" < now() - interval '3 months')
order by ma.total_activities desc, m.id
limit $(limit)
`
return (
await svc.postgres.writer
.connection()
.any(query, { lastMemberId, limit, tenantId, minMemberActivities })
).map((row) => row.id)
}
async function getEnrichableMembers(limit: number, lastMemberId?: string): Promise<string[]> {
const query = `
-- only use members that have more than one enrichment source
with members_with_sources as (select distinct "memberId", count(*)
from "memberEnrichmentCache"
group by "memberId"
having count(*) > 1),
-- also only use members that have more than 10 activities
members_with_activities as (select distinct msa."memberId", sum("activityCount") as total_activities
from members_with_sources ms
inner join "memberSegmentsAgg" msa on msa."memberId" = ms."memberId"
-- only consider subprojects otherwise we count some activities multiple times
inner join segments s on s.id = msa."segmentId" and s."tenantId" = $(tenantId) and s.type = 'subproject'
group by msa."memberId"
having sum("activityCount") > $(minMemberActivities))
select m.id
from members m
inner join members_with_activities ma on m.id = ma."memberId"
where m."deletedAt" is null and m."tenantId" = $(tenantId)
${lastMemberId ? `and m.id > $(lastMemberId)` : ''}
and (m."lastEnriched" is null
or m."lastEnriched" < now() - interval '3 months')
order by ma.total_activities desc, m.id
limit $(limit)
`
try {
return (
await svc.postgres.writer
.connection()
.any(query, { lastMemberId, limit, tenantId, minMemberActivities })
).map((row) => row.id)
} catch (error) {
svc.log.error({ error }, 'Failed to fetch enrichable members')
throw error
}
}

Comment on lines +468 to +475
for (const identity of data.identities.filter((i) =>
[
OrganizationIdentityType.PRIMARY_DOMAIN,
OrganizationIdentityType.ALTERNATIVE_DOMAIN,
].includes(i.type),
)) {
identity.value = websiteNormalizer(identity.value, false)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid mutating input data directly to prevent unintended side effects

The loop normalizes website identities by modifying identity.value directly in data.identities. This could lead to unintended side effects if data is used elsewhere after this function.

Consider creating a copy of the identities or the values before normalization to maintain data integrity:

- for (const identity of data.identities.filter((i) =>
+ const identitiesCopy = data.identities.map(identity => ({ ...identity }))
+ for (const identity of identitiesCopy.filter((i) =>
    [
      OrganizationIdentityType.PRIMARY_DOMAIN,
      OrganizationIdentityType.ALTERNATIVE_DOMAIN,
    ].includes(i.type),
  )) {
    identity.value = websiteNormalizer(identity.value, false)
  }
+ // Use identitiesCopy for further processing instead of data.identities

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (1)

9-15: Consider extracting configuration to a dedicated config file

While the configuration setup is clear, having environment variables and service configuration inline makes it less maintainable. Consider extracting this to a dedicated configuration file.

+// config/worker.ts
+export const workerConfig = {
+  envvars: [],
+  producer: { enabled: false },
+  redis: { enabled: true },
+  temporal: { enabled: false },
+  questdb: { enabled: false },
+  opensearch: { enabled: false },
+}

-process.env['CROWD_TEMPORAL_TASKQUEUE'] = 'not-important'
-svc.config.envvars = []
-svc.config.producer = { enabled: false }
-svc.config.redis = { enabled: true }
-svc.config.temporal = { enabled: false }
-svc.config.questdb = { enabled: false }
-svc.options.opensearch = { enabled: false }
+import { workerConfig } from './config/worker'
+process.env['CROWD_TEMPORAL_TASKQUEUE'] = 'not-important'
+Object.assign(svc.config, workerConfig)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1aba439 and d47bed7.

📒 Files selected for processing (3)
  • services/apps/premium/members_enrichment_worker/package.json (2 hunks)
  • services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (1 hunks)
  • services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • services/apps/premium/members_enrichment_worker/package.json
  • services/apps/premium/members_enrichment_worker/src/workflows/enrichMember.ts
🔇 Additional comments (2)
services/apps/premium/members_enrichment_worker/src/bin/onboarding.ts (2)

17-24: Refer to previous review comment about command-line argument handling

The previous review comment about enhancing command-line argument handling with proper validation and error messaging is still applicable.


29-61: Refer to previous review comments about query optimization and error handling

The previous review comments about adding type definitions, query optimization, and error handling are still applicable.

Comment on lines +26 to +27
const minMemberActivities = 100
const maxConcurrentProcessing = 5
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make constants configurable and add documentation

The hardcoded values should be configurable and documented:

-const minMemberActivities = 100
-const maxConcurrentProcessing = 5
+/**
+ * Minimum number of activities a member must have to be considered for enrichment
+ * @default 100
+ */
+const minMemberActivities = Number(process.env.MIN_MEMBER_ACTIVITIES ?? 100)
+
+/**
+ * Maximum number of members that can be processed concurrently
+ * @default 5
+ */
+const maxConcurrentProcessing = Number(process.env.MAX_CONCURRENT_PROCESSING ?? 5)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const minMemberActivities = 100
const maxConcurrentProcessing = 5
/**
* Minimum number of activities a member must have to be considered for enrichment
* @default 100
*/
const minMemberActivities = Number(process.env.MIN_MEMBER_ACTIVITIES ?? 100)
/**
* Maximum number of members that can be processed concurrently
* @default 5
*/
const maxConcurrentProcessing = Number(process.env.MAX_CONCURRENT_PROCESSING ?? 5)

Comment on lines +65 to +156
setImmediate(async () => {
await svc.init(false)

let processingCount = 0
let updatedMembersCount = 0
let skippedMembersCount = 0
let failedMembersCount = 0

let totalProcessingTime = 0
let processedMembersCount = 0
const REPORT_INTERVAL = 10

const pageSize = 100
let members = await getEnrichableMembers(pageSize)
let pagePromises: Promise<void>[] = []
while (members.length > 0) {
svc.log.info({ memberCount: members.length }, 'Processing members!')
// process members just like in enrichMember workflow
for (const memberId of members) {
while (processingCount >= maxConcurrentProcessing) {
await timeout(100)
}

processingCount++
const startTime = Date.now()

const promise = processMemberSources(memberId, sources)
.then((res) => {
processingCount--
if (res) {
const processingTime = Date.now() - startTime
totalProcessingTime += processingTime
processedMembersCount++

updatedMembersCount++
} else {
skippedMembersCount++
}

// Report average processing time every REPORT_INTERVAL members
if (processedMembersCount > 0 && processedMembersCount % REPORT_INTERVAL === 0) {
const averageProcessingTime = totalProcessingTime / processedMembersCount
svc.log.info(
{
averageProcessingTime: `${(averageProcessingTime / 1000).toFixed(2)}s`,
processedMembers: processedMembersCount,
updatedMembers: updatedMembersCount,
skippedMembers: skippedMembersCount,
failedMembers: failedMembersCount,
},
'Processing time statistics',
)
}
})
.catch((err) => {
processingCount--
svc.log.error(err, { memberId }, 'Error while processing member enrichment sources!')
failedMembersCount++
})
pagePromises.push(promise)
}

await Promise.all(pagePromises)
pagePromises = []
// load next page
members = await getEnrichableMembers(pageSize, members[members.length - 1])

svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
},
'Current statistics!',
)
}

svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
totalProcessedMembers: processedMembersCount,
totalProcessingTime: `${(totalProcessingTime / 1000).toFixed(2)}s`,
},
'Final statistics!',
)

process.exit(0)
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add graceful shutdown handling

The script should handle process signals (SIGTERM, SIGINT) gracefully to ensure proper cleanup and completion of in-progress tasks.

+let isShuttingDown = false
+
+async function shutdown() {
+  if (isShuttingDown) return
+  isShuttingDown = true
+
+  svc.log.info('Graceful shutdown initiated')
+  
+  // Wait for in-progress tasks
+  while (processingCount > 0) {
+    svc.log.info({ processingCount }, 'Waiting for in-progress tasks to complete')
+    await timeout(1000)
+  }
+
+  // Log final statistics
+  svc.log.info(
+    {
+      updatedMembersCount,
+      skippedMembersCount,
+      failedMembersCount,
+      averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
+      totalProcessedMembers: processedMembersCount,
+      totalProcessingTime: `${(totalProcessingTime / 1000).toFixed(2)}s`,
+    },
+    'Final statistics before shutdown!',
+  )
+
+  await svc.destroy()
+  process.exit(0)
+}
+
+process.on('SIGTERM', shutdown)
+process.on('SIGINT', shutdown)
+
 setImmediate(async () => {
   await svc.init(false)
 
   // ... existing code ...
 
   while (members.length > 0) {
+    if (isShuttingDown) break
     // ... existing code ...
   }
 
   // ... existing code ...
 })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
setImmediate(async () => {
await svc.init(false)
let processingCount = 0
let updatedMembersCount = 0
let skippedMembersCount = 0
let failedMembersCount = 0
let totalProcessingTime = 0
let processedMembersCount = 0
const REPORT_INTERVAL = 10
const pageSize = 100
let members = await getEnrichableMembers(pageSize)
let pagePromises: Promise<void>[] = []
while (members.length > 0) {
svc.log.info({ memberCount: members.length }, 'Processing members!')
// process members just like in enrichMember workflow
for (const memberId of members) {
while (processingCount >= maxConcurrentProcessing) {
await timeout(100)
}
processingCount++
const startTime = Date.now()
const promise = processMemberSources(memberId, sources)
.then((res) => {
processingCount--
if (res) {
const processingTime = Date.now() - startTime
totalProcessingTime += processingTime
processedMembersCount++
updatedMembersCount++
} else {
skippedMembersCount++
}
// Report average processing time every REPORT_INTERVAL members
if (processedMembersCount > 0 && processedMembersCount % REPORT_INTERVAL === 0) {
const averageProcessingTime = totalProcessingTime / processedMembersCount
svc.log.info(
{
averageProcessingTime: `${(averageProcessingTime / 1000).toFixed(2)}s`,
processedMembers: processedMembersCount,
updatedMembers: updatedMembersCount,
skippedMembers: skippedMembersCount,
failedMembers: failedMembersCount,
},
'Processing time statistics',
)
}
})
.catch((err) => {
processingCount--
svc.log.error(err, { memberId }, 'Error while processing member enrichment sources!')
failedMembersCount++
})
pagePromises.push(promise)
}
await Promise.all(pagePromises)
pagePromises = []
// load next page
members = await getEnrichableMembers(pageSize, members[members.length - 1])
svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
},
'Current statistics!',
)
}
svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
totalProcessedMembers: processedMembersCount,
totalProcessingTime: `${(totalProcessingTime / 1000).toFixed(2)}s`,
},
'Final statistics!',
)
process.exit(0)
})
let isShuttingDown = false
async function shutdown() {
if (isShuttingDown) return
isShuttingDown = true
svc.log.info('Graceful shutdown initiated')
// Wait for in-progress tasks
while (processingCount > 0) {
svc.log.info({ processingCount }, 'Waiting for in-progress tasks to complete')
await timeout(1000)
}
// Log final statistics
svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
totalProcessedMembers: processedMembersCount,
totalProcessingTime: `${(totalProcessingTime / 1000).toFixed(2)}s`,
},
'Final statistics before shutdown!',
)
await svc.destroy()
process.exit(0)
}
process.on('SIGTERM', shutdown)
process.on('SIGINT', shutdown)
setImmediate(async () => {
await svc.init(false)
let processingCount = 0
let updatedMembersCount = 0
let skippedMembersCount = 0
let failedMembersCount = 0
let totalProcessingTime = 0
let processedMembersCount = 0
const REPORT_INTERVAL = 10
const pageSize = 100
let members = await getEnrichableMembers(pageSize)
let pagePromises: Promise<void>[] = []
while (members.length > 0) {
if (isShuttingDown) break
svc.log.info({ memberCount: members.length }, 'Processing members!')
// process members just like in enrichMember workflow
for (const memberId of members) {
while (processingCount >= maxConcurrentProcessing) {
await timeout(100)
}
processingCount++
const startTime = Date.now()
const promise = processMemberSources(memberId, sources)
.then((res) => {
processingCount--
if (res) {
const processingTime = Date.now() - startTime
totalProcessingTime += processingTime
processedMembersCount++
updatedMembersCount++
} else {
skippedMembersCount++
}
// Report average processing time every REPORT_INTERVAL members
if (processedMembersCount > 0 && processedMembersCount % REPORT_INTERVAL === 0) {
const averageProcessingTime = totalProcessingTime / processedMembersCount
svc.log.info(
{
averageProcessingTime: `${(averageProcessingTime / 1000).toFixed(2)}s`,
processedMembers: processedMembersCount,
updatedMembers: updatedMembersCount,
skippedMembers: skippedMembersCount,
failedMembers: failedMembersCount,
},
'Processing time statistics',
)
}
})
.catch((err) => {
processingCount--
svc.log.error(err, { memberId }, 'Error while processing member enrichment sources!')
failedMembersCount++
})
pagePromises.push(promise)
}
await Promise.all(pagePromises)
pagePromises = []
// load next page
members = await getEnrichableMembers(pageSize, members[members.length - 1])
svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
},
'Current statistics!',
)
}
svc.log.info(
{
updatedMembersCount,
skippedMembersCount,
failedMembersCount,
averageProcessingTime: `${(totalProcessingTime / processedMembersCount / 1000).toFixed(2)}s`,
totalProcessedMembers: processedMembersCount,
totalProcessingTime: `${(totalProcessingTime / 1000).toFixed(2)}s`,
},
'Final statistics!',
)
process.exit(0)
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant