Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scheduled task for importing video learning events #1921

Merged
merged 4 commits into from
Oct 23, 2024

Conversation

jo-elimu
Copy link
Member

Issue Number

Purpose

  • Import event data from CSV files to the webapp's database.

Technical Details

  • Implemented a batch job running once per hour.

Testing Instructions

  • Change to @Scheduled(cron="00 * * * * *") in VideoLearningEventImportScheduler to run the code once per minute.

Screenshots

  • N/A

Format Checks

Note

Files in PRs are automatically checked for format violations with mvn spotless:check.

If this PR contains files with format violations, run mvn spotless:apply to fix them.

@jo-elimu jo-elimu requested a review from a team as a code owner October 23, 2024 09:58
@jo-elimu jo-elimu linked an issue Oct 23, 2024 that may be closed by this pull request
6 tasks
@jo-elimu jo-elimu self-assigned this Oct 23, 2024
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 60.25641% with 31 lines in your changes missing coverage. Please review.

Project coverage is 15.78%. Comparing base (e65a7d4) to head (7188202).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...s/analytics/VideoLearningEventImportScheduler.java 6.45% 29 Missing ⚠️
...i/elimu/util/csv/CsvAnalyticsExtractionHelper.java 94.44% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1921      +/-   ##
============================================
+ Coverage     15.05%   15.78%   +0.72%     
- Complexity      457      477      +20     
============================================
  Files           250      252       +2     
  Lines          7731     7806      +75     
  Branches        806      816      +10     
============================================
+ Hits           1164     1232      +68     
- Misses         6517     6524       +7     
  Partials         50       50              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

coderabbitai bot commented Oct 23, 2024

Walkthrough

This pull request introduces several changes, including the addition of a new interface VideoLearningEventDao and its implementation VideoLearningEventDaoJpa. A new service class VideoLearningEventImportScheduler is created to automate the import of video learning events from CSV files into a database. Additionally, a method for extracting video learning events from CSV files is added to the CsvAnalyticsExtractionHelper class. The LearningEvent class has a modified method for returning the Android ID without obfuscation. A corresponding test class for the CSV extraction functionality is also included.

Changes

File Path Change Summary
src/main/java/ai/elimu/dao/VideoLearningEventDao.java - Added interface VideoLearningEventDao with method read(Calendar timestamp, String androidId, String packageName, String videoTitle)
src/main/java/ai/elimu/dao/jpa/VideoLearningEventDaoJpa.java - Added class VideoLearningEventDaoJpa implementing VideoLearningEventDao with method read(...)
src/main/java/ai/elimu/model/analytics/LearningEvent.java - Updated getAndroidId method to return the Android ID without obfuscation
src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java - Added class VideoLearningEventImportScheduler with method execute() for scheduled CSV import
src/main/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelper.java - Added method extractVideoLearningEvents(File csvFile) for extracting video learning events from CSV files
src/test/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelperTest.java - Added test class CsvAnalyticsExtractionHelperTest with method testExtractVideoLearningEvents()
src/test/java/ai/elimu/dao/VideoLearningEventDaoJpaTest.java - Added test class VideoLearningEventDaoJpaTest with method testRead() for validating DAO functionality

Assessment against linked issues

Objective Addressed Explanation
Create scheduled task importing CSVs to database
Include unit tests

Possibly related PRs

  • feat(rest): video learning events #1904: The VideoLearningEventsRestController interacts with VideoLearningEventDao, which is directly related to the new VideoLearningEventDao interface and its implementation in the main PR.
  • feat(entity): video learning event #1911: The introduction of the VideoLearningEvent class is directly related to the VideoLearningEventDao interface, as it defines the entity that the DAO will manage.
  • Include additional data during data collection #1914: The addition of the additionalData field to the LearningEvent class may relate to the VideoLearningEvent as it extends LearningEvent, potentially impacting how data is handled in the context of video learning events.

Suggested reviewers

  • vrudas
  • Souvik-Cyclic
  • nya-elimu
  • jpatel3
  • alexander-kuruvilla

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Outside diff range and nitpick comments (10)
src/main/java/ai/elimu/dao/VideoLearningEventDao.java (2)

9-9: Add JavaDoc documentation for the read method.

The method signature could benefit from documentation explaining:

  • The purpose and expected behavior
  • Parameter requirements and constraints
  • When DataAccessException is thrown
  • Expected behavior when no matching event is found
+    /**
+     * Reads a video learning event based on unique identifying criteria.
+     *
+     * @param timestamp The time when the event occurred
+     * @param androidId The ID of the Android device
+     * @param packageName The package name of the video application
+     * @param videoTitle The title of the video
+     * @return The matching VideoLearningEvent or null if not found
+     * @throws DataAccessException if there's an error accessing the data store
+     */
     VideoLearningEvent read(Calendar timestamp, String androidId, String packageName, String videoTitle) throws DataAccessException;

7-10: Consider adding batch operations for CSV import efficiency.

Given that this DAO will be used in a scheduled task for importing CSV data, consider adding methods to support efficient batch operations:

  1. Batch existence check
  2. Batch insert

Example additions:

    /**
     * Checks for existing events in bulk to optimize CSV import.
     *
     * @param events List of events to check
     * @return Map of composite keys to existing events
     */
    Map<String, VideoLearningEvent> readBatch(List<VideoLearningEvent> events);

    /**
     * Persists multiple events efficiently in a single batch.
     *
     * @param events List of events to persist
     */
    void createBatch(List<VideoLearningEvent> events);
src/main/java/ai/elimu/dao/jpa/VideoLearningEventDaoJpa.java (3)

27-27: Use parameterized logging instead of string concatenation

Replace string concatenation with SLF4J's parameterized logging for better performance.

-            logger.info("VideoLearningEvent (" + timestamp.getTimeInMillis() + ", " + androidId + ", " + packageName + ", \"" + videoTitle + "\") was not found");
+            logger.info("VideoLearningEvent ({}, {}, {}, \"{}\") was not found", timestamp.getTimeInMillis(), androidId, packageName, videoTitle);

14-25: Consider adding an index for query performance

The query performs exact matches on four columns. For better performance, consider adding a composite index on these columns.

Example index creation SQL:

CREATE INDEX idx_video_learning_event 
ON video_learning_event (timestamp, android_id, package_name, video_title);

26-29: Consider using DEBUG level for expected "not found" case

The current log level is INFO, but a "not found" result is an expected scenario during normal operation. Consider using DEBUG level instead.

-            logger.info("VideoLearningEvent (" + timestamp.getTimeInMillis() + ", " + androidId + ", " + packageName + ", \"" + videoTitle + "\") was not found");
+            logger.debug("VideoLearningEvent ({}, {}, {}, \"{}\") was not found", timestamp.getTimeInMillis(), androidId, packageName, videoTitle);
src/main/java/ai/elimu/model/analytics/LearningEvent.java (1)

68-68: Document the security considerations for Android ID handling.

Add a comment explaining why the Android ID is no longer obfuscated and any security measures that should be considered when handling this data.

     public String getAndroidId() {
+        // Note: Android ID is returned without obfuscation to support proper event correlation
+        // during data imports. Ensure proper access controls and encryption are in place when
+        // handling this identifier.
         return androidId;
     }
src/test/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelperTest.java (2)

23-23: Make logger field private static final.

Following best practices for logger declarations in Java classes.

-    private Logger logger = LogManager.getLogger();
+    private static final Logger LOGGER = LogManager.getLogger();

21-21: Consider adding integration test.

Since this is part of a scheduled task for importing video learning events, consider adding an integration test that verifies:

  1. The scheduler picks up new CSV files
  2. Events are properly persisted to the database
  3. Duplicate files/events are handled correctly

This could be implemented as a separate integration test class using @SpringBootTest and an embedded database.

src/main/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelper.java (1)

8-8: Remove unused import.

The WordLearningEvent import is not used in this file.

-import ai.elimu.model.analytics.WordLearningEvent;
src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java (1)

83-85: Enhance log message with event details when skipping duplicates

Including event details in the log message can aid in debugging and provide more context.

Apply this diff to improve the log message:

                         logger.warn("The event has already been stored in the database. Skipping data import.");
+                        logger.warn("Duplicate event details: Timestamp={}, Android ID={}, Package Name={}, Video Title={}",
+                            event.getTimestamp(), event.getAndroidId(), event.getPackageName(), event.getVideoTitle());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e65a7d4 and e81b357.

⛔ Files ignored due to path filters (3)
  • pom.xml is excluded by !**/*.xml
  • src/main/webapp/WEB-INF/spring/applicationContext-jpa.xml is excluded by !**/*.xml
  • src/test/resources/ai/elimu/util/csv/e387e38700000001_3001018_video-learning-events_2024-10-09.csv is excluded by !**/*.csv, !**/*.csv
📒 Files selected for processing (6)
  • src/main/java/ai/elimu/dao/VideoLearningEventDao.java (1 hunks)
  • src/main/java/ai/elimu/dao/jpa/VideoLearningEventDaoJpa.java (1 hunks)
  • src/main/java/ai/elimu/model/analytics/LearningEvent.java (1 hunks)
  • src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java (1 hunks)
  • src/main/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelper.java (3 hunks)
  • src/test/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelperTest.java (1 hunks)
🧰 Additional context used
🔇 Additional comments (3)
src/main/java/ai/elimu/dao/jpa/VideoLearningEventDaoJpa.java (1)

1-9: LGTM: Class structure follows DAO pattern best practices

The class properly extends GenericDaoJpa and implements the DAO interface, following the standard DAO pattern implementation.

src/main/java/ai/elimu/model/analytics/LearningEvent.java (1)

68-68: Verify security implications of exposing full Android ID.

The removal of Android ID obfuscation logic means that device identifiers are now exposed in their complete form. While this change appears necessary for proper event correlation during CSV imports, we should verify the security implications.

Consider implementing the following security measures if not already in place:

  1. Ensure Android IDs are encrypted at rest in the database
  2. Implement proper access controls for endpoints exposing this data
  3. Add audit logging for access to unobfuscated Android IDs
✅ Verification successful

Based on the shell script results, I can now provide a comprehensive analysis of how Android IDs are handled across the codebase. Let me generate the final response:

Android ID handling appears secure with appropriate safeguards in place

The codebase demonstrates several security measures for handling Android IDs:

  1. Database Security:

    • Android IDs are stored with proper constraints (@NotNull, @Column(unique=true))
    • The Device table has a unique constraint on the androidId column
    • All database queries use parameterized statements preventing SQL injection
  2. Access Controls:

    • Android IDs are only exposed in admin-facing JSPs under /WEB-INF/jsp/analytics/
    • REST endpoints handling Android IDs implement proper validation
  3. Data Minimization:

    • The code masks Android IDs in logs and notifications (e.g., 7161***51cd)
    • Android IDs are only used where necessary for analytics correlation
  4. File System Security:

    • Analytics files containing Android IDs are stored in structured directories with proper access controls
    • Files are organized by language and version code, limiting exposure

The removal of Android ID obfuscation in LearningEvent.getAndroidId() is acceptable because:

  1. Other security measures remain in place
  2. Full Android IDs are necessary for proper analytics correlation
  3. The change aligns with the system's data processing requirements
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how Android IDs are handled across the codebase
# to ensure consistent and secure handling

# Check for any existing security measures or obfuscation of Android IDs
echo "Checking for Android ID handling patterns..."
rg -l "androidId" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 -B 5 "androidId" "$file"
done

# Look for privacy/security related documentation
echo -e "\nChecking for privacy documentation..."
rg -l -g "*.md" -g "*.txt" -i "privacy|gdpr|data protection"

Length of output: 39915

src/main/java/ai/elimu/util/csv/CsvAnalyticsExtractionHelper.java (1)

108-170: Verify the impact on the scheduled task.

The method is used by the VideoLearningEventImportScheduler. Ensure that the added validations and error handling align with the scheduler's requirements for handling failed imports.

✅ Verification successful

Let me check the scheduler implementation to understand how it handles errors from the extraction method.


No changes needed for error handling alignment

The review comment's concern about error handling alignment can be resolved. Here's why:

  • The scheduler already has a robust error handling approach:
    • It logs warnings for duplicate events and skips them gracefully
    • The extractVideoLearningEvents method returns an empty list on IOException, which the scheduler can handle safely
    • The scheduler processes files independently, so a failure in one file doesn't affect others

The current implementation of extractVideoLearningEvents aligns well with the scheduler's requirements as it:

  • Handles IOExceptions gracefully by logging them
  • Returns a valid (though empty) list even in error cases
  • Maintains the same contract that the scheduler expects
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find the scheduler implementation to verify error handling alignment

# Search for the scheduler class
echo "Searching for VideoLearningEventImportScheduler implementation..."
rg -l "VideoLearningEventImportScheduler"

# Search for usage of extractVideoLearningEvents
echo "Searching for usage of extractVideoLearningEvents..."
rg "extractVideoLearningEvents"

Length of output: 1081


Script:

#!/bin/bash
# Get the scheduler implementation to verify error handling
echo "Content of VideoLearningEventImportScheduler.java:"
cat src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java

Length of output: 5395

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
src/test/java/ai/elimu/dao/VideoLearningEventDaoJpaTest.java (2)

16-19: Consider using test-specific configuration.

The test is using production configuration files directly. Consider:

  1. Creating a separate test configuration to avoid potential production database connections
  2. Using @DirtiesContext or @Transactional to manage test data cleanup

49-55: Enhance assertion coverage.

The test only verifies that the event exists after creation. Consider adding assertions for:

  1. All properties of the retrieved event
  2. Timestamp accuracy
  3. Case sensitivity of string fields

Example enhancement:

 assertNotNull(existingEvent);
+assertEquals(androidId, existingEvent.getAndroidId());
+assertEquals(packageName, existingEvent.getPackageName());
+assertEquals(videoTitle, existingEvent.getVideoTitle());
+assertEquals(timestamp.getTimeInMillis(), existingEvent.getTimestamp().getTimeInMillis());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e81b357 and 88ab0c7.

📒 Files selected for processing (1)
  • src/test/java/ai/elimu/dao/VideoLearningEventDaoJpaTest.java (1 hunks)
🧰 Additional context used
🔇 Additional comments (2)
src/test/java/ai/elimu/dao/VideoLearningEventDaoJpaTest.java (2)

27-56: Add tests for CSV import functionality.

While this test covers basic DAO operations, additional test coverage is needed for:

  1. CSV file import functionality
  2. Scheduled task execution
  3. Error handling scenarios (malformed CSV, duplicate entries)

This aligns with the PR objective of implementing a scheduled batch job for importing video learning events.

#!/bin/bash
# Check for existing CSV import tests
rg -l "import.*csv" "src/test"

Would you like help creating additional test cases for these scenarios?


34-40: Verify timestamp precision in database queries.

The test might be sensitive to timestamp precision issues when querying the database. Consider:

  1. Setting specific time components in the Calendar instance
  2. Verifying the actual timestamp storage/retrieval precision
✅ Verification successful

Let me try to find more information about the database schema and timestamp handling by searching through different file patterns and extensions.


Let's check the entity definition and database migration files for timestamp-related information.


Let me check the VideoLearningEvent entity and its implementation to understand the timestamp handling better.


Based on the code analysis, I can now provide a final response:

Timestamp precision is properly handled in the database queries

The implementation shows that:

  1. The timestamp is handled using Java's Calendar class which maintains millisecond precision
  2. The database schema uses datetime type which is sufficient for the timestamp precision needed
  3. The JPA query performs an exact match on the timestamp field using proper parameter binding
  4. The test's null assertion is valid as it checks for an exact timestamp match

The current implementation correctly handles timestamp precision in both the entity mapping and database queries. The test case doesn't need modification as it's already using the appropriate data type and query mechanism.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the database schema enforces timestamp precision
rg -A 5 "CREATE.*TABLE.*video.*learning.*event" 

Length of output: 257


Script:

#!/bin/bash
# Look for JPA entity definition for VideoLearningEvent
rg -l "class.*VideoLearningEvent"

# Look for database schema files with common extensions
fd -e sql -e ddl -e schema

# Look for Hibernate/JPA mapping files
fd -e hbm.xml

# Check the VideoLearningEvent class implementation
ast-grep --pattern 'class VideoLearningEvent {
  $$$
}'

Length of output: 5513


Script:

#!/bin/bash
# Check the VideoLearningEvent entity definition for timestamp field
rg -A 5 "timestamp" src/main/java/ai/elimu/model/analytics/VideoLearningEvent.java

# Search for VideoLearningEvent table creation or modification in migration files
rg -A 5 "CREATE TABLE.*VideoLearningEvent|ALTER TABLE.*VideoLearningEvent" src/main/resources/db/migration/

# Check JPA schema export file
cat src/main/resources/META-INF/jpa-schema-export.sql

Length of output: 33599


Script:

#!/bin/bash
# Check VideoLearningEvent entity and implementation
rg -A 5 "timestamp" src/main/java/ai/elimu/model/analytics/VideoLearningEvent.java
rg -A 5 "timestamp" src/main/java/ai/elimu/dao/jpa/VideoLearningEventDaoJpa.java

Length of output: 1187

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java (4)

45-46: Add test coverage for the scheduler

The scheduler implementation lacks corresponding test coverage. Consider adding tests to verify:

  • Proper handling of malformed directory structures
  • Correct processing of CSV files
  • Duplicate event detection
  • Database interactions

Would you like me to help create a test class with these test cases?


50-51: Mark autowired field as final

The autowired field should be marked as final to prevent modification after initialization.

-    @Autowired
-    private VideoLearningEventDao videoLearningEventDao;
+    @Autowired
+    private final VideoLearningEventDao videoLearningEventDao;

53-53: Improve cron expression documentation

The current comment "Half past every hour" could be more descriptive. Consider documenting:

  • The exact time the job will run
  • What happens if the previous execution is still running
  • How to modify for testing (as mentioned in PR objectives)
-    @Scheduled(cron="00 30 * * * *") // Half past every hour
+    @Scheduled(cron="00 30 * * * *") // Runs at 30 minutes past every hour (HH:30:00)
+    // For testing: Use @Scheduled(cron="00 * * * * *") to run every minute

84-84: Enhance duplicate event warning message

The current warning message lacks context about which event was skipped. Include relevant event details in the message.

-                                            logger.warn("The event has already been stored in the database. Skipping data import.");
+                                            logger.warn("Skipping duplicate event: timestamp={}, androidId={}, packageName={}, videoTitle={}", 
+                                                event.getTimestamp(), event.getAndroidId(), 
+                                                event.getPackageName(), event.getVideoTitle());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 88ab0c7 and 58fcdeb.

📒 Files selected for processing (1)
  • src/main/java/ai/elimu/tasks/analytics/VideoLearningEventImportScheduler.java (1 hunks)
🧰 Additional context used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add VideoLearningEvent
3 participants