We are busy grading the submissions. A few comments after having a look at some of the submissions.
- The 1st grading guide is strictly adhered to.
- Data leakage seems to be common in the submissions.
Here's a ChatGPT explanation of data leakage
Data leakage in machine learning refers to a situation where information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. This can result in models that perform well during testing but fail in real-world applications because they rely on data that wouldn’t be available during actual predictions.
Here are a few common examples of data leakage:
-
Target Leakage. This occurs when the training dataset includes features that directly or indirectly contain information about the target variable. For instance, if you're predicting whether a patient has a disease and you include a feature that measures a related treatment received after the diagnosis, the model might learn from this post-diagnosis information, leading to misleading accuracy during validation.
-
Train-Test Contamination. If the test set is influenced by the training set, this is another form of leakage. For example, if you normalize your data using the mean and standard deviation of the entire dataset (including both training and test data), the model may inadvertently have access to test data characteristics during training. Instead, you should compute normalization parameters using only the training data.
-
Temporal Leakage. In time-series forecasting, using future data to predict past events can lead to leakage. For example, if you build a model to predict stock prices using data that includes future prices, the model will appear to perform well, but it wouldn’t be viable in a real-world scenario where future data isn’t available.
-
Feature Engineering Leakage. If you create features that use information from the target variable or future information, this can introduce leakage. For example, if you’re predicting customer churn and create a feature that sums up customer interactions from the future, it could skew the results. Prevention Strategies
To prevent data leakage, you can take the following steps:
- Careful Feature Selection: Ensure that features are derived only from the training data and don’t contain information about the target variable.
- Proper Data Splitting: Always split your data into training and test sets before any preprocessing to avoid contamination.
- Cross-Validation: Use techniques like k-fold cross-validation, ensuring that the splits respect the temporal order in time-series data.
- Monitoring: Regularly evaluate the model performance using separate validation datasets that are representative of real-world conditions.
Welcome to the Prescient Coding Challenge 2024!
You have been provided with price and financial data for 100 US stocks. Your task is to generate 1-day-ahead trading signals for each stock. Additionally, you need to select the top 10 stocks each day to form a portfolio. The performance of your selected portfolio will be evaluated based on the total return index over the evaluation (test) period.
This type of trading is known as swing trades. You are only allowed long positions, i.e. you matrix of buys will only contain 1's and 0's.
You are given files
README.md
- this filedata0.csv
- 1st data filedata1.csv
- 2nd data filereturns.csv
- returns filesolution.py
- a skeleton structure with sample solution for the problem description
The data provided is a mix of daily, monthly, and yearly data. Where possible the data has been issued daily otherwise forward filled to match the pricing data availability.
- The file
data0.csv
contains the security sector data - The file
data1.csv
contains price, historical returns, financial ratios and the 1-day-ahead price change label for each security and trading day
A brief description of the columns in the data are:
date
- close of business daysecurity
- the instrument code, in this case its the stock tickersector
- the security's sector classificationprice
- closing day price in USDratio_pe
- price to earnings ratioratio_pcf
- price to cash flow ratioratio_de
- debt to equity ratioratio_roe
- return on equityratio_roa
- return on assetslabel
- a 0 or 1 label, with 0 indicating a loss taking bet and 1 a positive winning bet
We are interested in the total payoff for the buys in the testing period 2024-01-01
to 2024-06-30
. The high level steps are
- Generate buy-signals
- Create a buy-matrix of 1s (buys) and 0s (don't buy) with each row summing to 10 (10 buys)
- Generate payoff chart
Your buy-matrix will create your payoff chart using the plot_payoff
function.
- You may use a subset of features.
- You may engineer features using the existing features.
- You may use pure rule based, quant, or ML methods
- You may create more than 1 model to generate buy-signals
- ChatGPT is allowed
- Sign in or sign up to GitHub.
- On the Coding Challenge repo page, fork the repo as shown below.
- Once the project shows as a repo on your GitHub profile, clone the repo.
- Since this is on your personal GitHub profile, you can work on your
main
branch.
- Assuming you are working on your
main
branch git add .
git commit -m 'Your Team Name'
git push origin main
- Make sure that your changes are only in one of either a
solution.py
or asolution.R
. - You should see your changes on your repo.
- On the "Pull Requests" tab, select "New pull request"
- The GitHub summary should mention only 1 file change
- Select "Create pull request"
- Add your team name and short description of how you solved the problem. Confirm the "Create pull request"
- You should now be able to see your team's pull request on our repository's list of pull requests.
The table below is the 1st grading guide.
Step | Criteria | Action |
---|---|---|
1 | Submitted on 5 October 2024 before 2pm? | Yes - next step, no - disqualified |
2 | No tampering with data sets and no additional data imports of any kind? | Yes - next step, no - disqualified |
3 | Script runs without intervention from us? | Yes - next step, no - disqualified |
4 | Script runs within 10 minutes? | Yes - next step, no - disqualified |
5 | Does it produce the same solution on consecutive runs? Simulation and stochastic estimation needs to be highly stable. | Yes - next step, no - disqualified |
6 | Does not contain look-ahead bias? Stock picking in this case will be considered look-ahead because you can see future prices. | Yes - next step, no - disqualified |
7 | Successfully feeds into TRI function and produces desired chart? | Yes - next step, no - disqualified |
The 2nd grading is a combination of the classification score, TRI final level and the solution originality decided by the Prescient Investment Management Team.