For the provided Order Record and Trade Record tables, there is an issue where the order type (OrderType) field cannot distinguish the type of market order, which leads to difficulties in further analysis. To resolve this problem, we will:
-
Organize the Order and Trade Tables:
- Mark price for limit orders.
- Mark levels for market orders.
-
Sorting:
- Order Record Sorting: Sort order records (market orders, limit orders, self-best orders) by timestamp and order index to ensure more accurate tracking of order execution and trade status.
- Cancellation Record Sorting: Sort cancellation records by timestamp and original data order to better restore the cancellation process, ensuring consistency and reliability for further analysis.
-
Method Selection: The task requires the use of HDFS + MapReduce to implement the solution.
- HDFS: Transfer relevant files into HDFS using Docker for input/output.
- MapReduce: Design and implement the corresponding Mapper and Reducer to address the issues in the Order Record and Trade Record tables.
-
Field Sources: Output a table with 7 fields based on the Order Record and Trade Record tables:
-
TIMESTAMP
: The time of the order or cancellation execution. This is determined by theExecType
field in the trade record, and the timestamp is taken from theTransactTime
orTradeTime
fields. -
PRICE
: Records the price for limit orders, taken directly from thePrice
field in the order table. -
BUY_SELL_FLAG
: Records the direction of the order (1-buy, 2-sell), determined from theSide
field in the order table and by checking the non-zero fields inBidApplSeqNum
andOfferApplSeqNum
in the trade table. -
ORDER_TYPE
: The type of the order (1-limit order, 2-market order, U-self-best order), taken directly from theOrderType
field in the order table. -
ORDER_ID
: A unique identifier for the order, extracted from theApplSeqNum
field. For cancellations, it takes the non-zero values from theBidApplSeqNum
andOfferApplSeqNum
fields. -
MARKET_ORDER_TYPE
: Records the number of price levels for market orders. This is calculated by matching trade records with the order records usingApplSeqNum
. AHashSet
or similar data structure counts the number of distinct prices for the sameApplSeqNum
, excluding cancellations. -
CANCEL_TYPE
: Indicates whether the order is canceled (1-canceled, 2-not canceled).
-
-
Overall Task: Output all order records and cancellation records from the trade record table, sorted by
TIMESTAMP
. For the sameTIMESTAMP
, record orders before cancellations. If there are multiple order or cancellation records, sort byORDER_ID
.
-
Field Complexity: The large number of irrelevant fields in both tables makes it difficult to filter and organize the necessary data. Field cleanup is required to simplify the process.
-
Missing Data:
- Cancellations without an order record should still be recorded.
- Trades without an order record should not be recorded.
- Orders without trade records should still be recorded.
-
Background Knowledge: Stock market-related knowledge is needed to understand the problem correctly.
-
Data Issues: The data is not entirely clean, with noise such as trade records during pre-market auctions that may confuse the analysis.
-
Process: The solution is divided into three phases: Map, Reduce, and Sort.
-
Case Handling: Handle the following four cases:
TIMESTAMP | PRICE | SIZE | BUY_SELL_FLAG | ORDER_TYPE | ORDER_ID | MARKET_ORDER_TYPE | CANCEL_TYPE |
---|---|---|---|---|---|---|---|
Order Time | null | Order Size | 1-buy, 2-sell | Market 1 | Order Index | 0,1,2,... | 2 |
Order Time | Price | Order Size | 1-buy, 2-sell | Limit 2 | Order Index | null | 2 |
Order Time | null | Order Size | 1-buy, 2-sell | Self-best U | Order Index | null | 2 |
Cancel Time | null | Trade Size | 1-buy, 2-sell | For ease mark 4 | Non-zero index | null | 1 |
-
Field Sources:
- Cancellation data comes entirely from the trade table.
- Limit orders and self-best orders come entirely from the order table.
- For market orders, the levels come from the trade table; other fields come from the order table.
-
Process:
- Map: The order and trade records are loaded from different Mappers. Only records within continuous trading time and for the selected stock (Ping An Bank) are processed; others are filtered out.
- Reduce: If a cancellation is present, record the cancellation; if an order record is present, record the order. Limit orders and self-best orders can be directly recorded from the order table, while market orders require matching trade records to determine the number of levels.
- Sort: Sort by timestamp in ascending order, then by cancellation type in descending order, and finally by order index in ascending order. The header should contain column names.
-
Map:
-
Mapper1: Process the
order
table:Mapper1: Use order index as key 1. Filter by continuous trading time and Ping An Bank. 2. Extract relevant fields: "O, Order Time, Price, Order Size, Buy/Sell Flag, Order Type".
-
Mapper2: Process the
trade
table:Mapper2: Use buy or sell order index as key 1. Filter by continuous trading time and Ping An Bank. 2. Extract relevant fields: "T, Trade Time, Trade Price, Cancellation Flag, Buy/Sell Flag, Trade Size".
-
-
Reduce:
Reducer: 1. Iterate through values: (a) Store order and trade tables. (b) Handle cancellations. 2. If no order record exists: skip. 3. Based on the order type (OrderType O5), determine trade type: (a) Limit: Directly write from order table fields. (b) Market: - Add trade prices to HashSet. - If HashSet is empty: Write type as 0. - If not empty: Write type as HashSet size. (c) Self-best: Directly write from order table fields.
-
Sorting: Use Java’s built-in sorting with a custom
compare
method to sort by timestamp in ascending order, cancellation type in descending order, and order index in ascending order.
-
Upload the files to HDFS under
/project/data
. -
Transfer the JAR file to Docker.
-
Run the command:
hadoop jar project-1.0-SNAPSHOT.jar driver.StockAnalysisDriver /project/data/order/am_hq_order_spot.txt /project/data/order/pm_hq_order_spot.txt /project/data/trade/am_hq_trade_spot.txt /project/data/trade/pm_hq_trade_spot.txt /project/output /project/output/part-r-00000
-
To re-run the process, remove the output directory:
hdfs dfs -rm -r /project/output