Check if sorting implementation is stable #177

yenslife · 2024-03-22T13:56:07Z

Introduced node numbering in qtest.c to evaluate q_sort's stability. The modification assigns unique identifiers to each node, facilitating stability checks during sorting operations. If nodes with identical key values are found out of ascending order, an error is reported to maintain stable sorting.

jserv · 2024-03-22T23:41:57Z

Can you avoid modifying the definition of element_t?

yenslife · 2024-03-23T12:34:31Z

Absolutely! I've found an alternative approach that doesn't require modifying the definition of element_t. In the updated version, I've improved stability testing in the sort function by tracking node pointers and their original order, all within qtest.c. This ensures we maintain the integrity of the original data structure. Just a heads up, though: the stability test has a limit on the number of elements, currently set as MAX_NO, to prevent long testing times. This method achieves our stability testing goals while keeping changes localized. The value for MAX_NO may be subject to discussion due to spatial constraints, as I am not certain of the optimal setting.

qtest.c

jserv

Shorten the code.

qtest.c

jserv · 2024-03-24T18:03:39Z

qtest.c

+/* If the number of elements is too large, it may take a long time to check the
+ * stability of the sort. So, MAX_NODE is used to limit the number of elements
+ * to check the stability of the sort. */
+#define MAX_NODE 100000


Undefine MAX_NODE when it is not used. Also, rename it to MAX_NODES.

Explain why 100000 was picked in a scientific way.

In the previous implementation, the time complexity for comparing each pair of duplicated strings was $2 * O(n)$, resulting in a total time complexity of $2*(n-1)*O(n)$, which is significant. However, in the shortened version commit a078970, the time complexity for comparing each pair of duplicated strings was reduced to $O(n)$. In the worst-case scenario, the comparison would take $(n-1)*O(n)$ time, which is much smaller compared to the previous implementation.

In the test traces/trace-14-perf.cmd, the maximum number of nodes and the sorting command script used is 2,000,000. In the traces/trace-15-perf.cmd, the second highest number of nodes used with the sort command is 100,000. Therefore, I set MAX_NODES to 100,000, as I made a mistake in setting it to cover the second highest case. Setting MAX_NODES to 2,000,000 would cause a segmentation fault on my computer, so I opted to skip that case. As a result, I measured the time taken using the sort command in various scenarios with different numbers of duplicated nodes.

The test script is as follows. I set MAX_NODES to 1,000,001 for the test. I will only change the number of nodes inserted at the head, and then use perf stat ./qtest < test.cmd to measure the time.

test.cmd

new ih a 10000 sort quit

node count elapsed time (seconds)

1000 0.027906238

10000 0.058188249

100000 2.445569385

150000 5.453358783

200000 9.671944194

300000 21.581793918

400000 40.176436261

500000 61.437037665

As shown above, exceeding 100,000 nodes in the test would lead to significant performance degradation🥹. While alternative approaches might be considered, I found that in the end, it was more straightforward to directly add a member to element_t. However, this approach deviates from the requirements of the assignment. Therefore, it involved a trade-off in implementation.

If the reasons provided above are deemed sufficient, I will create another commit to change MAX_NODES = 100000 to MAX_NODES = 100001, and will explain the rationale in the commit message.

If the reasons provided above are deemed sufficient, I will create another commit to change MAX_NODES = 100000 to MAX_NODES = 100001, and will explain the rationale in the commit message.

Use git rebase -i to rework the commits.

qtest.c

yenslife · 2024-03-25T16:46:16Z

Thank you for your review!

jserv

Squash the commits and refine the git commit messages.
Ensure the proposed change fits both ascending and descending order once a user specifies via "option" command.

yenslife · 2024-03-26T09:54:02Z

I ensure that this approach enables stable detection for both ascending and descending sorts. Besides my own successful testing, this proposed change only examines adjacent nodes with identical values and compares their indices before q_sort. Assuming successful sorting, nodes with identical values will always be adjacent. Therefore, stability checking at this juncture becomes meaningful.

jserv · 2024-03-26T13:12:43Z

qtest.c

+/* If the number of elements is too large, it may take a long time to check the
+ * stability of the sort. So, MAX_NODES is used to limit the number of elements
+ * to check the stability of the sort. */
+#define MAX_NODES 100000


Can you use the sliding window technique to track partial nodes instead of using a predefined number of nodes during a customized sorting routine?

If we were to use the sliding window technique, how would we ensure the relative order of nodes? Without additional data structures, the information about the relative order of nodes would be lost after sorting. Can the sliding window still be utilized under these circumstances, or have I misunderstood your suggestion?

If we were to use the sliding window technique, how would we ensure the relative order of nodes? Without additional data structures, the information about the relative order of nodes would be lost after sorting. Can the sliding window still be utilized under these circumstances, or have I misunderstood your suggestion?

Think of the facility of queue operations. We are ready to allocate the temporary nodes on demand. What I care is the fixed length for checking purpose.

Since we're considering avoiding the fixed length for checking purposes, I'm thinking of trying to use the node's address as the key for hashing. With an additional data structure, we can determine the original order of the queue. This approach can handle a large amount of test data efficiently due to the properties of hashing, with reduced time complexity. However, we'll need to decide on a fixed array length if we're using an array as a hash table. I'm a bit stuck on this point.

jserv

Ensure that you write complete sentences in git commit messages.

Introduced node numbering in qtest.c to evaluate the stability of q_sort's sorting algorithm. When the algorithm encounters two nodes with the same value, it searches for the address of the node record in the nodes array. It then compares the found node to the current node (cur). If the found node is the same as the current node, it indicates that these two duplicate nodes have not been swapped in position after sorting. However, if the found node is cur->next, it means that the position of the nodes has been swapped. That is, the sorting implementation is unstable. The performance of the testing code was evaluated by measuring the elapsed time for q_sort's operation on different numbers of nodes with duplicate values. Node counts ranging from 1000 to 500,000 were examined. Specifically, for the 1000-node count, the elapsed time was recorded as 0.0279 seconds, and for the 500,000-node count, it was 61.44 seconds. For the 100,000-node count, the elapsed time was 2.45 seconds. The elapsed time showed a significant increase starting from the 100,000-node count, underscoring potential performance issues with larger datasets. This method relies on auxiliary data structures to track node pointers and their original order, avoiding alterations to the structure in queue.h. However, stability testing is limited to a maximum of 100,000 elements (MAX_NODES) to address potential performance concerns.

jserv · 2024-04-01T03:52:13Z

Thank @yenslife for contributing!

jserv changed the title ~~Enhance q_sort Stability Check~~ Check if sorting implementation is stable Mar 23, 2024

jserv reviewed Mar 23, 2024

View reviewed changes