Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if sorting implementation is stable #177

Merged
merged 1 commit into from
Apr 1, 2024

Conversation

yenslife
Copy link
Contributor

Introduced node numbering in qtest.c to evaluate q_sort's stability. The modification assigns unique identifiers to each node, facilitating stability checks during sorting operations. If nodes with identical key values are found out of ascending order, an error is reported to maintain stable sorting.

@jserv
Copy link
Contributor

jserv commented Mar 22, 2024

Can you avoid modifying the definition of element_t?

@yenslife
Copy link
Contributor Author

Absolutely! I've found an alternative approach that doesn't require modifying the definition of element_t. In the updated version, I've improved stability testing in the sort function by tracking node pointers and their original order, all within qtest.c. This ensures we maintain the integrity of the original data structure. Just a heads up, though: the stability test has a limit on the number of elements, currently set as MAX_NO, to prevent long testing times. This method achieves our stability testing goals while keeping changes localized. The value for MAX_NO may be subject to discussion due to spatial constraints, as I am not certain of the optimal setting.

@jserv jserv changed the title Enhance q_sort Stability Check Check if sorting implementation is stable Mar 23, 2024
qtest.c Outdated Show resolved Hide resolved
qtest.c Outdated Show resolved Hide resolved
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorten the code.

qtest.c Outdated Show resolved Hide resolved
qtest.c Outdated
/* If the number of elements is too large, it may take a long time to check the
* stability of the sort. So, MAX_NODE is used to limit the number of elements
* to check the stability of the sort. */
#define MAX_NODE 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefine MAX_NODE when it is not used. Also, rename it to MAX_NODES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why 100000 was picked in a scientific way.

Copy link
Contributor Author

@yenslife yenslife Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, the time complexity for comparing each pair of duplicated strings was $2 * O(n)$, resulting in a total time complexity of $2*(n-1)*O(n)$, which is significant. However, in the shortened version commit a078970, the time complexity for comparing each pair of duplicated strings was reduced to $O(n)$. In the worst-case scenario, the comparison would take $(n-1)*O(n)$ time, which is much smaller compared to the previous implementation.

In the test traces/trace-14-perf.cmd, the maximum number of nodes and the sorting command script used is 2,000,000. In the traces/trace-15-perf.cmd, the second highest number of nodes used with the sort command is 100,000. Therefore, I set MAX_NODES to 100,000, as I made a mistake in setting it to cover the second highest case. Setting MAX_NODES to 2,000,000 would cause a segmentation fault on my computer, so I opted to skip that case. As a result, I measured the time taken using the sort command in various scenarios with different numbers of duplicated nodes.

The test script is as follows. I set MAX_NODES to 1,000,001 for the test. I will only change the number of nodes inserted at the head, and then use perf stat ./qtest < test.cmd to measure the time.

test.cmd

new
ih a 10000
sort
quit
node count elapsed time (seconds)
1000 0.027906238
10000 0.058188249
100000 2.445569385
150000 5.453358783
200000 9.671944194
300000 21.581793918
400000 40.176436261
500000 61.437037665

As shown above, exceeding 100,000 nodes in the test would lead to significant performance degradation🥹. While alternative approaches might be considered, I found that in the end, it was more straightforward to directly add a member to element_t. However, this approach deviates from the requirements of the assignment. Therefore, it involved a trade-off in implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reasons provided above are deemed sufficient, I will create another commit to change MAX_NODES = 100000 to MAX_NODES = 100001, and will explain the rationale in the commit message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reasons provided above are deemed sufficient, I will create another commit to change MAX_NODES = 100000 to MAX_NODES = 100001, and will explain the rationale in the commit message.

Use git rebase -i to rework the commits.

qtest.c Outdated Show resolved Hide resolved
qtest.c Outdated Show resolved Hide resolved
qtest.c Outdated Show resolved Hide resolved
@yenslife
Copy link
Contributor Author

Thank you for your review!

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Squash the commits and refine the git commit messages.
  2. Ensure the proposed change fits both ascending and descending order once a user specifies via "option" command.

@yenslife
Copy link
Contributor Author

I ensure that this approach enables stable detection for both ascending and descending sorts. Besides my own successful testing, this proposed change only examines adjacent nodes with identical values and compares their indices before q_sort. Assuming successful sorting, nodes with identical values will always be adjacent. Therefore, stability checking at this juncture becomes meaningful.

/* If the number of elements is too large, it may take a long time to check the
* stability of the sort. So, MAX_NODES is used to limit the number of elements
* to check the stability of the sort. */
#define MAX_NODES 100000
Copy link
Contributor

@jserv jserv Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the sliding window technique to track partial nodes instead of using a predefined number of nodes during a customized sorting routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to use the sliding window technique, how would we ensure the relative order of nodes? Without additional data structures, the information about the relative order of nodes would be lost after sorting. Can the sliding window still be utilized under these circumstances, or have I misunderstood your suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to use the sliding window technique, how would we ensure the relative order of nodes? Without additional data structures, the information about the relative order of nodes would be lost after sorting. Can the sliding window still be utilized under these circumstances, or have I misunderstood your suggestion?

Think of the facility of queue operations. We are ready to allocate the temporary nodes on demand. What I care is the fixed length for checking purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're considering avoiding the fixed length for checking purposes, I'm thinking of trying to use the node's address as the key for hashing. With an additional data structure, we can determine the original order of the queue. This approach can handle a large amount of test data efficiently due to the properties of hashing, with reduced time complexity. However, we'll need to decide on a fixed array length if we're using an array as a hash table. I'm a bit stuck on this point.

截圖 2024-03-27 上午2 49 15

@yenslife yenslife force-pushed the stable-sort-test branch 3 times, most recently from 192ae40 to 12be675 Compare March 28, 2024 17:53
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that you write complete sentences in git commit messages.

Introduced node numbering in qtest.c to evaluate the stability of
q_sort's sorting algorithm. When the algorithm encounters two nodes
with the same value, it searches for the address of the node record in
the nodes array. It then compares the found node to the current node
(cur). If the found node is the same as the current node, it indicates
that these two duplicate nodes have not been swapped in position after
sorting. However, if the found node is cur->next, it means that the
position of the nodes has been swapped. That is, the sorting
implementation is unstable.

The performance of the testing code was evaluated by measuring the
elapsed time for q_sort's operation on different numbers of nodes with
duplicate values. Node counts ranging from 1000 to 500,000 were
examined. Specifically, for the 1000-node count, the elapsed time
was recorded as 0.0279 seconds, and for the 500,000-node count, it
was 61.44 seconds. For the 100,000-node count, the elapsed time was
2.45 seconds. The elapsed time showed a significant increase starting
from the 100,000-node count, underscoring potential performance issues
with larger datasets.

This method relies on auxiliary data structures to track node pointers
and their original order, avoiding alterations to the structure in
queue.h. However, stability testing is limited to a maximum of 100,000
elements (MAX_NODES) to address potential performance concerns.
@jserv jserv merged commit 017ec7d into sysprog21:master Apr 1, 2024
1 of 2 checks passed
@jserv
Copy link
Contributor

jserv commented Apr 1, 2024

Thank @yenslife for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants