Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing table rows #110

Merged
merged 5 commits into from
Aug 19, 2024
Merged

Fix missing table rows #110

merged 5 commits into from
Aug 19, 2024

Conversation

Holzhaus
Copy link
Owner

@Holzhaus Holzhaus commented Aug 9, 2024

@Holzhaus Holzhaus added the bug Something isn't working label Aug 9, 2024
@Holzhaus
Copy link
Owner Author

Holzhaus commented Aug 9, 2024

@Swiftb0y I would appreciate your review :)

Copy link
Contributor

@Swiftb0y Swiftb0y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple mostly superficial comments. I have not dug deeply into Deep-Symmetry/crate-digger#32 so I can't really assert how well this implementation aligns with the spec.

Very happy to see some activity here btw.

let actual_row_count: usize = pages
.into_iter()
.flat_map(|page| page.row_groups.into_iter())
.map(|row_group| row_group.present_rows().collect::<Vec<Row>>().len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid temporary objects.

Suggested change
.map(|row_group| row_group.present_rows().collect::<Vec<Row>>().len())
.map(|row_group| row_group.present_rows().count())

pub struct RowGroup {
/// An offset which points to a row in the table, whose actual presence is controlled by one of the
/// bits in `row_present_flags`. This instance allows the row itself to be lazily loaded, unless it
/// is not present, in which case there is no content to be loaded.
#[br(args { count: num_rows.into(), inner: FilePtrArgs { inner: (page_type,), offset: page_heap_offset }})]
rows: Vec<FilePtr16<Row>>,
rows: [Option<FilePtr16<Row>>; Self::MAX_ROW_COUNT],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pdb/mod.rs Outdated
Comment on lines 442 to 451
} else {
needs_seek = true;
}
}

Ok(RowGroup {
rows,
row_presence_flags,
unknown,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to restore the seek position afterwards? To where?

src/pdb/mod.rs Outdated
Comment on lines 422 to 429
for i in (0..RowGroup::MAX_ROW_COUNT).rev() {
let row_present = row_presence_flags & (1 << i) != 0;
if row_present {
if needs_seek {
let index = u64::try_from(i).map_err(|_| binrw::Error::AssertFail {
pos: stream_position,
message: format!("Failed to calculate row index {}", i),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the try_from seems unnecessary. just cast it to u64 upfront. Interestingly rustc does not catch integer overflow when casting constants to narrower types... That seems like a compiler bug? Anyways, we're widening here so its fine IMO.

Suggested change
for i in (0..RowGroup::MAX_ROW_COUNT).rev() {
let row_present = row_presence_flags & (1 << i) != 0;
if row_present {
if needs_seek {
let index = u64::try_from(i).map_err(|_| binrw::Error::AssertFail {
pos: stream_position,
message: format!("Failed to calculate row index {}", i),
})?;
for index in (0..RowGroup::MAX_ROW_COUNT as u64).rev() {
let row_present = row_presence_flags & (1 << index) != 0;
if row_present {
if needs_seek {

});
}

let mut needs_seek = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the "seek caching" necessary here? Can we simplify the code by always seeking unconditionally even if it turns out to be a noop?

src/pdb/mod.rs Outdated
Comment on lines 316 to 323
let estimated_number_of_row_groups = usize::from(num_rows)
.checked_sub(1)
.map(|x| x / RowGroup::MAX_ROW_COUNT)
.and_then(|x| x.checked_add(1))
.ok_or_else(|| binrw::Error::AssertFail {
pos: stream_position,
message: "Failed to calculate number of row groups".to_string(),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't see how this could go wrong. Do you?

Suggested change
let estimated_number_of_row_groups = usize::from(num_rows)
.checked_sub(1)
.map(|x| x / RowGroup::MAX_ROW_COUNT)
.and_then(|x| x.checked_add(1))
.ok_or_else(|| binrw::Error::AssertFail {
pos: stream_position,
message: "Failed to calculate number of row groups".to_string(),
})?;
let estimated_number_of_row_groups = usize::from(num_rows).div_ceil(RowGroup::MAX_ROW_COUNT);

Comment on lines +327 to +335
u64::try_from(row_groups.len())
.ok()
.and_then(|index| index.checked_mul(36))
.and_then(|x| stream_position.checked_sub(x))
.map(SeekFrom::Start)
.ok_or_else(|| binrw::Error::AssertFail {
pos: stream_position,
message: format!("Failed to calculate seek position for row group {}", i),
})?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm a little narrow minded from too much C++ programming but this feels overcautious. row_groups.len() is a usize and will pretty much always fit into a u64, right? Are there architectures where usize is bigger than a u64? can we static_assert that (is there such a thing in rust)?

@brunchboy
Copy link

brunchboy commented Aug 12, 2024

The questions Jan Holthuis raised in the Crate Digger issue make me concerned that we are not calculating the number of row groups correctly, and perhaps we simply can’t calculate that in advance at all, because we don’t know how many rows have been invalidated in each row group, so we may simply need to scan row groups until we have seen num_rows non-deleted rows. But it sounded like he found a database where num_rows was 7 and there were simply not that many non-deleted rows in the page at all. Perhaps it was corrupt?

@Swiftb0y
Copy link
Contributor

fyi, I can't resolve my review comments, so take the entire review as addressed after merging #111

* rename `stream_position` to `row_group_end_position`
* formulate `needs_seek` in terms of `row_present`, making it clear
  that reading the next row is only needed when the current one
  is not not present.
@Holzhaus Holzhaus merged commit acfdcfb into main Aug 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing rows table rows.
3 participants