Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entire Section text version #171

Open
TimTCM opened this issue Oct 2, 2024 · 4 comments
Open

Entire Section text version #171

TimTCM opened this issue Oct 2, 2024 · 4 comments
Assignees

Comments

@TimTCM
Copy link

TimTCM commented Oct 2, 2024

Even with the current Microcomp format, would GPO be willing to publish a text file of the Congressional Record's four sections in entirety?

Entire sections are available in PDF, but not in text.

One can create a combined version with the downloaded zip file, and I am doing so right now, and confidence in the product would be greater if the official source had this available.

Over time, I don't plan to store the Record indefinitely, and so if later there is content that needs re-caching, I'd like to be able to pull from the official source without reprocessing the whole zip file on the fly each time.

This would be especially helpful for things that combine pages like House Morning Hour debate, one-minute speeches, and then even more so in the Senate where a Senate speaker's remarks can cross multiple pages as they are currently divided.

Thank you

@jonquandt jonquandt self-assigned this Oct 2, 2024
@jonquandt
Copy link
Member

@TimTCM - thanks for the suggestion. We'll look at feasibility of this.

From an acceptance criteria point of view, would having a complete package html/text file meet the need? This would include the text for the entire daily issue.

For this:

This would be especially helpful for things that combine pages like House Morning Hour debate, one-minute speeches, and then even more so in the Senate where a Senate speaker's remarks can cross multiple pages as they are currently divided.

Could you provide an example where a single Senate speaker's remarks are split across multiple granules? That will help us understand that portion a bit better.

If they are speaking on different subjects, it makes sense to me that they would have separate granules in GovInfo, but perhaps there's a scenario that I'm not thinking of at the moment.

@TimTCM
Copy link
Author

TimTCM commented Oct 19, 2024

From an acceptance criteria point of view, would having a complete package html/text file meet the need? This would include the text for the entire daily issue.

That would work. Thank you!

This would be especially helpful for things that combine pages like House Morning Hour debate, one-minute speeches, and then even more so in the Senate where a Senate speaker's remarks can cross multiple pages as they are currently divided.

Could you provide an example where a single Senate speaker's remarks are split across multiple granules? That will help us understand that portion a bit better.

If they are speaking on different subjects, it makes sense to me that they would have separate granules in GovInfo, but perhaps there's a scenario that I'm not thinking of at the moment.

Almost every day, the Senate leader remarks at the beginning of the day, which usually cover different subjects, get broken up by topic. Sometimes pages, after the first one, when divided up this way also don't have the speaker name at the beginning of the remarks.

In the following examples,

  • the timestamp link is to a splicing of the pages together
  • the topical links have links to the GPO granules at the bottom of the page

Here are some examples, from 7/31:

11:07 a.m. - Majority Leader Schumer spoke about anti-Semitism, the Tax Relief for American Families and Workers Act of 2024, and the Vacca nomination.

11:18 a.m. - Republican Leader McConnell spoke about border security and judicial nominations.

Here, from 9/25, the China page doesn't have the speaker name on it:

10:25 a.m. - Republican Leader McConnell spoke about the filibuster and China.

If they're not broken up by topic, then that's because they get rolled into a long Legislative Session or Executive Session page. For example, from 7/25:

LEGISLATIVE SESSION (S5489-7, 1,144 lines)

Either way, I've never seen a single link from the source to a leader's full remarks and only full remarks, notwithstanding the times a leader only spoke on one topic.

It's not just leaders that get their remarks split up over multiple pages. Here are some other examples, from 9/25:

11:20 a.m. - Senator Schatz spoke about disaster relief and honored Appropriations Committee staffer, Dabney Hegg.

7:25 p.m. - Senator Kennedy thanked the Bloomberg Foundation and honored departing floor staffer Katherine Foster. Senator Kennedy spoke about tax record privacy and asked unanimous consent to take up and pass H.R. 8292 regarding tax records. Senator Wyden objected.

That last example with Senator Kennedy is mixed. He starts by saying, "Three quick points," and then his remarks get split over three pages with only the first having his name at the beginning. What's mixed about this one is his third point becomes a unanimous consent request. UC debates have multiple Senators speaking and it makes sense to have the whole debate on one page.

One way to tell from GPO if sometimes-long pages are broken up into shorter ones by speaker is if the title is in all caps or not. All-caps "SESSION" pages can be very long, while if title words are mostly in lower-case, then it seems the Reporters of Congressional Debate added more dividers in the content. Senate 9/24 has lots of lowercase. In contrast, 9/19 Senate has a lengthy page with lots of things combined:

LEGISLATIVE SESSION (S6192-4, 1,109 lines) — 1 Vote: 247 — Speakers: Schumer (D-NY) • McConnell (R-KY) • Thune (R-SD) • Daines (R-MT) • Sullivan (R-AK) • Murray (D-WA) • Paul (R-KY) • Durbin (D-IL) — S. 5074H.R. 9468

Topical divisions instead of speaker divisions tend to cause speaker names to be missing from the beginning of subsequent pages. What happens is a Senator gives a speech, another Senator arrives on the Senate floor to talk about something else, listens to the current speaker, and then when the next Senator goes to speak, the Senator first starts with some comments about the previous speaker before speaking on their main topic. Then, in the Record, the subsequent speaker's name is on the previous page, and not at the beginning—or sometimes even at all—on the page where the main substance of their remarks is found.

For instance, from 9/18, there's no Senator name on the second page because it already appeared on the first:

4:56 p.m. - Senator Welch spoke on disaster relief.

5:08 p.m. - Senator Sanders spoke on Israel.

UC requests don't always get their own pages, too.

Sometimes they do, like on 9/25:

Unanimous Consent Request--Executive Calendar (Executive Session) (S6397-3, 103 lines)
Unanimous Consent Requests--Executive Calendar (Executive Session) (S6398, 90 lines)
Unanimous Consent Request--H.R. 8281 (Executive Calendar) (S6398-3, 623 lines)
Unanimous Consent Request--S. 1398 (Executive Calendar) (S6403, 185 lines)

Sometimes they don't, like on 9/17:

LEGISLATIVE SESSION (S6075-2, 1,510 lines) — UC debate excerpt

To bring it back to your question about the usefulness of a comprehensive page that has everything in succession, particularly for the Senate, yes, that would help deal with the many different ways pages and the same types of content get divided in the Congressional Record. I realize some of these things may be artifacts of how the content appears in print. Adding the name on pages where it often seems missing may not happen because of this. Having a comprehensive version makes it easier to divide content by speaker.

@jonquandt
Copy link
Member

Thanks for the additional detail. From the content originator's perspective, breaking by subject was the original request, but I see where consolidating in a different manner would be helpful. At this time, we are looking at providing the text files at a package or book level.

@jonquandt
Copy link
Member

This is something we are looking at as a March 2025 item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants