-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tv_grab_uk_tvguide can't configure #214
Comments
I have experienced almost the same problem with V1.2.1 running on Windows: C:\xmltv>tv_grab_uk_tvguide --configure No channels found in TVGuide |
This may be related to issue #185 (ownership of the tvguide.co.uk business/website having changed in May of 2022). |
It was the "HTTP error: 301 Moved Permanently" that made me think this may need some attention :) |
Are you able to edit the After line 1408 See if that helps. |
The whole site has been updated. Not sure how compatible (if at all), the changes will be with the existing grabber. There's an API (whoop!) which spits out JSON, when called via, for example: I notice it's proxied via Cloudflare now (was it before, I hadn't noticed if it was?), which could be problematic in future depending on whether or not they enable the bot protection, but I think for now this is not a problem. |
Excellent. Presuming that the API is (mostly) stable, and is intended to be (and stay) publicly available, that should make things a lot easier going forward. Even better if the company has a spec available for developers.
Obviously a JSON ingester and a screen scrapper have little in common on the input side, even if the data acquired and output will have a lot on common. Whomever steps up to write a replacement may find enough salvageable and reusable code so that the git diff will not be a complete rip and replace (at least the POD is likely to have strong re-usability).
I have not looked in quite some time, but suspect this is a somewhat recent change (I recall when I did look it had a large RTT to the site from my location in the US which correlated with it being hosted in the UK, and not just in the local Cloudflare DC) |
So far I've worked out the following...
...but that last JSON blob doesn't tell me things like the category. I've tried URIs like Any idea what the route might be ? |
does anyone have a solution? |
Not at this time. If you want to help contact the tvguide.co.uk site (email, mail, or perhaps walk into their office) and request access to the developer API spec to share with this project. |
@garybuhrmaster |
It means nobody has done it yet. I'm going to look at putting somthing together this weekend, as I'm running out of guide data for my MythTV system. |
I've been looking at this today. One thing I can't seem to find on the new site is the episode name. For example: Family Guy S21E5 is on later on ITV2, but I can't find its name 'Unzipped Code' via the new TVGuide website. In MythTV speak this is I suspect the TVGuide API has probably only been written to do enough to support their needs and nothing more. |
Looking at the website programme (details) screen for each programme there is no sign that the episode name is displayed for any programme on any channel. If correct there is no way you can grab them. |
Did anyone request the spec for the API? If not, I am willing to do so and share it. |
I haven't and I suspect nobody else has either. I guess it can't hurt to try, although my feeling is that there probably isn't more to the API than can be seen currently. I could be wrong, of course and that could be a good thing. |
I have submitted a request for the API spec and will provide an update here when I hear anything. Update 19 September 2023: still haven't heard anything from my request for the API spec, not sure if we still need it? |
I tried using the channel ID (in UUID format?) in the .conf file but the script is expecting an integer so that doesn't work. However, entering the URL into a browser in the form https://www.tvguide.co.uk/channel/2a548fcc-55e9-561d-9a77-f485fb69dad1/ (note that the trailing slash is important), it brings up the listings for that channel (defaulting to day zero). It also adds the channel name so that the URL appears as https://www.tvguide.co.uk/channel/2a548fcc-55e9-561d-9a77-f485fb69dad1/bbc-one-london/0 for the returned page. Is that a possible temporary workaround for the screen scraping, using the new format channels IDs rather than the old integer ones? Tedious to edit the config file but may only need to be carried out once. I am competent in many computer languages but unfortunately Perl isn't one of them, so I'm not sure where to start with trying out this new approach in the code! |
I have been doing some playing around and found that the channel listing from the URL https://www.tvguide.co.uk/channel/ looks to be in a completely different format to what the existing Perl is expecting (no surprise there probably). Attached is the HTML returned using curl to retrieve it (in a ZIP file, couldn't attach the HTML directly). The key part seems to be where the channel list starts, snippet below:
Apologies that the above is not pretty printed, not sure why that is but the full HTML file attached will be easier to read. There no longer appears to be a separate "channel" indicator, it appears that it would have to be deduced from the href tag contents. The channel names seem to be easily identifiable via the |
Kinda got something working, but it's messy code, doesn't use any of the XML-TV library, and takes some work to set up. It's also crashed once this morning reading a programme details page, but picked up again after....so I guess CloudFlare got in the way or my broadband dropped. UPDATE: switched to LWP::UserAgent::Determined in case of these dropouts |
I have been playing with the code for fetching the channels and the code below creates the attached .conf file. `sub fetch_all_channel_ids {
}` Apologies that the code is not appearing properly, I included it through the 'code' option but it doesn't seem to have applied it for the whole code snippet. There are probably shorter, prettier ways of doing this but as I haven't learnt Perl, I will offer this to others to examine and incorporate properly if they wish. The code for fetching the programme details will also need to be updated to incorporate the new channel IDs. |
BTW... someone should probably mention that non-profit
schedulesdirect.org does have UK guide data. It's not free, but is
pretty cheap and is high quality via an API less likely to break with
upstream changes :) There's a free 7 day trial that can hold you over
until you can write a new scraper. (no CC info is requested, ever
stored, and no auto renewal)
xmltv grabbers tv_grab_zz_sdjson and tv_grab_zz_sdjson_sqlite can be
used to get data for the UK (and other countries)
Disclaimer: I'm president of Schedules Direct and a founding board
member. SD was formed by the leaders of a number of open source
projects when our free US/Canada data source went away.
Robert
|
I've uploaded a new version of this grabber for beta testing. You will need to create a new config-file with Programme categories are being elusive at the mo. They are available via a webpage call for every prog in the schedule, but I'd rather not do that if possible. Let me know how it goes. |
@honir Thanks for this. I created a .conf file (Freeview, London, All channels) without any problem (very impressed with your handling of the options!) but get an error when I grab a listing. For convenience I always run xmltv from a batch file like this test one: But the result is this: Line 380 is in this group: Any idea what I am doing wrong? PS |
@honir Unfortunately you removed the error message but the result is the same: I noticed that it always writes a single cache entry (attached) for BBC One London - even after I deleted that channel from the .conf list. That seems to be the problem because it is correctly generating the XML for that channel. |
My apologies @honir your script is working correctly. I was confused because the command I posted above finished in a few seconds and produced only a single entry in the cache. However, it generated an XML for 3 days with all the Freeview channels except BBC One London which as mentioned previously I deleted from the .conf channel list. I assume the change from grabbing the old EPG is because it now omits the detailed programme descriptions, categories, etc. Thanks again for producing the new script so quickly. |
@honir thank you so much for fixing this! I run a Java programme daily on a Raspberry Pi to mimic the defunct Sky Never Miss service and this grabber is invaluable. I had to modify my Java code slightly to accommodate the season and episode data changes but it is all working now. I would have liked the Freeview and Sky channels all in one config file, does that look like it might be possible (I could raise a new issue for this enhancement) or will it not be possible due to the way that tvguide.co.uk splits everything by platform? |
Yes, it would have to be two separate runs. But you can join or merge the two files together before feeding your RPi. tv_cat
tv_merge
|
@honir, thanks for the swift reply, I shall investigate! |
I also added the fullstops back in my implementation, which I abandoned, but one issue with this is that it'll add fullstops to lines ending with a question or exclaimation mark. I used:
You might not need the first line, that comes from another grabber I'm working on. It seems (I think) that the XML writer trims trailing white space (and leading?), but that's not helpful if you dump another non-whitespace character after it. :-) |
Thanks @mkbloke I've updated the script with your code ;) |
Just tried the latest version of the Perl script and got this:
Don't know if it is CloudFlare, tvguide.co.uk tightening access to the API or a temporary glitch. Anyone else seeing this? |
Yep, looks like they don't want us to play. They are blocking access from XMLTV. |
That's a shame but thanks for trying. |
It's do-able, but they'd probably just block us there as well. Maybe you could persuade them to allow access for XMLTV - but I doubt it. The new owners don't seem to have their heart set on providing quality guide data. (And their app is broken and hasn't been updated in over 3 years.) |
I'm happy to recommend the guide service from Schedules Direct (SD). I've been using them myself for over 5 years and, apart from a few niggles -- such as them working on the server overnight in Texas time, which is of course 6-10 a.m. in the UK, and means my daily automated fetch fails and I have to remember to try again later -- the service has been excellent. Fault reporting could be a bit more transparent (you only get to see your own reports rather than a public fault log) but it's rarely needed. The data are generally good, and have pretty much all UK channels. The data provider (Gracenote) are much improved from their early Tribune Media Services (TMS) days in the 00s. You can use either of the project's approved grabbers: zz_sdjson or zz_sdjson_sqlite The dollar exchange rate has bumped up the cost in recent years, but at £28 a year it's good value (54p a week). Disclosure: I have no affiliation (or commission!) from SD, although I do know the people who run the organisation (as they include one of the elders of this XMLTV project). But my comments reflect the quality, and cost, of the data rather than any personal issues. |
A major benefit of the old EPG (but not the current version) was the detailed information, including very long programme descriptions, categories, ratings, etc. Does SD provide any of that for the UK channels? |
Absolutely. e.g. Film4
Or E4 example
I believe the data come straight from the TV networks (Gracenote have contracts for supply with most of them (excl. Virgin?)) They also provide unique identifiers which helps with your database! e.g. see dd_progid=MV002829430000 (MV=movie) and dd_progid">EP012708730158 (EP=episode). I think they offer a 7-days free trial, so you can have a look without cost. It's worth a look, I think. |
Quality guide data costs a fair amount of money to acquired and curate. With the exception of locations for which the regulatory agencies require broadcasters to offer such guide data for free and someone else (the broadcaster) pays the organizations, those organizations expect to be paid for their work(s) by the consumers of that guide data. In some cases that means limiting access to their own customers (placing the guide data behind a subscriber portal), or monetizing the guide data by providing the content on a web page filled with ads with the expectation (and typically the requirement) that you must consume (see) the ads (which is sometimes seen as the "no screen scraping allowed" TOS). Of course, in the longer run, while the underlying data about the show/movie/event will still be monetizable (because people will want to search out available romcom movies, for example), many people are choosing to consume content in different ways than linear TV (via various on-demand/streaming services), so linear TV guide data likely has a declining revenue model, and that almost insures that investment is also declining for such linear TV guide data. |
I too am a satisfied customer of Schedules Direct, and have been since its inception (2007) when the feed from TMS (zap2it labs) ended in the US. I liked it's data so much I authored one of the XMLTV grabbers using their new API partially because the PVR I use could migrate to a pure XMLTV guide data loading, and partially because the guide data available via the new API was richer than the previous data. Having guide data that just works has had value to me.
For scheduled activities, and site wide faults, Schedules Direct's forum provides a more public view and can be set to notify you (for those that wish to share; Schedules Direct clearly should not share your issues publicly without your permission, but for larger unplanned issues one tends to see a bunch of people posting).
Gracenote at some point decided to leverage their extensive data about content for worldwide availability. As I understand it they partner with local organizations where possible for the local schedule data. Sometimes partners are not available (not all countries are supported), and sometimes local partners quality of data is poor (and countries disappear), but as I understand it for most of the European countries the data is high quality, and they have processes to combine local showings with their existing rich content so that one tends to get high quality data. The underlying data is arguably richer (in at least some cases) than what XMLTV grabbers can make available due to the XMLTV schema.
The important issue is value. I clearly feel that I receive value for my yearly fee (which, being in the US, currently means $35/yr). I understand some will not see value in guide data "that just works" at any price, or because that price really exceeds what they can afford.
FD: I also have no affiliation with Schedules Direct itself. As the developer of one of the XMLTV grabbers I do occasionally work with Schedule Direct staff more directly due to an unusual issue or occurrence, and if requested will make sure my app still works with some planned change in their services (Schedules Direct staff will sometimes contact "the usual suspects" (i.e. other API developers) to let them know about upcoming changes which they believe should be transparent, but testing is always better; I certainly appreciate the heads-up). |
I'm not sure exactly how tv_imdb works but could it be used to provide the missing programme details? EDIT: ignore this, I have now tried it and the data is either very old or non existent. |
By changing the User Agent details for XMLTV::Get_nice just before it is called, I have managed to get the grabber to work. The default XMLTV User Agent is The line I added to the fetch_listings subroutine was:
just before:
Does that help? It may need updating from time to time if their API sees the "browser" as incompatible e.g. due to age. Does this work for others and if so, is this a permanent fix (at least for now)? |
Worked for me too. Well done!! |
My previous testing seemed to indicate than any user agent string containing the substring It is a solution for personal use. I'm not sure the XMLTV project will want to officially support a grabber that evades a site's blocking though. |
Fair point, perhaps the I also thought after my post that a similar change is probably needed during the configure phase. |
I am not a decider, but the project has in the past decided not to facilitate circumvention of a site's choices for access/blocking, as the project wants to be considered a good net citizen. If you believe that the site should allow xmltv to access the content, you should ask them what the project needs to do to be considered an acceptable user of their data (sometimes that can be to increase the time between requests, or to use a different endpoint). If they do not offer, or be willing to discuss, a path forward, there is likely no viable path forward for the xmltv project. |
No website likes scraping and would prefer to stop it. Most don't because its not possible or too much trouble. In any case if the scraping is solely for personal use, websites which make their data openly available (i.e. not behind a paywall) cannot use legal or copyright rules to control how the individual user chooses to consume them. Up to now all TVG websites targeted by your scripts have fallen into the "most don't" category. spider3838 is probably right that they detected the new script as a bot and simply attempt to block all bots. If you really believe the "good net citizen" line, XMLTV should write to EVERY site it targets and ask if they mind you grabbing their TVG. I doubt you will get a welcome from any of them. |
Devs are supposed to check the site's terms of service to see if there is anything against scraping before a grabber is added to the project. If the TOS changes, we don't periodically recheck it. With the agent string, we're upfront about who we are. If they want to block us, we don't fight it, or start an arms race. I don't think we need to get permission from every site in advance.... as you can see from this thread, getting hold of a human is difficult. |
Gave it a try this weekend and it's really pretty good. The configure process needs some work...hangs for 3-4 mins at the postcode stage then only returns a handful of channels of the UK....so I got it to pull all channels instead then disabled them all and used sqlite to enable the 50 I wanted. Also opened support ticket 22778 because of a lineup issue, which will hopefully be fixed before my trial expires. |
Until/unless someone provides a working grabber for tv_grab_uk_tvguide, it has been removed from the builds. |
XMLTV Version?
(Please specify release version or git commit ID)
V1.21 (Fedora repo)
XMLTV Component?
(Grabber name or utility)
tv_grab_uk_tvguide
Perl Version
This is perl 5, version 36, subversion 0 (v5.36.0).
Operating System
Linux 6.4.13-100.fc37.x86_64
What happened?
Running "tv_grab_uk_tvguide --configure" produces an error.
$ tv_grab_uk_tvguide --configure
tv_grab_uk_tvguide uses a cache with files that it has already downloaded. Please specify where the cache shall be stored.
Directory to store the cache in: [/tmp/xmltv/cache]
Fetching channels: 100% [======================================================================================================================================]No channels found in TVGuide
Trying alternative method 1
Fetching channels: 0% [ ]HTTP error: 302 Found
Retrying URL: https://www.tvguide.co.uk/mychannels.asp?gw=1242 (attempt 1 of 5)
HTTP error: 302 Found
Retrying URL: https://www.tvguide.co.uk/mychannels.asp?gw=1242 (attempt 2 of 5)
HTTP error: 302 Found
Retrying URL: https://www.tvguide.co.uk/mychannels.asp?gw=1242 (attempt 3 of 5)
HTTP error: 302 Found
Retrying URL: https://www.tvguide.co.uk/mychannels.asp?gw=1242 (attempt 4 of 5)
HTTP error: 302 Found
Retrying URL: https://www.tvguide.co.uk/mychannels.asp?gw=1242 (attempt 5 of 5)
HTTP error: 302 Found
Can't call method "look_down" on an undefined value at /usr/bin/tv_grab_uk_tvguide line 932.
What did you expect to happen?
Configure runs normally.
The website tvguide seems to have updated. It may have different channel ids now.
The text was updated successfully, but these errors were encountered: