-
-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large uploads 'fail' but have succeeded #852
Comments
Narrowed this down to storage speed. I moved it from 2Gbps SAN to 6Gbps DAS and the issue went away. I could push/pull ~100MB/s synchronously on this connection, and the uploads are limited to my 40Mbps / 5MB/s throughput, so must be something to do with latency sensitivity, or maybe checksumming? |
Tested more and it's not storage, but I was close. It's the upload speed. If I upload to it from a 40Mbps-up connection, I get this issue. If I upload from a 100Mbps-up connection, it's fine. (the server itself is on gigabit enterprise fibre) Most services in Australia max out at 40Mbps upload, and that's only if they have fibre or REALLY good copper, so this would likely affect many people. |
I think it's timing out for you. You can try playing with the chunks duration but it's simpler to increase timeouts because fine tuning the chunk duration so that it doesn't timeout is a nightmare. There is currently a hard coded timeout value for the 'short' tasks, and i addressed this in my PR: Also, you can completely eliminate all other timeouts by setting them in your local settings file. For testing, I wanted things to never timeout (don't do this in production). CELERY_TASK_SOFT_TIME_LIMIT = None
CELERY_TASK_TIME_LIMIT = None
CELERY_SOFT_TIME_LIMIT = None
CELERYD_TASK_SOFT_TIME_LIMIT = None But you can set time limits to something else because the defaults provided by Also, how are you deploying mediacms at the moment? If you're using nginx and uwsgi, you'll at minimum need to increase timeouts in those places as well [uwsgi]
http-timeout = 86400 ; Set HTTP timeout to 24 hours
socket-timeout = 86400 ; Set socket timeout to 24 hours And if you are also using
Configure things so they work best for you, but i was able to ingest 90 gig files stored on Also, the error message in the UI ideally should provide more context into why things are failing. It took a lot of debugging and trial and error for me to ingest large files. |
Oh nice one - I'll grab that PR and change the configs. Many thanks! |
Hi, thanks for the report and for the insightful comments @tobocop2 and @platima . I think that the sugestions for uwsgi are valid, there might be something that has to be optimized in this case. And I also believe that the CELERY settings are irrelevant (going to comment on the PR as well), because there's different uses for the word
As a recap, the problems here seem to be related with the uploading and not post-processing of files. On a sidenote, @tobocop2 I am curious to learn what could be a 90 gigabyte file that you've uploaded! I haven't tested the software in so big files, but would be very interested to learn more for this case, what type of videos/workflows you are having here, and what is the infrastructure that processess it. The software is definitely not optimized for that type of videos but I'm happy to read it doesn't fail. One thing I can think of is the command that produces the sprites file (the small images when you hover on the video duration bar), I know that this command fails on a vanilla MediaCMS for videos larger than 1-2 hours, and needs a tweak on the command that produces it (something related to ImageMagick if I remember well). For sure there will be other issues or edge cases here because again the software is not tested on so big files. Regards |
@mgogoulos the chunk jobs were all failing for me. You have 300 seconds as the hard coded timeout in the supervisord file and the tasks file. The order of precedence is keyword arguments at command line Right now there is no way to override the timeout because it's hard coded hence i made #856 Regarding the sprites, I overwrote the imagemagick policy.xml file and set it to be like this
I'm able to ingest 90 gig files no problem using the basic docker-compose infrastructure you provided with the addition of s3fs added to my docker-compose. Here is my config.
This is just a POC but it works for 90 gig files given all my modifications to the uwsgi, imagemagick, nginx, and django confs. This is my local settings file
Note i disabled the hls conversion because i don't need it. For convenience i also wrapped all the docker-compose commands in SHELL = /bin/sh
UID := $(shell id -u)
GID := $(shell id -g)
export UID
export GID
# Path to the mediacms directory after cloning
MEDIACMS_DIR := mediacms
# Make 'build' target a default goal
.DEFAULT_GOAL := build
# 'build' target will clone 'mediacms' repository, copy files, and run 'up'
.PHONY: build
build: copy-files up
# Rule for cloning 'mediacms' repository
.PHONY: clone
clone:
>---if [ ! -d "$(MEDIACMS_DIR)" ]; then git clone https://github.com/mediacms-io/mediacms.git $(MEDIACMS_DIR); fi
# Rule for copying the compose file, .env_file, and nginx conf file
copy-files:
>---# Copy 'docker-compose-letsencrypt-s3.yaml', '.env_file', and 'client_max_body_size.conf' to 'mediacms' directory
>---cp docker-compose-letsencrypt-s3.yaml .env_file requirements.txt clear_cache.sh $(MEDIACMS_DIR)/
>---# Copy 'client_max_body_size.conf' to the reverse_proxy directory
>---mkdir -p $(MEDIACMS_DIR)/deploy/docker/reverse_proxy/web
>---mkdir -p $(MEDIACMS_DIR)/deploy/docker/reverse_proxy/nginx
>---cp web/client_max_body_size.conf $(MEDIACMS_DIR)/deploy/docker/reverse_proxy/web/
>---cp nginx/client_max_body_size.conf $(MEDIACMS_DIR)/deploy/docker/reverse_proxy/nginx/
>---cp local_settings.py nginx.conf uwsgi.ini nginx_http_only.conf policy.xml $(MEDIACMS_DIR)/deploy/docker/
>---cp tasks.py $(MEDIACMS_DIR)/files/
>---cp local_settings.py $(MEDIACMS_DIR)/cms/
>---cp supervisord-celery_short.conf $(MEDIACMS_DIR)/deploy/docker/supervisord/
# Targets for managing the Docker Compose setup
.PHONY: up
up:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml up --build -d
.PHONY: down
down:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml down --remove-orphans
>---umount $(MEDIACMS_DIR)/media_files
logs_nginx:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --tail=1000 --follow nginx-proxy
logs_web:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --tail=1000 --follow web
logs_db:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --tail=1000 --follow db
logs_redis:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --tail=1000 --follow redis
logs_s3fs:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 s3fs
logs_celery_worker:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 celery_worker
logs_celery_beat:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 celery_beat
logs_s3fs_cron:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 s3fs-cron
logs_acme-companion:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 acme-companion
logs_migrations:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml logs --follow --tail=1000 migrations
web_sh:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml exec web /bin/bash
celery_sh:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml exec celery_worker /bin/bash
nginx_sh:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml exec nginx-proxy /bin/bash
s3fs_sh:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml exec s3fs /bin/sh
s3fs_cron_sh:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml exec s3fs-cron /bin/sh
restart_web:
>---cd $(MEDIACMS_DIR) && docker-compose -f docker-compose-letsencrypt-s3.yaml restart web
|
Always happy to help, and thanks for commenting on this ticket. Looking forward to a better uploader then :P |
I'm quite regularly running into this issue as well. Thanks for the suggestions, everyone! Will try some of this and see if it helps. It would be great if some of this information was put into the admin documentation. |
This ticket stays open until there is some info of it moved to the admin docs. Next, how is s3fs performance in playing these videos? Any observations? And also out of curiosity, what are these 90GB files, what sector are you working on? |
I completely abandoned s3fs for my needs. The ingestion throughput was poor and unsuitable for my needs and i ultimately devised a solution that simply uses aws s3 sync via a cronjob. I am going to migrate this to use aws datasync eventually. I was getting single stream uploads to s3 at roughly 250mb/s and I was doing this across multiple workers, so I was seeing over 1TB/s throughput via multiple s3 sync commands. I think my throughput needs just are too demanding for s3fs so I had to abandon it. It was a very interesting recommendation but just not suitable for the scale and file sizes I'm dealing with. I possibly could have spent the time to tune s3fs since it has a wide variety of options. I tried experimenting with the cache option, but ultimately i was just not seeing my files ingested in reasonable amounts of time so i moved away from it. While s3fs is a really awesome tool, i found the simpler approach of just maintaining a cron job that runs the aws s3 sync command hourly to be a more sustainable and manageable solution. The 90 gig files are are broadcast media files (mxf format, i had to add my own mimetype for them to work with mediacms). For playback, I'm using the MEDIA_URL setting in django and serving all of my files from s3 over cloudfront. This is breaking static assets in some places but I am ok with that for the time being. I see an existing PR here: which should fix that issue for me. In order to actually support the throughput and scale I needed, I had to deploy mediacms in two configurations: A) ingest mode (EC2 + EBS (GP3 max IOPS and max throughput) + local redis + shared RDS postgres Ingest is on demand, so year round I'll only have the simple ECS/EFS configuration deployed. In my ingest configuration I have one worker running on the same machine as the web app and I give the web app roughly 30% of my CPU. Distributed workers would be great, but the throughput needs that I have are unreasonable for NFS. Encoding was taking unreasonable amounts of time when reading from NFS and uploading was similarly slow. I have successfully ingested over 60 terrabytes of media through mediacms with some modifications to the deployment configurations and my own aws infrastructure code using terraform. Thank you as well as all of the contributors to this project @mgogoulos |
thanks for the very interesting information and insights, on the usage of s3fs, customizations and your use case. I'll try to review the PRs and merge them soon. |
I'd like to note that I have not found any of the suggestions here to work for me. I still get this problem when large files are uploaded. The uploader will show the upload as failed, the file will not appear in the user's media list, but then a while later the file will show up anyway. Have not had the time to trace it down. I've just told users who upload huge files that it's a known issue, and if it says failed it probably worked so they will need to just check back later. |
Setup
I've found many videos, usually ones over 1GB, upload successfully but show as failed, eg:
Retry does nothing on this, unlike one that has failed a portion of the way through. If I then go look at the media file, it has uploaded perfectly fine and encoding is finished. On the disk it shows just the one file, aka no spare chunks.
I have tried reducing the chunk sizing;
But that did not help
I can repro this easily, but am unsure which logs to check.
The text was updated successfully, but these errors were encountered: