Quantcast
Viewing all articles
Browse latest Browse all 14040

Checking the size of a file that's being downloaded by the browser causes it to get duplicated

I'm writing a script which needs to download a file by hitting a URL. It would be easy enough to curl the URL, but I need to be logged in to the site when doing it, and I've given up on trying to solve the problem of sending a curl request as a logged in user. So what I've settled on doing is opening the URL in the browser and monitoring the downloads folder until the new file appears.

I already wrote a working version for this some time ago in bash, but now I'm in the process of rewriting everything in python. This was my bash solution:

#! /bin/bashget_latest_csv_in_downloads() {    find ~/Downloads -maxdepth 1 -iname '*.csv' -printf '%B@ %f\0' | sort -znr | head -zn 1 | cut -zd '' -f 2- | head -c -1}url="$1"initial_file="$(get_latest_csv_in_downloads)"# Open the file in the browser which will initiate a download.python -m webbrowser "$url" > /dev/null# We'll try to obtain the most recent file in the downloads folder,# until it's a different one than before we started downloading.while latest_file="$(get_latest_csv_in_downloads)"; [[ "$initial_file" == "$latest_file" ]]; do :; done# When the file is created it's sometimes empty for a bit.# At some point it jumps to being fully written, without any inbetween.# So this waits for the file size to not be zero.# NOTE: [[ ! -s "..." ]] would be a lot nicer,# but for some reason it sometimes creates a copy of the file and messes everything up.while (( "$(stat --format="%s" ~/Downloads/"$latest_file")" == 0 )); do :; doneecho "got the file $latest_file"

This works, it prints: got the file example_file.csv. But as you can see, I had to do a bit of hacking here. Pay special attention to the NOTE above the second while loop. [[ -s <path> ]] would be the clean way to check if the file is nonempty, but sometimes, for some mysterious reason, it causes two files to be created in my downloads folder:

example_file(1).csv    4KB (or whatever size)example_file.csv       0KB (empty)

This causes the first while loop to find example_file.csv and then move on to the second loop, where it gets stuck forever because the file remains empty and the contents are actually written to the mysterious copy example_file(1).csv.

This right here is my problem. I've already solved it in bash, but now I'm trying to rewrite this in python and the same issue happens and I cannot figure out how to solve it. Here is my python version:

#! pythonimport osimport webbrowserimport sysimport globdef get_latest_csv_in_downloads():    files_in_downloads = glob.glob(os.path.expanduser(os.path.join(os.path.expanduser('~'), 'Downloads', '*.csv')))    latest_file = max(files_in_downloads, key=os.path.getctime, default=None)    return latest_fileurl = sys.argv[1]initial_file = get_latest_csv_in_downloads()webbrowser.open(url)while True:    latest_file = get_latest_csv_in_downloads()    if initial_file != latest_file:        breakwhile os.path.getsize(latest_file) == 0:    passprint(f'got the file {latest_file}')

In bash what helped was to use stat, so I tried to use os.stat(latest_file).st_size, but it didn't solve it (probably os.path.getsize just calls stat anyway).

I thought I could solve this more cleanly by obtaining an exclusive lock on the file so my script gets blocked until the browser closes its handle, but alas it appears that getting exclusive access to a file is a surprisingly hard problem to do portably and I tried to use some libraries which did not do the job.

I'm using Firefox. From my tests this issue doesn't reproduce in Edge.

I've verified that this issue reproduces no matter what I try to download or from which website.

And in case it's relevant, I'm on Windows.

Any ideas what could possibly be causing the file to get duplicated, and how to prevent it? Thanks.


Viewing all articles
Browse latest Browse all 14040

Trending Articles