Pulling files off a shared host (CPanel) with a 10K file FTP limit using a python web scraper

This post demonstrates the use of a web scraper to circumvent an imposed limit and download a bunch of the files.

I’ll use a recent case as an example where I had to migrate a client’s site to a new host. The old shared host was running an ancient version of CPanel and had a 10K file limit for FTP. There was no SSH or other tools, almost no disk quota left, and no support that could possibly change any CPanel settings for me. The website had a folder of user uploads with 30K+ image files.

I decided to use a web scraper to pull all of the images. In order to create links to all of the images that I wanted to scrape, I wrote a simple throwaway PHP script to link to all of the files in the uploads folder. I now had a list of all 30K+ files for the first time — no more 10K cap:

$directory = dirname(__FILE__) . '/_image_uploads';
$dir_contents = array_diff(scandir($directory), array('..', '.'));

echo '<h3>' . count($dir_contents) . '</h3>';
echo '<h5>' . $directory . '</h5>';

echo "<ul>\n";
$counter = 0;
foreach ($dir_contents as $file) {
  echo '<li>' . $counter++ . ' - <a href="/_image_uploads/'. $file . '">' . $file . "</a></li>\n";
echo "</ul>";

Next, to get the files, I used a python script to scrape the list of images using the popular urllib and shutil python3 libraries.

I posted a gist containing a slightly more generalized version of the script. It uses the BeautifulSoup library to parse the response from the above PHP script’s URL to build a list of all the image URLs that it links to. This script can be easily modified to suit a variety of applications, such as downloading lists of PDF’s or CSV’s that might be linked to from any arbitrary web page.

The gist is embedded below:

If you need to install the BeautifulSoup library with pip use: pip install beautifulsoup4

In the gist, note the regex in the line soup.findAll('a', attrs={'href': re.compile("^http://")}). This line and its regex can be modified to suit your application, e.g. to filter for certain protocols, file types, etc.