ホームページ内の画像ファイルを全て取得する方法

ホームページ内の画像ファイルを取得する方法を

探しているとGitHubに掲載している事を確認しました。

また、マイクロソフトが買収した事で情報が無くなる事が

心配だったので情報を書き込みする事にしました

<?php
date_default_timezone_set('Asia/Tokyo');

try
{
if (!isset($argv[1]))
{
throw new Exception('The first argument is required for url.');
}

$url = $argv[1];
if (!preg_match('/^https?:\/\//', $url))
{
throw new Exception('Invalid url.');
}

$header = @get_headers($url);
if (!preg_match('/^HTTP\/.*\s+200\s/i', $header[0]))
{
throw new Exception('Target page does not found.');
}

$html_source = file_get_contents($url);

if ($html_source == null || $html_source == '')
{
throw new Exception('Failed to get html source from url.');
}

preg_match_all('/src="(.*?(\.jpg|\.jpeg|\.gif|\.png))"/i', $html_source, $matches);

echo strlen($html_source)."\n";

if (!isset($matches[1]) || count($matches[1]) === 0)
{
throw new Exception('No image file in url.');
}

$base_tmp = explode('/', $url);
$base = sprintf('%s/%s/%s', $base_tmp[0], $base_tmp[1], $base_tmp[2]);
echo $base."\n";

$save_dir = sprintf('./save_img_%s', date('YmdHis'));
if (!file_exists('./'.$save_dir))
{
mkdir('./'.$save_dir);
}

$save_cnt = $duplicate_cnt = $error_cnt = 0;
$saved_list = [];
foreach($matches[1] as $k => $img_url)
{
$fname_tmp = explode('/', $img_url);
$fname_tmp = array_reverse($fname_tmp);
$fpath = sprintf('%s/%s_%s', $save_dir, $k, $fname_tmp[0]);

if (!preg_match('/^https?:\/\//', $img_url))
{
$img_url = sprintf('%s/%s', $base, $img_url);
}
if (in_array($img_url, $saved_list))
{
$duplicate_cnt++;
continue;
}

$data = @file_get_contents( $img_url );
if ( $data )
{
@file_put_contents( $fpath, $data );
}

if (!file_exists($fpath))
{
$error_cnt++;
}
else
{
$save_cnt++;
$saved_list[] = $img_url;
}
}

$message = sprintf('end. {all:%s, saved:%s, duplicate:%s, error:%s}', count($matches[1]), $save_cnt, $duplicate_cnt, $error_cnt);
}
catch (Exception $e)
{
$message = $e->getMessage();
}
finally
{
echo $message. "\n";
}

ファイル名を「get_all_image.php」として保存します。

実際に実行する場合は、以下の方法になります。

php get_all_image.php http://example.com

save_img_YmdHis（年月日時分秒）のフォルダが作成後に

画像データが保存されます。

よくわかるPHPの教科書【PHP7対応版】

この記事を書いている人

よし

某企業のSEとして社会に貢献している状態です。また、2005年から自宅にサーバを構築するようになり以下のタイミングで再構築など実施しています。・玄箱HG（2005年11月～2007年5月）・OpenMicroServer （2007年5月～2008年7月）・MP965-D（2008年7月～2011年5月）・SuperMicro（2011年5月～2014年12月）・D54250WYB（2014年12月～現在）

執筆記事一覧

コメントを残す

この記事を書いている人

よし

関連記事

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル