首页博客网络编程
网站url:https://www.woyaogexing.com/touxiang/qinglv/new/
浏览网页:可以发现每个图片都链接到了另一个网页
我们需要获取主目录中的每个图片对应的另一个html页面的url,再从这些url中提取图片
import requests response = requests.get('https://www.woyaogexing.com/touxiang/qinglv/new/')response.encoding = 'utf-8'print(response.text)1234
我们需要的url在html中的位置如下:
用正则表达式筛选出需要的url
import reimport requests response = requests.get('https://www.woyaogexing.com/touxiang/qinglv/new/')response.encoding = 'utf-8'html = response.text pattern = re.compile('href="(/touxiang/qinglv/20\d+/\d+\.html)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ',re.S)urls = re.findall(pattern,html)for url in urls: print(url)123456789
对其中的每个url在进行一次提取html操作:
import requestsimport re url = '/touxiang/qinglv/2021/1142841.html'response = requests.get('https://www.woyaogexing.com/'+url)response.encoding = 'utf-8'html = response.textprint(html)1234567
我们在这里就可以看见图片的url了
正则表达式筛选:
import requestsimport re url = '/touxiang/qinglv/2021/1142841.html'response = requests.get('https://www.woyaogexing.com/'+url)response.encoding = 'utf-8'html = response.text pattern = re.compile('href="(//img\d\.woyaogexing\.com/20\d\d.*?\.jpeg)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ',re.S)pic_urls = re.findall(pattern,html)for pic_url in pic_urls: print(pic_url)12345678910
将图片保存至本地即可
完整代码:
import reimport osimport requestsglobal i i = 0def get_one_page(url): response = requests.get(url) response.encoding = 'utf-8' html = response.text return htmldef get_urls(html): pattern = re.compile('href="(/touxiang/qinglv/20\d+/\d+\.html)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ',re.S) urls = re.findall(pattern,html) return urlsdef get_pic_url(html): pattern = re.compile('href="(//img\d\.woyaogexing\.com/20\d\d.*?\.jpeg)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" ',re.S) pic_urls = re.findall(pattern,html) return pic_urlsdef save_pic(url,pic_path): global i if not os.path.exists(pic_path): os.mkdir(pic_path) with open(os.path.join(pic_path,str(i)+'.jpg'),'wb') as f: f.write(requests.get(url).content) i += 1def main(): html = get_one_page('https://www.woyaogexing.com/touxiang/qinglv/new/') urls = get_urls(html) for url in urls: sub_html = get_one_page('https://www.woyaogexing.com'+url) pic_urls = get_pic_url(sub_html) for pic_url in pic_urls: save_pic('http:'+pic_url,'D:\\test\\') if __name__ == '__main__': main()1234567891011121314151617181920212223242526272829303132333435363738394041
效果如下:
版权声明:本文为CSDN博主「qq_51459600」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_51459600/article/details/118460442
声明提示:若要转载请务必保留原文链接,申明来源,谢谢合作!
广告位
广告位