Powershell + Selenium 爬虫--代理(03)

贵有恒,何必三更起、五更眠、最无益,只怕一日曝、十日寒。这篇文章主要讲述Powershell + Selenium 爬虫--代理(03)相关的知识,希望能为你提供帮助。
上一篇介绍了Senlinum 的操作, 真正需要使用senlenium 爬取目标网站还需要做一些其他伪装, 例如: 设置浏览器的代理来访问目标网站, 这样以来可以避免目标网站发现是爬虫, 从而把自己的上网IP 拉进网站后台的黑名单当中, 这样有可能造成自己的IP 被永久限制访问网站或者限制访问指定的内容
为此, 我们找到了一些网上的免费的代理网址, 通过代理网址提供的免费代理IP 来访问目标网站就相对来说安全多了, 避免了自己 上网IP 暴露给目标网站
讲到这里, 我再梳理一下以上的逻辑:
1. 确定要爬取的目标网址
2. 使用代理IP 伪装自己, 访问目标网址
3. 代理IP 池, 有待进一步验证和更新

#ipmo D:\\tools\\Selenium\\WebDriver.Support.dll #ipmo D:\\tools\\Selenium\\WebDriver.dll $proxyurl = \'http://www.66ip.cn/\' $testurl = "https://www.baidu.com" $ChromeOption = New-Object OpenQA.Selenium.Chrome.ChromeOptions $ChromeOption.AddExcludedArgument("enable-automation") # For closed "disable-infobars" message $ChromeOption.AddArguments("--start-maximized") # By default open chrome will use maximized window $ChromeOption.AddArgument(\'--disable-blink-features=AutomationControlled\') # Set "window.navigator.webdriver" = False #$ChromeOption.AddArgument(\'--proxy-server=http://219.159.38.200:56210\') # Set proxy address access target website $ChromeDriver = New-Object OpenQA.Selenium.Chrome.ChromeDriver($ChromeOption) $ChromeDriver.Navigate().GoToUrl($proxyurl) sleep 5 #regionhttps://www.89ip.cn < # $i = 0 $proxyIPs = @() while ($true) { $i++ if ($i -ne 1) { $ChromeDriver.FindElementByLinkText(\'下一页\') |Out-Null sleep 3 } $trs = $ChromeDriver.FindElementsByCssSelector(\'tbody tr\') if ($trs.Count -gt 0) { $j = 0 foreach ($tr in $trs) { $j++ $w = $j.ToString() + \'/\' + $trs.Count.ToString() $percent = "{0:0.0%}" -f ($j/$trs.Count) Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $i 页 $w , $percent" -PercentComplete ($j/($trs.count) * 100)$trinfo = $tr.Text -split \' \' $recordtime = $trinfo[4] + " " + $trinfo[5] try { $testproxy = "http://{0}:{1}" -f ($trinfo[0]), ($trinfo[1]) $testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop if ($testresult.StatusCode -eq 200) { Write-Host $testproxy $obj = New-Object psobject $obj | Add-Member -MemberType NoteProperty -Name IP -Value $trinfo[0] -Force $obj | Add-Member -MemberType NoteProperty -Name Port -Value $trinfo[1] -Force $obj | Add-Member -MemberType NoteProperty -Name Region -Value $trinfo[2] -Force $obj | Add-Member -MemberType NoteProperty -Name ISP -Value $trinfo[3] -Force $obj | Add-Member -MemberType NoteProperty -Name RecordTime -Value $recordtime -Force #$obj | epcsv d:\\ProxyServerList-20210828.csv -Encoding UTF8 -Append -Force -NoTypeInformation $proxyIPs +=$obj } } catch { #$errormsg = $_.Exception.Message #Write-Host "$testproxy Test Failed " } } } else { break } } $proxyIPs |epcsv d:\\ProxyServerList-20210829.csv -Encoding UTF8 -Force -NoTypeInformation#> #endregion#region http://www.66ip.cn/ $proxylist = @() $regionnames = ($ChromeDriver.FindElementsByTagName(\'li\') |select text -Last 34).Text foreach($regionname in $regionnames) { $ChromeDriver.FindElementByLinkText($regionname).Click() sleep 3 $trcount = ($ChromeDriver.FindElementsByTagName(\'tr\') |measure |select count).count $filtercount = $trcount - 3 $iplist = $ChromeDriver.FindElementsByTagName(\'tr\') |select Text -Last $filtercount $j = 0 if($iplist.Count -ge 0) { foreach($ipstring in $iplist.Text) { $ipinfo = $ipstring -split \' \' $ipaddress = $ipinfo[0] $ipport = $ipinfo[1] $ipregion = $ipinfo[2] $iptype = $ipinfo[3] $j++ $w = $j.ToString() + \'/\' + $iplist.Count.ToString() $percent = "{0:0.0%}" -f ($j/$iplist.Count) Write-Progress -Activity "Process test proxy address" -Status "请耐心等待,Current $ipregion$w , $percent" -PercentComplete ($j/($iplist.count) * 100)try { $testproxy = "http://{0}:{1}" -f $ipaddress, $ipport $testresult = Invoke-WebRequest -Uri $testurl -Proxy $testproxy -TimeoutSec 3 -ErrorAction Stop if ($testresult.StatusCode -eq 200) { Write-Host $testproxy $obj = New-Object psobject $obj |Add-Member -MemberType NoteProperty -Name IPAddress -Value $ipaddress -Force $obj |Add-Member -MemberType NoteProperty -Name Port -Value $ipport -Force $obj |Add-Member -MemberType NoteProperty -Name Region -Value $ipregion -Force $proxylist +=$obj } } catch {} } } } $proxylist |select IPAddress,Port,Region -Unique |ogv #endregion


【Powershell + Selenium 爬虫--代理(03)】

    推荐阅读