在Linux(Centos 8)上使用Headless Chrome无头浏览器采集的真实体验

最近在在采集微信文章的时候，遇到了点棘手的问题，通过搜狗搜索的微信搜索模式，使用普通的直接抓取页面的方式，无法绕过搜狗搜索的验证，因此使用gorequest成功的采集到微信文章。

选择chromedp + Headless Chrome

面对采集目的没实现的问题，我就这么放弃了吗？显然是不可能的。你不就是有签名验证嘛，只要不是要求输入验证码，总还是有办法解决的（蒽，一般的验证码也可以解决）。于是祭出golang下采集神器：chromedp。

简单点说，chromedp就是golang语言用于调用 Chrome Debugging Protocol 来实现模拟浏览器行为的可以使用简单的方式来驱动浏览器的一个包。使用它的前提只有简单的一个，那就是在你的电脑上安装上Chrome浏览器就可以了。

Chrome浏览器在的安装，在Windows下和macOS下都没问题，在桌面版的Linux下也问题不大。但想要在Linux服务器版上安装Chrome，就不是那么简单的事情了，至少目前来说Chrome还没有可以直接在服务器上安装的软件包。

但刚想好的法子又要放弃了吗？当然也是不可能的。翻翻chromedp的文档，刚好发现了一段话：

The simplest way is to run the Go program that uses chromedp inside the chromedp/headless-shell image. That image contains headless-shell, a smaller headless build of Chrome, which chromedp is able to find out of the box.

他的意思就是说：最简单的方法是使用 chromedp 调用 chromedp/headless-shell 镜像。 chromedp/headless-shell 是一个包含较小的Chrome无头浏览器的docker镜像。

好嘞，既然是一个docker镜像，我们就用docker安装它。

首先登录到我们的linux服务器，以下操作是认为你已经登录了Linux服务器了。

安装docker 和安装 chromedp/headless-shell

如果你的服务器已经安装了docker，则这一步跳过。

使用yum安装docker

yum install docker

安装完了，这时候docker还不能直接使用，需要执行下面这个命令来激活

systemctl daemon-reload

service docker restart

接着安装 chromedp/headless-shell 镜像

docker pull chromedp/headless-shell:latest

等待安装完毕，接着运行 docker镜像

docker run -d -p 9222:9222 --rm --name headless-shell chromedp/headless-shell

运行起来后，测试下chrome是否正常：

curl http://127.0.0.1:9222/json/version

如果看到类似以下内容，表示chrome浏览器工作正常

{ "Browser": "Chrome/96.0.4664.110", "Protocol-Version": "1.3", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "V8-Version": "9.6.180.21", "WebKit-Version": "537.36 (@d5ef0e8214bc14c9b5bbf69a1515e431394c62a6)", "webSocketDebuggerUrl": "ws://127.0.0.1:9222/devtools/browser/a41cd42b-99ef-4d5b-b9e6-37d634aa719a" }

chromedp 代码调用 chromedp/headless-shell 采集微信公众号文章内容

上面已经可以正常的在 linux下使用 Headless Chrome 无头浏览器了。剩下的就是代码调用它的事了。

下面我们开始编写采集微信文章用到的chrome代码：

定义 keyword、artile `struct.go`

package main

type Keyword struct {
    Id           int64  `json:"id"`
    Name         string `json:"name"`
    Level        int    `json:"level"`
    ArticleCount int    `json:"article_count"`
    LastTime     int64  `json:"last_time"` //上次执行时间
}

type Article struct {
    Id          int64  `json:"id"`
    KeywordId   int64  `json:"keyword_id"`
    Title       string `json:"title"`
    Keywords    string `json:"keywords"`
    Description string `json:"description"`
    OriginUrl   string `json:"origin_url"`
    Status      int    `json:"status"`
    CreatedTime int    `json:"created_time"`
    UpdatedTime int    `json:"updated_time"`
    Content     string `json:"content"`
    ContentText string `json:"-"`
}

编写核心代码 `core.go`

package main

import (
    "context"
    "fmt"
    "github.com/chromedp/cdproto/cdp"
    "github.com/chromedp/chromedp"
    "log"
    "net"
    "net/url"
    "strings"
    "time"
)

func CollectArticleFromWeixin(keyword *Keyword) []*Article {
    timeCtx, cancel := context.WithTimeout(GetChromeCtx(false), 30*time.Second)
    defer cancel()

    var collectLink string
    err := chromedp.Run(timeCtx,
        chromedp.Navigate(fmt.Sprintf("https://weixin.sogou.com/weixin?p=01030402&query=%s&type=2&ie=utf8", keyword.Name)),
        chromedp.WaitVisible(`//ul[@class="news-list"]`),
        chromedp.Location(&collectLink),
    )
    if err != nil {
        log.Println("读取搜狗搜索列表失败1：", keyword.Name, err.Error())
        return nil
    }
    log.Println("正在采集列表：", collectLink)
    var aLinks []*cdp.Node
    if err := chromedp.Run(timeCtx, chromedp.Nodes(`//ul[@class="news-list"]//h3//a`, &aLinks)); err != nil {
        log.Println("读取搜狗搜索列表失败2：", keyword.Name, err.Error())
        return nil
    }

    var articles []*Article
    for i := 0; i < len(aLinks); i++ {
        href := aLinks[i].AttributeValue("href")
        href, _ = joinURL("https://weixin.sogou.com/", href)
        article := &Article{}
        err = chromedp.Run(timeCtx,
            chromedp.Navigate(href),
            chromedp.WaitVisible(`#js_article`),
            chromedp.Location(&article.OriginUrl),
            chromedp.Text(`#activity-name`, &article.Title, chromedp.NodeVisible),
            chromedp.InnerHTML("#js_content", &article.Content, chromedp.ByID),
            chromedp.Text("#js_content", &article.ContentText, chromedp.ByID),
        )
        if err != nil {
            log.Println("读取搜狗搜索列表失败3：", keyword.Name, err.Error())
        }
        log.Println("采集文章：", article.Title, len(article.Content), article.OriginUrl)
        articles = append(articles, article)
    }

    return articles
}

// 重组url
func joinURL(baseURL, subURL string) (fullURL, fullURLWithoutFrag string) {
    baseURL = strings.TrimSpace(baseURL)
    subURL = strings.TrimSpace(subURL)
    baseURLObj, _ := url.Parse(baseURL)
    subURLObj, _ := url.Parse(subURL)
    fullURLObj := baseURLObj.ResolveReference(subURLObj)
    fullURL = fullURLObj.String()
    fullURLObj.Fragment = ""
    fullURLWithoutFrag = fullURLObj.String()
    return
}

//检查是否有9222端口，来判断是否运行在linux上
func checkChromePort() bool {
    addr := net.JoinHostPort("", "9222")
    conn, err := net.DialTimeout("tcp", addr, 1*time.Second)
    if err != nil {
        return false
    }
    defer conn.Close()
    return true
}

// ChromeCtx 使用一个实例
var ChromeCtx context.Context
func GetChromeCtx(focus bool) context.Context {
    if ChromeCtx == nil || focus {
        allocOpts := chromedp.DefaultExecAllocatorOptions[:]
        allocOpts = append(allocOpts,
            chromedp.DisableGPU,
            chromedp.Flag("blink-settings", "imagesEnabled=false"),
            chromedp.UserAgent(`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36`),
            chromedp.Flag("accept-language", `zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6`),
        )

        if checkChromePort() {
            // 不知道为何，不能直接使用 NewExecAllocator ，因此增加 使用 ws://127.0.0.1:9222/ 来调用
            c, _ := chromedp.NewRemoteAllocator(context.Background(),  "ws://127.0.0.1:9222/")
            ChromeCtx, _ = chromedp.NewContext(c)
        } else {
            c, _ := chromedp.NewExecAllocator(context.Background(), allocOpts...)
            ChromeCtx, _ = chromedp.NewContext(c)
        }
    }

    return ChromeCtx
}

编写入口文件 `main.go`

package main

import "log"

func main() {
    word := Keyword{
        Name:         "golang",
    }

    result := CollectArticleFromWeixin(&word)

    for i, v := range result {
        log.Println(i, v.Title, len(v.Content), v.OriginUrl)
        log.Println("纯内容：", v.ContentText)
    }
}

运行结果测试：

很棒结果出来了。

微信视频投屏：
1、在手机端微信中会拦截投屏功能，需要首先点击视频页面右上角“...”图标，选择“在浏览器中打开”，在列表中选取具备投屏功能的浏览器，推荐使用QQ浏览器
2、在新打开的浏览器视频页面里，点击播放按钮，可在视频框右上角看到一个“TV”投屏小图标，只要电视和手机在同一WiFi环境下，点击按钮即刻享受大屏观感！
本站资源声明：
1、如需免费下载云盘资源，请先点击页面右上角的“登录”按钮，注册并登录您的账号后即可查看到网盘资源下载地址；
2、本站所有资源信息均由网络爬虫自动抓取，以非人工方式自动筛选长效资源并更新发布，资源内容只作交流和学习使用，本站不储存、复制、传播任何文件，其资源的有效性和安全性需要您自行判断；
3、本站高度重视知识产权保护，如有侵犯您的合法权益或违法违规，请立即向网盘官方举报反馈，并提供相关有效书面证明与侵权页面链接联系我们处理；
4、作为非盈利性质网站，仅提供网络资源的免费搜索和检测服务，无需额外支付其他任何费用，学习和交流的同时请小心防范网络诈骗。

chrome 微信采集

选择chromedp + Headless Chrome

安装docker 和安装 chromedp/headless-shell

chromedp 代码调用 chromedp/headless-shell 采集微信公众号文章内容

定义 keyword、artile struct.go

编写核心代码 core.go

编写入口文件 main.go

运行结果测试：

发表回复 取消回复

定义 keyword、artile `struct.go`

编写核心代码 `core.go`

编写入口文件 `main.go`

发表回复取消回复