为什么我无法使用 GitHub Actions 连接到 MongoDB Atlas?
我想创建一个工作流,每天在预定的时间自动从 Google Play 商店抓取应用评论,并将其存储在 MongoDB Atlas 中的收藏中。因此,首先,我创建了一个名为
scraping_daily.py
的 Python 脚本,它将抓取 5,000 条新评论并过滤掉之前收集的任何评论。当我对其进行测试并手动运行它时,该脚本运行良好。以下是该脚本的样子:
# Import libraries
import numpy as np
import pandas as pd
from google_play_scraper import Sort, reviews, reviews_all, app
from pymongo import MongoClient
# Create a connection to MongoDB
client = MongoClient("mongodb+srv://<MY_USERNAME>:<MY_PASSWORD>@project1.lpu4kvx.mongodb.net/?retryWrites=true&w=majority")
db = client["vidio"]
collection = db["google_play_store_reviews"]
# Load the data from MongoDB
df = pd.DataFrame(list(collection.find()))
df = df.drop("_id", axis=1)
df = df.sort_values("at", ascending=False)
# Collect 5000 new reviews
result = reviews(
"com.vidio.android",
lang="id",
country="id",
sort=Sort.NEWEST,
count=5000
)
new_reviews = pd.DataFrame(result[0])
new_reviews = new_reviews.fillna("empty")
# Filter the scraped reviews to exclude any that were previously collected
common = new_reviews.merge(df, on=["reviewId", "userName"])
new_reviews_sliced = new_reviews[(~new_reviews.reviewId.isin(common.reviewId)) & (~new_reviews.userName.isin(common.userName))]
# Update MongoDB with any new reviews that were not previously scraped
if len(new_reviews_sliced) > 0:
new_reviews_sliced_dict = new_reviews_sliced.to_dict("records")
batch_size = 1_000
num_records = len(new_reviews_sliced_dict)
num_batches = num_records // batch_size
if num_records % batch_size != 0:
num_batches += 1
for i in range(num_batches):
start_idx = i * batch_size
end_idx = min(start_idx + batch_size, num_records)
batch = new_reviews_sliced_dict[start_idx:end_idx]
if batch:
collection.insert_many(batch)
接下来,我想使用 GitHub Actions 来安排我的脚本。就像我遵循 YouTube 教程一样,我在
.github/workflows
文件夹中创建了一个
actions.yml
文件。以下是 YAML 文件的样子:
name: Scraping Google Play Reviews
on:
schedule:
- cron: 50 16 * * * # At 16:50 every day
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: check out the repository content
uses: actions/checkout@v2
- name: set up python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: install requirements
run:
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: execute the script
run: python -m scraping_daily.py
但是,它在执行我的脚本时总是会抛出错误。错误消息是:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "/home/runner/work/vidio_google_play_store_reviews/vidio_google_play_store_reviews/scraping_daily.py", line 16, in <module>
df = pd.DataFrame(list(collection.find()))
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/cursor.py", line 1248, in next
if len(self.__data) or self._refresh():
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/cursor.py", line 1139, in _refresh
self.__session = self.__collection.database.client._ensure_session()
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1740, in _ensure_session
return self.__start_session(True, causal_consistency=False)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1685, in __start_session
self._topology._check_implicit_session_support()
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 538, in _check_implicit_session_support
self._check_session_support()
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 554, in _check_session_support
self._select_servers_loop(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pymongo/topology.py", line 238, in _select_servers_loop
raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net:27017: connection closed,ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net:27017: connection closed,ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net:27017: connection closed, Timeout: 300.0s, Topology Description: <TopologyDescription id: 641dd5b78e0efba394e00ffc, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-00.lpu4kvx.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-01.lpu4kvx.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-dc8axn9-shard-00-02.lpu4kvx.mongodb.net:27017: connection closed')>]>
Error: Process completed with exit code 1.
我尝试通过在
MongoClient()
中添加
serverSelectionTimeoutMS=300000
来增加超时设置,但仍然出现相同的错误。我该如何解决这个问题?
顺便说一句,我使用的是 Windows 计算机(但我不确定这是否有用)。
要通过 GitHub Actions 访问您的 MongoDB 数据库,您需要添加 IP 地址并选择 允许从任何地方访问 选项。
只需添加到@jonrsharpe的答案中,可以使用 MongoDB Atlas CLI github操作 简化此方法。
# Grant temporary MongoDB access to this Github Action runner ip address
- name: Get the public IP of this runner
id: get_gh_runner_ip
shell: bash
run: |
echo "ip_address=$(curl https://checkip.amazonaws.com)" >> "$GITHUB_OUTPUT"
- name: Setup MongoDB Atlas cli
uses: mongodb/[email protected]
- name: Add runner IP to MongoDB access list
shell: bash
run: |
atlas accessLists create ${{ steps.get_gh_runner_ip.outputs.ip_address }} --type ipAddress --projectId ${{ env.MONGODB_ATLAS_PROJECT_ID }} --comment "Temporary access for GH Action"
在工作流程结束时:
- name: Remove GH runner IP from MongDB access list
shell: bash
run: |
atlas accessLists delete ${{ steps.get_gh_runner_ip.outputs.ip_address }} --projectId ${{ env.MONGODB_ATLAS_PROJECT_ID }} --force
您可以添加运行器在作业持续期间的 IP,而不是允许 任何 IP 访问集群:
-
作为机密,使用项目所有者角色提供 API 凭据 。 请注意 ,您还必须允许任何 IP 访问 Atlas 管理 API 本身( docs )。
-
发出请求,例如 https://checkip.amazonaws.com 找出特定运行器的公共 IP:
- name: 获取运行器的公共 IP id: get-ip shell: bash run: | echo "ip-address=$(curl https://checkip.amazonaws.com)" >> "$GITHUB_OUTPUT"
-
向 MongoDB Atlas API
POST /groups/{groupId}/accessList
发出请求以允许访问该 IP:- name: 允许运行器访问 MongoDB Atlas id: allow-ip shell: bash run: | curl \ --data '[{"ipAddress": "${{ steps.get-ip.outputs.ip-address }}", "comment": "GitHub Actions Runner"}]' \ --digest \ --header 'Accept: application/vnd.atlas.2023-02-01+json' \ --header 'Content-Type: application/json' \ --user "$USERNAME:$PASSWORD" \ "https://cloud.mongodb.com/api/atlas/v2/groups/$GROUP_ID/accessList" env: GROUP_ID: ${{ secrets.ATLAS_GROUP_ID }} PASSWORD: ${{ secrets.ATLAS_PRIVATE_KEY }} USERNAME: ${{ secrets.ATLAS_PUBLIC_KEY }}
-
访问完成后,如果成功 或 失败,向
DELETE /groups/{groupId}/accessList/{entryValue>
发出请求以撤销访问权限:- name: 撤销运行者对 MongoDB Atlas 的访问权限 if: always() && steps.allow-ip.outcome == 'success' shell: bash run: | curl \ --digest \ --header 'Accept: application/vnd.atlas.2023-02-01+json' \ --request 'DELETE' \ --user "$USERNAME:$PASSWORD" \ "https://cloud.mongodb.com/api/atlas/v2/groups/$GROUP_ID/accessList/${{ steps.get-ip.outputs.ip-address }}" env: GROUP_ID: ${{ secrets.ATLAS_GROUP_ID }} PASSWORD: ${{ secrets.ATLAS_PRIVATE_KEY }} USERNAME: ${{ secrets.ATLAS_PUBLIC_KEY }}
另一种方法(包括自动应用作业后步骤)是在
自定义 JavaScript 操作
(使用
runs.post
发出
DELETE
请求)。我已将此类操作
发布
到市场。