本文最后更新于 228 天前,其中的信息可能已经有所发展或是发生改变。
1. 想法
最近在向个人 OJ 上搬题,但是一个一个题目的复制 Markdown 速度太慢,于是想出了写一个下载程序来批量下载题面数据
2. 请求URL
如果你在洛谷主站的任何网址后面加上 ?_contentOnly
,则洛谷会返回一份以 JSON 格式表述的页面数据。要注意的是这里的 User-Agent 不能为空,否则洛谷直接给你扔个 404 回来(大概是 lin_toto 这么设计以防止爬虫?)。
3. 解析 JSON
以 https://www.luogu.com.cn/problem/P7960?_contentOnly 为例:
{
"code": 200,
"currentTemplate": "ProblemShow",
"currentData": {
"problem": {
"background": "",
"description": "报数游戏是一个广为流传的休闲小游戏。参加游戏的每个人要按一定顺序轮流报数,但如果下一个报的数是 $7$ 的倍数,或十进制表示中含有数字 $7$,就必须跳过这个数,否则就输掉了游戏。\n\n在一个风和日丽的下午,刚刚结束 SPC20nn 比赛的小 r 和小 z 闲得无聊玩起了这个报数游戏。但在只有两个人玩的情况下计算起来还是比较容易的,因此他们玩了很久也没分出胜负。此时小 z 灵光一闪,决定把这个游戏加强:任何一个十进制中含有数字 $7$ 的数,它的所有倍数都不能报出来!\n\n形式化地,设 $p(x)$ 表示 $x$ 的十进制表示中是否含有数字 $7$,若含有则 $p(x) = 1$,否则 $p(x) = 0$。则一个正整数 $x$ 不能被报出,当且仅当存在正整数 $y$ 和 $z$ ,使得 $x = yz$ 且 $p(y) = 1$。\n\n\n例如,如果小 r 报出了 $6$ ,由于 $7$ 不能报,所以小 z 下一个需要报 $8$;如果小 r 报出了 $33$,则由于 $34 = 17 \\times 2$,$35 = 7 \\times 5$ 都不能报,小 z 下一个需要报出 $36$ ;如果小 r 报出了 $69$,由于 $70 \\sim 79$ 的数都含有 $7$,小 z 下一个需要报出 $80$ 才行。\n\n现在小 r 的上一个数报出了 $x$,小 z 想快速算出他下一个数要报多少,不过他很快就发现这个游戏可比原版的游戏难算多了,于是他需要你的帮助。当然,如果小 r 报出的 x 本身是不能报出的,你也要快速反应过来小 r 输了才行。\n\n由于小 r 和小 z 玩了很长时间游戏,你也需要回答小 z 的很多个问题。",
"inputFormat": "第一行,一个正整数 $T$ 表示小 z 询问的数量。\n\n接下来 $T$ 行,每行一个正整数 $x$,表示这一次小 r 报出的数。",
"outputFormat": "输出共 $T$ 行,每行一个整数,如果小 r 这一次报出的数是不能报出的,输出 $-1$,否则输出小 z 下一次报出的数是多少。",
"samples": [
[
"4\n6\n33\n69\n300\n",
"8\n36\n80\n-1\n"
],
[
"5\n90\n99\n106\n114\n169\n",
"92\n100\n109\n-1\n180\n"
],
[
"见附件中的 number/number3.in",
"见附件中的 number/number3.ans"
],
[
"见附件中的 number/number4.in",
"见附件中的 number/number4.ans"
]
],
"hint": "**【样例解释 #1】**\n\n这一组样例的前 $3$ 次询问在题目描述中已有解释。\n\n对于第 $4$ 次询问,由于 $300 = 75 \\times 4$,而 $75$ 中含有 $7$ ,所以小 r 直接输掉了游戏。\n\n**【数据范围】**\n\n对于 $10\\%$ 的数据,$T \\leq 10$,$x \\leq 100$。 \n对于 $30\\%$ 的数据,$T \\leq 100$,$x \\leq 1000$。 \n对于 $50\\%$ 的数据,$T \\leq 1000$,$x \\leq 10000$。 \n对于 $70\\%$ 的数据,$T \\leq 10000$,$x \\leq 2 \\times {10}^5$。 \n对于 $100\\%$ 的数据,$1 \\le T \\leq 2 \\times {10}^5$,$1 \\le x \\leq {10}^7$。",
"provider": {
"uid": 19,
"name": "CCF_NOI",
"slogan": "",
"badge": null,
"isAdmin": false,
"isBanned": false,
"color": "Gray",
"ccfLevel": 0,
"background": ""
},
"attachments": [
{
"downloadLink": "/fe/api/problem/downloadAttachment/nuacg4zh",
"size": 1332243,
"uploadTime": 1637419301,
"id": "nuacg4zh",
"filename": "number.zip"
}
],
"canEdit": false,
"limits": {
"time": [
1000,
1000,
1000,
1000,
1000,
1000,
1000,
1000,
1000,
1000
],
"memory": [
524288,
524288,
524288,
524288,
524288,
524288,
524288,
524288,
524288,
524288
]
},
"stdCode": "",
"tags": [
58,
83,
108
],
"wantsTranslation": false,
"totalSubmit": 67236,
"totalAccepted": 14342,
"flag": 5,
"pid": "P7960",
"title": "[NOIP2021] 报数",
"difficulty": 3,
"fullScore": 100,
"type": "P"
},
"contest": null,
"discussions": [
{
"id": 680342,
"title": "全re",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
},
{
"id": 675589,
"title": "样例都过不了求助!",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
},
{
"id": 648474,
"title": "两样例全过结果10分求助",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
},
{
"id": 642828,
"title": "tle优化求助",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
},
{
"id": 632904,
"title": "70pts后三个点WA的人看过来",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
},
{
"id": 632586,
"title": "求助RE",
"forum": {
"id": 65422,
"name": "P7960 [NOIP2021] 报数",
"slug": "P7960"
}
}
],
"bookmarked": false,
"vjudgeUsername": null,
"recommendations": [
{
"pid": "P7113",
"title": "[NOIP2020] 排水系统",
"difficulty": 4,
"fullScore": 100,
"type": "P"
},
{
"pid": "P7913",
"title": "[CSP-S 2021] 廊桥分配",
"difficulty": 4,
"fullScore": 100,
"type": "P"
},
{
"pid": "P7915",
"title": "[CSP-S 2021] 回文",
"difficulty": 4,
"fullScore": 100,
"type": "P"
},
{
"pid": "P7962",
"title": "[NOIP2021] 方差",
"difficulty": 6,
"fullScore": 100,
"type": "P"
},
{
"pid": "P7961",
"title": "[NOIP2021] 数列",
"difficulty": 5,
"fullScore": 100,
"type": "P"
}
],
"lastLanguage": 0,
"lastCode": "",
"privilegedTeams": [],
"userTranslation": null
},
"currentTitle": "[NOIP2021] 报数",
"currentTheme": null,
"currentTime": 1694678931
}
JSONC这道题共返回了 196 行数据,其中:
code
:状态代码,200 为成功,403 为无权查看currentData.problem.title
:题目标题currentData.problem.difficulty
:难度(1~7 分别代表“入门”到“NOI/NOI+/CTSC”,0 代表“暂无评定”)currentData.problem.totalSubmit
/currentData.problem.totalAccepted
:总提交 / 总通过currentData.problem.limits
:测试点限制列表currentData.problem.background
:题目背景currentData.problem.description
:题目描述currentData.problem.inputFormat
:输入格式currentData.problem.outputFormat
:输出格式currentData.problem.hint
:提示
根据这些内容写一个处理函数(resolve_problem(res): str
),传入一个 JSON 对象,返回处理好的 Markdown 内容。
4. 绕过洛谷对爬虫的限制
有时候请求速率太快,洛谷会直接扔回来一个类似于以下内容的 Javascript 质询
<script>
var _$daewqwskl=["\x64\x6f\x63\x75\x6d\x65\x6e\x74"];
var _$oopopdwskl=["\x6f\x6e\x4d\x6f\x75\x73\x65\x4d\x6f\x76\x65","\x62\x6f\x64\x79","\x64\x6f\x63\x75\x6d\x65\x6e\x74","\x72\x65\x6d\x6f\x76\x65\x41\x74\x74\x72\x69\x62\x75\x74\x65"];
window.open("/problem/P1846?_contentOnly", "_self");
window[_$daewqwskl[0]].cookie="C3VK=3ca901; path=/; max-age=300;"
</script>
HTML将字符串解析后,其实很简单。大概就是设置 Cookie C3VK=xxxxxx
(6 位字母和数字的混合内容,比如在这里是 3ca901),然后跳转原页面。
用 Python 手动处理一下字符串,添加到 requests 的 cookies 里即可。
5. 程序
print("\033[2mLoading libraries, please wait...")
# Import built-in libraries
import time
import json
import os
import traceback
# Define functions
def info(logger : str):
print(time.strftime("\033[0;94m[%H:%M:%S INFO]: \033[0m" + logger))
def warn(logger : str):
print(time.strftime("\033[0;93m[%H:%M:%S WARN]: \033[0m" + logger))
def error(logger : str):
print(time.strftime("\033[0;91m[%H:%M:%S ERROR]: \033[0m" + logger))
def resolve_problem(res):
#print(res)
save_problem = ""
save_problem += f"[P{now} - {res['currentData']['problem']['title']}](https://www.luogu.com.cn/problem/P{now})\n\n"
save_problem += f"难度:{res['currentData']['problem']['difficulty']}\n\n"
save_problem += f"通过/提交:{res['currentData']['problem']['totalAccepted']}/{res['currentData']['problem']['totalSubmit']}\n\n"
save_problem += f"时空限制 (共 {len(res['currentData']['problem']['limits']['time'])} 个测试点):\n\n"
for i in range(len(res['currentData']['problem']['limits']['time'])):
save_problem += f"- 时间 {res['currentData']['problem']['limits']['time'][i]}ms | 空间 {res['currentData']['problem']['limits']['memory'][i] / 1024}MiB\n"
save_problem += f"\n# 题目背景\n\n"
save_problem += res['currentData']['problem']['background'].strip('\n') if res['currentData']['problem']['background'] != None else "空"
save_problem += f"\n\n# 题目描述\n\n"
save_problem += res['currentData']['problem']['description'].strip('\n') if res['currentData']['problem']['description'] != None else "空"
save_problem += f"\n\n# 输入格式\n\n"
save_problem += res['currentData']['problem']['inputFormat'].strip('\n') if res['currentData']['problem']['inputFormat'] != None else "空"
save_problem += f"\n\n# 输出格式\n\n"
save_problem += res['currentData']['problem']['outputFormat'].strip('\n') if res['currentData']['problem']['outputFormat'] != None else "空"
save_problem += f"\n\n# 样例 (共 {len(res['currentData']['problem']['samples'])} 个)\n\n"
for i in range(len(res['currentData']['problem']['samples'])):
save_problem += "```input{i+1}\n" + res['currentData']['problem']['samples'][i][0].strip('\n') + "\n```\n\n"
save_problem += "```output{i+1}\n" + res['currentData']['problem']['samples'][i][1].strip('\n') + "\n```\n\n"
save_problem += f"# 提示\n\n"
save_problem += res['currentData']['problem']['hint'].strip('\n') if res['currentData']['problem']['hint'] != None else "空"
return save_problem
# Import 3rd-party functions
try:
import requests
except(ImportError):
warn("Downloading library requests from pip...")
os.system("pip install requests")
import requests
try:
import fake_useragent
except(ImportError):
warn("Downloading library fake_useragent from pip...")
os.system("pip install fake_useragent")
import fake_useragent
info("Starting Luogu Problems Downloader version 1.2")
start_time = time.time()
program_config = {}
# Load config.json
info("Preparing config \"config.json\"")
try:
with open('config.json', 'r') as f:
program_config = json.loads(f.read())
except:
warn("Copying default config")
program_config = {
"begin": 1000,
"end": 9607,
"download_delay": 5,
"status_filename": "status.conf",
"download_folder": "./lg_downloads/"
}
with open('config.json', 'w') as f:
f.write(json.dumps(program_config))
now = 1000
info(f"Set now problem ID to P{now} (Full problems' range is: {program_config['begin']} - {program_config['end']}).")
# Load status.conf
info(f"Preparing config \"{program_config['status_filename']}\"")
try:
with open(program_config['status_filename'], 'r') as f:
now = int(f.read())
except:
warn("Copying default config")
with open(program_config['status_filename'], 'w') as f:
f.write("1000")
# Prepare download folder (./lg_downloads/)
info(f"Preparing download folder \"{program_config['download_folder']}\"")
try:
os.makedirs(program_config['download_folder'])
except FileExistsError:
warn(f"Download folder has been exist! Skip for it.")
# Prepare request header
info("Preparing Request Header & Cookies Value")
reqheader = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'User-Agent': fake_useragent.UserAgent().random}
reqcookies = {}
# Finish initalize
info(f"Done! Time elapsed: {int((time.time() - start_time) * 1000)}ms")
start_time = time.time()
success = 0
warns = 0
try:
hazard_level = 0
while(now <= program_config['end']):
if(hazard_level >= 20):
# Hazard level is too high that may will be banned by Luogu
reqheader['User-Agent'] = str(fake_useragent.UserAgent().random)
warn(f"Hazard Level is too high! Reload the User Agent {reqheader['User-Agent']} and sleep for {hazard_level - 10}s")
time.sleep(hazard_level - 10)
hazard_level = 10
info(f"[\033[0;92mSuccess: {success} \033[0;93mWarn: {warns} \033[0;95mSpeed: {int(success / (time.time() - start_time + 1) * 60)} p/min\033[0m] Proceed problem P{now}:")
# Request API endpoint
res = requests.get(f"https://www.luogu.com.cn/problem/P{now}?_contentOnly", headers=reqheader, cookies=reqcookies).text
info(" > Requested")
# Resolve response
try:
res = json.loads(res)
except:
warns += 1
warn("Hit Anti-Robot Rating Limits!")
limit_message = res
if(limit_message.find(".cookie=\"") == -1):
error(f"Cannot find the JavaScript verification:\n{res}\nIncrease the hazard level by 10.")
hazard_level += 10
time.sleep(program_config['download_delay'] * 5)
continue
info("Trying to resolve JavaScript verification...")
# Use string actions to resolve JavaScript verification
# The example of verification:
"""
<script>
var _$daewqwskl=["\x64\x6f\x63\x75\x6d\x65\x6e\x74"];
var _$oopopdwskl=["\x6f\x6e\x4d\x6f\x75\x73\x65\x4d\x6f\x76\x65","\x62\x6f\x64\x79","\x64\x6f\x63\x75\x6d\x65\x6e\x74","\x72\x65\x6d\x6f\x76\x65\x41\x74\x74\x72\x69\x62\x75\x74\x65"];
window.open("/problem/P1846?_contentOnly", "_self");
window[_$daewqwskl[0]].cookie="C3VK=3ca901; path=/; max-age=300;"
</script>
"""
try:
limit_message = limit_message.split(".cookie=\"")[1]
limit_message = limit_message.split("; path")[0]
except:
error(f"Error while resolving JavaScript verification:\n{res}\nIncrease the hazard level by 10.")
hazard_level += 10
time.sleep(program_config['download_delay'] * 5)
continue
reqcookies[limit_message.split("=")[0]] = limit_message.split("=")[1]
info(f"Set cookies: {reqcookies}")
time.sleep(program_config['download_delay'] * 2)
continue
if(res['code'] == 200):
# Resolved successfully
info(" > Resolved")
save_problem = resolve_problem(res)
#print(save_problem)
with open(program_config['download_folder'] + "P" + str(now) + ".md", 'w', encoding='UTF-8') as f:
f.write(save_problem)
info(" > Saved")
hazard_level -= 5
success += 1
else:
error("Error while request this: Code " + str(res['code']))
warns += 1
time.sleep(program_config['download_delay'] * 3)
hazard_level += 15
time.sleep(program_config['download_delay'])
now += 1
except KeyboardInterrupt:
warn("Saving your status...")
with open('status.conf', 'w') as f:
f.write(str(now))
info("All the changes have saved.")
exit(0)
except Exception as e:
error("An unexpected error occured:")
traceback.print_exc()
info("Press Enter to exit safely.")
input()
warn("Saving your status...")
with open('status.conf', 'w') as f:
f.write(str(now))
info("All the changes have saved.")
exit(1)
warn("Saving your status...")
with open('status.conf', 'w') as f:
f.write(str(now))
info("All the changes have saved.")
Python程序源代码和 2023 年 9 月 13 日下载的所有题面数据可以在下面链接下载: