洛谷题面批量下载程序
本文最后更新于 228 天前,其中的信息可能已经有所发展或是发生改变。

1. 想法

最近在向个人 OJ 上搬题,但是一个一个题目的复制 Markdown 速度太慢,于是想出了写一个下载程序来批量下载题面数据

2. 请求URL

如果你在洛谷主站的任何网址后面加上 ?_contentOnly,则洛谷会返回一份以 JSON 格式表述的页面数据。要注意的是这里的 User-Agent 不能为空,否则洛谷直接给你扔个 404 回来(大概是 lin_toto 这么设计以防止爬虫?)

3. 解析 JSON

https://www.luogu.com.cn/problem/P7960?_contentOnly 为例:

{
    "code": 200,
    "currentTemplate": "ProblemShow",
    "currentData": {
        "problem": {
            "background": "",
            "description": "报数游戏是一个广为流传的休闲小游戏。参加游戏的每个人要按一定顺序轮流报数,但如果下一个报的数是 $7$ 的倍数,或十进制表示中含有数字 $7$,就必须跳过这个数,否则就输掉了游戏。\n\n在一个风和日丽的下午,刚刚结束 SPC20nn 比赛的小 r 和小 z 闲得无聊玩起了这个报数游戏。但在只有两个人玩的情况下计算起来还是比较容易的,因此他们玩了很久也没分出胜负。此时小 z 灵光一闪,决定把这个游戏加强:任何一个十进制中含有数字 $7$ 的数,它的所有倍数都不能报出来!\n\n形式化地,设 $p(x)$ 表示 $x$ 的十进制表示中是否含有数字 $7$,若含有则 $p(x) = 1$,否则 $p(x) = 0$。则一个正整数 $x$ 不能被报出,当且仅当存在正整数 $y$ 和 $z$ ,使得 $x = yz$ 且 $p(y) = 1$。\n\n\n例如,如果小 r 报出了 $6$ ,由于 $7$ 不能报,所以小 z 下一个需要报 $8$;如果小 r 报出了 $33$,则由于 $34 = 17 \\times 2$,$35 = 7 \\times 5$ 都不能报,小 z 下一个需要报出 $36$ ;如果小 r 报出了 $69$,由于 $70 \\sim 79$ 的数都含有 $7$,小 z 下一个需要报出 $80$ 才行。\n\n现在小 r 的上一个数报出了 $x$,小 z 想快速算出他下一个数要报多少,不过他很快就发现这个游戏可比原版的游戏难算多了,于是他需要你的帮助。当然,如果小 r 报出的 x 本身是不能报出的,你也要快速反应过来小 r 输了才行。\n\n由于小 r 和小 z 玩了很长时间游戏,你也需要回答小 z 的很多个问题。",
            "inputFormat": "第一行,一个正整数 $T$ 表示小 z 询问的数量。\n\n接下来 $T$ 行,每行一个正整数 $x$,表示这一次小 r 报出的数。",
            "outputFormat": "输出共 $T$ 行,每行一个整数,如果小 r 这一次报出的数是不能报出的,输出 $-1$,否则输出小 z 下一次报出的数是多少。",
            "samples": [
                [
                    "4\n6\n33\n69\n300\n",
                    "8\n36\n80\n-1\n"
                ],
                [
                    "5\n90\n99\n106\n114\n169\n",
                    "92\n100\n109\n-1\n180\n"
                ],
                [
                    "见附件中的 number/number3.in",
                    "见附件中的 number/number3.ans"
                ],
                [
                    "见附件中的 number/number4.in",
                    "见附件中的 number/number4.ans"
                ]
            ],
            "hint": "**【样例解释 #1】**\n\n这一组样例的前 $3$ 次询问在题目描述中已有解释。\n\n对于第 $4$ 次询问,由于 $300 = 75 \\times 4$,而 $75$ 中含有 $7$ ,所以小 r 直接输掉了游戏。\n\n**【数据范围】**\n\n对于 $10\\%$ 的数据,$T \\leq 10$,$x \\leq 100$。  \n对于 $30\\%$ 的数据,$T \\leq 100$,$x \\leq 1000$。  \n对于 $50\\%$ 的数据,$T \\leq 1000$,$x \\leq 10000$。  \n对于 $70\\%$ 的数据,$T \\leq 10000$,$x \\leq 2 \\times {10}^5$。  \n对于 $100\\%$ 的数据,$1 \\le T \\leq 2 \\times {10}^5$,$1 \\le x \\leq {10}^7$。",
            "provider": {
                "uid": 19,
                "name": "CCF_NOI",
                "slogan": "",
                "badge": null,
                "isAdmin": false,
                "isBanned": false,
                "color": "Gray",
                "ccfLevel": 0,
                "background": ""
            },
            "attachments": [
                {
                    "downloadLink": "/fe/api/problem/downloadAttachment/nuacg4zh",
                    "size": 1332243,
                    "uploadTime": 1637419301,
                    "id": "nuacg4zh",
                    "filename": "number.zip"
                }
            ],
            "canEdit": false,
            "limits": {
                "time": [
                    1000,
                    1000,
                    1000,
                    1000,
                    1000,
                    1000,
                    1000,
                    1000,
                    1000,
                    1000
                ],
                "memory": [
                    524288,
                    524288,
                    524288,
                    524288,
                    524288,
                    524288,
                    524288,
                    524288,
                    524288,
                    524288
                ]
            },
            "stdCode": "",
            "tags": [
                58,
                83,
                108
            ],
            "wantsTranslation": false,
            "totalSubmit": 67236,
            "totalAccepted": 14342,
            "flag": 5,
            "pid": "P7960",
            "title": "[NOIP2021] 报数",
            "difficulty": 3,
            "fullScore": 100,
            "type": "P"
        },
        "contest": null,
        "discussions": [
            {
                "id": 680342,
                "title": "全re",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            },
            {
                "id": 675589,
                "title": "样例都过不了求助!",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            },
            {
                "id": 648474,
                "title": "两样例全过结果10分求助",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            },
            {
                "id": 642828,
                "title": "tle优化求助",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            },
            {
                "id": 632904,
                "title": "70pts后三个点WA的人看过来",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            },
            {
                "id": 632586,
                "title": "求助RE",
                "forum": {
                    "id": 65422,
                    "name": "P7960 [NOIP2021] 报数",
                    "slug": "P7960"
                }
            }
        ],
        "bookmarked": false,
        "vjudgeUsername": null,
        "recommendations": [
            {
                "pid": "P7113",
                "title": "[NOIP2020] 排水系统",
                "difficulty": 4,
                "fullScore": 100,
                "type": "P"
            },
            {
                "pid": "P7913",
                "title": "[CSP-S 2021] 廊桥分配",
                "difficulty": 4,
                "fullScore": 100,
                "type": "P"
            },
            {
                "pid": "P7915",
                "title": "[CSP-S 2021] 回文",
                "difficulty": 4,
                "fullScore": 100,
                "type": "P"
            },
            {
                "pid": "P7962",
                "title": "[NOIP2021] 方差",
                "difficulty": 6,
                "fullScore": 100,
                "type": "P"
            },
            {
                "pid": "P7961",
                "title": "[NOIP2021] 数列",
                "difficulty": 5,
                "fullScore": 100,
                "type": "P"
            }
        ],
        "lastLanguage": 0,
        "lastCode": "",
        "privilegedTeams": [],
        "userTranslation": null
    },
    "currentTitle": "[NOIP2021] 报数",
    "currentTheme": null,
    "currentTime": 1694678931
}
JSONC

这道题共返回了 196 行数据,其中:

  1. code:状态代码,200 为成功,403 为无权查看
  2. currentData.problem.title:题目标题
  3. currentData.problem.difficulty:难度(1~7 分别代表“入门”到“NOI/NOI+/CTSC”,0 代表“暂无评定”)
  4. currentData.problem.totalSubmit / currentData.problem.totalAccepted:总提交 / 总通过
  5. currentData.problem.limits:测试点限制列表
  6. currentData.problem.background:题目背景
  7. currentData.problem.description:题目描述
  8. currentData.problem.inputFormat:输入格式
  9. currentData.problem.outputFormat:输出格式
  10. currentData.problem.hint:提示

根据这些内容写一个处理函数(resolve_problem(res): str),传入一个 JSON 对象,返回处理好的 Markdown 内容。

4. 绕过洛谷对爬虫的限制

有时候请求速率太快,洛谷会直接扔回来一个类似于以下内容的 Javascript 质询

<script>
    var _$daewqwskl=["\x64\x6f\x63\x75\x6d\x65\x6e\x74"];
    var _$oopopdwskl=["\x6f\x6e\x4d\x6f\x75\x73\x65\x4d\x6f\x76\x65","\x62\x6f\x64\x79","\x64\x6f\x63\x75\x6d\x65\x6e\x74","\x72\x65\x6d\x6f\x76\x65\x41\x74\x74\x72\x69\x62\x75\x74\x65"];
    window.open("/problem/P1846?_contentOnly", "_self");
    window[_$daewqwskl[0]].cookie="C3VK=3ca901; path=/; max-age=300;"
</script>
HTML

将字符串解析后,其实很简单。大概就是设置 Cookie C3VK=xxxxxx(6 位字母和数字的混合内容,比如在这里是 3ca901),然后跳转原页面。

用 Python 手动处理一下字符串,添加到 requests 的 cookies 里即可。

5. 程序

print("\033[2mLoading libraries, please wait...")

# Import built-in libraries
import time
import json
import os
import traceback

# Define functions
def info(logger : str):
    print(time.strftime("\033[0;94m[%H:%M:%S INFO]: \033[0m" + logger))

def warn(logger : str):
    print(time.strftime("\033[0;93m[%H:%M:%S WARN]: \033[0m" + logger))

def error(logger : str):
    print(time.strftime("\033[0;91m[%H:%M:%S ERROR]: \033[0m" + logger))

def resolve_problem(res):
    #print(res)
    save_problem = ""
    save_problem += f"[P{now} - {res['currentData']['problem']['title']}](https://www.luogu.com.cn/problem/P{now})\n\n"
    save_problem += f"难度:{res['currentData']['problem']['difficulty']}\n\n"
    save_problem += f"通过/提交:{res['currentData']['problem']['totalAccepted']}/{res['currentData']['problem']['totalSubmit']}\n\n"
    save_problem += f"时空限制 (共 {len(res['currentData']['problem']['limits']['time'])} 个测试点):\n\n"
    for i in range(len(res['currentData']['problem']['limits']['time'])):
        save_problem += f"- 时间 {res['currentData']['problem']['limits']['time'][i]}ms | 空间 {res['currentData']['problem']['limits']['memory'][i] / 1024}MiB\n"
    save_problem += f"\n# 题目背景\n\n"
    save_problem += res['currentData']['problem']['background'].strip('\n') if res['currentData']['problem']['background'] != None else "空"
    save_problem += f"\n\n# 题目描述\n\n"
    save_problem += res['currentData']['problem']['description'].strip('\n') if res['currentData']['problem']['description'] != None else "空"
    save_problem += f"\n\n# 输入格式\n\n"
    save_problem += res['currentData']['problem']['inputFormat'].strip('\n') if res['currentData']['problem']['inputFormat'] != None else "空"
    save_problem += f"\n\n# 输出格式\n\n"
    save_problem += res['currentData']['problem']['outputFormat'].strip('\n') if res['currentData']['problem']['outputFormat'] != None else "空"
    save_problem += f"\n\n# 样例 (共 {len(res['currentData']['problem']['samples'])} 个)\n\n"
    for i in range(len(res['currentData']['problem']['samples'])):
        save_problem += "```input{i+1}\n" + res['currentData']['problem']['samples'][i][0].strip('\n') + "\n```\n\n"
        save_problem += "```output{i+1}\n" + res['currentData']['problem']['samples'][i][1].strip('\n') + "\n```\n\n"
    save_problem += f"# 提示\n\n"
    save_problem += res['currentData']['problem']['hint'].strip('\n') if res['currentData']['problem']['hint'] != None else "空"
    return save_problem

# Import 3rd-party functions
try:
    import requests
except(ImportError):
    warn("Downloading library requests from pip...")
    os.system("pip install requests")
    import requests
try:
    import fake_useragent
except(ImportError):
    warn("Downloading library fake_useragent from pip...")
    os.system("pip install fake_useragent")
    import fake_useragent

info("Starting Luogu Problems Downloader version 1.2")

start_time = time.time()

program_config = {}

# Load config.json
info("Preparing config \"config.json\"")
try:
    with open('config.json', 'r') as f:
        program_config = json.loads(f.read())
except:
    warn("Copying default config")
    program_config = {
        "begin": 1000,
        "end": 9607,
        "download_delay": 5,
        "status_filename": "status.conf",
        "download_folder": "./lg_downloads/"
    }
    with open('config.json', 'w') as f:
        f.write(json.dumps(program_config))
now = 1000
info(f"Set now problem ID to P{now} (Full problems' range is: {program_config['begin']} - {program_config['end']}).")

# Load status.conf
info(f"Preparing config \"{program_config['status_filename']}\"")
try:
    with open(program_config['status_filename'], 'r') as f:
        now = int(f.read())
except:
    warn("Copying default config")
    with open(program_config['status_filename'], 'w') as f:
        f.write("1000")

# Prepare download folder (./lg_downloads/)
info(f"Preparing download folder \"{program_config['download_folder']}\"")
try:
    os.makedirs(program_config['download_folder'])
except FileExistsError:
    warn(f"Download folder has been exist! Skip for it.")

# Prepare request header
info("Preparing Request Header & Cookies Value")
reqheader = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'User-Agent': fake_useragent.UserAgent().random}
reqcookies = {}

# Finish initalize
info(f"Done! Time elapsed: {int((time.time() - start_time) * 1000)}ms")

start_time = time.time()
success = 0
warns = 0

try:
    hazard_level = 0
    while(now <= program_config['end']):
        if(hazard_level >= 20):
            # Hazard level is too high that may will be banned by Luogu
            reqheader['User-Agent'] = str(fake_useragent.UserAgent().random)
            warn(f"Hazard Level is too high! Reload the User Agent {reqheader['User-Agent']} and sleep for {hazard_level - 10}s")
            time.sleep(hazard_level - 10)
            hazard_level = 10
        info(f"[\033[0;92mSuccess: {success} \033[0;93mWarn: {warns} \033[0;95mSpeed: {int(success / (time.time() - start_time + 1) * 60)} p/min\033[0m] Proceed problem P{now}:")
        # Request API endpoint
        res = requests.get(f"https://www.luogu.com.cn/problem/P{now}?_contentOnly", headers=reqheader, cookies=reqcookies).text
        info(" > Requested")
        # Resolve response
        try:
            res = json.loads(res)
        except:
            warns += 1
            warn("Hit Anti-Robot Rating Limits!")
            limit_message = res
            if(limit_message.find(".cookie=\"") == -1):
                error(f"Cannot find the JavaScript verification:\n{res}\nIncrease the hazard level by 10.")
                hazard_level += 10
                time.sleep(program_config['download_delay'] * 5)
                continue
            info("Trying to resolve JavaScript verification...")
            # Use string actions to resolve JavaScript verification
            # The example of verification:
            """
            <script>
                var _$daewqwskl=["\x64\x6f\x63\x75\x6d\x65\x6e\x74"];
                var _$oopopdwskl=["\x6f\x6e\x4d\x6f\x75\x73\x65\x4d\x6f\x76\x65","\x62\x6f\x64\x79","\x64\x6f\x63\x75\x6d\x65\x6e\x74","\x72\x65\x6d\x6f\x76\x65\x41\x74\x74\x72\x69\x62\x75\x74\x65"];
                window.open("/problem/P1846?_contentOnly", "_self");
                window[_$daewqwskl[0]].cookie="C3VK=3ca901; path=/; max-age=300;"
            </script>
            """
            try:
                limit_message = limit_message.split(".cookie=\"")[1]
                limit_message = limit_message.split("; path")[0]
            except:
                error(f"Error while resolving JavaScript verification:\n{res}\nIncrease the hazard level by 10.")
                hazard_level += 10
                time.sleep(program_config['download_delay'] * 5)
                continue
            reqcookies[limit_message.split("=")[0]] = limit_message.split("=")[1]
            info(f"Set cookies: {reqcookies}")
            time.sleep(program_config['download_delay'] * 2)
            continue
        if(res['code'] == 200):
            # Resolved successfully
            info(" > Resolved")
            save_problem = resolve_problem(res)
            #print(save_problem)
            with open(program_config['download_folder'] + "P" + str(now) + ".md", 'w', encoding='UTF-8') as f:
                f.write(save_problem)
                info(" > Saved")
                hazard_level -= 5
                success += 1
        else:
            error("Error while request this: Code " + str(res['code']))
            warns += 1
            time.sleep(program_config['download_delay'] * 3)
            hazard_level += 15
        time.sleep(program_config['download_delay'])
        now += 1
except KeyboardInterrupt:
    warn("Saving your status...")
    with open('status.conf', 'w') as f:
        f.write(str(now))
    info("All the changes have saved.")
    exit(0)
except Exception as e:
    error("An unexpected error occured:")
    traceback.print_exc()
    info("Press Enter to exit safely.")
    input()
    warn("Saving your status...")
    with open('status.conf', 'w') as f:
        f.write(str(now))
    info("All the changes have saved.")
    exit(1)
    
warn("Saving your status...")
with open('status.conf', 'w') as f:
    f.write(str(now))
info("All the changes have saved.")
Python

程序源代码和 2023 年 9 月 13 日下载的所有题面数据可以在下面链接下载:

洛谷题面批量下载程序 | CodeZhangBorui Downloads

暂无评论

发送评论 编辑评论


				
上一篇
下一篇