|
|
# 基本要求
|
|
|
|
|
|
爬取所有百度汉语内存有的汉字及其组词
|
|
|
|
|
|
## 过程
|
|
|
|
|
|
1. 可以通过查询utf8编码范围知道所有汉字
|
|
|
|
|
|
2. 通过查询目标网站获得字词条
|
|
|
|
|
|
![image-5](uploads/7da1a7ac78d3cb26fab80569dd74c855/image-5.png)
|
|
|
|
|
|
如果碰到多音字,每个读音视为一个字,以json格式存储字的信息:
|
|
|
|
|
|
```javascript
|
|
|
{
|
|
|
// 必要字段 - 1.5分
|
|
|
"character": "$什么字",
|
|
|
"pronounciation": "$拼音"
|
|
|
"explanation"://字的基本释义,列表形式
|
|
|
[
|
|
|
("$释义1", ["$例1","$例2",...]), // 例子中的character需要替换成英文(注意是英文)的~
|
|
|
("$释义2", ["$例1","$例2",...]),
|
|
|
...
|
|
|
],
|
|
|
"detailed"://字的详细释义,列表形式
|
|
|
[
|
|
|
"$词性1"://词性-名、动等
|
|
|
[
|
|
|
("$释义1",[("$出处","$例句"),...]), //如果没有出处,则$出处为None, 例子中的character需要替换成英文(注意是英文)的~
|
|
|
...
|
|
|
]
|
|
|
],
|
|
|
"external"://百度百科
|
|
|
[
|
|
|
("$释义1",["$例句1","$例句2",...]),// 例子中的character需要替换成英文(注意是英文)的~
|
|
|
...
|
|
|
],
|
|
|
"translation"://英文翻译
|
|
|
["$翻译1","$翻译2",...]
|
|
|
// 可选字段 - 0.5分
|
|
|
"radical": "$部首",
|
|
|
"traditional": "$繁体",
|
|
|
"encoding": "$utf8编码",
|
|
|
"puzzles":
|
|
|
["$谜语1","$谜语2"]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
如
|
|
|
|
|
|
```javascript
|
|
|
{
|
|
|
"character": "意",
|
|
|
"pronounciation": "yì"
|
|
|
"explanation": [
|
|
|
("意思", ["来~"]),
|
|
|
("愿望", ["满~"]),
|
|
|
...
|
|
|
],
|
|
|
"detailed":
|
|
|
{
|
|
|
"名":
|
|
|
[
|
|
|
("(会意。从心从音。本义:心志。心意)", []),
|
|
|
("同本义", [("《说文》","~,志也。"),("《春秋繁露·循天之道》","心之所谓~。"),
|
|
|
...
|
|
|
],
|
|
|
"动":
|
|
|
[
|
|
|
("思念;放在心上 。",[(None, "如:~悬悬(忐忑不安;提心吊胆);~悬(挂念);~顾(挂念)")]),
|
|
|
(
|
|
|
"意料;猜测",
|
|
|
[
|
|
|
("《管子·小问》","而小人善~。臣~之也。"),
|
|
|
("《玉台新咏·古诗为焦仲卿妻作》","何~致不厚。"),
|
|
|
...
|
|
|
]
|
|
|
),
|
|
|
...
|
|
|
]
|
|
|
},
|
|
|
"external":
|
|
|
[
|
|
|
("儒家谓人对事物的思想与情态和对事物的态度。先秦儒家非常重视意对外界事物的看法,认为人对事物与行为好坏的看法都是由意造成的。", ["检其邪心,守其正~。"])
|
|
|
]
|
|
|
"translation":[]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
详细释义需要点击`更多`获得
|
|
|
|
|
|
tips: 可以看url的构成。
|
|
|
|
|
|
又例:
|
|
|
```javascript
|
|
|
{
|
|
|
"character":"擦",
|
|
|
...,
|
|
|
"external":
|
|
|
[
|
|
|
("物体在移动中相接触", ["摩~", "~火柴","摩拳~掌","手~破了皮。"],
|
|
|
("用布、手巾等摩擦使干净",["~脸","~汗","~玻璃","~汗~手","~桌椅"],
|
|
|
...
|
|
|
]
|
|
|
"translation":[],
|
|
|
}
|
|
|
```
|
|
|
|
|
|
3.查看图中“更多”选项
|
|
|
![image-6](uploads/fc294787142af016d82b0d8d9d74cbeb/image-6.png)
|
|
|
获取所有组词
|
|
|
![image-7](uploads/28e99d7a02e9d130939a767d81bb7637/image-7.png)
|
|
|
|
|
|
4. 以json格式存储每个词的信息:
|
|
|
```javascript
|
|
|
{
|
|
|
"word": "$什么词",
|
|
|
"pronounciation": "$拼音",
|
|
|
"explanation"://词的基本释义,列表形式
|
|
|
[
|
|
|
("$释义1", ["$例1","$例2",...]), // 例子中的word需要替换成英文(注意是英文)的~
|
|
|
("$释义2", ["$例1","$例2",...]),
|
|
|
...
|
|
|
],
|
|
|
"detailed"://词的详细释义,列表形式
|
|
|
[
|
|
|
("$释义1",[("$出处","$例句"),...]), //如果没有出处,则$出处为None, 例子中的word需要替换成英文(注意是英文)的~
|
|
|
...
|
|
|
],
|
|
|
"external"://百度百科
|
|
|
[
|
|
|
("$释义1",["$例句1","$例句2",...]),// 例子中的word需要替换成英文(注意是英文)的~
|
|
|
...
|
|
|
],
|
|
|
"translation"://英文翻译
|
|
|
["$翻译1","$翻译2",...]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
如
|
|
|
|
|
|
![image-8](uploads/95123723816c82126b85ad2de2f76902/image-8.png)
|
|
|
|
|
|
```javascript
|
|
|
{
|
|
|
"word":"创意",
|
|
|
"pronounciation":"chuàng yì",
|
|
|
"explanation":
|
|
|
[
|
|
|
("有创造性的想法、构思等",["颇具~", "这个设计风格保守,毫无~可言"]),
|
|
|
("提出有创造性的想法、构思等", ["这项活动由工会~发起。"])
|
|
|
],
|
|
|
"detailed":[
|
|
|
("亦作“剙意”。谓创立新意。",
|
|
|
[
|
|
|
("汉 王充 《论衡·超奇》","孔子 得史记以作《春秋》,及其立义~,褒贬赏诛,不復因史记者,眇思自出於胸中也。"),
|
|
|
("宋 程大昌 《演繁露·纳粟拜爵》","秦始皇 四年,令民纳粟千石,拜爵一级,按此即 鼂错 之所祖效,非 错 剙意也。") // 这句没有“创意”直接出现,不需要替换
|
|
|
]
|
|
|
),
|
|
|
...
|
|
|
],
|
|
|
"external":[]
|
|
|
"translation":["create new meanings"]
|
|
|
}
|
|
|
```
|
|
|
## 评分标准
|
|
|
1. 爬取GB2312标准中的常用汉字(1分)
|
|
|
2. 爬取常用汉字的所有组词(1分)
|
|
|
3. 爬取Unicode全部汉字及组词(若存在,1分) |
|
|
\ No newline at end of file |