|
|
# 构建种子词表
|
|
|
|
|
|
首先获取一批种子词汇,可以从以下词表出发
|
|
|
* [Gluonnlp GBW Dataset](https://gluon-nlp.mxnet.io/api/modules/data.html#gluonnlp.data.GBWStream)
|
|
|
* [Original GBW Dataset](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)
|
|
|
* TOFEL 词表
|
|
|
|
|
|
然后抽出词汇中所有的单词(word)组成种子词, ps. 此时不用考虑词组(phrase).
|
|
|
|
|
|
# 扩增种子词表并构建种子词组词表
|
|
|
|
|
|
根据种子词表在 [wordnet](http://wordnetweb.princeton.edu/perl/webwn) 中进行词表扩增,同时构建出词组词表
|
|
|
|
|
|
![image](uploads/49d1d98fe97d2b7f8f17d9988768baf3/image.png)
|
|
|
|
|
|
# 构建结构化词典
|
|
|
|
|
|
爬取目标网站 https://www.merriam-webster.com/
|
|
|
|
|
|
![image-2](uploads/0f6993ba7fbfdfba48071a24328ae271/image-2.png)
|
|
|
|
|
|
## 单词存储格式说明
|
|
|
|
|
|
```js
|
|
|
{
|
|
|
"word": "$什么词",
|
|
|
"explanation"://词的基本释义,列表形式
|
|
|
[
|
|
|
{
|
|
|
"type": "$type1", // 词的类型,例如:transitive verb, noun
|
|
|
"detailed": [
|
|
|
[["$释义1-1", "$释义1-2"], ["$例1","$例2"],["$path_to_ex_1","$path_to_ex_2"]]],
|
|
|
[["$释义2-1", "$释义2-2"], ["$例1","$例2"],["$path_to_ex_1","$path_to_ex_2"]]],
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"type": "$type2",
|
|
|
},
|
|
|
],
|
|
|
}
|
|
|
```
|
|
|
|
|
|
例如
|
|
|
```js
|
|
|
{
|
|
|
"word": "come",
|
|
|
"explanation":
|
|
|
[
|
|
|
{
|
|
|
"type": "intransitive verb",
|
|
|
"detailed": [
|
|
|
[["to move toward something", "approach"], ["come here."], ["Entry 1", "1", "a"]],
|
|
|
[["to move or journey to a vicinity with a specified purpose"], ["come see us.","come and see what's going on."], ["Entry 1", "1", "b"]],
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"type": "transitive verb",
|
|
|
},
|
|
|
],
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
## 词组存储格式说明
|
|
|
|
|
|
```js
|
|
|
{
|
|
|
"phrase": "$什么词组",
|
|
|
"center word": "$中心词", // 例如: come up 的中心词是 come
|
|
|
"explanation"://词的基本释义,列表形式
|
|
|
[
|
|
|
{
|
|
|
"type": "$type1", // 词的类型,例如:transitive verb, noun
|
|
|
"detailed": [
|
|
|
[["$释义1-1", "$释义1-2"], ["$例1","$例2"], ["$path_to_ex_1","$path_to_ex_2"]],
|
|
|
[["$释义2-1", "$释义2-2"], ["$例1","$例2"], ["$path_to_ex_1","$path_to_ex_2"]],
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"type": "$type2",
|
|
|
},
|
|
|
],
|
|
|
"external_explation": //二次检索获得的词的基本释义,列表形式
|
|
|
[
|
|
|
{
|
|
|
"type": "$type1", // 词的类型,例如:transitive verb, noun
|
|
|
"detailed": [
|
|
|
[["$释义1-1", "$释义1-2"], ["$例1","$例2"], ["$path_to_ex_1","$path_to_ex_2"]],
|
|
|
[["$释义2-1", "$释义2-2"], ["$例1","$例2"], ["$path_to_ex_1","$path_to_ex_2"]],
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"type": "$type2",
|
|
|
},
|
|
|
]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
例如
|
|
|
|
|
|
```js
|
|
|
{
|
|
|
"phrase": "come a cropper",
|
|
|
"center word": "come",
|
|
|
"explanation":
|
|
|
[
|
|
|
{
|
|
|
"type": "transitive verb", // 词的类型,例如:transitive verb, noun
|
|
|
"detailed": [
|
|
|
[["to fail completely"], ["The plan came a cropper."], ["Entry 1", "2"]]
|
|
|
]
|
|
|
}
|
|
|
],
|
|
|
"external_explation": []
|
|
|
}
|
|
|
```
|
|
|
|
|
|
又例如
|
|
|
|
|
|
```js
|
|
|
{
|
|
|
"phrase": "come across",
|
|
|
"center word": "come",
|
|
|
"explanation":
|
|
|
[
|
|
|
{
|
|
|
"type": "transitive verb",
|
|
|
"detailed": [
|
|
|
[["to meet, find, or encounter especially by chance"], ["Researchers have come across important new evidence."]],
|
|
|
]
|
|
|
},
|
|
|
],
|
|
|
"external_explation": [
|
|
|
{
|
|
|
"type": "intransitive verb",
|
|
|
"detailed": [
|
|
|
[["to give over or furnish something demanded"],[],["1"]],
|
|
|
[["to produce an impression"],["comes across as a good speaker"],["2"]],
|
|
|
[["come through"],[],["3"]]
|
|
|
]
|
|
|
}
|
|
|
]
|
|
|
}
|
|
|
```
|
|
|
|
|
|
注:词组的 explanation 是单词解释中的词组拓展,例如上面 come across 的 explanation 是来自有 come 中 come across 的解释;external expaltion 则是通过直接检索 come across 得到的。
|
|
|
|
|
|
|
|
|
## 额外要求 (Bonus)
|
|
|
|
|
|
### 动词的变形 (1分)
|
|
|
|
|
|
主要是针对过去式、过去分词、现在分词及三单
|
|
|
|
|
|
![image-3](uploads/b84f12e7a87783e93f3ed143344cd6a9/image-3.png)
|
|
|
|
|
|
![image-4](uploads/6b459e98833a1da1d933e09c0d313e7a/image-4.png)
|
|
|
|
|
|
观察可以发现动词中的变化形式在网站上的格式并不一致。以如下格式进行变形的存储
|
|
|
|
|
|
```json
|
|
|
{
|
|
|
"verb": "$哪个动词",
|
|
|
"past": "$过去式", // past 是 past tense 的缩写
|
|
|
"pp": "$过去分词", // pp 是 past participle 的缩写
|
|
|
"present": "$现在分词", // present 是 present participle 的缩写
|
|
|
"ts": "$第三人称单数", // ts 是 present tense third-person singular 的缩写
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## 评分标准
|
|
|
- 爬取 30000 以上词汇(1分)
|
|
|
- 爬取每个词汇相关的词组(1分)
|
|
|
- 存储爬取词条的所有动词变形(1分) |
|
|
\ No newline at end of file |