论文背景
标题:SPECTER: Document-level Representation Learning using Citation-informed Transformers
摘要:表示学习是自然语言处理系统的关键组成部分。像BERT这样的最新Transformer语言模型学习了强大的文本表示,但这些模型针对标记和句子级别的训练目标,并不利用相关性信息,这限制了它们在文档级表示方面的能力。对于科学文献的应用,如分类和推荐,嵌入提供了强大的终端任务性能。我们提出了SPECTER,一种基于预训练Transformer语言模型的科学文档的文档级嵌入生成方法,其依据一个强大的文档级相关性信号:引用图。与现有的预训练语言模型不同,SPECTER可以轻松地应用于下游应用,而无需特定任务的微调。此外,为了鼓励进一步的文档级模型研究,我们引入了SciDocs,一个新的评估基准,包括七个文档级任务,从引文预测到文档分类和推荐。我们展示了SPECTER在基准测试上优于各种竞争基线。
作者:【ArmanAI】Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, Daniel S. Weld
CCF:B
会议:Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
模型细节
安装Transformers
Python 3.6+, PyTorch 1.1.0+, TensorFlow 2.0+, and Flax.
通过Github仓库安装
1. 下载
## git clone git@github.com:allenai/specter.git
git clone https://github.com/allenai/specter.git
cd specter
## 通过浏览器下载archive.tar.gz更快
wget https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/specter/archive.tar.gz
tar -xzvf archive.tar.gz
2. 安装环境
## 先安装conda,记住运行Anaconda.sh安装脚本时不要在root下运行,不然默认装到/root下,虽然在安装时选择其他路径可以让user使用,但是运行代码还是有可能出错,保险起见,就user下安装即可。
conda create --name specter python=3.7 setuptools
conda activate specter
# if you don't have gpus, remove cudatoolkit argument
#conda install pytorch cudatoolkit=10.1 -c pytorch
conda install pytorch cpuonly -c pytorch
## pip requirements.txt的命令最好转化成以下两条,用以在网络不好的环境下分别执行,防止一个执行了重复执行
## pip install dill jsonlines pandas scikit-learn
##https可以换成git试试
pip install -r requirements.txt
python setup.py install
3. 修改环境包依赖bug
运行代码报错,这是因为包的版本不对。
allennlp 0.9.0
overrides 7.3.1
其中overrides版本高了,应该将其改为低版本3.1.0
报错信息
(specter) user@ubuntu:~/model/model2/specter$ python scripts/embed.py
> --ids data/sample.ids --metadata data/sample-metadata.json
> --model ./model.tar.gz
> --output-file output.jsonl
> --vocab-dir data/vocab/
> --batch-size 16
> --cuda-device -1
Traceback (most recent call last):
File "specter/predict_command.py", line 14, in <module>
from allennlp.commands import ArgumentParserWithDefaults
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 8, in <module>
from allennlp.commands.configure import Configure
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/commands/configure.py", line 26, in <module>
from allennlp.service.config_explorer import make_app
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/service/config_explorer.py", line 24, in <module>
from allennlp.common.configuration import configure, choices
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/common/configuration.py", line 17, in <module>
from allennlp.data.dataset_readers import DatasetReader
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/__init__.py", line 1, in <module>
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/dataset_readers/__init__.py", line 10, in <module>
from allennlp.data.dataset_readers.ccgbank import CcgBankDatasetReader
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/dataset_readers/ccgbank.py", line 9, in <module>
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 8, in <module>
from allennlp.data.instance import Instance
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/instance.py", line 3, in <module>
from allennlp.data.fields.field import DataArray, Field
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/fields/__init__.py", line 7, in <module>
from allennlp.data.fields.array_field import ArrayField
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 10, in <module>
class ArrayField(Field[numpy.ndarray]):
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 49, in ArrayField
@overrides
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/overrides/overrides.py", line 83, in overrides
return _overrides(method, check_signature, check_at_runtime)
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/overrides/overrides.py", line 170, in _overrides
_validate_method(method, super_class, check_signature)
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/overrides/overrides.py", line 189, in _validate_method
ensure_signature_is_compatible(super_method, method, is_static)
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/overrides/signature.py", line 102, in ensure_signature_is_compatible
ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
File "/home/user/anaconda3/envs/specter/lib/python3.7/site-packages/overrides/signature.py", line 303, in ensure_return_type_compatibility
f"{method_name}: return type `{sub_return}` is not a `{super_return}`."
TypeError: ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`.
修改代码
pip install allennlp==3.1.0
4. 运行代码
此时再次运行以上代码,模型正常工作
脚本的输出大概是如下格式:
模型虚拟环境版本存档
(specter) user@ubuntu:~/model/model2/specter$ pip list
Package Version
----------------------------- ---------
alabaster 0.7.13
allennlp 0.9.0
attrs 23.1.0
Babel 2.12.1
blis 0.2.4
boto3 1.26.158
botocore 1.29.158
certifi 2022.12.7
charset-normalizer 3.1.0
click 8.1.3
conllu 1.3.1
cycler 0.11.0
cymem 2.0.7
dill 0.3.6
docutils 0.19
editdistance 0.6.2
exceptiongroup 1.1.1
flaky 3.7.0
Flask 2.2.5
Flask-Cors 3.0.10
fonttools 4.38.0
ftfy 6.1.1
gevent 22.10.2
greenlet 2.0.2
h5py 3.8.0
idna 3.4
imagesize 1.4.1
importlib-metadata 6.7.0
iniconfig 2.0.0
itsdangerous 2.1.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.2.0
jsonlines 3.1.0
jsonnet 0.20.0
jsonpickle 3.0.1
kiwisolver 1.4.4
MarkupSafe 2.1.3
matplotlib 3.5.3
murmurhash 1.0.9
nltk 3.8.1
numpy 1.21.6
numpydoc 1.5.0
overrides 3.1.0
packaging 23.1
pandas 1.3.5
parsimonious 0.10.0
Pillow 9.5.0
pip 22.3.1
plac 0.9.6
pluggy 1.2.0
preshed 2.0.1
protobuf 4.23.3
Pygments 2.15.1
pyparsing 3.1.0
pytest 7.3.2
python-dateutil 2.8.2
pytorch-pretrained-bert 0.6.2
pytorch-transformers 1.1.0
pytz 2023.3
PyYAML 6.0
regex 2023.6.3
requests 2.31.0
responses 0.23.1
s3transfer 0.6.1
scikit-learn 1.0.2
scipy 1.7.3
sentencepiece 0.1.99
setuptools 65.6.3
six 1.16.0
snowballstemmer 2.2.0
spacy 2.1.9
specter 0.0.1
Sphinx 5.3.0
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
sqlparse 0.4.4
srsly 1.0.6
tensorboardX 2.6.1
thinc 7.0.8
threadpoolctl 3.1.0
tomli 2.0.1
torch 1.13.1
tqdm 4.65.0
types-PyYAML 6.0.12.10
typing_extensions 4.1.1
Unidecode 1.3.6
urllib3 1.26.16
wasabi 0.10.1
wcwidth 0.2.6
Werkzeug 2.2.3
wheel 0.38.4
word2number 1.1
zipp 3.15.0
zope.event 4.6
zope.interface 6.0
文章来源: 博客园
- 还没有人评论,欢迎说说您的想法!