2023中国高校计算机大赛 — 大数据挑战赛:论文学科分类(清华大学主办)

官方地址:www.heywhale.com/home/compet…

项目码源见文末

1.竞赛介绍

  • 赛事布景 自 2022 年底以来,大规模言语模型在各行各业产生了广泛的运用,其间围绕学术东西开发也诞生了许多具有影响力的运用,例如 ChatPDF 等。另一方面,在 2023 年 3 月 14 日,智谱 AI 与清华大学联合发布了 ChatGLM-6B 开源模型,并在不到一个月的时刻内吸引了超越 100 万人下载安装。该模型在 Hugging Face (HF) 全球大模型下载榜中连续 12 天位居第一名,在国内外的开源社区中产生了较大的影响。

为了最大化运用 ChatGLM-6B 开源模型推进科研东西的运用开发,我们联合国内最具影响力的学术平台 AMiner,推出了本次「ChatGLM 实践大赛 学术运用篇」。本次竞赛的中心主题是怎么运用 ChatGLM-6B 开源模型促进学术东西的优化。我们希望经过本次竞赛,为有志于投入大模型研究和开发的爱好者供给一个实践平台。大赛共供给 3 个场景、7 个赛道,分别为:

  • 场景 1:论文阅读

    • 赛道一:论文学科分类 (Easy)——依据标题和摘要将论文精确分类到 40 个天然学科中去,或许单学科,也或许交叉学科,精确度到达 90% 以上。

    • 赛道二:问答式科研知识库 (Medium)——将 PDF 论文上传构建向量化科研知识库,在知识库内做自由问答,要求相对回答专业,且答案后要顺便相关文件。

    • 赛道三:论文总述和比照分析 (Medium)——给定多篇论文的标题、摘要或全文,对论文的布景、问题、方法、实验、定论等进行总述或比照分析。

  • 场景 2:投稿审稿

    • 赛道四:投稿期刊会议引荐 (Medium)——依据标题和摘要引荐合适的 Top K 期刊会议,并依据匹配度针对每个引荐期刊会议给出引荐理由。

    • 赛道五:审稿回复 (Medium)——依据 Openreview 数据,微调出一个审稿回复模型。

  • 场景 3:论文发现

    • 赛道六:论文检索 (Hard)——给定概念、给定问题、给定实体等的单独和混合检索。

    • 赛道七:论文引荐和科技情报生成 (Hard)——依据用户画像(订阅关键词+查找阅读行为),从每日最新论文中挑选跟用户相关的1篇或多篇论文,依据论文信息(标题、作者、摘要等,也能够增加其他额外信息)微调大模型生成科技情报,情报形式和深度由选手自定义。

2023中国高校计算机大数据挑战赛:论文学科分类baseline|清华主办

  • 大赛安排

主办单位:智谱 AI

协办单位:和鲸科技

数据供给:AMiner 技能团队

安排支持:Huggingface

算力支持:揽睿星舟、AWS

2.论文学科分类赛道使命简介

  • 标题描述

依据标题和摘要将论文精确分类到 40 个天然学科里去,或许单学科,也或许交叉学科,精确度到达 90% 以上。

  • 数据阐明

数据集:40 个天然学科下每个学科 500 篇论文的标题摘要,1000 篇左右交叉学科论文的标题和摘要。

测验集:500 篇文献,客观分类,点评指标 Acc。

部分原数据集展示:

{"id":155,"title":"Modeling heterogeneous network user route and departure time responses to dynamic pricing","abstract":"The ability to realistically capture trip-makersresponses to time-varying road charges is essential for network equilibrium assignment models typically applied to predict network flows in the presence of dynamic road (congestion) pricing. User responses to pricing are governed by individual trip-makerspreferences, such as their value of time (VOT), and the cost they attach to late vs. early arrival relative to the destination. These behavioral characteristics vary across users. This paper presents a joint route and departure time network equilibrium assignment model explicitly considering heterogeneous users with different preferred arrival times at destinations, VOT, and values of early and late schedule delays (VOESD and VOLSD). The model is formulated as an infinite-dimensional variational inequality and solved by a column generation-based algorithmic framework that embeds: (i) an extreme non-dominated alternative-generating algorithm to obtain combinations of VOT, VOESD, and VOLSD subintervals (or breakpoints) that define multiple user classes, and the corresponding least trip cost alternative (joint departure time and path) for each user class, (ii) a traffic simulator to capture traffic flow dynamics and determine experienced travel costs; and (iii) a multi-class alternative flow updating scheme to solve the reduced multi-class simultaneous route and departure time user equilibrium problem defined by a subset of feasible alternatives. Application to an actual network illustrates the properties of the algorithm, and underscores the importance of capturing user heterogeneity and temporal shifts in the appraisal of dynamic pricing schemes.","subject_name":["交通运输工程"]}
{"id":156,"title":"Duration-dependent effect of transient neonatal hypothyroidism on sertoli and germ cell number, and plasma and testicular interstitial fluid androgen binding protein concentration.","abstract":"The impact of transient neonatal hypothyroidism on growth and function of puberal testis during different milestones of postnatal testicular development was studied in Wister rats. Rat pups were made hypothyroid for 10, 15, 30, 40 and 60 days of postnatal age from birth by providing 0.05% (W\/V) methimazole (MMI) in the drinking water of the mother, from day 1 postpartum till weaning (25 days postpartum) and thereafter in the drinking water. Control rats were raised without MMI treatment. Sertoli cell number and its function was assessed on day 60 postpartum. Sertoli cell number increased consistently in 10, 15, 30 and 40 days transient hypothyroid rats but decreased in rats subjected to continuous hypothyroidism from birth to 60 days postpartum. Rats subjected to continuous hypothyroidism from birth showed spermatogenic arrest at puberty and had only a single layer of spermatogonia. Transient neonatal hypothyroidism for 10 (or) 15 days from birth increased spermatocytes (pachytene and zygotene), spermatids (elongated and round) whereas, that of 30 and 40 days decreases the number of germ cells. Plasma androgen binding protein (ABP) concentration decreased in puberal rats belonging to all groups, whereas the testicular interstitial fluid (TIF) concentration of ABP increased significantly in 10 and 15 days hypothyroid rats while it decreased in all other groups. These findings indicate that the mitogenic activity of Sertoli cell is increased irrespective of the duration of transient neonatal hypothyroidism. However, the functional activity of Sertoli cells (ABP production) in these puberal rats varies depending upon the postnatal period at which the animals were in hypothyroid state.","subject_name":["临床医学"]}
  • train.json格局

{"id":0,"title":"title0","abstract":"abstract0","subject_name":["社会学"]}{"id":1,"title":"title1","abstract":"abstract1","subject_name":["社会学","石油工程"]}
  • test.json猜测文件格局(官方未放出来,我就简略结构几个作为测验)
{"id":0,"title":"title0","abstract":"abstract0",}{"id":1,"title":"title1","abstract":"abstract1"}
{"id":0,"title":"Oxidative coupling of methane in the redox cyclic mode over the catalysts on the basis of CeO2 and La2O3","abstract":"The 1% CeO 2 , 9% La 2 O 3 \/SiO 2 and 2% CeO 2 , 8% La 2 O 3 \/SiO 2 catalysts show reliable efficiency in the OCM reaction, as well as stable work in the redox cyclic mode. Selectivity to C 2 products remarkably increases if preliminary reduction of the catalyst by a small amount of hydrogen is used."}
{"id":1,"title":"Tissue engineering: strategies, stem cells and scaffolds.","abstract":"Tissue engineering scaffolds are designed to influence the physical, chemical and biological environment surrounding a cell population. In this review we focus on our own work and introduce a range of strategies and materials used for tissue engineering, including the sources of cells suitable for tissue engineering: embryonic stem cells, bone marrow-derived mesenchymal stem cells and cord-derived mesenchymal stem cells. Furthermore, we emphasize the developments in custom scaffold design and manufacture, highlighting laser sintering, supercritical carbon dioxide processing, growth factor incorporation and zoning, plasma modification of scaffold surfaces, and novel multi-use temperature-sensitive injectable materials."}
{"id":2,"title":"Enhancement of Forced Convection Subcooled Film Boiling Heat Transfer Using Gas Sheet Collapse by Electric Field Application","abstract":"Enhancement of forced-convection boiling heat transfer by electriceld is investigated experimentally. When a high-temperature horizontallament is immersed in water, a gas sheet is formed around and the abovelament due to liquid boiling, in the early immersion process. This gas-sheet markedly decreases the boiling cooling rate of thelament. Here, forced collapse of the gas sheet is attempted by imposing an electriceld to enhance the boiling cooling rate, In the experiments, a horizontal platinum wire of 0.5mm in diameter is immersed in pure water under atmospheric pressure, and a DC voltage up to 600V is applied between the wire surface and an electrode made of glass placed 10mm apart. The whole boiling curve is measured under different applied voltages and wire-falling velocities in 0.5 to 2.0m\/s range, and at subcooling of 60 K. The experimental results show that the electric field is effective in promoting the disintegration of the gas sheet. Under the tested conditions, boiling cooling rate increased two-fold for an applied electriceld of 600 V\/cm. This result shows that the use of an electriceld to break up the gas-sheet has resulted in a remarkable increase in the cooling rate at high superheats during initial cooling period, which is even greater than that used in the existing material manufacturing processes by the rapid cooling method, and therefore, this method may contribute to developing new materials."}
{"id":195,"title":"Speciation of some heavy metals in bottom sediments of the Ob and Yenisei estuarine zones","abstract":"The speciation of Fe, Mn, Zn, Cu, Co, Ni, Cr, Pb, and Cd was studied in 52 samples of bottom sediments collected during Cruise 49 of the R\/V Dmitrii Mendeleev in estuaries of the Ob and Yenisei rivers in the southwestern Kara Sea. Immediately after sampling, the samples were subjected to on-board consecutive extraction to separate metal species according to their modes of occurrence in the sediments: (1) adsorbed, (2) amorphous Fe-Mn hydroxides and related metals, (3) organic + sulfide, and (4) residual, or lithogenic. The atomic absorption spectroscopy of the extracts was carried out at a stationary laboratory. The distribution of Fe, Zn, Cu, Co, Ni, Cr, Pb, and Cd species is characterized by the predominance of lithogenic or geochemically inert modes (70–95% of the bulk content), in which the metals are bound in terrigenous and clastic mineral particles and organic detritus. About half of the total Mn amount and 15–30% Zn and Cu is contained in geochemically mobile modes. The spatiotemporal variations in the proportions of metal species in the surface layer of sediments along the nearly meridional sections and through the vertical sections of bottom sediments cores testify that Mn and, to a lesser extent, Cu are the most sensitive to changes in the sedimentation environment. The role of their geochemically mobile species notably increases under reducing conditions."}

3.数据转化

将官方数据进行处理得到模型格局要求的输入,这边就直接给出来了,能够参阅

其间40个分类分别为:

{'材料科学与工程', '临床医学', '电气工程', '数学', '化学', '地质工程', '地理学', '食品科学与工程', '医学', '生物学', '核科学与技能', '地球物理学', '水产', '药学', '交通运输工程', '体育学', '生物医学工程', '护理', '物理学', '心理学', '社会学', '神经科学', '计算机科学', '建筑学', '环境科学与工程', '机械工程', '航空航天工程', '石油工程', '免疫与微生物学', '矿业', '通讯与信息科学', '光学', '历史学', '地质学', '教育学', '海洋工程', '公共管理学', '仪器科学与技能', '经济学', '音乐'}

其间练习集和测验集份额0.8:0.2

{"id": 11810, "text_a": "Restoration of the shear capacity for RC beams with web openings using precast SHCC plates Providing web opening in the shear-span zone of RC beams results in significant reduction in the shear capacity of such beams. Thus, an efficient restoration technique has to be found out and implemented in order to compensate the developed reduction. The main target of the current paper is to introduce and validate an innovative restoration technique for the new construction making use of the Strain-Hardening Cementitious Composites (SHCC) material. Accordingly, precast thin SHCC plates having the required opening were cast and cured for about 3 weeks to eliminate the volumetric change issues, and then placed inside the formwork at both sides before casting the RC beams included web openings. The chosen thickness of the SHCC plates was 20 mm in order to be easily accommodated in the concrete cover. For the considered openings, the opening depth was kept constant to be 0.30 of the beam effective depth, while the opening length was varied considering three values; 150, 300, and 450 mm. Besides, small amount of internal reinforcement in the form of steel wire mesh was provided inside some SHCC plates in order to enhance their shear strength and ductility. Experimental results showed that the provided SHCC layers enabled the strengthened beams to exhibit distinguished performance in terms of ultimate capacity, ductility and decreased shear crack width. In addition, the gain in shear capacity due to the SHCC plates is decreased with the increase of the opening width. Finally, comparisons between the obtained experimental results and the predicted shear capacities stipulated by the ACI 318-19 and JSCE codes were performed. The comparisons revealed that the estimated shear capacities are in satisfactory agreement with the experimental results, however, these estimations tend to be overestimated with the increase of the opening length.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技能", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通讯与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技能"], "labels": [4]}
{"id": 5984, "text_a": "Caractristiques et valuation des symptmes de la rhinite allergique : Rsultats de l’enqute CESAR Des recommandations sont publies depuis plusieurs annes pour la prise en charge de la rhinite allergique. Avec le temps, les concepts visant  dfinir les entits chroniques et celles qui se manifestent sur de plus courtes priodes ont volu. Nous sommes ainsi passs du couple  perannuelle/saisonnire   celui de  persistante/intermittente . La svrit des symptmes et leur rpercussion sur la vie quotidienne des patients sont prises aussi en compte dans ces nouvelles recommandations. L’enqute observationnelle CESAR  Caractristiques et Evaluation des Symptmes de la rhinite AlleRgique  vise  valuer le paysage de la rhinite allergique en France sur ces nouveaux critres ainsi qu’ mieux apprhender les modalits de prise en charge des patients en mdecine gnrale.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技能", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通讯与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技能"], "labels": [10, 6]}
{"id": 4707, "text_a": "Dynamic analysis of OWT foundation with large diameter monopile under transient storm loading To investigate the stability of monopile foundation under dynamic loading, a comprehensive numerical model for the analysis of offshore wind turbines (OWT) foundation under a general transient storm loading is presented in this study. The dynamic stiffness and soil deformation around the large-diameter monopile is simulated using this method. During the numerical analysis, a dynamic boundary surface model of soil is derived instead of the empirical strength degradation. Along the axis direction of the monopile, an intensive study about deformation law of the seabed soil is analysed, moreover, some parameters which may affect the OWT stability and the dynamic stiffness are discussed. Some conclusions can be drawn that the dynamic stiffness and the lateral displacement of the monopile foundation can be obviously improved by increasing the buried depth than the diameter, and the proposed failure mode can well describe the failure law of soil around the monopile due to the dynamic loading.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技能", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通讯与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技能"], "labels": [22]}

4.模型练习猜测

多使命练习场景可分别进行数据转化再进行混合:通用分类、谈论情感分析、语义类似度计算、蕴含推理、多项式阅读理解等众多“泛分类”使命

##代码结构
├── deploy/simple_serving/ # 模型部署脚本
├── utils.py               # 数据处理东西
├── run_train.py           # 模型微调脚本
├── run_eval.py            # 模型评价脚本
├── label_studio.py        # 数据格局转化脚本
├── label_studio_text.md   # 数据标示阐明文档
└── README.md

4.1 模型微调

#安装最新版本paddlenlp
!pip install --upgrade paddlenlp
#移动数据集
!cp /home/aistudio/input/train.txt /home/aistudio/data
!cp /home/aistudio/input/dev.txt /home/aistudio/data
# 单卡启动:
!python run_train.py  \
    --device gpu \
    --logging_steps 100 \
    --save_steps 100 \
    --eval_steps 100 \
    --seed 1000 \
    --model_name_or_path utc-base \
    --output_dir ./checkpoint/model_best \
    --dataset_path ./data/ \
    --max_seq_length 512  \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --gradient_accumulation_steps 8 \
    --num_train_epochs 20 \
    --learning_rate 1e-5 \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir ./checkpoint/model_best \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model macro_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1 \
    --save_plm

该示例代码中由于设置了参数 --do_eval,因此在练习完会主动进行评价。

可装备参数阐明:

  • single_label: 每条样本是否只猜测一个标签。默以为False,表明多标签分类。
  • device: 练习设备,可选择 ‘cpu’、’gpu’ 其间的一种;默以为 GPU 练习。
  • logging_steps: 练习过程中日志打印的距离 steps 数,默许10。
  • save_steps: 练习过程中保存模型 checkpoint 的距离 steps 数,默许100。
  • eval_steps: 练习过程中保存模型 checkpoint 的距离 steps 数,默许100。
  • seed:大局随机种子,默以为 42。
  • model_name_or_path:进行 few shot 练习运用的预练习模型。默以为 “utc-base”, 可选”utc-xbase”, “utc-base”, “utc-medium”, “utc-mini”, “utc-micro”, “utc-nano”, “utc-pico”。
  • output_dir:必须,模型练习或紧缩后保存的模型目录;默以为 None
  • dataset_path:数据集文件所在目录;默以为 ./data/
  • train_file:练习集后缀;默以为 train.txt
  • dev_file:开发集后缀;默以为 dev.txt
  • max_seq_len:文本最大切分长度,包括标签的输入超越最大长度时会对输入文本进行主动切分,标签部分不可切分,默以为512。
  • per_device_train_batch_size:用于练习的每个 GPU 中心/CPU 的batch巨细,默以为8。
  • per_device_eval_batch_size:用于评价的每个 GPU 中心/CPU 的batch巨细,默以为8。
  • num_train_epochs: 练习轮次,运用早停法时能够选择 100;默以为10。
  • learning_rate:练习最大学习率,UTC 引荐设置为 1e-5;默许值为3e-5。
  • do_train:是否进行微调练习,设置该参数表明进行微调练习,默许不设置。
  • do_eval:是否进行评价,设置该参数表明进行评价,默许不设置。
  • do_export:是否进行导出,设置该参数表明进行静态图导出,默许不设置。
  • export_model_dir:静态图导出地址,默以为None。
  • overwrite_output_dir: 假如 True,掩盖输出目录的内容。假如 output_dir 指向检查点目录,则运用它继续练习。
  • disable_tqdm: 是否运用tqdm进度条。
  • metric_for_best_model:最优模型指标, UTC 引荐设置为 macro_f1,默以为None。
  • load_best_model_at_end:练习结束后是否加载最优模型,通常与metric_for_best_model合作运用,默以为False。
  • save_total_limit:假如设置次参数,将约束checkpoint的总数。删去旧的checkpoints 输出目录,默以为None。
  • --save_plm:保存模型进行推理部署

NOTE:

如需恢复模型练习,则能够设置 init_from_ckpt , 如 init_from_ckpt=checkpoint/model_state.pdparams 。

4.2 模型评价

经过运转以下命令进行模型评价猜测:

#评价样本
!python run_eval.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --per_device_eval_batch_size 32 \
    --max_seq_len 512 \
    --output_dir ./checkpoint_test
 99%|██████████████████████████████████████████▌| 98/99 [00:31<00:00,  3.30it/s][2023-06-15 22:19:30,758] [    INFO] - ***** test metrics *****
[2023-06-15 22:19:30,758] [    INFO] -   test_loss               =     1.8884
[2023-06-15 22:19:30,758] [    INFO] -   test_macro_f1           =     0.8427
[2023-06-15 22:19:30,758] [    INFO] -   test_micro_f1           =     0.9849
[2023-06-15 22:19:30,759] [    INFO] -   test_runtime            = 0:00:34.16
[2023-06-15 22:19:30,759] [    INFO] -   test_samples_per_second =     92.189
[2023-06-15 22:19:30,759] [    INFO] -   test_steps_per_second   =      2.897
100%|███████████████████████████████████████████| 99/99 [00:33<00:00,  2.94it/s]

可装备参数阐明:

  • model_path: 进行评价的模型文件夹途径,途径下需包括模型权重文件model_state.pdparams及装备文件model_config.json
  • test_path: 进行评价的测验集文件。
  • per_device_eval_batch_size: 批处理巨细,请结合机器情况进行调整,默以为16。
  • max_seq_len: 文本最大切分长度,输入超越最大长度时会对输入文本进行主动切分,默以为512。
  • single_label: 每条样本是否只猜测一个标签。默以为False,表明多标签分类。

4.3模型猜测

paddlenlp.Taskflow装载定制模型,经过task_path指定模型权重文件的途径,途径下需求包括练习好的模型权重文件model_state.pdparams

from pprint import pprint
import json
from paddlenlp import Taskflow
import pandas as pd
#读取文件并兼并数据
data = []
ids = []
with open('/home/aistudio/input/test.json', 'r', encoding='utf-8') as f:
    for line in f:
        record = json.loads(line.strip())
        text = record['title'] + ' ' + record['abstract']
        data.append(text)
        ids.append(record['id'])
schema = ['材料科学与工程', '临床医学', '电气工程', '数学', '化学', '地质工程', '地理学', '食品科学与工程', '医学', '生物学', '核科学与技能', '地球物理学', '水产', '药学', '交通运输工程', '体育学', '生物医学工程', '护理', '物理学', '心理学', '社会学', '神经科学', '计算机科学', '建筑学', '环境科学与工程', '机械工程', '航空航天工程', '石油工程', '免疫与微生物学', '矿业', '通讯与信息科学', '光学', '历史学', '地质学', '教育学', '海洋工程', '公共管理学', '仪器科学与技能', '经济学', '音乐']
my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path='/home/aistudio/checkpoint/model_best/plm')
results=my_cls(data)
#获取猜测labels
labels = []
for prediction in results:
    label_list = []
    for item in prediction['predictions']:
        label_list.append(item['label'])
    labels.append(label_list)
result = pd.DataFrame({'id': ids, 'subject_name': [labels[i] for i in range(len(labels))]})
print(result)
# 保存输出成果
result.to_csv('result.csv', index=False)
with open("/home/aistudio/output/output.txt", "w+",encoding='UTF-8') as f:    #a :   写入文件,若文件不存在则会先创建再写入,但不会掩盖原文件,而是追加在文件结尾
    for result in results:
        print(result)
        line = json.dumps(result, ensure_ascii=False)  #对中文默许运用的ascii编码.想输出真正的中文需求指定ensure_ascii=False
        f.write(line + "\n")
print("数据成果已导出")
[2023-06-16 14:58:56,470] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'utc-base'.
[2023-06-16 14:58:56,472] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/utc-base/utc_base_vocab.txt
[2023-06-16 14:58:56,495] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/utc-base/tokenizer_config.json
[2023-06-16 14:58:56,500] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/utc-base/special_tokens_map.json
[2023-06-16 14:58:56,502] [    INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer
    id subject_name
0    0         [化学]
1    1         [光学]
2    2        [物理学]
3  195         [化学]
{'predictions': [{'label': '化学', 'score': 0.9999739312596861}], 'text_a': 'Oxidative coupling of methane in the redox cyclic mode over the catalysts on the basis of CeO2 and La2O3 The 1% CeO 2 , 9% La 2 O 3 /SiO 2 and 2% CeO 2 , 8% La 2 O 3 /SiO 2 catalysts show reliable efficiency in the OCM reaction, as well as stable work in the redox cyclic mode. Selectivity to C 2 products remarkably increases if preliminary reduction of the catalyst by a small amount of hydrogen is used.'}
{'predictions': [{'label': '光学', 'score': 0.5380524000927461}], 'text_a': 'Tissue engineering: strategies, stem cells and scaffolds. Tissue engineering scaffolds are designed to influence the physical, chemical and biological environment surrounding a cell population. In this review we focus on our own work and introduce a range of strategies and materials used for tissue engineering, including the sources of cells suitable for tissue engineering: embryonic stem cells, bone marrow-derived mesenchymal stem cells and cord-derived mesenchymal stem cells. Furthermore, we emphasize the developments in custom scaffold design and manufacture, highlighting laser sintering, supercritical carbon dioxide processing, growth factor incorporation and zoning, plasma modification of scaffold surfaces, and novel multi-use temperature-sensitive injectable materials.'}
{'predictions': [{'label': '物理学', 'score': 0.8062627429802265}], 'text_a': 'Enhancement of Forced Convection Subcooled Film Boiling Heat Transfer Using Gas Sheet Collapse by Electric Field Application Enhancement of forced-convection boiling heat transfer by electriceld is investigated experimentally. When a high-temperature horizontallament is immersed in water, a gas sheet is formed around and the abovelament due to liquid boiling, in the early immersion process. This gas-sheet markedly decreases the boiling cooling rate of thelament. Here, forced collapse of the gas sheet is attempted by imposing an electriceld to enhance the boiling cooling rate, In the experiments, a horizontal platinum wire of 0.5mm in diameter is immersed in pure water under atmospheric pressure, and a DC voltage up to 600V is applied between the wire surface and an electrode made of glass placed 10mm apart. The whole boiling curve is measured under different applied voltages and wire-falling velocities in 0.5 t
{'predictions': [{'label': '化学', 'score': 0.6280942049702516}], 'text_a': 'Speciation of some heavy metals in bottom sediments of the Ob and Yenisei estuarine zones The speciation of Fe, Mn, Zn, Cu, Co, Ni, Cr, Pb, and Cd was studied in 52 samples of bottom sediments collected during Cruise 49 of the R/V Dmitrii Mendeleev in estuaries of the Ob and Yenisei rivers in the southwestern Kara Sea. Immediately after sampling, the samples were subjected to on-board consecutive extraction to separate metal species according to their modes of occurrence in the sediments: (1) adsorbed, (2) amorphous Fe-Mn hydroxides and related metals, (3) organic + sulfide, and (4) residual, or lithogenic. The atomic absorption spectroscopy of the extracts was carried out at a stationary laboratory. The distribution of Fe, Zn, Cu, Co, Ni, Cr, Pb, and Cd species is characterized by the predominance of lithogenic or geochemically inert modes (70–95% of the bulk content), in which the metals are bound in terrigeno
数据成果已导出
#依照官方输出格局要求
import json
from paddlenlp import Taskflow
import pandas as pd
# 后台将在project目录下运转,途径若不确定可写绝对途径 '/home/mw/project/xxx'
def invoke(input_data_path):
    data = []
    ids = []
    with open('/home/aistudio/input/test.json', 'r', encoding='utf-8') as f:
        for line in f:
            record = json.loads(line.strip())
            text = record['title'] + ' ' + record['abstract']
            data.append(text)
            ids.append(record['id'])
    schema = ['材料科学与工程', '临床医学', '电气工程', '数学', '化学', '地质工程', '地理学', '食品科学与工程', '医学', '生物学', '核科学与技能', '地球物理学', '水产', '药学', '交通运输工程', '体育学', '生物医学工程', '护理', '物理学', '心理学', '社会学', '神经科学', '计算机科学', '建筑学', '环境科学与工程', '机械工程', '航空航天工程', '石油工程', '免疫与微生物学', '矿业', '通讯与信息科学', '光学', '历史学', '地质学', '教育学', '海洋工程', '公共管理学', '仪器科学与技能', '经济学', '音乐']
    my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path='/home/aistudio/checkpoint/model_best/plm')
    results=my_cls(data)
#pred_threshold阈值函数记住修改
    # 提取成果中的label值
    #获取猜测labels
    labels = []
    for prediction in results:
        label_list = []
        for item in prediction['predictions']:
            label_list.append(item['label'])
        labels.append(label_list)
    # 构建输出成果
    result = pd.DataFrame({'id': ids, 'subject_name': [labels[i] for i in range(len(labels))]})
    return result
input_data_path="/home/aistudio/input/test.json"
result=invoke(input_data_path)
print(result)
[2023-06-16 14:59:12,503] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'utc-base'.
[2023-06-16 14:59:12,505] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/utc-base/utc_base_vocab.txt
[2023-06-16 14:59:12,528] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/utc-base/tokenizer_config.json
[2023-06-16 14:59:12,531] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/utc-base/special_tokens_map.json
[2023-06-16 14:59:12,535] [    INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer
    id subject_name
0    0         [化学]
1    1         [光学]
2    2        [物理学]
3  195         [化学]

5.总结

赛道一:论文学科分类 (Easy)——依据标题和摘要将论文精确分类到 40 个天然学科中去,或许单学科,也或许交叉学科,精确度到达 90% 以上。全体使命比较简略,花了几个小时就搞完了,但是在官方镜像上糟蹋很多时刻,导致使命提交失利,必须吐槽一下,供给的基础镜像只包括TF 和torch,没有paddle,个人在构建新镜像一直发布不出导致婴儿惨死腹中,先把baseline开源出来,欢迎我们调试,

5.1 改进策略

  • 对摘要部分进行处理,做个文摘提取关键内容,目前模型字符处理长度512,会丢掉部分信息。

2023中国高校计算机大数据挑战赛:论文学科分类baseline|清华主办

项目码源见文末

项目云端码源链接链接:www.heywhale.com/mw/project/…

更多优质内容请关注公号:汀丶人工智能