知识中心 - 北京概普生物科技有限公司(GapTech)

plos biology文章揭示预印本与最终发表版差异几何|4月biorxiv生信好文速览

生信干货 Montreal ·2022年5月6日 14:49

关于预印本的最重要的困惑之一是：预印本的版本和最重的发表版，区别有多少？这直接关系到对待预印本的态度，进而引申到引用、合作等层面。鉴于许多新冠的文章先投预印本，不少媒体也都直接报道预印本结果，因此，对预印本结果的可信度的研究在当下有着很重要的意义。

近期，著名高水平开放获取杂志plos biology刊登了两篇背靠背文章，就这一问题进行了探讨。两篇文章运用不同方法得到了大致相仿的结论：最终发表版与预印本差别不大。其中一篇，来自宾夕法尼亚大学的研究人员通过机器学习方法分析了近两万份biorxiv上发布的preprint。另一篇论文则由多国学者联合完成，采用的是手动分析的方案，对180多篇预印本文章进行了详实分析，发现仅有82.8%的preprint与最终发表的版本有重大差异，对不涉及新冠的研究中这一比例更高达92.8%。顺便提一下，类似主题受该杂志青睐绝非偶然。18年plos biology就率先与biorxiv达成协议，允许作者在投稿时自动转发在biorxiv，开业内风气之先河。

尽管这两篇文章常被解读为“审稿过程对预印本影响不大”，以下几点仍需要警惕：

1.文章未刨除未发表的preprint，而这些文章有更大机会由于审稿过程中遭遇更大阻力。

2.结论的一致有可能是作者不愿意做出调整，或者审稿人并未发现文章的问题。

3.发表的版本不表示“没毛病”

上面的第三点强调了一个新兴的观念——post-publication peer review（PPPR）：发表不是终点，对于文章的审议要接受大家的批评，而且要与时俱进。这个话题，以后再跟大家聊聊。

如果把公众号的推送看作一次“发表”，尽管经过小编撰稿和总编审稿，很多推送也难免出现疏漏。比如上一期的预印本好文速览中，小编不慎将本属于两栖动物的蚓螈caecilian，说成是爬行动物，误导读者，在此致歉。以下为大家带来4月的biorxiv生信好文速览。

A picture containing diagram Description automatically generated

一、【表观】表观基因组学中的机器学习：数量模型（quantative model）略胜一筹

Evaluating deep learning for predicting epigenomic profiles

Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models continue to emerge with different architectures and training configurations, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery. Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.

二、【流形】荷兰代尔夫特理工（Delft University of Technology）：基于流形比对框架在单细胞数据分析中的应用

TopoGAN: unsupervised manifold alignment of single-cell data

Results We present TopoGAN, a method for unsupervised manifold alignment of single-cell datasets with non-overlapping cells or features. We use topological autoencoders to obtain latent representations of each modality separately. A topology-guided Generative Adversarial Network then aligns these latent representations into a common space. We show that TopoGAN outperforms state-of-the-art manifold alignment methods in complete unsupervised settings. Interestingly, the topological autoencoder for individual modalities also showed better performance in preserving the original structure of the data in the low-dimensional representations when compared to using UMAP or a variational autoencoder. Taken together, we show that the concept of topology preservation might be a powerful tool to align multiple single modality datasets, unleashing the potential of multi-omic interpretations of cells. Availability and implementation Implementation available on GitHub (https://github.com/AkashCiel/TopoGAN). All datasets used in this study are publicly available.

三、【建树】从读段开始轻松构建进化树

Read2Tree: scalable and accurate phylogenetic trees from raw reads

The inference of phylogenetic trees from raw sequencing reads is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10- 100x faster than conventional approaches, and in most cases more accurate—the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree—thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree enables comparative genomics at scale.

四、【回访】芬兰赫尔辛基大学（University of Helsinki）：对biobank参与者的回访

Re-contacting biobank participants: lessons from a pilot study within FinnGen

Results The overall participation rate was 18.6% (23.1% among individuals aged 18-69). A second reminder letter yielded an additional 9.7% participation rate in those who did not respond to the first invitation. Re-contacting participants via an online healthcare portal yielded lower participation than re-contacting via physical letter. The completion rate of questionnaire and cognitive tests was high (92% and 85%, respectively), and measurements were overall reliable among participants. For example, the correlation (r) between self-reported body mass index and that collected by the biobanks was 0.92. Conclusions In summary, this pilot suggests that re-contacting FinnGen participants with the goal to collect a wide range of cognitive, behavioral and lifestyle information without additional engagement, results in a low participation rate, but with reliable data. We suggest that such information be collected at enrollment, if possible, rather than via post-hoc re-contacting.

五、【选择】东京工大（tokyo institute of technology）：如何选择AhlpaFold2得到的最佳模型？

How to select the best model from AlphaFold2 structures?

Among the methods for protein structure prediction, which is important in biological research, AlphaFold2 has demonstrated astonishing accuracy in the 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14). The accuracy is close to the level of experimental structure determination. Furthermore, AlphaFold2 predicts three-dimensional structures and estimates the accuracy of the predicted structures. AlphaFold2 outputs two model accuracy estimation scores, pLDDT, and pTM, enabling the user to judge the reliability of the predicted structures. Original research of AlphaFold2 showed that those scores had good correlations to actual prediction accuracy. However, it was unclear whether we could select a structure close to the native structure when multiple structures are predicted for a single protein. In this study, we generated several hundred structures with different combinations of parameters for 500 proteins and verified the performance of the accuracy estimation scores of AlphaFold2. In addition, we compared those scores with existing accuracy estimation methods. As a result, pLDDT and pTM showed better performance than the existing accuracy estimation methods for AlphaFold2 structures. However, the estimation performance of relative accuracy of the scores was still insufficient, and the improvement would be needed for further utilization of AlphaFold2.

六、【抗体】IgFold：一款据称在抗体结构预测上超越AlphaFold的工具，来自约翰斯霍普金斯大学

Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies

Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold’s capabilities, we predicted structures for 105K paired antibody sequences, expanding the observed antibody structural space by over 40 fold.

详情见另一推送

七、【分箱】真核生物宏基因组分箱：挑战与机遇

Recovery of 447 Eukaryotic bins reveals major challenges for Eukaryote genome reconstruction from metagenomes

An estimated 8.7 million eukaryotic species exist on our planet. However, recent tools for taxonomic classification of eukaryotes only dispose of 734 reference genomes. As most Eukaryotic genomes are yet to be sequenced, the mechanisms underlying their contribution to different ecosystem processes remain untapped. Although approaches to recover Prokaryotic genomes have become common in genome biology, few studies have tackled the recovery of Eukaryotic genomes from metagenomes. This study assessed the reconstruction of Eukaryotic genomes using 215 metagenomes from diverse environments using the EukRep pipeline. We obtained 447 eukaryotic bins from 15 classes (e.g., Saccharomycetes, Sordariomycetes, and Mamiellophyceae) and 16 orders (e.g., Mamiellales, Saccharomycetales, and Hypocreales). More than 73% of the obtained eukaryotic bins were recovered from samples whose biomes were classified as host-associated, aquatic and anthropogenic terrestrial. However, only 93 bins showed taxonomic classification to (9 unique) genera and 17 bins to (6 unique) species. A total of 193 bins contained completeness and contamination measures. Average completeness and contamination were 44.64% (σ=27.41%) and 3.97% (σ=6.53%), respectively. Micromonas commoda was the most frequent taxa found while Saccharomyces cerevisiae presented the highest completeness, possibly resulting from a more significant number of reference genomes. However, mapping eukaryotic bins to the chromosomes of the reference genomes suggests that completeness measures should consider both single-copy genes and chromosome coverage. Recovering eukaryotic genomes will benefit significantly from long-read sequencing, intron removal after assembly, and improved reference genomes databases.

八、【噬菌体】宏基因组中的噬菌体序列分析

MetaPhage: an automated pipeline for analyzing, annotating, and classifying bacteriophages in metagenomics sequencing data

In the last decades, a great interest has emerged in the study and characterisation of the microbiota, especially the human gut microbiota, demonstrating that commensal microorganisms play a pivotal role in normal anatomical development and physiological function of the human body. To better understand the complex bacterial dynamics that characterize different environments, bacteriophage predation and gene transfer need to be considered as well, as they are important factors that may contribute to controlling the density, diversity, and network interactions among bacterial communities. To date, a variety of bacteriophage identification tools have been developed, differing on phage mining strategies, input files requested and results produced; however, new users approaching the bacteriophage analysis might struggle in untangling the variety of methods and comparing the different results produced. Here we present MetaPhage, a comprehensive reads-to-report pipeline that streamlines the use of multiple miners and generates an exhaustive report to both summarize and visualize the key findings and to enable further exploration of specific results with interactive filterable tables. The pipeline is implemented in Nextflow, a widely adopted workflow manager, that enables an optimized parallelization of the tasks on different premises, from local server to the cloud, and ensures reproducible results using containerized packages. MetaPhage is designed to allow scalability, reproducibility and to be easily expanded with new miners and methods, in a field that is constantly expanding. MetaPhage is freely available under a GPL-3.0 license at https://github.com/MattiaPandolfoVR/MetaPhage.

九、【蝴蝶】波多黎各学者：泛基因组揭示蝴蝶染色体开放性的进化规律

A butterfly pan-genome reveals a large amount of structural variation underlies the evolution of chromatin accessibility

Despite insertions and deletions being the most common structural variants (SVs) found across genomes, not much is known about how much these SVs vary within populations and between closely related species, nor their significance in evolution. To address these questions, we characterized the evolution of indel SVs using genome assemblies of three closely related Heliconius butterfly species. Over the relatively short evolutionary timescales investigated, up to 18.0% of the genome was composed of indels between two haplotypes of an individual H. charithonia butterfly and up to 62.7% included lineage-specific SVs between the genomes of the most distant species (11 Mya). Lineage-specific sequences were mostly characterized as transposable elements (TEs) inserted at random throughout the genome and their overall distribution was similarly affected by linked selection as single nucleotide substitutions. Using chromatin accessibility profiles (i.e., ATAC-seq) of head tissue in caterpillars to identify sequences with potential cis-regulatory function, we found that out of the 31,066 identified differences in chromatin accessibility between species, 30.4% were within lineage-specific SVs and 9.4% were characterized as TE insertions. These TE insertions were localized closer to gene transcription start sites than expected at random and were enriched for several transcription factor binding site candidates with known function in neuron development in Drosophila. We also identified 24 TE insertions with head-specific chromatin accessibility. Our results show high rates of structural genome evolution that were previously overlooked in comparative genomic studies and suggest a high potential for structural variation to serve as raw material for adaptive evolution.

十、【新冠】马德里康普顿斯大学（Complutense University of Madrid）：奥密克戎毒株对宠物的感染

The Omicron (B.1.1.529) SARS-CoV-2 variant of concern also affects companion animals

The recent emergence of the Omicron variant (B.1.1.529) has brought with it a large increase in the incidence of SARS-CoV-2 disease worldwide. However, there is hardly any data on the incidence of this new variant in companion animals. In this study, we have detected the presence of this new variant in domestic animals such as dogs and cats living with owners with COVID19 in Spain that have been sampled at the most optimal time for the detection of the disease. None of the RT-qPCR positive animals (10.13%) presented any clinical signs and the viral loads detected were very low. In addition, the shedding of viral RNA lasted a short period of time in the positive animals. Infection with the Omicron variant of concern (VOC) was confirmed by a specific RT-qPCR for the detection of this variant and by sequencing. These outcomes suggest a lower virulence of this variant in infected cats and dogs. This study demonstrates the transmission of this new variant from infected humans to domestic animals and highlights the importance of doing active surveillance as well as genomic research to detect the presence of VOCs or mutations associated with animal hosts.