知识中心 - 北京概普生物科技有限公司(GapTech)

一、摘要

获取某个物种几乎全部个体的数据并分析在传统遗传学中的地位越来越重要。但是即使目前测序成本超超超便宜，对于非模式生物来说，一次性测群体数量级别的个体，还真扛不住。

目前针对多个体混池测序相对于单个个体单独测序而言貌似是一个好的选择，性价比较高。当然随着pool-seq相关软件的开发，这项技术对于群体规模上的模式生物和非模式生物都是相当适用的。因此在这篇综述中，文章会详细的介绍pool-seq。

二、背景介绍

不知道大家有没有这样的感觉，十年前整个基因组，指定上头条啊。十年间风云变幻，如今NGS技术的发展，三代测序技术的发展，组装算法的发展，弄个物种的基因组跟玩似的，其重要性今非昔比。要知道一个个体的测序仅仅是这个物种的一个代表，其基因组信息不足以完全代表这个物种，因此目前的趋势就是一次性测一个群体的个体，比如1000个人，2000只鸡等等。这里面有两个问题，第一是这种玩法太耗钱，对于大的机构和单位，模式生物来说，还好弄，对于研究相对较偏的物种，这种玩法压根就不现实。第二个问题就是测1000个个体就能代表这个物种嘛，很明显不能，尤其是对人来说，一丁点都不能出问题的情况下，混几千人还真不能说明啥。文章原文如下：The analysis of human diseases and other complex traits indicated that even the analysis of several thousand individuals frequently turned out to be insufficient to determine the underlying genetic architecture.这两个问题其实说明了pool-seq还是很有存在的必要的。

这篇综述中，文章讨论全基因组的多个体pool-seq测序，这种方式相对于单个个体测序花费较少，并且相对于其他方式性价比较高。文章还会介绍这种方法的优缺点、相关的软件工具和流程。

三、Pool-seq 性价比高

Pool-seq测序在获取同样准确性的插入碱基频率的情况下，花费更少。换句话说，对于单个个体测序来说，小规模的测序其实仍然会存在噪音。但是pool-seq不是几个样品的测序，而是大群体级别测序，这种情况准确性会高很多。在同样的测序成本下，其优势体现如下图。

可以看出pool-seq 在单独测50个的时候，混100个，混500个效果明显较好。

另外目前测序成本越来越低，在建库成本不能忽略不计的情况下，pool-seq 可以显著的降低建库成本哦。

四、跟其他方法进行比较

为了减少成本，相对全部个体单独测序而言，可以选择的方式有减少代表个体测序、外显子测序、RNA水平测序（RNA-seq，只关注编码区）等。其测序策略示意图如下，具体的实验方法，在之前的文章有介绍，就不多嘴了。

这几种测序优缺点比对如下：

可见Pool-seq还是有很大的优势的。尤其是性价比超级高。

Pool-seq的使用限制因素如下：

1、pool size

2、Linkage disequilibrium

3、Sequencing errors

4、Differential representation of individuals in the pool

5、Aliment problems

具体适合与否见表格:

五

Pool-seq 的应用

1.Genotype-phenotype mapping

Mapping-by-sequencing of induced mutations using bulk segregant analysis
Mapping of naturally occurring functionally diverged alleles using bulk segregate analysis
Pool genome-wide association studies
Evolve and resequence

2.Reverse ecology

3.Domestication

4.Genome evolution

Recombination landscape and polymorphism
Transposable elements
Selective sweeps

5.Trajectories of selected alleles

Clonal interference
Plateauing of selected alleles
Dynamics of ecologically diverged clons

6.Cancer genomics

六

相关软件表格

七

推荐流程和策略

The analysis of data obtained by whole-genome sequencing of pools of individuals (Pool-seq) is a rapidly growing field, and new tools are continuously being developed. Therefore, we caution that recommendations listed here are also a moving target that needs to be continuously challenged, preferentially by validation studies. Furthermore, the optimal experimental design will depend on the biological systems being investigated and the purpose of the study.

Number of individuals included in a pool: >40

The accuracy of Pool-seq increases with the number of individuals included in the pool because the sampling error and the influence of unequal representation of individuals in the pool are reduced. At least 40 diploid individuals should be used11,12,38.

Depth of coverage: >50×

Reliable allele frequency estimates require a sufficiently high sequencing coverage to reduce the sampling error, which in turn depends on the allele frequency. Furthermore, a higher coverage not only facilitates the identification of sequencing errors but also provides more power to detect allele frequency differences. Therefore, we recommend a minimum coverage of at least 50‑fold for single-nucleotide polymorphism (SNP)-based tests and caution that some applications may require a 200‑fold coverage110. A lower coverage is sufficient if windows containing multiple SNPs39 or large inversions111 are analysed.

Sequencing technology: using a read length of >75 nucleotides and paired-end reads

As mapping accuracy is improved by longer paired-end reads, we recommend using paired-end reads of at least 75 nucleotides. Furthermore, PCR duplicates are more reliably identified if paired-end reads are used.

Preprocessing of reads: trimming

The increased error rate towards the 3ʹ end of Illumina reads could impair downstream analyses such as variant calling112. Therefore, we suggest trimming reads with one of the available software tools39,113.

Mapping: using conspecific reference genome and global alignment; allowing for gaps and disabling seeding

Whenever possible, heterologous reference genomes should not be used, as even closely related species often harbour diverged genomic regions that may cause alignment artefacts83,114. For non-model organisms with large genome sizes, RNA-sequencing-based de novo assemblies may be a viable strategy72. Soft clipping (the exclusion of terminal bases with mismatches) should be avoided, as this leads to biased allele frequency estimates39,115. Thus, semi-global alignment algorithms should be used (as implemented in BWA ALN35 and Bowtie2 (REF. 116)). In addition, allowing for gaps increases the mapping accuracy39. Realignment of unmapped reads could improve the coverage of diverged regions, but soft clipping will be introduced for these reads (an example of a realignment tool that uses soft clipping is BWA SAMPE35). The ‘seeding’ step, which was introduced as a heuristic to accelerate mapping, should be avoided because it discriminates against diverged reads and could possibly introduce bias into allele frequency estimates.

Filtering: using proper pairs and a mapping quality of >20

The mapping precision is higher when both reads of a read pair can be mapped (that is, when they are proper pairs); therefore, broken pairs should be filtered. Rather than relying on uniquely mapped reads, it is preferable to filter reads by mapping quality, as this takes the base quality of mismatches into account35. We recommend a minimum mapping quality of 20.

Indels: realigning reads spanning indels or ignoring regions around indels

Reads mapped to insertions and deletions (indels) are frequently misaligned, especially if the ends of reads span an indel33. To avoid false SNPs, we recommend either realigning reads covering an indel117,118 or excluding bases flanking the indel39.

Duplicates: removing duplicates

It is frequently recommended to remove PCR duplicates12, but only preferential amplification of one allele will result in a biased allele frequency estimate.

CNV: filtering for CNVs or using a maximum coverage

Copy number variations (CNVs) may lead to false-positive SNPs when multiple slightly diverged copies of a genomic region are collapsed during mapping. CNVs may be detected either with specialized software40 or by excess sequence coverage and should be removed from the analyses.

Coverage heterogeneity

Heterogeneous sequence coverage results in unequal power to detect allele frequency differences if they are not accounted for. Thus, it is recommended either to use more complex models that account for this 79 or to subsample to a homogeneous coverage over the entire genome.

Variant detection: using a variant-calling algorithm that accounts for strand bias

In addition to ad hoc strategies (that is, strategies in which a minimum sequence quality is combined with a minimum fraction of reads supporting a SNP), it is also possible to use one of the several available tools for variant detection (TABLE 3). We note that it is also important to take other features that are frequently associated with false SNPs into account: only SNPs that are occurring at similar frequencies on both strands33 (that is, those not displaying strand bias) and that are also located in the central region of a read should be considered reliable. Examples of suitable variant callers include the GATK Unified Genotyper118 and VarScan119.

八

相关概念介绍

Next-generation sequencing

(NGS; also known as second-generation sequencing). An umbrella term for different sequencing platforms delivering millions of short DNA sequence reads. Reads DNA sequences that are generated by next-generation sequencing.

Pool-seq

A sequencing technique in which sequencing libraries are not prepared from DNA of a single individual or cell but from a mixture of DNA fragments originating from different individuals or cells. In the context of this Review, Pool-seq is used to describe the unbiased sequencing of the entire genome.

Coverage

The number of reads that span a given genomic position.

Sequencing libraries

Sets of fragmented DNA extracted from one or more individuals that serve as the template for subsequent sequencing.

Exome sequencing

A sequencing approach in which the complexity of the genome is reduced through hybridization to exonic sequences, which results in a higher sequence coverage of protein-coding regions.

Restriction-site-associated DNA markers

Sequence polymorphisms in close proximity to a restriction enzyme recognition site.

Linkage disequilibrium

(LD). Nonrandom association between alleles at two loci. In outcrossing diploid individuals, the genotypes need to be sorted into haplotypes in a statistical procedure called phasing.

Genetic markers

Polymorphic loci that could be scored with a genotyping technique.

F2 analysis

Analysis of mapping populations generated by the F2 design. The F1 progeny from crossing two phenotypically different parental strains are themselves crossed to produce an F2 population that is segregating for the phenotype of interest. The F2 mapping population may carry up to three genotypes at every marker and therefore allows the detection of additive and dominance effects, as well as interactions between loci.

Phased genomic sequences

Genome sequences for which the haplotype phase (that is, the combination of alleles or genetic markers that coexist on a single chromosome) has been determined.

Imputation

In statistics, it refers to the replacement of missing data with values. In genomics, it describes the use of haplotype sequences to fill in missing sequence information.

Haplotypes

The combination of alleles or genetic markers that coexist on a single chromosome. Chromosomal regions carrying a haplotype are inherited as intact physical units until they are broken up by recombination.

Pool genome-wide association studies

(Pool-GWASs). Genotype–phenotype mapping studies in which phenotypically extreme individuals are grouped and sequenced as pools. Causative variants are identified by contrasting the allele frequencies between the pools. Evolve and resequence studies Studies that combine experimental evolution with next-generation sequencing.They make use of controlled environmental, demographic and selective variables to facilitate genotype–phenotype mapping.

Forward genetics

An approach in which mutations induced by random mutagenesis that lead to the disruption of gene function are identified based on their phenotypes. The causative mutation is traditionally identified by positional cloning or by a candidate-gene approach.

Bulk segregant analysis

(BSA). Analysis in which offspring from diverged parents are phenotyped and the DNA of individuals from opposing tails of the phenotypic distribution is combined (pooled). Causative variants are identified by contrasting allele frequency differences among the pools.

Epistatic interactions

Non-additive interactions between genes in which the effect of an allele at one locus is modified by the genotypes at other loci in the genome. The resulting phenotype is different from that expected by summing the independent effects of the individual loci.

Introgress

Introducing a genomic region from one strain or species into that of another by repeated backcrossing. By selecting for the phenotype of interest, the genomes become isogenic except for the chromosomal regions causing the selected phenotype.

Paired-end reads

DNA fragments that were sequenced from both ends, yielding pairs of reads that are separated by a defined distance that is dependent on the library preparation protocol.

Soft clipping

Substrings at either end of reads that were not aligned with a local alignment algorithm and are thereby excluded in the subsequent analysis.

Proper pairs

Paired-end reads where both pairs can be mapped to the same chromosomes within a distance pre-specified by the insert size chosen during library preparation.

Broken pairs

Paired-end reads that do not map as proper pairs.

Mapping quality

Log (base 10) transformed measure of the probability that a read is incorrectly mapped multiplied by 10.

Base quality

Log (base 10) transformed measure of the probability that a given base call is incorrect multiplied by 10.

Insertions and deletions

(Indels). DNA sequences that have been inserted or deleted from a genomic region. As only phylogenetic analysis allows the distinction between insertions and deletions, indel has been used as an indifferent term.

Strand bias

A variant that is significantly more likely to occur within reads that originate from one of the two strands of DNA.

GWASs

Trait mapping studies that rely on a statistical test to determine associations between sequence variants and a given phenotype in natural populations.

Cline

The gradual change in phenotypes or allele frequencies along a geographical or environmental gradient.

Hitchhiking

The population genetic mechanism by which a neutral, or in some cases slightly deleterious, mutation increases in population frequency solely as a result of physical linkage with a positively selected mutation.

参考文献

Schlötterer C, Tobler R, Kofler R, et al. Sequencing pools of individuals - mining genome-wide polymorphism data without big funding.[J]. Nature Reviews Genetics, 2014, 15(11):749.

欢迎关注生信人