知识中心 - 北京概普生物科技有限公司(GapTech)

ncRNA注释

生信干货生信人 ·2015年8月15日 19:06

1、ncRNA（Non-coding RNA）分析

1.1 RNA 分类

其中 ncRNA 分析就是分析非编码 RNA，我们现有流程主要集中在tRNA、 sRNA、小RNA（miRNA、sRNA、snRNA）的分析。

1.2 tRNA分析

■ tRNA 简介

tRNAs由70-90个核苷酸组成。它的主要功能是在蛋白质生物合成过程中把mRNA的信息准确地翻译成蛋白质中氨基酸顺序的适配器（adapter）分子，具有转运氨基酸的作用，并以此氨基酸命名。此外，它在蛋白质生物合成的起始作用中，在DNA反转录合成中极及其他代谢调节中也起重要作用。细胞内tRNA的种类很多，每一种氨基酸都有其相应的一种

或几种tRNA。

■tRNAscan-SE简介

tRNA基因识别比编码蛋白质的基因识别简单，目前基本解决了用理论方法预测tRNA基因的问题。tRNAscan-SE工具中综合了多个识别和分析程式，通过分析启动子元件的保守序列，tRNA二级结构的分析，转录控制元件分析和除去绝大多数假阳性的筛选过程，根称能识别 99%的真tRNA基因，其搜索的速度可以达到 30kb/秒。该程序适用于大规模人类

基因组序列得分析，同时也可以用于其他 DNA 序列。

■ tRNAscan-SE 使用

程序运行命令行：

Usage: tRNAscan-SE [-options] <FASTA file(s)>

Scan a sequence file for tRNAs using tRNAscan, EufindtRNA & tRNA covariance models --defaults to use with eukaryotic sequences (use -B, -A, -O or -G to scan other types of sequences)

Basic Options

-B or -P : search for bacterial tRNAs (use bacterial tRNA model)

-A : search for archaeal tRNAs (use archaeal tRNA model)

-O : search for organellar (mitochondrial/chloroplast) tRNAs

-G : use general tRNA model (cytoplasmic tRNAs from all 3 domains included)

-C : search using Cove analysis only (max sensitivity, very slow)

-o <file> : save final results in <file>

-f <file> : save tRNA secondary structures to <file>

-a : output results in ACeDB output format instead of default

tabular format

-m <file> : save statistics summary for run in <file>

(speed, # tRNAs found in each part of search, etc)

-H : show both primary and secondary structure components to

covariance model bit scores

-q : quiet mode (credits & run option selections suppressed)

-h : print full list (long) of available options

■ 重要参数说明

-B or –P 搜索细菌类 tRNA

-A 搜索古细菌类 tRNA

-O 搜索细胞器的 tRNA ，如线粒体和叶绿体

-G 一般真核细胞的 tRNA

-o 最终的结果文件,默认列表格式

-f 生成的 tRNA 二级结构文件

-m 生成的统计结果文件

-a 生成 ACeDB 格式的结果

命令行：tRNAscan-SE -o *.tRNA -f *.tRNA.structure < FASTA file(s)>

注意：因为程序默认输入序列为真核生物基因组序列，故只需要设置 “-o”、“-f”参数，

对于其他类型的基因组序列，请选择正确的输入序列的物种信息。

■ tRNA 预测结果说明

1、*.tRNA -o 参数产生的 tRNA 结果文件，内容如下：

tRNA 预测结果的信息有tRNA的位置信息（tRNA Begin 和 Bounds End），转运氨基

酸的类型结密码子（tRNA Type和 Anti Codon），Intron的位置信息（Intron Begin 和Bounds End），以及预测的分值（Cove Score）。

tRNA 预测结果的信息有tRNA的位置信息（tRNA Begin 和 Bounds End），转运氨基酸的类型结密码子（tRNA Type 和 Anti Codon）， Intron 的位置信息（Intron Begin 和 Bounds End），以及预测的分值（Cove Score）。

（1）位于最后一项的 Cove Score 是搜寻到的 tRNA 的得分，该分值随所选用的数据库

的不同而又变化（默认情况下是 Eukaryotic 数据库； -B 或–P 是搜索细菌类的 tRNAs ； -A 搜

索古菌类 tRNAs；-O 搜索细胞器的 tRNAs；-G 一般的 tRNA 模式）。

（2）如果有 tRNA 符合假基因的标准，那么将在“tRNA Type”栏处标注“Pseudo”。

2、 *.tRNA.structure -f 参数产生的 tRNA 二级结果文件。

tRNA 的二级结构是呈“三叶草”形状，在结够上具有某些共同之处，一般可将其分为五

臂四环：包括氨基酸接受区，反密码区、二氢尿嘧啶区、TΨC 区和可变区。除了氨基酸接

受区外，其余每个区均含有一个突环和一个臂。下图为 tRNA 的二级结构图

下面是 tRNA 二级结果文件的内容

1.3 rRNA分析

rRNA 是细胞中含量最多的RNA，约占RNA总量的82%。rRNA 单独存在时不执行其

功能，它与多种蛋白质结合成核糖体，作为蛋白质生物合成的“装配机”。原核生物 rRNA 分

三类：5SrRNA、16SrRNA和23SrRNA。真核生物的 rRNA 分四类:5SrRNA、5.8SrRNA、

18SrRNA、28SrRNA。原核生物和真核生物核糖体都是由大、小两个亚基组成。

■ rRNA 预测方法介绍

目前对于 rRNA 预测有两种方法。

第一种；homology 预测，通过与已知rRNA 库进行blast比对，找到rRNA，这种方法

找到的 rRNA 的结果准确但不全面。

这种方法预测需要客户提供非常近源的参考序列。当确定了参考序列，在从网站（NCBI）

下载该参考序列的 rRNA 序列（NCBI上*.frn文件）并提取相关 rRNA。

参考序列 rRNA 文件的 ID格式必须是 L78479#rRNA_28S 这种格式，“#”之前是序列

ID号，这些 ID号必须唯一， ID 号由字母、数字、下划线组成。“#”号之后是 rRNA 的类型。

最终参考序列的 rRNA文件格式如下：

>CM_M.grisea1#rRNA_5S

TAACGCACACCAACGTACACGTGCAGGCTGATTAATTGGGTAGGCAAGCCATATGTT

>CM_M.grisea4#rRNA_5S

TGACGCACACCAACGTTTACGTGCAGGCAAATTGATTGGGTAGGAGAGCCATATATT

■结果说明

Blast结果包括很多有用信息，但是 blast 结果不那么直观，所以将 blast 结果转化的列

表文件，在将列表文件转化成标准的gff文件。

Blast参数：path/balst -p blastn -e 1e-5 -v 10000 -b 10000

程序运行完之后会产生 3 个文件： *.blast 、*.tab、 *.tag.gff。

*.blast 最初的 blast比对结果

*.tab 由 blast 结果转化的列表文件

列表文件共有16列，以"\t"分隔，如果某一类的值为空，以"--"代替，每一列信息都

是从原始 blast结果得到，具体意义如下说明：

1:Query_id 2:Query_length 3:Query_start 4:Query_end 5:Subject_id

6:Subject_length 7:Subject_start 8:Subject_end 9:Identity 10:Positive 11:Gap

12:Align_length 13:Score 14:E_value 15:Query_annotation 16:Subject_annotation

*.gff由列表文件转化的最终结果文件，GFF格式。

第二种：RNAmmer 预测，RNAmmer 软件通过隐马尔可夫模型预测 rRNA。这种预测

是基于组装结果进行Denovo预测。

■ 软件使用

usage():

path/rnammer -S arc/bac/euk (-multi) (-m tsu,lsu,ssu) (-f) (-k) (-gff [gff file])

(-xml [xml file]) (-f [fasta file]) (-h [HMM report]) [sequence]

OPTIONS

-S Specifies the super kingdom of the input sequence. Can be either 'arc', 'bac', or 'euk'.

-gff output gff file Specifies filename for output in GFF version 2 output

-multi Runs all molecules and both strands in parallel

-f fasta Specifies filename for output fasta file of predicted rRNA genes

-h hmmreport Specifies filename for output HMM report.

-m Molecule type can be 'tsu' for 5/8s rRNA, 'ssu' for 16/18s rRNA,

'lsu' for 23/28s rRNA or any combination seperated by comma.

■重要参数

-S 指定输入的序列是哪个界'arc', 'bac', 或者 'euk'。

'arc'：古细菌 ‘bac’：细菌‘euk’真菌

-m rRNA 类型。'tsu'：5/8s rRNA,'ssu'：16/18s rRNA'lsu'：23/28s rRNA

-gff 指定输出 gff 文件的名字

-f 指定输出预测的 rRNA 结果的 fasta 格式文件的文件名

■ 结果说明

1、*.fq -f 参数产生的文件，fasta 格式。

>rRNA_scaffold164_671644-671759_DIR+ /molecule=8s_rRNA /score=36.0

ACGACCAGAGGACAATGAAATCAGGGCTTCCCGTCCGCTCAGCCATACTTAAGC

ATTGTACCGGTGGATTAGTAGTTAGGTGGGAGACCACTAGCGAATACCCGCTGC

CGTATGTT

2、*.gff -g 参数产生的文件。

seqname source feature start end score +/- frame attribute

scaffold164 RNAmmer-1.2 rRNA 671644 671759 36.0 + . 8s_rRNA

scaffold9 RNAmmer-1.2 rRNA 720308 720423 32.3 - . 8s_rRNA

1.4 miRNA snRNA sRNA 分析

■ 预测原理

目前对于 miRNA、snRNA、sRNA 的预测，是通过与 Rfam 数据库比对，找到三种 RNA。

Rfam 是一个综合的非冗余的非编码 RNA 家族数据库，由多重序列比对及图谱随机上下

文语法所表示，它旨在促进已知序列家族的鉴定和分类。

■ 方法介绍

First align the query sequence with database sequences by blast to find any possible hits,

then cut the aligned fragment out of the query sequence, cmsearch it with the matched Rfam.

The file Rfam.thr contains Rfam id, RNA name, threshold, max length, and status. The

threshold is the recommend cutoff of bit score for cmsearch, while the max length is the

recommend length for the cutted query fragment.

First align the query sequence with database sequences by blast to find any possible hits,

then cut the aligned fragment out of the query sequence, cmsearch it with the matched Rfam.

The file Rfam.thr contains Rfam id, RNA name, threshold, max length, and status. The

threshold is the recommend cutoff of bit score for cmsearch, while the max length is the

recommend length for the cutted query fragment.

■结果说明

Blast 参数 : path/blast -p blastn -W 7 -e 1 -v 10000 -b 10000 –m 8

*.m8最初 blast 比对结果

*.m8.filter 去冗余之后的 blast 比对结果，只保留比对结果最好的结果（E-value 值最

下的比对结果）。

*.all.align 预测的原始结果

*.gff 预测得到的原始结果 GFF3 格式文件

*.confident. gff 预测根据阈值筛选后得到的结果 GFF3 格式文件

*.confident.nr.gff *. confident. gff 去冗余之后的最终结果 GFF3 格式文件

如果你有什么好的素材或者需求都可以直接给我们回复，我会第一时间联系你。

另，如果你喜欢我们，就动动手指关注一下吧！