一、名词解 生物信息学 bioinformatics Dotplot 算法 分子钟 molecular clock 隐马尔科夫模型 hidden Markov model, HMM Gene Ontology, GO molecular phylogenetic tree 序列比对 sequence alignment 空位罚分
线性空位罚分 constant gap penalty 多序列比对 关系数据库 Dayhoff 突变数据矩阵 BLOSUM 矩阵 blocks substitution matrix 蛋白质结构分类数据库 SCOP（structural classification of proteins） CATH 蛋白质结构分类数据库 系统发育树 物种树 基因树 有根数、无根树 最大似然法 同源建模蛋白质结构预测 蛋白质结构从头预测法 蛋白质折叠 FASTA-ALL NCBI EBI GenBank Entrez SRS 系统 同源性 homology、同一性 identity、形似性 similarity neutral theory of molecular evolution 最小二乘法
neighbor-joinning method maximum parsimony 基因组注释 基因组学 蛋白质组学 PDB MEGA 软件 PHYLIP 软件 动态规划算法 dynamic programming algorithm Smith-Waterman algorithm Needleman-Wunsch 算法 BLAST，BLASTn, BLASTp
复习思考题 1. 什么是生物信息学？其主要应用有哪些？ 2. 简述生物信息学发展史上重大的标志性成果？ 3. 有人说生物将是下一场技术革命的热土，你认为生物信息学将对生物产业化有哪些方面 的贡献？ 4.什么是生物学数据库？请举例说明。 5. 一级数据库与二级数据库的区别是什么，请举例说明？ 6． Entrez 的检索途径有哪些？ 7.为什么要进行序列比对？以核酸双序列比对为例简述序列比对的基本原理。 8. 假设两条序列：catgt 和 acgctg。利用动态规划方法来进行序列全局比对分析（完成比对 矩阵，并找到最佳比对。记分方法：匹配得分为 2，失配得分为-1，空位罚分为-1。。 ） 0 0 1 2 3 4 5 6 a c g c t g 0 -1 -2 -3 -4 -5 -6 1 c -1 2 a -2 3 t -3 4 g -4 5 t -5
9. 假设两条序列：CACGA 和 CGA。利用 Smith-Waterman 算法来进行序列比对分析（建立比 对矩阵，并找到最佳比对。记分方法：匹配得分为 1，失配得分为 0，空位罚分为-1。。 ）
10. 假设两条序列：CACGA 和 CGA。利用 S. Needleman 与 C. Wunsch 动态规划方法来进行序 列比对分析（建立比对矩阵，并找到最佳比对。记分方法：匹配得分为 1，失配得分为 0， 空位罚分为-1。。 ） 11. 简述蛋白质二级结构预测流程。 12. 简述蛋白质三级结构同源建模预测流程。 13.为什么要进行蛋白质结构比对？简述蛋白质结构比对的基本原理。 14.为什么说蛋白质高级结构是由一级结构决定的？ 15. 简述蛋白质编码基因预测流程。 16．简述基因组注释的基本流。 17. 如何从头预测真核生物蛋白质编码基因？ 18.简述利用邻近法构建系统发育树的基本思想。 19.简述 UPGMA 法构建系统发育树的基本思想。 20. 简叙最大简约构建系统发育树的基本思想。 20.设有 4 段序列，分别为：A：TAGG; B:TACG; C:AAGC; D:AGCC。利用 UPGMA 方法构建系统 发育树。 21.设有 4 段序列，分别为：A：TAGG; B:TACG; C:AAGC; D:AGCC。利用邻近法构建系统发育 树。 22. 什么是中性学说？中性学说对分子进化有什么影响？ 23. 你认为生物信息学学习需要掌握哪些基本的计算机基础？生物学基础？数学基础？ 24.什么是分子钟假说？ 25.简述构建系统发育树的步骤。 三、文献阅读 1. Welcome to the UCSC Genome Browser website. This site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides portals to the ENCODE and Neandertal projects. We encourage you to explore these sequences with our tools. The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide. The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways. Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database. VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns. Genome Graphs allows you to upload and display genome-wide data sets. 2 WebLogo is a web based application designed to make the generation of sequence logos as easy and painless as possible. Click here to create your own sequence logos.
Sequence logos are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike Stephens. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a consensus sequence.
3. The National Center for Biotechnology Information (NCBI) is one of the world's premier Web sites for biomedical and bioinformatics research. Based within the National Library of Medicine at the National Institutes of Health, USA, the NCBI hosts many databases used by biomedical and research professionals. The services include PubMed, the bibliographic database; GenBank, the nucleotide sequence database; and the BLASTalgorithm for sequence comparison, among many others. 4. KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information. It is a computer representation of the biological system, consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) that are integrated with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (systems information). The KEGG website at www.kegg.jp has become the primary site of the KEGG database developed by Kanehisa Laboratories. The GenomeNet website at www.genome.jp operated by Kyoto University Bioinformatics Center will continue to mirror the KEGG database and provide additional KEGG-based analysis services. 5. The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003, contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.
6. PubMed is a database developed by the National Center for Biotechnology Information (NCBI) at
the National Library of Medicine (NLM), one of the institutes of the National Institutes of Health (NIH). The database was designed to provide access to citations (with abstracts) from biomedical journals. Subsequently, a linking feature was added to provide access to full-text journal articles at Web sites of participating publishers, as well as to other related Web resources. PubMed is the bibliographic component of the NCBI's Entrez retrieval system. 7. Bioinformatics consists of a computational approach to biomedical information management and analysis. It is being used increasingly as a component of research within both academic and industrial settings and is becoming integrated into both undergraduate and postgraduate curricula. The new generation of biology graduates is emerging with experience in using bioinformatics resources and, in some cases, programming skills.
8. The resources provided by NCBI for studying the three-dimensional (3D) structures of proteins center around two databases: the Molecular Modeling Database (MMDB), which provides structural information about individual proteins; and the Conserved Domain Database (CDD), which provides a directory of sequence and structure alignments representing conserved functional domains within proteins(CDs). Together, these two databases allow scientists to retrieve and view structures, find structurally similar proteins to a protein of interest, and identify conserved functional sites. To enable scientists to accomplish these tasks, NCBI has integrated MMDB and CDD into the Entrez retrieval system (Chapter 15). In addition, structures can be found by BLAST, because sequences derived from MMDB structures have been included in the BLAST databases (Chapter 16). Once a protein structure has been identified, the domains within the protein, as well as domain “neighbors” (i.e., those with similar structure) can be found. For novel data not yet included in Entrez, there are separate search services available. Protein structures can be visualized using Cn3D, an interactive 3D graphic modeling tool. Details of the structure, such as ligand-binding sites, can be scrutinized and highlighted. Cn3D can also display multiple sequence alignments based on sequence and/or structural similarity among related sequences, 3D domains, or members of a CDD family. Cn3D images and alignments can be manipulated easily and exported to other applications for presentation or further analysis. 9. R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. 10.
EMBL-EBI is a pivotal partner in ELIXIR, the European life sciences infrastructure for biological information, as part of the European Strategy on Research Infrastructures (ESFRI) process. On behalf of ELIXIR, EMBL-EBI coordinates BioMedBridges, which develops technical bridges for interoperability between data and services in the ESFRI biomedical sciences research infrastructures. These collaborative projects are undertaken with our partners in the European Member States. 11.
PubMed Central (PMC) is the National Library of Medicine's digital archive of full-text journal literature. Journals deposit material in PMC on a voluntary basis. Articles in PMC may be retrieved either by browsing a table of contents for a specific journal or by searching the database. Certain journals allow the full text of their articles to be viewed directly in PMC. These are always free, although there may be a time lag of a few weeks to a year or more between publication of a journal issue and when it is available in PMC. Other journals require that PMC direct users to the journal's own Web site to see the full text of an article. In this case, the material will always be available free to any user no more than 1 year after publication but will usually be available only to the journal's subscribers for the first 6 months to 1 year. To increase the functionality of the database, a variety of links are added to the articles in PMC: between an article correction and the original article; from an article to other articles in PMC that cite it; from a citation in the references section to the corresponding abstract in PubMed and to its full text in PMC; and from an article to related records in other Entrez databases such as Reference Sequences, OMIM, and Books. 12. The primary data produced by genome sequencing projects are often highly fragmented and sparsely annotated. This is especially true for the Human Genome Project as a result of its policy of releasing sequence data to the public sequence databases every day (1, 2). So that individual researchers do not have to piece together extended segments of a genome and then relate the sequence to genetic maps and known genes, NCBI provides annotated assemblies of public genome sequence data. NCBI assimilates data of various types, from numerous sources, to provide an integrated view of a genome, making it easier for researchers to spot informative relationships that might not have been apparent from looking at the primary data. The annotated genomes can be explored using Map Viewer (Chapter 20) to display different types of data side-by-side and to follow links between related pieces of data. This chapter describes the series of steps, the “pipeline”, that produces NCBI's annotated genome assembly from data deposited in the public sequence databases. A variant of the annotation
process developed for the human genome is used to annotate the mouse genome, and similar procedures will be applied to other genomes (Box 1).
Box 1 Annotation of other genomes. NCBI may assemble a genome prior to annotation, add annotations to a genome assembled elsewhere, or simply process an annotated genome to produce RefSeqs and maps for display in Map Viewer (Chapter 20). The basic procedures (more...) NCBI constantly strives to improve the accuracy of its human genome assembly and annotation, to make the data displays more informative, and to enhance the utility of our access tools. Each run through the assembly and annotation procedure, together with feedback from outside groups and individual users, is used to improve the process, refine the parameters for individual steps, and add new features. Consequently, the details of the assembly and annotation process change from one run to the next. This chapter, therefore, describes the overall human genome assembly and annotation process and provides short descriptions of the key steps, but it does not detail specific procedures or parameters. However, sufficient detail is provided to enable users of our assembly and annotations to become familiar with the complexities and possible limitations of the data we provide.