Renci
Renaissance Computing Institute

Home | About | Focus Areas | Resources | Publications | News | Default


The Carolina Center for Exploratory Genetic Analysis

The Carolina Center for Exploratory Genetic Analysis (CCEGA) is developing an interdisciplinary cyberinfrastructure to identify the complex genetic traits that underlie human diseases, bringing together data from clinical studies, population studies and model systems. Funded by the National Institutes of Health, CCEGA is a collaboration among RENCI the UNC departments of Biostatistics, Genetics, Epidemiology, and Computer Science, the schools of Pharmacy and Information and Library Sciences, and the Health Sciences Library.

  • Overview
  • CCEGA website
  • HAP-SAMPLE

The Carolina Center for Exploratory Genetic Analysis (CCEGA) is developing an interdisciplinary infrastructure to identify the complex genetic traits that underly human diseases, bringing together data from clinical studies, population studies and model systems. CCEGA believes the next breakthroughs in our understanding of biology and disease will be made possible by the integrated analysis of genetic data and its expression as phenotypes. CCEGA work centers on enabling this kind of multidisciplinary, multi-investigator research. The center involves three complementary groups of scientist at the University of North Carolina at Chapel Hill: (a) experimental geneticists, (b) quantitative experts in statistics and biostatistics, and (c) computer scientists with expertise in algorithm development, software construction, and high-performance computing.

Phase one of CCEGA focuses on building a community of investigators and deploying a  prototype infrastructure for analyzing relationships among genotypes and phenotypes in three contexts:

  • Family linkage studies, which examine the relationship between genotypes and  susceptibility to specific diseases and conditions, in this case alcoholic addiction.
  • Gene expression profile studies, which develop a picture of genes and cellular activity in order to identify patterns and signatures related to disease, in this case breast cancer.
  • Public health studies, which look at communities and their risk factors for diseases, in this case atherosclerosis.

The RENCI Contribution
To accommodate the diverse, multi-investigator databases necessary to answer these complex questions, RENCI is working with scientists to develop a prototype, extensible data model and provide access to data via a portal constructed using the Open Grid Computing Environment toolkit. The newest methods of integrated data analysis will be incorporated into a portal-based workflow. These include new techniques in linkage analysis (oligogenic analysis, multivariate linkage analysis, epistasis, and genotype by environment interaction), subspace clustering, and association analysis (quantitative trait and nucleotide analysis).

RENCI and its scientific partners also are exploring new visualization techniques for examining and interacting with large data sets and high performance computing for implementing computationally intensive analysis techniques. To reduce the barriers between data providers and data analyzers, CCEGA and RECNI conducts intensive, specialized workshops, colloquia and intramural meetings.

Funding
National Institutes of Health/National Center for Research Resources, Grant Number 5-P20-RR020751-01-02

Co-Principal Investigators at UNC-Chapel Hill
    James Evans, Terry Magnuson, Karen Mohlke, Fernando Manuel Pardo, Charles Perou, Patrick Sullivan, David Threadgill, Kirk Wilhelmsen, Department of Genetics
    Susan Paulsen, Jan Prins, Wei Wang, Department of Computer Science
    Fred Wright, Fei Zou, Department of Biostatistics
    Bradley Hemminger, School of Information and Library Science
    Andrew Nobel, Department of Statistics
    Kari North, Department of Epidemiology
    Alexander Tropsha, School of Pharmacy
    K.T.L. Vaughan, Health Sciences Library
RENCI Team
    Xiaojun Guan
    Kevin Gamiel
    Clark Jeffries
    Jeff Tilson

Publications

Fred A. Wright, Hanwen Huang, Xiaojun Guan, Kevin Gamiel, Clark Jeffries, William T. Barry, Fernando Pardo-Manuel, Patrick F. Sullivan, Kirk C. Wilhelmsen, and Fei Zou. Simulating Association Studies: a Data-based Resampling Method for Candidate Regions or Whole Genome Scans (accepted for publication in Bioinformatics), 2007.

Jeffries, C. Hairpin Database: Why and How? Genomic Impact of Eukaryotic Transposable Elements  conference, Asilomar, CA, April 2006

Jeffries, C. Bipartite and tripartite systems and matrices from genetic control research, Linear Algebra and its Applications 409 (2005) 70-78.

Jeffries, C., Jarstfer, M., Perkins, D.: Folded RNA from an intron of one gene might inhibit expression of a competing gene, in silico Biology 5 (2005), 0037.

Jeffries, C., Perkins, D., Jarstfer, M.: Systematic discovery of the grammar of translational inhibition by RNA hairpins, Journal of Theoretical Biology (accepted for publication).

J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel, J. Prins, "Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis", SIAM Conference on Data Mining (SDM), 2006.

J. Liu, S. Paulsen, W. Wang, A. Nobel, J. Prins, "Mining approximate frequent itemset from noisy data", Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), 2005.

Hemminger BM, Saelim B, Sullivan PF. TAMAL: An integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 2006.

Presentations

From the First CCEGA Workshop, January 21, 2005

Introduction and Context Dan Reed Chancellor's Eminent Professor Vice-Chancellor for Information Technology and CIO Director, Renaissance Computing Institute (RENCI)

Workshop Format Kirk Wilhelmsen, Department of Genetics

Addiction Family Study Kirk Wilhelmsen, Department of Genetics

Strong Heart Kari North, Epidemiology

Diabetes, Fusion Karen Mohlke, Department of Genetics

CATIE (Clinical Antipsychotic Trial of Intervention Effectiveness), Schizophrenia Pat Sullivan, Department of Genetics

Cystic Fibrosis Mike Knowles, Department of Medicine

Cancer Epidemiology Bob Millikan, Epidemiology

Head and Neck EpidemiologyAndy Olshan, Epidemiology

Renal Disease Gene Expression Ron Falk, Department of Medicine

ELSI/Prospective Studies Jim Evans, Department of Genetics

CCEGA Analysis Methods Workshop, May 4, 2005

Introduction NIH Site Visit, May 4, 2005

Linkage analysis / family-based association studies Kori North, Epidemiology

Model system for evaluation of data mining techniques Susan Paulsen, Computer Science

Subspace clustering methods Wei Wang, Computer Science

Visualization of high-dimensional data Leonard McMillan, Computer Science

Complex phenotypes: schizophrenia and ventricle morphology Guido Gerig, Psychiatry and Computer Science

Realistic simulation of genotypes Fred Wright, Biostatistics

Genetics viewpoint Pat Sullivan, Genetics

NIH Site Visit, May 4, 2005

Introduction Dan Reed Chancellor's Eminent Professor Vice-Chancellor for Information Technology and CIO Director, Renaissance Computing Institute (RENCI)

Project Overview Kirk Wilhelmsen, Department of Genetics

ELSI Working Group Jim Evans, Department of Genetics

Informatics Working Group Brad Hemminger, Information and Library Science

Analysis Working Group Jan Prins, Department of Computer Science

NIH Roadmap Program Greg Farber, NIH

CCEGA Workshop, Feb 2, 2007

Introduction Kirk Wilhelmsen, Department of Genetics

Data Modeling, Informatics Working Group Brad Hemminger, School of Information and Library Science

Realistic Simulation of Genotypes Fred Wright, William Barry, Department of Biostatistics

Random Forest on a Culled Set of SNPs Susan Paulsen, Jan Prins, Department of Computer Sciences

Preliminary Statistical Analysis of Bakeoff Data Fei Zou, Seunggeun Lee, Department of Biostatistics

Analysis of Simulated Genetic Data Based on Goodness of Fit Chi-square Test Alex Tropsha, Alexander Golbraikh, School of Pharmacy, Steve Marron, Department of Statistics

Bakeoff Summary Fred Wright, William Barry, Department of Biostatistics

 

Partners
RENCI
University of North Carolina at Chapel Hill :

Activities

Working Groups
There are three working groups that meet every week to have discussions and presentations on specific topics.

  • Informatics Working Group - Meets 12:30 - 1:30 every Thursday, contact: Bradley Hemminger, School of Information and Library Science, bmh@ils.unc.edu, 919-966-2998.

    Workgroup Presentations
    PCaP-Epidemiology Specimen Tracking System Roger Akers, Feb. 17, 2005
    Lab Data Management Systems Kirk Wilhelmsen, Feb. 24, 2005
    Demo of Lab Data and Clinical Data Management Systems Kirk Wilhelmsen, Mar. 3, 2005
    Generalized Model (Modeling Genetics and Proteomics studies) Brad Hemminger, Mar. 10, 2005
    Review Draft Model Brad Hemminger, Mar. 24, 2005 (meeting minutes).
    Review Draft Model Brad Hemminger, Mar. 31, 2005 (meeting minutes).
    Review Draft Model All, Apr. 7, 2005 (meeting minutes).
    BSP (BioSpecimen Project) Facility Peter DeSaix, Paul Brown, May 19, 2005 (meeting minutes).
    Knowles lab and their databases related to cystic fibrosis (CF) Mike Knowles, Hemant Kelkar, Annie Xu and David Fargo, May 26, 2005 (meeting minutes).
    Identify genes that influence cardio vascular disease Kari North, Jun 2, 2005 (meeting minutes).
    Data management issues of Melanoma project Dennis Simpson, July 7, 2005 (meeting minutes).
    Compare and merge database schemas of UNC labs Offsite meeting at the Friday Center, Nov 8, 2005
    Review initial draft of the schema for the common data model March 24, 2006 (meeting minutes).
  • Analysis Working Group - Meets 11:00 - 12:00 every Tuesday, contact: Jan Prins, Department of Computer Science, prins@cs.unc.edu, 919-962-1913.

    Workgroup Presentations
    Linkage Analysis Ethan Lange, Feb. 3, 2005
    Family Based Association Studies Kirk Wilhelmsen, Feb. 8, 2005 (summary).
    A Model of Genetic Data Fred Wright, Feb. 22, 2005 (summary).
    A Model of Genetic Data Fred Wright, Mar. 8, 2005 (summary).
    Survey of Data Mining Techniques for Genotype-Phenotype Association Studies Pat Sullivan (slides) and Susan Paulsen (slides), Mar. 22, 2005 (summary).
    Survey of Data Mining Techniques for Genotype-Phenotype Association Studies Pat Sullivan (discussion), Mar. 29, 2005
    Quantitative Genotype Phenotype Relationships (QGPR): Can we learn from Quantitative Structure Activity Relationships (QSAR) modeling? Alex Tropsha (slides), Apr. 5, 2005 (summary).
    Application of neural networks to find fucntions of selected, weighted combinations of measurable parameters Clark Jeffries (slides), Apr. 12, 2005 (summary).
    Microarray Data and Analysis Charles Perou (slides), Apr. 19, 2005 (summary).
    Classification Accuracy Criteria as Target Functions in QSAR Alexander Golbraikh (slides), Apr. 26, 2005


  • ELSI Working Group - contact: James Evans, Department of Genetics, jpevans@med.unc.edu, 919-966-2276.
Tutorials As part of the Carolina Center for Exploratory Genetic Analysis (CCEGA), we have organized a tutorial series. The goal of the series is to provide cross-disciplinary education and facilitate interdisciplinary collaboration.

Tutorial Presentations (slides and audio)
Genotyping, (slides), (slides and audio) Bob Millikan, Apr. 19, 2005
XML, (slides), (slides and audio) Barrie Hayes, Apr. 19, 2005

Data Sets

We provide simulated data sets generated according to certain genetic models to allow computational scientists to evaluate various algorithms for mapping genotype-phenotype relationships. Developmental model 1: Diffusion-Threshold Two different sets of "Case-Control" data are available. In the Null data, there is no connection between phenotype and genotype. In the Developmental Model 1 data, phenotype is determined by genotype through a simulated developmental process dependent upon a subset of the genetic loci. The two types of data sets share the same population genetic parameters. Each collection includes 1000 files, one for each replicate data set. Each collection is packaged into three zip files for easy downloading, and one sample from each collection has been provided: Contact: Susan Paulsen, Department of Computer Science, paulsen@cs.unc.edu
Participants

Daniel A. Reed, Renaissance Computing Institute
919-966-1585

Daniel A. Reed Ph.D., is the Chancellor's Eminent Professor and the founding director of the Renaissance Computing Institute. He also serves as the Vice-Chancellor for Information Technology for the University of North Carolina at Chapel Hill. His research interests are in high-performance computing, computational Grids, scientific collaboration and computer systems.

Terry Magnuson, Department of Genetics
919-843-6475

Professor Terry Magnuson Ph.D., is the Sarah Graham Kenan Professor and founding chair of the Department of Genetics at UNC. He also heads the Carolina Center for Genome Sciences (CCGS). The CCGS includes experimental, social and analytical genomics divisions, with the latter unit including specialists in basic and applied biomedical computing.

Bradley Hemminger, School of Information and Library Science
919-966-2998

Bradley Hemminger, Ph.D., is an Assistant Professor in the School of Information and Library Science. His interests are medical and bio-informatics, computer-human interfaces, digital libraries and open archives and information visualization.

James Evans, Department of Genetics
919-966-2276

James Evans, M.D. is an Associate Professor in the Department of Genetics and an Associate Director of the CCGS. He is board certified medical geneticist with a special interest in cancer genetics. He is heading a UNC development project to integrate the collection of genetic information and materials throughout the UNC campus.

Andrew Nobel, Department of Statistics
919-962-1352

Andrew Nobel Ph.D. is an Associate Professor of Statistics and an adjunct faculty member in the Computer Science Department. He will lead the statistical analysis of subspace clustering techniques and the validity of their results. Andrew has been collaborating with Chuck Perou's laboratory on the analysis of gene expression data and also has research programs in pattern recognition and machine learning.

Kari North, Department of Epidemiology
919-966-2148

Kari North Ph.D. is an Assistant Professor in the Department of Epidemiology and a member CCGS. She is a statistical geneticist. She is highly experienced in genetic epidemiology and linkage analysis. She has practical experience in large genetic study design and will act as a collaborator on this project contributing her expertise.

Fernando Manuel Pardo, Department of Genetics
919-843-5403

Fernando Manuel Pardo Ph.D. is an Assistant Professor in the Department of genetics and the CCGS. His interests include quantitative genetic analysis in mice. He has expertise in genetic meiotic segregation and genetic diversity of mice.

Karen Mohlke, Department of Genetics
919-966-2913

Karen Mohlke Ph.D. is an Assistant Professor in the Department of genetics and the CCGS. She is interested in the genetic analysis of complex traits. She has been and will be working on the positional cloning of loci for diabetes.

Charles Perou, Department of Genetics
919-843-5740

Charles Perou Ph.D. is an Assistant Professor in the Department of Genetics and the CCGS. He is interested in transcriptional profiling and the genetic epidemiology of cancer.

Jan Prins, Department of Computer Science
919-962-1913

Jan Prins Ph.D. is a Professor of Computer Science. He has directed or co-directed several large projects to integrate high performance computing techniques into computational science. He will lead the development of high-performance interactive implementations of subspace clustering methods.

Patrick Sullivan, Department of Genetics
919-966-3358

Patrick Sullivan M.D. is a Professor in the Department of Genetics, Psychiatry and the CCGS. His interest is principally in behavioral genetics. He is actively involved in large collaborative projects on schizophrenia, smoking dependence and chronic fatigue syndrome.

David Threadgill, Department of Genetics
919-843-6472

David Threadgill Ph.D. is an Assistant Professor in the Department of Genetics and the CCGS. His principal interest is in quantitative genetic analysis of mice and transcriptional profiling.

Alexander Tropsha, School of Pharmacy
919-966-2955

Alexander Tropsha Ph.D. is a Professor in the Division of Medicinal Chemistry and Products in the School of Pharmacy and an Associate Director of the CCGS. His principal interests are in biomolecular informatics.

K.T.L. Vaughan, Health Sciences Library
919-966-8011

K.T. Vaughan, M.S.L.S., is an Assistant Librarian in the Health Sciences Library. Her research interests include the integration of medical and bio-informatics into clinical, research, and teaching practices and the use of library services by interdisciplinary communities of practice. As the Librarian for Bioinformatics and Pharmacy, K.T. coordinates Library services for faculty, staff, students, and the general public in areas of genetics, pharmaceutics, and basic biomedicine.

Fred Wright, Department of Biostatistics
919-843-3655

Fred Wright Ph.D. is an Associate Professor in the Department of Biostatistics and the CCGS. His principal interests are statistical genetics and the development of analytic methods.

Wei Wang, Department of Computer Science
919-962-1744

Wei Wang Ph.D. is an Assistant Professor of Computer Science. She is an expert in data mining and has developed key algorithms for subspace clustering, as well as mining sequence, spatial and structured data. She is a member of the CCGS and has collaborations with several biological driving problems.

Kirk Wilhelmsen, Department of Genetics
919-966-1373

Kirk Wilhelmsen, M.D., Ph.D. is an Associate Professor in the Department of Genetics, Neurology, CCGS and Bowles Center for Alcohol Studies. His interest is principally in behavioral genetics. He has directed several large-scale family studies related to addiction, has directed a high throughput genotyping laboratory and has collaborated on the genetic analysis for all the studies that he has participated.

Fei Zou, Department of Biostatistics
919-843-4822

Fei Zou, Ph.D. is an Assistant Professor in the Department of Biostatistics and the CCGS. Her interest is in methods of linkage analysis of quantitative traits and association analysis.


Acknowledgements

The project described was supported by Grant Number 5-P20-RR020751-01-02 from the National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health/National Center for Research Resources.

Introduction
HAP-SAMPLE is a web application for simulating SNP genotypes for case-control and affected-child trio studies by resampling from Phase I/II HapMap SNP data. The user provides a list of SNPs to be "genotyped," along with a disease model file that describes causal SNPs and their effect sizes. The simulation tool is appropriate for candidate regions or whole-genome scans.

Acknowledgements
This project is supported by Grant 5-P20-RR020751-01-02 from the National Institutes of Health Center for Research Resources as part of the Carolina Center for Exploratory Genetic Analysis. Other sources of support include Carolina Environmental Research Center (EPA RD-83272001), NIGMS R01 GM074175, and CF Foundation Zou05P0. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or the National Center for Research Resources.

Link to www.hapsample.org
RENCI About | Focus Areas | Resources | Publications | News  | Text Only | Default
Renaissance Computing Institute | 100 Europa Drive Suite 540 | Chapel Hill, North Carolina 27517
phone: 919-445-9640 | fax: 919-445-9669 | For questions contact