Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)

TensorFlow · Intermediate ·🧬 Deep Learning ·8y ago

Key Takeaways

The video introduces Nucleus, a TensorFlow toolkit for genomics, and its role in creating DeepVariant, an open-source program for genome variant discovery. Nucleus is a library of Python code for reading, writing, and filtering common genomics file formats for conversion to TensorFlow examples.

Full Transcript

[Music] hello everyone my name is Cory MacLean and I'm an engineer on the genomics team in Google brain and today I'm excited to tell you about nucleus which is a library we've released today to make it easy to bring genomics data to tensorflow so genomics is a study of the structure and function of genomes in every cell in your body you have two copies of the genome one from each parent and this is strings of DNA which is a four-letter alphabet and about three billion letters in the genome so here is a picture of a snapshot and on chromosome one 150,000 letters what we can see is there's a number of known things about this area one there are functional elements like the genes depicted in that second row biological measurements allow us to analyze what are different things that are active in cells so on that third row we can see the amount of gene expression across different tissue types is quantified there and at the bottom through sequencing many people we can identify places where there's variation across individuals there's a many different computational and algorithmic challenges in developing that image this range is from on the experimental generation data generation side can we better take the output of these physical measurements to get accurate DNA readings or reduce noise in the experiments that quantify this expression can we take the DNA sequence and interpret where our functional elements like these genes or predict how active are they in different tissue types and can we identify places where individuals vary compared to our reference and how is that different in small variants versus say and cancer and how do those changes influence human traits so one thing that is really exciting for us is there are many opportunities for deep learning in genomics a lot of that is driven by the increase in the amount of data available this graph shows the dramatic reduction in cost to sequence a million bases of DNA over the past decade but also there's a lot of structure in these datasets that is often complex and difficult to represent with relatively simple models but this may just display convolutional structure so we can use techniques from it was classification as well as sequence models and there have been a number of proven successes of applying deep learning to problems in genomics such as deep variant which is a tool our group developed to identify small variants using convolutional neural networks so our goals in genomics are multifaceted one is to make it easy to apply tensor flow to problems in genomics and do this by creating libraries to make it easy to work with genomics data we're also interested in developing tools and pushing the boundaries on some of the scientific questions using those things that we've built and then want to make all of that publicly available as tools that can be used by the community so today I'll focus on the first part of making it easy to bring genomics data to tensorflow so what is a major problem one major difficulty is that there are many different types of data that are generated for genomics research we can see here on the right subset of different types used and these different file formats have varying amounts of support and I in general no uniform aps we also have some concerns about efficiency and language support where we would like to be able to express some manipulations in Python but it needs effective ways to efficiently go through this data such that native Python wouldn't make that possible so to address these challenges we developed which is a C++ and Python library for reading and writing genomic data to make it easy to bring to tons of phone models and then feed through the TF data API that Derek talked about earlier today for training models for your particular tasks of interest in this release we support the reading of many of the most common data formats in genomics and provide a unified API across the different data types so we're able to iterate through the different records of these different types and be able to query on specific regions of the genome to access the data there the way that we develop this uses protocol buffers under the hood so that we can implement all of the general parsing in C++ and then make those available to other languages like Python and for those of you familiar with genomics we end up using HTS Lib which is a canonical parser for the AI throughput sequencing formats like the aligned reads and variants and then wrap that to generate the protocol buffers and then use cliff on top of this to make the data available to pythons and finally we use some of the tensorflow core libraries so that we can write out these data as TF records so that they can be ingested by the TF data API so the data types that we currently support are the following ranging from general genome annotation to reference genomes and different sequence reads or whether they're direct off the sequencer or mat as well as genetic variants so to give an example of the reading API it's quite straightforward so this is kind of a toy example but is essentially similar to what is used for deep variant when we want to train a model to identify actual genome variations based on mapped sequence reads and a reference genome so we have three different data types that we need we import the different reader types and then say in this region that we're interested in we can issue queries to each of the different reader types and then have iterables of these protocol buffers has output which we can then manipulate and turn into tensorflow examples on the writing side it's similarly straightforward so if we have a list of variants for the VC of the common VCF format we'll have an Associated header which provides metadata about this and then open a writer with that header and then just loop through the variants and write them and note that we support writing to block gzip format no which is common for the subsequent indexing by other tools however we can also write directly to TF records and here provide some convenience methods to write out charted data which we found helps avoiding certain hotspots in the genome using a very similar API finally we have been working with the Google Cloud team which has some tools for analyzing variant data and so they have developed a tool called variant transforms which allows you to load the VCF Marron files to bigquery using Apache Beam and then you can do structured queries over that data and so we're working now to integrate here to have nucleus under the hood providing that generation of the variants and to learn more about that tool you can go to the link below so to summarize we have developed nucleus which is a C++ and Python library to make it easy to bring genomics data to tensorflow your models of interest for genomic problems and we have the ability to interoperate with cloud genomics and are being integrated into the variant transforms at the moment and this ended up being the foundation of our CNN based variant caller which is also available open source at the link below so with that I would like to thank you all for your attention today [Applause] [Music] [Music]

Original Description

Cory McLean announces the launch of Nucleus, a library of Python code for reading, writing, and filtering common genomics file formats for conversion to TensorFlow examples. Cory briefly describes its role in creating DeepVariant, an open-source TensorFlow CNN-based program for genome variant discovery that substantially improves upon prior methods. TensorFlow Dev Summit 2018 All Sessions playlist → https://goo.gl/Lsaq1R Subscribe to the TensorFlow channel → https://goo.gl/ht3WGe event: TensorFlow Dev Summit 2018; re_ty: Publish; product: TensorFlow - General; fullname: Cory McLean; event: TensorFlow Dev Summit 2018;
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from TensorFlow · TensorFlow · 17 of 60

1 The TensorFlow YouTube Channel is Here!
The TensorFlow YouTube Channel is Here!
TensorFlow
2 Answering Your TF Questions #AskTensorFlow
Answering Your TF Questions #AskTensorFlow
TensorFlow
3 Chatting With the TensorFlow Community (TensorFlow Meets)
Chatting With the TensorFlow Community (TensorFlow Meets)
TensorFlow
4 All About TensorFlow Code (Coding TensorFlow)
All About TensorFlow Code (Coding TensorFlow)
TensorFlow
5 TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow
6 Keynote (TensorFlow Dev Summit 2018)
Keynote (TensorFlow Dev Summit 2018)
TensorFlow
7 tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
TensorFlow
8 Eager Execution (TensorFlow Dev Summit 2018)
Eager Execution (TensorFlow Dev Summit 2018)
TensorFlow
9 Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
TensorFlow
10 Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
TensorFlow
11 The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
TensorFlow
12 Distributed TensorFlow (TensorFlow Dev Summit 2018)
Distributed TensorFlow (TensorFlow Dev Summit 2018)
TensorFlow
13 Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
TensorFlow
14 TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow
15 Searching Over Ideas (TensorFlow Dev Summit 2018)
Searching Over Ideas (TensorFlow Dev Summit 2018)
TensorFlow
16 Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
TensorFlow
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
TensorFlow
18 Open Source Collaboration (TensorFlow Dev Summit 2018)
Open Source Collaboration (TensorFlow Dev Summit 2018)
TensorFlow
19 Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
TensorFlow
20 TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow
21 Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
TensorFlow
22 Real-World Robot Learning (TensorFlow Dev Summit 2018)
Real-World Robot Learning (TensorFlow Dev Summit 2018)
TensorFlow
23 TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow
24 Project Magenta (TensorFlow Dev Summit 2018)
Project Magenta (TensorFlow Dev Summit 2018)
TensorFlow
25 TensorFlow Dev Summit 2018 - Livestream
TensorFlow Dev Summit 2018 - Livestream
TensorFlow
26 Introducing TensorFlow Lite (Coding TensorFlow)
Introducing TensorFlow Lite (Coding TensorFlow)
TensorFlow
27 TensorFlow Dev Summit 2018 Highlights
TensorFlow Dev Summit 2018 Highlights
TensorFlow
28 Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
TensorFlow
29 TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow
30 Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
TensorFlow
31 Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
TensorFlow
32 TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow
33 Using the tf.data API to build input pipelines (TensorFlow Meets)
Using the tf.data API to build input pipelines (TensorFlow Meets)
TensorFlow
34 Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
TensorFlow
35 Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
TensorFlow
36 TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow
37 Get started with TensorFlow's High-Level APIs (Google I/O '18)
Get started with TensorFlow's High-Level APIs (Google I/O '18)
TensorFlow
38 TensorFlow for JavaScript (Google I/O '18)
TensorFlow for JavaScript (Google I/O '18)
TensorFlow
39 TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow
40 Get started with TensorFlow's High-Level APIs in 5 mins |  Google I/O 2018
Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018
TensorFlow
41 TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow
42 TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow
43 Advances in machine learning and TensorFlow (Google I/O '18)
Advances in machine learning and TensorFlow (Google I/O '18)
TensorFlow
44 Distributed TensorFlow training (Google I/O '18)
Distributed TensorFlow training (Google I/O '18)
TensorFlow
45 Classification using neural networks & ML regression models #AskTensorFlow
Classification using neural networks & ML regression models #AskTensorFlow
TensorFlow
46 TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow
47 Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
TensorFlow
48 How to get started with AI/ML, retraining models, & more! #AskTensorFlow
How to get started with AI/ML, retraining models, & more! #AskTensorFlow
TensorFlow
49 TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow
50 MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
TensorFlow
51 The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
TensorFlow
52 At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
TensorFlow
53 NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
TensorFlow
54 Try TensorFlow.js in your browser (Coding TensorFlow)
Try TensorFlow.js in your browser (Coding TensorFlow)
TensorFlow
55 TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow
56 How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
TensorFlow
57 Training models faster with TensorFlow Hub (TensorFlow Meets)
Training models faster with TensorFlow Hub (TensorFlow Meets)
TensorFlow
58 Prepare your dataset for machine learning (Coding TensorFlow)
Prepare your dataset for machine learning (Coding TensorFlow)
TensorFlow
59 Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
TensorFlow
60 TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TensorFlow

The video introduces Nucleus, a TensorFlow toolkit for genomics, and its role in creating DeepVariant, an open-source program for genome variant discovery. Nucleus enables the conversion of common genomics file formats to TensorFlow examples, facilitating deep learning in genomics.

Key Takeaways
  1. Install Nucleus using Python
  2. Use Nucleus to read and write genomics file formats
  3. Convert genomics file formats to TensorFlow examples
  4. Use DeepVariant for genome variant discovery
💡 Nucleus facilitates the application of deep learning in genomics by enabling the conversion of common genomics file formats to TensorFlow examples.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →