Stephanie Juneau- Jupyterenabled astrophysical analysis for researchers and students|JupyterCon 2020
Key Takeaways
The video demonstrates the use of Jupyter-enabled astrophysical analysis for researchers and students, showcasing the Astro Data Lab as an example of a science platform that provides access to large datasets and pre-installed software libraries, including machine learning algorithms and Jupyter notebooks for research, training, and education.
Full Transcript
hi everyone my name is stephanie juneau i'm an associate astronomer at nsf store lab i'm also a scientist at the astro data lab and a member of the disney team so i'm excited to be telling you about jupiter-enabled analysis for researchers and students before moving on i want to acknowledge some colleagues in particular from the astro data lab team at moore lab and also from the density so there have been a big change in the way that we do research in astronomy it used to be that the main most typical scenario would be to have either individual researchers or maybe a small team to go to the telescope to gather their data get their observations and then upload those to their computers where they have also installed software to process and analyze the data locally on a local computer so this still required some fairly sophisticated software just because in astronomy we tend to be looking for a very faint signal so we have to do a very accurate sky subtraction and also accurate correction for any source of noise including detector noise and so on but regardless the main way of working used to be very local and then would then result in papers that are published in professional journals um there have been a few major changes with the years so one big change has been the tendency to more and more have larger teams or collaborations so the main difference there is that several people who are the team members will need access to the same data so one way to do this is people can simply download the data whether this is done from ftp or through a web browser and then most typically people would still work locally with their software and pipelines on their on their computers except with the a nice interesting difference is that people would more often share intermediate or advanced data products for example it could be generating a catalog of measurements and then other team members might need the same table or catalog and as a result then there will be multiple papers and multiple results that are published in journals another way to do research which has been more and more prevalent is so-called data-driven or archive driven astronomy so in the most extreme case people no longer even need to go to the telescope anymore because possibly the data to answer a question already exists and they're located or stored in an archive or data center if the data volume becomes so large that it's prohibitive to simply download the data so in that regime then what one needs to do is to instead work directly on a science platform so science platform will basically be a suite of tools and services that are connected and can be working with the data without having to download a large amount of data so a science platform then is motivated by having a growth in both data volume and detect complexity and the goals of the science platform then is to make sure to maximize the scientific output of the community so this means that the data need to be not only available but easily discoverable and they need to be a lot of different versatile tools and also sophisticated and advanced enough software to do the required analysis which can include machine learning algorithms or other and we also want to make sure that the framework is conducive to doing very robust science so it has to be reproducible and it's also potentially a way to do this is for researchers to start publishing not just the results but also publishing their actual workflow so the role of the science platform is also to train and make it as easy as possible for people to basically hop on board so we want to lower the barrier of entry you want to make sure there's good tutorials and documentation and we really want to target here the the workforce at all carrier stages so early care researchers but also more advanced care researchers who maybe have not used those tools before the first example science platform i'm going to talk about is the astro data lab so it is at nsf's national optical infrared astronomy research laboratory and the main motivation behind the birth of the astro data lab was the basically explosion and these very wide area surveys that started covering more and more of the sky so what happened in the last decade so you can see here these maps these ellipses represent the full sky coverage projected onto the the ellipse and from 2011 to 2019 you can see the rapid jump and how much of the sky has been covered so this is what the color scheme is showing here going to green and then red is more and more exposures taken in the sky at these these areas so here um you can see these guy map coverage from these mid scale observatories so they're four meter telescopes that have been used with white field cameras and this is another way to show this growth of data volume here is the size of the data holdings in the noir lab astro data archive so similar thing you can see from 2011 to now this very rapid rise and as of 2017 we've reached a petabyte scale in terms of data holdings and this is uh comprising raw images process images also as well as high-level data products so really our mission at the data lab is to empower researchers so that they can get new results and make new discoveries so we want to enable this we want to have the tools and the software and services co-located with the data so we have very different types of data including catalogs or tables so these are mostly stored in databases and they range from a few hundred thousand rows to our largest table right now has 68 billion rows so that's a very large table so we also have images which are basically 2d arrays as well as file collections so these collections are actually going to be a mix there will be catalogs images and possibly spectra and on the right hand side you can see example so right recently we started serving 1d spectra so this is basically the trace of the spectrum so you can see here the picture with six different spectra mostly of galaxies and in the future we will also have 2d spectra so this is the same idea spectroscopy dispersing the light but it's also the image of the spectrum as opposed to just a trace and in the future we also will have the next level up which is in 3d so 3d spectral data cubes so how does jupiter come into play how do we use jupiter at the astro data lab and a few different ways so one is to directly enable the research most of our users are professional astronomers or graduate students so we use jupiter we have a notebook server that they can use to conduct their research we also use example notebooks as part of our training and tutorials so for training tutorials we have a user manual the example notebooks as well as we have been giving online webinars and also in-person tutorials and the last category for which we use a notebooks is for education so i'm listing here a couple of different highlights one of them is the teen astronomy cafe program this is for middle to high school students and it's extracurricular so typically students will do this on a saturday it includes a computer activity so typically with educational book and also some hands-on activities and more advanced we now have um notebooks from the lassi arena school for data science so this is targeting undergraduate and graduate students to learn to use data science for astronomy and this is very exciting it's brand new we're just right now um starting to put the notebooks into our collection this was motivated because the 2020 school was cancelled because only those have been in person so now we're migrating the content to jupiter lab so it could be also done remotely and virtually and the last example is using it directly into the classroom so um we don't often go in the classrooms ourselves as a retired team but professors and teachers have been getting in touch with us so we definitely work with them to help them get started so the jpg server is running on rn at the data lab which means that the users don't need to install anything they just need to point their browser to the data website log in and then they can spin the jupyter notebook server so for research um when researchers come to use notebooks and and jupiter at data lab it comes with pre-installed software packages so we have a lot of the commonly used default libraries so we have numpy map.lib pandas and for astronomy specifically there's a package called astropy which is awesome python as well as a datalab package so i will describe why we need the data package but basically it's to connect jupiter to the other functionality that we have at data lab all users have the same installation by default um and the and i will describe more a bit later how they actually then can use a datalab command to connect with uh or data that are on the server on the on the back side and then the notebook can be used for analysis and visualization we do provide some example cases that users can follow and we also have a help desk and email to support researchers um so the goal and the hope is that they're able to generate high quality publication ready figures that they can then uh put in their articles and journals so what's missing so right now we don't have a way for users to to tailor their own environment and install the the software instead as i mentioned before all the users have the same installation at the moment we have been responsive if somebody asks us for a package that they're missing we would normally look at the package and see if we can install it and add it so this slide here shows a bit of a schematic way these three main clients that are part of the data package so with a little book a user can use the authentication client which is called off client this means that the user can sign in to their this app account which gives them access to storage so each user account comes with one terabyte of virtual storage for files and it also comes with storage for databases in the mydbe space so the other way that users will connect from the notebook to the data lab is there's a query client that will send queries so queries can be written with sql or adql so they're um query languages and basically the query client will transfer the query into the query manager will then retrieve the data from the database directly and send it back into the notebook and let's say the store client will allow users to store files or also to read files that they might have previously stored in their account so the order can be a little bit shuffled around here depending on the exact use case and the and again as i mentioned before the hope is that at the end the user has managed to do their their full analysis and produce their figures for for their papers in books so we have written a jupiter notebook collection which has been put together by several members of the datalab team and every user account comes with already pre-populated notebooks so they get the full copy of their collection whatever was the latest collection that was created um there is a way for users to then update their collection so they can always see a read-only version of our notebook's latest and i will show you how those are organized so this means that if a user say it isn't connect for two or three or six months to their account and they come back and they see there's a bunch new notebooks they can actually make a copy that they can then work on and edit as they please um so the the notebooks are currently organized in folders as is shown on the slide here so we start with the most basic level which is a getting started so this is equivalent of the 101 kind of level so how to use python not to use a notebook how to use sql and so on and then moving on is a data access notebook which shows the most commonly used and kind of typical but still somewhat basic ways of interacting with the data at data lab and then we move on to a bit more fully fleshed examples so we have science use cases so these science example notebooks are grouped by scientific theme and then for example whether it's like finding dwarf galaxies or exploring the large scale structure of the universe um and and more so i will show you those the next slide we also have these notebooks that we call how to's so those tend to be a bit more technical so they will be something like how to use a query client how to use a store client but they will really be more thorough and explore the different keywords and the different options so they're they're more technical tutorials then we have contrib which stands for contributed notebooks so this is fairly recent so users can now contribute notebooks back to our collection so we provide a template and instructions and we will first review the notebook and once we're happy with it we will merge it with the collection the next category is epo which stands for education and public outreach so these will include the education cases i mentioned before um and then we have a test which is a way to automatically test the notebooks to make sure that everything is still running as expected so for the science example notebooks we have a lot of different data sets at data lab for example the dark energy survey and other surveys so we try to be careful to make sure that there's a minimum of one example of book per data set so we think that this will help the users not only to know what datasets are but also how to actually use them for their science on the left hand side is as you would see it on github and on the right hand side is just looking in a notebook server so you can see that these are the folders that are organized per science theme and this is one particular case on the example collection so the notebook is called exploring smash dr2 this survey is targeting the imaginary clouds so there's a large and the small magenta clouds which are two very small neighbor galaxies to the milky way and the survey was designed with mosaics around these two galaxies but also somewhat in between and outside of them because they are stellar streams so stars that were basically torn from the galaxies or from other galaxies from title like gravitational interaction um so this notebook follows a typical structure so there's a title and then imports packages so you can see here kind of a typical example of possible packages and in this particular case what i'm showing is a part of the notebook that does a query that will be a group per heel picks so he'll picks is a way to tile the sky and then different number of sides would make finer or larger heel picks or tiles and in this case it will count her heel picks how many objects there are to then produce a number density map so this map that you show here on the the right hand side the yellow color shows where there's many more objects per area and this is projected onto the sky is fortunate onto a sphere in this case and what you can see this is a log scale so in yellow there's many more objects in this case stars and we see easily the small and the large magnetic clouds where there are many more stars so overall this query returned 193 million and some stars in total for this map okay so i'm not going to go into the details but i just want to point out that we have a way to automatically test a notebook so there's a poster on this topic by my colleague robert nikita that goes through the whole end-to-end notebook life cycle and then it will run automatically to the collection and show us what has passed or failed the testing so now i want to move on to the second example astronomy platform which is used for the desi collaboration so desi is a dark energy spectroscopic instrument and this instrument um is going to be used to over the course of five years to survey over a third of the sky um and then this actually part of the sky has already been covered with images and the images so far we have large catalogs with 1.6 billion objects which include stars galaxies and quasars so the idea behind dizzy is among these 1.6 billion objects is to select 35 million galaxies and quasars in order to measure their distance from us so we do this with spectroscopy with measuring the redshift which we can convert into a distance but what we get is we have these 35 million points in 3d after we get a distance so therefore we have the largest 3d map of the universe and this is how we can actually learn about dark energy and so dark energy is really we don't know exactly the nature yet but we know that it's driving the expansion of the universe so depending of the exact nature and property of dark energy it will produce a different 3d map so that's why we need to first measure this map and then we compare it to different cosmological models in order to infer what dark energy is so this is pretty exciting um there's already over 800 researchers in the collaboration and spread around 93 different institutions in different countries the dizzy instrument is installed on the four meter telescope in kitt peak arizona so the telescope is shown this was before dizzy was actually installed but that's a telescope inside the dome and what happens is the light hits the primary mirror goes to the dizzy instrument where these uh 5 000 optical fibers and each fiber is actually attached to a small robot which is a positioner and the positioner will position exactly the fiber on the sky so that it's exactly in line with either a star or galaxy or a quasar like the object is going to take a spectrum of and then on the right side what you see is a picture a drawing on the sky basically there's a picture of the sky and then these 5000 little regions where the fibers can can go and then one of these fibers so this is actually real data that were obtained in first light so you can see that one spectrum actually has three proportions because the spectrographs have three arms to cover the big range of colors from blue to infrared all right so how is jupiter then part of this desi project so the most of the software is all on github for desi there's over 60 repositories over 100 contributors so jupyter is a way to have pre-installed software libraries and that can talk to the data so there are tutorials that are written in notebooks and those are super useful to train new collaboration members or for current members to learn new techniques on new tasks some notebooks are also used for actually testing different versions that we have different software releases this is all working at nurse so there's a super computer on the back which is great and so so far the data already includes a lot of images catalogs and tables and we have a little bit of spectra but there's a much more coming over the next five years when we get the full survey going uh and then cosmology is the main goal because this is again like i said it's a dark energy experiment but because we're going to have all this these data is going to be basically a treasure trove for discovery so stars galaxies black holes and so on so this is an example of accessing the the data for dizzy this needs to be run at nurse again and people can already look at some of the early data that were obtained as part of the early survey validation or people can either generate or work with simulated data that might take into account the observing conditions and so on and then what you can see here is one of the tools that work that works in the jupiter lab it's called prospect and it's an interactive spectral viewer so you can recognize again these three arms and three colors of the spectrum here but this lets you zoom in zoom out and then plot a model and so on so this is based on bokeh and it runs in jupiter lab the another use of jupiter is for tutorials that i've already mentioned so here i'm putting a screenshot but you can actually browse this whole list on github and same thing these need to be run at nurse and then this is an example as part of the testing part of using the jupyter so there's a notebook where you would simply say which version you want to test and some jobs take longer so instead of waiting for the cell to run it will actually send a job in the background and it will be logged into a log file also in the background and still with sending jobs in the background the whole notebook will take about two hours to run and and the last thing i wanted to say about how jupiter is used as part of desi is also for education public outreach so there's a program called the zhai and it's going to be a talk about this by michael wilson at jupitercon so i encourage you to take a look at that so to summarize why jupiter and notebooks work so so well for us it allows us to have this pre-installed software and in some cases we can have multiple kernels so we can choose what to work with and then both teams have been writing different packages to then connect jupiter to the other portion of the science platform so for data access and also for storage and it's very nice to be able to use one notebook that will contain an entire workflow or full example including the documentation and the codes which of course is what jupiter notebooks are great for but this has been useful in multiple avenues including tutorials for teaching for testing software but also for actual scientific analysis and we're pretty excited about the possibility to containerize in our books this brings me to the next slide about the wish list because we haven't quite done that yet here on the slide are also topics i would really like to hear from other jupiter con participants or presenters if you have ideas one thing is we're still trying to keep improving how we teach and onboard new users so we still think that the hardest step is the very first step from zero experience and how to become comfortable and how to learn to use jupyter notebooks in our case with the science platform it's not just a jupyter notebook we also have python sql and so on but any ideas about training and reducing this learning curve would be great and then the last point i want to mention that i did not talk about yet is we don't yet have a good way for people to collaborate on the same notebooks and i feel like this would be one of the most exciting possibilities for example there's something called google collab that allows people to work together so you would like something kind of similar except working directly on the azure data lab server and then have ways for people to share their data and share their own books so how does jupyter play a role in astrophysical platforms it basically ties together the software and the tools and the tutorials co-located with the data which is very useful as the data growing both volume and complexity there are other astronomy platform science platforms that are being developed it becomes then a great opportunity because we can share common technologies and capabilities across these different platforms which also means that users might be able to have more portable tools and be able to work across from one platform to another so i'll leave you with the slides so you can get in touch through either the website or email and you can also find us on twitter and this is also my twitter handle thanks for your attention
Original Description
Brief Summary
Astronomy science platforms are changing the way researchers work. Enabled by Jupyter, they allow geographically distributed users to access petabytes of data and perform complex analyses with pre-installed software libraries. Using NSF’s Astro Data Lab as an example, I will highlight how Jupyter is used in cutting-edge astronomy research, and share our wishlist for Jupyter development.
Outline
In this talk, I will present example Jupyter applications as part of astrophysical science platforms to illustrate how Jupyter Notebooks/JupyterLab can be embedded within a workflow, how they influence the way researchers work and how they can be used to train students and professional astronomers who may not be experienced in data science. The main goal is to share experiences, and facilitate follow-up discussion about possible future Jupyter developments to improve current functionality. No astronomy background is needed as I will focus on the general concept of a science platform as a suite of online tools and services that include access to datasets, visualization and analysis software, compute resources, and storage capabilities.
In astronomy, the need for science platforms is driven by not only the ever-increasing data volume but also by the complexity of datasets, which require highly specialized and diverse software libraries to be co-located with the data. Therefore, there is a marked advantage in being able to connect Jupyter Notebooks or JupyterLab to large datasets to perform analysis and/or data visualization efficiently. I will showcase two different astronomy projects as ongoing successful example applications. Namely, the Astro Data Lab at NSF’s NOIRLab (National Optical-Infrared Astronomy Research Laboratory) is an online astronomy science platform serving large public astronomical datasets including databases with tables ranging from 500,000 to 65 billion rows. Since opening its doors in 2017, the Data Lab now has over 1,300 registered users
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from JupyterCon · JupyterCon · 26 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
▶
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Interview Joshua Patterson NVIDIA
JupyterCon
Dave Stuart - Jupyter as an Enterprise “Do It Yourself” (DIY) Analytic Platform | JupyterCon 2020
JupyterCon
Jeffrey Mew - Supercharge your Data Science workflow | JupyterCon 2020
JupyterCon
Michelle Ufford- Supercharging SQL Users with Jupyter Notebooks | JupyterCon 2020
JupyterCon
Alan Yu - What we learned from introducing Jupyter Notebooks to the SQL community | JupyterCon 2020
JupyterCon
Chris Holdgraf- 2i2c: sustaining open source through hosted Jupyter infrastructure | JupyterCon 2020
JupyterCon
Yiwen Li - Intro to Elyra - an AI centric extension for JupyterLab | JupyterCon 2020
JupyterCon
Luciano Resende - What's new on Elyra - A set of AI centric JupyterLab extensions | JupyterCon 2020
JupyterCon
Alan Chin - Explore and Extend AI Pipeline Runtimes with Elyra and JupyterLab | JupyterCon 2020
JupyterCon
Eduardo Blancas- Streamline your Data Science projects with Ploomber | JupyterCon 2020
JupyterCon
Thorin Tabor - Democratizing the accessibility of computational workflows | JupyterCon 2020
JupyterCon
Simon Willison- Using Datasette with Jupyter to publish your data | JupyterCon 2020
JupyterCon
Brendan O'Brien - Using Qri (“query”) to fetch, query, combine and publish datasets.|JupyterCon 2020
JupyterCon
Georgiana Dolocan - Putting the JupyterHub puzzle pieces together | JupyterCon 2020
JupyterCon
Yuvi Panda- Running nonjupyter applications on JupyterHub with jupyter-server-proxy| JupyterCon 2020
JupyterCon
Richard Wagner- The Streetwise Guide to JupyterHub Security | JupyterCon 2020
JupyterCon
TamNguyen- Handling Custom Jupyter Data Sources | JupyterCon 2020
JupyterCon
Immanuel Bayer- ipyannotator - the infinitely hackable annotation framework | JupyterCon 2020
JupyterCon
Rebecca Kelly- A shared Python, R and Q Jupyter Notebook - A Quant Sandbox Dream |JupyterCon 2020
JupyterCon
Itay Dafna - Leap of faith: Transitioning from Excel to Jupyter-based applications | JupyterCon 2020
JupyterCon
Damián Avila - Using the Jupyterverse to power MADS | JupyterCon 2020
JupyterCon
Chiin Rui Tan- From Zero to Hero | JupyterCon 2020
JupyterCon
Firas Moosvi- Teaching an Active Learning class with Jupyter Book| JupyterCon 2020
JupyterCon
Daniel Mietchen- Jupyter in the Wikimedia ecosystem | JupyterCon 2020
JupyterCon
Qiusheng Wu- How Jupyter and geemap enable interactive mapping and analysis | JupyterCon 2020
JupyterCon
Stephanie Juneau- Jupyterenabled astrophysical analysis for researchers and students|JupyterCon 2020
JupyterCon
Denton Gentry- The Care and Feeding of JupyterHub for Climate Solution Models| JupyterCon 2020
JupyterCon
Tingkai Liu- FlyBrainLab: Interactive Computing in the Connectomic/Synaptomic Era | JupyterCon 2020
JupyterCon
Kunal Bhalla- A Notebook Style Guide| JupyterCon 2020
JupyterCon
Julia Wagemann - How to avoid 'Death by Jupyter Notebooks' | JupyterCon 2020
JupyterCon
David Pugh - Best practices for managing Jupyter-based data science | JupyterCon 2020
JupyterCon
Karla Spuldaro - Debugging notebooks and python scripts in JupyterLab | JupyterCon 2020
JupyterCon
Shreyas Dalia - assert browserTest == True # Frontend Testing JupyterLab | JupyterCon 2020
JupyterCon
Chris Holdgraf - The new Jupyter Book stack | JupyterCon 2020
JupyterCon
Hamel Husain - Fastpages - A new, open source Jupyter notebook blogging system | JupyterCon 2020
JupyterCon
Marc Wouts - Jupytext: Jupyter Notebooks as Markdown Documents | JupyterCon 2020
JupyterCon
Sheeba Samuel- ProvBook |JupyterCon 2020
JupyterCon
Philipp Rudiger - To Jupyter and back again | JupyterCon 2020
JupyterCon
Jacob Tomlinson - What is my GPU doing? | JupyterCon 2020
JupyterCon
Afshin Darian - A visual debugger in Jupyter | JupyterCon 2020
JupyterCon
Eric Charles - Jupyter Real Time Collaboration| JupyterCon 2020
JupyterCon
Devin Robison - Optimizing model performance | JupyterCon 2020
JupyterCon
Junhua zhao - PayPal Notebooks: ML & Data Science experience | JupyterCon 2020
JupyterCon
April Wang - Redesigning Notebooks for Better Collaboration | JupyterCon 2020
JupyterCon
Bryan Weber - Distributing and Collecting Jupyter Notebooks for Manual Grading| JupyterCon 2020
JupyterCon
Georgiana Dolocan - The Littlest JupyterHub distribution | JupyterCon 2020
JupyterCon
Tim Metzler - Electronic Examination using Jupyter Notebook | JupyterCon 2020
JupyterCon
Blaine Mooers - Why develop a snippet library for Jupyter in your subject domain? | JupyterCon 2020
JupyterCon
Ryan Abernathey - Cloud Native Repositories for Big Scientific Data | JupyterCon 2020
JupyterCon
Tanya Rai - Introducing Bento: Jupyter Notebooks @ Facebook | JupyterCon 2020
JupyterCon
Kenton McHenry - From Papers to Notebooks | JupyterCon 2020
JupyterCon
Ryan Herr - After model.fit, before you deploy| JupyterCon 2020
JupyterCon
Ana Ruvalcaba - Community building is a sustainability strategy | JupyterCon 2020
JupyterCon
Martin Renou - Xeus: an ecosystem of Jupyter kernels | JupyterCon 2020
JupyterCon
Michael Wilson - Teaching teenagers to understand Dark Energy | JupyterCon 2020
JupyterCon
Davide De Marchi - Voilà dashboards for policy support | JupyterCon 2020
JupyterCon
Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020
JupyterCon
Praveen Kanamarlapud - Kernel Life Cycle Management | JupyterCon 2020
JupyterCon
Aaron Bray - Pulse Physiology Engine | JupyterCon 2020
JupyterCon
Aaron Watters - Using WebGL2 transform/feedback in Jupyter widgets | JupyterCon 2020
JupyterCon
More on: Prompt Craft
View skill →
🎓
Tutor Explanation
DeepCamp AI