SXSW 2017: Data-Driven Applications with Capital One DevExchange's Hydrograph

The New Stack · Intermediate ·📊 Data Analytics & Business Intelligence ·9y ago

Key Takeaways

The video discusses Hydrograph, an open-source ETL tool developed by Capital One, which provides a graphical interface for data manipulation and integration, and is designed to be accessible to developers, making it easier for them to manage data and work with analysts. The tool leverages about 1,400 other projects and is available to run in various scales, from small environments like laptops to massive cloud environments that handle billions of records, utilizing technologies such as Eclipse R

Full Transcript

hello welcome to the new stack makers a podcast where we talk about at scale application development deployment and management [Music] thanks to capital one our sponsor here at South by Southwest where we're doing podcasts live from the capital one house at Antone's [Music] hey it's Alex Williams of the news tag here at the Capital One house at the South by Southwest Conference here with Clark ferry and Clark is with Capital One and Clark is the person who we go to to learn about the hydrograph project and he's really helping lead the hydrograph project and the hydrograph project just for people's sake is what exactly Clark can you explain it to us is an open-source project and it's an open-source project that helps people manage and manage all their data so you know if you're trying to integrate data from a lot of different sources if we're trying to enrich data you're trying to do ETL in the traditional using the traditional language now it's the tool that allows you to take data from all these different places put it together and present it for your data analysts presented for you know a application using but put it together in a way that's meaningful for for that particular use of data so examples include building databases for analysts or loading data to be used for a certificate so all uses for ETL so so we were talking about this earlier about the uses for something like hydrograph and from a very top level look you think about how difficult it is to pull different data sources together from the very simple most simple use all the way up to the more complex and ETL is the technology that's been used for years to do this but it's mostly the domain of database administrators right and ETL as you know is extract transform and load load right and so now we're seeing this new use for developers what is the use for the developers here that hydrograph offers yes so you know this technology has been around for a while but big commercial implementations very expensive we're trying to make it more accessible to more folks and for developers it's really about productivity you know we provide a really nice graphical interface that allows you to manipulate data in the way you think about data if you're going to drop your process on the whiteboard or the chalkboard and you talk about I want these two inputs and I need to join them together on their on the account key and then I want to filter out all the pending payments so I just have the full payments left and I want to roll them up in case somebody's made two payments and then I put that into a database think about that process the way you draw that up we really provide an interface to allow you to code in the same way you think and draw up a process and this is really what Hydra graph is all about it's yeah productivity that allows you to think of it think through the problem and then really turn that into real code and something that that is executable and can run in a you know everything from a very small environment like your laptop to a massive cloud environment that takes billions and billions of records and pumps in through so you know sort of available to all at various scales and make your really simple simple accessible very protic productive so now a analyst can follow along with the developer and think through it with them and then you can just take it to production so this is an open source project and you just did open source it and that seems to be something that that Capital One is really engaging in more these days is in the open source community what is the importance of that for Capital One and in making this open source yeah so I'd say a couple of things you know Capital One is fully invested in the open source community we contribute back to projects but we also leverage everybody else's projects yeah and a hydrographs a great example hydrograph leverages about 1,400 other projects to bring this to market that's amazing yeah and we really think about it in terms of if we're using other people's work in this type of way we really should give back as well really we think it's important to do that we've we've built something here that is that is significant and we think that by working together we can all get to a better place faster yeah we built the components we need but we know that there's things that that are very Universal here but there's also a continual evolving environment and there's gonna be new things that come into play and somebody else might even before us we might be able to deliver that new component before they do and by all of us working together we're gonna get to you know a much better product and one that's more inclusive to you know both technologies and developers and companies so that more people can really get benefit out of this type of the scale is so quick now it's a it's so complex it requires a massive amounts of data open source seems to be almost like the only answers like solvers at the center of business then an open source is at the center of software it just makes sense to do this absolutely I mean most of the tools that that all of us use today really are our open source yeah you think of everything from our operating systems to the programming languages you use you know even you know things like get getting into the way we all browse the internet you know open source is is everywhere it is the infrastructure it's the pipes that allow you to browse the web and and really do everything trying to think of something today that isn't doesn't have some component of open source tied into it and it's really hard to think about something right it's it's everywhere and it's a future so I'd like to talk about three things the technology architecture the use cases and the roadmap going forward so let's look at the technology architecture behind hydrograph yeah so we've been very conscious to build a modular system here and it is because we believe the world continues to change and we know that yeah we want to be able to swap out pieces and have that be independent of those technology choices so the architecture is three-part it's an RCP rich client platform as the the front end the user interface it's an XML layer in the middle that is really the rule set as I deploy code I deploy this I deploy this configuration set that acts as a bridge between the two and allows me to swap either one and then the backend which right now we support both the cascading framework for things like MapReduce but also spark so I've got my three layers and all of them can be changed but that that bridge in the middle means and recently we did this you know we added spark because we thought spark was the future and we did that without breaking you know the the front-end and without breaking the existing application so I think that was been really really powerful for us okay great so so so you have that back in architecture you use cascading in spark now tell us a little bit about the use cases well but actually let me back up a little bit before I get to the use cases where some of the inch what are some of the different aspects of those modular you know they're different them the mod the different modular capabilities that come with hydrograph on the front end you're using Eclipse as I understand right yeah are RCP so yeah the yeah the the common thing that everybody's heard of right really at RCP but there's so much that we leverage as part of that people have already built the plugins to get Europe and other other things within that so you know in the end of the day we're not you know we're not UI developers right it's not our core capability we're really more about the data we're going to leverage something on the front-end that really gave us a fast jump start but also had all those things built in we really need it already right and then focus most of our attention on sort of how you with data which is really what our expertise is Capital One is a data company Wow we have a lot of people other space to do great UI but you know my area is all about data cutting branch and how do you deliver it right so our CP on the front end was our jumpstart into that great great now tell us tell us a little bit about the use cases that you're seeing and I'm curious about the use cases you know what are the early use cases that you're seeing what do you think are gonna be some of the core use cases I mean what are some of the use cases that you're really just starting to discover yeah so one of the biggest things we're doing is using this tool set to move our infrastructure from the old on-prem so in we have data centers we have lots of servers a lot of them run data ETL tools as we move to the cloud and we're fully in on the Amazon Cloud you know Robbie Alexandra's talk our CIO has talked about that a lot we wanted to change out the technologies to be much more flexible and this is the the migration of those things from the old legacy sack to the cloud this is our bridge doubt let's get to the clouds that's use case number one a very important one we're also using it for new things like there are new batch interfaces new new files are delivered all the time and this is a just a great common tool then that we've tied into so much of the the way we manage data the way we govern our data and the way we secure our data that we're using it for those new brand new file feeds as well so legacy was probably the starting point but then the new things are also a big part of it we're using it for a variety of projects yeah we have one where we're sort of working with some I believe it's HR type data and loading that into a very specific application for that so this ties back into actually the the modular capabilities to come with hydrograph because the the use cases are really for developers aren't they and so the developers who are using hydrograph they can really use it for whatever purpose they want they might they might have like a they may have a big data warehouse right or they may not right isn't that the intent here yeah I mean this is the ultimate and flexibility so this is about about that developer and it is really a development tool so you don't have to go out there and write low-level job and move three files around immersion together you can do it very much in a flowchart style interface and you can use it for a massive data warehouse or you can move use it for you know a very small sort of a great example I talked to a healthcare a gentleman in healthcare was looking at for his not-for-profit now and it's not a massive amount of data but he was thinking about oh how do I do I put all this data together and present it back to some of his customers and this you know perfect for that type of application as well not just the things on the multiple billions of rows scale but things on the hundreds or thousands of rows right so tell us about the roadmap going forward yeah absolutely I mean we are Capital One is always looking at new technologies and things like different databases different file formats and and the different things that our end users and our end users lifetimes of data scientists our data digital product owners people who want to do cool things with data build applications like we've seen so much of here at sell it to myself less so we're always looking at new file formats and new ways to present data databases but at the same time we spend a lot of time thinking about the way the input data is changing so a great example is streaming is becoming much more common those regular feeds of data that come through both that come through api's you know so we're looking at every everything in from both that front end of the the latency and how data arrives and that getting much shorter but also the although different targets on the back end so really everything and the great thing about this is that everything can change it's what they're exciting about data technologies is it it's all constantly evolving and we're prepared to you know to really go where our customer our data scientist customer our digital product owner or our application space needs us to go so streaming is is the no-brainer and you know different target technologies from a database standpoint that are the and from a David environment standpoint the two that are guaranteed to happen everything else like it that's the fun of the yeah being in the space well Clark thank you very much for taking some time to talk with us about hydrograph at an exciting new open source project we look forward to learning more about it you know in the months and year ahead thank you it's Mel have fun [Music] thanks to capital one our sponsor for our podcast from South by Southwest at the Capital One house thanks for joining us [Music]

Original Description

Capital One recently released Hydrograph, an open source ETL tool for developers. In this sponsored podcast, we spoke with Clark Farrey, Data Engineering Leader, about how Hydrograph works and how it could used by developers to aggregate multiple data sources into a single clean dataset. Listen on SoundCloud: https://soundcloud.com/thenewstackmakers/sxsw-2017-data-driven-applications-with-capital-one-devexchanges-hydrograph
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from The New Stack · The New Stack · 36 of 60

1 What's Next for the Cloud Foundry Foundation in 2017 with Executive Director Abby Kearns
What's Next for the Cloud Foundry Foundation in 2017 with Executive Director Abby Kearns
The New Stack
2 How Unikernels Can Better Defend against DDoS Attacks
How Unikernels Can Better Defend against DDoS Attacks
The New Stack
3 Weaveworks is Bringing Horizontal Scaling to Prometheus
Weaveworks is Bringing Horizontal Scaling to Prometheus
The New Stack
4 TNS Analysts Thanksgiving Special: The Evolution of Kubernetes and the Container Ecosystem
TNS Analysts Thanksgiving Special: The Evolution of Kubernetes and the Container Ecosystem
The New Stack
5 How Rancher Labs is Seeing Kubernetes Put to Work in Production
How Rancher Labs is Seeing Kubernetes Put to Work in Production
The New Stack
6 SAP Tests Kubernetes for Cloud-Native Enterprise Software Deployments
SAP Tests Kubernetes for Cloud-Native Enterprise Software Deployments
The New Stack
7 Event Marketing for Today's Developer Evangelists and Community Managers
Event Marketing for Today's Developer Evangelists and Community Managers
The New Stack
8 NodeSource Introduces Certified Modules to Improve Node.js Security
NodeSource Introduces Certified Modules to Improve Node.js Security
The New Stack
9 How Lightstep is Illuminating the Case for Distributed Tracing
How Lightstep is Illuminating the Case for Distributed Tracing
The New Stack
10 How OpenStack Aims to be More Inclusive without being Exclusive
How OpenStack Aims to be More Inclusive without being Exclusive
The New Stack
11 How Shuttlecloud Saves Time and Money by Monitoring with Prometheus
How Shuttlecloud Saves Time and Money by Monitoring with Prometheus
The New Stack
12 Creating Analytics-Driven Solutions for Operational Visibility
Creating Analytics-Driven Solutions for Operational Visibility
The New Stack
13 Understanding the Application Pattern for Effective Monitoring
Understanding the Application Pattern for Effective Monitoring
The New Stack
14 Building On Docker's Native Monitoring Functionality
Building On Docker's Native Monitoring Functionality
The New Stack
15 The Importance of Having Visibility Into Containers
The Importance of Having Visibility Into Containers
The New Stack
16 How Getting Your Project in the CNCF Just Got Easier
How Getting Your Project in the CNCF Just Got Easier
The New Stack
17 Tectonic Summit Pancake Breakfast: How to Sell Kubernetes to the Hypervisor-Minded
Tectonic Summit Pancake Breakfast: How to Sell Kubernetes to the Hypervisor-Minded
The New Stack
18 The Buzz at Tectonic Summit 2016 in New York City
The Buzz at Tectonic Summit 2016 in New York City
The New Stack
19 Bringing Clarity to the Future of Node.js Modules
Bringing Clarity to the Future of Node.js Modules
The New Stack
20 How FluentD Can Help Monitor Microservice Architectures Through Unified Logging
How FluentD Can Help Monitor Microservice Architectures Through Unified Logging
The New Stack
21 Reshaping Front End Development with Warehouse.ai
Reshaping Front End Development with Warehouse.ai
The New Stack
22 2016 Year End Wrap-Up: Discussing Docker, OpenStack, and Open Source
2016 Year End Wrap-Up: Discussing Docker, OpenStack, and Open Source
The New Stack
23 Here's Why You Should Build a Robot Using Node.JS: Because You Can
Here's Why You Should Build a Robot Using Node.JS: Because You Can
The New Stack
24 How the Node.js Foundation is Utilizing Participatory Governance Models
How the Node.js Foundation is Utilizing Participatory Governance Models
The New Stack
25 Set Up an MongoDB Replica Set in Less Than an Hour Using Bitnami Packages
Set Up an MongoDB Replica Set in Less Than an Hour Using Bitnami Packages
The New Stack
26 Determining Who Bears the Burden of Ensuring NPM Module Security
Determining Who Bears the Burden of Ensuring NPM Module Security
The New Stack
27 How Intel Snap uses Telemetry and Kubernetes to Drive Enterprise Efficiency
How Intel Snap uses Telemetry and Kubernetes to Drive Enterprise Efficiency
The New Stack
28 How the NFL Scored a Touchdown with its Open Source React Framework Wildcat
How the NFL Scored a Touchdown with its Open Source React Framework Wildcat
The New Stack
29 Aporeto CEO Dimitri Stiliadis: When it Comes to Security, Context is King
Aporeto CEO Dimitri Stiliadis: When it Comes to Security, Context is King
The New Stack
30 The Buzz at Node.JS Interactive
The Buzz at Node.JS Interactive
The New Stack
31 Why Going Serverless Doesn't Mean 'No Ops'
Why Going Serverless Doesn't Mean 'No Ops'
The New Stack
32 How Node.js is Transforming Today's Enterprises
How Node.js is Transforming Today's Enterprises
The New Stack
33 JJ Asghar Interview
JJ Asghar Interview
The New Stack
34 How Capital One is Using APIs to Streamline Auto Financing
How Capital One is Using APIs to Streamline Auto Financing
The New Stack
35 SXSW 2017: How Machine Learning Differs From Regular Programming
SXSW 2017: How Machine Learning Differs From Regular Programming
The New Stack
SXSW 2017: Data-Driven Applications with Capital One DevExchange's Hydrograph
SXSW 2017: Data-Driven Applications with Capital One DevExchange's Hydrograph
The New Stack
37 SXSW 2017: How Good Engineers Make Bad Business Decisions
SXSW 2017: How Good Engineers Make Bad Business Decisions
The New Stack
38 CloudNativeCon & KubeCon EU Pancake Breakfast 2017: Kubernetes and the Multi-Cloud
CloudNativeCon & KubeCon EU Pancake Breakfast 2017: Kubernetes and the Multi-Cloud
The New Stack
39 CNCF Executive Director Dan Kohn: What's Next for CNCF in 2017
CNCF Executive Director Dan Kohn: What's Next for CNCF in 2017
The New Stack
40 Exploring the Latest Container Runtime Projects in the CNCF
Exploring the Latest Container Runtime Projects in the CNCF
The New Stack
41 Exploring the Future of the Kubernetes Ecosystem
Exploring the Future of the Kubernetes Ecosystem
The New Stack
42 Kubernetes and Continuous Deployment
Kubernetes and Continuous Deployment
The New Stack
43 Kris Nova of Deis at CouldNativecon/Kubecon in Berlin
Kris Nova of Deis at CouldNativecon/Kubecon in Berlin
The New Stack
44 Docker's Quest for Simplicity with the Evolution of Containerd
Docker's Quest for Simplicity with the Evolution of Containerd
The New Stack
45 Developers First: The Cloud Foundry Service Broker API and Kubernetes
Developers First: The Cloud Foundry Service Broker API and Kubernetes
The New Stack
46 Mapping the Future of CoreOS's rkt in the CNCF
Mapping the Future of CoreOS's rkt in the CNCF
The New Stack
47 Red Hat and Dell EMC: Two Perspectives from DockerCon
Red Hat and Dell EMC: Two Perspectives from DockerCon
The New Stack
48 Capital One Opened its APIs to Third-Party Developers — Here’s What They Learned
Capital One Opened its APIs to Third-Party Developers — Here’s What They Learned
The New Stack
49 SUSE Joins the CNCF, Brings Kubernetes to OpenStack Cloud 7
SUSE Joins the CNCF, Brings Kubernetes to OpenStack Cloud 7
The New Stack
50 How Capital One Brings Open Source To The  Banking Industry
How Capital One Brings Open Source To The Banking Industry
The New Stack
51 OSCON Is Coming Back To Portland, A Show Wrapup With Co-Chair Kelsey Hightower
OSCON Is Coming Back To Portland, A Show Wrapup With Co-Chair Kelsey Hightower
The New Stack
52 Dev Or Ops Doesn’t Matter, You Need Observability
Dev Or Ops Doesn’t Matter, You Need Observability
The New Stack
53 Taking The Next Steps In Developing An Open Source Culture
Taking The Next Steps In Developing An Open Source Culture
The New Stack
54 SXSW 2017: How Capital One Became Technology-First With Open Source
SXSW 2017: How Capital One Became Technology-First With Open Source
The New Stack
55 Apcera   Old Apps Spanning New Clouds
Apcera Old Apps Spanning New Clouds
The New Stack
56 Provenance: The Peace of Mind Chef Habitat Seeks to Deliver
Provenance: The Peace of Mind Chef Habitat Seeks to Deliver
The New Stack
57 InSpec: Human Readable, Automated Compliance
InSpec: Human Readable, Automated Compliance
The New Stack
58 The Evolution of SAP HANA Express
The Evolution of SAP HANA Express
The New Stack
59 Women Engineers Who Inspire And Never Give Up
Women Engineers Who Inspire And Never Give Up
The New Stack
60 Three Perspectives on the Evolution of Container Security
Three Perspectives on the Evolution of Container Security
The New Stack

The video introduces Hydrograph, an open-source ETL tool that provides a graphical interface for data manipulation and integration, and discusses its features, architecture, and use cases, highlighting its accessibility to developers and its ability to handle large-scale data processing. The tool is designed to meet the needs of data scientists and digital product owners, and is part of Capital One's investment in the open-source community. By using Hydrograph, developers can streamline their da

Key Takeaways
  1. Install and set up Hydrograph
  2. Use the graphical interface to design data pipelines
  3. Configure data sources and destinations
  4. Run and monitor data pipelines
  5. Use Hydrograph to integrate data from multiple sources
  6. Deploy Hydrograph in various environments, including cloud and on-premises
💡 Hydrograph provides a unique graphical interface for data manipulation and integration, making it easier for developers to manage data and work with analysts, and its modular architecture and open-source nature make it a versatile and scalable tool for data-driven applications.

Related AI Lessons

The HiPPO is always right
Learn why traditional analytics often fails to produce verifiable conclusions and how the HiPPO effect impacts decision-making in companies
Dev.to · Sharmin Sirajudeen
How to Extract Saudi Arabia Property Data Across Bayut.sa, Wasalt.sa, Aqar.fm and PropertyFinder.sa
Extract Saudi Arabia property data from major portals like Bayut.sa and PropertyFinder.sa using the REGA advertisement-license trick
Dev.to · Omar Eldeeb
Norway vs France (1:4) — A 97% Crime Index Anomaly: When Ruthless Efficiency Buries the xG Evidence
Learn how to analyze sports games using statistical models and xG evidence to identify anomalies in game outcomes
Medium · Data Science
How to Build an H-1B Salary Database by Employer (the Real Data Source + Python)
Build an H-1B salary database by employer using Python and the DOL OFLC LCA disclosure files to gain insights into salary trends
Dev.to · Omar Eldeeb
Up next
Spreadsheet Guy Meets the CFO: "Define How Much"
Digital Transformation with Eric Kimberling
Watch →