SXSW 2017: Data-Driven Applications with Capital One DevExchange's Hydrograph
Key Takeaways
The video discusses Hydrograph, an open-source ETL tool developed by Capital One, which provides a graphical interface for data manipulation and integration, and is designed to be accessible to developers, making it easier for them to manage data and work with analysts. The tool leverages about 1,400 other projects and is available to run in various scales, from small environments like laptops to massive cloud environments that handle billions of records, utilizing technologies such as Eclipse R
Full Transcript
hello welcome to the new stack makers a podcast where we talk about at scale application development deployment and management [Music] thanks to capital one our sponsor here at South by Southwest where we're doing podcasts live from the capital one house at Antone's [Music] hey it's Alex Williams of the news tag here at the Capital One house at the South by Southwest Conference here with Clark ferry and Clark is with Capital One and Clark is the person who we go to to learn about the hydrograph project and he's really helping lead the hydrograph project and the hydrograph project just for people's sake is what exactly Clark can you explain it to us is an open-source project and it's an open-source project that helps people manage and manage all their data so you know if you're trying to integrate data from a lot of different sources if we're trying to enrich data you're trying to do ETL in the traditional using the traditional language now it's the tool that allows you to take data from all these different places put it together and present it for your data analysts presented for you know a application using but put it together in a way that's meaningful for for that particular use of data so examples include building databases for analysts or loading data to be used for a certificate so all uses for ETL so so we were talking about this earlier about the uses for something like hydrograph and from a very top level look you think about how difficult it is to pull different data sources together from the very simple most simple use all the way up to the more complex and ETL is the technology that's been used for years to do this but it's mostly the domain of database administrators right and ETL as you know is extract transform and load load right and so now we're seeing this new use for developers what is the use for the developers here that hydrograph offers yes so you know this technology has been around for a while but big commercial implementations very expensive we're trying to make it more accessible to more folks and for developers it's really about productivity you know we provide a really nice graphical interface that allows you to manipulate data in the way you think about data if you're going to drop your process on the whiteboard or the chalkboard and you talk about I want these two inputs and I need to join them together on their on the account key and then I want to filter out all the pending payments so I just have the full payments left and I want to roll them up in case somebody's made two payments and then I put that into a database think about that process the way you draw that up we really provide an interface to allow you to code in the same way you think and draw up a process and this is really what Hydra graph is all about it's yeah productivity that allows you to think of it think through the problem and then really turn that into real code and something that that is executable and can run in a you know everything from a very small environment like your laptop to a massive cloud environment that takes billions and billions of records and pumps in through so you know sort of available to all at various scales and make your really simple simple accessible very protic productive so now a analyst can follow along with the developer and think through it with them and then you can just take it to production so this is an open source project and you just did open source it and that seems to be something that that Capital One is really engaging in more these days is in the open source community what is the importance of that for Capital One and in making this open source yeah so I'd say a couple of things you know Capital One is fully invested in the open source community we contribute back to projects but we also leverage everybody else's projects yeah and a hydrographs a great example hydrograph leverages about 1,400 other projects to bring this to market that's amazing yeah and we really think about it in terms of if we're using other people's work in this type of way we really should give back as well really we think it's important to do that we've we've built something here that is that is significant and we think that by working together we can all get to a better place faster yeah we built the components we need but we know that there's things that that are very Universal here but there's also a continual evolving environment and there's gonna be new things that come into play and somebody else might even before us we might be able to deliver that new component before they do and by all of us working together we're gonna get to you know a much better product and one that's more inclusive to you know both technologies and developers and companies so that more people can really get benefit out of this type of the scale is so quick now it's a it's so complex it requires a massive amounts of data open source seems to be almost like the only answers like solvers at the center of business then an open source is at the center of software it just makes sense to do this absolutely I mean most of the tools that that all of us use today really are our open source yeah you think of everything from our operating systems to the programming languages you use you know even you know things like get getting into the way we all browse the internet you know open source is is everywhere it is the infrastructure it's the pipes that allow you to browse the web and and really do everything trying to think of something today that isn't doesn't have some component of open source tied into it and it's really hard to think about something right it's it's everywhere and it's a future so I'd like to talk about three things the technology architecture the use cases and the roadmap going forward so let's look at the technology architecture behind hydrograph yeah so we've been very conscious to build a modular system here and it is because we believe the world continues to change and we know that yeah we want to be able to swap out pieces and have that be independent of those technology choices so the architecture is three-part it's an RCP rich client platform as the the front end the user interface it's an XML layer in the middle that is really the rule set as I deploy code I deploy this I deploy this configuration set that acts as a bridge between the two and allows me to swap either one and then the backend which right now we support both the cascading framework for things like MapReduce but also spark so I've got my three layers and all of them can be changed but that that bridge in the middle means and recently we did this you know we added spark because we thought spark was the future and we did that without breaking you know the the front-end and without breaking the existing application so I think that was been really really powerful for us okay great so so so you have that back in architecture you use cascading in spark now tell us a little bit about the use cases well but actually let me back up a little bit before I get to the use cases where some of the inch what are some of the different aspects of those modular you know they're different them the mod the different modular capabilities that come with hydrograph on the front end you're using Eclipse as I understand right yeah are RCP so yeah the yeah the the common thing that everybody's heard of right really at RCP but there's so much that we leverage as part of that people have already built the plugins to get Europe and other other things within that so you know in the end of the day we're not you know we're not UI developers right it's not our core capability we're really more about the data we're going to leverage something on the front-end that really gave us a fast jump start but also had all those things built in we really need it already right and then focus most of our attention on sort of how you with data which is really what our expertise is Capital One is a data company Wow we have a lot of people other space to do great UI but you know my area is all about data cutting branch and how do you deliver it right so our CP on the front end was our jumpstart into that great great now tell us tell us a little bit about the use cases that you're seeing and I'm curious about the use cases you know what are the early use cases that you're seeing what do you think are gonna be some of the core use cases I mean what are some of the use cases that you're really just starting to discover yeah so one of the biggest things we're doing is using this tool set to move our infrastructure from the old on-prem so in we have data centers we have lots of servers a lot of them run data ETL tools as we move to the cloud and we're fully in on the Amazon Cloud you know Robbie Alexandra's talk our CIO has talked about that a lot we wanted to change out the technologies to be much more flexible and this is the the migration of those things from the old legacy sack to the cloud this is our bridge doubt let's get to the clouds that's use case number one a very important one we're also using it for new things like there are new batch interfaces new new files are delivered all the time and this is a just a great common tool then that we've tied into so much of the the way we manage data the way we govern our data and the way we secure our data that we're using it for those new brand new file feeds as well so legacy was probably the starting point but then the new things are also a big part of it we're using it for a variety of projects yeah we have one where we're sort of working with some I believe it's HR type data and loading that into a very specific application for that so this ties back into actually the the modular capabilities to come with hydrograph because the the use cases are really for developers aren't they and so the developers who are using hydrograph they can really use it for whatever purpose they want they might they might have like a they may have a big data warehouse right or they may not right isn't that the intent here yeah I mean this is the ultimate and flexibility so this is about about that developer and it is really a development tool so you don't have to go out there and write low-level job and move three files around immersion together you can do it very much in a flowchart style interface and you can use it for a massive data warehouse or you can move use it for you know a very small sort of a great example I talked to a healthcare a gentleman in healthcare was looking at for his not-for-profit now and it's not a massive amount of data but he was thinking about oh how do I do I put all this data together and present it back to some of his customers and this you know perfect for that type of application as well not just the things on the multiple billions of rows scale but things on the hundreds or thousands of rows right so tell us about the roadmap going forward yeah absolutely I mean we are Capital One is always looking at new technologies and things like different databases different file formats and and the different things that our end users and our end users lifetimes of data scientists our data digital product owners people who want to do cool things with data build applications like we've seen so much of here at sell it to myself less so we're always looking at new file formats and new ways to present data databases but at the same time we spend a lot of time thinking about the way the input data is changing so a great example is streaming is becoming much more common those regular feeds of data that come through both that come through api's you know so we're looking at every everything in from both that front end of the the latency and how data arrives and that getting much shorter but also the although different targets on the back end so really everything and the great thing about this is that everything can change it's what they're exciting about data technologies is it it's all constantly evolving and we're prepared to you know to really go where our customer our data scientist customer our digital product owner or our application space needs us to go so streaming is is the no-brainer and you know different target technologies from a database standpoint that are the and from a David environment standpoint the two that are guaranteed to happen everything else like it that's the fun of the yeah being in the space well Clark thank you very much for taking some time to talk with us about hydrograph at an exciting new open source project we look forward to learning more about it you know in the months and year ahead thank you it's Mel have fun [Music] thanks to capital one our sponsor for our podcast from South by Southwest at the Capital One house thanks for joining us [Music]
Original Description
Capital One recently released Hydrograph, an open source ETL tool for developers. In this sponsored podcast, we spoke with Clark Farrey, Data Engineering Leader, about how Hydrograph works and how it could used by developers to aggregate multiple data sources into a single clean dataset.
Listen on SoundCloud:
https://soundcloud.com/thenewstackmakers/sxsw-2017-data-driven-applications-with-capital-one-devexchanges-hydrograph
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from The New Stack · The New Stack · 36 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
▶
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
What's Next for the Cloud Foundry Foundation in 2017 with Executive Director Abby Kearns
The New Stack
How Unikernels Can Better Defend against DDoS Attacks
The New Stack
Weaveworks is Bringing Horizontal Scaling to Prometheus
The New Stack
TNS Analysts Thanksgiving Special: The Evolution of Kubernetes and the Container Ecosystem
The New Stack
How Rancher Labs is Seeing Kubernetes Put to Work in Production
The New Stack
SAP Tests Kubernetes for Cloud-Native Enterprise Software Deployments
The New Stack
Event Marketing for Today's Developer Evangelists and Community Managers
The New Stack
NodeSource Introduces Certified Modules to Improve Node.js Security
The New Stack
How Lightstep is Illuminating the Case for Distributed Tracing
The New Stack
How OpenStack Aims to be More Inclusive without being Exclusive
The New Stack
How Shuttlecloud Saves Time and Money by Monitoring with Prometheus
The New Stack
Creating Analytics-Driven Solutions for Operational Visibility
The New Stack
Understanding the Application Pattern for Effective Monitoring
The New Stack
Building On Docker's Native Monitoring Functionality
The New Stack
The Importance of Having Visibility Into Containers
The New Stack
How Getting Your Project in the CNCF Just Got Easier
The New Stack
Tectonic Summit Pancake Breakfast: How to Sell Kubernetes to the Hypervisor-Minded
The New Stack
The Buzz at Tectonic Summit 2016 in New York City
The New Stack
Bringing Clarity to the Future of Node.js Modules
The New Stack
How FluentD Can Help Monitor Microservice Architectures Through Unified Logging
The New Stack
Reshaping Front End Development with Warehouse.ai
The New Stack
2016 Year End Wrap-Up: Discussing Docker, OpenStack, and Open Source
The New Stack
Here's Why You Should Build a Robot Using Node.JS: Because You Can
The New Stack
How the Node.js Foundation is Utilizing Participatory Governance Models
The New Stack
Set Up an MongoDB Replica Set in Less Than an Hour Using Bitnami Packages
The New Stack
Determining Who Bears the Burden of Ensuring NPM Module Security
The New Stack
How Intel Snap uses Telemetry and Kubernetes to Drive Enterprise Efficiency
The New Stack
How the NFL Scored a Touchdown with its Open Source React Framework Wildcat
The New Stack
Aporeto CEO Dimitri Stiliadis: When it Comes to Security, Context is King
The New Stack
The Buzz at Node.JS Interactive
The New Stack
Why Going Serverless Doesn't Mean 'No Ops'
The New Stack
How Node.js is Transforming Today's Enterprises
The New Stack
JJ Asghar Interview
The New Stack
How Capital One is Using APIs to Streamline Auto Financing
The New Stack
SXSW 2017: How Machine Learning Differs From Regular Programming
The New Stack
SXSW 2017: Data-Driven Applications with Capital One DevExchange's Hydrograph
The New Stack
SXSW 2017: How Good Engineers Make Bad Business Decisions
The New Stack
CloudNativeCon & KubeCon EU Pancake Breakfast 2017: Kubernetes and the Multi-Cloud
The New Stack
CNCF Executive Director Dan Kohn: What's Next for CNCF in 2017
The New Stack
Exploring the Latest Container Runtime Projects in the CNCF
The New Stack
Exploring the Future of the Kubernetes Ecosystem
The New Stack
Kubernetes and Continuous Deployment
The New Stack
Kris Nova of Deis at CouldNativecon/Kubecon in Berlin
The New Stack
Docker's Quest for Simplicity with the Evolution of Containerd
The New Stack
Developers First: The Cloud Foundry Service Broker API and Kubernetes
The New Stack
Mapping the Future of CoreOS's rkt in the CNCF
The New Stack
Red Hat and Dell EMC: Two Perspectives from DockerCon
The New Stack
Capital One Opened its APIs to Third-Party Developers — Here’s What They Learned
The New Stack
SUSE Joins the CNCF, Brings Kubernetes to OpenStack Cloud 7
The New Stack
How Capital One Brings Open Source To The Banking Industry
The New Stack
OSCON Is Coming Back To Portland, A Show Wrapup With Co-Chair Kelsey Hightower
The New Stack
Dev Or Ops Doesn’t Matter, You Need Observability
The New Stack
Taking The Next Steps In Developing An Open Source Culture
The New Stack
SXSW 2017: How Capital One Became Technology-First With Open Source
The New Stack
Apcera Old Apps Spanning New Clouds
The New Stack
Provenance: The Peace of Mind Chef Habitat Seeks to Deliver
The New Stack
InSpec: Human Readable, Automated Compliance
The New Stack
The Evolution of SAP HANA Express
The New Stack
Women Engineers Who Inspire And Never Give Up
The New Stack
Three Perspectives on the Evolution of Container Security
The New Stack
More on: Data Literacy
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The HiPPO is always right
Dev.to · Sharmin Sirajudeen
How to Extract Saudi Arabia Property Data Across Bayut.sa, Wasalt.sa, Aqar.fm and PropertyFinder.sa
Dev.to · Omar Eldeeb
Norway vs France (1:4) — A 97% Crime Index Anomaly: When Ruthless Efficiency Buries the xG Evidence
Medium · Data Science
How to Build an H-1B Salary Database by Employer (the Real Data Source + Python)
Dev.to · Omar Eldeeb
🎓
Tutor Explanation
DeepCamp AI