6: The Modern Data Stack with Tristan Zajonc

0:00 / 2:18

Tristan Zajonc

Where do I think things are going to go? I think things, at the very foundational level, I think there should be and will be a convergence around what does an ML pipeline look like? What does a pipeline around training and inference and monitoring look like?

And that, you know, TFX kind of hints maybe at that. So what Google did with TFX can set the, kind of the canonical steps of an ML pipeline. So I think there needs to be and should be an attention to what are those canonical steps and standardizing around the inputs and the outputs for those steps so that you can then build the higher level experiences on top of that.

So I do think that is something that will happen and that it will move from complete pipeline jungle world to a more structured set of pipelines even for ML engineering. So for the, I'm talking here for the ML engineer. I think really, our view at Continual is, that's probably the right approach for the sophisticated ML engineer who really needs complete control over everything, but that there really needs to be another step which I would call data-first approaches to ML.

So, if I look at that pipeline, I think once you start to look at ML pipelines and you start to think about the inputs and outputs of them, you realize they are quite standardized. And then the question becomes, well, what is unique to the problem at hand?

You know, everything, the ML algorithms, the trend there in research is overwhelmingly towards sort of convergence algorithms that can handle many different types with standard architectures. Even if you're using multiple model types. I mean, honestly, AutoML approaches are getting better and better, particularly when you have complex data with multimodal features. If you're getting that, you don't need to get down into the weeds of that. I think automated approaches to the the models themselves, pre-training... all of that is going to cause that to become, fewer and fewer people to want to go all the way down into that level.

And then once you start to realize that there is a standardized pipeline at a higher level, you'll also say, well, that's also just a standardized training, monitoring, profiling, testing, performance comparisons, tuning. All of those steps are pretty much common. And so what you're left with is, okay, what's your data?

What are you trying to predict, namely, what are the outputs of your machine learning models and what are the inputs? What are the features in the signals that you can bring to bear on those models? And so I think that we're very early on this but ultimately there really is going to be a next generation of products that starts with the idea that your workflows should be centered on your data and tries to at least aspirationally automate the rest.

Auto-scroll