Transcript Generated by Easy Cloud AI’s Beluga

You’ve heard of large language models like ChatGPT. ChatGPT. ChatGPT. ChatGPT. It can answer questions, write stories, and even engage in conversation. But if you want to build a business that uses this technology, you’ll need to ask yourself an important question. How do I take this role model and this role intelligence and actually customize this to my use case?

How do I make it really good for my user so that it’s differentiated and better than what’s out there? This is Razahabib. His company, HumanLoop, enables large language models to have even greater superpowers. We can help you build differentiated applications and products on top of these models. The range of use cases is like, now feels to be more limited by imagination than it is limited by technology.

You can replicate your exact writing style, customize tone, fact check answers, and train the model on your company’s unique data. We really hope that this is a platform on top of which, you know, the next million developers can build LLM applications. In our conversation, we explore the secrets to building an app that stands out. What made it so good that a million users signed up in five days was a fine tuning exercise.

The impact of generative AI on developers. They’re, you know, finding a significant fraction of their code is being written by a large language model. And what the future of large language models might bring to society as a whole. It’s an ethical minefield. There are going to be societal consequences on the path to AGI. Potential benefits are huge as well, but we do need to tread very carefully.

Let’s start, like, basics and high level. Like, what is a large language model? And why is it that they’ve suddenly sort of made a splash? I assume they’ve been around a lot longer than the past year or two. Yeah, so language models themselves are a really old concept and old technology. And really all it is is a statistical model of words in English language.

So you take a big bunch of texts and you try to predict what is the word that’ll come next, given a few previous words. So the cat sat on theā€¦ mat is the most likely word. And then you have a distribution over all the other words in your vocabulary. As you scale the language models, both in terms of the number of parameters they have, but also in the size of the data set that they’re trained on, it turns out that they continue to get better and better at this prediction task.

Eventually you have to start doing things like having world knowledge. You know, early on the language model is learning letter frequencies and word frequencies. And that’s fairly straightforward. And that’s kind of what we’re used to from predictive texts in our phones. But if the language model is gonna be able to finish the sentence, today the president of the United States acts.

It has to have learned who the president of the United States is. If it’s gonna finish a sentence that’s a math problem, it has to be able to solve that math problem. And so where we are today is that, you know, I think starting from GPT one and two, but then GPT three was really the one that I think everyone said, okay, something is very, very different here.

We now have these models of language that they’re just models of the words, right? They don’t know anything about the outside world. There’s loads of debates about whether they actually understand language, but they are able to do this task extremely well. And the only way to do that is to have gotten better at some form of reasoning and some form of knowledge.

What are some of the challenges of using a pre-trained model like chat GPT? So one of the big ones is that they have a tendency to confidently bullshit or hallucinate stuff. I think Matt Friedman described it as alternating between spooky and kooky. Sometimes it’s so good that you cannot believe the large language model was able to do that.

And then just occasionally it’s horrendously wrong. And that’s just to do with how the models originally trained. They’re trained to do next word prediction. And so they don’t necessarily know that they shouldn’t be dishonest. Yeah, sometimes they get it wrong. Sometimes they get it wrong, but the danger is that they confidently get it wrong. So, and very persuasively, very authoritatively, they get it wrong.

And so people might mistakenly trust these models. So there’s a couple of ways that you can hopefully fix that. And it’s an open research question, but the way we can help you with human loop to do this today is we make it very easy to pull in a factual context to the prompt that you give to the model.

And so the model’s much more likely to use that rather than make something up. And so we’ve seen that as a very successful technique for reducing hallucinations. Terrific. And this is an element to building a differentiated model for your use case. Absolutely. And an element for making it safe and reliable. Right. Yeah. And I think when chat GPT came out, there was a lot of frustration from people who didn’t like its personality.

The tone was a bit obsequious and it’s, you know, it’ll defer, it doesn’t want to give strong opinions on things. And to me, that demonstrates the need for, you know, many different types of models and tone and customizations depending on the use case and depending on the audience. And we can help you do that. Can you talk a little bit about what it means to fine tune a model and why that’s important?

If you look at what the difference is between chat GPT or the most recent OpenAI Text to Binshee 3 model and what’s been in the platform for two years and has not gotten as much attention, the difference is fine tuning. Like it’s the same base model more or less. They took, you can see it on the OpenAI website.

It’s one of their code pre-trained models. And what made it so good that a million users signed up in five days was a fine tuning exercise. And so what fine tuning is, is gathering examples of the outputs you want for the tasks that you are trying to do. And then doing a little bit of extra training on top of this base model to specialize it to that task.

What OpenAI, I think, did first and others have followed to do is to first do a fine tuning round of these models on input and output pairs that are actually instructions and the results that you would like from the instructions. So those are human generated pairs of data. And then to further fine tune the model, using something called reinforcement learning from human feedback, where you get human preference data.

So you show people a few different generations from the model, ask them to rank them or choose which of two they prefer. And then use that to train a signal that can ultimately fine tune the model. And it turns out that reinforcement learning from human feedback makes a huge difference to performance. Like it’s really hard to, to understate that.

In the Instruct GPT paper that OpenAI released, they compared a one or two billion parameter model with instruction tuning and RLHF to the full GPT-3 model and people preferred that. Despite the fact it was a hundred times smaller. Anthropic had this very exciting paper just a couple of weeks ago where actually we’re able to get similar results to RLHF without the H. So just actually having a second model provide the evaluation feedback as well.

And that’s obviously a lot more scalable. And what data do developers need to bring in order to fine tune a model? So there’s this kind of two types of fine tuning you might do. They might just show up with a corpus of books or some background. They just want to fine tune for tone. They have their companies, chat logs, or tone of voice from marketing communications.

And they just want to adjust the tone. Or all the emails they’ve sent us. All the emails they’ve sent, for example. That’s kind of almost extra pre-training. I would think about it as, but it’s fine tuning as well. And then the other fine tuning data comes actually from in production usage. So once they have their app being used, they’re capturing the data that their customers are providing.

They’re capturing feedback data from that. And in some sense it’s being automated at this point. Like HumanLoop is taking care of that data capture for you and it’s making the fine tuning easy. So you have an interaction with a customer that the LLM produces and the customer sort of gives a thumbs up or thumbs down as to whether that was helpful.

To give you a concrete example, imagine you give the email example. Imagine that you’re helping someone draft a sales email. And so you generate a first draft for them and then they either send it or they don’t. So that’s like a very interesting piece of feedback that you can capture. They probably edit it. So you can capture the edited text and they made the get a response or they don’t get a response.

So all of those bits of feedback are things we would capture and then use to drive improvements of the underlying model. Got it. If a developer is trying to build an app using a large language model and is doing it for the first time, what problems are they likely to encounter and how do you guys help them address some of those problems?

Yeah, so we typically help developers with kind of three key problems. One is prototyping, evaluation and finally customization. Maybe I can sort of talk about each of those. So at the early stages of developing a new large language model product, you have to try and get a good prompt that works well for your use case. That tends to be highly iterative.

You have hundreds of different versions of these things lying around. Managing the complexity of that, versioning, experimenting, that’s something we help with. Then the use cases that people are building now tend to be a lot more subjective than you might have done with machine learning before. And so evaluation is a lot harder. You can’t just calculate accuracy on a test set.

And so helping developers understand how well is my app working with my end customers is the next thing that we really make easy. And finally customization. Everyone has access to the same base models. Everyone can use GPT-3. But if you wanna build something differentiated, you need to find a way to customize the model to your use case, to your end users, to your context.

And we make that much easier both through fine tuning and also through a framework for running experiments. We can help you get a product to market faster, but most importantly, once you’re there, we can help you make something that your users prefer over the base models. That seems pretty fundamental. I mean, it’s prototyping, getting you the first versions out, testing and evaluation, and then differentiation.

This seems pretty fundamental to building something great. I think so. I mean, we really hope that this is a platform on top of which the next million developers can build LLM applications. And we worked really closely with some of the first companies to realize the importance of this, understood the pain points they had, and in a proper YC approach have tried to build something that those people really wanted.

And I think we got to a point that now we’re seeing from others that it really does solve acute pain points for them. And it doesn’t really matter to us what base language model you’re using. We can help you with the data feedback collection, with fine tuning, with prototyping, and those problems are gonna be very similar across different models.

And really, we just wanna help you get to the best result for your use case. And sometimes that’ll mean choosing a different model. I wanted to ask, how is the job or role of a developer likely to change in the future because of this technology? This is interesting. I thought about this a lot. I think in the short term, it augments developers, right?

You can do the same thing you could do faster. To me, the most impressive application we’ve seen of the large language model so far is GitHub Co-Pilot. I think that they cracked a really novel UX and figured out how to apply a large language model in a way that’s now used by, I think, 100 million developers. And many people I speak to who say that they’re finding a significant fraction of their code is being written by a large language model.

And I think if you’d ask people, will that happen two years ago, no one would have thought. One thing that is surprising to me is that the people who say to me they use it the most are some of the people I consider to be better or more senior developers. You might have thought this tool would help juniors more.

But I think people who are more accustomed to editing and reading code actually benefit more from the completions. So short term, it just accelerates us and allows us to do more. On a longer time horizon, you could imagine developers becoming more like product managers in that they’re writing the spec, they’re writing the documentation, but more of the grunt work and more of the boilerplate is taken care of by models.

I don’t know, long enough time horizon. I mean, there’s very few jobs that can be done so much through just text, right? We’ve really pushed it to the extreme. We’ve got GitHub and you have remote work. Engineers can do a lot of their jobs entirely sitting at a computer screen. And so when we do get towards things that look like AGI, I suspect that developers will actually be one of the first jobs to see large fractions of their job be automated, which I think is very counterintuitive, but also predicting the future is hard.

Yeah, what do you think the next breakthroughs will be in LLM technology? So I actually think here the roadmap is like, quite well known almost. Like I think there’s a bunch of things that are coming that we are kind of baked in. We know they’re coming. We just have to wait for it to be achieved. One thing that I think developers will really care about is the context window.

So at the moment, when you sort of use these models as a limit to how much information you can feed it every time you use it, and extending that context window is going to add a lot more capabilities. One thing that I’m really excited about is actually augmenting large language models with the ability to take actions. And so we’ve seen a few examples of this.

That’s a startup called that are doing this and a few others, where you essentially let the large language model decide to take some tasks so it can output a string that says, search the internet for this thing. And then off the basis of the result, generate some more and repeats. You actually start treating these large language models much more like agents than just text generation machines.

Well, something we have to sort of expect or look forward to is AI taking actions. Can this technology just fundamentally be steered in a safe and ethical direction? And how? Oh, gosh, that’s a tough question. I certainly hope so. I think we need to spend more time thinking about this and working on it than we currently do, because as the capabilities increase, it becomes more pressing.

There’s a lot of different angles to that. So there are people who worry about just end safety. So people like Eli Zidokovsky, in order to distinguish himself from just normal AI safety, he just talked about AI not kill everyone. Like he thinks the risks are potentially so large that this could be an existential threat. And then there are just the shorter term threats, the social disruption.

People feel threatened by these models. There are gonna be subtle consequences, even to the weaker versions on the path to AGI that raise serious ethical questions. The models, baking biases and preferences that were in the model and the data and the team that built it at the time that it was being constructed. So there are, it’s an ethical mind field.

I don’t think that means we shouldn’t do it because I think the potential benefits are huge as well, but we do need to tread very carefully. How strong is the network effect with these models? In other words, is it the case that in the future, there may be one model that sort of rules them all because it will be bigger and hence smarter than anything anyone else could build?

Or is that not the dynamic that’s at play here? So I don’t think that’s the dynamic that’s at play here. Like to me, the barriers to entry of training one of these models are mostly capital and talent. Like the people needed are still very specialized and very smart and you need lots of money to pay for GPUs.

But beyond that, I don’t see that much secret sauce, right? OpenAI, for all the criticism they get, they actually have been pretty open and deep mind have been pretty open. They’ve published a lot about how they’ve achieved what they’ve achieved. And so the main barrier to replicating something like GPT-3 is can you get enough compute and can you get smart people and can you get the data?

And more people are following on their heels. There’s some question about whether or not the feedback data might give them a flywheel. I’m a little bit skeptical of that, that it would give them so much that no one could catch up. Why? That seems pretty compelling. If they have a two-year head start and thousands and thousands of apps get built, then the lead they have in terms of feedback data would seem to be pretty compelling.

So I think the feedback data is great for narrower applications, right? Like if you’re building an end user application, then I think you can get a lot of differentiation through feedback and customization. But they’re building this very general model that has to be good at everything. And so they can’t kind of like let it become bad at code whilst it gets good at something else, which others can do.

I see, got it. Now let me ask you probably the hardest question here. OpenAI’s mission is to build AGI, artificial general intelligence, so that machines can be at the cognitive level of humans, if not better. Do you think that’s within reach? Like the breakthroughs recently mean that that’s closer than people thought, or is this still for the time being science fiction?

So there’s a huge amount of uncertainty here. And if you pull experts, you get a wide range of opinions, even if you pull the people who are closest to it, if you chat to folks at OpenAI or other companies, opinions differ. But I think compared to most people’s perception in the public, people think it’s plausible sooner than I think a lot of us thought.

So there are prediction markets on this, metaculous sort of polls people on how likely they think AGI will be. And I think the median estimate’s something like 2040. And even if you think that that’s plausible, that’s remarkably soon for a technology that might up and almost all of society. What is very clear is that, we are still gonna see very dramatic improvements in the short term.

And even before AGI, a lot of societal transformation, a lot of economic benefit, but also questions that we’re gonna have to wrestle with to make sure that this is a positive for society. So yeah, I think on the short end of timelines, there are people who think 2030 is plausible, but those same people will accept there’s some probability that it won’t happen for hundreds of years.

There’s a distribution. If you take it seriously, I think you should take it seriously. And it’s very hard to take it seriously, even having made that choice of like, I’m gonna accept that by 2030, it’s plausible that we will have machines that can do all the cognitive tasks that humans can do and more. And then you ask me like, okay, Reza, are you building your company in a way that’s obviously gonna make sense in that world?

It’s like, I’m trying, but it’s really hard to internalize that intuitively. Stuart Russell has a point where he says, if I told you an alien civilization was gonna land on earth in 50 years, you wouldn’t do nothing. And there’s some possibility that, we’ve got something like an alien arriving soon. Right, soon, an alien arriving soon. Yeah, you heard it here first.

So let me ask you, what does this new technology mean for startups? Oh man, it’s unbelievably exciting. It’s really difficult to articulate. There’s so many things that previously you required a research team for and that felt just impossible that now you just ask the model. Like honestly stuff that during my PhD, I didn’t think would be possible for years or that I spent trying to solve problems on where you wanna have a system that can generate questions or can do something, it’ll be a really good chat bot like chat GPT, like a realistic one that can understand context over long ranges of time, not like Alexa or Siri, that’s a single message.

The range of use cases is like, now feels to be more limited by imagination than it is limited by technology. And when there is a technology change disabrupt where something has improved so much, and YC teaches this, right, there’s sort of a few different things that open up opportunities for new applications. And we’re beginning to see it, a sort of Cambrian explosion of new startups.

I think the latest YC batch has many more startups. We see it at HumanLoop, we get a lot of inbound interest from companies that are at the beginning of their explanations and trying to figure out how do I take this role model and this role intelligence and actually turn that into a differentiated product. Hopefully we have some AI engineers or aspiring AI engineers listening today and might be interested in working at HumanLoop.

Are you guys hiring and what kind of culture and company you’re trying to build? We absolutely are hiring. We’re hoping to build a platform that’s, potentially one of the most disruptive technologies we’ve ever had, and that ideally will be used by millions of developers in the future. And there’s gonna be a lot of doing stuff for the first time and also inventing novel UX or UI experiences.

So full stack developers for comfortable, like genuinely really comfortable up and down the stack and who deeply care about the end user experience who will enjoy speaking to our customers. And they’re fun customers to work with because we’re working with startups and AI companies who are really on the cutting edge, they’re really innovators. You know, if that sounds exciting to you, it will be very hard.

Less of it will be very new, but it’ll also be very rewarding. Well, this has been really fascinating. I think what my crystal ball says is like one day in the future, literally millions of developers will be using your tools to build great applications using AI technology. So I wish you luck and thank you again for your time.

Thank you, Ali. It’s been an absolute pleasure.