LLMs in Prod Podcast | Semantic Caching at Scale w/ Walmart's Chief Architect, Rohit Chatter

[00:00:00] Rohit Agarwal: I think this is super exciting to be doing this. Especially excited to have Rohit this is gonna be confusing continuously. So when I say Rohit, I mean Rohit Chatter. And I think when Rohit means Rohit, he means me. So we'll keep doing this joke throughout. But it's super exciting to have you with us. I feel that there isn't enough spotlight that's put on LLMs and production, specifically, what is the real ROI that enterprises are deriving from LLMs and just the generative AI wave? And when I heard your story, and I think this is almost 4 months back 4, 5 months back of how Walmart's really focusing on semantic caching, I felt, hey this is something that's amazing that when you launch, we'd love to talk about it, share your journey. So really excited to be doing this with you, Rohit. Just a quick introduction for everybody. Rohit is the chief software architect at Walmart. He's been working with AI for the last 10 plus years now. Had a fantastic run. And I think we'll specifically for this talk be talking about semantic caching, Rohit's use on LLMs, Gen AI, and how this new wave of ML and AI has really made it has lowered the barriers to entry for newer people to come in and explore. Rohit, why don't you come upon and just help us take us through your journey for the first 2 minutes, and then we're going right into the session.

[00:01:28] Rohit Chatter: Hi. I'm Rohit Chatter. Been in industry for almost 27 plus years of experience. Joined Walmart almost 2 and a half years back As a chief software architect the Walmart dot com ecommerce wing Walmart is pretty big. All of you know, it's Fortune 1 stores online and, you know, need to be, like, you name it and Walmart is almost in US, it's it's significantly large. And also in expanding multiple countries. Before Walmart, I was so actually in Walmart, I'm working out of Bangalore. It's a little surprising having you know, Walmart dot com software architect, Chief software architect working from Bangalore. So that's another I feel the world has changed post COVID that you can you can be anywhere if you wanna create an impact for the organization. Before joining Walmart, I was at Inmobi I joined Inmobi more of a of a an architect. But, coincidentally, fortunately, unfortunately, however, we want to put it. I grew into more of a management and became a CTO. When I became a CTO, I realized you know, things are not as technical. We all believe that CTO is all tech, but it actually tech plus management plus strategy plus everything else. And somehow I realize my energy is not positive zone because I like talking on tech stuff and learn from people. I decided to to become an IC again. Hopefully, who knows? Maybe I'll become manager again somewhere. That cycle will continue because it has happened with me. It was running before Inmobi, I was at Yahoo. There, I joined as a manager, entry level manager ended up becoming more like a a architect, uh, for Yahoo.

[00:03:05] Rohit Chatter: So that, then also I stayed in a business before that, being in start ups and stuff. So along my my journey being architect or manager, as long as it teaches me skills as long as it teaches me to stay in tech. That's that has been my my attraction all all this while. That I think when I joined Walmart, you know, just recent last 6 months we had an opportunity to try out GenAI for the search, and that's what's I'm gonna talk today Gen AI in semantic cache. And we'll see if I can survive the questions coming out from Rohit or we find new opportunities to learn new things and go back to my, you know, learning time. Yeah. Rohit, all over to you.

[00:03:47] Rohit Agarwal: Absolutely. Thanks for that, Rohit. And I think it's great. I don't think I have met many people in my life who've gone to the level of a CTO of a large company like Inmobi and then decided, no. Let me come back to an IC role and let's build something for real. I think this shows the true engineer spirit. A lot of us do have that. I think all of the smart engineers who've gotten into people management roles to some extent are like, no, we wanna go back to IC because there's a lot more impact that can come out from us. So interesting to hear that journey, Rohit. Awesome. Cool. So I think for folks in the audience we generally like to keep so we've done 2 of these just collaborative sessions before, feel free to ask questions in the chat window. I'll try to pick up as many as during the conversation or we'll pick up everything that's remaining towards the end. So feel free and towards the end, we'll also bring a bunch of you up on stage, and we can have those q and a sessions with Rohit as well. Perfect. Rohit, I think we'll we'll start we'll just ease into this. And I know we have a very exciting demo that I'm sure a lot of us are dying to see on how generative search really works on Walmart and what's all behind it. But can you just start off with what were the core problem statements? What was Walmart doing earlier? What is it now, and what was the big PR release that happened, what, 4 days back at CES?

[00:05:12] Rohit Chatter: So, I mean, around June, and, actually, you know, every company is spending their energy and time to figure it out, how to get more and more users onto their app, you know, have them come on board and do the product search. We would we are aspiring to be the place for all product searches. And with that said you get all kind of searches. And we have had tons of tails, how you categorize different queries. We wanted to improve on the tail tail side of the queries. But it is very ambiguous. You really don't know what exactly you're looking for. For example, people say party supplies, camping essentials or football watch party or birthday gift for dad 60 year old imagine somebody typing in. They have a plethora of options to to present, and what are those options to be presented first. Most of the people have invested significantly on the LP side look at the engagement data look at the crawling piece and then figure it out, you know, for this given query, uh, after doing all the tokenizing and stuff like How do you retrieve the right set of products? That's been biggest. Now, obviously, Tailgate is a very low engagement data. So knowing what is more what is selling more or what people like, it just becomes a little sparse. That's the problem that we were presented when will say that if we if we fix the tail queries, uh, which, you know, we categorize as 30 percent not that it is actually it's it's a long tail with the not knowing that the conversion will happen or not. So obviously you know, solving that kind of a problem is not a straight problem. We we thought that why not take this tough problem because if we do better, then we, you know, having those customers of ours who actually came there just browsing or checking now feels surprised presently if we present something nice.

[00:07:02] Rohit Chatter: And that's that's has been our aspiration. We eventually created up leveraging a little for that, but before that, there have been all all kinds of models that uses CTR, ATC, conversion, um, semantic matching, to an extent token matching, exact match, fuzzy match you know, the the word that has end up. That has been the journey, but now at CES Doug himself who is the CEO of Walmart Global Unveiled and, you know, along with Satyajuddin on stage, thing that Walmart at its scale has been the first 1 to release, you know, the Gen AI leveraging the Gen AI capabilities for its customers to be able to shop little better. I can I'll give you an example what does it mean.

[00:07:45] Rohit Agarwal: So just what is the semantic cache? So a regular cache is where the exact same input comes in, and we've stored we've basically hashed that input stored it as a cache key, and we're able to retrieve from cache using this exact match input. So that is what is the regular cache, the simple cache, and this is where it's used all across engineering. A semantic cache is where you can have similar queries being served from being served the same output. So for example, I can ask who's the president of the United States versus who's the president of the USA. These are both the same queries, but a simple cache will not serve both of these queries because they're not the exact match. Whereas a semantic cache will actually be able to serve both these queries together. So that's what semantic caching is, as a concept, this is really useful for especially ecommerce. That's why Walmart delivered on it. And even in regular search because now you're able to cache queries that are very similar to each other, and you don't have to go through the entire process of you know, the ML pipeline again.

[00:08:55] Rohit Chatter: As you can see at the top, I've typed football watch party, and then you will see the results. You will get combined results. There will be snacks and chips. There'll be party drinks. Party supply Super Bowl apparel, and televisions. Like in India, people buy TVs. When there's a world cup happening, they buy large TVs with better resolutions and a good sound. In each of these categories, when you see, we also try to explain what that meaning is. Right? Yeah. We'll we'll say, easy to serve snacks that are so we we try to educate when, you know, you you search for something and we are presenting a category, we'll also educate and then, you know, you can keep scrolling. So, essentially, we trying to understand the football watch party largely means few folks at your at your home. You will you are going to have, you know, snacks, drinks, or, you know, games, a bunch of those things.

[00:09:53] Rohit Chatter: Right now, we have limited to, say, 5 product grouping, but it can work to any any number of product grouping. Also make sure that the groupings are mutually exclusive and comprehensive, comprehensively exhaustive VC principle that we follow to enable our customers to be able to explore the range of products that are available for them. Similarly, you know, I have I myself tested my own thing. I'm planning for a camping trip. I already have a tent, what else do you suggest? You can actually express yourself and then it will it will actually show everything else but a tent. So now it opens up, you know, that that opportunity for customers to to be able to come and express their own means without really doing this keyword what we do? Humans, we've been trained to search by keyword. So today, if you have to go on Google And actually say, okay. This query of mine in my brain needs to be transferred to some main keywords And, you know, those keywords has to then eventually and if and then it becomes a keyword engineering. You figure it out after typing. We're saying don't type too many things. Just express. It can be the whole sentence and we'll figure it out. So that's that's the Yeah.

[00:11:11] Rohit Agarwal: Very interesting. In fact, you know, I think this is almost I I had seen a Walmart video a long time back where they were saying that converting these to natural interfaces makes so much sense when you walk up to a Walmart employee, you're not seeing party supplies. You so you're not speaking keywords in the real world. It's just said we've become used to or Google's made us used to Exactly. Searching in keywords, and now we're just able to express ourselves better. Does it also have a voice component to it right Rohit

[00:11:41] Rohit Chatter: We have a different it's a voice assistant enabled certain things, but not for the search yet. But it's in works to enable bunch of things like voice, image. And all combined together. Yeah.

[00:11:55] Rohit Agarwal: Very interesting. Okay. So I think let's get technical a little bit, Rohit. So, I mean, till the extent that you're able to share, we'd love to know how this is implemented. So let's talk about search first. I'm guessing there is some flavor of embeddings here. There is some flavor of caching. So let's talk about the embeddings first. How you getting similar products? I mean, how does let's take this example. I have a tent ongoing camping. What other supplies do I need? What happens behind the scenes with the Walmart generative search now that makes it even possible to do something

[00:12:29] Rohit Chatter: like this. So, most of the people you already know that, you know, we generate embedding them. I mean, almost all and store it in a vector DB. So we have a catalog and we take this large catalog generate an embedding, and store it into a vector db we also have a polar based search lexical search available because, you know, in some cases, semantic matching is good. ANN based matching is good. And some cases, you know, the exact match or the lexicon match is good. And, obviously, there has been years worth of investment that have gone into make something work. So, you know, we'll be trying to take goodness of both sides of the world? So Given that now we have a catalog, you have a question saying that the query or a question saying, I have a tent. What else do you suggest? So you translate that. Okay. The customer is saying the tent is already there and the rest needs to be bought. So basically, you're saying camping essentials it could be lights. It could be cookware. It could be sleeping bag. And then you have tent. Now when we convert this into the query, we say all this and say not that. Now in this case in earlier 1, we'd have used NLP translate and figure it out and make it negative part you know, tent minus the tent part. In generative AI we are able to create a product grouping Of all and Gen AI is helping us to not create a product grouping for tent.

[00:13:53] Rohit Chatter: Now we have a product grouping on each of these sleep sleeping bags, cookware, and stuff. We know that for camping, what kind of sleeping bags are required on for cook for cookware and what exactly. By by based on certain understanding of the products and, you know, its features and stuff. So we go ahead for formulate that semantic query or you can say a n n query. Fire that inquiry and get those you know, product retrieved from the catalog based on the, you know, vecor map And also come up with a an internal reversion of the query, like we call query expansion and given a query, if if I'm doing a product grouping, how would you how would you query this if you have to do a lexical match? Take both sides, combine it, and then obviously, rerank

[00:14:39] Rohit Agarwal: So the string that's converted to the query so the natural language input that came in, how is there a smaller model you're using to convert this? Because you're not using NLP this time. So what is used to convert the query to, I mean, the natural language sentence to the actual query.

[00:14:56] Rohit Chatter: So we use a combination of things. We definitely use NLP for our own existing stack think it's there's a goodness of an existing stack. And for the new stack that we have figured it out, we are using open source model which are fine tuned with our own engagement data so that, you know, you retrieve and bias because your your catalog is is what your selling point is you don't want any random stuff coming out and then eventually don't have something to sell. It doesn't make sense. Very very bad experience for the customer. We take our customer engagement data at an aggregate level and then fine tune the model. So it it had a much higher recall set and a better precision. So, and then open source models largely, you know, HuggingFace, community is pretty big. We use the data data board to understand which model will suit us the most. Obviously, you have to look at the size of the the dimensional size of the model and this kind of GPU that will fit in and all that, you know, we'll talk about it a little later. But all those considerations and the latency, you then decide which is the right model for you. And then you take your data, custom data prepare it and then train your own model or fine-tune your own model.

[00:16:04] Rohit Agarwal: And so let's talk about this a little bit. How do you pick which model to start testing with? What was the process to test and choose the embedding model? And also, you mentioned that these leaderboards can keep updating. You know, there's a new model that keeps coming up. How you architected the system to be so that you can update Either the fine tune or the model and be more forward compatible.

[00:16:30] Rohit Chatter: Pretty loaded question. I'll try to do the justice to that. How do we identify which model? It's always POC and most of the people know that without POC in engineering, well, nothing possible. So we started with 1 which is BERT based model. I'm not forgetting the MiniLM V6. I I can't pronounce the whole thing. Yeah. Yeah. We started with that. And then we realized that few queries were not doing good. You have some table that you define. You wanna see whether 1 of the queries I can tell you that didn't work for us was TV for living room. If you if you go and search sometimes you'll get TV stands or mostly you'll get TV stands. So that's 1 example. How do you change that to be able or, you know, try to tech gadgets for men or things like this. And, you know, you will see that some of the things don't even translate into the right things. Now you try different models and see which model. Now before you try also, you look at which dataset they're trained on. So you go on to those model card and see, you know, what dataset that they're used to train.

[00:17:33] Rohit Chatter: Once you know, okay, this dataset seems to be a little more what you think would be matching to your retail need, which is very basic. So if you can't just take that model and, you know, go live, because then you get all surprises in the wild. You take that model and obviously you start taking around your queries. Now anytime the model will perform good if you change the query. Now the question is, first, what part of the query needs to be changed and how will you change the query so that model does good? And 2, when you figured out that this is the right thing for query to go for an in, what needs to be tuned for the model on the similarity, which is cosine similarity, to be able to retrieve the right set of products. And, obviously, we have tried bunch of those things which I said MiniLM and then E5 small V2, then there is this BGE and then, you know, there is the GT Large and, you know, there are there are plethora of it, but we tried this 4 of it. And then we settled down on 1 of these for semantic m atch.

[00:18:37] Rohit Agarwal: I think 2 more things on top of it. So how do you keep it updated? So do you keep fine tuning the model continuously, uh, and how do you measure that, okay, this embedding model performed better than the other..

[00:18:50] Rohit Chatter: So, once you decide which model is good, then you don't revisit it until it's really doing bad. Unless it's not getting any use cases. So once you freeze, then probably you don't invest in comparing models, but you just keep fine tuning the model because you figure it out that, you know, this is the model that you're gonna invest in. Unless it turns out that it's not there, it'll be fine tuned a bit further. So that's 1 decision point you have to make. Once you decide and it's working for you, then you invest in how to make it work. That's first how often do we tune it? So we we do it. We we have the engagement data. We make sure that the data is significant enough. Usually, it can go anywhere between a month to a quarter. As the model it's fine tune. You you usually do not get a significant incremental gain until this type of queries change significantly. And we believe that with this release of this whole Gen AI based search, the behavior of customers will change. They will come and type different queries rather than just so all the tuning has happened around the queries that are keyword based and, you know, 4, 5 keywords. And you'll see peak of the moment you type more keywords like before more than 6 or 7, no matter which search engine you use, you'll you'll see that, you know, it starts to not do as as great. Obviously, as the queries change Yeah. The tuning will become significantly different, and maybe we expect as we ramp up the traffic, we will we might have to do initially monthly. And then eventually, once we know that it has come to a stable point, maybe we'll move to a quarterly kind of a

[00:20:23] Rohit Agarwal: interesting. And what are some evaluation metrics? And I think for the sake of the audience, I think I got a comment. So we've discussed what is generative search, how is Walmart really leveraging, and we've given some examples for it. Now we're really discussing that how are we creating this generative search experience. And I think we've discussed the embeddings we've discussed that we are fine tuning embeddings. I think the last part in this would be how you're really evaluating for the accuracy. Is it better recall? Is it human based? Is it AI? How are you measuring that this is performing well?

[00:21:00] Rohit Chatter: All of it! You definitely have recall based metrics. So you definitely look at What's the recall size that you're getting? Right? What you also do is make sure you do a Decision identification. Because if the recall is, you'll still get some recall, but what if it's not great enough? So you you improve on the precision by, obviously feeding the right data into it. And you also send for the human eval because without human eval, you will not know how you must and, actually, changes by the location also, to be honest. You know, different region may have a different understanding of What you say. And what you mean by that? Yeah. And then the language is gonna forget the language. Within English also, it can mean differently and then, you know, the language complexity kicks in. So, then you will send it for the human eval. Some percentage of it, obviously, you can't because it's very expensive to get the those things. And then there will be an error to that human eval also. So you have to be that.

[00:21:55] Rohit Chatter: Then you use another model where you feed in both the 1 that you thought is good And get it validated or in an automated way. And then what human has said, you will get validate that also. And then you send the negative cases also. And and you create a list so that, you know, the data because when you are tuning the model, you wanna make sure the data is pristine and if that is pristine, your recall will be better, and eventually the precision. And then eventually have a cross encoder built on top of it, which improves the Both recall and the precision. The right mixes the higher recall and higher precision because that's the that's the nirvana state. But, obviously, you When you ask for precision the recall reduces when you increase recall the precision reduces.

[00:22:36] Rohit Agarwal: Correct. And what's sort of the benchmark that you said saying we won't go to production or we won't release before we hit these Particular percentages or numbers on precision recall.

[00:22:47] Rohit Chatter: So we had an existing NDCG score, for an existing search. It has to either equate or beta. So you have to be NDCG positive for the first top 10. So we do NDCG 1 at position 1, 5, and 10. So can you can you explain NDCG a little bit? Okay. Yeah. It's so you know, it's called an NDCG metric, but let me say it's formalized discounted cumulative gain. Yeah. So what happens is, so the first position supposed to be the most relevant. Then the second, the second, you know, second most relevant. The moment you end up having a different position, the less relevant thing, you get significantly penalized for it because you wanna make sure the top 10 NDCG times are pretty high. It's it's a it's an NDCG score that we end up having compared it. And it has to improve over a bit of time, and that that's how we ended up marking whether we are ready for production or. That's it. We had a little bit of a challenge there, and I don't know how many people will be able to correlate to that. In internal search, it's a homogeneous scroll Alright. When you go, people are used to doing vertical scroll. So whether you go on Amazon, Walmart, or Insta, everybody do the horizontals call.

[00:24:00] Rohit Chatter: So the moment we introduce the notion of product grouping, there's a little bit of a shift. That you tied, say, football watch party. And so what if you suddenly see television? I'm talking about a football watch party and some people have snacks and drinks in mind. Somebody has apparel. And most of the time, party means snacks and drinks, but then there's an apparel, then there's this. A theme based thing, the party supplies are, like, plates and dinnerware and stuff. And then you have this theory. So initially, we had this challenge because the tool that we had for NDCG did not understand the question of product grouping. And we were getting heavily penalized. So we then started saying that, hey, the query is not just the query, but we are also, you know, putting the the product grouping in the product of our query so that, you know, at least we get the fair share of NDCG evaluation. That challenge can can also exist and NDCG doesn't take care of the product grouping part even in the scoring part. So that's a homogeneous expectation.

[00:24:56] Rohit Agarwal: How long did it take Rohit from going from a POC saying, okay, this works to actually go into production? How long was that period?

[00:25:03] Rohit Chatter: So this I was talking to you know, 1 of my product and tech guy and saying, we built the whole thing end to end in 4 months crazy working day and night. It took us 3 months to get to the point where we are saying the ATC is good. The GMB is right and relevance is good. I feel, Yeah. Building is not the thing. What is more important that once you build it, putting it out in the wild and seeing how customer reacts to it. And we had tons of learning. And that itself took a decent amount of time to productionize it. And we are still working towards productionizing scaling up, and we have our own challenges that we'll continue. 1 of the question that we will ask and I am sure people will ask here also, what about the latency? Because Gen AI is not, Quick enough. So Yeah. And we will see when that question comes in. But yeah.

[00:25:56] Rohit Agarwal: Yeah. No. I think that is gonna be my next question. The latency and cost. I think traditional search is fast and cheap. Gen AI is both expensive and slow. How are you sort of dealing with it? I think before we go there, I think there's just 1 very pointed question towards this thing. The response that you're seeing and evaluating, is it a list assortment of SKUs or it's evaluated point wise? I'm guessing point wise means the list of search results that come on. But, yeah, any any thoughts there, Rohit? I'm guessing trying to say that is the evaluation done on the entire list or is it on the individual items?

[00:26:31] Rohit Chatter: It's a little bit complicated. Hey. You generate product groupings, not necessarily completely, um, context aware of your catalog. Because it's a general query. So we'll end up having a product grouping which doesn't match to your catalog. We get items which falls into a little different product grouping. You have a product grouping generated from generative AI, and then you have product grouping generated on the fly by your anchor. What we do is we first you know, it's like simple clustering techniques that we use and topic modeling combined together, and we cluster the products together. And then the likelihood of ATC and all those product groupings, for example, you say football party, obviously, snacks and chips or drinks should should come at the top. If you put TV first, it's gonna just be, like, you know, not, like the the are you are you planning to sell TV more than the snacks and chips? Because not everybody will be buying TV. So, yes. It is. It's a combination of those things. It's not a straightforward answer to get a point or a yeah.

[00:27:36] Rohit Agarwal: Yeah. Yeah. I can imagine that at some point in time, this might also play in well with advertising. Saying if it's the Super Bowl that's happening, Pick out all of the advertiser list and who are also advertising on, say, Walmart or some other ecom player, and you're just boosting those,

[00:27:51] Rohit Chatter: More than tech, the business is very important in this 1 question. People who are down on the retail will know that Tech alone can't do this because it has a lot of connotation of the human real world these things. For example, Walmart is not just online but in stores also. Now when you search for something and you put a filter of in store, suddenly your number of items will be very less because we we will try to match your nearest store so it's easy. You can pick up or easy delivery, bunch of those too. And as a user, you might feel, oh, Walmart has such a less inventory to offer me for my party. That can happen because it depends on where you're living, what inventory is available in a real time, you know, area among those stores and stuff. So it's it's more complex than just getting those list of products right in front of the customer.

[00:28:40] Rohit Agarwal: Okay. Got it. I think we have so many questions, but let's let's keep on track. So cost and latency work. What is the difference? And is it worth it? Is the trade off worth it?

[00:28:51] Rohit Chatter: Oh, this is a, I would say, billion dollar question, not a million dollar question. Because everybody is struggling with that, and you know that people who are making GPUs are making skyrocket stock prices, uh, and getting GPUs is not straightforward. It's not easy and very expensive. And over a period of 2, 3 years, they just changed so much that you don't know whether those GPUs will will continue to serve your needs. So, if you're strategic and forward looking and you have 3 years in mind, it's definitely what that to think of but you need to have a good business use case. It's not gonna come cheap. So earlier, the world was all about AI ML. So hiding those guys and then getting the right data and then investing, trying, POC, validating. It takes its own sweet time. This is a frustration that you will see building up in the business that why something is not that quickly working and stuff like that. So that investment will then be replaced to some extent when you use Generative AI. So it is that place now in the ecosystem where if you have an idea, I'm I'm not gonna answer pointed question because it's not a straightforward because, yes, the cost is high. Yeah.

[00:30:03] Rohit Chatter: Is it worth it? Yes. But depends on for whom and for what. So for example, say, you know, your company has been struggling to try out certain things. And say you are you need, otherwise, say, 20 sales people. You don't know whether this will work or But you have to invest upfront to get 20 people, give them idea, and then, you know, they start working on that stuff like that. But there's Generative AI where you actually can do a proof of concept, so that validation with generative AI has become fast and easier. And it doesn't require too many tech folks to be spending time because now that knowledge has been assimilated for you to be able to generate something out of it. Now converting that into a business proposition at scale is a daunting challenge. It's a it's not going to be straightforward. 1, you need to plan. If you go, say, for example, 7 b model or say 13 b model or 20, 50. And the larger the model, we have bigger the GPU. The bigger the GPU, the higher the cost. Yeah. And not only the cost, say you can afford the cost, you have some 50 million worth of CapEx or OPEX. But what about the latency? Customers cannot come on your app, type something, they're not gonna wait for 30 seconds because it's not that you're gonna offer something so drastically different that will change their world. Right? So they will drop off if you don't serve them in less than 2, 3 seconds. And I know the first challenge you might try to solve because it's a search. Right?

[00:31:30] Rohit Chatter: So it's imagine you're searching On an app and it says, you know, wait for 30 seconds, and at the end, you just show snacks and chips. Are you crazy nuts? Don't you know that Hardly means nothing there. True. So Yeah. You have to bring it down to less than, say, 2 to 3 seconds. We are working towards less than a sub second. Now how do you do right, that's 1 of the question that people ask. You have to be very careful when the query comes in. You don't need the complete really formed response for you to be able to retrieve the product status. So you can go ahead and say, I'm gonna do a little not so perfect job in getting the first iteration out. And as the customer engages, I'm gonna actually in the second iteration, Which is message 15 seconds later, I'm gonna enrich the same same query with results which are more powerful and more appropriate and more relevant. I think that way because that's how you engage the customer to say, hey. Come and explore more products that I have while you deal with this full latency Is it ready for production? Yeah. I said there are a lot of surprises. You can't just say, hey. I have LLM. I have any query. And, you know, I'll give you something. You have to have a guardrails. You have to follow the legal the legalities of what can be searched would cannot be searched like profanity and, you know, people coming up with random queries some some queries that are not allowed if you come and search. You should be able to not serve a blog or convert it into a product that are safe. So, yeah, there are a lot of other human based thing that you need to take care other than just the tech piece and that actually ends up taking quite a significant of time and investment

[00:33:13] Rohit Agarwal: yeah. No. I think that makes sense. I think so I think learnings were it's not sub second yet, but we are trying to get there. And I think this progressive enhancement of results. I think it's also I think streaming is also to some extent that. You're giving the users a chance to look at some information while it's being processed and I think that's just it makes us as humans a little more comfortable with waiting because now we're able to do a bunch more. So how does so 1, obviously, which vector database if you're open to sharing open source or not. And then where does semantic cache come in. I realized we're almost 40 minutes in and we're now just starting on semantic cache. But I think all of this was supreme context. It was needed for us to get to this stage. But, yeah, let's get into it.

[00:34:03] Rohit Chatter: Yeah. So so Vector DB, we we are leveraging a combination of stuff, but for now, we are using Milvus. That's What we're using today and Yeah. We are trying bunch of those things. We we also looking on another open sourcing. But these are always, you know, the POCs that keep going and, you know, you have to keep trying different things. We're also using Vertex AI and we're also using some Firespace thing. So different teams will leverage different things for different, you know, trials and POC. Are we, say, are we advocating any 1 specific 1? Probably not. Right? Because you have to make sure it works for your for your use cases. Now moving to the semantic. People who have worked on the query side, one, queries are unpredictable. People can come and tackle it.

[00:34:49] Rohit Chatter: Second, when you come up with a tail query, that tail means it's gonna be not repetitive. Now Yeah. If it's not gonna be repetitive what's the part of what's the point of doing semantic cache? Or, like, people do caching. The point that we actually found the right balance is this semantic meaning of the query may not be as different. If you create, say, clusters of queries, each cluster will end up having few queries in it, which has almost a similar meaning or same meaning. Oh, that's where we thought that we will try to improve, we don't wanna send every query to our LLM because if we do that, then first, it's unaffordable. 2, It's going to be concurrency challenge because it's it's own latency challenge. 3, you know, it's not the right thing to do because, You don't know what comes out all the time otherwise. Because tomorrow you change the model, that would be a little bit of a change. And somebody posted consistency can go very off also at the same time. Yeah. So you said, okay. What is the right balance? So we come up with the similarity score. Obviously, we use cosine similarity. Yes. And we say, okay, what should be the threshold that beyond which if it reaches, that's a new query? So for example, if you type in football watch party and say football watch party with friends, They mean the same. They will fall into the same similarity threshold and You know, it will it will kind of say it's a cache. We have seen a significant so we started thinking that we'll get 10, 20 percent caching for the tail queries. But to our surprise, we actually got close to like, upper on the upper side of 50 percent. And we were, like, pleasantly surprised, you know, this because these queries, you don't know what kind of things that's gonna show up. Like, for example, toys for toddler, Toddler toys. Right? And then my toddler. toys for my toddler. Correct. Now all these mean the same. And we have tried this, and it has actually given us very significant gain because spelling mistakes, that's taken care because semantically, it will match. These I say, prepositions, and pronouns, and, like, bunch of things that people keep adding Will also be denoised, then you you get the central meaning. So, it can actually help when then people asked me, What if we scale up the topic? And I was like, if it worked for tail, I'm ready. Because if the tail things are going to match everything. If you said toys. I know what exactly you are asking about. So yeah. It's just caching's just gonna get better and better.

[00:37:30] Rohit Agarwal: Interesting. And I think 50 percent is just insane. And to be honest, I think these are numbers even we're seeing with some of our customers. And I think when I talk about it to people. So you can get a cache hit rate of 50 percent, 60 percent. People are just saying, no. That's that's just not possible because we've not seen those kind of cache hit rates before. But I think for fact based semantic caching this is great validation that you can already start to see 50 percent on tail traffic. I think, head traffic will be, like, you just get into the easy seventies, eighties. I think caching will also be very high there, but semantic caching will just be even even better. I think another question on this was the real world moving parts of, say, organization like Walmart. Things that are in inventory, new items coming in, old items going out. How is that handled as part of the cache?

[00:38:21] Rohit Chatter: So on the ANN side is I think people have worked on it and it's not easy if you're if you're, like, a billion Items are a 10 billion item building daily, and then it's just not gonna be possible. Imagine the amount of embedding generation and ingestion that it will take. It'll have to be really crafty and say, you're gonna you're gonna separate out fast changing items versus the slow changing items. For example, If you put inventory as a part of your ANN, you're gonna be just struggling. So you'll have to figure it out slow moving. I mean, slow changing, I would say, is we that I wanted slow changing dimensions and fast changing dimensions. So the slow changing data and the fast changing data. So fast changing data, you still keep it, but you take the static part of it. And So you you largely use ANN for the making sure that the matching for the relevance is good? And then on top of it you employ the everything else part of it, so that's how i would say but otherwise, yeah, it's it's gonna be challenging. I have few other techniques. I haven't tried it. And maybe I can't share it because it's gonna be a little bit of proprietary and how we are making ANN also to be able to take the fast changing data.

[00:39:29] Rohit Agarwal: Well, is it also like, is there some amount of metadata filtering that happens? So if something goes out of stock or that vendor is not available. Those queries will just get filtered out by default, and you're not running an ANN on sort of those key items.

[00:39:44] Rohit Chatter: You'll still go ahead and fetch it, but, you know, somewhere in the later part of this ranking stage, you will actually demote them because the inventory was not available. Something of that sort. Because you always want product to be seen because when you show the product, there's an option to say shop similar. And that's you don't meet your customer there.

[00:40:01] Rohit Agarwal: How similar are these 2 clusters, for you, Rohit, the semantic cache cluster and the Generative search cluster? Because this is similar information encoding. I'm guessing the same embedding and vector DB. How closely related are these 2 embedding sets?

[00:40:16] Rohit Chatter: They're actually very different because When you use the Generative AI, they're more like a a general knowledge. You can think of, you know, when you are kids used to say general knowledge. So it's think of it as more of a general knowledge, generalized, not specialized. It's your fine tune for your need. Yeah. Everything is very transactional for your business needs. They're very, very specific to what comes to you. To them. Yeah. I would say, You can think of a small model and a big model. A small model, which is fine , it's purely your data and the big model, which guides the big model to say, hey, for this context, it will be best. That's it.

[00:40:57] Rohit Agarwal: Got it. And I think for the same for semantic caching especially. I've heard from many people the question around how do you segment data so you're not leaking from cache? So for example, and I think this also goes to another question that's being asked is do you look at past history of a user and maybe use that to return in something or are there plans to do that in the future? And if you do that, how do we prevent our segment, the cache to not do it?

[00:41:24] Rohit Chatter: It's Not a straightforward answer, and we have the whole backlog created on how we gonna personalize. It's like, 1 is personalized, then hyper personalized. For example, if you have something in the cart versus you already bought it, the The query is same. So you say football watch party. You bought a TV Yeah. Because, yeah, from Walmart. Now me showing you TV Makes no sense. Right? And you have something in cart. Assuming you have cart Apparel. Now How do I create that mix that I, you know, leverage that opportunity of showcasing the right product so that I can satisfy what they're looking for. It's not so straightforward, and, obviously, it's It's gonna be a little proprietary on even a thought process wise because that's where we've changed. So but the point is, yes, there are initiatives And thought process around how to each query. So if you type, say, football watch party and I type football watch party, We should be getting a slightly different, responses over a period of time as we further enhance it.

[00:42:29] Rohit Agarwal: Got it. And when you do that, how would you prevent like, how will you segment your cache so that your and it's strange that Rohit's recommendations are different from another Rohit's recommendations because now you'll have to I mean, my history should not get leaked to somebody else when I'm still using when I'm still trying to get served from cache.

[00:42:52] Rohit Chatter: So I wouldn't say it's a history getting leaked, but I would say so it's it's it will be a combination of a RAG and a retrieval a two stage retrieval. So for example, if you come I know that you have come and say search for football watch party or say camping trip. I know I have historical information. So I will I will change the context before I go ahead and retrieve the products because Finally, the products are retrieved. So I wouldn't say it's it's a leak, but it's more like, you know, biasing towards what is more probable that you'll you'll buy.

[00:43:23] Rohit Agarwal: Okay. Interesting. So saying it will end up anyway being retrieval, which has context. And this context, I'm guessing, will be part of the cache system any which way.

[00:43:34] Rohit Chatter: It will be a combination of a one time cache semantic cache, and some Hyper personalized information. So it's not gonna be a straightforward word embedding. Yeah. It's gonna be a multiple embedding.

[00:43:44] Rohit Agarwal: Interesting. Okay. Were almost on time. I think there are a lot of questions and I've been trying to get a lot of those into it.

[00:43:51] Rohit Chatter: I'm so sorry we couldn't do justice to all the questions.

[00:43:55] Rohit Agarwal: But I think this is this is just amazing. I think there's just a whole host of questions that have come in. And I think we've we've answered a lot of them over this 1 hour. I think just 1 final question and maybe, Rohit, if you have time, feel free to stick around, but be just will, you know, catch up after this is So this entire setup and you mentioned there's a big backlog of things to be done. How is Walmart viewing generative AI, generative search, semantic cache in its broad I mean, engineering overlay. We've seen a lot of large organizations spending significant amount of money, effort, resources into really tuning for this. And seems like this is a big win announced at CES. So what is the outlook for Walmart Labs specifically for generative AI going forward?

[00:44:47] Rohit Chatter: There's a huge focus coming out from all quarters. There's 3 year investment focus and a plan that has been being cut out. Obviously, it includes both to the GPU people product perspective aspects of that. Like, All across the board, there are bunch of things. I would say almost every nook and corner of where it is possible cheaper and applicable and will add value. Mhmm. Gen AI is being considered for example, if you are in store, then you have a question. Yeah. I should be able to answer those store questions to you. For people working in store we should be able to have an operating manual for them, and they don't need to get trained for that long. They just quickly ask questions. So there there a enormous number of use cases, uh, especially for Walmart because both online and offline. So there are huge opportunity almost in every aspect of the business to be able to improve subsequently.

[00:45:46] Rohit Agarwal: Yeah. Interesting. What is what is next? Like, what is right after this?

[00:45:52] Rohit Chatter: I do not know. Like, it's a pretty big question, but, All I know is this year, uh, we will be investing significantly on Gen AI. I just had a conversation before this meeting with 1 of my counterparts in US. How do we enable AR, VR, and generative AI all put together? For example,and I'm just going to leave this thought in everybody's mind. You're at your friend's place. Yeah. And say, you really like wall or a clock or something. And you have and you have just bought a house. And I'm just going beyond all this query. You're not searching for anything. Now you you take a picture. I wanna see how it would look in your house in a 3 d modeling world. First, when you say this, first, it's a 3 d modeling, you know, so I I should be able to place that particular thing. I should be able to search Yeah. That item or similar item, right, which has the same texture, color, and stuff. Right? And then I should be also be able to tell you that if you're buying this, Maybe there's, like, let's say, that painting and then so far. And then how do you place that so far there? Now I should be able to search L-shaped sofa if you're looking for it. And if you have some texture in mind, then or you can augment saying that, hey, this sofa, but I need it in brown color, not the beige. That's in and the rest you don't say.

[00:47:12] Rohit Chatter: You you just assume that you have L-shaped sofa, and you say I need this in brown color. So now you not only have image, and you're also adding the context text context onto the image. So now when the search happens, it's gonna be an L-shaped sofa with whether it's a leather or not leather, whatever we can detect. And then, you know, put on top of it, but it has to be you know, brown color. And then place it into your 3 d house that you're thinking with the dimensions and stuff. And how does it and then you keep adding, and then you save it. It's more like an interior designing thing. Like, you know, the world is just open out there. So it's no more active sales. It is living Okay. So that's what I would say, but in everything requires effort.

[00:48:00] Rohit Agarwal: Yeah. And I think it's good. I think it's just that and you were mentioning this the last time we chatted that generative AI has even enabled builders like us to think about such possibilities. I know we've always had these fantasies of, you know, what if I could point my camera here and start swiping, you know, different TVs on how they would look on my table. I think I'm sure we're gonna see these demos live in 2024. So I think that's a fantastic vision to have and also something that's I think it seems like it's very doable now.

[00:48:31] Rohit Chatter: Yep. Very much at least fast tracked it.

[00:48:36] Rohit Agarwal: Yeah. Perfect. Cool. Rohit, thank you so much for doing this. I think this was just 60 minutes packed with a lot of information. Thanks a lot for sharing all that you could. I think there are a bunch of questions we're gonna compile all of this and we'll share it with you. We'll also try to figure out a way that we can share answers back with everybody who joined. People who are in here you know, feel free to just hang around. We can have a conversation after this. We can just chat as well. Rohit, if you have time, just hang around. We could just bring some people up from the audience and we could have questions. . No compulsion. I know that it's middle of the day for you, and there's so much to be done. So feel free. Yeah.

[00:49:19] Rohit Chatter: Yeah. I'll take the leap. Thank you all. Appreciate Rohit, and Vrushank for giving this opportunity. Hope it was useful. You know, couldn't cover much and and, obviously, with some restrictions on my end. But I'll try to, you know, best possible answer the questions that I get right now. So so thank you all. I appreciate the time that you give.

[00:49:39] Rohit Agarwal: Yeah. Absolutely. We're gonna have you back in a while. We'll do part 2 of semantic caching and with all of the learnings you have from production. And you also have to reflect on your journey going from CTO to,, IC contributor and then probably back to management again. So, I mean, that's also an interesting journey, but for another day. Thanks so much for joining in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

001-Rohit-Chatter-Semantic-Caching-at-Scale.md

001-Rohit-Chatter-Semantic-Caching-at-Scale.md

LLMs in Prod Podcast | Semantic Caching at Scale w/ Walmart's Chief Architect, Rohit Chatter

Files

001-Rohit-Chatter-Semantic-Caching-at-Scale.md

Latest commit

History

001-Rohit-Chatter-Semantic-Caching-at-Scale.md

File metadata and controls

LLMs in Prod Podcast | Semantic Caching at Scale w/ Walmart's Chief Architect, Rohit Chatter