Episode 116: Bulk API 2.0 with Abhi Samantapudi

Today, we sit down and talk with Abhi Samantapudi, Associate Product Manager here at Salesforce, about Bulk API 2.0 – what it means for developers, what’s in pilot, and what’s on the roadmap.

Bulk API 2.0 is not just an iteration on an old friend, the original Bulk API. Abhi discusses the advantages of moving to the 2.0 version being the optimal way to perform asynchronous, large-scale CRUD (create, read, update, and delete) operations with Salesforce.

We also talk about the Associate Product Manager (APM) program at Salesforce, which allows new graduates in fields related to STEM and business to join Salesforce as Associate Product Managers.

Show Highlights:

The benefits of the Associate Product Manager program
New features of the Bulk API 2.0
What is PK Chunking?
Developing with Bulk API vs. Bulk API 2.0
Tips for developing queries that go against large datasets
How to figure out the right tool for the job
Other things on the Bulk API 2.0 roadmap
How the composite graph payload works (now in pilot)
Examples of where you might run into locks and how to minimize them
How to handle auto-handling of locks

Links:

Abhi on LinkedIn: https://www.linkedin.com/in/abhilash-samantapudi/

Episode Transcript

Abhi Samantapudi:
Basically allows you to study two different majors at once. So in my case, that was electrical engineering, computer science, and business administration.

Josh Birk:
That is Abhi Samantapudi an associate product manager here at Salesforce. I’m Josh Birk your host for the Salesforce Developer Podcast. And here on the podcast, you’ll hear stories and insights from developers for developers. Today, we sit down and talk with Abhi about Bulk API 2.0, what it means for developers? What’s in pilot and what’s on the roadmap? But we start, as we often do with his early years talking specifically hear about the associate product manager program itself.

Abhi Samantapudi:
Yeah, absolutely. So the inspiration for the program actually came through Bret Taylor, who co-founded the program with Heather Conklin. Bret back in I believe 2003, 2004 was part of the original cohort of APMs at Google. And obviously he had a fantastic experience there and his other cohort members have gone on to do really incredible things at Google and other tech companies. And so he wanted to bring that opportunity to Salesforce. It’s a program that essentially allows new graduates in fields related to stem and business to join Salesforce as associate product managers. We essentially rotate on three different teams. So I’m currently in the middle of my first rotation and I’m going to be rotating to another team at the end of March.

Josh Birk:
And do you get to have a choice with the rotation or does Salesforce… Or it’s just like, “The API has this role, here’s your first rotation. We’re going to figure out your next one.”

Abhi Samantapudi:
So the first rotation is sort of chosen for us. But they do take into account what you’ve worked in the past, what your interests are and what experience you’ve had, from then on you have the opportunity to network yourself and find opportunities. And then those teams will then all apply. And we sort of have a big networking day every eight months. And we all go in and get a chance to network with various teams. And we get to mention what our preferences are and the other side as well mentions their preferences for APMs. So it’s sort of like a matching process that goes on. So I guess in summary, the first rotation is chosen for us. And from then on it’s based on our preferences and fit with teams that have opportunities. What’s great about that is that all of the project have sort of VP or high level approval. And so you know it’s going to be something that is of a high business need and has opportunity for you to grow and make an impact.

Josh Birk:
That’s brilliant. That really is, I’m looking back at my first real corporate job after. So I left college. I did some freelancing, freelancing didn’t go so well, landed a job at a big insurance company. And I have to say one of the things I don’t think college really properly prepped me for was the concept of networking. Knowing when part of your day it should be to. And this is going to tell people just how old I am, but like, “Get off your desk, don’t call that person on the phone, get away from your desk, go to their desk and sit down and just have an impromptu meeting about them, so that you’re not just sending an email, but you’re actually having a proper conversation. That’s going to go faster.” So that’s really nice that they’re putting layers like that in, that early in the process.

Abhi Samantapudi:
Absolutely. I think just getting exposure to different parts of the business is also very important at large organizations. When you get exposure to all of the different sort of departments and efforts, you can collectively see what’s going on and it allows you to make connections that other, sort of employees might not be able to make. I think it’s one of the big draws of the program. Hopefully collectively the APMs through this experience, when we move on to more full-time roles at Salesforce, we’re able to draw from our rotations and say, “Hey, I did this in Sales Cloud. And even though I’m in platform, here’s how I can apply my experience there.” Or we can say like, “I have a friend over in marketing cloud, here’s how my team can connect with them.” So, having the ability to network and make those connections, I think ultimately will make us better product managers.

Josh Birk:
That’s awesome. Okay. So our big old today is make sure that people know that Bulk API 2.0 is not just an iteration on an old friend high level this for me, what are some of the big advantages of moving to the 2.0 version?

Abhi Samantapudi:
Yeah, absolutely. So Bulk API 2.0 we believe is the optimal way to perform asynchronous, large scale credit operations with Salesforce. The big difference is there is that while the original Bulk API, the clients or the customers must prepare and manage batches themselves with Bulk API 2.0, we do the batching ourselves internally. And anyone that’s for worked with the service knows how big of a time save this is. And it also saves API calls. Also we have simplified limits. So when you’re working with the original Bulk API, you’re very conscientious of the 15,000 batch limit per 24 hours. Versus when you’re working with Bulk 2.0, you’re working with the hundred 50 million record limit. And so this simplification allows you to stop worrying about sort of the internal processing and just focus on getting your records into Salesforce. Beyond that. We also have some additional improvements and processing such as automatically performing PK Chunking. And it also has a substantial portion of our future investment. So wherever there are feature parities or feature gaps, we’re working to actively close those out. And we’re also working on continuous enhancements to these features.

Josh Birk:
Got it. So you’re moving towards the future there’s virtually no reason to use Bulk API 1.0 because Bulk API 2.0 will be future parity with it and give you all the [inaudible 00:06:31] whistles.

Abhi Samantapudi:
So that’s certainly the long term vision for Bulk API 2.0.

Josh Birk:
Got you. Before I forget, I have to ask and I’m probably going to get this wrong. I was very unfamiliar with the term. What is PK crunching?

Abhi Samantapudi:
Yeah. So PK Chunking is a feature that we have with the Bulk API and Bulk API 2.0 the bulk 2.0, doesn’t automatically essentially what it does is it splits the bulk queries for large tables into chunks. And we do this based on the record IDs or primary keys. So by splitting these large queries over very large tables into these chunks, we’re able to really improve the processing and ensure that queries have a greater chance of being successful.

Josh Birk:
Got it. And since you used the word primary key, I can now skip the question of what PK might stand for. Got that one. OK. So walk me through the distinction of where like developers and data admins is the starting point for working with the API 2.0 different from the 1.0, since they’re not doing the batching, like what’s the what’s step one for making sure they’re using 2.0?

Abhi Samantapudi:
So the starting point for using Bulk API or Bulk API 2.0 is largely the same for customers. Okay. You’re going to make your first API call, which is the same, just to create the job where the differences start coming up is in the batch management. So after you create a job via both processes, instead of creating the batch, as you do with the original bulk API, you simply upload your data for your job. And then you close out the job and we handle all of the processing. So in summary, the start point for both processes is the same, but in the middle, the way that you manage your data and upload it is a little bit different.

Josh Birk:
So the hand shaking is simpler and easier and faster.

Abhi Samantapudi:
Exactly. There’s a few differences when it comes to headers, but for the most part, the processes is very similar.

Josh Birk:
Got it. And if I am a developer and I’m looking at my data set, is there a magic number where I’m just like, “Nope, this is not for my REST APIs. Let me go look up Bulk API 2.0.

Abhi Samantapudi:
Yeah, absolutely. This is a really important question. It’s one that we get quite a bit at our various conferences. So right now in our public documentation, you’ll find that it’s 2000 records. And I believe that if you are, watching this in early 2022, and you’re looking to start a bulk workload, this is the appropriate number for you. I’ll just caveat at that by saying that we’re working on some internal optimizations as well. And they’re on our roadmap that may enable us to sort of move that number a bit down. And so please, whenever you’re watching this, make sure that you visit our public documentation so that you can get the most up to date number.

Josh Birk:
Got it. And anytime a product manager says roadmap, everybody shouldn’t have in their head [Astrick 00:09:42] Ford like [Saban 00:09:43] Safe Harbor, et cetera, et cetera.

Abhi Samantapudi:
Yeah, absolutely.

Josh Birk:
Speaking of developers though, and maybe this would be a whole other episode, so you can also give me your favorites if you want. Are there tips that you would have for developers for developing queries that are going to go against large data sets?

Abhi Samantapudi:
The first tip that I would give is to really understand the size of the data set that you’re working over. And this may sound very basic, but an organization as big as Salesforce and with customers having very orgs. Sometimes we find that customers are running bulk queries over very small data sets and extracting a small amount of data. And that is obviously not really the intention behind bulk and you’re sort of unnecessarily using up your limits, which could be better utilized on larger data sets. Also keep in mind that bulk is an asynchronous process. And what that means is that we don’t provide the same SLA guarantees that are provided to a synchronous process. So it might actually be a disadvantage to you.

Abhi Samantapudi:
It certainly is a disadvantage to you to use this asynchronous processing when it is unnecessary. So first, yeah, understand the size of the data set and how much you expect to extract your queries. Second of all, understand the frequency at which you’re running these queries. Again because it is an asynchronous process. If you’re running a query very regularly, like let’s say every few minutes, you should think about whether running it asynchronously is the best option for that. So yeah, those two general chips like understand the size of your data set as well as the number of times that you’re running these queries is probably important to keep in mind.

Josh Birk:
Got it. So don’t use the sledge hammer to swap the fly and play within your limits.

Abhi Samantapudi:
Yes. Absolutely.

Josh Birk:
Okay. So there is now a landscape of data tools out there from IHO friends of ours like Dataloader and Workbench to new players like Dataloader.io. What are your thoughts on people using the right tool for the right job?

Abhi Samantapudi:
I admit that sometimes it’s a little difficult to do so, and to find the right job. What I can say is that with each of these tools, it’s really important to understand what each is optimized for and what the limits are for each. And I use the limits in the literal sense and also the figurative sense sort of running through some of the tools that are the most popular. Let’s start with Dataloader. So Dataloader is of course, a very popular tool for data management. It’s included in our admin training. And if you’re trying to extract or query a large number of records, the limits are very conducive towards that. It’s something that admins are very familiar with. But what I’ll say is, if you need certain functionality such as the ability to schedule jobs, or maybe you just want a little bit more automation or you have an external data store that you need to import data from, there is additional functionality in Dataloader.io that you might want to take advantage of.

Abhi Samantapudi:
So that’s a potential use case there where you want to use dataloader.io. Dataloader.io is also web-based. And so if your organization doesn’t want to sort of deal with an application that you need to download, and then sort of need to worry about the more regular updates that’s also a use case there. What I think is a tool that’s also a bit underutilized, almost is Data Import Wizard. Data import wizard is very useful for your main sort of bread and butter objects that you want to import data against. And so it has a very intuitive flow has generous limits towards that. And so I think that if just need to import some data, most popular sObjects, I’d really consider using that as a first option before moving to some of the more advanced tools for other more generic use cases.

Josh Birk:
Got it. Makes sense. Now we’re going to talk a little bit about the future here and some of the future investments that you’re putting in. And once again, [Asterisk 00:14:10] for looking statement, sorry, say Barbara, et cetera, et cetera, but I think you’ve got your eyes on the Bulk Setup page itself. So what kind of updates are you considering there?

Abhi Samantapudi:
Yeah, absolutely. So to start off, I guess if folks aren’t familiar with the Bulk Setup page, it’s a page intuitively within setup where our customers can go in and they can see overall how much of their limits they have consumed in the past 24 hours. So these are of course 24-hour rolling limits. So you sort of see, okay, like how many batches have I used up? How many records have I uploaded? How many queries have I run, et cetera. And also, you get to see your jobs that have been processing for the last seven days and jobs that are processing right now. You can go in and see more information about those specific jobs. So one of the items within our roadmap is to sort of improve that page, to make it more usable for our customers and overall just a more intuitive process.

Abhi Samantapudi:
So I would say the biggest change that we’re looking to make is just to update.ui, to be more intuitive and sort of align with what you would expect from the other Salesforce products that you use. I think that in and of itself, based on our feedback from admins and developers we’ll just make it a much more enjoyable process to go on there and investigate. Beyond that, some of the other changes that we’re looking to make to the page include providing more information about the specific jobs, when you enter the page, how being able to see, which clients you’re specifically… The job was created at, and also like when you go into the page, having a better understanding of by API, which limits you have consumed. So I don’t want to, get into too many of the specific details overall. I can say that these changes among others will make the page much more intuitive and easy for developers, admins, and other relevant users to utilize.

Josh Birk:
Got it. Awesome. And you’ve also got a couple features that are currently in closed pilots and let’s start with one of them that I think I kind of understand, but I’m not really sure. Start with composite payloads. How exactly would they work with the API?

Abhi Samantapudi:
Yeah, absolutely. So the composite graph payload within Bulk API 2.0 is one of the closed pilots that we’re currently running. And for folks are familiar with the composite graph API they’ll know that the composite payload allows you to sort of run a series of REST API calls all within one payload. So REST API, of course you can only work with one sObject at a time. And you can imagine in situations where you want to create a series of related Objects, this can be pretty cumbersome because you need a separate API call for each. And so through the composite graph API, you can just sort of put all these related sObjects into one payload and then ingest all of them at one time. So by providing this functionality Bulk API 2.0, what we’re allowing customers to do is run these composite graphs asynchronously.

Abhi Samantapudi:
So this really allows you to take advantage of some very large payloads by switching from the synchronous processing to asynchronous processing of this composite graph payload, we’re removing the limit of 500 maximum nodes per one payload. And now what you work with Bulk API 2.0 is a max payload size of 150 megabytes, which is much, much larger. And what I can tell you is that if you need to ingest a large number of composite graphs, you can do so with this API. So in fact, over 24 hours, you can ingest up to a 100 million notes, which is a multitude greater than synchronously.

Abhi Samantapudi:
And what’s important to note here, however, is that 100 million number it’s the shared batch limit across all bulk jobs. So I guess if you’re running this processing very, very heavily with the composite graph payload, you would have some reduced ability to run other jobs. Regardless, I guess the main point here is that if you need to run some very large composite graph payloads. And you want to run them asynchronously to ensure that there is a greater chance of their success, this is the correct API to use.

Josh Birk:
Got it. And so when we say composite graph, and I think this speaks to the scale of things, are we talking about just like relationships. So instead of having to insert all my accounts, get those IDs, insert all my contacts, get those IDs, get insert, all my opportunities, get all these… We just do all that once.

Abhi Samantapudi:
Absolutely. Yeah. So there’s a composite API, which means that you can put that payload of related sObjects. You, can submit that payload of related sObjects. The composite graph API sort of extends this by allowing you to create multiple of these related sObjects and graphs. And then submit all of them. And so the real benefit here is you can imagine with these graphs, there’s no sort of need to process them synchronously. If it’s a very large payload and we can process graphs, asynchronously, to ensure there’s a greater chance in their success.

Josh Birk:
Because like in that last scenario size really would matter because it’s like you’re grouping multiple objects and you’re grouping multiple categories of multiple objects and possibly.

Abhi Samantapudi:
Absolutely. Yeah. So if you have a very large number of these graphs, it’s when it’s really relevant. I’ll say generally based on what I’m not the, the product manager for the composite API, but I know through our documentation that we recommend you create more graphs rather than very large number of graphs, and [crosstalk 00:20:46] processing. So if you are in a situation where you need to take advantage of the 75 graphs that we allow in one payload, which is, very large number by the way, and you really want to take advantage of it. You find that your jobs running synchronously or not successful, then maybe, you can consider are taking advantage this and Bulk API 2.0. Again, this is a closed pilot as of right now, if this sounds interesting to you, please reach out to your related account team and they can submit a nomination for this pilot.

Josh Birk:
Awesome. And another pilot that has me kind of excited is, and I love saying this phrase auto handling of locks, but first let’s talk about locks in general. Like walk me through a scenario where a bulk load will run into a problem with lock.

Abhi Samantapudi:
Absolutely. So like any other relational database, Salesforce uses these locks to ensure referential integrity. And it’s important to know that by and large, if you’re using Salesforce, you don’t really interact with these locks. They’re held for very short periods of time. And really the issue that we run into is when you’re working with large data volumes, such as bulk level volumes. So an example of where you might run into locks is if you have two ingest operations that share parent sObjects.

Abhi Samantapudi:
So as an example, if you have two orders that come in again, these can be completely different orders that come in through a bulk process, but they share the same account and contract object. Then you’re going to run into locking issues because one of these processes is going to come in first and they’re going to place a lock on their parent account and contract objects. And then when the other order bulk request comes in and it tries to place lock itself, it’ll return in error because that lock is already-

Josh Birk:
Already there.

Abhi Samantapudi:
Locks on the parents have already been placed.

Josh Birk:
Got you. And currently, how should… Are there ways that developers can think about that in order to get around the locks?

Abhi Samantapudi:
Yeah. So there are some general suggestions that we give to developers in order to reduce locks. So these include organizing your batches the original Bulk API to minimize lock contention, being aware of operations that increase lock contention, such as creating new users, updating ownership for records, with private sharing, updating user role, et cetera. As well as being aware of locking when you’re creating your data model, such that perhaps if you’re creating, custom objects with parent-child relationships, you ensure that one object maybe doesn’t have too many children. So I know there’s, an ample amount of material overall provided regarding locking, but those are just some of the tips that we provide specifically with bulk.

Josh Birk:
Got it. And that was going to be one of my follow up questions because, and I’m going to point out that this reference is possibly is maybe as much as eight or nine years old, but I remember coding a very specific user generation flow and it was… and I can’t remember what some of the permissions were, but it was like, whatever that user generation flow was, it caused… Because I think it was like self registering. Like it would just give the user an email and a temporary password sort of thing. And it’s like, that process took longer. And so you couldn’t always expect the system to be like, “Okay, I’m done. ” Whereas, in a normal like user generation, you could, like I said, we’ll probably fix that, but [crosstalk 00:24:37] it’s nearly decade.

Abhi Samantapudi:
I guess one thing I should note is that ultimately the biggest tool that we have in dealing with locking if these sort of preventative tools don’t work is to run the process in serial mode. And that sort of enables you to obviously process the records one at a time or the jobs one at a time, rather I should say, and sort of avoid these situations in the first place in many cases.

Josh Birk:
Got it. That makes a lot of sense. Now tell me a little bit about the future. Well, how would auto handling of locks be handled?

Abhi Samantapudi:
So the Bulk API 2.0 Pilot feature that we discuss the auto handling of locking how it handles this is. If there is a locking error that is detected while you’re using the ingest operation, it will handle that automatically in order to ensure the greatest chance of success. This essentially is sort of providing an extra security blanket for developers, such that they have time to manage less of their locking errors themselves. And hopefully the system handles most of them for them. It’s just sort of part of our overall idea behind bulk API, 2.0 To just sort of submit your data and let Salesforce handle the rest.

Josh Birk:
And go get a cup of coffee.

Abhi Samantapudi:
Yes. Go get a cup of coffee.

Josh Birk:
Or tea in certain parts of the world. Awesome. And I will totally respect if the answer to this question is no, but is there anything else on the roadmap that you want to give a shout out to?

Abhi Samantapudi:
Yeah, absolutely. So some of the things that we are currently working on that are coming up in the next few releases are in just support for big objects. So this is an example of a feature parody item between the original Bulk API and Bulk API 2.0. Currently, if you wanted to ingest a big object, you had to use the original Bulk API. So we’re providing ingest support now in this release in 238, beyond that in the next few releases, we’re also working on PK Chunking improvements. What I mean by this is with, for Bulk 2.0, we of course do PK Chunking automatically. But what we want to do is improve our internal process to make that even more efficient. The thing to make that even more efficient for Bulk API 2.0, as well as enable PK Chunking for even more objects. Beyond that, we’re also working to sort of simplify our limits internally and also expand them for customers and without getting into too much specifics. The ultimate goal there is just to make the limits that you face with Bulk API 2.0, even simpler.

Abhi Samantapudi:
It. And just allow you to again, think about it more in terms of records, uploaded and queried and not have to worry about more of the internal processing. We also have some other items on a roadmap such as providing additional content types. So one of the feature parody items I was referencing is that if you want to use XML and JSON for ingest, you have to currently use the original bulk. So we’re looking to expand the content types that we provide access to with Bulk API 2.0. Oh, we’ve already talked about how we’re looking to improve the bulk setup page. Yeah. This will be for both bulk APIs obviously, but customers can definitely expect more comfortable and reliable experience on that site. Also are always sort of improving our documentation, we’re looking to cover more our use cases, more best practices, and also more descriptions on how to handle common errors. So that’s another, I guess, opportunity for us to connect with developers and just make their life a bit easier.

Abhi Samantapudi:
I will say that one of the things that we are exploring, it’s not on a roadmap or it’s not something we’ve come to, but one things we’re looking to improve is these clients that we’ve discussed our data management tools, what’s the return, we want to use them for Dataloader Dataloader.io, Workbench, Data, Import Wizard, et cetera. We believe that there is an opportunity there long term to improve these tools and give a sort of consolidated and better tool for all admins and developers to use. It’s certainly more of a long term project. It’s one that we’re just exploring right now. And again is not on a roadmap, but we think there’s a exciting opportunity there. And we certainly are invite admin and developer input.

Josh Birk:
That’s our show. Now, before we go, I did ask after Abhi’s favorite non-technical hobby. And Abhi, if you’re listening, trust me, you’re much better at these sports than I probably I’.

Abhi Samantapudi:
Yeah. I have to say it’s basketball. Nice. It’s I’m not very good at basketball. Actually. I take that back. I would say it’s a split between basketball and volleyball. I was never very good at basketball, but I absolutely love watching it. Huge fan of the NBA. I’m a Pistons fan, but we don’t have much success at least recently. So I guess I sort of was on the Warriors bandwagon when I moved out to Berkeley. But yeah. Anyway, I really enjoyed playing basketball just with friends throughout college. We’d go oftentimes on Saturday just to hang out there at, at the gym. But more recently I took a volleyball during the pandemic. Just a bit more, I guess, social distancing, you can do it outdoors a little bit safer. Right. And it’s a lot of fun. I mean, I’m still very horrible at that as well, but it is so much fun.

Josh Birk:
I want to thank ABI for the great conversation information and it’s always, I want to thank you for listening now, if you want to learn more about this show, head on over to develop salesforce.com/podcast, where you can hear old episodes, see the show notes and links to your favorite podcast service. Thanks again, everybody. And I’ll talk to you next week.