Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond SAIS NA

With the community continuously working on preparing the next versions of Apache Spark you may be asking yourself ‘how do I get involved in contributing to this?’ Or ‘how do I make sure my voice is heard?’ With such a large volume of contributions, it can be hard to know how to begin contributing yourself. Holden Karau offers a developer-focused head start, walking you through how to find good issues, formatting code, finding reviewers, and what to expect in the code review process. In addition to looking at how to contribute code we explore some of the other ways you can contribute to to Apache Spark from helping test release candidates, to doing the all important code reviews, bug triage, and many more (like answering questions).

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Thanks for joining me. I’m coming to you from my bedroom in San Francisco underneath my bed literally and am gonna be talking about getting started contributing to Apache Spark from PR to code reviews to JIRA and beyond. I wanna be really clear, I am on the PMC for Spark but this is my own personal views not necessarily those of my employer or the project.

So I’m Holden my preferred pronouns are she/her. I am a co-author of “Learning Spark & High Performance Spark. You can follow me on Twitter @holdenkarau and I also do Live stream code & reviews on YouTube you can check out. And Spark Dev in the Bay Area no longer @ Google.

So in addition to sort of who I am professionally, I like to talk about who I am outside of Spark. I’m trans, queer, Canadian in America I got my green card this year which is pretty exciting am part of the broader level community. And this isn’t particularly related but I think it’s important for those of us who are building machine learning or data information systems to have diverse teams and one of the things we have to do to have diverse teams is actually talk about we’re all from what our backgrounds are and then if we realize that we’re all the same then it’s probably time to go and find some people who can give additional pieces of input to us. I’m hoping you’ll be able to provide some of that to the Spark project. I’m not saying if you’re a trans queer Canadian in San Francisco you shouldn’t participate but I’m hoping we can get more people to contribute.

What we are going to explore together!

So we’re gonna explore the current state of Apache Spark community, reasons why you might wanna contribute to Spark, different ways that you can contribute, the best way to find places to contribute and the tooling that’s gonna be involved in you being successful with these.

So I’m hoping you’re nice people, this is kind of weird doing it recorded. I just really hope your nice people if you don’t like pictures of cats, this is not the talk for you, there are many other ones and you should go to one of them. I’m assuming that you might know something about Spark because you’re interested in contributing to Spark but if you’re a completely new Spark don’t feel like you’re not welcome here. I totally would love to have more new users of Spark contributing as well.

Why I’m assuming you might want to contribute:

So why might you wanna contribute to Spark? I think one of the most sort of stereotypical reasons people contribute to open source is the fixing your own bugs thing. That’s one of the great things about open source. If I find something that’s broken I can try and fix it. That’s pretty cool. Maybe you ought to learn more about distributed systems either for your personal enjoyment or profit. Or maybe you just wanna improve your experience, maybe you want some credibility for jobs. Or maybe you just like doing it. I think a lot of us are here just because it’s fun, these are all good.

What’s the state of the Spark dev community?

So the Spark dev community has a lot of contributors and that’s really awesome. However we have a lot of PMC & Committers concentrated especially in the Bay Area. And not exclusively but certainly we could be better in having more people from different areas here. And so if you’re not from the Bay Area please do get involved. Some of the conversations that happen are in Pacific Time time zone so sometimes it’s difficult but we will do our best.

How can we contribute to Spark?

So how can we contribute to Spark? There’s a bunch of different ways that you can contribute to Spark. I think when people think of contributing to open source projects, we normally think of contributing code. We think of Pull requests, we think of bug fixes but there are all sorts of other things that you would do. There is improving the documentation, there’s code reviews, there’s fixing the code that Spark itself depends on, I like to think of that as yak shaving, there’s helping users out. And that one’s really important there’s so many Spark users and there’s just so many more users than our developers, having more people available to help out is really important. And the other one is testing and release validation. And I’m not sure if Spark 3.0 is gonna be released by the time that you’re seeing this. If it’s not please go and test out the current release candidate for Spark 3.0. And we’ll talk more about why this is one of the most impactful easiest things you can do as a Spark user.

Which is right for you?

So how do we decide what is right for us to do? What do we want to do? So direct code contributions to Spark have probably very high visibility but they can be kind of slow, fixing packages built on top of the Spark can often be a lot faster but they can have a little bit less impact things like formats can have a really big impact but they have a smaller user of base. And yak shaving is really great and amazing but because Spark upgrades its dependency so slowly it can take a while for anything that you fix and something that Spark depends on to actually be visible to Spark users.

Which is right for you? (continued)

Code reviews are really high visibility to the PMC and the PMC as the Project Management Committee. And what this means is like the people who are merging code are far more likely to read your reviews than perhaps any of these other activities. Code reviews are how a lot of the PMCs spends their day in and day out. And so by participating in that part of the process you engage more with the people who are making decisions around the project. Documentation improvements, also super important. Lots of places to contribute, unfortunately, some places do not value that as much. So it may not align with your goals if you’re trying to get credibility to get a new job. Some places won’t count that in the same way. And of course I think advocacy is important but the same caveat applies there. Some places don’t consider that important which I think is complete hogwash.

Testing/Release Validation

So let’s talk about testing and release validation and how to go about participating in this. The first step is to go and join the developer mailing list and you can just go to go to the mailing list and you can join that mailing list right there. And you look for the threads that say vote on them. And that is your sign that we are taking a decision. Most of the decisions that happen on the dev list are votes about whether or not we should make a release. There are some other ones, I’d encourage you to read all of them but the release ones are probably the most important. And what you can do here as a Spark user is take this proposed release and see if your job works on it or see if it works in your environment and if it doesn’t let people know. And this is, Spark is deployed in so many different kinds of systems. It’s impossible for us to have complete test coverage that’s gonna catch all of these bugs. And by participating in this part you are more likely to ensure that when Spark is released your application is continuing to work and you’re not having to like hot fix something or do a last minute rollback. You are essentially getting a chance to make sure that what we build meets your needs. And so this is I think amazingly impactful for both you the user and also for us as a community, it helps us release better software. And one of the things you can do is, you can contribute automated testing here so that we have better automated release validation and that can be really great. There’s a pretty good coverage for Kubernetes stuff compared to other things. But a lot of other things really don’t have any automated integration testing.

Helping users

Helping users is also super important. We all started using Spark at some point and it is huge. Figuring out what’s going on is really hard especially right now. It’s really hard to like go and ask someone a question when we can’t physically go to people. So people are turning to the mailing list more and more. So you can join the user list, you can answer people’s questions, you can do the same thing on Stackoverflow, you can write blog posts, you can do all that kind of stuff. And this is also really impactful it helps us grow the community. If you’re not an asshole it helps us set a good example for what we want. And if you are an asshole please become not an asshole or leave.

And it makes the experience so much better for new people. Remember how hard it was getting started with Spark? Now imagine you could make that better for someone else. Of course you want to, now don’t burn yourself out trying to do this. It can get really easy to get burnt out on this and then become kind of jaded. I’ve had those moments and so don’t feel like you have to do this. But if you want to please, please do.

Contributing Code Directly to Spark

And so contributing code directly to Spark I think that’s probably why a lot of you are here in this talk. Maybe you’ve got a bug you wanna fix, maybe there’s a feature you need for your environment, let’s get into it. But take a look at the contributing guide first before, it’s probably down here somewhere before you start writing your code. Because if you just go ahead and start writing code it’s not gonna be a great experience.

The different pieces of Spark 3+?

So the first thing that we should do is like look at the different pieces of Spark and we should figure out where we wanna make our contribution. If you’ve found a bug ,you probably wanna fix your bug in the component where you’ve encountered it. But it might end up that it actually is in a different component further down the stack.

On the other hand if you’re just like, I wanna kind of do something in Spark but like I don’t know what. Let’s look up on some of these different options would imply.

Choosing a component?

If we decide to work in Core, it is going to take a lot of time to get our code reviewed. And that’s because some pieces of Core aren’t tested as well as they should be. And everything depends on top of Core, so making changes here could have a very big impact and that could be good or that could be like really bad. And so people are a little bit more conservative here. And so it’s not the area that I’d recommend when you’re starting out. ML and MLlib , MLlib is, good bye Mllib. But our machine learning stuff, improving existing algorithms very much welcome, adding new algorithms has not been a thing that’s been encouraged for a long time. And if you wanna the reasons more or less comes back to maintenance, but we can talk about it off line. Structured Streaming, lots of fun lots of changes happening there and it’s a little bit difficult to contribute because the API is changing so rapidly but I think it’s a really great area. SQL is also pretty cool, not just the streaming SQL part but just the base SQL layer. Lots of fun stuff, very active. I don’t do that much work there but I know many people who do and they enjoy it. I’m gonna try and convince you to contribute to improving the Python support ’cause that’s one of the things that I care about a lot. And it’s one of the easiest areas to get started and we’ll talk about that later in fact.

And then we can also contribute to the different scheduler backends if you’re a YARN expert, please come, if you’re an expert in Mesos please come. We really need someone who’s Mesos expert, who’s contributing. Kubernetes, lots of active work, lots of reviewers, it’s a lot of fun, you can come and hang out together and contribute to Spark and Kubernetes. Standalone, there’s less people excited about it because increasingly Spark is deployed on top of one of these other systems.

Onto JIRA – Issue tracking funtimes

So being a project like Spark it’s huge, right? So keeping track of all of the issues is difficult, we use the ASF JIRA. You should sign up for an account on it ’cause it can take a little bit of time because sometimes the mail servers don’t have a nice day. It’s better to have an account not need it than you wanna file an issue and not be able to because the mail server is not having a good day.

What we can do with ASF JIRA?

So with the ASF JIRA we can search for issues that other people have reported, we can report our own issues, we can comment on issues, we can ask people for clarification or help. We can make sure @mentions work here too. So if there’s someone who would be really good at solving a particular issue you can loop them in.

But there’s a few things that we can’t do with the ASF JIRA that you might be used to doing with your own bug tracker at home. One of them is assigning issues to ourselves. Because we don’t want issues to get stuck and be like assigned to someone who ends up getting distracted when they do need to do something else. We don’t assign issues to people really until the issue is resolved. Or maybe sometimes during the pull request phase. Posting long designed documents, JIRA isn’t great for that, certainly people do. I’d encourage you to do a Google Doc or something where people can comment more interactively. And there are tags and in theory you can tag them, but in practice they often get removed. So I wouldn’t bother putting tags on issues.

If you haven’t seen JIRA before, this is what the ASF JIRA looks like. We can see there is an issue database can’t be changed if it’s specified in URL. I mean to me, that sounds like not an issue but the person who reported it thinks it’s a major bug. So I probably don’t know enough of the details and I just be like “Oh, I don’t know,” and I’d start reading to see what it is they wanted.

Finding a good “starter” issue:

So that’s great. There are thousands of issues how do we find a good one to start with? So there’s a starter issue tag but as I was mentioning with tags earlier they’re inconsistently applied. And some people tag things with starter issue that are definitely not starter issues and vice versa. And so I think unfortunately the realistic answer right now is just you have to read through a whole bunch of issues and look for something that looks easy. And there are some other things that we should also look for in addition to easy. It’s really good if the reporter or someone commenting on it is a committer. And you can look at the Spark Committer list, you can find that in Google really quickly or on the Spark Website. Because if there’s a committer who’s interested in an issue once you’ve got your code ready there’s a good chance that that committer will be available to review it. And that’s a lot less work than having to go out and find someone to review it. I think right here right now if Spark 3.0 insn’t released one of the things that you can try and do is look for old issues that were closed that we couldn’t fix at the time because we had to maintain binary API compatibility. That being said, if Spark 3.0 has been released, that’s not a good idea anymore, we’ll save that for Spark 4.0.

Going beyond reported issues:

So that’s cool and all but the reported issues aren’t like the definitive source of truth for all the work that needs to be done on Spark. One of the great things that you can do, especially if you’re helping out on the user list, answering questions, is look for things which might be issues that haven’t been reported yet in JIRA. There’s also to do list, these are often things that people felt were really small or we’re just gonna be a follow up later on. I’ve resolved to do this at certain years old. So just a quick follow up, we all know how that goes. Looking for deprecation is another thing. Spark depends on a lot of different libraries and things get deprecated and we don’t always fix them. And these are all issues that might not have been reported but can still be important. And if you find one you can open an issue and start working on it.

While we are here: Bug Triage

While we’re figuring out what to work on, one of the important things you can do is triage. Now I know I’m saying that tags aren’t super useful and that’s true. The one exception that I would say is if you find a starter issue, tag it as starter issue and eventually maybe we’ll actually have a good list of starter issues. The other thing is, there are a lot of things that end up in the bug tracker that don’t really belong there. Some people will be like “Hey how do I do this on Spark? I’ll file a JIRA.” That’s not the right way to do it. And you can redirect them to their appropriate list, normally the user list. You can also take a look and make sure that things are being reported at the correct level. If you see something that says like “Hey, this occasionally reports the wrong number.” It’s pretty minor for me though like that’s probably actually a major issue possibly a blocker and you can change the level and try and make sure that people are aware of it because we don’t wanna miss important issues. And another thing is, our issue tracker sometimes gets stale. Sometimes people report duplicates. Sometimes the issue gets fixed incidentally this thing that used to be a problem is no longer a problem. The person who fixed it didn’t realize they were also fixing that issue at the same time, so they didn’t close it. So if you see something old and your like “I think that works in Spark.” you can try it see if it works in Spark and close it if it does. And if it doesn’t you can leave a comment that’s like “Hey what’s up, just verified “this still impacts version XYZ.”

Finding SPIPS:

We can also find the SPIPs, and I wanna be clear, I put this in here because I don’t want you to do it. That’s a little weird. I think it’s great to go and read the SPIPs. I think reading the Spark the big proposed changes to Spark great trying to have those be your introduction to contributing to Spark, bad idea. But it’s a great way to look at how people are thinking about the future of Spark.

Here’s some SPIPs I think, oh no, here’s some starter issues.

cool So before we go too far, we wanna maintain compatibility between releases and so that means like if you see something and you’re like what this parameter, I don’t think we need it.

But before we get too far:

There might be a reason why that parameter is still there and it might be for API compatibility. That being said, if we really don’t need it or something we can deprecate it and we can still do things about it. We can’t just change everything that we want as soon as we see it. And even though we’re working on Spark 3.0 or Spark 3.0 may have been released by the time you’re watching this, Spark 3.0 we’ll be pretty close to being released hopefully. So it’s probably not the time for any last minute breaking changes anymore. If you have really big breaking changes, the sad news is those are probably gonna have to wait for a Spark 4.0, am sorry.

Getting at the code: yay for GitHub 🙂

Getting the code, yay GitHub. So hopefully you’re familiar with GitHub, if not there’s tons of great videos. It’s pretty easy you can just go to the Apache Spark GitHub fork it and Clone it locally and start working on it.

This is the GitHub, this is from a while ago but we can see, yay! Lots of contributors.

Building Spark

Once we’ve got Spark checked out locally, It’s time to build the Spark. Now the build of reference is Maven. I think it’s, anyways and that’s for various reasons. I tend to do my builds with SBT locally and that’s because I can leave SBT running and I can redo my compilations many times a little bit faster. I think it works a little better for me but sometimes I will have to do a Maven build. And so if you’re doing something which is like you really really fiddly and might be being impacted by the build do to try it with Maven as well. Oh and also just shout out for your Python people, yeah you do still have to build Java code even if you wanna contribute to Python or R, sorry. What about documentation changes?

What about documentation changes?

So we still track documentation changes in the same way, we use JIRA to track them. Most of the Docs live in Markdown files.

It’s pretty great, Markdowns pretty common. If you’re not familiar with Markdown, once again lots of great videos and tutorials about that. I use Emacs but I know that there are tons of great tutorials from Markdown as well.

Building Spark’s docs

Building Spark’s Docs, so we don’t just share raw Markdown with the world, we want to give people nice things to look at. And so you need to build them with Jekyll. There is some information about how to install it it’s in the Docs read me and build it and make sure that you change looks the way you think it looks. Because sometimes our editors they’re really great but sometimes they get them a little wrong and we don’t want to end up with a broken table.

Finding your way around the project

So besides the Docs directory, Spark is also generally organized into these components and they’re organized into the sub directories for each project. IntelliJ is really popular, people find that a really easy way to find their way around the project. I use Magit to find my way around the project. I do not recommend that but if you’re an Emacs user it’s not the worst, it is really fast.

Testing the issue

So hopefully we’ve figured out what we wanna do, we’ve found the issue, we know where to look for the code. We can make our changes and now we should test to make sure that our changes work. Unless you to you did test them from development in which case you wrote the test first, which, that’s awesome. I like using the Spark-shell for doing quick manual verification. I think it’s pretty cool. But do please write a unit test, don’t just test it manually. Unit tests, yay!

Manual verification, sad!

While we get our code working:

So once we’ve got our code working or while we’re getting it to work there is a style guide that we’ll wanna format our code to. We don’t have to do this like all the time during our development process but I do it we before we make our PR we’ll get it to conform to the style guide. Please always add tests and if you’re changing the API please make sure that it passes MiMa. Now if you don’t know what MiMa is, that’s okay, I’ll talk more about it later. But it’s important for doing API compatibility checking.

A bit more on MiMa

So MiMa is how Spark attempts to maintain binary API compatibility. We wish to do this in all of our non-experimental components and this doesn’t mean that if MiMa gives us an error we can’t make our change. MiMa is often a little bit too sensitive and it doesn’t know which components are experimental or not. So if you get a MiMa error don’t get too worried and if you can’t resolve it just ask someone else on the project to help you resolve that MiMa error. But if it comes up that is a thing that you need to pay attention to, you can’t just ignore it. But we can add that to the excludes if you do you have not broken the API compatibility.

Making the change:

So making a change, as I mentioned I mean Emacs user. Don’t argue about which editor is the best. They’re all terrible, all software is terrible, it’s fine. IntelliJ is great for finding where you’re going. I use grep, GitHub actually is pretty good search nowadays. There’s all kinds of wonderful tooling to help you find the code that you need to change.

Python API change parity update?

As I was saying, I think one of the easiest ways to make code changes Spark is Python API parity updates and often that’s just things like “Oh hey what’s up, we have this function in Scala and forgot to add it in Python.” And it’s pretty simple to do because you can generally open up the corresponding Python file, look at how the other ones are implemented and just kind of guess and then try it and you can do that a few times and eventually it’ll start working. And I think this is really cool because we have Python interfaces for all of our components. So if you’re interested in a specific component you can order the API and make sure that it’s compatible between Python and Scala for that specific component that you’re interested in knowing more about. And in that process you’ll learn a lot about what’s changing. I think that’s pretty cool.

Yay! Let’s make a PR)

Yay! Okay, so it’s time to make our PR. They’re going to push it to your branch visit GitHub and create your PR. Now please put the JIRA name in your title. Nowadays thankfully there is a little box that auto fills the PR description box. It auto fills with some information. Please read it and fill it in. It tells you what to do and if you follow that template it makes our life so much easier. Now if you’re like “I got this like 90% of “the way there but I am done.” I can’t do this anymore, that’s okay too. You can still make a PR, just put work in progress in the tag. Work in Progress tag in the title so other people can find your PR. And you can still be listed as one of the contributors, when we do the merge eventually. But it doesn’t have to be finished. And by putting work in progress in the title, it tells people like “Hey what’s up. “Don’t spend too much time on this.” Unless you’re really interested in next item.

Code review time

Now it’s time for code reviews. Now I have feelings about when code reviews should be done. I know some of your employers may have different feelings about when code reviews should be done. There are some companies that prefer to do their code reviews internally and then create their PRs and then go through the open source process. I think it’s better to do it all in the open so that we can all learn.

My personal views and the views of your manager may not align, especially in this time. You may wanna pick your manager’s views here because having a job is pretty useful. Like having health insurance.

And now onto the actual code review..

And so doing the actual code review most often the committer will review your code but this may not happen very quickly. Other people can help out too. People can be really busy especially near release cycles. Like right now I don’t have a lot of time to review code because I’m just checking to make sure like the five things that I need to get merged in so that I can cut the 246 release candidate are in.

I’m not looking to see what the new PRs are coming in, I’m tracking specific items. But that’s okay, like if there is something where you need me or another committer or we might not be paying as much attention you can @mention us in your PR and help us realize that we should spend some time looking at this. And other people can review too, besides committers. But you do need a committer of course eventually to commit your change but just because people are reviewing it aren’t committees doesn’t mean that their comments aren’t useful or helpful.

What does the review look like?

What does a review look like? So at the end of your review, you’ll hopefully get a Looks good to me, that’s like your thumbs up, I’m gonna merge into this code, that’s great. SGTM sounds good to me, does not mean the same thing normally just means that’s like one part sounds good. Sometimes things get sent back to the drawing board not everything gets in and that’s okay. I made a PR last week, that looking at the feedback, I don’t know if it’s gonna get in, that’s totally normal. And you’ll get a mixture of in-line and sort of overall comments. And if you’re interested in seeing more about the process there’s a link down at the bottom where I do live code reviews and you can sort of see what the code review process is like from the committer’s side.

That being said, I’m gonna show you one of my older PRs and how it went. So this was Spark hadoop util support switching to YARN. It’s not super important what’s going on but we can see there were four participants.

Okay, cool. Some comments, I failed the style tests, that was my bad, should you caught that beforehand.

Try some more comments, I was so excited. Looks good to me. Only trivial things. Oh yeah! It’s a party.

Wooo, someone else shows up and it’s like “Hey what’s up actually there’s some other issues, Marcello, that matter.” So like let’s dig into these first. I’m like oh yeah okay, let’s really do that. Aah, here we go, let’s go back and forth, trying to get on the same page of understanding to push some new code.

Marcello comes back and is just like yeah I’d move these things around, I think that would make a little bit better this belongs in a final (indistinct) the way you’re doing your class check here isn’t ideal. And then finally like cool, it looks good to me, merge to master, and so that’s happy.

That’s a pretty standard small PR

That’s pretty standard for a small PR and then it still took a really long time and went on for a lot of like pages. Review cycles are long, so don’t just make a PR and then stop and wait for it to get in. Go on and do other things. Cut your hair with some hair clippers, drink some coffee. We all have our quarantine stress projects. Pick one of them up and start working on them or pick up another JIRA and start working on it.

Don’t get discouraged

Don’t get discouraged. I know actually in this PR, I got a little bit discouraged ’cause I got an LGTM except minor issues that then became not all is good to me. I was like so close. Ooh, sadness. And so don’t feel sad like this cat. Please, please keep your happiness. I know it’s hard during these times but things take a long time and that’s okay. That’s not you, that’s just the state of the world.

When things don’t gg well.

When things don’t go so well sometimes it can be because people don’t know how to say no. This is the thing that, I think there are many people who struggle with this. The Spark project has been getting better about this. In the past sometimes, we would say no by just not ever bothering to review a PR we didn’t want to. I don’t think that’s great. Of course I think we should be clear about what we don’t wanna do. If you’re in a place where no one’s willing to take a look. Feel free to reach out to me and I will at least hopefully be able to get someone to tell you no.

While we are waiting

While we’re waiting it’s important to keep our PR up to date. This is important for many reasons but also because of how committers tend to look at things. There’s a review dashboard that we use and if Jenkins can’t run your PR I’m not gonna bother to review it. That’s just not worth it to me.

Open Source Code reviews are a like Mermaid School

And so relatedly you can participate in the code review process an open source code reviews are a lot like mermaid school and this is because they help you grow your skills. Mermaid school helped me improve my swimming. They build on my existing skills, swimming, Scala, Python. You get better with time but you need to start somewhere and people often don’t wanna pay you for them. Although I did manage to get my manager to pay for 3/4 of my mermaid class and coffee makes it better. Yes important.

Why the community needs you?

So the community needs you to help us with our code reviews. Many projects suffer from maintainer burn out. The Spark project is no different. Most people don’t wake up in the morning excited to review code. We wake up excited to write code. So if you can help and give back with some of your review time that’s really awesome. If you are the kind of person who wakes up excited to review code please please come and join us. We have a super huge number of open PRs, it’s hard to keep track of all of them, it’s just not possible. I would love to see more diverse reviewers so that we can help have more diverse solutions. And I hope you can also help like, one of the things that I know as someone who has been working in Spark for years, there’s oftentimes where I will be like, oh yeah, that looks fine to me. Some of you are like, no Holden, that’s very silly, I don’t understand, I was like oh yeah, I guess we started doing that because of Java 7 or it’ll be some random thing that I just haven’t revisited and it just looks right to me, by having someone new be involved in the process you can catch issues that people have been around forever won’t notice. The last one is you can represent the user. Like I work on Spark a lot, I don’t build a lot of things with Spark right now. I mostly build Spark itself.

And one of the things that’s a danger of any infrastructure project is building infrastructure that isn’t useful. So if there’s things where it’s just like, “Oh this is such a cool idea.” If it was just this a little bit different it would be so much easier to use. That kind of stuff is amazing and I want your voices to be involved in the project. I wanna hear that from you all.

The other part is, if we look at the PRs created to first PR comment, there’s a considerable gap. Yes, yes. Now this is across I think all of GitHub. This might have just been Scala project. It’s been a while since I ran this as you can see. There’s kind of a big gap.

By contributing to open source code reviews you can grow your skills in the same ways that you were by contributing code or documentation. You can see the worlds although right now. Who knows? And you also may get faster recognition from committers and PMC members because we tend to be more involved with code reviews.

I’m seeing more of the world. So when you’re working on a small starter issue, it’s only gonna touch a little narrow piece of Spark. But when you’re working on a code review you can be reviewing code that is touching all kinds of different pieces of Spark. That’s really cool, you’ll get to understand all of these different components and how they work together. And this really helps you grow your skills faster. And it also is a lot easier I would say to start contributing, if you know a good Python looks like, it’s okay, if you don’t know Spark so well because you look at some code and you’re like, you know what? “That’s not good python, I would rewrite it this way.” but of course say it nicer like, “I think we could improve this by doing XYZ.” That’s great, you’ll learn a little bit about Spark and you’ll help the person know a little bit more about Python. Everyone wins, yay!

I’ve had a little bit of coffee.

Possible Faster Recognition

Faster recognition is definitely very possible. Generally we have more people contributing code than we do reviewing code. So I tend to remember the reviewers more, I’m not saying this is true for everyone or always but I think that this is a way to gain recognition faster in a project.

And I also think it’s easier to control your time. If we look back to however many sides ago it was when I was trying to make that pull requests for that one simple change. It took me a very long time to get that in. But the thing with the code review is, it provides value even if you just come by once and help someone improve their Python a little bit or their Scala a little bit. You give value just in that small time box interaction whereas with making code changes it’s kind of hard and it can take a really long time for that to land in the project itself and having impact.

Finding a good first PR to review

Finding good first PRs to review up is pretty similar to the challenge of finding good first issues to contribute. I think the best way to do it is by looking at the Spark PRs link down here, and I think it’s one of the easier ways. I think reviewing new PRs is easier than late stage PRs. Late stage PRs people tend to be a little bit more invested, early PRs people are more open to comments like this in the first day or two. It’s a lot easier to make suggestions and they’re probably feeling less overwhelmed.

Doing a first review, feel free to leave comments like “I’m new to the project, I think this is intending “to do X maybe we can add a comment here.” That’s really important. We want Spark to be easy for people to contribute to and by helping us have better comment to code, you can help us have better contribution experiences for everyone. Look for when changes are getting out of sync and be like “Hey, what’s up, the Docs are out of date.” There’s all kinds of things that you can do here that don’t depend on you knowing Spark super well in depth.

Communicate carefully please

I remember early on when I was starting to contribute to Spark, there were some reviewers who when I saw their name in my email from GitHub, it’s like “Oh God I don’t know if I can open this email.” Like it was just so hard sometimes. And I don’t want people to have that experience. I want people to be excited to contribute to our project. I want them to be like “Yeah I’m gonna contribute to Spark, “They’re gonna help me make my code better. “It’s gonna be awesome.” So if you’re gonna help out with code reviews, please remember you’re an ambassador for the project. Help people have a good experience contributing so they continue to do our work for free. And understand folks can get defensive about their work sometimes. It’s pretty natural being told something isn’t right. It can be hard to hear especially if you put a lot of time into it and it’s okay to be scared. I’m not saying never leave a comment, never tell someone something’s not good, or it needs improvement.

It’s totally normal to be scared, telling people feedback about their code. It’s really hard giving people feedback and it’s a skill and by doing this you will improve at that skill.

Phrasing matters a lot

So one of the things that we can do is some simple phrasing changes, things like this is slow to could we do this faster? Saying like this library sucks that you’re using, like that hurts, offering alternatives and asking people what they think of them. There’s a lot of things that you can do. And just these small phrasing changes can make such a big difference in how someone is gonna receive the impact of what it is you’re trying to say.

OSS reviews videos (live & recorded):

If you’re interested we can do open source code review we’re not gonna do one now but if you want you can go check out my YouTube video and there are tons of open source code reviews there for you to watch.

What about when we want to make big changes?

If you wanna make big changes in Spark that’s totally cool. But don’t just show up with a pull request, talk with the community [email protected], [email protected] Maybe you don’t start off with a big change as your first thing. There’s a reason why starter issues are small, it’s better to start small and grow to big. I’m not a mount climber but I’m pretty sure people start with smaller mountains first.

How about yak shaving?

Yak shaving. Really great through a lot of the yaks that need to be shaved, Python isn’t the place to do it because of how we manage our dependencies there.

There are some great books about Spark. I’m a co-author of several of these but I am most interested in my newest Spark book on “High Performance Spark.”

High Performance Spark!

you can buy it today on the Internet. And the reason why I am most interested in this is it’s the one where I learnt you can negotiate royalties. So I get more money and everyone needs more money. So buy several copies of this with your corporate credit card, today or not it’s fine. Don’t get fired on my account.

Another project which I’m doing which I think is a lot of fun is distributed computing for kids. I sort of got a first draft of this together before quarantine hit and I’m hoping to have some time to iterate on this in the coming weeks. Definitely go to sign up with your email so I can keep you in the loop and will make a book for kids about how to use Spark, is gonna be super cool. I hope this session has been good. I hope I’ve convinced you to contribute to Spark. If not that’s okay, I totally get it. This is a really weird time for everyone for everything. So thank you so much for choosing to spend it with me and I hope you have a wonderful day.

Watch more Spark + AI sessions here
Try Databricks for free
« back
Holden Karau
About Holden Karau


Holden is a transgender Canadian open source developer with a focus on Apache Spark, Airflow, Kubeflow, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and Kubeflow for Machine Learning. She is a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.