Solving the Challenges of Collaborating Around Data Science Notebooks
Paul Shin
February 3, 2021
I have had 40 Zoom calls with data science teams from various industries: asset management, healthcare, e-commerce, adtech, NBA teams and more.
One of the most common challenges that they brought to us was around collaboration, and there are 3 main types of collaboration problems that emerged.
-
- Coder – to – coder
- Coder – to – non-coder
- Coder – to – IT
Coder – to – coder Collaboration
Data scientists specialize in pulling insights from large data sets using code like Python, R and Scala. The most common collaboration problem among this group arose when the individuals were using open-source tools like Jupyter or R-Studio on their laptops. Naturally, each team member utilizes the tools and libraries that they are most comfortable with on their given hardware and OS. This approach often causes compatibility issues when other teammates need to access, review or edit the projects.
Team managers found it especially difficult to understand the progress of various projects that the team members were working on, because each team member often worked in different environments based on their requirements. Git flows often became overwhelming, and data security or the audit process almost impossible. Someone would quit, and the codes would leave the organization with the laptop (read more about the problems around storing notebooks in github from our customer Yonder).
Coder – to – non-coder
Once the data scientists build/train models or finish data analysis, they are often expected to explain their findings to business teams that may not speak their language (the non-coder). Most open source tools that run locally have various visualization libraries as well as cell/paragraph structures to tell the story around the data and the analysis that they produced. However, as mentioned above, if coders have a hard time loading another coder’s Jupyter notebook, we can imagine that it’s basically impossible for non-coders to read from a notebook file. It’s not Excel! Additionally, passing files and screenshots over email is painful and nearly impossible to keep versions up to date.
This is where a handoff is often made from a code-based tool over to a visualization/BI tool that non-coders purchase seat licenses for to consume the insights. However, problems still arise when your organization doesn’t have the budget to provide seats for every business consumer, or if you work with a team that requires the most up-to-date output on-demand. For example, sales or customer support teams need client deliverables very frequently, but data scientists are not able to respond quickly to these ad hoc requests.
A data scientist once told me that he actually tried helping their customer-facing team members set up a Jupyter notebook instance on their own laptops to be able to run his code … but that didn’t end well. It’s no surprise that this question on Stackoverflow, “How can I share Jupyter notebooks with non-programmers?” was viewed over 130,000 times!
Coder – to – IT
Different organizations have different demands from their data science teams, and when a data scientist deals with big data and complex models, her local laptop won’t provide the compute power needed. If her organization has an IT department that is already used to supporting the data science team, they may turn around a solution for her easily. However, if she is the first person making this type of request, or the IT team has limited resources and has to prioritize another department over hers, she’s out of luck. The same problem occurs for data scientists that are supported by Engineering — instead of submitting tickets to IT, the data science team will have to wait for a two week sprint to end before they can get their project moving.
I often hear, “We want to have nothing to do with our IT team, if possible.”
Collaborate in Zepl
Zepl has the collaboration problem by building an intuitive cloud-based notebook platform which provides a flexible and consistent environment for any organization to bring their coders and non-coders together without having to involve IT. Its built-in Plot.ly library, notebook publishing capabilities as well as permissions and authentication methods allow for data science insights to be shared throughout the organization securely and easily. By building a flexible compute infrastructure into the cloud environment, it also eliminates any IT or engineering dependencies by the data teams, and speeds up project cycles.
Reference Links: