Data Science – CoderPad https://coderpad.io Online IDE for Technical Interviews Mon, 13 Nov 2023 15:42:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://coderpad.io/wp-content/uploads/2024/01/cropped-coderpad-favicon-32x32.png Data Science – CoderPad https://coderpad.io 32 32 8 Red Flags To Watch Out For When Hiring Data Scientists https://coderpad.io/blog/data-science/8-red-flags-to-watch-out-for-when-hiring-data-scientists/ https://coderpad.io/blog/data-science/8-red-flags-to-watch-out-for-when-hiring-data-scientists/#respond Fri, 06 Oct 2023 12:08:24 +0000 https://coderpad.io/?p=36806

Analytics is 50% math and 50% communication. If a person cannot express their ideas in written or presentation format, it doesn’t matter if they can do the math.

Mia Umanos, CEO of Clickvoyant

A CV filled with impressive credentials can capture attention, but it’s the subtleties during an interview that reveal the most about a candidate. Unlike many other fields, data science requires a unique blend of technical expertise, business acumen, and interpersonal skills.

Every hiring manager knows the gravity of a wrong hire, especially in a domain as critical as data science. 

A misfit can not only hinder project progress but can also disrupt team dynamics, making the interview process all the more crucial. 

It’s not just about assessing the candidate’s knowledge of algorithms or programming languages, but also understanding their problem-solving approach, communication style, and adaptability.

In this landscape, knowing what to look for during interviews becomes paramount. Red flags can sometimes be subtle, easily masked by a candidate’s confidence or eloquence. However, a keen eye can spot these signs, which often hint at deeper underlying issues.

While no single interview technique guarantees a perfect hire, being aware of potential pitfalls can significantly enhance the hiring process’s effectiveness. 

This post delves into eight red flags that job candidates might display during data science interviews, helping you make informed decisions and securing the best talent for your team.

1. They build models without business context

Many technical projects for data science interviews involve having the candidate working with real or simulated data to help solve an actual business problem that the hiring company may face.

This is a great way to see how a candidate would work on your team by seeing what actual insights they can drive, given some information about your business. 

However, some candidates will ignore the business problem and instead focus on showing off their modeling skills in an effort to show you what kind of insights they can deliver with the little bit of information you gave them.

The problem with this is precisely that they’re only working with a little bit of information. 

Unless you’ve spent a few hours with them going over your business model, all the various pieces of data you collect, the nuances of the business, and all the relevant business contexts of the data, then their model is going to be useless at best or drive harmful business decisions at worst. 

When candidates create predictive models without knowing the business, they display a lack of humility and an inclination to jump to conclusions based on possibly faulty assumptions. 

This careless behavior can waste a lot of resources for your team and your business.

2. They show a lack of curiosity about stakeholders

This is a requirement for every data role – a data scientist who doesn’t understand internal stakeholders and customers will fail to produce valuable data insights.

The logic behind this is similar to the first red flag. Without learning about how the business operates and who the primary users are, the candidate is forced to rely upon assumptions about the context of the data within the company.

A four panel comic of a man next to a whiteboard. For the first three panels, who looks confident. In the first the whiteboard says "i'll create amazing dashboards for your stakeholders". The next says "they'll use advanced predictive modeling techniques." the third says "all without stakeholder input". in the last panel the man looks unsure as he looks at the whiteboard that again says "all without stakeholder input".

Without input from the people actually utilizing the data, this candidate would be working in a black box with zero feedback from others. That’s a recipe for disaster and will undoubtedly lead to useless data insights. 

3. They seem unwilling to learn and grow

Data science is a healthy balance of programming, stakeholder communication, good judgment, and some applied statistics. 

No matter how senior, the candidate should show a willingness to improve those skills.

You can usually gauge this in an interview by asking them what they’re currently learning about or about a lesson they recently learned based on a mistake they made. If they are unwilling to learn or can’t tell you a story about improving on their mistakes, that is a noticeable red flag.

4. They are unwilling to receive feedback

Often, a data scientist is a black box to stakeholders. “I don’t know how they do it, but they make these models that predict the future, and it’s basically magic to me, but it works” is a sentiment a data scientist has likely heard at least once.

Data scientists, then, have to accept that stakeholders will regularly ask them to explain their output and conclusions in an easy-to-understand way – this is especially true when they deliver insights that go against common business intuition.

They will need to be able to field these kinds of curious questions as well as handle constructive feedback from others. If they’re unable to respond to these kinds of responses – they shut down or react with defensive anger – then this shows an unwillingness to either defend their ideas or have the humility to admit that they might be wrong.

You also probably won’t see these candidates interested in working with other teams or seeking feedback about their work if they were to join your team. Be careful if you choose to hire them.

5. They are unable to communicate with non-technical stakeholders

Non-technical stakeholders will always be involved in some aspect with data decision-makers – whether as the consumer of the data insights or someone responsible for sharing the context behind a new data source.

The ability to break down very technical information into a format that won’t overload people is crucial. 

Some stakeholders won’t have the knowledge base to understand (or care to understand) the statistical methods behind your conclusion.

 Frequently, they’re busy enough that they just want to know what insights your candidates will be able to provide them to make their lives or the finances of the company better.

Data science candidates should be willing and able to break down complex information for teams outside their own – whether for accounts payable, sales, marketing, or any other department that needs to utilize the information.

If candidates can’t do that, it’s a massive red flag because it means they likely won’t be able to hold on to stakeholder trust for long. If stakeholders don’t trust the source of their new insights, they won’t be willing to take action on it, and you have a big problem.

🔖 Related resource: Mastering Jupyter Notebooks: Best Practices for Data Science

6. They cannot justify their technical decisions

Just like being unable to communicate with non-technical stakeholders, if a candidate can’t describe and reasonably defend their choices at a technical level, then they will not be able to hold on to technical stakeholder trust.

A data scientist should be able to describe steps taken to clean, transform, and operate over data at a reasonably technical level. Some examples candidates could use:

  • With the help of a developer or data engineer, they used SQL queries to clean up the data they wanted to use.
  • They noticed a heavy imbalance of labeled data for model training, so they added medians wherever they found missing data (~<15% of all rows). They explain that doing this mitigates bias in the final model predictions.
  • They artificially standardized the metric they want to predict to a scale between 0 and 1 so that they have more easily interpretable prediction output.
  • They coded all categorical columns into a sparse dataset of 0s and 1s to include non-numerical predictors in the model, some of which help raise prediction accuracy.

Fortunately, this red flag is pretty easy to detect in an interview – you set up your question in something like a Jupyter Notebook, hand it off to the candidate, and then have them walk you through the logic behind their models or algorithms. You can ask them to explain things that don’t make sense to you or that you would have done differently, and if they can’t explain it, you may want to move on to the next candidate. 

7. They lack proficiency in SQL and don’t understand databases

This goes along with the previous point, but anyone working in data should understand how to query that data and how it is collected and stored. 

Panel comic. On the top a man is speaking to an audience and says "who wants to be a data scientist?"; everyone in the crowd has their hand raised. On the bottom the speaker now says "who wants to learn sql?", and no one in the crowd has their hand raised.

For junior candidates, you may want to include a few SQL questions in the coding portion of the interview. For more senior candidates, a few verbal questions about database design or query structure should suffice – they may be insulted if you hand them a technical question that is too simple. It is a waste of their time.

8. They have shiny object syndrome

Part of the appeal of getting into data science these days is all the new technologies and tools you work with.

That’s fine. In fact, that curiosity can be a boon to your team. 

However, data scientist candidates should also be willing to show how they’ve done the tedious but essential grind work that often comes with algorithm and model development.

If they’re always worried about learning the newest technologies at the expense of doing necessary work (i.e., shiny object syndrome), you may want to pass on adding them to your team.

Conclusion

Detecting these red flags in an interview is easier than you think.

You can easily test the candidate’s ability to understand and communicate data and basic programming skills by using a tool like Jupyter Notebooks in your interviews.

CoderPad has an integration that allows you to do just that – check out the pad below for an example question you can use in your own data science interviews.

Some parts of this blog post were written with the assistance of ChatGPT.

]]>
https://coderpad.io/blog/data-science/8-red-flags-to-watch-out-for-when-hiring-data-scientists/feed/ 0
The Differences Between Data Science And Data Engineering Job Roles https://coderpad.io/blog/data-science/the-differences-between-data-science-and-data-engineering-job-roles/ https://coderpad.io/blog/data-science/the-differences-between-data-science-and-data-engineering-job-roles/#respond Thu, 28 Sep 2023 10:08:23 +0000 https://coderpad.io/?p=36689 Whether charting a career, hiring the right talent, or staying updated with industry shifts, it is important to understand the similarities and differences between data science and data engineering

The goal here is simple: to help you understand what data scientists and data engineers do, especially when scrolling through job listings and the terminology starts to blur together. 

Even though these roles have “data” in the name, they’re not the same animal. And to make things even more interesting, the lines between them are getting fuzzier, thanks to all sorts of tech advancements.

In this article, we’ll break down the nitty-gritty of each role, explore why they’re starting to overlap, and give you some honest advice on keeping up in this ever-changing field. Ready to jump in?

🔖 Related resource: Jupyter Notebook for realistic data science interviews

1: The traditional understanding of data science and data engineering

The roles and responsibilities traditionally associated with data engineering and data science serve as a guiding framework for professionals and organizations. However, these traditional descriptions are rooted in a “classic” view – we will see later that both fields are ever-evolving, and the boundaries are more flexible than they once were.

Data science

Data science is an interdisciplinary field that employs various techniques, algorithms, and processes to extract valuable insights from structured and unstructured data. 

However, this field goes beyond the mere application of algorithms and machine learning models; it also incorporates scientific methods that require a deep understanding of the data. Data science involves formulating hypotheses, applying appropriate statistical tests, and interpreting the results to derive actionable insights.

Data science is not merely a technical discipline but an integration of methods that require a scientific temperament. Beyond the algorithms and codes lies the essence of inquiry, exploration, and discovery. It’s about asking the right questions, challenging assumptions, and iterating based on evidence. In this light, data science transcends its technical roots, encapsulating a holistic methodology reminiscent of traditional scientific fields.

Data engineering

Data engineering, on the other hand, focuses on the practical aspects of data handling. This field creates and maintains the architecture, allowing data collection, transformation, and storage. 

Data engineering serves as the foundational layer upon which data science operates. Data engineers often play a significant role in setting the stage for the smooth deployment of applications. However, deployment itself can be a collaborative effort. Sometimes, roles such as machine learning engineers and specialized DevOps teams come into play. 

With the frameworks and pipelines data engineers establish, data scientists would find it easier to perform their analyses efficiently. Data engineering can be viewed as the backbone that supports the data science function, ensuring that data is readily available, clean, and in a format that can be easily manipulated for analytical purposes.

The radar chart below represents a general perspective of the competencies of data engineering and data science roles. Both professions showcase foundational data skills — data engineering related to infrastructure tasks and data science lean towards analytical and scientific expertise. 

2: Shifting boundaries between data engineering and data science

The traditional definitions we discussed earlier provide a solid starting point, but the landscape is shifting. The roles of data scientists and data engineers are not strictly categorized.

The gray areas

In the evolving data ecosystem, one can’t help but notice the increasing overlap between data science and data engineering skills. 

It’s becoming less rare to encounter data scientists who are not just analyzing data and training models but are also involved in setting up and managing data pipelines. For instance, consider a scenario where a data scientist is building a real-time machine-learning model for fraud detection in online transactions. To ensure the timely processing of each transaction, the scientist might need to design a custom data pipeline that efficiently fetches, preprocesses, and feeds transaction data to the model.

Conversely, a data engineer working on a platform for personalized content recommendations might need a foundational understanding of data analytics and statistical modeling. This would allow them to design a system that can efficiently gather user behavior data and feed it into recommendation algorithms, ensuring that data scientists can fine-tune models effectively.

But what’s driving this convergence of job responsibilities?

The catalysts of change

Technological advancements play a central role in reshaping the dynamics between these professions. The surge in innovative tools and platforms has democratized many aspects of data handling, allowing professionals to venture beyond their traditional scopes.

Take cloud computing, for instance. AWS, Google Cloud, and Azure have revolutionized data storage and processing. With their user-friendly interfaces and myriad services, data scientists find it easier than ever to set up databases, manage data streams, or even deploy machine learning models without heavy reliance on data engineers.

Moreover, the rise of platforms offering “Data Science as a Service” (DSaaS) further blurs the lines. These platforms provide end-to-end solutions, from data ingestion to model deployment, demanding a more holistic skill set from their users. AI’s influence in enhancing these platforms, offering real-time suggestions and automation capabilities, cannot be overstated.

3: Adapting to a rapidly changing landscape

If there’s one constant in the world of data, it’s change. The pace of innovation and the introduction of new technologies ensure that the data landscape is continually evolving. The dynamic field brings opportunities and challenges, requiring data professionals to be agile and committed to continuous learning. 

The importance of continuous learning

The ability to adapt and grow is not just an asset. It’s a necessity. The onus is on professionals to keep their skills updated and stay attuned to industry shifts. At the same time, recruiters must recognize and value these evolving skills, understanding that today’s sought-after expertise might differ from tomorrow’s.

Tips for keeping up-to-date

So, how do you stay caught up whether hiring or being a job seeker? Here are some actionable tips:

  1. Online courses: Many platforms offer specialized data science and data engineering courses. These can be invaluable for learning new skills or updating your existing knowledge. Protip: Just make sure the course you’re into is up-to-date. For those looking to sharpen their coding skills or experiment in a real-time environment, consider tinkering in the CoderPad sandbox
  2. Workshops and boot camps: These provide hands-on experience and are often more focused and cutting-edge, helping you quickly gain practical skills.
  3. Industry publications: Staying current means knowing what’s happening in the industry. Follow reputable publications and journals to inform yourself about the latest trends and research.
  4. Networking: Connect with peers and industry experts through social networks, conferences, and webinars. These interactions can provide insights and expose you to different perspectives.
  5. The key is to stay technologically curious: New tools and methods emerge constantly. Keep an eager eye on evolving technologies and methodologies. For recruiters, this means understanding the significance of new tools when they appear on resumes. For data professionals, it’s about mastering them.

Building on timeless foundations

Despite the constant evolution of the data realm, certain foundational pillars remain unchanged. These core competencies are the bedrock that stabilizes one’s journey through the shifting sands of data science and data engineering.

  1. Essentials of data structures and algorithms: A robust understanding of data structures and fundamental algorithms ensures optimal data manipulation and processing, irrespective of the tools in use. Tackle programming challenges on platforms like CodinGame.
  2. Statistical and mathematical mastery: The essence of data science lies in statistics and mathematics. Grasping core statistical concepts and linear algebra forms the basis for more advanced techniques.
  3. Programming proficiency: Command over programming languages, be it Python, R,  SQL, C++, or JavaScript, is indispensable. While tools may evolve, the logic and reasoning behind programming endure. 
  4. Data intuition: it’s about developing a sixth sense for data: recognizing patterns, understanding its nuances, and asking the right questions. It’s an art honed over time. 
  5. Foundational system design: Understanding system interactions and scalability is crucial for data engineers. As data demands grow, this knowledge ensures the infrastructure evolves in tandem. Study architectural designs of successful systems, attend workshops, and seek mentorship from seasoned professionals.
  6. Ethical grounding: With great data power comes great responsibility. An ethical foundation, considering biases, privacy, and broader societal impacts, is paramount.

While adapting to the changing landscape is vital, anchoring oneself in these foundational skills ensures resilience and depth in one’s data journey. They provide the stability needed to navigate the complexities of ever-evolving technologies.

4: The complementary nature of data skills

Understanding data science and data engineering as separate fields is only one puzzle piece. These roles exist within a broader data ecosystem, functioning as complementary pieces, each empowering the other. 

Above all, success isn’t merely determined by distinct roles but by the competencies individuals bring to the table. A seamless fusion of data engineering and data science skills is of prime importance for driving impactful results.

How data engineering competencies support data science tasks

For data professionals: Understanding the symbiotic relationship between data engineering competencies (like crafting robust data infrastructures) and data science tasks (like deriving insights) is crucial. 

Imagine trying to analyze data that’s poorly structured or not easily accessible. It’s akin to deciphering a book with jumbled pages. Mastery in both areas ensures that data initiatives have a solid foundation and deliver actionable and timely insights.

For recruiters: When hiring, it’s essential to look beyond traditional role titles and focus on the competencies candidates bring. Overemphasizing one skill set while neglecting the other can lead to challenges:

  • Inefficient data processing: Without competencies in setting up agile data pipelines, real-time analytics or handling of large datasets can become cumbersome, regardless of one’s proficiency in data analysis.
  • Poor data quality: Effective data analysis relies on clean, structured data. Lacking data curation and management competencies can result in flawed or misleading insights.
  • Increased costs: Addressing issues after they arise is often more resource-intensive. If someone adept in data analysis has to pause to address infrastructure issues, it prolongs timelines and increases costs.

As the lines between roles blur, a holistic approach emphasizing a blend of competencies in data engineering and science becomes the gold standard for successful data initiatives.

A spectrum of data competencies

It’s not just a binary distinction between data engineering and data science competencies. There’s a spectrum of skills and expertise that individuals might possess. For instance, many professionals, regardless of their title, might lean heavily into data interpretation, bridging the gap between raw data and actionable insights. This competency, often associated with data analysts, emphasizes understanding and translating data patterns into tangible business strategies.

Similarly, another sought-after competency is the ability to code algorithms that enable computers to learn from data and make decisions. While traditionally linked to machine learning (ML) engineers, this skill is becoming essential across many data roles. It involves building on foundational data handling and analysis competencies to create automated, data-driven solutions.

Enter MLOps competencies, which have become increasingly critical to an organization’s data policies and processes. These skills revolve around streamlining the machine learning lifecycle, from model development to deployment and monitoring. 

Professionals adept in MLOps ensure that machine learning models are accurate, scalable, maintainable, and seamlessly integrated into production environments. Additionally, with the modern trend of agile workflows in data teams, mastering Continuous Integration and Continuous Deployment (CI/CD) practices has become a necessary competency, facilitating rapid and reliable model updates in production.

This visualization presents a comparative view across four pivotal data fields: data engineering, data analysis, data science, and machine learning engineering, rather than specific roles such as data engineer or data scientist. 

This distinction is vital because thinking about fields provides a broader perspective on the skill landscape. An individual can excel in specific competencies within an area without fitting neatly into a predefined role. Each field displays unique strengths with noticeable overlaps, highlighting the significance of interdisciplinary skills in the data domain.

For data professionals and recruiters, it’s essential to recognize that possessing every skill in the vast data landscape is nearly impossible. Just as in soccer, where not everyone is a scorer and lacking a goalkeeper spells trouble, it’s about building a balanced team. The focus should be on ensuring complementary skills at the team level, where each member’s expertise fills the gaps of another.

Domain expertise: The forgotten hero of data science

While foundational data skills are crucial, one of the distinguishing attributes of an exceptional data scientist is their expertise in the domain of the data they work with. 

For instance, a data scientist in the healthcare sector should possess a deep understanding of health data, medical terminologies, patient care processes, and the nuances of clinical trials. Similarly, a data scientist working in the financial sector should be well-versed in trading dynamics, financial instruments, and regulatory norms. 

This domain knowledge is not just about understanding the data. It’s about grasping its real-world implications, challenges, and opportunities.

Such domain expertise is invaluable. It ensures that insights derived are contextually relevant, biases and limitations of the data are recognized, and the solutions proposed are practical and actionable. Moreover, domain expertise fosters effective communication with other stakeholders, ensuring that data-driven solutions align with broader organizational goals and industry-specific challenges.

Data professionals should delve into their respective domains, leveraging industry-specific resources and collaborating with experts to amplify their analysis. For recruiters, assessing candidates on technical prowess and their understanding and experience within the industry is vital. Fusing data skills with domain expertise is the benchmark for effective data science in any sector.

5: Navigating job listings: A skill-centric approach

With a grasp on the spectrum of data competencies, you’re better equipped to discern job listings. However, translating this understanding to the practical realm of job applications and interviews can be tricky. Here’s a guide to help you focus on the skills and expertise employers are seeking.

Spotting key competencies in listings

Navigating job listings requires an astute eye for competencies, especially since the traditional boundaries of roles have become less distinct. Here’s a breakdown of terms and competencies to watch out for:

For data analysis and modeling:

  • Statistical modeling: Harnessing data to predict or understand trends.
  • Data analysis: Interpreting data to uncover meaningful insights.
  • Machine learning: Leveraging algorithms for predictive modeling.
  • Insight generation: Converting data into actionable strategies.
  • Inference or predictive analytics: Foreseeing future events based on historical data.
  • Data visualization: Graphically representing data insights.

For data infrastructure and management:

  • Data pipelines: Creating and managing processes for data flow.
  • Data storage: Tasks centered around maintaining databases.
  • ETL processes: Extract, Transform, and Load processes for data preparation.
  • Database management: Ensuring optimal database performance.
  • Cloud computing: Expertise in AWS, Azure, or Google Cloud platforms.
  • Data architecture: Overseeing the design and management of data structures.

Other essential competencies:

  • Structured Query Language (SQL): Directly relates to database querying or management.
  • Git: Highlights collaborative coding and software development aspects.
  • Python (or other programming language): Indicates scripting, data manipulation, or machine learning tasks.
  • Distributed File System (DFS): Managing datasets across multiple servers.
  • Spark: Processing substantial datasets efficiently.
  • Docker & Kubernetes: Pointing towards deployment and scaling of applications.
  • Application Programming Interfaces (APIs): The role may involve software tool integration or data exchange.

Moreover, as previously emphasized, domain expertise is a cornerstone for many of these roles, especially for data scientists. A deep understanding of the specific industry or sector can be as crucial as technical acumen. Given this complex landscape, it becomes even more imperative to focus on competencies and domain knowledge rather than adhering strictly to job titles.

Beyond these competencies, don’t forget the multifaceted nature of roles like machine learning engineers, MLOps specialists, and data analysts. While each position has a distinct emphasis, there’s a significant convergence of skills. Also, don’t forget that the skills listed here might be incomplete and may soon become outdated.

Key questions to unearth role expectations

Whether you’re evaluating potential hires or considering a job offer, probing with the right questions can demystify role expectations. Here are some competency-focused queries:

For roles emphasizing data analysis and modeling:

  1. What are some recent projects that involved data modeling or predictive analysis?
    • Intent: Gauge the depth and breadth of hands-on experience with core analytical tasks.
  2. How collaborative is the process between teams focusing on data infrastructure and data analysis?
    • Intent: Understand the candidate’s/team’s exposure to interdisciplinary collaboration, which can be crucial in ensuring seamless data operations.
  3. Which statistical tools and programming environments are predominantly used?
    • Intent: Ascertain familiarity with industry-standard tools and adaptability to potential new tools.
  4. What is your approach when faced with data anomalies or inconsistencies during analysis?
    • Intent: Gauge problem-solving skills and diligence in ensuring data integrity.
  5. How do you communicate complex data findings to non-technical stakeholders?
    • Intent: Understand the ability to translate technical insights into actionable business recommendations.
  6. Can you share an instance where your domain expertise significantly influenced a data analysis or modeling decision?
    • Intent: Assess the candidate’s domain knowledge’s depth and practical application in data tasks.

For roles emphasizing data infrastructure and management:

  1. What systems are in place for data storage and retrieval?
    • Intent: Ascertain understanding and experience with data architecture, especially concerning scalability and efficiency.
  2. How does the organization ensure data quality during ETL tasks?
    • Intent: Assess the rigor and robustness of data processing and validation methods.
  3. Can you describe a time when you had to scale a data pipeline to handle a significant increase in data volume?
    • Intent: Understand experience with scalability challenges and solutions.
  4. How do you handle data security and compliance, especially with regulations like GDPR or CCPA?
    • Intent: Gauge awareness and implementation of data security best practices and regulatory compliance.

Such pointed questions clarify the role’s emphasis on specific competencies and shed light on the organization’s broader data strategy and culture.

Conclusion

Recognizing the evolving interplay and convergence between data science and data engineering domains is crucial. This article has highlighted the unique competencies of each while emphasizing their increasing intertwinement. 

The key takeaway is the importance of adaptability and a continuous learning mindset based on solid, timeless foundations. As the lines between roles blur and technology advances, professionals should prioritize a competency-focused approach. It’s less about the exact role one holds and more about the diverse skills one brings.

In sum, the data realm is in constant flux, and success hinges on an interdisciplinary blend of skills, constant upskilling, and a deep appreciation of the changing data world. Embracing this holistic perspective ensures individuals and organizations remain at the forefront of data-driven innovation.

Additional resources to stay updated

  1. Reddit communities:
    • /datascience: A vibrant community discussing various facets of data science, from beginner topics to advanced research.
    • /dataengineering: Dive into the technical challenges and solutions that data engineers encounter, shared by both novices and experts.
  2. Online publications:
    • Towards Data Science on Medium: This publication spans various articles on data science, machine learning, AI, and more, contributed by professionals and researchers globally.
  3. Podcasts:
    • Data Skeptic: Kyle Polich hosts this insightful podcast, delving into the intricacies of data science, statistics, machine learning, and AI, often enriched by interviews with industry experts.
  4. Strong fundamentals with books: 
    • “The Pragmatic Programmer” by Andrew Hunt and David Thomas: A timeless classic that offers practical advice on software development best practices, this book is invaluable for anyone involved in coding, including data professionals.
    • “The Visual Display of Quantitative Information” by Edward R. Tufte: Often regarded as one of the best books on data visualization, Tufte’s work is a must-read for understanding the art and science behind visually representing data.
    • “Think Stats: Probability and Statistics for Programmers” by Allen B. Downey: A practical guide to statistics and probability using Python. This book focuses on real-world examples and exercises to teach programmers and data scientists essential statistical concepts.
  5. Keep your data science skills up-to-date by practicing a CoderPad Interview data question, like the one below.
]]>
https://coderpad.io/blog/data-science/the-differences-between-data-science-and-data-engineering-job-roles/feed/ 0
From Data Chaos to Actionable Insights: My Team’s Quest to Hire our First Data Scientist https://coderpad.io/blog/data-science/from-data-chaos-to-actionable-insights-my-teams-quest-to-hire-our-first-data-scientist/ https://coderpad.io/blog/data-science/from-data-chaos-to-actionable-insights-my-teams-quest-to-hire-our-first-data-scientist/#respond Thu, 21 Sep 2023 12:36:55 +0000 https://coderpad.io/?p=36579 Editor’s Note: This article was guest-authored by CoderPad’s Senior Principal Engineer Jonathan Geggatt about his data science hiring experience with his previous company, HotelTonight.

In the fast-paced business landscape, the ability to harness data effectively can be a game-changer. 

This became quickly apparent to me during my time as the Director of Data Science and Engineering at HotelTonight, when my team and I identified specific challenges that needed the expertise of a data scientist. Our goal was simple: to transition from a reactionary data team to one that could proactively generate actionable insights and keep a finger on the business’s pulse.

Identifying the Need for a Data Scientist

Initially, our data team functioned in a reactionary mode, scrambling to respond to requests for data insights. This approach often resulted in missed opportunities for strategic planning and proactive decision-making. To counter this, we embarked on a journey to build a data science team that could unravel ambiguous tasks and extract valuable insights to steer the business in the right direction.

🔖 Related resource: Jupyter Notebook for realistic data science interviews

Crafting a Comprehensive Hiring Process

The first step in our journey was to establish a hiring process that could identify candidates who aligned not only with the technical requirements but also with our organizational culture. Our Talent Acquisition team took charge of the initial screening, filtering resumes based on interests and salary expectations, bypassing the technical pre-screen to dive deeper into potential fits.

There would then be a phone screen where the hiring manager introduced the candidates to the team dynamics and the expectations tied to the role. Following this, we conducted a technical interview to assess the foundational skills in SQL and Python. A cultural fit interview ensued, allowing us to gauge how well the candidates could integrate into our organizational ethos.

The highlight of the process was the data science project exercise, where we presented candidates with high-level questions crafted by our team. This stage was crucial in evaluating a candidate’s ability to handle real-world scenarios using tools like Snowflake and Jupyter Notebooks, offering a realistic and immersive experience with real, anonymized, and scrubbed data.

We would spend about 20 minutes going over the question with them, including an overview of the data dictionary. We also gave them the code for a database connection to reduce the amount of time they spent on trivial tasks.

They would then spend one-to-two hours on the task, and present the results to us as screenshots or in a Jupyter Notebook.

🔖 Related read: Mastering Jupyter Notebooks: Best Practices for Data Science

Learning and Adjusting Along the Way

In the early stages, we faced the challenge of not knowing the “right” answers to our high-level questions — a gap we were hoping to fill with the new hire. The criteria we set for evaluating the candidates were centered around their past projects and their ability to convey complex insights to non-tech stakeholders. We were cautious about candidates who ventured into building predictive models, considering their limited knowledge about our business.

The Evolution of the Hiring Process

As the data science team took shape, our hiring process matured. We could now focus on the techniques used by candidates to tackle problems, expanding our hiring spectrum to include fresh graduates and individuals transitioning into data science from unique backgrounds. This transition also equipped us with the discernment to sift through the applicants effectively, identifying those who could genuinely add value to our business as opposed to those who merely echoed what clients wanted to hear.

Leveraging Technology for a Competitive Edge

The integration of Jupyter Notebooks in our hiring process turned out to be a significant advantage. By allowing candidates to use familiar tools, we set them up for success, giving them the opportunity to showcase their best work. The use of our Snowflake data set not only offered a realistic experience but also allowed us to flaunt our tech stack, giving us an edge in attracting top talent.

Key Takeaways: Finding the Right Blend of Technical and Soft Skills

As we navigated this journey, it became evident that crafting questions aligning with our business objectives was crucial in finding the right person. The ultimate role of a data scientist in our team extended beyond technical expertise to include the ability to simplify complex data for stakeholders, making personal and communication skills a vital aspect of the selection criteria.

In conclusion, my journey at HotelTonight in hiring our first data scientist was one of learning and evolution. As we move forward, the focus remains on finding individuals who can decomplexify data, turning it into actionable insights and fostering a proactive, data-driven business culture.

Improve your own data science interviews

CoderPad makes it easy to assess data science candidates with our new Jupyter Notebook-integrated pads. Check it out for yourself in the pad below.

]]>
https://coderpad.io/blog/data-science/from-data-chaos-to-actionable-insights-my-teams-quest-to-hire-our-first-data-scientist/feed/ 0
2 Interview Questions for Vetting Data Science Candidates https://coderpad.io/blog/data-science/2-interview-questions-for-vetting-data-science-candidates/ https://coderpad.io/blog/data-science/2-interview-questions-for-vetting-data-science-candidates/#respond Mon, 18 Sep 2023 16:33:23 +0000 https://coderpad.io/?p=36339 As artificial intelligence and machine learning technologies continue to boom, the search for proficient data scientists has become increasingly difficult as you try to evaluate – or even locate – the ideal candidate.

This translates to a potential increase in the time and financial resources expended in recruiting for these roles. To avoid squandering both time and money, it is critical to ensure that you’re selecting the right data scientist for your organization. Repeating the hiring process multiple times is undeniably an unwise utilization of both time and resources.

Therefore, if you’re intent on hiring the best data scientist for your team, a thoughtful evaluation of the appropriate data science competencies is essential.

Most importantly, it’s vital to scrutinize both their statistical analysis capabilities and their expertise in data management and machine learning as those skills pertain to your team’s needs. Depending on your hiring criteria, this might entail an assessment of specialized competencies such as proficiency in Python or R, or more general skills like data visualization and predictive modeling.

Equally important – given your company’s data architecture, databases, and cloud storage – is a deep understanding of big data technologies, data mining techniques, and knowledge of data privacy and ethics.

One of the superior methods to gauge these crucial skills in prospective employees is through the initiation of collaborative data projects or case studies within a realistic environment. That means devising insightful technical interview questions is a crucial aspect of the interview procedure, warranting particular focus.

In this post, we will delve into two data science interview questions that can serve as tools to gauge the aptitude of your candidates. Although initially set in specific analytical frameworks, you can modify them to align with your specific technical requirements – the principles are broad enough that the exact toolkit is not important.

🔖 Related resource: Jupyter Notebook for realistic data science interviews

Question 1: Iris Exploratory Analysis

Context

The Iris dataset is a well known, heavily studied dataset hosted for public use by the UCI Machine Learning Repository.

The dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

  • id
  • sepal_length_cm
  • sepal_width_cm
  • petal_length_cm
  • petal_width_cm
  • class this is the species of Iris

The sample CSV data looks like this:

sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
5.1,3.5,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
5.8,2.7,5.1,1.9,Iris-virginica

Directions

Using any analysis method you choose, build either a classifier or produce a data visualization, that shows how the available data can be leveraged to predict the species of Iris.

Initial cell contents

Use this starter code to get started with accessing the Iris dataset in this pad. Feel free to use either Pandas or Native Python for your work.

You may install additional packages by using pip in this Notebook’s terminal.

import pandas as pd
import pprint

# Result as pandas data frame

result_df = pd.read_csv('iris.csv')

# Preview results output as a data frame

result_df.head()

# Result as pythonic list of dictionaries

result = result_df.where(pd.notnull(result_df), None).to_dict('records')

# Preview results output as a native list of dictionaries

pprint.pprint([record for record in result])

Success criteria

At minimum, a candidate should be able to conduct a basic analysis showing that they explored the data and found a way to separate the unique characteristics of each flower from the other.

For example:

  • Does one species of iris have longer petals than the other?
  • Can the candidate pose questions about the dataset and explore the data for answers to those questions?
  • Are the methods the candidate uses to explore the data reasonable? This question primarily requires some basic analysis and data visualization.  If a candidate starts off with a more complex approach, there may be a missed opportunity for fast, early lessons from the data, aka “low-hanging fruit.”
  • Can the candidate support any observations with plots?
  • How does the candidate form inferences from the data and how well does that candidate apply statistics to defend their inferences?

Question 2: Forecasting Future Grocery Store Sales

Context

This example question uses one of the Getting Started competitions on Kaggle. The goal is to forecast future store sales for Corporación Favorita, a large Ecuadorian-based grocery retailer.

Data

  • train.csv: The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales
  • test.csv: The test data, having the same features as the training data but starts after the ending date of train data and for 15 dates. One has to predict the target sales for the dates in this file
  • stores.csv: This has some stores metadata including city, state, type, and cluster (grouping of similar stores)
  • oil.csv: This has oil price data as Ecuador economy is susceptible to volatility of oil market
  • holidays_events.csv: This has data on holidays and events in Ecuador

Directions

  • You are expected to do at least one completed time series analysis that predicts future sales.
  • You are expected to show any data transformations and exploratory analysis.
  • You have full flexibility to use the provided data as desired, but at minimum the date and sales numbers need to be used.

Initial cell contents

Please review the context and data overview in the Instructions panel in this pad to gain a basic understanding of the available data and this exercise.

# Following code loads useful libraries 

# Useful for out of the box time series function libraries
install.packages('fpp3')
library(fpp3)

library(tsibble)
library(tsibbledata)
library(tidyverse)
library(ggplot2)
# Reading all the input datasets into memory

df_train <- read_csv("/home/coderpad/app/store sales files/train.csv",show_col_types = FALSE) %>%
  mutate(store_nbr = as.factor(store_nbr))

df_test <- read_csv("/home/coderpad/app/store sales files/test.csv",show_col_types = FALSE) %>%
  mutate(store_nbr = as.factor(store_nbr))

df_stores <- read_csv("/home/coderpad/app/store sales files/stores.csv",show_col_types = FALSE) %>%
  mutate(store_nbr = as.factor(store_nbr))

df_transactions <- read_csv("/home/coderpad/app/store sales files/transactions.csv",show_col_types = FALSE)

df_oil <- read_csv("/home/coderpad/app/store sales files/oil.csv",show_col_types = FALSE)

df_holidays_events <- read_csv("/home/coderpad/app/store sales files/holidays_events.csv",show_col_types = FALSE)
# Show training data 
head(df_train)
# Example visual of total daily sales

# Converting data frame into a tsbibble object
train_tsbl <- df_train %>%
  as_tsibble(key = c(store_nbr, family), index = date) %>%
  fill_gaps(.full = TRUE)

train_tsbl[is.na(train_tsbl)] <- 0

# aggregate data by stores
train_tsbl <- train_tsbl %>%
  aggregate_key(store_nbr, sales = sum(sales))

options(repr.plot.width = 18, repr.plot.height = 6)
train_tsbl %>%
  filter(is_aggregated(store_nbr)) %>%
  ggplot(aes(x = date, y = sales)) + 
   geom_line(aes(group=1), colour="dark green") +
  labs(title = "Total Sales")

Success criteria

At minimum, a candidate should be able to conduct a basic time series analysis showing that they explored the data, transformed it appropriately for a time series analysis, considered a confounding factor like seasonality, and interpreted results in a reasonably accurate way.

For example:

  • Does the candidate know to address auto-correlated data?
  • Does the candidate explore the data to find any necessary transformations/clean up needed ahead of the analysis?
  • Can the candidate identify seasonal patterns among store sales?
  • Is the candidate able to justify their analysis approach and conclusions?

🧑‍💻 You can access this question in a CoderPad sandbox here.

Conclusion

Assessing data scientists involves more than just scrutinizing their technical skills. To more accurately determine their fit for your team, we recommend consulting the supplementary interview guides found in the Related Posts section below.

Some parts of this blog post were written with the assistance of ChatGPT.

]]>
https://coderpad.io/blog/data-science/2-interview-questions-for-vetting-data-science-candidates/feed/ 0
Mastering Jupyter Notebooks: Best Practices for Data Science https://coderpad.io/blog/data-science/mastering-jupyter-notebooks-best-practices-for-data-science/ https://coderpad.io/blog/data-science/mastering-jupyter-notebooks-best-practices-for-data-science/#respond Wed, 06 Sep 2023 12:57:55 +0000 https://coderpad.io/?p=36238 Ah, Jupyter Notebooks—a data scientist’s trusty companion, as reliable as a warm croissant in a French bakery. 

If you’ve ever dipped your toes into data science, machine learning, or even scientific computing, chances are you’ve encountered a Jupyter Notebook. These powerful tools have become ubiquitous for good reasons: they blend code, narrative text, equations, and visualizations in a single document, making data storytelling a breeze. 

The beauty of a Jupyter Notebook lies in its layers—the layers of information, code, and commentary that give context to raw data and findings. 

But as we all know, beauty can quickly turn to chaos without proper care. 

This becomes particularly crucial in team environments where clarity and readability are not just courtesies, but necessities. 

This article delves into best practices for working with Jupyter Notebooks, guided by a simple daily sales analysis project from a French bakery (French bakery daily sales | Kaggle). The accompanying Notebook is accessible here: Daily Sales Analysis Best Practices Demo. Alternatively you can use the sandbox embedded at the bottom of the page. 

In this article, you’ll discover why and how to keep your Notebooks focused, the role of Markdown for readability, discipline in cell execution, the importance of modular programming, and tips for optimized data loading and memory management.  

All right, let’s get this trip started. You’ll come away from the article with a newfound proficiency in Jupyter Notebook. 

🔖 Related resource: Jupyter Notebook for realistic data science interviews

1. Ensure your Notebook stays focused 

The Dilemma: One comprehensive Notebook vs. Multiple specialized Notebooks 

So you’re asked to work on daily sales data for a French bakery—croissants, baguettes, and all those delicious goodies. Would you put every analysis—customer behavior, seasonal trends, inventory levels—into one grand, all-encompassing Notebook? Or would you break it down into bite-sized notebooks, each catering to a specific question? Every time I start a new project, I question things like this. 

Strategies for deciding between one large Notebook and multiple smaller ones 

When initiating a data science project, the scope of your analysis can significantly influence whether you opt for a single, all-encompassing notebook or multiple specialized ones. 

It often makes sense to split the work into several notebooks for projects covering various topics or analyses. This way, each Notebook can stay focused and more readily understandable, making it easier to collaborate with others and revisit the work later on. 

Another crucial factor is your audience. A series of focused notebooks might be more beneficial if your audience comprises experts looking for a deep dive into the data. On the other hand, if the audience is looking for a comprehensive overview, a single notebook that brings multiple analyses together might be more effective. 

There’s no straightforward answer to this dilemma. There are pros and cons to both focused and comprehensive notebooks. Let’s take a closer look at each option. 

The value of a focused Notebook 

Having a focused notebook is similar to following a well-organized recipe. It allows your readers, or even a future version of yourself, to navigate easily through the steps without getting sidetracked by irrelevant details. 

A well-defined objective paves a straightforward path from raw data to valuable insights, enhancing your work’s clarity and readability. Each cell and section is tailored to a specific role, which minimizes clutter and adds efficiency. 

This focus doesn’t just help the author; it also simplifies sharing and collaboration. With clear objectives, team members can quickly grasp the Notebook’s purpose, making it easier to either extend the existing work or offer constructive critiques. 

The advantages of a comprehensive Notebook 

While specialized notebooks offer modularity and focus, there are scenarios where a single, comprehensive Notebook may be more appropriate. 

For instance, when your analyses are deeply interconnected or when you aim to present a unified narrative, a single Notebook allows for seamless storytelling and data exploration. This centralized approach also minimizes the risk of duplicating data preparation steps across multiple files, thereby making your analysis more efficient.  

This table provides a quick overview of the advantages and disadvantages of how to focus your Notebooks, which can help you decide between choosing one large Notebook or multiple smaller ones. 

 One Big Notebook Many Small Notebooks 
Pros Centralized analysis Enhanced clarity 
 Easy to see the big picture Better modularity 
Cons Reduced clarity Management overhead 
 Performance issues Redundancy in data prep 

The need for adaptability 

It’s worth noting that the choice between one large Notebook and multiple smaller ones is rarely set in stone. As a project evolves, so too can its documentation. 

You may start with one Notebook and later find it beneficial to split the work into more specialized notebooks as the scope expands. Your Notebook’s structure should be flexible enough to cater to different project needs and audience expectations. 

Just remember the mantra: “Adaptable, Clear, and Purpose-Driven.” 

Best practices for keeping your Notebooks well organized 

Organizing your notebooks can be challenging, even with the previous explanations. To make it easier, here are some practical tips you can use: 

  1. Clearly define the objective at the start: Before you even begin coding, outline what you aim to achieve with the Notebook. A clear objective sets the stage for focused analyses and helps your audience quickly understand the Notebook’s purpose. 
  2. Limit tangential analyses: You may encounter exciting side routes as you dive into your data. While it’s tempting to go off on tangents, these can dilute the primary focus of your Notebook. If a tangential analysis starts to take on a life of its own, it may warrant a separate notebook. 
  3. Use a Table of Contents (TOC) or an index: Navigation can become a challenge in larger notebooks. Implementing a table of contents or an index can significantly improve the Notebook’s usability, helping you and your collaborators find relevant sections more efficiently.

By following these guidelines and deciding strategically between one large and multiple smaller ones, you can make your Jupyter Notebook projects more organized, focused, and collaborative. 

2. Mastering Markdown usage 

Now that you have chosen the scope of your Notebook, you need to pay attention to the format. 

Imagine you’ve crafted the perfect baguette. It’s not just the ingredients that matter, but also how they’re presented—the golden crust, the soft interior. Similarly, a well-structured Jupyter Notebook isn’t solely about the code and data. It’s also about presentation and narrative, which is where Markdown shines. 

The value of structured Notebooks 

Markdown helps you create structured notebooks by allowing you to include headers, lists, images, links, and even mathematical equations. These elements contribute to the Notebook’s readability and flow, making it easier for anyone to understand your work process. 

How Markdown text improves narrative and flow 

Think of the Markdown text as the storyline that weaves through your Notebook. It guides the reader through your thought process, explains the rationale behind code snippets, and adds context to your data visualizations. 

Keeping audience and purpose in mind 

A notebook aimed at technical experts might focus more on code and algorithms, but one intended for business stakeholders may benefit from more narrative explanations and visualizations. Markdown lets you tailor your Notebook to its intended audience. 

Demonstrating Markdown capabilities 

For our purposes, the following Markdown elements will be useful in creating an eye-popping narrative for your data. You can see how they render in the Notebook in the image below.

  • Headers: The use of ## generates a subheading, Average Revenue per Transaction, which serves as a navigational landmark in the Notebook. 
  • Anchor links: The <a class="anchor" id="avg-revenue-per-transaction"></a> is an HTML tag used to create a hyperlink anchor for more straightforward navigation within the Notebook. 
  • Math equations: The formula for the KPI is displayed using LaTeX notation enclosed within dollar signs $. This allows for a clear presentation of mathematical concepts. 
  • Inline code: The backticks around Average Revenue per Transaction set this particular text apart, typically indicating that it refers to a code element or technical term.

This example offers just a snapshot of what Markdown can do. The possibilities for enriching your Notebook with Markdown are extensive, enabling you to turn a collection of code cells into a well-organized, compelling narrative. 

Making your Notebooks presentation-ready 

When we talk about a “presentation-ready” Jupyter Notebook, we’re referring to a notebook transcending the role of a mere draft or a playground for data tinkering. 

A presentation-ready notebook is polished, well-organized, and easy to follow, even for someone who didn’t participate in its creation. It should be able to “tell the story” on its own, meaning it can be handed off to a colleague, a stakeholder, or even your future self with minimal guidance – and still make complete sense. 

A presentation-ready notebook typically displays several key characteristics that set it apart from a rough draft: 

  1. Well-organized sections and subsections: The Notebook should be logically segmented into parts that guide the reader through your thought process. Each section should flow naturally into the next, like chapters in a book. 
  2. Descriptive comments and Markdown text: A good Notebook uses the markdown cells effectively to describe what each code cell is doing. It’s not just the code that speaks; the text around it elucidates why a particular analysis is essential, what the results signify, or why a specific coding approach was taken. 
  3. Effective use of visualizations: Charts, graphs, and other visual aids should not just be addressed as afterthoughts. Instead, they should be integral parts of the narrative, aiding in understanding the data and the points you are trying to convey. 
  4. Significance in data storytelling: A presentation-ready notebook is particularly vital for data storytelling. In many ways, your Notebook is the story—the narrative explaining your data-driven insights. 

Next, you’ll find a screenshot that exemplifies the use of Markdown to enhance the comprehension of data visualization. 

The screenshot shows a Notebook section where Markdown text provides a description and a plot analysis. This serves as a real-world demonstration of how Markdown can make visual data more meaningful and contribute to a presentation-ready Notebook.  

By focusing on structured organization, effective Markdown usage, and meaningful visualizations, you transform your Notebook from a mere collection of code into a powerful tool for both analysis and decision-making. The Markdown text, in particular, elevates your narrative by clarifying not just the “what” but also the “why” and “how,” adding depth to your data storytelling. 

3. Cell execution order discipline 

It’s great to have a well-organized and nicely presented notebook, but having one that runs smoothly is even better! 

Sequential flow and logic

When analyzing sales data for our cozy French bakery, we could hop between different cells to explore new ideas, debug issues, or revisit past analyses. 

One of the most unique features of Jupyter Notebooks is the ability to execute cells out of order, which is often a double-edged sword. 

While this non-linear execution gives Jupyter Notebooks their interactive power, it can also lead to confusion and errors if we’re not careful. This ability allows for greater flexibility and exploration but also opens the door to logical inconsistencies. 

Potential pitfalls of out-of-order cell execution 

Imagine you’re calculating the average sales of croissants for the last week. You execute a cell that deletes outliers, but then you return to a previous cell to adjust some parameters and re-run it. 

You’ve now got a Frankenstein dataset—part cleaned, part raw. This kind of scenario makes it extremely difficult to replicate your results or debug issues. 

Maintaining logical flow and data integrity in your Notebook 

Maintaining a logical progression in your Notebook is crucial for ensuring it remains a reliable tool for data analysis. 

Each cell should build upon the output of the previous one, creating a coherent flow of information and analysis. However, this comes with the risk of stale or overwritten variables. If you modify a variable in one cell but forget to update it in subsequent cells that rely on it, inconsistencies can occur, leading to misleading results. 

Let’s say you’re preparing a sales report and accidentally run the “total sales calculation” cell before updating the “monthly discount” cell. Your total sales figure ends up being incorrect, and you only realize it during a team meeting—embarrassing, right? 

A disciplined approach could have prevented this. To mitigate such risks, it is advisable to temporarily duplicate some of your data for testing purposes. This allows you to make changes and run tests without affecting the original variables, reducing the potential for errors and headaches. 

Restart and Run All: A best friend 

Make a habit of regularly restarting the kernel to clear the Notebook’s state. This helps in catching hidden dependencies or assumptions about prior cell executions. Also, use the “Run All” feature to verify that your Notebook flows logically from start to finish. Keep an eye on the execution count indicator to help track the order in which cells have been run. 

To reiterate this important message: maintaining discipline in cell execution order is not just good practice; it’s a necessity for creating reliable, shareable, and replicable notebooks. 

Whether working solo on a pet project or collaborating on a critical business analysis, disciplined execution ensures that your Notebook is as dependable as it is powerful. 

4. Optimized data loading and memory management 

Although optimizing data loading and memory management isn’t particularly relevant to our study of French bakery sales, it’s still important to be aware of the pitfalls that can arise from a lack of attention to these issues. 

Challenges of working with large datasets in Jupyter Notebooks 

Challenges abound when handling large datasets, as limited system memory and performance degradation can seriously hamper your work. 

Overlooking data size can slow your analysis, freeze your Notebook, or even crash the system. Being mindful of these factors is crucial; for example, attempting to load a 10GB dataset on a machine with only 8GB of RAM is a recipe for failure. Therefore, understanding and managing these challenges is integral to a smooth and productive workflow in Jupyter Notebooks. 

Efficient memory use techniques 

1. Data sampling: Working with representative subsets for exploratory analysis 

When initially exploring data, consider loading only a subset of rows and columns you need. This can be quickly done using the nrows and usecols parameters in pandas.read_csv() or similar functions in other data-loading libraries. 

# Example: Load only first 1000 rows and selected columns
df_sample = pd.read_csv(
    "bakery sales.csv", nrows=1000, usecols=["date", "time", "article"]
)

For preliminary analyses, sometimes working with a representative sample of your data can be as good as using the entire dataset. This also significantly reduces memory usage. 

# Example: Random sampling of 10% of the dataset
df_sample = df.sample(frac=0.1)

2. Tips: Functions like pandas.DataFrame.info() to monitor memory usage

For instance, using pandas.DataFrame.info() can provide a detailed summary of the DataFrame, including memory usage, which helps manage computational resources. 

3. Using suitable data types

One of the easiest ways to reduce memory usage is by converting data types. For example, float64 can often be safely downcast to float32, and int64 can often become int32 or even smaller data types. For instance, we convert unit_price from string to float and then downcast it to the smallest possible float type. 

# 'unit_price' is in string format with the Euro symbol, and comma as
#  a decimal separator
# remove the Euro symbol
df_copy["unit_price"] = df_copy["unit_price"].str.replace(" €", "")
# replace comma with dot
df_copy["unit_price"] = df_copy["unit_price"].str.replace(",", ".")

# Downcasting unit_price column
df_copy["unit_price"] = pd.to_numeric(df_copy["unit_price"], downcast="float")

Also, columns with a low number of unique values (low cardinality) can be converted to the category data type. 

df_copy['article'] = df_copy['article'].astype('category')

Thanks to these downcasts, we’ve freed up storage, reducing the size of the DataFrame by over 50%, from 68.94 MB to 32.37 MB. 

4. Techniques like dimensionality reduction and aggregating data

Sometimes, you can reduce your data size by aggregating it at a higher level. For instance, if you have transaction-level data, summarizing it daily or weekly can significantly reduce data size without losing the overall trend information. 

5. Deleting large variables to free up memory

Suppose you’ve created large intermediate variables that are no longer needed. In that case, you can free up memory by deleting them using the del keyword in Python. 

☝ Do you remember your best friend “Restart and Run All”? Because you’ll be running your notebooks a lot if you follow this advice, even a tiny increase in execution time due to memory optimization can save you a lot of time in the long run. 

Effective data and memory management in Jupyter Notebooks is not merely a good-to-have but necessary for achieving a smooth workflow. Being judicious about what data to load, optimizing code to be memory-efficient, and systematically freeing up resources make for a more robust and hiccup-free data analysis experience. 

ℹ Mastering memory management is a discipline in itself, and what we’ve touched upon here is just the tip of the iceberg regarding the techniques and practices that can be employed. Checkout this article on Python Memory Management for further information.

5. Refactor and create modules 

As we come to the end of this article, there is one more critical aspect to address for mastering Jupyter Notebook: refactoring and modularity. 

Role of modular programming in data science 

In data science, modular programming serves as a cornerstone for creating efficient and streamlined workflows. 

This approach is centered around compartmentalizing your code into distinct, functional modules, much like organizing a complex machine into its fundamental, working parts. It provides a two-fold advantage: First, it elevates the readability of your Notebook by organizing code into more understandable and navigable units. Second, it enriches the reusability of your code, enabling it to serve as a set of building blocks for future projects. 

Continuing our earlier discussion on efficient data handling and memory optimization, let’s take those principles further by incorporating them into a refactored and modular code design. 

I’d like to introduce the function optimize_memory, which serves as a practical example of code refactoring. This function encapsulates all the various techniques we’ve discussed for memory optimization into a single, reusable piece of code. 

Instead of manually applying data type conversions and downcasting in multiple places throughout your Notebook, you can now make a single function call to optimize_memory

def optimize_memory(df, obj_cols_to_optimize=[], in_place=False):

    if not in_place:
        df = df.copy()

    # Downcast integer columns
    int_cols = df.select_dtypes(include=["int"]).columns
    df[int_cols] = df[int_cols].apply(pd.to_numeric, downcast="integer")

    # Downcast float columns
    float_cols = df.select_dtypes(include=["float"]).columns
    df[float_cols] = df[float_cols].apply(pd.to_numeric, downcast="float")

    # Downcast specified object columns
    for col in obj_cols_to_optimize:

        # Check if column exists and is actually of object dtype
        if col in df.columns and df[col].dtype == "object":
            # Convert to category
            df[col] = df[col].astype("category")

    if in_place:
        return None
    else:
        return df


df_memory_optimized = optimize_memory(
    df_copy, obj_cols_to_optimize=["article"], in_place=False
)

Creating and Organizing Custom Modules for Notebooks 

While many notebooks commence as a straightforward list of code cells, the truly effective ones break free from this basic form. 

By leveraging custom Python modules, they encapsulate intricate tasks or repetitive actions. Take, for example, our optimize_memory function. This function could be an excellent candidate for exporting to a custom module. By doing so, you can effortlessly integrate it into your Notebook whenever needed, akin to slotting a new gear into a well-oiled machine. 

This level of organization keeps your main document streamlined, allowing you to concentrate on high-level logic and data interpretation rather than getting mired in the details of code syntax. 

Effortless integration with team collaboration in mind 

Incorporating custom modules like one containing optimize_memory is generally as simple as bringing in standard Python libraries. 

Simple commands like import my_module or from my_module import optimize_memory can help weave these external Python assets into the fabric of your Notebook. To ensure seamless collaboration, remember to document any external dependencies in a requirements.txt file. 

Modular notebooks are easier to decipher, more straightforward to extend, and thus ideal for team-based projects. A notebook arranged in this fashion transforms from a personal sandbox into an enterprise-ready tool that multiple individuals can comfortably use for various aspects of the project. This transformation naturally fosters better collaboration and makes the Notebook a more versatile asset in the data science toolkit. 

Refactoring and modularizing your Jupyter Notebook is like refining a good recipe. The result is something more efficient, shareable, and enjoyable. Both for you, the chef, and those lucky enough to sample your culinary (or data science) creations. 

Conclusion 

Recap of best practices 

Throughout this article, we’ve dived deep into best practices for Jupyter Notebooks, ranging from maintaining focus and clarity in your notebooks to mastering Markdown for enhanced readability. 

We’ve also discussed the importance of disciplined cell execution and how to handle data loading and memory management for smoother notebook operation. Lastly, we tackled the advantages of modular programming.  

The value of each practice in professional data science settings 

In a professional setting, these practices are not just recommendations but necessities. 

A well-organized and focused notebook ensures that your work is understandable and replicable, not just by you but by anyone on your team. Efficient memory management and modular code can dramatically speed up development time and make the maintenance of long-term projects more sustainable. 

A collaborative endeavor 

The ultimate goal of adhering to these best practices is to facilitate better teamwork and collaboration. 

In a shared environment, a disciplined approach to Notebook usage ensures that everyone can follow along, contribute, and derive value from what has been done. 

But let’s not forget, that collaboration isn’t just about working well with others. It’s also about being kind to your future self. After all, future you will undoubtedly appreciate the meticulous organization and readability when revisiting your Notebook! 

As Jupyter Notebooks become increasingly central in data science and related fields, mastering these best practices is more critical than ever. If you’re using Jupyter Notebooks in an interview setting or simply collaborating on a project, these tips can be your secret weapon for efficient and practical analysis. 

Now that you’re equipped with these best practices, why not put them to the test? Experience the efficiency and clarity of a well-organized Jupyter Notebook during your next project or even in data science interviews. 

And there you have it! Best practices for Jupyter Notebooks that can elevate your data science projects from good to exceptional. Thank you for reading, and happy coding! 

Simon Bastide is a data scientist with a Ph.D. in Human Movement Sciences and nearly five years of experience. Specializing in human-centric data, he’s worked in both academic and corporate settings. Has has also been an independent technical writer for the past two years, distilling complex data topics into accessible insights. Visit his website or connect with him on LinkedIn for collaborations and discussions.

]]>
https://coderpad.io/blog/data-science/mastering-jupyter-notebooks-best-practices-for-data-science/feed/ 0
5 Key Criteria to Consider When Hiring a Data Scientist https://coderpad.io/blog/data-science/how-to-hire-a-data-scientist/ https://coderpad.io/blog/data-science/how-to-hire-a-data-scientist/#respond Tue, 14 Feb 2023 15:55:44 +0000 https://coderpad.io/?p=30337 In the last decade, the field of data science has experienced massive growth and according to the Global Data Science Platform Market Research Report, the data science market size is expected to grow from USD 95.3 billion in 2021 to USD 322.9 billion by 2026.

To keep up with this demand, businesses are turning to hiring a data scientist to drive their success. As a highly sought-after profession, data scientists are in high demand, with the LinkedIn U.S Emerging Jobs Report naming it the second fastest growing job. In this guide, we’ll explore the key factors to consider when hiring a data scientist, including the essential skills and the interview process.

Oh, by the way, we’ve put all of our articles and resources on how data science interviews here. 

In a nutshell: the Data Scientist

A Data Scientist combines technical and analytical skills to explore, spot and solve complex problems and patterns in a company’s data. Long gone are the days of leaving this part of the business to the short-staffed IT team, a Data Scientist’s job, day in day out, is to plug away at all data leads.

Data Scientists are insanely curious and thrive when backed into tough problem solving corners. They need to stay one step ahead of analytical techniques (such as machine learning, the statistical learning algorithms used to extract intelligence from data) in order to collect and transform vast amounts of data into functioning formats to help business objectives.

What skills to look for in a Data Scientist?

The ideal candidate

The ideal Data Scientist candidate will have skills that fall under these umbrellas: mathematics/stats, engineering/programming and business/strategy.

Data Scientist Skill Set
Data Scientist skill set

Mathematics/stats

A junior candidate who has a degree in mathematics but no experience in a data science role, will have a strong foundation of basic statistics – such as correlation and regression analysis – even if they don’t yet have the machine learning element of the role.

Someone with a more specific data science degree or even job experience in a similar role should also tick these boxes easily.

Engineering/programming

To hire a great Data Scientist, the second skill set to examine is their engineering and programming skills. Why? Simply, because a lot of skills that Data Scientists use day to day are the same as a software engineer or a programmer. For example, both roles rely heavily on knowledge of databases and how these are queried using SQL in order to pull data.

A Data Scientist’s role, in its simplest form, is all about taking and transforming data into something that is meaningful for the business. So, experience in engineering and programming shows that the candidate knows how to get the data and how to transform it (writing code). Looking out for skills in Python or R, to mention just two, means a tick in this box. A candidate may have more general software development skills such as Java, which still shows a solid programming foundation with the aptitude for taking on other languages more useful in data science.

Business/strategy

Last but not least, the candidate must a) know what the overall business goal and/or strategy is in relation to their work and b) be able to communicate their findings in a way that is easy to negotiate for teams within the company that don’t know anything about data science.

There is absolutely no business use in a Data Scientist plugging away on a certain project without checking in with the project lead and understanding the overall objective of their work. Strong communication skills are also something to look out for.

Which candidates to consider?

So, we’ve looked at the skills that a Data Scientist candidate should be able to display in one way or another, but what title do they present themselves under? Here are the top 4 titles that might be covering your next Data Scientist hire.

Data Scientist

An obvious one to start off with! If you’re hiring for a Data Scientist, the industry is old enough for you to attract, you’ve got it, Data Scientists.

Pros: they know the job!

Cons: depending on time spent in their current role, they might be lacking in good all-round experience. On the flip side, if they dazzle you, they’ll probably come with a hefty price tag.

Software developer/engineer

Software developer/engineer is the job role most closely linked by the set of skills that are required in data science. If this candidate’s CV is on your desk, look out for the transferable engineering/programming experience listed above.

Pros: a lot of similar skills and at a minimum, usually evidence that the candidate could easily pick up the needed statistics-focused languages.

Cons: are they 100% committed to shifting roles, learning new skills and tools and focusing on manipulating big data?

Machine learning is the skill that tops the interest board for developers to learn in 2020 (our 2023 developer survey). As the world of artificial intelligence (AI) and machine learning jobs grow, this statistic shows that developers are keen (and aware that they need to) improve on their skills of working with data in this way. With a solid data science skill set already in place, developers and engineers can easily transition into Data Scientist roles.

Graduate

When we talked about mathematics and statistical skills we had graduates in mind. Graduates fresh out of school/university with a degree in statistics and maths (or even physics, economics) could make for great Data Scientists.

Pros: they’re keen and mouldable with the freshest skills.

Cons: other than work experience placements, they’ll have never had to grind in a job before, working alongside lots of different people and having to pick up the jobs at the bottom of the pile for some time.

Business analyst

Business analysts will probably have a good looking CV and perform well in interviews. Their role in a business is to be able to seamlessly present data so their communication should be top notch.

Pros: they might already work using SQL, in which case they have a foundation to progress into the other needed Data Scientist skills.

Cons: they might not have what it takes. Do they have the intelligence to learn to program?

5 steps to assess a Data Scientist’s skills

You’ve highlighted some candidates via their profile and past job roles, now it’s time to assess whether they have the right skill set for the job or not. Here are 5 steps you should follow to check whether your candidate is a good fit.

Company overview

Kick off the interview by getting the candidate excited about their prospective company and data science team. Talk them through the size of the team, how different teams work alongside other departments and give them examples of previous projects. Ask them why they want to work for your company. A simple question that can throw out some interesting answers! Have they already worked in a data science team? How do they envisage their role in their new team?

Recent projects

This is an open question, which allows the candidate to display their communication skills along with their experience. Ask the candidate to talk you through a recent project. What the candidate focuses on and how their present their story is a mini preview of what they might be like in the role; strategy awareness and analytical skills can both be equally important.

Skills

It’s time to get testing the candidate’s skills.

Your candidate should have an ‘okay’ to a strong knowledge of SQL – this is definitely something to test at the interview stage. How do their actual skills match up to what’s on their CV? One option is to use a simple test like below:

hiring a data scientist table

Write a SQL query to create a table that shows, for each country, the value of the highest temperature in the country.

Even better, though, would be to use an assessment tool geared specifically towards testing SQL (test sample hereand Data Science (test sample here) skills. With CoderPad Screen you cut right to the chase and effectively test the skills that your Data scientist candidate should be able to display, with the tool holding your hand through the whole process, from assessment to results.

🔖 Related read: 2 Interview Questions for Vetting Data Science Candidates

Interpreting data

Interpreting data is at the heart of a Data Scientist’s role and arguably the most important element. Present the candidate with a set of data – for example, in a table or a graph – and assess how they talk you through it, how they process the different data clusters and what their next steps would be. Alarm bells should ring if a candidate really struggles with this step, as even though other skills are great within the role, being confident with data interpretation is a must.

Any questions?

Always end the interview by asking the candidate if they have any questions. They should jump in on varying topics from job role to company culture. If they have nothing, it could ring alarm bells to how interested they are in the position.

Let’s wrap up

That brings you to the end of our guide on how to hire a Data Scientist. The most important elements to leave with are these:

  • Skills are key. If you get the skill set (or ‘potential’ with certain skills) right with a hire, you shouldn’t go far wrong
  • A great Data Scientist could have been many different things before they walked through your door, so keep an open mind within your parameters
  • Work with an assessment tool so that you’re not in any doubt at the final stage of hiring

🔖 Related resource: Jupyter Notebook for realistic data science interviews

Enhance your data science interviews with Jupyter Notebook

CoderPad makes it easy to assess a data science candidates ability to communicate their data observations and conclusions with our new Jupyter Notebook-integrated pads. Check it out for yourself in the pad below:

]]>
https://coderpad.io/blog/data-science/how-to-hire-a-data-scientist/feed/ 0