{"id":36339,"date":"2023-09-18T09:33:23","date_gmt":"2023-09-18T16:33:23","guid":{"rendered":"https:\/\/coderpad.io\/?p=36339"},"modified":"2023-11-13T07:42:47","modified_gmt":"2023-11-13T15:42:47","slug":"2-interview-questions-for-vetting-data-science-candidates","status":"publish","type":"post","link":"https:\/\/coderpad.io\/blog\/data-science\/2-interview-questions-for-vetting-data-science-candidates\/","title":{"rendered":"2 Interview Questions for Vetting Data Science Candidates"},"content":{"rendered":"\n<p>As artificial intelligence and machine learning technologies continue to boom, the search for proficient data scientists has become increasingly difficult as you try to evaluate \u2013 or even locate \u2013 the ideal candidate.<\/p>\n\n\n\n<p>This translates to a potential increase in the time and financial resources expended in recruiting for these roles. To avoid squandering both time and money, it is critical to ensure that you&#8217;re selecting the right data scientist for your organization. Repeating the hiring process multiple times is undeniably an unwise utilization of both time and resources.<\/p>\n\n\n\n<p>Therefore, if you&#8217;re intent on hiring the best data scientist for your team, a thoughtful evaluation of the appropriate data science competencies is essential.<\/p>\n\n\n\n<p>Most importantly, it&#8217;s vital to scrutinize both their statistical analysis capabilities and their expertise in data management and machine learning as those skills pertain to your team&#8217;s needs. Depending on your hiring criteria, this might entail an assessment of specialized competencies such as proficiency in Python or R, or more general skills like data visualization and predictive modeling.<\/p>\n\n\n\n<p>Equally important \u2013 given your company&#8217;s data architecture, databases, and cloud storage \u2013 is a deep understanding of big data technologies, data mining techniques, and knowledge of data privacy and ethics.<\/p>\n\n\n\n<p>One of the superior methods to gauge these crucial skills in prospective employees is through the initiation of collaborative data projects or case studies within a realistic environment. That means devising insightful technical interview questions is a crucial aspect of the interview procedure, warranting particular focus.<\/p>\n\n\n\n<p>In this post, we will delve into two data science interview questions that can serve as tools to gauge the aptitude of your candidates. Although initially set in specific analytical frameworks, you can modify them to align with your specific technical requirements \u2013 the principles are broad enough that the exact toolkit is not important.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p> \ud83d\udd16&nbsp;<strong>Related resource<\/strong>: <a href=\"https:\/\/coderpad.io\/use-case\/jupyter-notebook-data-science-interview\/\">Jupyter Notebook for realistic data science interviews<\/a><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">Question 1:&nbsp;Iris Exploratory Analysis<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Context<\/strong><\/h3>\n\n\n\n<p>The Iris dataset is a well known, heavily studied dataset hosted for public use by the UCI Machine Learning Repository.<\/p>\n\n\n\n<p>The dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.<\/p>\n\n\n\n<p>The columns in this dataset are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>id<\/code><\/li>\n\n\n\n<li><code>sepal_length_cm<\/code><\/li>\n\n\n\n<li><code>sepal_width_cm<\/code><\/li>\n\n\n\n<li><code>petal_length_cm<\/code><\/li>\n\n\n\n<li><code>petal_width_cm<\/code><\/li>\n\n\n\n<li><code>class<\/code>&nbsp;<em>this is the species of Iris<\/em><\/li>\n<\/ul>\n\n\n\n<p>The sample CSV data looks like this:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"CSS\" data-shcb-language-slug=\"css\"><span><code class=\"hljs language-css shcb-wrap-lines\"><span class=\"hljs-selector-tag\">sepal_length_cm<\/span>,<span class=\"hljs-selector-tag\">sepal_width_cm<\/span>,<span class=\"hljs-selector-tag\">petal_length_cm<\/span>,<span class=\"hljs-selector-tag\">petal_width_cm<\/span>,<span class=\"hljs-selector-tag\">class<\/span>\n5<span class=\"hljs-selector-class\">.1<\/span>,3<span class=\"hljs-selector-class\">.5<\/span>,1<span class=\"hljs-selector-class\">.4<\/span>,0<span class=\"hljs-selector-class\">.2<\/span>,<span class=\"hljs-selector-tag\">Iris-setosa<\/span>\n7<span class=\"hljs-selector-class\">.0<\/span>,3<span class=\"hljs-selector-class\">.2<\/span>,4<span class=\"hljs-selector-class\">.7<\/span>,1<span class=\"hljs-selector-class\">.4<\/span>,<span class=\"hljs-selector-tag\">Iris-versicolor<\/span>\n5<span class=\"hljs-selector-class\">.8<\/span>,2<span class=\"hljs-selector-class\">.7<\/span>,5<span class=\"hljs-selector-class\">.1<\/span>,1<span class=\"hljs-selector-class\">.9<\/span>,<span class=\"hljs-selector-tag\">Iris-virginica<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">CSS<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">css<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h3 class=\"wp-block-heading\" id=\"markdown-directions\"><strong>Directions<\/strong><\/h3>\n\n\n\n<p>Using any analysis method you choose, build either a classifier or produce a data visualization, that shows how the available data can be leveraged to predict the species of Iris.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Initial cell contents<\/strong><\/h4>\n\n\n\n<p>Use this starter code to get started with accessing the Iris dataset in this pad. Feel free to use either Pandas or Native Python for your work.<\/p>\n\n\n\n<p>You may install additional packages by using <code>pip<\/code> in this Notebook&#8217;s terminal.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python shcb-wrap-lines\"><span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd\n<span class=\"hljs-keyword\">import<\/span> pprint\n\n<span class=\"hljs-comment\"># Result as pandas data frame<\/span>\n\nresult_df = pd.read_csv(<span class=\"hljs-string\">'iris.csv'<\/span>)\n\n<span class=\"hljs-comment\"># Preview results output as a data frame<\/span>\n\nresult_df.head()\n\n<span class=\"hljs-comment\"># Result as pythonic list of dictionaries<\/span>\n\nresult = result_df.where(pd.notnull(result_df), <span class=\"hljs-literal\">None<\/span>).to_dict(<span class=\"hljs-string\">'records'<\/span>)\n\n<span class=\"hljs-comment\"># Preview results output as a native list of dictionaries<\/span>\n\npprint.pprint(&#91;record <span class=\"hljs-keyword\">for<\/span> record <span class=\"hljs-keyword\">in<\/span> result])<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h3 class=\"wp-block-heading\"><strong>Success criteria<\/strong><\/h3>\n\n\n\n<p>At minimum, a candidate should be able to conduct a basic analysis showing that they explored the data and found a way to separate the unique characteristics of each flower from the other. <\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Does one species of iris have longer petals than the other? <\/li>\n\n\n\n<li>Can the candidate pose questions about the dataset and explore the data for answers to those questions?<\/li>\n\n\n\n<li>Are the methods the candidate uses to explore the data reasonable? This question primarily requires some basic analysis and data visualization.&nbsp; If a candidate starts off with a more complex approach, there may be a missed opportunity for fast, early lessons from the data, aka \u201clow-hanging fruit.\u201d<\/li>\n\n\n\n<li>Can the candidate support any observations with plots?<\/li>\n\n\n\n<li>How does the candidate form inferences from the data and how well does that candidate apply statistics to defend their inferences?<\/li>\n<\/ul>\n\n\n<section class=\"\n    text-image-hero-block\n    text-image-hero-block--align-right\n            text-image-hero-block--no-image\n    \"\ndata-block-name=\"text-image-hero-block\">\n\t<div class=\"inner\">\n\t\t<div class=\"the-content\">\n\t\t\t\t\t\t\t<h1 class=\"headline\">Try out this question<\/h1>\n\t\t\t\n\t\t\t\n\t\t\t\t\t\t\t<a href=\"https:\/\/embed.coderpad.io\/sandbox?question_id=257738&#038;use_question_button\" class=\"the-cta js-cta--%f0%9f%a7%91%f0%9f%92%bb-you-can-access-this-question-in-a-coderpad-sandbox-here\" data-ga-category=\"CTA\" data-ga-label=\"Primary|\ud83e\uddd1\u200d\ud83d\udcbb You can access this question in a CoderPad sandbox here.\" >\ud83e\uddd1\u200d\ud83d\udcbb You can access this question in a CoderPad sandbox here.<\/a>\n\t\t\t\n\t\t\t\t\t<\/div>\n\n\t\t\t<\/div>\n\n<\/section>\n\n\n\n<h2 class=\"wp-block-heading\">Question 2: Forecasting Future Grocery Store Sales<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Context<\/strong><\/h3>\n\n\n\n<p>This example question uses one of the <a href=\"https:\/\/www.kaggle.com\/competitions\/store-sales-time-series-forecasting\" target=\"_blank\" rel=\"noopener\">Getting Started competitions on Kaggle<\/a>. The goal is to forecast future store sales for Corporaci\u00f3n Favorita, a large Ecuadorian-based grocery retailer.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code><strong>train.csv<\/strong><\/code>: The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales<\/li>\n\n\n\n<li><code><strong>test.csv<\/strong><\/code>: The test data, having the same features as the training data but starts after the ending date of train data and for 15 dates. One has to predict the target sales for the dates in this file<\/li>\n\n\n\n<li><code><strong>stores.csv<\/strong><\/code>: This has some stores metadata including city, state, type, and cluster (grouping of similar stores)<\/li>\n\n\n\n<li><code><strong>oil.csv<\/strong><\/code>: This has oil price data as Ecuador economy is susceptible to volatility of oil market<\/li>\n\n\n\n<li><code><strong>holidays_events.csv<\/strong><\/code>: This has data on holidays and events in Ecuador<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Directions<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You are expected to do at least one completed time series analysis that predicts future sales.<\/li>\n\n\n\n<li>You are expected to show any data transformations and exploratory analysis.<\/li>\n\n\n\n<li>You have full flexibility to use the provided data as desired, but at minimum the date and sales numbers need to be used.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Initial cell contents<\/strong><\/h4>\n\n\n\n<p><em>Please review the context and data overview in the Instructions panel in this pad to gain a basic understanding of the available data and this exercise.<\/em><\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"R\" data-shcb-language-slug=\"r\"><span><code class=\"hljs language-r shcb-wrap-lines\"><span class=\"hljs-comment\"># Following code loads useful libraries <\/span>\n\n<span class=\"hljs-comment\"># Useful for out of the box time series function libraries<\/span>\ninstall.packages(<span class=\"hljs-string\">'fpp3'<\/span>)\n<span class=\"hljs-keyword\">library<\/span>(fpp3)\n\n<span class=\"hljs-keyword\">library<\/span>(tsibble)\n<span class=\"hljs-keyword\">library<\/span>(tsibbledata)\n<span class=\"hljs-keyword\">library<\/span>(tidyverse)\n<span class=\"hljs-keyword\">library<\/span>(ggplot2)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">R<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">r<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"R\" data-shcb-language-slug=\"r\"><span><code class=\"hljs language-r shcb-wrap-lines\"><span class=\"hljs-comment\"># Reading all the input datasets into memory<\/span>\n\ndf_train &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/train.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>) %&gt;%\n  mutate(store_nbr = as.factor(store_nbr))\n\ndf_test &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/test.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>) %&gt;%\n  mutate(store_nbr = as.factor(store_nbr))\n\ndf_stores &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/stores.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>) %&gt;%\n  mutate(store_nbr = as.factor(store_nbr))\n\ndf_transactions &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/transactions.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>)\n\ndf_oil &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/oil.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>)\n\ndf_holidays_events &lt;- read_csv(<span class=\"hljs-string\">\"\/home\/coderpad\/app\/store sales files\/holidays_events.csv\"<\/span>,show_col_types = <span class=\"hljs-literal\">FALSE<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">R<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">r<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"R\" data-shcb-language-slug=\"r\"><span><code class=\"hljs language-r shcb-wrap-lines\"><span class=\"hljs-comment\"># Show training data <\/span>\nhead(df_train)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">R<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">r<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/d2h1bfu6zrdxog.cloudfront.net\/wp-content\/uploads\/2023\/09\/image.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"415\" height=\"243\" src=\"https:\/\/d2h1bfu6zrdxog.cloudfront.net\/wp-content\/uploads\/2023\/09\/image.png\" alt=\"\" class=\"wp-image-36366\" srcset=\"https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image.png 415w, https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image-300x176.png 300w, https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image-18x12.png 18w\" sizes=\"auto, (max-width: 415px) 100vw, 415px\" \/><\/a><\/figure>\n<\/div>\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-6\" data-shcb-language-name=\"R\" data-shcb-language-slug=\"r\"><span><code class=\"hljs language-r shcb-wrap-lines\"><span class=\"hljs-comment\"># Example visual of total daily sales<\/span>\n\n<span class=\"hljs-comment\"># Converting data frame into a tsbibble object<\/span>\ntrain_tsbl &lt;- df_train %&gt;%\n  as_tsibble(key = c(store_nbr, family), index = date) %&gt;%\n  fill_gaps(.full = <span class=\"hljs-literal\">TRUE<\/span>)\n\ntrain_tsbl&#91;is.na(train_tsbl)] &lt;- <span class=\"hljs-number\">0<\/span>\n\n<span class=\"hljs-comment\"># aggregate data by stores<\/span>\ntrain_tsbl &lt;- train_tsbl %&gt;%\n  aggregate_key(store_nbr, sales = sum(sales))\n\noptions(repr.plot.width = <span class=\"hljs-number\">18<\/span>, repr.plot.height = <span class=\"hljs-number\">6<\/span>)\ntrain_tsbl %&gt;%\n  filter(is_aggregated(store_nbr)) %&gt;%\n  ggplot(aes(x = date, y = sales)) + \n   geom_line(aes(group=<span class=\"hljs-number\">1<\/span>), colour=<span class=\"hljs-string\">\"dark green\"<\/span>) +\n  labs(title = <span class=\"hljs-string\">\"Total Sales\"<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-6\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">R<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">r<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/d2h1bfu6zrdxog.cloudfront.net\/wp-content\/uploads\/2023\/09\/image-1.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"645\" height=\"211\" src=\"https:\/\/d2h1bfu6zrdxog.cloudfront.net\/wp-content\/uploads\/2023\/09\/image-1.png\" alt=\"\" class=\"wp-image-36367\" srcset=\"https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image-1.png 645w, https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image-1-300x98.png 300w, https:\/\/coderpad.io\/wp-content\/uploads\/2023\/09\/image-1-18x6.png 18w\" sizes=\"auto, (max-width: 645px) 100vw, 645px\" \/><\/a><\/figure>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\"><strong>Success criteria<\/strong><\/h3>\n\n\n\n<p>At minimum, a candidate should be able to conduct a basic time series analysis showing that they explored the data, transformed it appropriately for a time series analysis, considered a confounding factor like seasonality, and interpreted results in a reasonably accurate way.<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Does the candidate know to address auto-correlated data?<\/li>\n\n\n\n<li>Does the candidate explore the data to find any necessary transformations\/clean up needed ahead of the analysis?<\/li>\n\n\n\n<li>Can the candidate identify seasonal patterns among store sales?<\/li>\n\n\n\n<li>Is the candidate able to justify their analysis approach and conclusions?<\/li>\n<\/ul>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>\ud83e\uddd1\u200d\ud83d\udcbb <b><a href=\"https:\/\/embed.coderpad.io\/sandbox?question_id=258632&amp;use_question_button\">You can access this question in a CoderPad sandbox here.<\/a><\/b><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Assessing data scientists involves more than just scrutinizing their technical skills. To more accurately determine their fit for your team, we recommend consulting the supplementary interview guides found in the Related Posts section below.<\/p>\n\n\n\n<p><em>Some parts of this blog post were written with the assistance of ChatGPT.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Read on to find two curated technical interview questions designed to help you assess the expertise and problem-solving abilities of your data science candidates.<\/p>\n","protected":false},"author":12,"featured_media":36370,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[72],"tags":[],"persona":[27],"blog-programming-language":[37,38],"keyword-cluster":[],"class_list":["post-36339","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science"],"acf":[],"_links":{"self":[{"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/posts\/36339","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/comments?post=36339"}],"version-history":[{"count":68,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/posts\/36339\/revisions"}],"predecessor-version":[{"id":37336,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/posts\/36339\/revisions\/37336"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/media\/36370"}],"wp:attachment":[{"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/media?parent=36339"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/categories?post=36339"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/tags?post=36339"},{"taxonomy":"persona","embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/persona?post=36339"},{"taxonomy":"blog-programming-language","embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/blog-programming-language?post=36339"},{"taxonomy":"keyword-cluster","embeddable":true,"href":"https:\/\/coderpad.io\/wp-json\/wp\/v2\/keyword-cluster?post=36339"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}