Level: Hard
Description: To celebrate the end of the first season of #JustKNIMEIt, this challenge will be a fully open-ended one! We invite you to create a reusable component that handles any niche operation that you see fit! See our community component collections for inspiration: Best Practices, Winter 2021, Spring 2022, and Summer 2022.
Need more inspiration for this challenge? A component on regular expressions could be fun, but there are honestly no limits here.
Author: Victor Palacios
Note: As this is an open-ended challenge, we will not be supplying our own solution as we usually do. However, we recommend looking at some of these unique components made by our own KNIME team such as this OCR Python project or Gonhaddock's hypothesis testing node.
Level: Easy
Description: In this challenge you will play the role of educational journalist investigating “educational gag orders”. These are defined as “state legislative efforts to restrict teaching about topics such as race, gender, American history, and LGBTQ+ identities in K–12 and higher education.”
We encourage you to use the Google Authentication node, so make a copy of the original dataset which can be found here.
This data needs to be copied inside your own Google Sheets under the email account which you will use for the Google Authentication node. In the Google Sheets Reader node select the sheet, "All Legislation Introduced Since Jan 2021" to create a workflow that examines which states have the most gag orders and which topics are most discussed most overall.
Author: Victor Palacios
Data: PEN America Index of Educational Gag Orders
Solution Summary: After accessing the “All Legislation Introduced Since Jan 2021” sheet, which was the focus of the challenge, we generated a bar chart to check which US state has the most gag orders (Missouri) and a tag cloud over the sheet’s descriptions to identify the most popular topics in discussion (e.g., schools, race, sex).
Solution Details: After entering credentials into the Google Authentication node, we used the Google Sheets Connection node to access all the sheets in the dataset for this challenge. Next, we used the Google Sheets Reader node to select the “All Legislation Introduced Since Jan 2021” sheet, which was the the focus of this challenge. We then prepared two visualizations to tackle the tasks in the challenge. First, we used the Bar Chart node to check which US state has the most gag orders (Missouri). Second, we combined the Strings to Document node and the Keyword Search component to properly format the dataset’s descriptions and then compute their term co-occurrences, finally feeding this information into the Tag Cloud node to identify which topics were the most discussed (e.g., schools, race, sex).
Level: Medium
Description: You are a researcher studying how words are used and in what contexts. Given a target word, one of your tasks is to extract the 3 words that appear before and the 3 words that appear after it -- if applicable (some target words are not preceded by any word, for instance).
In this challenge, you will be given various examples that contain common traps for these kinds of searches, such as line breaks, capitalization issues, punctuation, and no preceding or following words. Your solution should extract words while (1) retaining capitalization, (2) being case insensitive, and (3) capturing multiple instances of the word depending on the window size. Please ignore misspellings. Bonus: Make a component so others can play with your creation!
Example input and output where "eggs" is the target word:
(INPUT) Eggs are great -> (OUTPUT) Eggs are great
(INPUT) I like eggz. -> (OUTPUT) ? (empty since no exact match)
(INPUT) I really do like eggs that (line break) are covered in ketchup. -> (OUTPUT) really do like eggs that are covered (line can be retained if preferred)
(INPUT) I love eggs, but they need salt. -> (OUTPUT) I love eggs, but they need
(INPUT) Is KNIME secretly in love with EGGS??? -> (OUTPUT) in love with EGGS
Author: Victor Palacios
Data: Word Context Dataset in the KNIME Hub
Solution Summary: After reading the dataset containing the input strings, we continued by building a component to find the word windows. In the component’s configuration dialog, the target word and the window size can be defined. We then built a regular expression considering the two input variables, and then used this regex to find all the matching terms. The results are returned as an array. Finally, to better process the terms, we split the results array so that each term is in a single column cell.
Solution Details: After reading the input file using the Excel Reader node, we continued by building a customizable component to find the word windows. We started by adding two String Input nodes. The first String Input node allows to define the target word in the configuration window of the component, the second String Input node is to define the window size. In our case, the target word is “eggs”, and the window size is 3. Next, we used the String Manipulation (Variable) to build a regular expression to find the word windows. The target word variable and the window size variable are included as flow variables. We then sent the resulting regex to the Regex Find All component. This component finds and returns all the terms that match the given regex. The results are returned as an array. Finally, to better process the terms, we used the Split Collection Column node to split the results array so that each term is in a single column cell.
Level: Easy
Description: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal of text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow.
Author: Victor Palacios
Data: PDF in Swedish in the KNIME Hub
Solution Summary: After reading the downloaded PDF and performing optical character recognition (OCR) to ingest the data, we split it into lines and removed those that were duplicates. After that, we rejoined the remaining lines and obtained a cleaner text together as a result.
Solution Details: We started our solution by reading and performing OCR with the Tika Parser node. Next, we used the Cell Splitter node to break the text into lines (one per column), using ‘.’ as a marker for periods. For a more convenient deduplication, we then used the Transpose node to turn the generated columns into rows. This allowed us to use the Duplicate Row Filter node to remove the redundant lines in the text. Finally, we used the GroupBy node to rejoin the remaining text back together.
Level: Medium
Description: KNIME has just released a new textbook on time series: to celebrate it, we will do a little time series analysis in this challenge. Check out the book (and discount!) here.
For this challenge you will perform time alignment on data that contains two types of gaps: regular gaps by the nature of the data, and irregular gaps. For example, daily stock market data regularly skips Saturdays and Sundays, and irregularly skips some other weekdays due to public holidays. Your concrete task for this challenge is to introduce the missing timestamps that correspond to weekdays into the given data, while omitting those for weekends. The data contains the daily exchange rates of US dollar vs Swiss franc from 1980 to 1998. Hint: Check out our verified components for time series analysis. Hint 2: For reference, if you get access to the book, check out Chapter 3 (Preparing Data for Time Series Analysis) which this challenge is based on.
Author: Maarit Widmann
Data: Daily Exchange Rates Data in the KNIME Hub
Solution Summary: In our solution, we started by extracting the year (week-based) and week associated with each date in the dataset. Next, we grouped the data by week and, for each group, we inserted sequential timestamps for those weekdays that were missing. The final dataset thus included (1) the original data on a daily granularity, and (2) missing entries for those weekdays that were not in the original data.
Solution Details: After reading the dataset for this challenge with the CSV Reader node, we transformed its dates into timestamps with the String to Date&Time node. Next, we extracted each date's year (week-based) and week with the Extract Date&Time Fields node. We then grouped this new dataset into weeks with the Group Loop Start node, and added weekday missing timestamps to each week at a time with the Timestamp Alignment component. Finally, we used the Loop End node to connect the time-aligned data.
Level: Hard
Description: In this challenge we will create a network graph using your LinkedIn data. Here are a few instructions on how to request your LinkedIn data, so that you can tackle this challenge:
If you do not have a LinkedIn account or any LinkedIn data, feel free to use our supplied excel sheet below. Once you have your chosen data at hand, your task is to make networks that show where your contacts work, and link those contacts to a work place node. A simple example would be:
Paolo ---- KNIME ----- Elizabeth
The above network graph shows that both of your contacts (Paolo and Elizabeth) work at KNIME. After the graph is built, visualize the top 3 companies where most of your contacts work. Hint: We used the Network Creator node and Object Inserter node to create our graph.
Author: Alida Brizzante
Data: LinkedIn Data in the KNIME Hub
Solution Details: Using the sample dataset available for this challenge, we created a component that derives its network graph, and creates network visualizations for the top 3 companies where most contacts in the dataset work (KNIME, Amazon, and Meta). As a first step, we removed missing data from the original dataset, and processed it to just keep the top 3 companies. Next, we created color codes for each company: they were later used to properly color the contacts in the graph. Upon joining the color codes and the cleaned sample dataset, we aggregated the “First Name” and “Last Name” columns into a single column for graph node identifiers (column “Concatenate”). As a last preprocessing step, we also removed emojis from our contact information. After that, using the Network Creator, Object Inserter, and Feature Inserter nodes, we created color-coded graphs based on our clean dataset, covering the top 3 companies. We then sent these graphs to instances of the Network Viewer node, through which it was possible to visualize the contact networks for the top 3 companies. Instances of the Text Output Widget node were also used to map the color codes onto their corresponding company names.
Level: Medium
Description: One of your tasks at work is to train a model using sentences with the correct word context (i.e., words in a sentence following a meaningful and correct order). However, to train such model, you also need to create a dataset of words used in an incorrect context. You can think of this task as a version of Negative Sampling - a neat technique for training the famous Word2Vec model. Concretely, in this challenge you will create a workflow that takes a sentence and scrambles the order of its words. You can create a small sample of sentences to test your work with the Table Creator node.
Input
I like cats.
Output
cats. like I
Hint: Our simple solution only uses 5 nodes, but the permutations are not exactly random. Conversely, our more complex solution uses more than 15 nodes and 2 loops, as well as the Random Numbers Generator node, to create truly scrambled sentences. Bonus: Create a solution with true randomization without using any loops.
Author: Victor Palacios
Solution Summary: We started by filtering our data so that only the first 100 data rows were considered and converted the input strings that were supposed to be randomized to documents. The solution contains two nested loops. The first loop iterates over the data rows so that each input row is processed at once. For each document, a Bag of Words is created, and the total term count is obtained. Now, to introduce total randomization, a random number is assigned to each word. The words are then ordered based on the random number und ultimately concatenated back together into a sentence. This randomization part is encapsulated by the second loop and repeated n=10 times.
Solution Details: After reading a Kaggle Dataset with tweets from consumers to airlines, we began by filtering the data to the first 100 data rows using the Row Filter node. With the Strings To Document node we converted the tweets from strings to documents. The following Chunk Loop Start node marks the start of the first loop that is responsible for looping over each data row at once. For each input row, we created a Bag of Words using the Bag Of Words Creator node, and we calculated the total number of terms using the GroupBy node. The value obtained by the GroupBy node is needed later in order to determine how many random values are required to be created. The value is converted to a variable using the Table Row to Variable node. The following Counting Loop Start node marks the start of the second loop, where the number of iterations is defined as 10. Within the second loop, we created random values of arbitrary range using the Random Number Generator node. In the next step, the randomly created numbers needed to be assigned to a term. For that, we first used the RowID node to reset the row ids of the table containing the random values. In addition, we converted the terms contained in the table that resulted from the Bag Of Words Creator node back to strings using the Term To String node. Then, both tables were combined using the Column Appender node. In the next step, we sorted the terms according to their randomly assigned value in ascending order using the Sorter node. Lastly, with the help of the GroupBy node we concatenated the terms back into a sentence. The following two Loop End nodes mark the end of the two loops, where the first Loop End node corresponds to the Counting Loop Start node, and the second Loop End node to the Chunk Loop Start node respectively.
Level: Hard
Description: In America the prices of medical procedures can vary greatly, so savvy Americans tend to shop around for a good deal. In this challenge you will take the role of a data journalist trying to investigate price differences among medical procedures. More specifically, you want to find the top 5 procedures that show the largest variety in terms of pricing from hospital to hospital (in statistics, you would call this high standard deviation). The data you have at hand for this investigation is not uniform and requires inspection in order to be properly read and processed. For simplicity, compare all average charges for the 25 most common outpatient procedures performed by hospitals from the Kaiser Foundation Hospitals and Sutter Hospitals only. For Kaiser Foundation, the relevant data is in Excel files with "...Common25..." in their names; for Sutter, the relevant Excel files contain "...CDM_All..." in their names. Beware of hidden sheets.
Author: Victor Palacios
Data: Link for Medical Procedures Dataset
Solution Summary: We started our solution by filtering the spreadsheets in order to just consider those that concern Kaiser Foundation and Sutter hospitals. After extracting hospital names, we read their corresponding Excel sheets and processed the codes and prices of their medical procedures. Finally, we computed a few statistics for the price distribution of each procedure and picked the top-5 ones with the largest standard deviation. They were laparoscopic cholecystectomy; excision, breast lesion; hernia repair, inguinal, 5 years and older; carpal tunnel surgery; and arthroscopy, knee, with meniscectomy (medial or lateral).
Solution Details: The first step in our solution was to read the spreadsheets in the unzipped dataset folder with the List Files/Folders node. Next, we used the Row Filter node to concentrate our analysis on Kaiser Foundation and Sutter hospitals. We then used the Path to String and String Manipulation nodes to extract hospital names: these were the input for the Read Excel Sheets component, and its output was the data for all the hospitals concatenated in a single data table. For readability, we used the Column Rename node to change column "A" to "procedure", column "B" to "code", and column "C" to "price". Next, to allow for a statistical analysis, we used the String to Number node to turn prices and codes into integers. We then used the Rule-based Row Filter, GroupBy, and Joiner nodes to remove rows with missing values while preserving all relevant information. The next step was then to turn each medical procedure into a different column, with rows corresponding to their prices at different hospitals. This shape modification facilitated our statistical analysis: we just had to use the Statistics node to calculate the standard deviation for the prices in each column (that is, the prices for each procedure). Finally, we used the Top k Selector node to pick the 5 procedures with the largest standard deviation.
Level: Easy
Description: A company buys their material from different manufacturers and stores all relevant information in one excel file. In this challenge your goal is to change the structure of the excel sheet so that as little information as possible is stored repeatedly. Use as few nodes as possible to change the structure of the excel file to the output shown below (our solution has only 5 nodes, no scripting nodes, no components).
Input
2019 1 ABC comp Wood 141 148
2019 1 ABC comp Glass 144 134
...
Output
2019 1 ABC comp Glass 144 134 2019 1 ABC comp Wood 141 148
...
Author: Victor Palacios
Data: Excel table in the KNIME Hub
Solution Summary: After reading the Excel table, we tackled the proposed challenge by first removing rows with undefined prices ("..."). Next, we grouped the data by company and material and fixed their RowIDs in order to properly output the data.
Solution Details: To read the Excel table related to this challenge, we used the Excel Reader node. Next, we used the Row Filter node to remove rows with undefined prices. Finally, we used the Group Loop Start and Loop End (Column Append) nodes to cycle through the data grouping rows by company and material and outputting them in the format specified above. Within the loop, we used the RowID node to relabel the rows' IDs, facilitating their concatenation in the right format.
Level: Easy
Description: What is this delicacy? For this challenge, you'll extract text from a web page, perform a keyword search, and create a tag cloud of the keywords. The solution should contain the tag cloud of the keywords and your guess of the recipe!
Hint: This challenge only requires 5 nodes in total (two are verified components for Text Processing).
Author: Maarit Widmann
Data: URL of the recipe in the KNIME Hub
Solution Summary: We started by reading the recipe URL and extracting the website's text. To ease further text processing, we converted the text string to document format. Then, we extracted the most relevant keywords from the website's text and their weights. Lastly, we visualized the keywords in a word cloud.
Solution Details: After reading the recipe URL with the CSV Reader node, we used the Web Text Scraper component to extract the website's text. The text was captured in one cell as a string. For further text processing we require the data to be in document format, hence we applied the String to Document node. In order to get the relevant keywords, we used the Keyword Search component which extracts the most relevant English keywords in a corpus. The component also outputs each keyword's weight. Lastly, the view of the Tag Cloud node visualizes the word cloud of the most relevant keywords of the recipe where the word's size represents its relevance.
Level: Easy
Description: For this challenge, you’ll compare the number of foreign students in different German states on a choropleth map of Germany. The output should contain an interactive view with a map showing the number of foreign students by German states, a widget to change the considered year, and a refresh button. What are the top 3 states with the most foreign students now? And in 2000?
Hint: Check out the verified components for visualization on the KNIME Hub.
Author: Ahmad Varasteh
Data: Student Data and List of State Abbreviations in the KNIME Hub
Solution Summary: After reading the students data, we joined it with the country codes dataset. We then accumulated the number of students by state and year. Lastly, we created an interactive component composite view that displays the number of students by state and allows for interactive year selection.
Solution Details: We started by reading the students data and the country codes. We joined both datasets on the state using the Joiner node. We did this to avoid erroneous visualization results later, as the Choropleth Map component works best when using state abbreviations. Then we used the Pivoting node to accumulate the number of students by state and year. Lastly, we created a component containing the Choropleth Map component to visualize the data by state, a Value Selection Widget node to select a year, and a Refresh Button Widget node to apply the selection.
Level: Easy
Description: You are a senior student in a US high school, and you are going to apply for college soon. The costs of undergraduate studies in the US can be pretty high, and you want to understand what factors influence the costs -- and how. To make an informed and financially responsible decision, you decided to perform some Explanatory Data Analysis and cost modeling. What conclusions do you get to?
Author: Jinwei Sun
Dataset: Cost of Undergraduate Studies Data in the KNIME Hub
Solution Summary: To tackle this challenge, we used a variety of plots to check how the average undergraduate cost in the US varies depending on different factors (state, year, type of school etc). We also used a simple Decision Tree model to see how the cost values generalize per institution. The idea is for students to use this information to help determine what school they should attend.
Solution Details: In our solution, we start by grouping the cost data per state with the GroupBy node, then sorting it in descending order with the Sorter node, and then visualizing this information with the Bar Chart node. The most expensive region of the US in terms of undergraduate cost is DC. Next, we group the data again per year (GroupBy node) and use the Line Plot node to visualize the cost time series. The data shows that from 2013 to 2020 the cost for an undergraduate education has risen considerably, on average. We also use the Bar Chart node to visualize average undergraduate cost per school type, which indicates that Public In-State schools are the cheapest for a student to attend. Finally, we partition the data (Partitioning node), train and test a Simple Regression Tree model (Simple Regression Tree Learner and Predictor nodes), and assess its quality with the Numeric Scorer node. The attributes in the data explain the variability in about 95% of the test data (R-squared score of 0.952), suggesting that factors such as school type and state are great indicators of undergraduate cost in the US, and thus are very likely to help students make sound decisions as to where they go to college.
Level: Easy
Description: In this challenge, you will do a little bit of logic building and produce the star triangle below. The output will be such that each row in the pattern will be in the corresponding row of the table.
P.S. do not hardcode the pattern.
*
**
***
****
*****
******
Bonus Challenge: Do not use scripting nodes or components.
Author: Ali Marvi
Solution Summary: We started by creating a table containing one white space character. We then looped over the table and iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration and replaced the white space character by a star character. We collected all rows and lastly removed the first row from the table as it is a duplicate.
Solution Details: We started by creating a table using the Table Creator node containing one white space character. We then looped over the table using a Counting Loop Start node and a Loop End node that collects the intermediate results row-wise. In the Counting Loop Start node we set the number of loops to 7. Within the loop body, we used the String Manipulation node to build the star pattern. We iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration (padLeft function) and replaced the white space character by a star character (replace function). Lastly, we used the Row Filter node to remove the first row from the table as it is a duplicate.
Level: Easy
Description: You have a dataset containing information on US citizens who donated blood in the last year, including addresses and blood types. The O- blood type, also known as "universal donor", is perhaps the most valuable blood in the world because it can be transfused to nearly any blood type holder. Your goal here is to help a group of researchers find the number of citizens with O- blood type per US state. Unfortunately, the address column comes in a single line, so to extract the state information you will have to perform some data wrangling. They also asked you to create a choropleth map of the US to visualize the results.
Author: Ahmad Varasteh
Dataset: Synthetic Blood Type Data in the KNIME Hub
Solution Summary: After reading the dataset in rare blood types, we first extracted the state from the address line. We then removed the unnecessary address column and renamed the name and the state column for better readability. We then filtered the data to only keep samples where the blood type equals "O-" and grouped the data by state. Lastly, to visualize the data in a world map we used the Coropleth Map component.
Solution Details: After using the CSV Reader node to read the dataset on rare blood types, we used the Regex Split node to extract the state from the address line into a separate column called "split_0". With the help of the Table Manipulator node, we removed the address column and renamed the columns “name” and “split_0”. Then, we used the Row Filter node to only keep samples where the blood type equals "O-". To calculate the number of citizens per state we used the GroupBy node, grouped by state, and used Count as an aggregation method. Lastly, we used the Coropleth Map component to visualize the results in a world map which resulted in an interactive map of the United States.
Level: Easy
Description: To wrap up our series of data classification challenges, consider again the following churning problem: a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. The target class to be predicted in the test data is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You have already found a good model for the problem and have already engineered the training data to increase the performance a bit. Now, your task is to communicate the results you found visually. Concretely, build a dashboard that:
Author: Aline Bessa
Dataset: Training and Test Data in the KNIME Hub
Solution Summary: To tackle this challenge, we created (1) an interactive visualization to compare different performance metrics for both classes; (2) used the Global Feature Importance verified component to check which features were most indicative of churning for the test set; and (3) created a component based on the Local Explanation View verified component to better understand why our model generated a given false positive (the user did not churn but the model predicted churning) and a given false negative (the user churned but the model did not predict it).
Solution Details: After oversampling the minority class in the training data, and using AutoML to find a suitable ML model for our problem, we created a component to interactively compare different metrics (e.g., recall and precision) for classes Churn = 0 and Churn = 1. This comparison is visual and is based on the Bar Chart node. Next, we used the Global Feature Importance verified component to check which features were most indicative of churning — the top 2 were DayMins (total number of calling minutes during the day) and CustServ Calls (number of calls placed to customer service). Finally, we created a component to visualize local explanations for one false negative and one false positive example. These explanations were generated with the Local Explanation View verified component. In the case of the false negative example, the desired class was Churn = 1, and in our execution we noticed that the instance’s value for Night Calls (billed cost for nighttime calls) was likely responsible for misclassification. As for the false positive, the desired class was Churn = 0 — here, the value for CustServ Calls probably contributed a lot for misclassification.
Level: Hard
Description: In this challenge series, the goal is to predict which customers of a certain telecom company are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. Here, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do).
After automatically picking a classification model for the task, you achieved an accuracy of about 95% for the test data, but the model does not perform uniformly for both classes. In fact, it is better at predicting when a customer will not churn (Churn = 0) than when they will (Churn = 1). This imbalance can be verified by looking at how precision and recall differ for these two classes, or by checking how metric Cohen’s kappa is a bit lower than 80% despite a very high accuracy. How can you preprocess and re-sample the training data in order to make the classification a bit more powerful for class Churn = 1? Note 1: Need more help to understand the problem? Check this blog post out. Note 2: This problem is hard: do not expect to see a major performance increase for class Churn = 1. Also, verifying if the performance increase is statistically significant will not be trivial. Still... give this challenge your best try!
Author: Aline Bessa
Dataset: Training and Test Data in the KNIME Hub
Solution Summary: To tackle this challenge, we again relied on our AutoML component to pick a suitable model, and also played with different strategies to oversample the minority class using our SMOTE node. We were able to increase Cohen's kappa to 81.1% while maintaining basically the same accuracy we had before. The best solution we found again involved Gradient Boosted Trees -- this time, with 90 trees -- and using SMOTE with the 10 nearest neighbors to oversample the minority class. Note: We did not assess whether this improvement in Cohen's kappa is statistically significant.
Solution Details: After using two instances of the CSV Reader node to read both training and test datasets, we decided to use the SMOTE node to oversample the minority class in the training data. We then fed the oversampled data to the AutoML component and, just like last week, sent the best found model and the test data to the Workflow Executor node, which generated churn predictions. We then used the Scorer node to check how well the model performed over the test data. Since the SMOTE node has a parameter to control the number of nearest neighbors used to generate new samples, we decided to experiment with it. To this end, we used the Table Creator node to come up with a range of numbers of nearest neighbors, and then placed the body of this workflow (SMOTE/learning/predicting/scoring) in a loop that would generate solutions for these different numbers of nearest neighbors. The loop used the Table Row to Variable Loop Start and Loop End nodes. After executing this loop, we used the Top k Selector node to get the solution with the best Cohen's Kappa value, and then only kept informative columns associated with this solution by using the Column Filter node.
Level: Medium
Description: Just like in last week’s challenge, a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. One of your colleagues said that she was able to achieve a bit over 95% accuracy for the test data without modifying the training data at all, and using all given attributes exactly as they are. Again, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). What model should you train over the training dataset to obtain this accuracy over the test dataset? Can this decision be automated? Note 1: A simple, automated solution to this challenge consists of a mix of 1 component and 4 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.
Author: Aline Bessa
Dataset: Training and Test Data in the KNIME Hub
Solution Summary: We used the AutoML verified component to experiment with a variety of classifiers (and hyperparameter values) in order to find the best suited model for this problem. The chosen model was Gradient Boosted Trees, in a technical tie with XGBoost Trees and H2O’s GBM. This model achieved an accuracy of 95.1% over the test data. Note: We would need more data than what we currently have to robustly verify if Gradient Boosted Trees is statistically better than the other models. Small variations on the training dataset, for example, could lead to a different model choice.
Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the AutoML verified component to detect the best suited model based on the training dataset at hand. Next, we used the Workflow Executor node to apply the chosen model over the test data. Finally, we used the Scorer node to check how well the model classified the test instances.
Level: Easy
Description: A telecom company wants you to predict which customers are going to churn (that is, are going to cancel their contracts) based on attributes of their accounts. To this end, you are expected to use a decision tree classifier. The company gives you two datasets (training and test), both with many attributes and the class ‘Churn’ to be predicted (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You should train the decision tree classifier with the training data, and assess its quality over the test data (calculate the accuracy, precision, recall, and confusion matrix for example). Note 1: This challenge is a simple introduction to predictive problems, focusing on classification. You are expected to just apply a decision tree classifier (and get an accuracy of about 92%). A simple solution should consist of 5 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.
Author: Aline Bessa
Dataset: Training and Test Data in the KNIME Hub
Solution Summary: Using the learner-predictor paradigm, we trained a decision tree classifier over the training data and assessed its performance over the test data. When training the decision tree, we used Gini index as a metric for the quality of the decision tree, pruned it using the MDL method, and kept at least 6 records per node. By doing this, we achieved an accuracy of about 94%.
Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the Decision Tree Learner node to train a decision tree classifier, and the Decision Tree Predictor node to apply it over the test data in order to assess its performance. Finally, we used the Scorer node to check how well the model classified the test instances. Note: Decision tree models have a number of parameters that can be tuned in order to generate better models. We'll be discussing parameter tuning and model selection later in this series of challenges.
Level: Easy
Description: In this challenge you will create visualizations to inspect different sexual orientation laws around the world, as reported by the State-Sponsored Homophobia report. The purpose of this report is to work as a useful tool for LGBTQIA+ human rights defenders. Here are a few questions that your visualizations should help answer:
1. In what decade did most countries decriminalize homosexuality?
2. Is the decriminalization of homosexuality becoming more common over time?
3. In what countries is same-sex marriage legal?
4. Is there a continent in which the legality of same-sex marriage is more common? And how about civil unions?
Author: Aline Bessa
Solution Summary: After reading the dataset on sexual orientation laws, we binned the temporal information associated with the decriminalization of homosexuality, and then grouped countries by period. This information was then used to analyze whether the decriminalization of homosexuality is becoming more common over time. The data indicates that the 20th and 21st centuries concentrate most of the decriminalizations -- in particular, the period from 1990 to 1999. We also used maps to better understand the current legality of same-sex marriages and civil unions. Both are still illegal in most of the world, and legality seems to be more common in Oceania and in the Americas.
Solution Details: After using the CSV Reader node to read the dataset for the challenge, we used the Rule Engine node to bin the decriminalization information per decade of the 20th century, and then used the GroupBy node to group countries by decriminalization period. Next, we used the Row Filter node to remove periods associated with no decriminalization, and then used the Top k Selector node to sort the grouped data: periods with more decriminalization ranked higher in this sorted dataset. We also used a sequence of Rule Engine and GroupBy nodes to group countries by larger periods of time -- the goal was to visualize less granular trends in decriminalization. Here, we needed to use the String to Number node in the beginning of the pipeline for formatting. Finally, we used the Column Expressions node to create different codes for countries, depending on the legality of same-sex marriages or civil unions in them. In the end, we sent these three different sources of information to the Visualize LGBTQIA+ Rights component, which implements bar plots for decriminalization periods and choropleth maps for the legality of same-sex marriages and civil unions across the world.
Level: Easy
Description: You are interning for a travel agency that wants to know about their top spenders' eating habits. As a trial run, you are given a tiny dataset with 6647 rows containing information on 1011 unique citizens traveling with different purposes. In this challenge you will:
Note: Sometimes the ending balance is more than the starting balance. Assume this was a mistake when calculating money spent.
Author: Ahmad Varasteh
Dataset: Eating Out Data in the KNIME Hub
Solution Summary: Using date and time manipulation, we were able to calculate the time spent per activity as well as the amount spent. We then aggregated our data and filtered only those entries that were related to eating. We then took the top 10 spenders and the top 10 time takers and compared their ids with a join. The results of the join showed that eating a lot does not necessarily relate to spending a lot of money on it.
Solution Details: Critical to this analysis was to change the checkInTime and checkOutTime column to the datetime type. To this end, we used the String to Date&Time node -- but note that we could also have changed the type of the column within the CSV Reader node by clicking on the Transformation tab and then selecting the appropriate type. Once we had our data in a date format, we were able to use the Date&Time Difference node. This node made it possible to calculate the difference between the in and out time. With the Math Formula node, we were able to subtract the balance columns to see how much money was spent on each visit. We noticed that some of these events led to monetary gains (which was probably a mistake, so we used the "abs" function to derive the absolute value of the spent amount). Next, we used the GroupBy node to get the sum of money and time spent per customer. Then we used the Nominal Value Row Filter to limit our data to eating events only. The Top k Selector node then allowed us to get a list of the top 10 money/time spenders. Finally, using the Joiner node we compared whether the participant ids matched. We noted there was little overlap, suggesting that spending more time eating does not necessarily mean that more money will be spent.
Level: Hard
Description: You work for a hospital and they have data for each time a patient was seen. In this challenge, you will calculate the difference between each time a patient was seen excluding weekends (called "network days"). Once you calculate the network days, calculate the average network days per patient. For the challenge, experiment with the input and output below.
Input
Patient Date
Aline 11/01/2022
Aline 12/02/2022
Aline 25/02/2022
Aline 15/04/2022
Victor 05/02/2022
Victor 25/02/2022
Victor 15/03/2022
Victor 30/03/2022
Output
Patient Date Network Days Mean
Aline 11/01/2022 ? 23.333
Aline 12/02/2022 24 23.333
Aline 25/02/2022 10 23.333
Aline 15/04/2022 36 23.333
Victor 05/02/2022 ? 13.333
Victor 25/02/2022 15 13.333
Victor 15/03/2022 13 13.333
Victor 30/03/2022 12 13.333
Note: if you simply use the Date&Time Difference node, you will mix patient data/dates and will also end up counting weekends. Bonus Challenge: Create a solution without using loops.
Author: Victor Palacios
Solution Summary: First, we changed our string to datetime format so that we can calculate temporal differences. To avoid loops, we then used lag columns to tag when our patients change. Afterwards, we used another lag column to calculate the differences between days. Next, we used a component which we found in the KNIME Forum that allowed us to calculate network days using math formulas. Finally, we calculated averages using a groupby and then joined that with our data.
Solution Details: Transforming our data using String to Date&Time allows us to use operational nodes like Date&Time Difference. This was useful since the key to this challenge was to make use of the Lag Column node if you wanted to avoid loop. The Lag Column node generates a duplicate column that is off by one or more rows. This allows for direct comparison between one column's data and its previous data. That is, we were able to calculate the difference between when a patient was seen last by comparing their date seen and their next date seen. Not only that, but you can use the lag column to pinpoint when a patient changes using a Rule Engine which simply compares the current row with the lagged row. While some people used another programming language in their network day calculation, we found a component on the KNIME Forum which had all the mathematical rules for calculating network days. This component used several nodes like Date&Time Difference, Math Formula, and Rule Engine. To keep patient data from being mixed, we then used another Rule Engine to remove data calculated across patients. With the GroupBy node we were then able to find the mean number of times patients came in. Finally, we joined the grouped data with our original data and used the Column Filter node to show only relevant columns in the final output.
Level: Medium
Description: A survey was done to ask employees to rate themselves on a scale of 1-5 on different technologies and skills. We would like you to analyze this data and present it in a list where we are providing primary, secondary, and tertiary skills. Primary skills can be defined as skills rated 5; secondary skills rated 3 and 4; and tertiary skills rated 2. The final list should appear as it does on the KNIME forum where this challenge was suggested. Note: There may be cases in which employees have not rated themselves more than 3 on any of the skills (i.e., they have no primary skills).
Author: Jyotendra
Dataset: Skill Survey Data in the KNIME Hub
Solution Summary: We started our solution by restructuring the input dataset, turning the skill columns into values of a single column. Next, we filtered the data to only keep the primary skills (rated 5) for each employee. We did something similar for secondary (rated 3 or 4) and tertiary skills, and then joined this information at the end. This led to a dataset in which it is possible to quickly see all primary, secondary, and tertiary skills of each employee.
Solution Details: After reading the challenge’s dataset with the Excel Reader node, we used the Unpivoting node to turn each skill column into a value of a single column. This facilitated the grouping of skills per employee. To perform this grouping for primary skills, for example, we used (1) the Range Filter node to only keep the employees’ skills that were rated 5, (2) the GroupBy node to concatenate all employees’ primary skills into a single row, and then (3) the Column Rename node to rename the concatenated column as “Primary Skills”. We used the same sequence of nodes for secondary and tertiary skills, ending up with columns “Secondary Skills” and “Tertiary Skills” respectively. We then used two Joined nodes to join all skill tables into a single dataset, such that all primary, secondary, and tertiary skills of each employee are listed in a single row.
Level: Medium
Description: A common problem in mechanical and medical data is associating notes with a category. In this challenge, you will automate the process of sorting mechanical notes into their correct category. Here’s an example:
INPUT
--List of Categories--
1. Scratch
2. Crack
3. Defect
--Notes--
1. The product was defective.
2. A crack was caused by client.
3. Many scratches noted.
OUTPUT
Note Category
1. The product was defective. Defect
2. A crack was caused by client. Crack
3. Many scratches noted. Scratch
Don't worry about using fancy machine learning or natural language processing models. This problem can be handled reasonably well using a total of 5 nodes (simple solution), and a more refined solution involves just 8 nodes (complex solution). Also don't worry about getting 100% accuracy.
Author: Victor Palacios
Datasets: Datasets with Mechanical Notes and Categories in the KNIME Hub
Solution Summary: After reading the inspection notes and the categories, a simple solution consisted of running a similarity search between categories and inspection notes to find which of the former best corresponded to each of the latter. A more complex solution involved lowercasing both categories and notes to improve matching, and then running a regular expression matcher to find all categories that correspond to each note (instead of just one category). Note: We did not implement any spellchecking in our solution, which would further increase matching quality.
Solution Details: (Simple solution) After reading the inspection notes and categories with two Excel Reader nodes, we used the Similarity Search node to find the category that best matched each note. Next, we used the Joiner node to actually perform the match between most similar categories and notes, and then used the Column Filter node to remove temporary columns. (Complex Solution) We started by reading and lowercasing the inspection notes and categories with two Excel Reader and two String Manipulation nodes. Note that lowercasing both inputs was a cheap way of finding more matches between categories and notes. Next, we used the GroupBy node to create a concatenated regular expression containing all categories, and used the Table Row to Variable node to convert this regular expression into a variable. We then used shared component Regex Find All to find all categories that corresponded to each inspection note, and finally used Split Collection Column node to put each matched category into a separate column.
Level: Medium
Description: Housing prices in America are a bit out of control. In this challenge, you will see how housing prices have changed through time. Here are the tasks for this challenge: