Previous Just KNIME It! Challenges (Season 1)

Level: Hard

Description: To celebrate the end of the first season of #JustKNIMEIt, this challenge will be a fully open-ended one! We invite you to create a reusable component that handles any niche operation that you see fit! See our community component collections for inspiration: Best Practices, Winter 2021, Spring 2022, and Summer 2022.

Need more inspiration for this challenge? A component on regular expressions could be fun, but there are honestly no limits here.

Author: Victor Palacios

Note: As this is an open-ended challenge, we will not be supplying our own solution as we usually do. However, we recommend looking at some of these unique components made by our own KNIME team such as this OCR Python project or Gonhaddock's hypothesis testing node.

Level: Easy

Description: In this challenge you will play the role of educational journalist investigating “educational gag orders”. These are defined as “state legislative efforts to restrict teaching about topics such as race, gender, American history, and LGBTQ+ identities in K–12 and higher education.”

We encourage you to use the Google Authentication node, so make a copy of the original dataset which can be found here.

This data needs to be copied inside your own Google Sheets under the email account which you will use for the Google Authentication node. In the Google Sheets Reader node select the sheet, "All Legislation Introduced Since Jan 2021" to create a workflow that examines which states have the most gag orders and which topics are most discussed most overall.

Author: Victor Palacios

Data: PEN America Index of Educational Gag Orders

Solution Summary: After accessing the “All Legislation Introduced Since Jan 2021” sheet, which was the focus of the challenge, we generated a bar chart to check which US state has the most gag orders (Missouri) and a tag cloud over the sheet’s descriptions to identify the most popular topics in discussion (e.g., schools, race, sex).

Solution Details: After entering credentials into the Google Authentication node, we used the Google Sheets Connection node to access all the sheets in the dataset for this challenge. Next, we used the Google Sheets Reader node to select the “All Legislation Introduced Since Jan 2021” sheet, which was the the focus of this challenge. We then prepared two visualizations to tackle the tasks in the challenge. First, we used the Bar Chart node to check which US state has the most gag orders (Missouri). Second, we combined the Strings to Document node and the Keyword Search component to properly format the dataset’s descriptions and then compute their term co-occurrences, finally feeding this information into the Tag Cloud node to identify which topics were the most discussed (e.g., schools, race, sex).

See our Solution in the KNIME Hub

Level: Medium

Description: You are a researcher studying how words are used and in what contexts. Given a target word, one of your tasks is to extract the 3 words that appear before and the 3 words that appear after it -- if applicable (some target words are not preceded by any word, for instance).

In this challenge, you will be given various examples that contain common traps for these kinds of searches, such as line breaks, capitalization issues, punctuation, and no preceding or following words. Your solution should extract words while (1) retaining capitalization, (2) being case insensitive, and (3) capturing multiple instances of the word depending on the window size. Please ignore misspellings. Bonus: Make a component so others can play with your creation!

Example input and output where "eggs" is the target word:

(INPUT) Eggs are great -> (OUTPUT) Eggs are great

(INPUT) I like eggz. -> (OUTPUT) ? (empty since no exact match)

(INPUT) I really do like eggs that (line break) are covered in ketchup. -> (OUTPUT) really do like eggs that are covered (line can be retained if preferred)

(INPUT) I love eggs, but they need salt. -> (OUTPUT) I love eggs, but they need

(INPUT) Is KNIME secretly in love with EGGS??? -> (OUTPUT) in love with EGGS

Author: Victor Palacios

Data: Word Context Dataset in the KNIME Hub

Solution Summary: After reading the dataset containing the input strings, we continued by building a component to find the word windows. In the component’s configuration dialog, the target word and the window size can be defined. We then built a regular expression considering the two input variables, and then used this regex to find all the matching terms. The results are returned as an array. Finally, to better process the terms, we split the results array so that each term is in a single column cell.

Solution Details: After reading the input file using the Excel Reader node, we continued by building a customizable component to find the word windows. We started by adding two String Input nodes. The first String Input node allows to define the target word in the configuration window of the component, the second String Input node is to define the window size. In our case, the target word is “eggs”, and the window size is 3. Next, we used the String Manipulation (Variable) to build a regular expression to find the word windows. The target word variable and the window size variable are included as flow variables. We then sent the resulting regex to the Regex Find All component. This component finds and returns all the terms that match the given regex. The results are returned as an array. Finally, to better process the terms, we used the Split Collection Column node to split the results array so that each term is in a single column cell.

See our solution in the KNIME Hub

Level: Easy

Description: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal of text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow.

Author: Victor Palacios

Data: PDF in Swedish in the KNIME Hub

Solution Summary: After reading the downloaded PDF and performing optical character recognition (OCR) to ingest the data, we split it into lines and removed those that were duplicates. After that, we rejoined the remaining lines and obtained a cleaner text together as a result.

Solution Details: We started our solution by reading and performing OCR with the Tika Parser node. Next, we used the Cell Splitter node to break the text into lines (one per column), using ‘.’ as a marker for periods. For a more convenient deduplication, we then used the Transpose node to turn the generated columns into rows. This allowed us to use the Duplicate Row Filter node to remove the redundant lines in the text. Finally, we used the GroupBy node to rejoin the remaining text back together.

See our Solution in the KNIME Hub

Level: Medium

Description: KNIME has just released a new textbook on time series: to celebrate it, we will do a little time series analysis in this challenge. Check out the book (and discount!) here.

For this challenge you will perform time alignment on data that contains two types of gaps: regular gaps by the nature of the data, and irregular gaps. For example, daily stock market data regularly skips Saturdays and Sundays, and irregularly skips some other weekdays due to public holidays. Your concrete task for this challenge is to introduce the missing timestamps that correspond to weekdays into the given data, while omitting those for weekends. The data contains the daily exchange rates of US dollar vs Swiss franc from 1980 to 1998. Hint: Check out our verified components for time series analysis. Hint 2: For reference, if you get access to the book, check out Chapter 3 (Preparing Data for Time Series Analysis) which this challenge is based on.

Author: Maarit Widmann

Data: Daily Exchange Rates Data in the KNIME Hub

Solution Summary: In our solution, we started by extracting the year (week-based) and week associated with each date in the dataset. Next, we grouped the data by week and, for each group, we inserted sequential timestamps for those weekdays that were missing. The final dataset thus included (1) the original data on a daily granularity, and (2) missing entries for those weekdays that were not in the original data.

Solution Details: After reading the dataset for this challenge with the CSV Reader node, we transformed its dates into timestamps with the String to Date&Time node. Next, we extracted each date's year (week-based) and week with the Extract Date&Time Fields node. We then grouped this new dataset into weeks with the Group Loop Start node, and added weekday missing timestamps to each week at a time with the Timestamp Alignment component. Finally, we used the Loop End node to connect the time-aligned data.

See our Solution in the KNIME Hub

Level: Hard

Description: In this challenge we will create a network graph using your LinkedIn data. Here are a few instructions on how to request your LinkedIn data, so that you can tackle this challenge:

Open the app and click on the profile icon (top right)
Go to "Settings & Privacy" > Data Privacy
Under the section "How LinkedIn uses your data", click "Get a copy of your data"

If you do not have a LinkedIn account or any LinkedIn data, feel free to use our supplied excel sheet below. Once you have your chosen data at hand, your task is to make networks that show where your contacts work, and link those contacts to a work place node. A simple example would be:

Paolo ---- KNIME ----- Elizabeth

The above network graph shows that both of your contacts (Paolo and Elizabeth) work at KNIME. After the graph is built, visualize the top 3 companies where most of your contacts work. Hint: We used the Network Creator node and Object Inserter node to create our graph.

Author: Alida Brizzante

Data: LinkedIn Data in the KNIME Hub

Solution Details: Using the sample dataset available for this challenge, we created a component that derives its network graph, and creates network visualizations for the top 3 companies where most contacts in the dataset work (KNIME, Amazon, and Meta). As a first step, we removed missing data from the original dataset, and processed it to just keep the top 3 companies. Next, we created color codes for each company: they were later used to properly color the contacts in the graph. Upon joining the color codes and the cleaned sample dataset, we aggregated the “First Name” and “Last Name” columns into a single column for graph node identifiers (column “Concatenate”). As a last preprocessing step, we also removed emojis from our contact information. After that, using the Network Creator, Object Inserter, and Feature Inserter nodes, we created color-coded graphs based on our clean dataset, covering the top 3 companies. We then sent these graphs to instances of the Network Viewer node, through which it was possible to visualize the contact networks for the top 3 companies. Instances of the Text Output Widget node were also used to map the color codes onto their corresponding company names.

See our Solution in the KNIME Hub

Level: Medium

Description: One of your tasks at work is to train a model using sentences with the correct word context (i.e., words in a sentence following a meaningful and correct order). However, to train such model, you also need to create a dataset of words used in an incorrect context. You can think of this task as a version of Negative Sampling - a neat technique for training the famous Word2Vec model. Concretely, in this challenge you will create a workflow that takes a sentence and scrambles the order of its words. You can create a small sample of sentences to test your work with the Table Creator node.

Input
I like cats.

Output
cats. like I

Hint: Our simple solution only uses 5 nodes, but the permutations are not exactly random. Conversely, our more complex solution uses more than 15 nodes and 2 loops, as well as the Random Numbers Generator node, to create truly scrambled sentences. Bonus: Create a solution with true randomization without using any loops.

Author: Victor Palacios

Solution Summary: We started by filtering our data so that only the first 100 data rows were considered and converted the input strings that were supposed to be randomized to documents. The solution contains two nested loops. The first loop iterates over the data rows so that each input row is processed at once. For each document, a Bag of Words is created, and the total term count is obtained. Now, to introduce total randomization, a random number is assigned to each word. The words are then ordered based on the random number und ultimately concatenated back together into a sentence. This randomization part is encapsulated by the second loop and repeated n=10 times.

Solution Details: After reading a Kaggle Dataset with tweets from consumers to airlines, we began by filtering the data to the first 100 data rows using the Row Filter node. With the Strings To Document node we converted the tweets from strings to documents. The following Chunk Loop Start node marks the start of the first loop that is responsible for looping over each data row at once. For each input row, we created a Bag of Words using the Bag Of Words Creator node, and we calculated the total number of terms using the GroupBy node. The value obtained by the GroupBy node is needed later in order to determine how many random values are required to be created. The value is converted to a variable using the Table Row to Variable node. The following Counting Loop Start node marks the start of the second loop, where the number of iterations is defined as 10. Within the second loop, we created random values of arbitrary range using the Random Number Generator node. In the next step, the randomly created numbers needed to be assigned to a term. For that, we first used the RowID node to reset the row ids of the table containing the random values. In addition, we converted the terms contained in the table that resulted from the Bag Of Words Creator node back to strings using the Term To String node. Then, both tables were combined using the Column Appender node. In the next step, we sorted the terms according to their randomly assigned value in ascending order using the Sorter node. Lastly, with the help of the GroupBy node we concatenated the terms back into a sentence. The following two Loop End nodes mark the end of the two loops, where the first Loop End node corresponds to the Counting Loop Start node, and the second Loop End node to the Chunk Loop Start node respectively.

See our Solution in the KNIME Hub

Level: Hard

Description: In America the prices of medical procedures can vary greatly, so savvy Americans tend to shop around for a good deal. In this challenge you will take the role of a data journalist trying to investigate price differences among medical procedures. More specifically, you want to find the top 5 procedures that show the largest variety in terms of pricing from hospital to hospital (in statistics, you would call this high standard deviation). The data you have at hand for this investigation is not uniform and requires inspection in order to be properly read and processed. For simplicity, compare all average charges for the 25 most common outpatient procedures performed by hospitals from the Kaiser Foundation Hospitals and Sutter Hospitals only. For Kaiser Foundation, the relevant data is in Excel files with "...Common25..." in their names; for Sutter, the relevant Excel files contain "...CDM_All..." in their names. Beware of hidden sheets.

Author: Victor Palacios

Data: Link for Medical Procedures Dataset

Solution Summary: We started our solution by filtering the spreadsheets in order to just consider those that concern Kaiser Foundation and Sutter hospitals. After extracting hospital names, we read their corresponding Excel sheets and processed the codes and prices of their medical procedures. Finally, we computed a few statistics for the price distribution of each procedure and picked the top-5 ones with the largest standard deviation. They were laparoscopic cholecystectomy; excision, breast lesion; hernia repair, inguinal, 5 years and older; carpal tunnel surgery; and arthroscopy, knee, with meniscectomy (medial or lateral).

Solution Details: The first step in our solution was to read the spreadsheets in the unzipped dataset folder with the List Files/Folders node. Next, we used the Row Filter node to concentrate our analysis on Kaiser Foundation and Sutter hospitals. We then used the Path to String and String Manipulation nodes to extract hospital names: these were the input for the Read Excel Sheets component, and its output was the data for all the hospitals concatenated in a single data table. For readability, we used the Column Rename node to change column "A" to "procedure", column "B" to "code", and column "C" to "price". Next, to allow for a statistical analysis, we used the String to Number node to turn prices and codes into integers. We then used the Rule-based Row Filter, GroupBy, and Joiner nodes to remove rows with missing values while preserving all relevant information. The next step was then to turn each medical procedure into a different column, with rows corresponding to their prices at different hospitals. This shape modification facilitated our statistical analysis: we just had to use the Statistics node to calculate the standard deviation for the prices in each column (that is, the prices for each procedure). Finally, we used the Top k Selector node to pick the 5 procedures with the largest standard deviation.

See our Solution in the KNIME Hub

Level: Easy

Description: A company buys their material from different manufacturers and stores all relevant information in one excel file. In this challenge your goal is to change the structure of the excel sheet so that as little information as possible is stored repeatedly. Use as few nodes as possible to change the structure of the excel file to the output shown below (our solution has only 5 nodes, no scripting nodes, no components).

Input

2019 1 ABC comp Wood 141 148
2019 1 ABC comp Glass 144 134
...

Output

2019 1 ABC comp Glass 144 134 2019 1 ABC comp Wood 141 148
...

Author: Victor Palacios

Data: Excel table in the KNIME Hub

Solution Summary: After reading the Excel table, we tackled the proposed challenge by first removing rows with undefined prices ("..."). Next, we grouped the data by company and material and fixed their RowIDs in order to properly output the data.

Solution Details: To read the Excel table related to this challenge, we used the Excel Reader node. Next, we used the Row Filter node to remove rows with undefined prices. Finally, we used the Group Loop Start and Loop End (Column Append) nodes to cycle through the data grouping rows by company and material and outputting them in the format specified above. Within the loop, we used the RowID node to relabel the rows' IDs, facilitating their concatenation in the right format.

See our Solution in the KNIME Hub

Level: Easy

Description: What is this delicacy? For this challenge, you'll extract text from a web page, perform a keyword search, and create a tag cloud of the keywords. The solution should contain the tag cloud of the keywords and your guess of the recipe!

Hint: This challenge only requires 5 nodes in total (two are verified components for Text Processing).

Author: Maarit Widmann

Data: URL of the recipe in the KNIME Hub

Solution Summary: We started by reading the recipe URL and extracting the website's text. To ease further text processing, we converted the text string to document format. Then, we extracted the most relevant keywords from the website's text and their weights. Lastly, we visualized the keywords in a word cloud.

Solution Details: After reading the recipe URL with the CSV Reader node, we used the Web Text Scraper component to extract the website's text. The text was captured in one cell as a string. For further text processing we require the data to be in document format, hence we applied the String to Document node. In order to get the relevant keywords, we used the Keyword Search component which extracts the most relevant English keywords in a corpus. The component also outputs each keyword's weight. Lastly, the view of the Tag Cloud node visualizes the word cloud of the most relevant keywords of the recipe where the word's size represents its relevance.

See our solution in the KNIME Hub

Level: Easy

Description: For this challenge, you’ll compare the number of foreign students in different German states on a choropleth map of Germany. The output should contain an interactive view with a map showing the number of foreign students by German states, a widget to change the considered year, and a refresh button. What are the top 3 states with the most foreign students now? And in 2000?

Hint: Check out the verified components for visualization on the KNIME Hub.

Author: Ahmad Varasteh

Data: Student Data and List of State Abbreviations in the KNIME Hub

Solution Summary: After reading the students data, we joined it with the country codes dataset. We then accumulated the number of students by state and year. Lastly, we created an interactive component composite view that displays the number of students by state and allows for interactive year selection.

Solution Details: We started by reading the students data and the country codes. We joined both datasets on the state using the Joiner node. We did this to avoid erroneous visualization results later, as the Choropleth Map component works best when using state abbreviations. Then we used the Pivoting node to accumulate the number of students by state and year. Lastly, we created a component containing the Choropleth Map component to visualize the data by state, a Value Selection Widget node to select a year, and a Refresh Button Widget node to apply the selection.

See our solution in the KNIME Hub

Level: Easy

Description: You are a senior student in a US high school, and you are going to apply for college soon. The costs of undergraduate studies in the US can be pretty high, and you want to understand what factors influence the costs -- and how. To make an informed and financially responsible decision, you decided to perform some Explanatory Data Analysis and cost modeling. What conclusions do you get to?

Author: Jinwei Sun

Dataset: Cost of Undergraduate Studies Data in the KNIME Hub

Solution Summary: To tackle this challenge, we used a variety of plots to check how the average undergraduate cost in the US varies depending on different factors (state, year, type of school etc). We also used a simple Decision Tree model to see how the cost values generalize per institution. The idea is for students to use this information to help determine what school they should attend.

Solution Details: In our solution, we start by grouping the cost data per state with the GroupBy node, then sorting it in descending order with the Sorter node, and then visualizing this information with the Bar Chart node. The most expensive region of the US in terms of undergraduate cost is DC. Next, we group the data again per year (GroupBy node) and use the Line Plot node to visualize the cost time series. The data shows that from 2013 to 2020 the cost for an undergraduate education has risen considerably, on average. We also use the Bar Chart node to visualize average undergraduate cost per school type, which indicates that Public In-State schools are the cheapest for a student to attend. Finally, we partition the data (Partitioning node), train and test a Simple Regression Tree model (Simple Regression Tree Learner and Predictor nodes), and assess its quality with the Numeric Scorer node. The attributes in the data explain the variability in about 95% of the test data (R-squared score of 0.952), suggesting that factors such as school type and state are great indicators of undergraduate cost in the US, and thus are very likely to help students make sound decisions as to where they go to college.

See our Solution in the KNIME Hub

Level: Easy

Description: In this challenge, you will do a little bit of logic building and produce the star triangle below. The output will be such that each row in the pattern will be in the corresponding row of the table.

P.S. do not hardcode the pattern.

*
**
***
****
*****
******

Bonus Challenge: Do not use scripting nodes or components.

Author: Ali Marvi

Solution Summary: We started by creating a table containing one white space character. We then looped over the table and iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration and replaced the white space character by a star character. We collected all rows and lastly removed the first row from the table as it is a duplicate.

Solution Details: We started by creating a table using the Table Creator node containing one white space character. We then looped over the table using a Counting Loop Start node and a Loop End node that collects the intermediate results row-wise. In the Counting Loop Start node we set the number of loops to 7. Within the loop body, we used the String Manipulation node to build the star pattern. We iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration (padLeft function) and replaced the white space character by a star character (replace function). Lastly, we used the Row Filter node to remove the first row from the table as it is a duplicate.

See our solution in the KNIME Hub

Level: Easy

Description: You have a dataset containing information on US citizens who donated blood in the last year, including addresses and blood types. The O- blood type, also known as "universal donor", is perhaps the most valuable blood in the world because it can be transfused to nearly any blood type holder. Your goal here is to help a group of researchers find the number of citizens with O- blood type per US state. Unfortunately, the address column comes in a single line, so to extract the state information you will have to perform some data wrangling. They also asked you to create a choropleth map of the US to visualize the results.

Author: Ahmad Varasteh

Dataset: Synthetic Blood Type Data in the KNIME Hub

Solution Summary: After reading the dataset in rare blood types, we first extracted the state from the address line. We then removed the unnecessary address column and renamed the name and the state column for better readability. We then filtered the data to only keep samples where the blood type equals "O-" and grouped the data by state. Lastly, to visualize the data in a world map we used the Coropleth Map component.

Solution Details: After using the CSV Reader node to read the dataset on rare blood types, we used the Regex Split node to extract the state from the address line into a separate column called "split_0". With the help of the Table Manipulator node, we removed the address column and renamed the columns “name” and “split_0”. Then, we used the Row Filter node to only keep samples where the blood type equals "O-". To calculate the number of citizens per state we used the GroupBy node, grouped by state, and used Count as an aggregation method. Lastly, we used the Coropleth Map component to visualize the results in a world map which resulted in an interactive map of the United States.

See our solution in the KNIME Hub

Level: Easy

Description: To wrap up our series of data classification challenges, consider again the following churning problem: a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. The target class to be predicted in the test data is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You have already found a good model for the problem and have already engineered the training data to increase the performance a bit. Now, your task is to communicate the results you found visually. Concretely, build a dashboard that:

shows performance for both classes (you can focus on any metrics here, e.g., precision and recall)
ranks features based on how important they were for the model
explains a few single predictions, especially false positives and false negatives, with our Local Explanation View component (read more about it here)

Author: Aline Bessa

Dataset: Training and Test Data in the KNIME Hub

Solution Summary: To tackle this challenge, we created (1) an interactive visualization to compare different performance metrics for both classes; (2) used the Global Feature Importance verified component to check which features were most indicative of churning for the test set; and (3) created a component based on the Local Explanation View verified component to better understand why our model generated a given false positive (the user did not churn but the model predicted churning) and a given false negative (the user churned but the model did not predict it).

Solution Details: After oversampling the minority class in the training data, and using AutoML to find a suitable ML model for our problem, we created a component to interactively compare different metrics (e.g., recall and precision) for classes Churn = 0 and Churn = 1. This comparison is visual and is based on the Bar Chart node. Next, we used the Global Feature Importance verified component to check which features were most indicative of churning — the top 2 were DayMins (total number of calling minutes during the day) and CustServ Calls (number of calls placed to customer service). Finally, we created a component to visualize local explanations for one false negative and one false positive example. These explanations were generated with the Local Explanation View verified component. In the case of the false negative example, the desired class was Churn = 1, and in our execution we noticed that the instance’s value for Night Calls (billed cost for nighttime calls) was likely responsible for misclassification. As for the false positive, the desired class was Churn = 0 — here, the value for CustServ Calls probably contributed a lot for misclassification.

See our Solution in the KNIME Hub

Level: Hard

Description: In this challenge series, the goal is to predict which customers of a certain telecom company are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. Here, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do).

After automatically picking a classification model for the task, you achieved an accuracy of about 95% for the test data, but the model does not perform uniformly for both classes. In fact, it is better at predicting when a customer will not churn (Churn = 0) than when they will (Churn = 1). This imbalance can be verified by looking at how precision and recall differ for these two classes, or by checking how metric Cohen’s kappa is a bit lower than 80% despite a very high accuracy. How can you preprocess and re-sample the training data in order to make the classification a bit more powerful for class Churn = 1? Note 1: Need more help to understand the problem? Check this blog post out. Note 2: This problem is hard: do not expect to see a major performance increase for class Churn = 1. Also, verifying if the performance increase is statistically significant will not be trivial. Still... give this challenge your best try!

Author: Aline Bessa

Dataset: Training and Test Data in the KNIME Hub

Solution Summary: To tackle this challenge, we again relied on our AutoML component to pick a suitable model, and also played with different strategies to oversample the minority class using our SMOTE node. We were able to increase Cohen's kappa to 81.1% while maintaining basically the same accuracy we had before. The best solution we found again involved Gradient Boosted Trees -- this time, with 90 trees -- and using SMOTE with the 10 nearest neighbors to oversample the minority class. Note: We did not assess whether this improvement in Cohen's kappa is statistically significant.

Solution Details: After using two instances of the CSV Reader node to read both training and test datasets, we decided to use the SMOTE node to oversample the minority class in the training data. We then fed the oversampled data to the AutoML component and, just like last week, sent the best found model and the test data to the Workflow Executor node, which generated churn predictions. We then used the Scorer node to check how well the model performed over the test data. Since the SMOTE node has a parameter to control the number of nearest neighbors used to generate new samples, we decided to experiment with it. To this end, we used the Table Creator node to come up with a range of numbers of nearest neighbors, and then placed the body of this workflow (SMOTE/learning/predicting/scoring) in a loop that would generate solutions for these different numbers of nearest neighbors. The loop used the Table Row to Variable Loop Start and Loop End nodes. After executing this loop, we used the Top k Selector node to get the solution with the best Cohen's Kappa value, and then only kept informative columns associated with this solution by using the Column Filter node.

See our solution in the KNIME Hub

Level: Medium

Description: Just like in last week’s challenge, a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. One of your colleagues said that she was able to achieve a bit over 95% accuracy for the test data without modifying the training data at all, and using all given attributes exactly as they are. Again, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). What model should you train over the training dataset to obtain this accuracy over the test dataset? Can this decision be automated? Note 1: A simple, automated solution to this challenge consists of a mix of 1 component and 4 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.

Author: Aline Bessa

Dataset: Training and Test Data in the KNIME Hub

Solution Summary: We used the AutoML verified component to experiment with a variety of classifiers (and hyperparameter values) in order to find the best suited model for this problem. The chosen model was Gradient Boosted Trees, in a technical tie with XGBoost Trees and H2O’s GBM. This model achieved an accuracy of 95.1% over the test data. Note: We would need more data than what we currently have to robustly verify if Gradient Boosted Trees is statistically better than the other models. Small variations on the training dataset, for example, could lead to a different model choice.

Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the AutoML verified component to detect the best suited model based on the training dataset at hand. Next, we used the Workflow Executor node to apply the chosen model over the test data. Finally, we used the Scorer node to check how well the model classified the test instances.

See our Solution in the KNIME Hub

Level: Easy

Description: A telecom company wants you to predict which customers are going to churn (that is, are going to cancel their contracts) based on attributes of their accounts. To this end, you are expected to use a decision tree classifier. The company gives you two datasets (training and test), both with many attributes and the class ‘Churn’ to be predicted (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You should train the decision tree classifier with the training data, and assess its quality over the test data (calculate the accuracy, precision, recall, and confusion matrix for example). Note 1: This challenge is a simple introduction to predictive problems, focusing on classification. You are expected to just apply a decision tree classifier (and get an accuracy of about 92%). A simple solution should consist of 5 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.

Author: Aline Bessa

Dataset: Training and Test Data in the KNIME Hub

Solution Summary: Using the learner-predictor paradigm, we trained a decision tree classifier over the training data and assessed its performance over the test data. When training the decision tree, we used Gini index as a metric for the quality of the decision tree, pruned it using the MDL method, and kept at least 6 records per node. By doing this, we achieved an accuracy of about 94%.

Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the Decision Tree Learner node to train a decision tree classifier, and the Decision Tree Predictor node to apply it over the test data in order to assess its performance. Finally, we used the Scorer node to check how well the model classified the test instances. Note: Decision tree models have a number of parameters that can be tuned in order to generate better models. We'll be discussing parameter tuning and model selection later in this series of challenges.

See our Solution in the KNIME Hub

Level: Easy

Description: In this challenge you will create visualizations to inspect different sexual orientation laws around the world, as reported by the State-Sponsored Homophobia report. The purpose of this report is to work as a useful tool for LGBTQIA+ human rights defenders. Here are a few questions that your visualizations should help answer:

1. In what decade did most countries decriminalize homosexuality?
2. Is the decriminalization of homosexuality becoming more common over time?
3. In what countries is same-sex marriage legal?
4. Is there a continent in which the legality of same-sex marriage is more common? And how about civil unions?

Author: Aline Bessa

Solution Summary: After reading the dataset on sexual orientation laws, we binned the temporal information associated with the decriminalization of homosexuality, and then grouped countries by period. This information was then used to analyze whether the decriminalization of homosexuality is becoming more common over time. The data indicates that the 20th and 21st centuries concentrate most of the decriminalizations -- in particular, the period from 1990 to 1999. We also used maps to better understand the current legality of same-sex marriages and civil unions. Both are still illegal in most of the world, and legality seems to be more common in Oceania and in the Americas.

Solution Details: After using the CSV Reader node to read the dataset for the challenge, we used the Rule Engine node to bin the decriminalization information per decade of the 20th century, and then used the GroupBy node to group countries by decriminalization period. Next, we used the Row Filter node to remove periods associated with no decriminalization, and then used the Top k Selector node to sort the grouped data: periods with more decriminalization ranked higher in this sorted dataset. We also used a sequence of Rule Engine and GroupBy nodes to group countries by larger periods of time -- the goal was to visualize less granular trends in decriminalization. Here, we needed to use the String to Number node in the beginning of the pipeline for formatting. Finally, we used the Column Expressions node to create different codes for countries, depending on the legality of same-sex marriages or civil unions in them. In the end, we sent these three different sources of information to the Visualize LGBTQIA+ Rights component, which implements bar plots for decriminalization periods and choropleth maps for the legality of same-sex marriages and civil unions across the world.

See our Solution in the KNIME Hub

Level: Easy

Description: You are interning for a travel agency that wants to know about their top spenders' eating habits. As a trial run, you are given a tiny dataset with 6647 rows containing information on 1011 unique citizens traveling with different purposes. In this challenge you will:

Find the top 10 participants spending the highest amount of money on eating.
Find out whether the people who spend the most money on eating are the same people who spend the most time eating.

Note: Sometimes the ending balance is more than the starting balance. Assume this was a mistake when calculating money spent.

Author: Ahmad Varasteh

Dataset: Eating Out Data in the KNIME Hub

Solution Summary: Using date and time manipulation, we were able to calculate the time spent per activity as well as the amount spent. We then aggregated our data and filtered only those entries that were related to eating. We then took the top 10 spenders and the top 10 time takers and compared their ids with a join. The results of the join showed that eating a lot does not necessarily relate to spending a lot of money on it.

Solution Details: Critical to this analysis was to change the checkInTime and checkOutTime column to the datetime type. To this end, we used the String to Date&Time node -- but note that we could also have changed the type of the column within the CSV Reader node by clicking on the Transformation tab and then selecting the appropriate type. Once we had our data in a date format, we were able to use the Date&Time Difference node. This node made it possible to calculate the difference between the in and out time. With the Math Formula node, we were able to subtract the balance columns to see how much money was spent on each visit. We noticed that some of these events led to monetary gains (which was probably a mistake, so we used the "abs" function to derive the absolute value of the spent amount). Next, we used the GroupBy node to get the sum of money and time spent per customer. Then we used the Nominal Value Row Filter to limit our data to eating events only. The Top k Selector node then allowed us to get a list of the top 10 money/time spenders. Finally, using the Joiner node we compared whether the participant ids matched. We noted there was little overlap, suggesting that spending more time eating does not necessarily mean that more money will be spent.

See our Solution in the KNIME Hub

Level: Hard

Description: You work for a hospital and they have data for each time a patient was seen. In this challenge, you will calculate the difference between each time a patient was seen excluding weekends (called "network days"). Once you calculate the network days, calculate the average network days per patient. For the challenge, experiment with the input and output below.

Input

Patient Date
Aline    11/01/2022
Aline    12/02/2022
Aline    25/02/2022
Aline    15/04/2022
Victor    05/02/2022
Victor    25/02/2022
Victor    15/03/2022
Victor    30/03/2022

Output

Patient Date Network Days Mean
Aline 11/01/2022 ? 23.333
Aline 12/02/2022 24 23.333
Aline 25/02/2022 10 23.333
Aline 15/04/2022 36 23.333
Victor 05/02/2022 ? 13.333
Victor 25/02/2022 15 13.333
Victor 15/03/2022 13 13.333
Victor 30/03/2022 12 13.333

Note: if you simply use the Date&Time Difference node, you will mix patient data/dates and will also end up counting weekends. Bonus Challenge: Create a solution without using loops.

Author: Victor Palacios

Solution Summary: First, we changed our string to datetime format so that we can calculate temporal differences. To avoid loops, we then used lag columns to tag when our patients change. Afterwards, we used another lag column to calculate the differences between days. Next, we used a component which we found in the KNIME Forum that allowed us to calculate network days using math formulas. Finally, we calculated averages using a groupby and then joined that with our data.

Solution Details: Transforming our data using String to Date&Time allows us to use operational nodes like Date&Time Difference. This was useful since the key to this challenge was to make use of the Lag Column node if you wanted to avoid loop. The Lag Column node generates a duplicate column that is off by one or more rows. This allows for direct comparison between one column's data and its previous data. That is, we were able to calculate the difference between when a patient was seen last by comparing their date seen and their next date seen. Not only that, but you can use the lag column to pinpoint when a patient changes using a Rule Engine which simply compares the current row with the lagged row. While some people used another programming language in their network day calculation, we found a component on the KNIME Forum which had all the mathematical rules for calculating network days. This component used several nodes like Date&Time Difference, Math Formula, and Rule Engine. To keep patient data from being mixed, we then used another Rule Engine to remove data calculated across patients. With the GroupBy node we were then able to find the mean number of times patients came in. Finally, we joined the grouped data with our original data and used the Column Filter node to show only relevant columns in the final output.

See our Solution in the KNIME Hub

Level: Medium

Description: A survey was done to ask employees to rate themselves on a scale of 1-5 on different technologies and skills. We would like you to analyze this data and present it in a list where we are providing primary, secondary, and tertiary skills. Primary skills can be defined as skills rated 5; secondary skills rated 3 and 4; and tertiary skills rated 2. The final list should appear as it does on the KNIME forum where this challenge was suggested. Note: There may be cases in which employees have not rated themselves more than 3 on any of the skills (i.e., they have no primary skills).

Author: Jyotendra

Dataset: Skill Survey Data in the KNIME Hub

Solution Summary: We started our solution by restructuring the input dataset, turning the skill columns into values of a single column. Next, we filtered the data to only keep the primary skills (rated 5) for each employee. We did something similar for secondary (rated 3 or 4) and tertiary skills, and then joined this information at the end. This led to a dataset in which it is possible to quickly see all primary, secondary, and tertiary skills of each employee.

Solution Details: After reading the challenge’s dataset with the Excel Reader node, we used the Unpivoting node to turn each skill column into a value of a single column. This facilitated the grouping of skills per employee. To perform this grouping for primary skills, for example, we used (1) the Range Filter node to only keep the employees’ skills that were rated 5, (2) the GroupBy node to concatenate all employees’ primary skills into a single row, and then (3) the Column Rename node to rename the concatenated column as “Primary Skills”. We used the same sequence of nodes for secondary and tertiary skills, ending up with columns “Secondary Skills” and “Tertiary Skills” respectively. We then used two Joined nodes to join all skill tables into a single dataset, such that all primary, secondary, and tertiary skills of each employee are listed in a single row.

See our Solution in the KNIME Hub

Level: Medium

Description: A common problem in mechanical and medical data is associating notes with a category. In this challenge, you will automate the process of sorting mechanical notes into their correct category. Here’s an example:

INPUT

--List of Categories--

1. Scratch
2. Crack
3. Defect

--Notes--

1. The product was defective.
2. A crack was caused by client.
3. Many scratches noted.

OUTPUT

Note Category

1. The product was defective. Defect
2. A crack was caused by client. Crack
3. Many scratches noted. Scratch

Don't worry about using fancy machine learning or natural language processing models. This problem can be handled reasonably well using a total of 5 nodes (simple solution), and a more refined solution involves just 8 nodes (complex solution). Also don't worry about getting 100% accuracy.

Author: Victor Palacios

Datasets: Datasets with Mechanical Notes and Categories in the KNIME Hub

Solution Summary: After reading the inspection notes and the categories, a simple solution consisted of running a similarity search between categories and inspection notes to find which of the former best corresponded to each of the latter. A more complex solution involved lowercasing both categories and notes to improve matching, and then running a regular expression matcher to find all categories that correspond to each note (instead of just one category). Note: We did not implement any spellchecking in our solution, which would further increase matching quality.

Solution Details: (Simple solution) After reading the inspection notes and categories with two Excel Reader nodes, we used the Similarity Search node to find the category that best matched each note. Next, we used the Joiner node to actually perform the match between most similar categories and notes, and then used the Column Filter node to remove temporary columns. (Complex Solution) We started by reading and lowercasing the inspection notes and categories with two Excel Reader and two String Manipulation nodes. Note that lowercasing both inputs was a cheap way of finding more matches between categories and notes. Next, we used the GroupBy node to create a concatenated regular expression containing all categories, and used the Table Row to Variable node to convert this regular expression into a variable. We then used shared component Regex Find All to find all categories that corresponded to each inspection note, and finally used Split Collection Column node to put each matched category into a separate column.

See our Solution in the KNIME Hub

Level: Medium

Description: Housing prices in America are a bit out of control. In this challenge, you will see how housing prices have changed through time. Here are the tasks for this challenge:

Read the dataset for this challenge directly from its link, without downloading any files to your local machine with the CSV Reader node.
Using monthly data, calculate the yearly data and visualize the 3 RegionNames with the most drastic changes. Note: Consider the year the data was collected until the most recent year to calculate change; feel free to ignore missing values.
Find the state (regardless of region) with the lowest prices for homes on average currently.
According to Wikipedia, "The bankruptcy of Lehman Brothers on September 15, 2008 was the climax of the subprime mortgage crisis.". What visualizations can show the effect this had on the housing market? We used our answer to question 2 to keep our visualizations clutter-free.

Author: Victor Palacios

Solution Summary: After reading the housing prices dataset, and reshaping it such that each row had a region, price value, and date, we created three components to answer questions 2 to 4. To answer question 2, we grouped the dataset by region and calculated the mean housing price per year. Next, we looped over the regions getting the housing price differences between the first and the last years in the dataset. We then selected the 3 regions with the most drastic changes and plotted them with a bar chart. To answer question 3, we got the mean housing price per state in the last year and then selected the state that currently has the lowest average housing price. Finally, to answer question 4 we got the average housing price per year for the 3 regions that changed the most (output of question 2), and then plotted the prices as line plots.

See our Solution in the KNIME Hub

Level: Easy

Description: You were asked to split a single sales CSV file into smaller ones based on groups, named according to the group names. As an example, if the original file had the following data:

Sales Group
100 b
200 a
300 a

You would generate two files: one named Group_a, and one named Group_b. They would have the following structure:

Sales Group
200 a
300 a

and

Sales Group
100 b

Your solution to this task should be generic — that is, it should work for any number of groups, and the names of the groups should not matter.

Author: Victor Palacios

Dataset: Synthetic Dataset for the Challenge in the KNIME Hub

Solution Summary: We addressed this challenge by looping over the sales groups, and by writing smaller CSV files with all the sales information associated to each group.

Solution Details: After reading the sales file with the CSV Reader node, we used the Group Loop Start node to start iterating over the groups information. Next, we used the Table Row to Variable node to get each group at a time as a flow variable. We then used the String Manipulation (Variable) node to create filenames using each group as part of the name. Next, we used the String to Path (Variable) node to create relative path locations for the filenames, and finally wrote the smaller CSV files into the path locations. We closed our loop with the Variable Loop End node.

See our Solution in the KNIME Hub

Level: Hard

Description: Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from this PDF document and attempt to partially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, but labeling the columns required a bit of manual effort.

Author: Victor Palacios

Solution Summary: After reading the content of the PDF with one of KNIME’s OCR functionalities, we processed it in order to get the table data in the right format (cell by cell, and with the right number of columns and rows), and also to extract the table’s columns’ names. We then re-formatted the columns’ names, merged it with the table’s body, and removed those columns that were not specified in the challenge.

Solution Details: We started our solution by using the Tika Parser node to read the given PDF file. Next, we used the Cell Splitter node to break the content into newlines, which already helps in the separation of the table’s rows. Since the output of the Cell Splitter node is a table with a single, very long row (with each cell corresponding to a line), we used the Transpose node to turn each column into a row, facilitating the postprocessing. Next, we used another instance of the Cell Splitter node to further split the data by space, further separating the cells that are present in the PDF table. This processing led to a table with messy Row IDs, so we used the RowID node to reset them. Next, we used the Row Filter node to isolate those rows that corresponded to the table in the PDF, and then used the Extract Table Spec node to get the table’s columns’ names. Finally, we used the Insert Column Header node to combine the output of the Row Filter node with the fixed columns’ names (copied and pasted within an instance of the Table Creator node), and then used the Column Filter node to remove those columns that were not specified in the challenge.

See our Solution in the KNIME Hub

Level: Easy

Description: You are working with audio recognition data tables such that each table’s initial and final rows contain zeroes. These beginning and ending zeroes are irrelevant noise that must be removed. Here is an example of such a data table:

0
0
8
0
7
1
0

Your goal is to only remove these unnecessary starting and trailing zeroes. You must keep the zeroes in the middle of the data, which do not constitute noise. As well, the position of the starting and trailing zeroes differs per data table, so a good solution probably requires flow variables for flexibility (although there may be other solutions which do not involve flow variables).

Author: Victor Palacios

Dataset: Example of Audio Data Table in the KNIME Hub

Solution Summary: Our solution to this challenge was split into two parallel steps: first, we identified the RowID number of the first non-zero entry in the data; in a parallel branch of the workflow, we reversed the RowID numbers to find the one that corresponded to the last non-zero entry in the data. These two special RowID numbers, which were captured by two flow variables, were used to remove the unnecessary starting and trailing zeroes in the data.

Solution Details: After reading the example data for the challenge with the CSV Reader node, we used the String Manipulation node to create a temporary column named row_number, which contained the RowID numbers, and the String to Number node to convert the values in row_number into integers. This column came in handy later on in the solution, making the detection of rows with starting and trailing zeroes a bit more convenient. After this preprocessing, we used the Row Filter node to remove all rows whose data entries corresponded to zero, just to facilitate the detection of the first RowID number that was linked to a non-zero data entry. To perform this detection we used the Table Row to Variable node, which stored this RowID number into a flow variable. This flow variable and the output of the String to Number node were then passed to another instance of the Row Filter node, which combined the information to remove all data entries with starting zeroes. In parallel, the data with no zero entries (output of the first instance of the Row Filter node) was sorted to reverse the RowID numbers. Next, a second combo of Table Row to Variable and Row Filter nodes was used to identify the last RowID number tied to a non-zero data entry, and to remove all trailing zeroes from the data. Finally, we used the Column Filter node to remove the temporary row_number column.

See our Solution in the KNIME Hub

Level: Medium

Description: Your company keeps data related to online and onsite transactions in a tabular dataset with the following format:

Index Online Onsite

1 A.6777-01

2 7736-01

3 L-2210341175-00-000020

4 L-3210341175-00-000020

5 F5454

6 B_7736-01-00-000020

7 F5454 7736-01

In this challenge, you are asked to extract digits from the transactions (which are related to the bought products) given the following guidelines: (1) if the onsite transaction starts with “L”, then take its first 12 digits; otherwise, take its first 6 digits; and (2) if the onsite transaction has a missing value, then take the string from the online transaction.

What is the most efficient way to perform this task? For the example above, you should produce the following output column:

Product Codes

677701

773601

221034117500

F5454

773601

Author: Victor Palacios

Dataset: For this challenge, create the dataset illustrated above as input with the Table Creator node.

Solution Summary: To tackle this challenge we started by handling missing values, which help determine if column Online or Onsite should be used. Next, we implemented rules to extract the right number of digits from each onsitetransaction. We generated a few temporary columns in the process, which we then removed at the end of our solution.

Solution Details: After using the Table Creator node to generate the challenge’s input, we used the Rule Engine node to extract the Onsite values that were missing into a column named Onsite_missing. Next, we generated a temporary column that extracted exclusively numerical Onsite entries. To this end, we used a regular expression within the String Manipulation node. After these steps, we had to implement rules to extract the right number of digits from each onsite transaction: 12 if it originally started with “L”; 6 otherwise. We did so with a simple Javascript if-else statement within the Column Expressions node. We then used the Column Merger node to combine information from the different temporary columns we created, and finished our solution by removing these temporary columns with the Column Filter node.

See our Solution in the KNIME Hub

Level: Medium

Description: You have been working for a Life Sciences company for a month as a a data wrangler. Several coworkers from the Biology department would like to obtain a list of human genes related to specific hormones, but they do not know how to use REST services, GET requests, etc. Your task is to use the REST service provided by MyGene.info to obtain a list of human genes related to a list of hormones provided to you by your coworkers. Next, you should parse the JSON response into a table that is easy to read.

For example, if you use "http://mygene.info/v3/query?q=summary :" and append "insulin", then your request would return a JSON structure with 10 hits -- each one of them with the following fields: "_id", "_score", "entrezgene", "name", "symbol", and "taxid".

You should then parse this JSON into a table with columns "_id", "_score", "entrezgene", "name", "symbol", and "taxid". If the list provided by your coworkers contains more than one hormone, all the parsed information should be aggregated into a single table. Also, sometimes your request may return a response in XML instead of JSON. How could you include a way to also parse XML responses?

Need a tip or two? See our youtube video on REST API.

Author: Victor Palacios

Dataset: Example of a List of Hormones in the KNIME Hub

Solution Summary: To tackle this challenge, we started by reading a list of hormones and then, for each hormone, we formatted and executed a GET request. The user should then inspect the result, determine whether the responses were in XML or JSON, and then execute a component we created to control the execution of the rest of the workflow. We then implemented solutions to parse responses of both types (XML and JSON).

Solution Details: We started our solution by using the CSV Reader node to read the list of hormones. Next, we used the String Manipulation node to create a GET request URL per hormone, and then the GET Request node to execute them. We then created a component named Pick Body Type (JSON or XML) so that the user could control what parsing solution should be used over the GET response. If the user types 0, the JSON parsing is invoked; if 1, the XML parsing. For the JSON parsing, we used the JSON Path node followed by the Ungroup node to get row-wise records. For the XML parsing, we used the JSON to XML node to simulate an XML response (none of the hormones in the dataset returned XML) followed by the XPath node to directly obtain row-wise records.

See our Solution in the KNIME Hub

Level: Easy

Description: Your support team would like to get some insight into the quality of their service and they ask you to analyze their support data. As a first step, they would like to find out how the time to answer a support ticket is distributed, and what the mean of this distribution is. Help your support team by using the dataset included in this challenge to (1) plot this time distribution as a histogram, and to (2) calculate its mean.

Author: Kathrin Melcher

Dataset: Dataset on Support Tickets in the KNIME Hub

Solution Summary: Our solution to this challenge was pretty straightforward. After reading the ticket data, we calculated the difference between the creation and the closing timestamps for each ticket (in minutes). Next, we computed the mean time and the time distribution to close a ticket focusing on those tickets that were already closed.

Solution Details: We started by reading the data with the Table Reader node. Next, in order to compute time differences, we used the String to Date&Time node to convert creation and closing timestamps into Date&Time cells. We then used the Date&Time Difference node to calculate the time that it took to close each ticket — naturally, tickets that were still open have this value undefined. Since our focus was on tickets that have this value defined (closed tickets), we used the Row Splitter node to separate them. Next, we sent the closed tickets to the Math Formula and the Histogram nodes to calculate the mean time and the time distribution to close tickets respectively.

See our Solution in the KNIME Hub

Level: Easy

Description: Your coworker turns to you and says that she is going to retire. You laugh because she is 30 years old. She is serious. To understand how she got to this decision, you will create a KNIME component named “Financial Tracker_YOURNAME” (replace YOURNAME with your name). The component should use widgets to get the following input:

a person's monthly expenditure amount
their target age to retire

The output of the component should be how much money they need to have in order to retire at the target age. For simplicity, use this formula in your component:

amount_to_retire = (100 - target_age) * monthly_expenditure_amount * 12

Use your component to figure out if 2,000,000 dollars is enough for your coworker to retire, given that she spends 4,000 dollars per month. To keep this challenge simple, do not consider inflation, compounding interest, or part-time work in retirement.

Are you interested in how we came up with this formula? Check the Trinity study out. In this study, participants needed roughly 25 times whatever they spent yearly to survive for 30 years with a 95% success rate.

Author: Victor Palacios

Solution Summary: To solve this challenge, we created a component in which we first use widgets to capture monthly expense and target age for retirement. Next, we calculated the amount of money required to retire at the target age, given the monthly expense, following the formula given in the challenge. We used another widget to show the calculated value, and also added a “Recalculate” option for users to explore different retirement setups.

Solution Details: We started our solution (component Financial Tracker_KNIME) by using two Integer Widget nodes that capture monthly expense and target age for retirement. Through an interactive interface, users can input values for these two variables. These values are then sent to the Math Formula (Variable) node to calculate the amount of money required for retirement. The output of this node is then sent to a Text Output Widget node, which shows the calculated amount of money in an interactive view. To allow for recalculations, we also used the Refresh Button Widget node in this component.

See our solution in the KNIME Hub

Level: Medium

Description: You would like to post a question on the KNIME forum, but you have confidential data that you cannot share. In this challenge you will create a workflow which removes (or transforms) any columns that reveal anything confidential in your data (such as location, name, gender, etc.). After that, you should shuffle the remaining columns' rows such that each numeric column maintains its original statistical distribution but does not have a relationship with any other column. Rename these columns as well, such that in the end of your workflow they do not have any specific meaning. Let's see an example:

Before anonymization

Row Name Fav_Num Muscle_Mass

0 Victor 7 10

1 Aline 3 20

2 Scott 42 30

After anonymization

Row column column (#1)

0 3 30

1 42 10

2 7 20

Feel free to see our resources on data anonymization for inspiration, but note that the task here is much simpler! For reference, our solution only uses 7 nodes to anonymize and 3 additional nodes to do make sure the data truly was anonymized.

Author: Victor Palacios

Datasets: FIFA Dataset for Anonymization in the KNIME Hub

Solution Summary: After reading the FIFA dataset, we removed all columns of type string because they may contain personally identifying information. Next, we looped over the datasets’ columns shuffling them one by one in a pseudo-randomized fashion. We then gathered the shuffled columns into a single table and anonymized its column identifiers by replacing them with meaningless names. We also compared the data before and after shuffling/anonymizing to make sure that the anonymization worked.

Solution Details: We started our solution by reading the FIFA dataset with the CSV Reader node, followed by the removal of all string columns with the Column Filter node. The goal here was to remove any information that could help users identify the players. Next, we used the Column List Loop Start node to begin looping over the datasets’ columns. Within the loop, we used the Table Row to Variable node to get the name of the column that was currently being processed, and then we sent it to the Target Shuffling node. At this point of the loop, the column’s values were randomly shuffled. We finished the loop with the Loop End (Column Append) node, which aggregates all the shuffled columns into a single table. To finish the anonymization, we used the Column Rename (Regex) node to replace the columns’ identifiers with meaningless names. To make sure that the anonymization worked — that is, that the order of values in columns was completely and randomly changed in the process —, we sent the data before and after anonymization to two Row Filter nodes. Here, we extracted their first columns and compared their values with the Table Difference Finder node. The output of this node shows how the original value order was destroyed in the data processing.

See our solution in the KNIME Hub

Level: Medium

Description: The challenge requires you to recreate the Wordle bar chart for three players. If you have never played Wordle, check this brief introduction to the game. In this challenge, you are expected to do the following:

For the bar chart, each possible number of guesses (1 through 6) must be represented, as well as the number of victories by each player. Check this example with just a single player's data.
Transform the number of guesses [1, 6] into a score [1,10], but remember that a higher number of guesses should result in a lower score. A missing guess means that the player did not manage to guess the word and therefore should receive a 0 score.
Next, calculate the average of the three players. Who has the best average?
Now consider the difficulty of the words. Assign a weight W to the words in the list of the 306 most difficult words in the English language (file “difficult words.txt”), and recalculate the average of the three players for W=2 and for W=0. Has the player with the best score changed?

Author: Rosaria Silipo

Datasets: Wordle Players' Data and Word Difficulty Data in the KNIME Hub

Solution Summary: After reading the players’ data and the list of difficult words, we performed a left outer join between these two datasets using the words as keys. We used this joined data to plot the performance bar chart for the three users, unpivoting and pivoting the data on different columns first. We also used the joined data to compute the simple and the weighted average performance per player. In all cases, player 2 was the winner.

Solution Details: To tackle this challenge, we started by reading the players’ data with the Excel Reader node and the word difficulty data with the CSV Reader node. In both cases, we used the Constant Value Column node to set a standard weight (weight = 1) for words in the players’ data, and to set a specific weight for the difficult words (which can be weight = 0 or weight = 2). After lowercasing the words in the players’ data with the String Manipulation node, we left-outer-joined both datasets (Joiner node) using the words as keys. We started the plotting part of our solution by unpivoting the joined dataset on columns player 1, player 2, and player 3 (Unpivoting node). This gave us the numbers of word attempts per player, and we removed the rows with attempts missing with the Missing Value node. We then re-pivoted the data per player and per number of attempts, ending up with how many words each player guessed right per number of attempts (Pivoting Node). After some post-processing, which included replacing missing values in the pivoted table with 0, and setting different colors for each player, we plotted their performances in a bar chart (component Bar Chart in green). In parallel, we also used the joined data to compute the simple and the weighted average performance per player. First, we rescaled their numbers of word attempts to interval [0, 10], such that 0 and 10 were the worst and best scores respectively (component Rescale). Next, we executed the GroupBy node over the rescaled data to calculate the average player performance. Inside component Weighted Average, we processed the rescaled data a bit to use the correct weights depending on the difficulty of the words, and also used the GroupBy node to obtain the weighted average performance. In all cases, player 2 was the winner. Note: Execute the workflow with weight = 0 and weight = 2 for difficult words to obtain both required weighted averages. We did not add a configuration widget to parametrize weights to keep the solution simple.

See our solution in the KNIME Hub

Level: Medium

Description:

In this challenge you will create one or more visualizations to inspect the percentual share of seats held by women in local government, as reported by the United Nations Development Programme: Human Development Reports. The purpose of this report is to “ensure women's full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life.” Since this is a real dataset there will be missing values, and we expect you to augment the data via your own creativity. In particular, we ask that you not only report country specific data but also continent data. The true challenge here will be how you transform this incomplete dataset into one or more visualizations, and which visualizations you deem appropriate for this challenge. Challenge outline is as follows:

Download the dataset and decide the most efficient way to bring it into KNIME
Add a continent column using whatever method you think is most efficient
Visualize the data (the KNIME solution will visualize both country and continent related data)

Author: Victor Palacios

Dataset: Share of Seats Held by Women in Goverment in the KNIME Hub

Solution Summary: To tackle this challenge, we gathered data on continents and their countries and joined it with the dataset of the challenge, ending up with a dataset that had (1) a country, (2) its continent, and (3) the share of seats held by women per row. Next, we aggregated this dataset to visualize the participation of women in local governments per continent, with a bar chart. Finally, we used the original dataset of the challenge to visualize the participation of women in local governments per country with a choropleth map.

Solution Details: We started our solution by copying and pasting the dataset of the challenge into KNIME with the Table Creator node. The dataset was then immediately fed into the Enhanced Choropleth Map component, exposing how much women participate in local governments country-by-country. Next, we used the CSV Reader node to read a dataset that listed every country in each continent, and joined this data with the initial one with the Joiner node (the join keys were the countries). After that, we used the GroupBy node to aggregate the resulting dataset per continent, extracting the corresponding median share of seats held by women in government. We decided to use the median instead of the mean because the former is more robust against outliers, leading to a more faithful representation of how women participate in governments in each continent. We then post-processed the aggregated data by sorting it in descending order with the Sorter node, and by removing groups with missing values with the Nominal Value Row Filter node. These steps were necessary to generate an adequate bar chart visualization with the Bar Chart node, indicating the share of women participation in government per continent.

See our solution in the KNIME Hub

Level: Easy

Description: A research group is running a study to understand how news consumption has changed lately. They run a survey on 3,000 people, asking where they usually read their news. Participants can indicate up to three options among “Social Media”, “Online Newspapers”, “Printed Newspapers”, and “Radio or Television”, recorded with a numeric code in a CSV file. The respective descriptive names are provided in a small table. Your task is (1) to replace the codes in the survey data with their corresponding descriptive names, and (2) to create an interactive dashboard that shows how many times each option has been selected. The dashboard should allow users to filter for the options they are interested in. Note: Feel free to use this challenge as an opportunity to explore different visualization nodes!

Author: Emilio Silvestri

Dataset: Survey data and code dictionary in the KNIME Hub

Solution Summary: We started by reading the survey data and the code dictionary. Next, we processed and grouped the survey data to get how many times each news source was mentioned. We then replaced the codes in this grouped data with their corresponding descriptive names, sorted it based on news source popularity, and moved on to the visualizations. To this end, we used a component to create a dashboard that allows users to select which news sources they want to visualize, and then plotted the sources' frequencies as a donut chart and as a bar chart. Note: Both charts basically convey the same type of information and are redundant. We used them here for didactic purposes, making the dashboard a bit more complex from an implementation viewpoint but still simple as a baseline solution.

Solution Details: We started by reading the survey data with the CSV Reader node, and the code dictionary with the Table Reader node. In order to get how many times each news source is cited in the survey, we started by transforming string column Source into a List with the Cell Splitter node. Next, we used the Ungroup node to create one row per entry in the list, and then grouped the rows by source type with the GroupBy node. We then used the Cell Replacer node to replace the codes in the grouped data with their corresponding descriptive names, and finally sorted this data based on the news source popularity (ascending order). After these steps, we created a dashboard to filter and visualize the news sources' frequencies. To filter the news source options in the dashboard, we started by creating a list of all the different sources with the GroupBy node. Next, we transformed this list into a flow variable with the Table Row to Variable node and fed it into the Multiple Selection Widget node, which used the list as options for visualization. The selection list output by this node was then joined with the news sources' frequencies (input of the dashboard) by using the Joiner node. We then plotted the frequencies for the selected news sources using the Plot/Donut chart, after making it colorblind-friendly with the Color Manager node. Finally, we also plotted these frequencies with the Bar Chart node for didactic purposes.

See our solution in the KNIME Hub

Level: Medium

Description: In the accounting firm you work for, you are given contracts which were executed on different dates. Your goal is to create a method that can label each of these contracts with their corresponding fiscal year. Note: A fiscal year is a period of 12 months that is used in government accounting, usually for budget purposes and financial reporting. Its definition varies from country to country, so your solution should be flexible and include flow variables for both the start and the end dates of the fiscal year. As an example, the federal fiscal year in the United States is the 12-month period that begins on October 1st and ends on September 30th of the following year.

Author: Victor Palacios

Dataset: Labels for fiscal years and contract dates in the KNIME Hub

Solution Summary: We started by reading the contract dates and the lookup tables with the defined fiscal years. Next, for each fiscal year, we selected those contracts whose dates fell within it and labeled them accordingly. The process of identifying which contracts fall within a given fiscal year was made flexible with the use of flow variables and the Date&Time-based Row Filter node. After all contract dates were labeled with their fiscal years, we sorted them by date to facilitate the understanding of the data.

Solution Details: We started by reading the contract dates and the lookup tables with the defined fiscal years with two Excel Reader nodes. Next, we converted the contract dates to Date&Time with the String to Date&Time node to facilitate the identification of contracts tied to a given fiscal year downstream. We then started looping over the fiscal years, generating flow variables for their start and end dates, with the Table Row to Variable Loop Start node. Inside the loop, these flow variables were used to select the contracts whose dates fell within their range. To this end, we used the Date&Time-based Row Filter node. Next, the Constant Value Column node was used inside the loop to label all selected contract dates with the current fiscal year label. We closed the loop with the Loop End node, which collected and concatenated the selected contracts at every iteration. We then finished the workflow by sorting the labeled data by contract date, facilitating interpretability. Note: In this solution, contracts that do not fall within the range of any fiscal year get filtered out.

See our solution in the KNIME Hub

Level: Easy

Description: You are using KNIME to monitor the daily price of a product online. After using the Line Plot node to visualize the daily prices you have already gathered, you notice that they are often constant for a certain number of days before changing again. You want to create a new column in the price data you have at hand, named "Change", such that its value is 1 if a daily price changed with respect to the previous day, or 0 if it remained unchanged. For the first daily price in the data, the "Change" value should be 1. As an example, if the initial daily prices look like:

Date                Price
2015-01-01    10
2015-01-02    10
2015-01-03    11

You should end up with data in the following format:

Date                Price      Change
2015-01-01    10            1
2015-01-02    10            0
2015-01-03    11            1

Author: Emilio Silvestri

Dataset: Daily Prices in the KNIME Hub

Solution Summary: To create column "Change" according to the challenge's description, we first created a temporary column, named "Price(-1)", holding the price value for the previous day (that is, a column just like "Price", but with a 1-day lag). Next, we compared the lagged and current prices to determine whether there was a change in value or not, leading to the creation of column "Change". Finally, we remove temporary column "Price(-1)" from the final dataset.

Solution Details: We started by reading the dataset with the CSV Reader node, and then plotted the values against the dates with the Line Plot node just to have an idea of how it looks. Next, we use the Lag Column node, which is the core of the solution, to create column "Price(-1)" based on column "Price": the former holds the values of the latter, but lagged by one day. With this new column at hand, we used the Rule Engine node to generate column "Change" by comparing the lagged values in "Price(-1)" with the current values in "Price". If the values were equal, the value in "Change" would be 0; otherwise, it would be 1. We then end the workflow by removing temporary column "Price(-1)" with the Column Filter node.

See our solution in the KNIME Hub

Level: Easy

Description: You received the 2017 cancer data from the CDC for inspection, and your goal is to answer the following questions: (1) What are the top-5 most frequent cancer types occurring in females? (2) What are the top-5 most frequent cancer types occurring in males? (3) Which US state has the highest cancer incidence rate (that is, the highest number of cancer cases normalized by the size of its population)?

Author: Janina Mothes

Dataset: Cancer and Population Data in the KNIME Hub

Solution Summary: To find the top-5 most frequent cancer types occurring in (fe)males in the US in 2017, we preprocessed the data to remove aggregated cancer sites that could lead to wrong counts, grouped the data by cancer type, and pivoted by sex. Next, we sorted the data to find the top-5 most frequent cancer types. As for the highest normalized incidence of cancer, we grouped the CDC data by state, read the states population data, and then joined these two datasets on state. We then normalized the number of cancer cases per state by the corresponding population, and sorted the resulting data to find the state with the largest incidence.

Solution Details: We started by reading the CDC data with the CSV Reader node, and noticed that a few cancer sites were aggregated (e.g., "All Invasive Cancer Sites Combined"). Since this could lead to a less refined understanding of what cancer types were most prominent, we filtered them out with a regular expression in the Row Filter node. For simplicity, we did not explore cancer site codes in our solution. Next, we used the Pivoting node to group the data by cancer sites and to pivot it by sex. This generated a table with a breakdown of cancer types and frequencies per sex. To answer the first question, we fed this table to the Sorter node and sorted it in descending order with respect to column Female+Sum(Count), which was previously created by the Pivoting node. We then used the Column Filter node, followed by the Row Filter node, to get the top-5 rows of the resulting table. This corresponds to the most common cancer sites (potentially including some aggregated sites) in females. The solution to the second question is very similar: after using the Pivoting node, we sorted the resulting table in descending order with respect to column Male+Sum(Count) with the Sorter node, and used the Column Filter and Row Filter nodes to get the top-5 most common cancers in males. To answer the third question, we used an Excel Reader node to read the 2017 US states population data, cleaned the names of the states with the String Manipulation node, aggregated the CDC data by state with the GroupBy node, used the Joiner node to combine the grouped CDC data with the US states population data, normalized the number of cancer cases per state by their population with the Math Formula node, and used the Sorter node to sort the resulting table in descending order with respect to the normalized cancer incidence. Finally, we used the Row Filter node to extract the first row of this sorted table, which corresponds to the US state with the highest normalized incidence of cancer. Note: Depending on how you clean and aggregate the data, you may obtain different results.

See our solution in the KNIME Hub

Level: Easy

Description: A pharmaceutical company used to keep its sales data in a CSV file. They ask for your help to split the data into monthly CSV files and to save them into different subfolders, where each subfolder corresponds to a year. As an example, the data for January 2015 should be stored in a file named 2015/January.csv. In your solution, remember to save the files and subfolders in the workflow data folder.

Author: Emilio Silvestri

Datasets: Sales Data in the KNIME Hub

Solution Summary: This challenge required you to read the CSV file, extract the year and month of the entries, and then implement a loop that, by iterating through years and months, creates yearly subfolders and monthly sales files inside them.

Solution Details: We started by reading the CSV file using the CSV Reader node. Next, we used nodes String to Date&Time to convert the dates in the original file to the correct type, and Extract Date&Time Fields to extract the dates’ corresponding years and months. We then used the Group Loop Start node to iterate over groups of rows that had the same year and month. For each batch of rows, we (1) used nodes Number to String, Table Row to Variable, and String to Path (Variable) to convert their year into the string that is the name of their subfolder, (2) used node Create File/Folder Variables to create the file path corresponding to their month, inside the correct subfolder, and (3) wrote the batch of rows in a correctly located CSV file. We added the node Variable Loop End to collect the variables created in the loop. Note: remember to save the files and subfolders in your workflow data area.

See our solution in the KNIME Hub

Level: Medium

Description: You are asked to build a framework that allows users to interactively visualize datasets of images and manually exclude those pictures that are not of their interest. To test your implementation, you use a dataset containing images from The Simpsons and assume that only Marge Simpson’s pictures are of interest. How would you implement this framework and filter out every image that is not Marge’s?

Author: Emilio Silvestri

Dataset: Simpsons Main Characters: Simpsons Image Dataset for Instance Segmentation on Kaggle

Solution Summary: This challenge required you to read the images in the dataset folder and to display them on a composite view to manually select and then filter out images that were not Marge's.

Solution Details: We started by reading the images using the Folder List node and the Image Reader node. Next, we rendered them to SVG format with the Renderer to Image node and displayed the images with the Tile View node. We then opened the interactive, composite view of the Tile View node and manually selected all images that were not Marge’s by clicking on their corresponding tile views. The final node is a Row Filter that excludes all data rows that were selected in the previous composite view, leading to a clean data table. Note: Remember to accept the changes when closing the composite tile view, otherwise the data selection will not be carried on to the next node. Tip: If you have a KNIME server, you can call this workflow from a web browser through the KNIME Server WebPortal. In the web browser, the composite view will show up as a web page and the data selection will be automatically exported into the underlying workflow.

See our solution in the KNIME Hub

Previous Just KNIME It! Challenges (Season 1)

Challenge 40: Just KNIME It! Season Finale

Challenge 39: Legislative Limits on Teaching

Challenge 38: Word Windows

Challenge 37: Text Deduplication

Challenge 36: Implementing Custom Time Alignment

Challenge 35: Linkedin Network Graph

Challenge 34: Word Scramble

Challenge 33: Medical Procedure Prices

Challenge 32: Change Structure of an Excel File

Challenge 31: Extracting Keywords from a Website

Challenge 30: Visualizing KPIs by States

Challenge 29: Comparing Distributions between Groups

Challenge 28: Left Triangle Star Pattern

Challenge 27 - Rare Blood Types

Challenge 26: Modeling Churn Predictions - Part 4

Challenge 25: Modeling Churn Predictions - Part 3

Challenge 24: Modeling Churn Predictions - Part 2

Challenge 23: Modeling Churn Predictions - Part 1

Challenge 22: Pride

Challenge 21: Eating Out

Challenge 20: Patient Network Days

Challenge 19: Employee Skills

Challenge 18: Categorizing Notes

Challenge 17: Housing Market Madness

Challenge 16: Grouping Data into CSV Files

Challenge 15: Extracting a Table from a PDF

Challenge 14: Removing Noise from Data

Challenge 13: Onsite and Online Transactions

Challenge 12: Obtaining a List of Human Genes via REST

Challenge 11: Mean Time to Close a Ticket

Challenge 10: Retiring Early

Challenge 9: Simple Anonymization

Challenge 8: Wordle Scores on Average

Challenge 7: Women in Government

Challenge 6: Most Popular News Sources

Challenge 5: Date to Fiscal Year

Challenge 4: Days with Price Changes

Challenge 3: CDC Cancer Data

Challenge 2: Split File and Save into Subfolders

Challenge 1: Image Dataset Visual Filtering