Prove your KNIME knowledge and practice your workflow building skills by solving our weekly challenges! Can you become one of our top Just KNIME It! KNinjas?
Level: Medium
Description: You are a data scientist working for a grocery store that focuses on wellness and health. One of your first tasks in your new job is to go over the grocery's inventory and find patterns in the items they sell, based on nutritional composition. This will help them assess if they need to tweak their offerings, and where, to match their ethos of wellness and health.
Beginner-friendly objectives: 1. Load and normalize the grocery data. 2. Cluster the data based on its numeric values using an unsupervised learning algorithm such as k-Means. 3. Denormalize the data after clustering it.
Intermediate-friendly objectives: 1. Visualize the clustering results using scatter plots and analyze the distribution of clusters. Use flow variables to dynamically control the scatterplot and enhance interactivity. 2. Perform dimensionality reduction using PCA to simplify the dataset while retaining essential information. 3. Visualize the results with scatterplots as well.
What patterns can you find? What recommendations and insights can you come up with based on these patterns?
Author: Aline Bessa
Dataset: Groceries Dataset in KNIME Community Hub
Solution Summary: The solution involves clustering the normalized data to find data groupings based on nutritional attributes. We then create two components for the visualization of results: one uses an important dimensionality reduction technique named PCA to project the data onto two dimensions of high variance, and the other implements an interactive scatterplot for users to check the clustered data using different nutritional attributes as axes.
Solution Details: The workflow starts with the CSV Reader node configured to read grocery data from a file named "food.csv”. The Normalizer (PMML) node is used to apply Min-Max normalization to all numeric columns, scaling them between 0.0 and 1.0. Next, the k-Means node clusters the data into three groups using nutritional attributes, with centroids initialized from the first rows. The data is then denormalized to facilitate visualization and interpretation. In one component, the PCA node reduces the data to two dimensions of very high variance, retaining the original columns in the output. The Column Filter node retains only the PCA dimensions and cluster information for visualization, and an interactive scatter plot is created using the Scatter Plot (JavaScript) node, configured to display PCA results and clustering outcomes. In a second component, Single Selection Widget nodes allow users to pick two different nutrients to work as axes in a scatterplot of the data points, which are plotted in their assigned cluster color. The final steps of both components involve sorting and sampling the data to provide insights into the grocery items, with results displayed in Table View nodes for easy exploration.
Level: Easy
Description: You are a linguist studying linguistic diversity around the world. You have found a dataset that includes information about countries, such as the number of languages spoken, area, and population. The dataset also contains a column called MGS, which refers to the mean growing season in each country (i.e., for how many months per year crops can be grown on average). What are the top 5 countries by the number of languages spoken? What are the top 5 countries by the ratio of languages spoken to population? What are the top 5 countries by the ratio of languages spoken to land area? Finally, do you notice any patterns between the numbers of languages spoken and the MGS values?
Objective 1 (Easy): Learn how to import a CSV file into KNIME.
Objective 2 (Easy): Perform ratio calculations between columns (e.g., number of languages spoken and population size ratio).
Objective 3 (Easy): Sort the resulting table using specific criteria to select top 5 countries.
Objective 4 (Easy): Filter the top rows based on your selected criteria.
Author: Michele Bassanelli
Dataset: Linguistic in the KNIME Community Hub
Solution Summary: We solve this challenge with by computing the ratios between the number of languages spoken in a country and its population and area, and then ranking the countries.
Solution Details: After reading the linguistic dataset with the CSV Reader node, we answer the first question using the Top K Row Filter node, sorting by the "Lang" column. For the second question, an Expression node is used to calculate the ratio of languages to population, followed by another Top K Row Filter node to sort by the newly calculated ratio.
The third question is addressed with a similar approach, but the ratio is calculated between the number of languages spoken and the country’s area.
This challenge was adapted from Statistics for Linguists and uses a modified version of the dataset from Nettle 1999. In this case, the columns that were initially log-transformed are restored to their original values.
Level: Medium
Description: You have an EV and want to live in a place that has many available charging stations, and where it is also cheap to charge your vehicle. Given a dataset on chargers around the world, you need to find out the top ten cities that have the most EV chargers. You also want to consider which of those ten cities offer, on average, the cheapest KwH rates in cost. You should be narrowing down your city of choice to five after taking into account the costs.
Objective 1 (Easy): Clean data by removing addresses without real city names and extract country out of their addresses.
Objective 2 (Easy): Count the total number of EV charging stations by city and find the top ten cities.
Objective 3 (Easy): Of the top ten cities, find out which cities have the cheapest average cost to charge per kwH and show the five cheapest cities.
Objective 4 (Medium): Create a bar chart that allows you to compare the top ten cities in terms of average cost to charge per kwH. Create a widget that lets you select the cities you want to see in this plot and control the plotting with flow variables.
Author: Thor Landstrom
Dataset: EV data in the KNIME Community Hub
Solution Summary: We solve this problem by grouping the data by city, so that we can count every city's unique EV stations and also calculate their average cost. We sort and filter the data and create visualizations that allow users to compare cities' average EV prices interactively.
Solution Details: After reading the dataset with the CSV Reader node, we preprocess the data. First, we remove rows without proper addresses (Row Filter node) and then extract the addresses in the remaining rows for grouping (Expression node). The next step is to group the data by address with the GroupBy node, and sort the resulting data in descending order by count (Sorter node). We then extract the top 10 cities with the most EV charging stations with the Row Filter node. We use the Sorter and the Row Filter nodes again to extract the top 5 cheapest cities out of these 10, and visualize their average costs with the Table View node. In parallel, we create a component that allows users to compare the top 10 cities with the most EV charging stations in terms of average charging cost. This component has a widget that lets users select the cities they want to see in this plot, which turns into a flow variable that controls the plotting.
They are a great way of preparing for our certifications.
KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?