Prove your KNIME knowledge and practice your workflow building skills by solving our weekly challenges! Can you become one of our top Just KNIME It! KNinjas?
Challenge 30
Level: Medium
Description: Christmas is right on the corner, and many of those who celebrate it are currently looking for souvenirs and gifts for their loved ones. To facilitate their search, let’s create a data app that allows people to find nearby Christmas markets given an address and a maximum distance. Data app users should be able to visualize the locations, along with their distances from the given address. In our example, we focus on the Berlin Christmas markets of 2023, but feel free to experiment with other geospatial datasets that have information on Christmas markets. Hint: Since the dataset is in German, you can use the Translator component from KNIME Community Hub to translate the addresses.
Author: Sanket Joshi
Dataset: Christmas Market Data in the KNIME Community Hub
Solution Summary: To solve this challenge, we read the list of Christmas markets in Berlin, translate the column names from German to English, transform the market addresses into geometries that can be plotted on a map, and used in search, and then create a dashboard that allows users to type an address and get a list of the closest Christmas markets. This dashboards also details the distances between the given address and the markets, and plots the markets onto a map.
Level: Medium
Description: As a data scientist working in finance, you are tasked with researching the current crypto market. Perform a GET request on the provided API to obtain up-to-date information, convert the data so that each token/crypto has its own row, and extract the name of the crypto, 'marketCapUSD' , and column 'changePercent24Hr' from the JSON body. With this data at hand, obtain the top 10 gainers and losers in the current crypto market and visualize them with a simple tree map. The size of each crypto should be scaled off the market cap, and its color should be determined by whether or not the percent change is positive (green) or negative (red). Hint 1: Use the Generic ECharts View for the tree map. Here's an example of a simple tree map visualization that can be useful. Hint 2: Tree maps usually do not display negative values. One simple way to handle it is to use an absolute value.
Author: Thor Landstrom
Solution Summary: We solve this challenge by performing a GET request to the given API, and then cleaning the result to isolate the names of each crypto and their percent change in value over the previous 24 hours. Next, we select the top 10 crypto markets that gained the most in value and the top 10 crypto markets that lost the most. With this information, we create a simple tree map visualization with colors encoding the gaining and losing markets, such that their sizes in the tree map are also proportional to their gains or losses.
Level: Easy
Description: The 2024 World Happiness Report focuses on the happiness of people at different stages of life. The later stages of life are often portrayed as less happy, but happiness research shows a more nuanced picture, and one that is changing over time. What are the top 10 happies countries in 2024 for people in different age groups? Are there any interesting overlaps in these rankings?
Author: Sanket Joshi
Dataset: World Happiness Data in the KNIME Community Hub
Solution Summary: We solve this challenge by grouping the data based on four different age groups: youngest, lower middle age, upper middle age, and oldest. For each group, we get the geometries for the top 10 reported happiest countries. Finally, we create a composite visualization with maps highlighting the top 10 happiest countries per age group.
Solution Details: After reading the given dataset with the CSV Reader node, we use the GroupBy node to get the happiness country ranking for the youngest group of people. We then use the Top k Row Filter node to restrict our analysis to the top 10 countries for this group. With the Geocoding node, we get the geometries for the 10 countries. In parallel, we use this same sequence of nodes to get the top 10 happiest countries, along with their geometries, for other three age groups: lower middle age, upper middle age, and oldest people. We send these groups' geometries to a component that, with the Geospatial View node, plots the happiest countries on a map.
Level: Easy
Description: This week we propose a delicious (and, for some people, scary!) challenge: find the most uncommon pizza toppings. Given a dataset containing pizza options from several restaurants, along with their prices and ingredient lists, your task is to find the ingredients that are most rarely used -- i.e. the ones that appear the least in the ingredient lists. Go a step further if you will and extract the rarest ingredient pairs as well. Buon appetito!
Author: Emilio Silvestri
Dataset: Pizza Data in the KNIME Community Hub
Solution Summary: We solve this challenge in three different ways. First, we get a list of unique ingredients, count their occurrences, and extract those that occur fewer than three times. Second, we create a list with unique pairs of ingredients and count their occurrences, allowing us to check which pairs are the rarest. Finally, we use a loop-based approach to create ingredient pairs for each pizza, and only after we have pairings for all pies do we group them to verify which pairs are the rarest.
Level: Easy
Description: You work for a political think tank that closely monitors the US presidential campaigns. You are tasked with designing a KNIME component that visualizes the results of the last three US presidential elections (2012, 2016, 2020) on a map for each state. Which insights can you get about this data with this component?
Author: Michele Bassanelli
Dataset: US Elections Data in the KNIME Community Hub
Solution Summary: To solve this challenge, we start by preprocessing the dataset to restrict our scope to the three most recent presidential elections (2012, 2016, 2020). For each state, we identify the candidate who received the highest number of votes. We then integrate this with geographical data using the OSM Boundary Map to obtain the geometry of each US state and the District of Columbia. The election results are joined with the geographical data, which are then combined into a KNIME component. Finally, the results are visualized on a map using the Geospatial view.
Solution Details: After reading the dataset for the challenge with the CSV Reader node, we group the data by state with the GroupBy node and then standardize its formar with the Expression node. Next, we use a loop (Table Row to Variable Loop Start and Loop End nodes) to iterate over each state, getting their geometries with the OSM Boundary Map node. We filter these geometries with the Row Filter node to keep only information about the states. In parallel, we perform a similar loop for Washington and New York because data was missing previously for these two states. With the Concatenate node, we combine the information for all 50 states and also for the District of Columbia (DC). As a last step, we send the geometries and the election data to a component that allows us to select an election year (2012, 2016, or 2020) and then check which candidate and party won in each state. If a state is red, it means that the Republican party won there; if it is blue, it means that the Democratic party won there.
Level: Medium
Description: Your company wants to perform a cohort analysis to find customers that started purchasing goods on the same month (cohorts), and then track their persistence over time. To solve this challenge, you must explore a transactions dataset from 2016, identify its cohorts, calculate the customer retention ratio, separate the purchases month by month, and then visualize the persistence of the cohorts over time. Note: Customer retention rate (CRR) is the percentage of customers a company retains over a period of time. If E is the number of customers at the end of 2016, N is the number of new customers during the period, and S is the number of customers in the beginning of 2016, then CRR = [(E - N) / S] * 100.
Author: Aline Bessa
Dataset: Transactions Data in the KNIME Community Hub
Solution Summary: To solve this challenge, we first process the data in order to obtain customer cohorts on a monthly basis. Next, we compute the CRR for the different cohorts, normalize the data so that we can compute cohorts that started in different months, and then finally visualize the different cohorts to check their persistence over time. 55% of the cohort with the highest retention ratio(January 2016) came back during the first month, and 17% came back even 11 months after that.
Level: Easy to Medium
Description: You are a social scientist who needs to create some synthetic data for an imaginary population consisting of 1000 people, including attributes age, height, and weight. Start by generating a Gaussian age distribution using a mean of 40 and a standard deviation of 10, then bin people into four age groups: 'Children', 'Young Adults', 'Adults', and 'Seniors’. For each group, generate heights using a beta distribution with realistic parameters. Categorize heights into three groups: ‘< 160cm', ‘> 180', and 'rest’. Based on the binned height information, generate weights using a gamma distribution that accurately models weight distributions per age group. Visualize the relationships and identify patterns and correlations within this synthetic population.
Author: Keerthan Shetty
Solution Summary: We solve this challenge by generating a pool of 1,000 people, assigning ages to them by drawing numbers from a Gaussian distribution, and binning them into age groups. Next, we use a beta distribution to generate heights for the population, and then bin people according to their height. Finally, we use a gamma distribution to generate synthetic weight values for people and then visualize correlations between height, weight, and age. There's a slight correlation between height and weight for this synthetic data generation process: the higher a person is, the larger their weight tends to be.
Solution Details: We start our solution by generating a pool of 1,000 ids (1,000 population instances) using the Random Number Generators node. Next, we use the Gaussian Distribution Assigner and Number Rounder nodes to generate ages (as natural numbers) to this population. With the Numeric Binner node, we group this population by age: "Children" are below 18; "Young Adults" are equal to or above 18 and below 35; "Adults" are equal to or above 35 and below 54; and "Seniors" are equal to or above 54. After the population's ages are determined, we use the Beta Distributed Assigner node to generate heights. We set the minimum bound as 120 and the maximum bound as 190, and assign different peaks and scaling parameter values p for each age group. As a result, we have customized height distributions for each age group. We again use the Numeric Binner node to group the population based on height: "Bin 1" comprises individuals with height lower than 160cm; "Bin 2" comprises individuals with height equals to or higher than 160cm and lower than 180cm; and "Bin 3" comprises individuals with height equals to or higher than 180cm. We use the Gamma Distributed Assigner node to generate different weight distributions for the different height bins, setting different peaks and scaling pattern values p for each bin. Once the weights are generated for all individuals, we are ready to explore correlations between age, height, and weight using the Scatter Plot Matrix node.
Level: Medium
Description: You work as a data analyst for an e-commerce, and are asked to build an interactive dashboard using the Generic ECharts View node. Your dashboard must include a Funnel Chart representing the events of customer conversion and a Sankey Diagram representing the expenditure of the top 10 most loyal and valuable customers. Leverage KNIME’s AI assistant, K-AI, to customize the plots' aesthetic appearance (e.g., color, spacing, labels position, title etc). Finally, for each plot, type a short description of the key findings and play around with the layout to display the descriptions.
Author: Roberto Cadili
Dataset: E-Commerce Data in the KNIME Community Hub
Solution Summary: To solve this challenge, we group the data based on event types and create percentages for each one of them in order to create a conversion funnel graph. Event type 'view', for example, is always present in a purchase, so its percentage is 100% at the top of the funnel. Event type 'cart' corresponds to a portion of customers that also match 'view': 7.56% of them. Finally, only 5.04% of the customers that initially viewed items "convert" to event type 'purchase'. In parallel, we process the data to only keep information about the top 10 most loyal and valuable customers, separating them into expenditure categories to create an expenditure flow graph. Most of these special customers spend the most on categories computers -> components -> videocards.
Level: Medium
Description: Recently you have been feeling a bit nostalgic about Pokémon, and decided to use the PokeAPI to check which of them are the most common these days. Through the API, find the the occurrences of each Pokémon, filtering out all irrelevant information along the way. What are the five most common Pokémon types?
Author: Thor Landstrom
Solution Summary: To tackle this challenge, we fetch information on all pokémons using GET requests, filter out the information to only keep Pokémon types, and visualize the most common types with bar and pie charts. The top 5 most common Pokémon types are water, normal, grass, flying, and psychic.
Solution Details: We start our solution with a relatively complex component named Get all Pokemon. Through a combination of Recursive Loop and Chunk Loop nodes, we perform GET requests to the Pokémon API (GET Request node) and then extract the responses with the JSON Path node. After the component is done executing, we have a table with all pokémons and their corresponding API links. We can optionally use the Row Sampling node to restrict our analysis to a lower amount of data, speeding up the performance but potentially compromising the quality of the results. Next, we use the GET Request node to get more specific information on each pokémon through their links. Next, we use the Column Filter node to filter out irrelevant columns and then the JSON Path node to filter out Pokémon data and focus on their types. With the Ungroup node, we split these types to have a single one per row, and then send this information to a simple component named View Results, which shows the most common types with the Pie Chart and Bar Chart nodes.
Level: Medium
For this challenge, you will leverage KNIME’s reporting capabilities to create a neat PDF document containing summaries and visual analyses of a movie dataset. Movies from each decade should be analyzed separately, one decade per page.
Within each decade:
(1) what average budget was spent per genre?
(2) what was the average revenue per genre?
(3) what were the main features of its top 5 movies?
Do your best to enhance the report visualization (e.g., use format managers, colors, and head/footer). Hint: Use the KNIME Reporting extension to solve this challenge.
Author: Armin Rudd
Dataset: Movie Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we start by preprocessing the data to extract the movies' genres and get the movies' decades based on their release years. Next, we loop over each decade to get their top three genres based on average revenue, and to create a PDF page summarizing their cinematic information. The loop concatenates the created pages, which are then turned into a single report.
Level: Easy
Description: Sports analytics is a growing field, and Formula 1 analytics is a part of it. Let’s perform a comprehensive data analysis to gather insights into both historical team performances and the dynamic shifts in driver positions throughout the thrilling 2021 season. Explore the given datasets and build visualizations that highlight the top 5 2021 teams and driver position changes. For the latter, create a dynamic visualization showing how drivers position changed after each round. Hint: All datasets available for this challenge should be used.
Author: Keerthan Shetty
Dataset: Formula 1 Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we start by combining data from constructors (e.g., BMW or Ferrari) with results data. After some preprocessing, we are able to visualize how many points each one of the top 5 constructors has accumulated. In parallel, we combine data from drivers, races, and results to dynamically visualize, for the 2021 season, how the drivers' ranking has changed from race to race.
Solution Details: To visualize all points each top 5 constructor has accumulated, we start by using two instances of the CSV Reader node to read constructors and results data. We join these datasets with the Joiner node and then group it by constructor name, summing all their corresponding points (GroupBy node). We then select the top 5 constructors based on their points (Top k Row Filter node), sort them (Sorter node), and visualize the points total for them with the Bar Chart node. We add different colors to the constructors in this visualization with the Color Manager node, mostly for aesthetic reasons. In parallel, we use two instances of the CSV Reader node to ingest data on drivers and races. We combine these datasets with the results data using two instances of the Joiner node. Next, we use the Row Filter node to focus exclusively on the 2021 season, and the Column Filter node to only keep information on drivers, how many points they made per race, and race dates. With this information, we use the Animated Bar Chart component to visualize how the drivers' ranking changes race by race, based on their points.
Level: Hard
Description: You work as a researcher creating models to identify whether a breast tumor is benign or malign, based on anonymized patient data. Besides obtaining a classifier that works very well for both benign and malign cases, you must be able to explain how different feature values impact your results. Experiment with LIME and visualization techniques to explain your predictions and make your research more transparent. Hint: Learn more about this problem's data attributes here.
Author: Keerthan Shetty
Dataset: Breast Tumor Data in the KNIME Community Hub
Solution Summary: To tackle this problem, we start by lightly preprocessing the data and then building a random forest classifier to predict whether a tumour is benign ('B') or malign ('M'). We then sample two instances categorized with very high confidence (one originally labeled as benign and the other as malign) and use LIME to see how the different instance features account for their predictions. At the end of our solution, we create visualizations to better interpret the impact of the features.
Level: Medium
Description: Birds like to travel and are often flying in big groups to different parts of the globe depending on the season. This week, your goal is to identify patterns in data about birds migrating from North America. You should isolate the initial and final locations of each bird and plot them on a map to visualize their migration path. Additionally, create a map visualization to determine which continent all the birds ultimately migrated to.
Author: Keerthan Shetty
Dataset: Bird Migration Data in the KNIME Community Hub
Solution Summary: To solve this challenge, we start by grouping the bird migration data per bird, using their local identifiers. We manipulate the data further and create geometries for the starting point and final destination of every bird's migration. We then visualize these geometries for each bird.
Level: Medium
Description: You are the architect behind an innovative vocal assistance device, and your initial goal is to process user reviews and uncover insights about 'sound quality’. To this end, you decide to use 4-grams to discover frequently mentioned words near the term "sound quality”. What trends emerge from the 4-gram frequencies? What is the top 4-gram? Hint: The NGram Creator node can come in handy here!
Author: Michele Bassanelli
Dataset: User Reviews Data in the KNIME Community Hub
Solution Summary: After reading the reviews and preprocessing them to remove punctuation and stopwords, we create 4-grams and filter them to only keep those that contain the term "sound quality". The most frequent 4-gram with this term is "pleasantly surprised sound quality".
Solution Details: To tackle this challenge, we use the File Reader node to read the reviews. Next, we transform them into type Document with the String to Document node: this is usually the first step for text mining in KNIME. We preprocess the reviews to remove punctuation (Punctuation Eraser node), numbers (Number Filter node), and stopwords (Stop Word Filter node) inside the Preprocessing metanode. We then use the NGram Creator node to create 4-grams for all preprocessed reviews, only keeping those that contain the term "sound quality" (Row Filter node). We sort the NGram corpus by frequency and find out that the 4-gram "pleasantly surprised sound quality" is the most popular.
Level: Easy
Description: You are a social scientist researching childcare in the European Union. The data you are currently examining contains the percentage of children under 3 years old that are cared by formal arrangements other than by the family, in various European countries. Is the overall trend increasing or decreasing? In what country is childcare the most common? And in what country it is the least common? Check for more information on the data here.
Author: Emilio Silvestri
Dataset: Childcare Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we pivot the data to get yearly childcare percentages for different countries. We then use different visualization strategies to see trends in childcare for the countries.
Solution Details: After reading the European Union childcare data and its metadata with the CSV Reader node, we use the Value Lookup node to replace country codes with country names, making the data easier to understand. Next, we use the Pivot node to get yearly childcare percentages for each one of the countries separately -- that is, each country turns into a separate column in the data, with yearly percentages corresponding to rows. We use the Box Plot node to visualize how these percentages are distributed per country (Denmark has the highest median value) and the Stacked Area Chart node to plot temporal trends for all countries. Overall, this last visualization suggests that certain countries are relying less on childcare in recent years, probably a lingering consequence of the COVID-19 pandemic.
Level: Medium
Description: You are a data scientist asked to analyze an avocado dataset by your team. The task at hand is to pick a specific avocado type in the whole of the US and forecast its daily average prices. To do that, you should train, apply, and score an ARIMA model. Do you see any seasonality in the line plot or autocorrelation plots? Do you think a seasonal ARIMA (SARIMA) would perform better? For your model, visualize forecasts and compute scoring metrics.
Authors: Roberto Cadili, Swetha Kannan, and Corey Weisinger
Dataset: Avocado Price Data in the KNIME Community Hub
Solution Summary: To tackle this problem, we focus on "conventional" avocado types, aggregate their prices per day, and then visually search for autocorrelation patterns that may indicate seasonality. The plots do not suggest strong seasonality patterns for the prices of avocados, but we still use a SARIMA model to learn and forecast the prices. We use a line plot to contrast the actual prices of avocados in the data we held out for testing against the forecasted prices generated by the model, and the predictions are rather close -- especially up to April.
Solution Details: After reading the data with the CSV Reader node, we filter out any rows that do not correspond to "conventional" avocado types and also sort the data by date (Data Preprocessing metanode). We use the Date&Time Aggregator node to get the average price per date, and then partition the data (70% for training, 30% for testing) using the Partitioning node. We send the training data to the SARIMA Learner node, setting the AR Order (p) as 2, the I Order (d) as 1, the MA Order (q) as 4, and the Seasonal Period (s) as 2. Configuring these parameters requires experimenting with different values, and then checking which ones lead to the best forecasts over the test data. We use the SARIMA Predictor node to generate forecasts, join them with the original values using the Joiner node, and then assess the quality of the forecasts visually, with the Line Plot node, and more analytically with the Numeric Scorer node.
Level: Easy to Medium
Description: You work in the contracts department of a software company and are asked to detect fraudulent (or wrong) contracts based on their contract value. Given the PDF versions of the contracts, you need to extract their contract value (and, optionally, any other fields you find useful) and detect outliers among them. You can either use simpler outlier detection techniques, such as those based on statistics or visualization, or more advanced ones based on machine learning.
Author: Lada Rudnitckaia
Dataset: Contract data in the KNIME Community Hub
Solution Summary: After ingesting and processing the contracts in PDF format, and isolating their contract values, we identify outliers via IQR (interquartile range), and box plot and histogram visualizations. Through these techniques, two contracts seem to be fraudulent based on their unusually large values.
Solution Details: We start our solution by parsing the PDF contracts with the PDF Parser node. Next, the Document Data Extractor node extracts text from each PDF. In parallel, we use sequences of Cell Splitter nodes to isolate contractual information, including their values. We then use the Column Appender node to add these new columns with extracted information to the initial contract table, and apply the Table Manipulator node to filter out and rename columns. At this point of the workflow, we only keep information on contract IDs and their corresponding values. This simplified table is then sent to a component named Outliers Detection, which uses a combination of nodes Numeric Outliers, Tile View, Box Plot, and Histogram nodes) to extract numeric outliers in different ways -- visually and statistically.
Level: Medium
Description: Did you know that August 2nd is International Beer Day? To celebrate one of the world’s oldest beverages, you are tasked with retrieving all worldwide brewery data from the OpenBreweryDB REST API, which provides data on the breweries in paginated responses. Build a loop that makes successive requests to each page of the API until all data is retrieved.
Not interested in beer? Feel free to use any API of your choice! Find more public APIs here: https://publicapis.io/
Author: Lada Rudnitckaia
Solution Summary: To tackle this challenge, we get the total number of breweries, calculate how many pages there are, and create a loop that extracts all brewery data page by page. We finish our solution with a paginated table view for the brewery data.
Solution Details: After getting the number of breweries with the GET Request and JSON Path nodes, we use the Math Formula node to compute the total number of pages (which is 42), given that the number of results per page is 200. With the Table Row to Variable node we turn the computed number of pages into a flow variable, and then use it to parametrize a loop (Counting Loop Start and Loop End nodes). Inside the loop, we get the number of the current page and add it to the base of a URL using the Math Formula (Variable) and String Manipulation (Variable) nodes. After that, a second instance of the GET Request node is used to retrieve data from the current page, which is then parsed and ungrouped (JSON Path and Ungroup nodes). When the loop ends, information for all pages has been collected. This information is sent to a visualization component that, with the Table Manipulator and Table View nodes, creates a simple tabulated interface for the brewery data.
Level: Easy
Description: The 2024 Summer Olympics start this week: are you ready for some on-theme trivia? Here are three questions to test your knowledge:
1. Which country has the highest average number of medals (gold, silver, and bronze combined) per athlete, across all Summer Olympics?
2. Which sport made its debut at the Rio 2016 Olympics?
3. Who holds the title of the oldest gold medalist of all time?
If you're not sure about the answers, don't sweat it: we've got a handy dataset that covers the entire history of the Games for you. Before you know it, you'll be standing on the virtual podium as the trivia champion, ready to take home the gold!
Author: Michele Bassanelli
Dataset: Olympics Data on KNIME Community Hub
Solution Summary: To answer question 1, we group the data to get (1) the number of medals per country and (2) the number of athletes per country. We then compute the ratio between (1) and (2) and get 'Russia' as the answer. For question 2, we group the data by year and sport, get all sports that were part of the Olympics up until 2012, and then compare this list with the sports that were part of the 2016 Summer Olympics. The sport that is present in the latter but not in the former, 'Rugby 7', started in 2016. Finally, for question 3 we filter the data to only take gold medalists into account and sort them by age. This leads us to the name of 'Charles Jacobus' as the answer.
Solution Details: After reading all datasets with CSV Reader nodes, we tackle question 1 by joining them on region/team with the Joiner node, creating a new column that indicates what athletes are medalists (Rule Engine node), and then computing the ratio between the number of medals per team and the number of athletes per team (GroupBy and Math Formula nodes). For question 2, we use the GroupBy node to aggregate the athletes' data based on sport and year, use two instances of the Row Filter node to isolate sports that were present up to 2012 and those that were present in 2016, and then compare these two lists to identify those sports that only started in 2016 (Joiner node). Finally, for question 3 we use the Row Filter node to identify all athletes that are gold medalists, and then sort them based on their age with the Sorter node.
Level: Medium
Description: You work as a data analyst for a delivery company, and some packages were not delivered last week due to address typos. Thanks to the postal carriers, addresses that were not found due to typos were marked as such. Given a dataset with successful deliveries (due to no typos) and unsuccessful ones (due to typos), your goal is to automatically fix the incorrect addresses by leveraging the correct ones.
Author: Aline Bessa
Dataset: Postal Data on KNIME Community Hub
Solution Summary: To tackle this challenge, we first separate the addresses that have typos from those that do not. Next, for each address with a typo, we find the correct address that is the most similar to it, and then replace it.
Solution Details: After reading the dataset with postal addresses with the CSV Reader node, we use the Row Splitter node to separate those with typos from those that are correct. Next, we remove duplicate addresses with the Duplicate Row Filter node and use the String Matcher node to identify, for each incorrect address, the most similar correct one. This information is used to fix each incorrect address with the String Replacer (Dictionary) node.
Level: Medium
Description: Recently you became more interested in finance, and since you want to learn more about web scraping for work, you decided to unite both interests. Using the KNIME Web Interaction extension, can you navigate to the Economic News section on Yahoo Finance, extract the headers of only the most recent topics that pop up on the webpage, and then make sense of the results visually? Remember to filter out any ads or unrelated banners/headers/content. Hint: Find class tags in the news' XML that are unique to the content you are scraping.
Author: Thor Landstrom
Solution Summary: To tackle this challenge, we connect to a browser through KNIME Analytics Platform, fetch the most recent content from the Economic News section on Yahoo Finance, extract its headers, and then visualize their corresponding topics as a table.
Solution Details: We start our solution by connecting to a browser with the Web Interaction Start node. We then navigate to the Economic News page on Yahoo Finance using the Navigator node. We retrieve the XML content of this page, including heading tags and text, with the Content Retriever node. In parallel, since we do not need a browser connection anymore, we close it with the Web Interaction End node. As for the retrieved content, we use the Row Filter node to remove headings that are not tagged as "h3", and then a combination of the XPath node and a second instance of the Row Filter node to identify and remove rows that contain ads or unrelated pages. Finally, we isolate the heading texts with the Column Filter node and visualize them as a table with the Table View node.
Level: Medium
Description: You are reorganizing a data warehouse in your company, working with a filesystem that creates parent folders if you give it a reference for a child folder. For example, if you ask the filesystem to create “folder1/folder2” and neither folder1 or folder2 exist, it will create both, with folder2 inside folder1, without raising an error. Given a list of folders, you want to keep only the longest unique child folders, filtering out references to parent folders that will be generated anyway for efficiency.
Here's an example of an initial list of folders:
- folder1/folder3
- folder1/folder3/folder22
- folder1/folder3/folder22/folder47
After executing your workflow, the list above should only contain a reference for folder1/folder3/folder22/folder47.
Author: Emilio Silvestri
Datasets: Folder Data in the KNIME Community Hub
Solution Summary: After reading the list of folders, we calculate the depth (number of levels) for each one of them. Next, we iterate over all folders searching for parent folders and making sure that we just keep the "deepest reference" with a path in common. We also create references for paths that got excluded, indicating what redundant child/subfolder was responsible for it.
Solution Details: After reading the list of folders with the Table Reader node, we split them into columns with the Cell Splitter node and, with a combination of nodes (Column Aggregator, Math Formula, Column Filter), end up with the number of levels (depth, or cardinality) present in each folder reference. Next, we use the Table Row to Variable Loop Start and Variable Loop End nodes to iterate over all folders. At each interaction, we use the String Manipulation and Row Filter nodes to find and isolate all parent folders for the folder in question. By combining the Table Row to Variable and Rule Engine Variable nodes, we make sure that we just keep the folder in question if it is the deepest, using identified parent folders as comparison.
Level: Medium
Description: You work as a freelance photo reporter for wildlife magazines. In your daily work you take a lot of pictures, usually in .JPG format and in different sizes. To be able to sell your photographs to magazines, you need to accommodate their different sizing and formatting requests. To streamline this process, you decide to build a workflow that automates the following, sequentially: (1) Image resizing -- create a configurable component with three options: do nothing, reduce to fixed size (150x150), or reduce size keeping ratio; (2) image format conversion -- create a configurable component with two options: .PNG or .SVG; (3) save edited images on your machine.
Author: Roberto Cadili
Datasets: Image Data in the KNIME Community Hub
Solution Summary: Our solution to this challenge contains two configurable components that let users (1) resize their images in different ways (do nothing, keep the ratio, or use a fixed 150x150 format), and then (2) convert them into .PNG or .SVG format. The images are then saved locally.
Solution Details: We start our solution by using the List Files/Folders node to get a list of local images in .JPG format. We then use the Path to String node to facilitate the reading of these images, and import them to KNIME Analytics Platform with the Image Reader (Table) node. The images are sent to our first component, Image Resizer. With the Single Selection Configuration node, we allow users to configure this component based on resizing option (do nothing, keep the ratio, or use a fixed 150x150 format). A CASE Switch Start node gets the chosen option and either activates no branch, going straight to the CASE Switch End node, resizes the images in a fixed way with the Image Resizer node, or resizes the images using a ratio of 0.3 also with another instance of the Image Resizer node. The resized images are then passed to a second component, named Image Converter. This component also uses an instance of the Single Selection Configuration node to let users pick the format they want the images to be in (.PNG or .SVG). The images inside the component are initially converted to .PNG and then passed as input to an instance of the CASE Switch Start node. If the chosen option is .PNG, this node activates a branch that simply ungroups the images with the Ungroup node. If the chosen option is .SVG, the images are also ungrouped with another instance of the Ungroup node but are turned into .SVG with the Renderer to Image node. Both branches meet as inputs for the CASE Switch End node. Outside this second component, we use the String Manipulation node to create a filename column, and then save the edited images locally with the Image Writer (Table) node.
Level: Medium
Description: As the 2024 European Football Championship (UEFA) unfolds, let's dive into football history with a data challenge. Today you are asked to create a data app that allows users to check, for any timeframe, what the top three teams with the most football victories were. Who are the top three teams of all time? And who were the top three teams in the 1980s?
Author: Michele Bassa
Datasets: Football Data in the KNIME Community Hub
Solution Summary: After reading the football data and determining wins, losses, and ties, we create a data app that allows users to pick a temporal interval and then check which three teams had the most victories.
Solution Details: We start our solution by reading the football data with the CSV Reader node, transforming dates into Date format in the node's Transformation tab. Next, we use the Rule Engine node to determine wins, losses, and ties for home teams. This data is then sent to a component (data app) that allows for the temporal filtering of the data. Two instances of the Date&Time Widget node let users select the start and end dates of a temporal period, for which a team ranking will be calculated. The selected dates are passed to two instances of the Date&Time-based Row Filter node, reducing the data to a specific period. After that, two parallel branches use the Row Filter, Column Filter, and GroupBy nodes to select those matches in which the home team (top branch) or away team (bottom branch) wins. Both victory numbers are combined with the Joiner node, and then the Top k Row Filter node selects the top three best teams for the selected period. This information is then plotted with the Bar Chart node.
Level: Medium
Description: As a member of a think tank, your task is to craft a report on LGBTQIA+ representation in political discourse. Given a EU dataset gathering responses from LGBTQIA+ individuals across all member states, you decide to start your work by investigating the answers to the following question: "In your opinion, how widespread is offensive language about lesbian, gay, bisexual, and/or transgender people by politicians in the country where you live?”.
Use a map to present the results effectively.
Author: Michele Bassa
Datasets: LGBTQIA+ Survey Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we reduce the scope of the data to question "In your opinion, how widespread is offensive language about lesbian, gay, bisexual and/or transgender people by politicians in the country where you live?". We then filter the answers and only keep the most common ones: "rare" and "widespread". This facilitates the understanding of trends and patterns across countries. We compute the percentages of answer "widespread" for every country and also compute their map coordinates. Finally, we join the geospatial information and the computed percentages and plot them in a map.
Solution Details: After reading the survey dataset with the CSV Reader node, we prepare the data by reducing it to question "In your opinion, how widespread is offensive language about lesbian, gay, bisexual and/or transgender people by politicians in the country where you live?", and to its two most common answers, "rare" and "widespread". We also group the data by country, keeping the totals for both answers. We loop over this data (Group Loop Start and Loop End nodes) to compute the percentages of answer "widespread" for every country, using the Math Formula node (we compute the denominator for these percentages with the Moving Aggregator node). Next, we run another loop (Table Row to Variable Loop Start and Loop End nodes) to find the map coordinates of each country with the OSM Boundary Map node. We join the previously calculated percentages to the data with the map coordinates (Joiner node), use the Projection node to improve formatting for visualization, filter irrelevant data with the Row Filter node, and then finally plot the computed information with the Geospatial View node.
Level: Medium
Description: You work for the United Nations and want to discuss how the causes of death vary across the European Union (EU). You know how to analyze data and generate insightful visualizations, but the data you have at hand is a bit challenging: the meaning of its different columns and codes is not clear. To conclude your work well, you will have to integrate this data with some metadata in XML format, making sense of the different death causes and data attributes. What patterns can you find in the different countries?
Author: Emilio Silvestri
Datasets: Demographic Data from the EU in the KNIME Community Hub
Solution Summary: Our solution to this challenge can be split into two steps. First, we identify the code for the top cause of death in each country, regardless of sex or age; next, we match these codes with metadata describing what they are and sort the countries based on these descriptions. For 27 (out of 35) countries, "diseases of the circulatory system" is the main cause of death; for 8 (out of 35) countries, the top cause of death is "neoplasms".
Solution Details: With the CSV Reader node, we ingest the dataset on EU death causes in 2021. Next, with a series of Column Filter and Row Filter nodes, we reduce the dataset to what is pertinent to the analysis. It lists codes for causes of death per country regardless of sex and age. We then use a loop (Group Loop Start and Loop End nodes) to identify what is the top code for cause of death per country, employing the Top k Row Filter node. At the end of this branch, we have the codes that correspond to top death causes all over the EU, but cannot make sense of them yet. To this end, in parallel, we ingest metadata on the death causes with the XML Reader node. Using a series of XPath nodes, we extract column names, descriptions, and other values from the metadata. The descriptions and values come in lists, and to facilitate their posterior matching with death cause codes from the original dataset, we use the Ungroup node to break the lists into single tokens. We filter the resulting data to only keep rows that correspond to causes of death (Row Filter node), and then use the Value Lookup node to match these causes with their codes in the original dataset. Finally, we sort the data with the Sorter node and get to the conclusion that the top cause of death in most EU countries has to do with diseases of the circulatory system.
Level: Easy
Description: You are a real estate agent working in a new city, and to perform well your first task involves understanding the houses in the region better. A colleague shares a dataset with you and now it’s time for you to explore it. What has been the average housing price, lot size (in acres), and living space (in sqft) in this city, according to her dataset? How are prices distributed and correlated with housing features? What other insights can you gather from this dataset?
Author: Thor Landstrom
Dataset: Real Estate Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we compute some general statistics of the dataset such as average price, lot size, and living space. We also calculate the linear correlation for all pairs of numerical features, uncovering which housing attributes have the largest connection with their price. On average, central Seattle is the priciest area in the region, but there are a few other relevant clusters to the south and to the east.
Solution Details: After ingesting the housing data with the CSV Reader node, we compute Pearson's linear correlation for all pairs of numerical attributes with the Linear Correlation node. The results are plotted with the Heatmap (JavaScript) node, revealing which housing attributes relate the most to their price. In parallel, we use the Column Filter node to remove unnecessary columns, and convert the lot size information into acres with the Math Formula node. We use the Statistics View node to get important housing summaries, including their average lot size and price, and group the data with the GroupBy node by zipcode. In the aggregation, we calculate the average housing price per zipcode and their median latitude and longitude values. The Lat/Lon to Geometry node uses the median values per zipcode to generate geometries, which are then visualized with the Spatial Heatmap node.
Level: Easy
Description: You are a climate scientist studying CO₂ emissions. To make your research insights more accessible to your colleagues, and then write a paper about it, you decide to build a report-enabled component in KNIME that allows users to check how emissions vary for different regions and sources. What are the most alarming insights illustrated in such report?
Authors: Armin Ghassemi Rudd and Marina Kobzeva
Dataset: CO₂ Emissions Data in the KNIME Community Hub
Solution Summary: To tackle this challenge, we manually select the country that ranks highest in terms of CO₂ emissions and create a PDF report showing its historical emissions, how they vary per capita throughout the years, and what sources they are mostly tied to. Different countries can be selected based on their ranking, leading to different visualizations and reports.
Solution Details: After reading the dataset with the Table Reader node, we use the Row Filter node to select a country based on its CO₂ emissions' ranking. Next, we finish our preprocessing by using the Number Format Manager node, selecting how many decimals we want to use in the CO₂ and CO₂ per capita numbers of our report. We create a component named "Report" that contains a few visualizations for our data: two line plots (Line Plot node) for the historical emissions of CO₂ and CO₂ per capita, and a bar chart (Bar Chart node) showing a breakdown of these emissions for different sources. To turn these visualizations into a PDF report, we feed this component with a report template (A4 Landscape) that is specified with the Report Template Creator node. After the component executes, its visualizations are saved as a PDF report with the Report PDF Writer node.
Level: Easy
Description: You work in finance and one of your clients wants to understand the value of different company stocks over time. Given a dataset of stock prices, you decide to use simple moving averages (window length = 20) to tackle this task. What companies have an upward trend for the most recent data? And what companies have a downward trend?
Author: Thor Landstrom
Dataset: Stock Data in the KNIME Community Hub
Solution Summary: We propose two different solutions to this challenge. The simplest one involves manually filtering the data for a specific company, calculating its moving average, and then visualizing it with a line plot. The second one relies on a simple data app: a company is selected from a dropdown box and its stock prices are selected, a moving average is computed, and the final points are plotted as a line plot.
Solution Details: Both solutions have a core part in common. After the rows for a company are selected, we use the Column Filter node to isolate dates and close prices, do some typecasting with the String to Date&Time node, sort the data from oldest to most recent with the Sorter node, and then use the Moving Average node to compute simple moving averages (window length = 20). Next, we visualize the results with the Line Plot node. In the simplest solution, we use the configuration of the Row Filter node to select the data for a company. In the more complex solution, we get all company names with the "Get company names" metanode, and then pass them, along with the original data, to the "Visualize company stock prices" component. Inside this component, a Single Selection Widget node allows the selection of one of the company names, which in turn is used to control an instance of the Row Filter node. After that, this solution is basically equivalent to the simplest one.
⇒ We post a challenge on Wednesday.
⇒ You create a solution with KNIME Analytics Platform.
⇒ Upload it to your public KNIME Community Hub Space.
⇒ Check your rank on the Just KNIME It Leaderboard.
Our solution to the challenge comes out on the following Tuesday.
They are a great way of preparing for our certifications.
KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?