Picture of Insights | Joining data sets |II_004

Insights | Joining data sets |II_004

In this tutorial, we'll take a look at how to link data within your standalone data tables within your platform. For example, in this case we wish to display the name of each product ordered and the customer that ordered it. These two pieces of data do not exist in the same data set, so we need to link data sets to achieve this. This example will require the linking of 4 data sets: Order Orderline Customer Product

Tutorial Video Transcript

A transcript of our tutorial video, ensuring you can find exactly what you need, when you need it.

You can search this page to find the relevant time-stamp in the video. Also, this text can be used as part of the tutorial search feature.

00:00:07:04 - 00:00:09:22
In this tutorial, we'll take a look
at how to link data

00:00:09:22 - 00:00:13:18
within your standalone data tables
within your platform.

00:00:13:20 - 00:00:16:21
For example, in this case
we wish to display

00:00:16:21 - 00:00:19:21
the name of each product
that's been ordered.

00:00:19:27 - 00:00:21:16
And the next to that we want a column

00:00:21:16 - 00:00:25:12
which has the customer name or customer
email that ordered it.

00:00:25:14 - 00:00:29:17
Now, these two pieces of data
do not exist in the same dataset,

00:00:29:19 - 00:00:33:19
so we need to link
data sets to actually achieve this

00:00:33:21 - 00:00:36:08
combination of data.

00:00:36:08 - 00:00:39:13
And this example would require the linking
of four data sets.

00:00:39:21 - 00:00:41:25
So we're going to have order,

00:00:41:25 - 00:00:50:19
we have order line customer and product.

00:00:50:22 - 00:00:53:04
So we'll start off
by making a new question

00:00:53:04 - 00:00:56:26
from within Metabase.

00:00:56:28 - 00:01:01:03
And then I'm going to
use one of my raw data sets

00:01:01:06 - 00:01:02:04
and I'm going to start off

00:01:02:04 - 00:01:06:06
just with my order line information.

00:01:06:08 - 00:01:07:16
So this will show me information

00:01:07:16 - 00:01:12:19
on all of the individual order lines.

00:01:12:21 - 00:01:15:27
Now we want to refine the displayed
column list.

00:01:16:00 - 00:01:18:01
So we're using this little arrow
icon here.

00:01:18:01 - 00:01:22:22
You can choose the columns of this dataset
to be displayed in the final visual.

00:01:22:25 - 00:01:26:08
This helps to reduce the sheer
amount of data that's being displayed,

00:01:26:10 - 00:01:29:10
particularly
when you start to link datasets together

00:01:29:15 - 00:01:33:01
as you start to get a vast amount
of columns and repeated data

00:01:33:01 - 00:01:38:09
a lot of the time as well.

00:01:38:12 - 00:01:40:21
We're then going to click on the Join Data

00:01:40:21 - 00:01:44:10
Icon so we can select another dataset
to link to our already

00:01:44:10 - 00:01:48:00
selected dataset.

00:01:48:03 - 00:01:56:02
And in this case,
I'm going to go for product.

00:01:56:05 - 00:01:59:01
The next
step is to identify matching columns.

00:01:59:01 - 00:02:03:14
So when joining two datasets, you
will need to identify a column in each set

00:02:03:16 - 00:02:05:09
which has matching data.

00:02:05:09 - 00:02:09:25
So in this case, for example, and
it's done it automatically, in this case

00:02:09:28 - 00:02:12:26
we can link
the product ID column of the order

00:02:12:26 - 00:02:16:23
line table to the ID
column of the product table.

00:02:16:29 - 00:02:19:29
Now, those are all the same
bits of information in both tables.

00:02:20:00 - 00:02:23:20
And by linking those together,
it knows how to relate one set of data

00:02:23:23 - 00:02:29:12
to the other sets of data.

00:02:29:15 - 00:02:31:13
Again, with our newly joined dataset,

00:02:31:13 - 00:02:35:19
we can choose the information
that we wish to display.

00:02:35:22 - 00:02:42:25
In this case,
I might just want the product name.

00:02:42:28 - 00:02:43:17
And then again,

00:02:43:17 - 00:02:47:21
I'm going to join another dataset so
multiple datasets can be linked together.

00:02:47:22 - 00:02:50:15
You aren't just limited to

00:02:50:17 - 00:02:52:12
the same process.

00:02:52:12 - 00:02:55:07
This time I'm going to go for my order

00:02:55:07 - 00:02:59:27
information,

00:02:59:29 - 00:03:02:29
look at the data
I want to display for the order

00:03:03:03 - 00:03:06:09
and then make sure I've got a link
between appropriate

00:03:06:09 - 00:03:13:05
columns.

00:03:13:08 - 00:03:16:18
Now, the order table is useful
in this respect as we're able to identify

00:03:16:18 - 00:03:20:21
the link between the order line
and the order in which it was part of.

00:03:20:24 - 00:03:23:15
And this will eventually help us
identify the customer

00:03:23:15 - 00:03:33:01
which is associated with the order itself.

00:03:33:03 - 00:03:35:29
Going to join one final dataset.

00:03:35:29 - 00:03:38:19
This is going to be our customer dataset.

00:03:38:19 - 00:03:41:09
Now again,
I'm going to repeat the process.

00:03:41:09 - 00:03:44:04
You can see this time
that it's not been able to actually

00:03:44:04 - 00:03:47:16
automatically define
a link between the datasets.

00:03:47:18 - 00:03:49:29
So what I need to do
is to manually pick these up.

00:03:49:29 - 00:03:52:16
So I'm starting off with

00:03:52:18 - 00:03:56:02
my previous
results and something to link to customer.

00:03:56:02 - 00:03:58:23
So I'm going to need
customer ID from somewhere.

00:03:58:23 - 00:04:01:16
Now I know this customer ID
is not in the order

00:04:01:16 - 00:04:05:05
line dataset,
which is identifying by default.

00:04:05:07 - 00:04:08:00
I'm going to collapse that list,

00:04:08:00 - 00:04:10:14
find my appropriate dataset.

00:04:10:14 - 00:04:13:14
So order ID

00:04:13:19 - 00:04:23:19
then in here I should be able to find

00:04:23:22 - 00:04:27:11
customer ID.

00:04:27:13 - 00:04:30:13
Then within the customer table
it's simply the ID number.

00:04:30:13 - 00:04:34:22
And then we've formed our link
between the datasets.

00:04:34:25 - 00:04:37:09
Again, I'm going to control
what information is displayed.

00:04:37:09 - 00:04:39:20
I don't need all of my columns
in this case.

00:04:39:20 - 00:04:47:20
I just want my email

00:04:47:23 - 00:04:48:13
and that's good.

00:04:48:13 - 00:04:50:24
So that's all the data joining.
We need to do.

00:04:50:24 - 00:04:53:17
The final thing I'm going to do in
my example is the same I've done in

00:04:53:17 - 00:04:56:06
previous
ones is just to cut down the results.

00:04:56:06 - 00:05:00:04
So we're only seeing results
from one individual storefront.

00:05:00:06 - 00:05:04:01
This is not a necessary step
if you are wanting to see data

00:05:04:01 - 00:05:07:26
from all of the storefronts.

00:05:07:28 - 00:05:08:05
And I'm

00:05:08:05 - 00:05:11:25
just going to go ahead and click Visualize

00:05:11:27 - 00:05:15:00
and what you can see there
because of all that joined up data,

00:05:15:00 - 00:05:20:07
we have a single table displayed combining
data from those multiple datasets.

00:05:20:09 - 00:05:25:20
So we have the product name,
which is coming from the product table,

00:05:25:22 - 00:05:30:12
and we're associating that with the email
by going through those alternate dataset.

00:05:30:14 - 00:05:35:06
So from products we've gone to order
line from order line we're going to order.

00:05:35:08 - 00:05:37:09
And then for order we've gone to customer.

 

Incomplete