One of our key goals as data analysts is to merge multiple data sets into a single
data set. The way we do this is by querying our data (also referred to as merging or
blending data). At the end of the day, if you understand the basic concepts behind
writing a query, then you can master any tool that is used to arrange that data into
a single data set.
Querying Data
You may find that the first step you perform as an analyst is querying the data.
You will create your data set and then you will begin the data transformations,
like combining names or creating functions. You will use joins when working
with any two or more data sets that need to be combined into a single data set
for reporting.
Let’s start with a list of invoices that we have in our database. This invoice table contains
very valuable invoice details, but it’s missing customer details. It only contains the
CustomerID, which is a foreign key (meaning CustomerID is the primary key in the
customer table). For the data set to be more meaningful, it will need to include the
customer’s name and their geography details, so that we cannot only report on these
elements but also map that information in later visualizations. Because the information
exists in two different tables, we will use a query to join them.
Consider these two data sets. Each table contains only the data it is designed to
store for efficiency’s sake. To combine this data into a single, more meaningful set,
it must include the necessary key fields from each table. In our example, we can
use the CustomerID field in each of the data sets to join the customers to the order
information.
We bring this data into a query either through an SQL statement or a query
designer, such as what’s used in tools like SQL Server Management Studio,
Microsoft Access, or Tableau. You will leverage the key fields, which in this example
is the CustomerID, to bring all the records together. Below you will see a screenshot
of one such query designer in the process of merging the two data sets.
When a relationship in the database exists between the tables you are using in your
query, it will automatically show the join line between the two tables on the key
field. When a relationship doesn’t exist, the analyst will need to create the join line
by dragging and dropping the key field from one table to the key field in the other
table. In the case of our screenshot, we would drag and drop CustomerID from
the Customers table to CustomerID in the Orders table. The initial join type will be
an inner join. This join type defaults to showing records that exist in both tables,
which means it will only show customers that have made orders (and will not show
customers that have not made an order).
Types Of Joints
There are multiple join types that you can use when querying data, and it can be
hard to understand which join would be best in any given situation. The decision
about which join type to use should be based on what results in you need from your
data set.
- Cross join/Cartesian join: Data doesn’t have a direct join on a key field.
- Inner join: Data is joined so that only records that exist in both tables appear in
the result. - Left outer join: The left table displays all results of the left table, while only
matching records in the other (right) table appear in the result. - Right outer join: The right table displays all results of the right table, while only
matching records in the other (left) table appear in the result. - Full outer join: Data is joined so that all records, matched or unmatched, show
in the results.
The Impact of Each Join to Data
Because joins ultimately control what data shows in the results, it is important to
understand the impact of each type. Let’s return to our example, in which we want
to query data from the customers data set and the invoices data set, and break
down each outcome using different join types
Cross Join
When tables are added to a query and no join is specified, the end result will be that
each record in each table is tied together. In the case of our query above, this would
associate each customer to every order in the orders table, meaning every order that
has been placed will be associated to every customer regardless of whether its their
order. When it does not know how to join the records through the key, it will default to
joining every record to every record. Because of this, cross joins are rarely used.
Inner Join
Inner joins are typically the default join type for most programs that query. The
result of an inner join for this data set is that for every customer who has placed an
order, there will be a record in the results. This means that by using an inner join,
you will automatically exclude any customer who has not placed an order
Left Outer Join
Left outer joins specify that all records from the table on the left side and only
matching records from the right table show in the results. In this case, the left table
is the Customer table, which means all customers will show in the results whether
they have no orders, one order, or multiple orders
When a customer who has no order shows up in the results, it will produce a null
for all the order fields (because that order doesn’t exist).
Right Outer Join
Right outer joins function the same way as left outer joins, except in this type all
records from the right table and only matching records in the left are shown in the
results. This join type will also show Nulls when a match doesn’t exist. With this join
type, the results would make clear any instances where we have orders that did not
have a customer record.
Full Outer Join
A full outer join is like creating a left outer and right outer at the same time,
meaning it displays the same results you would see if you run both a left and right
join query, but in one single set. This is a great way to troubleshoot potential bad
records, or records due to be corrected. The result here would show all customers
regardless of their orders, and also all orders.
Conclusion
We have got to learn how to query multiple data at one go. We also have learned what are joints, what is the use of joints, what are the types of joints, and what each joint describes.