Scenario
You work for an AI research organization that tracks research papers and their contributing authors.
Task
Write a Snowflake SQL query that:
- Joins
{{ ref("authors") }} with {{ ref("research_papers") }} on paper_id
- Assigns a sequential
row_number to each author, partitioned by paper_id and ordered by author_id
- Returns the columns:
paper_id, author_id, name, row_number
Schema
research_papers
| Column |
Type |
Description |
| paper_id |
String |
Unique identifier for the paper |
| title |
String |
Title of the research paper |
| year |
Integer |
Year the paper was published |
authors
| Column |
Type |
Description |
| paper_id |
String |
Paper this author contributed to |
| author_id |
String |
Unique identifier for the author |
| name |
String |
Full name of the author |
Example
research_papers:
| paper_id |
title |
year |
| P1 |
Scalable Data Pipelines |
2021 |
| P2 |
Edge Computing Strategies |
2023 |
| P3 |
Privacy in Distributed Systems |
2022 |
authors:
| paper_id |
author_id |
name |
| P1 |
A1 |
Nora Chen |
| P1 |
A2 |
Leo Park |
| P2 |
A3 |
Diana Ruiz |
| P2 |
A4 |
Marcus Holt |
| P2 |
A5 |
Priya Nair |
| P3 |
A6 |
Sam Ortega |
| P3 |
A7 |
Tina Farah |
Expected Output:
| paper_id |
author_id |
name |
row_number |
| P1 |
A1 |
Nora Chen |
1 |
| P1 |
A2 |
Leo Park |
2 |
| P2 |
A3 |
Diana Ruiz |
1 |
| P2 |
A4 |
Marcus Holt |
2 |
| P2 |
A5 |
Priya Nair |
3 |
| P3 |
A6 |
Sam Ortega |
1 |
| P3 |
A7 |
Tina Farah |
2 |
Note: The row_number resets to 1 for each new paper_id. Within a paper, authors are numbered in ascending order by author_id.
SELECT
a.paper_id,
a.author_id,
a.name,
ROW_NUMBER() OVER (
PARTITION BY a.paper_id
ORDER BY a.author_id
) AS row_number
FROM {{ ref("authors") }} a
INNER JOIN {{ ref("research_papers") }} rp
ON a.paper_id = rp.paper_id
Explanation
Step 1: Joining the two tables
FROM {{ ref("authors") }} a
INNER JOIN {{ ref("research_papers") }} rp
ON a.paper_id = rp.paper_id
An INNER JOIN links each author row to its corresponding paper. This ensures only authors with a valid paper appear in the result. If an author referenced a paper_id that did not exist in research_papers, that row would be excluded.
Step 2: Assigning row numbers with ROW_NUMBER()
ROW_NUMBER() OVER (
PARTITION BY a.paper_id
ORDER BY a.author_id
) AS row_number
ROW_NUMBER() is a window function that produces a sequential integer for each row within a partition. PARTITION BY a.paper_id resets the counter to 1 for every new paper. ORDER BY a.author_id controls the sequence within each partition, so authors are numbered in ascending ID order.
Step 3: ROW_NUMBER vs RANK vs DENSE_RANK
Because author_id values are unique within a paper, ROW_NUMBER(), RANK(), and DENSE_RANK() would all produce the same result here. The distinction matters when there are ties: RANK() leaves gaps after ties (1, 2, 2, 4), DENSE_RANK() does not (1, 2, 2, 3), and ROW_NUMBER() always assigns distinct values even for tied rows. For this problem any of the three would work, but ROW_NUMBER() is the most explicit choice for sequential numbering.
Step 4: Alternative with CTE
You can also wrap the query in a CTE for readability:
WITH joined AS (
SELECT
a.paper_id,
a.author_id,
a.name,
ROW_NUMBER() OVER (
PARTITION BY a.paper_id
ORDER BY a.author_id
) AS row_number
FROM {{ ref("authors") }} a
INNER JOIN {{ ref("research_papers") }} rp
ON a.paper_id = rp.paper_id
)
SELECT * FROM joined
CTEs become more useful when you need to chain multiple transformation steps on top of the result.