It is our collective responsibility to enforce this Style Guide.
We use SQLFluff as our linter which enforces a majority of our Style Guide, although there are still some limitations.
Over time, we expect an increasing number of these guidelines will be handled by the SQLFluff lint checks. Areas that are known to be checked by SQLFluff are marked as such.
The Dev Standards guide now has guidance for SQL code, including SQLFLuff settings file defaults.**
SQLFLuff has a Rules Reference with descriptions of all rules and many helpful examples. We recommend everyone read or scan through the rules documentation (and the other SQLFluff docs) at least once.
You can also become familiar with these rules by installing the SQLFluff VS Code extension which will give real time lint feedback and which has autoformat capabilities.
These are enforced by SQLFLuff:
These are enforced by SQLFLuff:
An id
, name
, or generally ambiguous value such as type
should always be prefixed by what it is identifying or naming
-- Good
SELECT
id AS account_id,
name AS account_name,
type AS account_type,
...
-- Bad
SELECT
id,
name,
type,
...
When joining to any data from a different source, a field should be prefixed with the data source, e.g. sfdc_account_id
, to avoid ambiguity
-- Good
SELECT
sfdc_account.account_id AS sfdc_account_id,
zuora_account.account_id AS zuora_account_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...
-- Bad
SELECT
sfdc_account.account_id,
zuora_account.account_id AS zuora_id
FROM sfdc_account
LEFT JOIN zuora_account ON ...
When joining tables and referencing columns from both, strongly prefer to reference the full table name instead of an alias. When the table name is long (~20), try to rename the CTE if possible, and lastly consider aliasing to something descriptive.
-- Good
SELECT
budget_forecast_cogs_opex.account_id,
-- 15 more columns
date_details.fiscal_year,
date_details.fiscal_quarter,
date_details.fiscal_quarter_name,
cost_category.cost_category_level_1,
cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex
LEFT JOIN date_details
ON date_details.first_day_of_month = budget_forecast_cogs_opex.accounting_period
LEFT JOIN cost_category
ON budget_forecast_cogs_opex.unique_account_name = cost_category.unique_account_name
-- Ok, but not preferred. Consider renaming the CTE in lieu of aliasing
SELECT
bfcopex.account_id,
-- 15 more columns
date_details.fiscal_year,
date_details.fiscal_quarter,
date_details.fiscal_quarter_name,
cost_category.cost_category_level_1,
cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex bfcopex
LEFT JOIN date_details
ON date_details.first_day_of_month = bfcopex.accounting_period
LEFT JOIN cost_category
ON bfcopex.unique_account_name = cost_category.unique_account_name
-- Bad
SELECT
a.*,
-- 15 more columns
b.fiscal_year,
b.fiscal_quarter,
b.fiscal_quarter_name,
c.cost_category_level_1,
c.cost_category_level_2
FROM budget_forecast_cogs_opex a
LEFT JOIN date_details b
ON b.first_day_of_month = a.accounting_period
LEFT JOIN cost_category c
ON b.unique_account_name = c.unique_account_name
All field names should be snake-cased
-- Good
SELECT
dvcecreatedtstamp AS device_created_timestamp,
account_id
FROM table
-- Bad
SELECT
dvcecreatedtstamp AS DeviceCreatedTimestamp,
account_id
FROM table
Boolean field names should start with has_
, is_
, or does_
-- Good
SELECT
deleted AS is_deleted,
sla AS has_sla
FROM table
-- Bad
SELECT
deleted,
sla
FROM table
When transforming source data, use double quotes to identify case sensitive columns or columns that contain special characters different than “$” or “_”. Double quotes aren’t needed for capitalized field names, as this is how Snowflake identifiers are handled internally.
-- Good
SELECT "First_Name_&_" AS first_name,
-- Bad
SELECT "FIRST_NAME" AS first_name,
Dates
_at
, e.g. deal_closed_at
, and should always be in UTC_date
, e.g. deal_closed_date
deal_closed_month
date
or month
as a column namedate_part
, but prefer date_part
over extract
, e.g. DAYOFWEEK(created_at)
> DATE_PART(dayofweek, 'created_at')
> EXTRACT(dow FROM created_at)
date_trunc('month', created_at)
will produce the calendar month (‘2019-01-01’ for ‘2019-01-25’) while SELECT date_part('month', '2019-01-25'::date)
will produce the number 1SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
returns 1
even though the timestamps are different by one millisecond.SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999')
return 0
even though the timestamps are nearly an entire day apart.DATEDIFF
function will ensure you are getting the right results. For example, DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
will provide a 1 day interval
and DATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
will provide a 1 millisecond interval
.replace_sfdc_account_id_with_master_record_id
and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joinsCTEs should be formatted as follows:
WITH events AS ( -- think of these select statements as your import statements.
...
), filtered_events AS ( -- CTE comments go here
...
)
SELECT * -- you should always aim to "select * from final" for your last model
FROM filtered_events
Within a CTE, the entire SQL statement should be indented 4 spaces
-- Good
WITH my_data AS (
SELECT *
FROM prod.my_data
WHERE filter = 'my_filter'
)
-- Bad
WITH my_data AS (
SELECT *
FROM prod.my_data
WHERE filter = 'my_filter'
)
Indentation within a query (e.g. columns, JOIN
clauses, multi-line GROUP BY
, etc.) should also be 4 spaces
-- Good
SELECT
table_1.column_name1,
table_1.column_name2,
table_1.column_name3
FROM table_1
JOIN table_2
ON table_1.id = table_2.id
WHERE table_2.clouds = TRUE
AND table_2.gem = TRUE
GROUP BY 1, 2, 3
HAVING table_1.column_name1 > 0
AND table_1.column_name2 > 0
-- Bad
SELECT
column_name1,
column_name2,
column_name3
FROM table_1
JOIN table_2
ON table_1.id = table_2.id
WHERE clouds = true
AND gem = true
GROUP BY 1,2,3
HAVING column_name1 > 0
AND column_name2 > 0
When SELECT
ing, always give each column its own row, with the exception of SELECT *
which can be on a single row
DISTINCT
should be included on the same row as SELECT
The AS
keyword should be used when projecting a field or table name
Fields should be stated before aggregates / window functions
Ordering and grouping by a number (eg. GROUP BY 1, 2) is preferred
group_by
macroPrefer WHERE
to HAVING
when either would suffice
Prefer accessing JSON using the bracket syntax, e.g. data_by_row['id']::bigint as id_value
Never use USING
in joins because it produces inaccurate results in Snowflake. Create an account to view the forum discussion on this topic.
Prefer UNION ALL
to UNION
. This is because a UNION
could indicate upstream data integrity issue that are better solved elsewhere.
Prefer !=
to <>
. This is because !=
is more common in other programming languages and reads like “not equal” which is how we’re more likely to speak
Consider performance. Understand the difference between LIKE
vs ILIKE
, IS
vs =
, and NOT
vs !
vs <>
. Use appropriately
Prefer LOWER(column) LIKE '%match%'
to column ILIKE '%Match%'
. This lowers the chance of stray capital letters leading to an unexpected result
Familiarize yourself with the DRY Principal. Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places
DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP. BRAIN TIME IS EXPENSIVE.
NUMBER
instead of DECIMAL
, NUMERIC
, INTEGER
, BIGINT
, etc.FLOAT
instead of DOUBLE
, REAL
, etc.VARCHAR
instead of STRING
, TEXT
, etc.TIMESTAMP
instead of DATETIME
The exception to this is for timestamps. Prefer TIMESTAMP
to TIME
. Note that the default for TIMESTAMP
is TIMESTAMP_NTZ
which does not include a time zone.
IFNULL
TO NVL
IFF
to a single line CASE
statementIFF
to selecting a boolean statement (amount < 10) AS is_less_than_ten
Consider simplifying a repetitive CASE
statement where possible:
-- OK
CASE
WHEN field_id = 1 THEN 'date'
WHEN field_id = 2 THEN 'integer'
WHEN field_id = 3 THEN 'currency'
WHEN field_id = 4 THEN 'boolean'
WHEN field_id = 5 THEN 'variant'
WHEN field_id = 6 THEN 'text'
END AS field_type
-- Better
CASE field_id
WHEN 1 THEN 'date'
WHEN 2 THEN 'integer'
WHEN 3 THEN 'currency'
WHEN 4 THEN 'boolean'
WHEN 5 THEN 'variant'
WHEN 6 THEN 'text'
END AS field_type
LEFT JOIN
instead of JOIN
. (Default joins are INNER
)Specify the order of a join with the FROM table first and JOIN table second:
-- Good
FROM source
LEFT JOIN other_source
ON source.id = other_source.id
WHERE ...
-- Bad
FROM source
LEFT JOIN other_source
ON other_source.id = source.id
WHERE ...
Putting it all together:
WITH my_data AS (
SELECT *
FROM prod.my_data
WHERE filter = 'my_filter'
),
some_cte AS (
SELECT DISTINCT
id,
other_field_1,
other_field_2
FROM prod.my_other_data
),
final AS (
SELECT
my_data.field_1 AS detailed_field_1,
my_data.field_2 AS detailed_field_2,
my_data.detailed_field_3 AS detailed_field_3,
my_data.data_by_row AS id_field,
CASE
WHEN my_data.cancellation_date IS NULL
AND my_data.expiration_date IS NOT NULL
THEN my_data.expiration_date
WHEN my_data.cancellation_date IS NULL
THEN my_data.start_date + 7
ELSE my_data.cancellation_date
END AS cancellation_date,
LAG(my_data.detailed_field_3) OVER (
PARTITION BY
my_data.id_field,
my_data.detailed_field_1
ORDER BY cancellation_date
) AS previous_detailed_field_3,
SUM(my_data.field_4) AS field_4_sum,
MAX(my_data.field_5) AS field_5_max
FROM my_data
LEFT JOIN some_cte
ON my_data.id = some_cte.id
WHERE my_data.field_1 = 'abc'
AND (my_data.field_2 = 'def' OR my_data.field_2 = 'ghi')
GROUP BY 1, 2, 3, 4, 5
HAVING COUNT(*) > 1
ORDER BY 4 DESC
)
SELECT *
FROM final
--
syntax/* */
syntaxTODO
comments, create new issues for improvement