To replace the null or not replace the null?

that is my question. When you have nulls in a databse, do you replace them with “unknown” or other values? If so when and why? We’ve been discussing internal.

A few notes on it we’ve found so far

Kimball

Kimball outlines 3 main areas where you’ll run into nulls

and advocates for replacing null Dimensions

Putting a null in these fields can be confusing to the user, as it will appear as a blank on reports and pull-down menus, and require special query syntax to find. Instead, we recommend substituting an appropriately descriptive string, like “Unknown” or “Not provided.”

But I think most BI products (atleast Chartio) handle nulls fine these days - we don’t just have it “appear as a blank” in menus.

Fishtown

In How we Structure our DBT Projects the Fishtown team alludes to actually creating more nulls!

  • Light cleansing, such as replacing empty string with NULL values, has occurred.

How about you? Any frameworks for making the call here?

At Fractal River, we do use NULL values within the data warehouse layers - at least in Staging, Prep, and Intermediate. I believe NULL values are very useful as they behave differently than sentinel values, avoid potential conflicts/bugs, and allow you to use constructs like count(), proper comparisons, etc.

The suggestion of inserting “Unknown” into NULL values, for example, feels like something that should be done only if necessary, and as close to the Visualization layer (Marts or better yet, within the BI tool itself - LookML, for example).

FWIW, As a Kimball purist, I used to always replace NULL values. However, once I started using Looker at my reporting tool, it became less necessary to do that since Looker displays everything nicely to the user.

In an ideal world, I would prefer no nulls in my database. The problem for me is if I see a null in a field, is it because a) its intentionally null b) something went wrong in the data engineering? There really isn’t a good way to tell from just looking at the data so you need to trace back through the lineage to the source.

I wish there were standard values for “business null” and “physical null”

I have worked with clients that have a mixed BI tool environment so we put add a layer of report ready tables within the warehouse. And it was in that layer that we would replace dimension NULLs with ’
Unknown’. That put the same business logic in all of the tools and then if a value was NULL it was purposeful for the end user.