Analyzing a column with unique, n_distinct, and data.table

The tutorial will show you how to analyze a column in different ways.

Lets bring in a sample dataset.

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Lets look at the unique species names.

unique(iris$Species)

## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

lets see how many distinct species are in iris dataset, this will be much more useful in a bigger dataset.

library(tidyverse)

n_distinct(iris$Species)

## [1] 3

Let’s breakdown the species by the number of times they show up in a dataset

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

Let’s now do the breakout by percentage

prop.table(table(iris$Species))

## 
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333