Rescaling Data in R

G23rd September 2019

TR

A simple methodology for rescaling multiple series of data via min-max normalisation in R, then plotting.

Recently I had to work with a set of colour information data extracted from some photographs. The software programme I used extracted colour frequency data for the RGB (Red, Green, Blue), HSV (Hue, Saturation, Value) and L*a*b (Lightness, Green-Magenta, Blue Yellow) channels and recorded them in a CSV file with the following structure:

barcode date line treatment trait value label
B2018010t0c00n153 5/09/2019 0 0 blue_frequencies 0 0
B2018010t0c00n153 5/09/2019 0 0 blue_frequencies 0 1
B2018010t0c00n153 5/09/2019 0 0 blue_frequencies 1 2
B2018010t0c00n153 5/09/2019 0 0 blue_frequencies 8 3
B2018010t0c00n153 5/09/2019 0 0 blue_frequencies 32 4
n …255

The trait attribute describes the channel that was being measured (so in the above case blue_frequencies is the blue channel from the RGB colour model). The trait column contained nine different descriptors:

• blue-yellow_frequencies (b)
• blue_frequencies (B)
• green-magenta_frequencies (a)
• green_frequencies (G)
• hue_frequencies (H)
• lightness_frequencies (L)
• red_frequencies (R)
• saturation_frequencies (S)
• value_frequencies (V)

For each of these, the label column contained a numeric value for frequency whilst the value column contained a count of the number of pixels in the image that were measured at that frequency within the named colour model.

My aim was to graph all nine histograms on a single plot in R. The problem was that the different colour models were all using different scales:

• The RGB model has a scale between 0 and 255 for red, green and blue.
• The L*a*b model has a scale between -128 and 128 for blue-yellow, green-magenta and lightness.
• The HSV model has a scale between 1 and 359 for hue and 0 and 100 for saturation and value.

Clearly plotting these on the same chart would result in a histogram that was very messy and difficult to read:

The solution was to rescale the data using a normalisation calculation to place all the values into bins. The mathematics for this is as follows:

$x’=a +{(x-min(x))(b-a)\over(max(x)-min(x))}$

where x is the original value (label), x’ is the normalised value, a is the desired minimum value and b is the desired maximum value. Since I wanted to bin my values between 0 and 255, a=0 and b=255.

To achieve the same result, the equation can be simplified:

$x’={(x-min(x))\over(max(x)-min(x))}\times255$

Normalising the data in R

The real challenge was to translate this equation into R code. Here’s how I did it using the dplyr package:

My table of values were loaded into a data frame called “table1”. From this, made a new data frame (called “table2”) that contained the new series of transformed data that I generated using the mutate function:

library(dplyr)
table2 <- table1 %>% group_by(trait) %>% mutate(bin = (label - min(label)) / (max(label) - min(label)) * 255)

The new column of data was listed in a column called bin.

Unfortunately some of the bin values were decimals and I wanted my values to be integers between 0 and 255. This was easy to fix with a round function:

table2 <- table1 %>% group_by(trait) %>% mutate(bin = round((label - min(label)) / (max(label) - min(label)) * 255, 0))

Next, I added bin as a variable:

bin_value <- table2\$bin

Next, I could use the ggplot2 package to plot my histograms:

library(ggplot2)
ggplot(table2) + geom_line(aes(x = bin_value, y = value, color = trait))

The final result (with more elaborate ggplot2 code*):

The data series can easily be exported into a new CSV for further statistical analysis using the following command:

write.csv(table2,"c:\path\to\results.csv", row.names = TRUE)

Importing large CSV files

CSV files of this sort can be massive and take a long time to load into R. A good tip for quickly importing large CSV files is to use the data.table package:

library(data.table)
my.data <- fread("c:/path/to/my_data.csv", data.table=FALSE)

*The ggplot2 code

Making pretty graphs with ggplot2 is a bit of an art. Here’s the code that I used:

ggplot(table2) + geom_line(aes(x = bin_value, y = value, color = trait), size=1.1) + labs(x="Bins", y="Pixels") + theme(axis.text.x = element_text(angle = 0, vjust = 0.2, hjust = 0.0, size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size=12, face="bold"), axis.title.y = element_text(size=12, face="bold"), title = element_text(size=12)) + scale_x_continuous(breaks = seq(min(bin_value), max(bin_value), by = 25)) + scale_color_manual(name="Colour Channel", labels = c("Blue-Yellow [b]", "Blue [B]", "Green-Magenta [a]", "Green [G]", "Hue [H]", "Lightness [L]", "Red [R]", "Saturation [S]", "Value [V]"), values=c("gold2", "royalblue2", "magenta", "springgreen4", "purple4", "ivory4", "red2", "darkturquoise", "gold4"))