Rescaling Data in R
A simple methodology for rescaling multiple series of data via min-max normalisation in R, then plotting.
Recently I had to work with a set of colour information data extracted from some photographs. The software programme I used extracted colour frequency data for the RGB (Red, Green, Blue), HSV (Hue, Saturation, Value) and L*a*b (Lightness, Green-Magenta, Blue Yellow) channels and recorded them in a CSV file with the following structure:
barcode | date | line | treatment | trait | value | label |
---|---|---|---|---|---|---|
B2018010t0c00n153 | 5/09/2019 | 0 | 0 | blue_frequencies | 0 | 0 |
B2018010t0c00n153 | 5/09/2019 | 0 | 0 | blue_frequencies | 0 | 1 |
B2018010t0c00n153 | 5/09/2019 | 0 | 0 | blue_frequencies | 1 | 2 |
B2018010t0c00n153 | 5/09/2019 | 0 | 0 | blue_frequencies | 8 | 3 |
B2018010t0c00n153 | 5/09/2019 | 0 | 0 | blue_frequencies | 32 | 4 |
… | … | … | … | … | …n | …255 |
The trait
attribute describes the channel that was being measured (so in the above case blue_frequencies is the blue channel from the RGB colour model). The trait
column contained nine different descriptors:
- blue-yellow_frequencies (b)
- blue_frequencies (B)
- green-magenta_frequencies (a)
- green_frequencies (G)
- hue_frequencies (H)
- lightness_frequencies (L)
- red_frequencies (R)
- saturation_frequencies (S)
- value_frequencies (V)
For each of these, the label
column contained a numeric value for frequency whilst the value
column contained a count of the number of pixels in the image that were measured at that frequency within the named colour model.
My aim was to graph all nine histograms on a single plot in R. The problem was that the different colour models were all using different scales:
- The RGB model has a scale between 0 and 255 for red, green and blue.
- The L*a*b model has a scale between -128 and 128 for blue-yellow, green-magenta and lightness.
- The HSV model has a scale between 1 and 359 for hue and 0 and 100 for saturation and value.
Clearly plotting these on the same chart would result in a histogram that was very messy and difficult to read:
The solution was to rescale the data using a normalisation calculation to place all the values into bins. The mathematics for this is as follows:
\[x’=a +{(x-min(x))(b-a)\over(max(x)-min(x))}\]
where x is the original value (label
), x’ is the normalised value, a is the desired minimum value and b is the desired maximum value. Since I wanted to bin my values between 0 and 255, a=0 and b=255.
To achieve the same result, the equation can be simplified:
\[x’={(x-min(x))\over(max(x)-min(x))}\times255\]
Normalising the data in R
The real challenge was to translate this equation into R code. Here’s how I did it using the dplyr package:
My table of values were loaded into a data frame called “table1”. From this, made a new data frame (called “table2”) that contained the new series of transformed data that I generated using the mutate
function:
library(dplyr)
table2 <- table1 %>% group_by(trait) %>% mutate(bin = (label - min(label)) / (max(label) - min(label)) * 255)
The new column of data was listed in a column called bin
.
Unfortunately some of the bin values were decimals and I wanted my values to be integers between 0 and 255. This was easy to fix with a round
function:
table2 <- table1 %>% group_by(trait) %>% mutate(bin = round((label - min(label)) / (max(label) - min(label)) * 255, 0))
Next, I added bin
as a variable:
bin_value <- table2$bin
Next, I could use the ggplot2 package to plot my histograms:
library(ggplot2)
ggplot(table2) + geom_line(aes(x = bin_value, y = value, color = trait))
The final result (with more elaborate ggplot2
code*):
The data series can easily be exported into a new CSV for further statistical analysis using the following command:
write.csv(table2,"c:\path\to\results.csv", row.names = TRUE)
Importing large CSV files
CSV files of this sort can be massive and take a long time to load into R. A good tip for quickly importing large CSV files is to use the data.table package:
library(data.table)
my.data <- fread("c:/path/to/my_data.csv", data.table=FALSE)
*The ggplot2 code
Making pretty graphs with ggplot2 is a bit of an art. Here’s the code that I used:
ggplot(table2) + geom_line(aes(x = bin_value, y = value, color = trait), size=1.1) + labs(x="Bins", y="Pixels") + theme(axis.text.x = element_text(angle = 0, vjust = 0.2, hjust = 0.0, size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size=12, face="bold"), axis.title.y = element_text(size=12, face="bold"), title = element_text(size=12)) + scale_x_continuous(breaks = seq(min(bin_value), max(bin_value), by = 25)) + scale_color_manual(name="Colour Channel", labels = c("Blue-Yellow [b]", "Blue [B]", "Green-Magenta [a]", "Green [G]", "Hue [H]", "Lightness [L]", "Red [R]", "Saturation [S]", "Value [V]"), values=c("gold2", "royalblue2", "magenta", "springgreen4", "purple4", "ivory4", "red2", "darkturquoise", "gold4"))
Comments
No comments have yet been submitted. Be the first!