from analysis.load_spectra import load_acetonitrile_spectra
from lmfit.models import VoigtModel
from ramanalysis.calibrate import ACETONITRILE_PEAKS_CM1
# Load acetonitrile spectra and their corresponding spectrometers
= load_acetonitrile_spectra()
acetonitrile_spectra, acetonitrile_dataframe
# Set parameters for peak finding
= 4
num_peaks = 25
window_size = (750, 3100) wavenumber_range_cm1
We used ChatGPT to help write code and help clarify and streamline text that we wrote. We also used Claude to help write code and help clarify and streamline text that we wrote.
Motivation
At Arcadia, we’re interested in using spontaneous Raman spectroscopy as an agnostic, high-throughput tool for biological phenotyping. To implement this approach effectively, we evaluated a variety of Raman spectroscopy systems to determine which would best align with our research requirements. By testing the same research-relevant specimens across all instruments, we could directly compare their performance characteristics, including resolution, sensitivity, and signal-to-noise ratios under real experimental conditions. To facilitate data processing and characterization, we also created an open-source Python package, ramanalysis
, to process, normalize, and help interpret the spectral data from different manufacturers. We hope that others getting started with similar systems can use this notebook and the open-source Python package associated with it to assist in their analyses. Our analysis has several caveats based on the acquisition parameters and samples we used, and as such, isn’t intended to generate a definitive ranking or absolute comparison of the instruments themselves.
Instrumentation
Our goal for this study was to test a variety of spontaneous Raman systems with various capabilities to determine which instrument is best suited for our samples and applications. In total, we tested five different systems:
Prior to the collection of a full dataset, we conducted preliminary measurements on a subset of samples to calibrate and optimize the acquisition parameters for each spectrometer. We provide these acquisition parameters in the tables below. For each sample, the laser power listed in the tables is either provided by the instrument or measured at the sample surface. Unless stated otherwise, we focused the laser spot by eye when obtaining measurements and collected spectra from cells growing on agar plates. Note that the OpenRAMAN is a spectrometer we’ve previously used and described in our research (Avasthi et al., 2024; Braverman et al., 2025).
Sample preparation
Acetonitrile
We chose acetonitrile, a common organic solvent, as a reference material for calibration. It has multiple easily resolvable, well-documented peaks in regions of Raman spectra relevant to our sample set. We used acetonitrile, ≥ 99.5% from Avantor/VWR, and analyzed it in a capped quartz cuvette.
Chlamydomonas strains
We compared three wild-type strains of Chlamydomonas grown on two types of solid media for a total of six treatments. There were two strains of Chlamydomonas reinhardtii, CC-124 (mating type -) and CC-125 (mating type +), and one strain of Chlamydomonas smithii, CC-1373 (mating type +).
Media components
We made all solid media with 1.5% agar. We made “water” plates with ultrapure water from a Milli-Q EQ 7008 system equipped with a 0.22 μm filter. Individual media components are as follows:
- TAP (tris-acetate-phosphate) media
We purchased complete liquid media from UTEX. - M-N (minimal media lacking nitrogen) media
Made in-house following the above protocol.
Protocols
We maintained cells on TAP or M-N media plates with 1.5% agar under a 12/12 h light-dark cycle1 at ambient temperature for 5–15 days before use. Cells were densely streaked to cover about half of the plate, creating a lawn. To acquire spectra from cells, we focused the laser on the portion of the plate with cells. To acquire background spectra from media and agar, we focused the laser on the area without cells.
1 Note that we didn’t account for the timing within the light phase of the cycle when measurements were taken.
To prepare slides with cells for microscopy (this was done only for the Renishaw InVa Qontor), we drew a wax circle on a #1.5 glass coverslip to create a hydrophobic barrier for each sample, making six total. We pipetted 3 µL of either liquid TAP or M-N media within the wax circles, resulting in three coverslips with each media type. We then transferred a small number of cells to the liquid droplet using a clean pipette. After transfer, clumps of cells were removed with forceps, resulting in a more continuous distribution of cells in each droplet. We repeated this for each of the three strains. Then, using forceps, we flipped each coverslip, placed it onto an ethanol-cleaned glass slide, and sealed each edge with melted VALAP (1:1:1 mixture of Vaseline, lanolin, and paraffin wax) using a paintbrush.
Analysis & results
Characterizing spectral resolution with acetonitrile
We acquired and analyzed measurements of the same sample of acetonitrile on each instrument to assess system performance. By analyzing the spectra and comparing them to established references, we’re able to characterize each instrument’s approximate spectral resolution and accuracy (the accuracy of the observed peak positions with respect to reference peaks).
We’ll start by loading the acetonitrile spectrum from each instrument as well as the standard Raman peaks for acetonitrile (Shimanouchi, 1972). While five standard peaks are listed in the NIST database as medium, strong, or very strong, the peak at 2,999 cm–1 can be hard to detect, particularly with 532 nm excitation. For this reason, we’ll only characterize the four most prominent peaks and fit each with a Voigt model2 (a convolution of a Gaussian distribution with a Lorentzian distribution) with a window size of 25.
2 A Voigt profile is commonly used in peak fitting applications within spectroscopy because it accurately models spectral line shapes by accounting for both Doppler broadening (Gaussian component, due to thermal motion) and Lorentzian broadening (from collisional or lifetime effects). This makes it ideal for fitting experimental spectra where multiple broadening mechanisms contribute to the observed peak shape.
Figure 1 shows the results of the peak fitting. Because the quartz cuvette of acetonitrile yields relatively strong and clean spectra, very little preprocessing is applied to the spectra. If you expand the code cell below, you’ll see that the only preprocessing steps applied are cropping the spectra to the wavenumber range (750–3,100 cm–1) and min-max normalization. Only the OpenRAMAN spectrum is smoothed via median filtering to suppress hot pixels and read noise from the detector.
Another strategy to deal with hot pixels from our acquisition would have been to acquire and subtract a dark spectrum (a spectrum with the laser powered off) to capture only the detector’s fixed pattern noise.
Source code for this figure:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from analysis.plotting import darken, get_custom_colorpalette, get_default_plotly_layout
from plotly.subplots import make_subplots
from ramanalysis.peak_fitting import find_n_most_prominent_peaks
# Store peak fitting results
= []
peak_fitting_records
# Create figure
= get_custom_colorpalette()
color_palette = acetonitrile_dataframe.groupby(["λ_nm"]).count()["instrument"].max().item()
num_rows = acetonitrile_dataframe["λ_nm"].nunique()
num_cols = make_subplots(
fig =num_rows,
rows=num_cols,
cols=True,
shared_xaxes=True,
shared_yaxes=0.02,
horizontal_spacing=0.05,
vertical_spacing=["785 nm", "532 nm"],
column_titles
)
# Map each (instrument, wavelength) combo to a subplot
= { # (instrument, λ_nm) -> (col, row)
subplot_indices "horiba", 785): (1, 1),
("renishaw", 785): (1, 2),
("wasatch", 785): (1, 3),
("openraman", 532): (2, 2),
("wasatch", 532): (2, 3),
(
}
for i, dataframe_row in acetonitrile_dataframe.iterrows():
# Get the raw spectrum and the instrument and wavelength with which it was recorded
= dataframe_row["λ_nm"]
wavelength_nm = dataframe_row["instrument"]
instrument = acetonitrile_spectra[i]
raw_spectrum = subplot_indices[(instrument, wavelength_nm)]
subplot_col, subplot_row
# Apply median filtering on OpenRAMAN data to filter out hot pixels from the detector
if instrument == "openraman":
= raw_spectrum.smooth(kernel_size=3)
raw_spectrum # Crop and normalize spectra for peak detection
= raw_spectrum.between(*wavenumber_range_cm1).normalize()
preprocessed_spectrum
# Plot each acetonitrile spectrum
= f"{instrument.capitalize()}" if instrument != "openraman" else "OpenRAMAN"
name
fig.add_trace(
go.Scatter(=preprocessed_spectrum.wavenumbers_cm1,
x=preprocessed_spectrum.intensities,
y=name,
name="skip",
hoverinfo={"color": color_palette[(instrument, wavelength_nm)]},
marker
),=subplot_row,
row=subplot_col,
col
)
# Find and get the indices of the most prominent peaks in the spectrum
= find_n_most_prominent_peaks(
peak_indices =num_peaks
preprocessed_spectrum.intensities, num_peaks
)for i_peak in peak_indices:
# Create window around each peak
= preprocessed_spectrum.interpolate(i_peak)
peak_cm1, peak_height = (i_peak - window_size / 2, i_peak + window_size / 2)
window_range = np.linspace(*window_range, num=500)
window # Interpolate the window along the spectrum
= preprocessed_spectrum.interpolate(window)
x_data_cm1, y_data
# Fit the model to each peak
= VoigtModel()
model = model.make_params(amplitude=1, center=peak_cm1, sigma=1, gamma=0.5)
initial_fit_params = model.fit(y_data, initial_fit_params, x=x_data_cm1)
fit_result
# Find the offset from acetonitrile reference peaks
= fit_result.params["center"].value
peak_center = np.abs(ACETONITRILE_PEAKS_CM1 - peak_center).argmin()
reference_peak_index = ACETONITRILE_PEAKS_CM1[reference_peak_index]
reference_peak = peak_center - reference_peak
offset
# Record the peak fitting results
= {**dataframe_row.to_dict(), **fit_result.params.valuesdict()}
peak_fitting_results "reference"] = reference_peak
peak_fitting_results["offset"] = offset
peak_fitting_results[
peak_fitting_records.append(peak_fitting_results)
# Plot and annotate each peak
= f"{peak_fitting_results['fwhm']:.1f} cm<sup>-1</sup>"
annotation_text = ["fwhm", "height", "offset"]
hovertext_fields = [
hovertext f"{field}: {peak_fitting_results[field]:.3g}<br>" for field in hovertext_fields
]
fig.add_trace(
go.Scatter(=[peak_center],
x=[peak_fitting_results["height"]],
y=False,
showlegend="".join(hovertext),
hovertext=annotation_text,
text="markers+text",
mode="top center",
textposition={"color": darken(color_palette[(instrument, wavelength_nm)])},
marker
),=subplot_row,
row=subplot_col,
col
)
# Plot fitted Voigt profile for each peak
= np.linspace(peak_center - window_size, peak_center + window_size, 500)
x_data = [peak_fitting_results[p] for p in ["amplitude", "center", "sigma", "gamma"]]
model_params = model.func(x_data, *model_params)
y_data
fig.add_trace(
go.Scatter(=x_data,
x=y_data,
y=False,
showlegend="skip",
hoverinfo={"color": darken(color_palette[(instrument, wavelength_nm)])},
marker
),=subplot_row,
row=subplot_col,
col
)
# Create pandas DataFrame of peak fitting results
= pd.DataFrame.from_records(peak_fitting_records).drop("gamma", axis=1)
acetonitrile_peaks_dataframe
# Configure plotly layout
= get_default_plotly_layout()
layout =450)
fig.update_layout(layout, heightrange=[wavenumber_range_cm1[0] - 100, wavenumber_range_cm1[1] + 100])
fig.update_xaxes(range=[-0.05, 1.3])
fig.update_yaxes(="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=1)
fig.update_xaxes(title_text="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=2)
fig.update_xaxes(title_text="Intensity (normalized)", row=2, col=1)
fig.update_yaxes(title_text fig.show()
By analyzing the same acetonitrile sample across instruments, we can control for sample-dependent variables, allowing us to make meaningful comparisons of the relative performance of each spectrometer based on full width at half maximum (FWHM)3 measurements. The reference peak at ~918 cm–1 is comparatively isolated, so the FWHM won’t be impacted by other components of the spectrum. Let’s choose this peak to compare FWHM values, noting that the observed FWHM is influenced by both instrumental factors (e.g., quality of optical components) and sample characteristics (e.g., molecular dynamics and environmental conditions).
3 While the FWHM serves as a useful indicator of spectral quality, it’s important to note that true spectral resolution is defined by the minimum separation at which two adjacent peaks can be distinguished rather than by individual peak widths alone.
"reference == 918").set_index(["instrument", "λ_nm"]).round(1) acetonitrile_peaks_dataframe.query(
amplitude | center | sigma | fwhm | height | reference | offset | ||
---|---|---|---|---|---|---|---|---|
instrument | λ_nm | |||||||
horiba | 785 | 10.9 | 918.5 | 4.7 | 11.7 | 0.8 | 918 | 0.5 |
renishaw | 785 | 6.2 | 921.8 | 3.2 | 8.2 | 0.7 | 918 | 3.8 |
wasatch | 785 | 6.5 | 921.4 | 4.8 | 11.9 | 0.5 | 918 | 3.4 |
openraman | 532 | 4.0 | 921.4 | 8.7 | 21.0 | 0.2 | 918 | 3.4 |
wasatch | 532 | 1.9 | 927.1 | 5.9 | 14.5 | 0.1 | 918 | 9.1 |
The peak fitting results in Table 1 show that the Renishaw has the lowest FWHM of any instrument at 8.2 cm-1. The Horiba and the 785 nm Wasatch are close behind at 11.7 cm-1 and 11.9 cm-1, respectively. Additionally, we found that the Horiba is the best calibrated in this spectral range as it has the smallest offset in peak position from the reference. While spectral accuracy is important, we’re less concerned here, given that we can always recalibrate spectra as part of a post-processing workflow using reference materials. Furthermore, for most of our research, we’re more interested in relative differences among spectra than absolute differences.
Let’s continue the comparison by exploring spectra from biological samples.
Characterizing SNR with C. reinhardtii spectra
We’re shifting our focus from a standard sample (acetonitrile) to biological samples, specifically unicellular algae, to determine how the different spectrometers perform for our experiments of interest. We collected Raman spectra of Chlamydomonas reinhardtii (CC-124) — a model organism we frequently use in the lab — from each instrument. Since biological samples introduce more complexity than pure compounds, we need to assess the signal-to-noise ratio (SNR) of these spectra to determine which instruments offer the highest sensitivity and are best suited for detecting subtle spectral features. This characterization will help us identify the most sensitive spectrometer for future experiments where detecting faint biochemical signals is critical. In addition to SNR measurements, we’ll make (relatively crude) peak-finding calculations to get a rough sense of the number of features we can resolve from each spectrum. We’ll use the find_peaks
function from SciPy
, which finds the local maxima in a 1D signal, and filter this set of local maxima to the peaks we’re interested in resolving by defining parameters such as the relative prominence and the distance between peaks.
We’ll also need to perform baseline subtraction to derive meaningful SNR measurements, as raw Raman spectra often contain fluorescence and other background signals that obscure meaningful peaks. In a separate analysis, we evaluated different baseline estimation algorithms. We found that adaptive smoothness penalized least squares (asPLS) performed the best at removing background while preserving key spectral features (Zhang et al., 2020). We use the implementation of asPLS in pybaselines
with the smoothing parameter changed to lam = 1e4
, as we found empirically that this value provided better baseline estimation for our dataset.
from analysis.load_spectra import load_cc124_tap_spectra
from pybaselines.whittaker import aspls
from ramanalysis import RamanSpectrum
from scipy.signal import find_peaks
= load_cc124_tap_spectra()
cc124_tap_spectra, cc124_tap_dataframe
# Define a function for calculating peak SNR
def compute_peak_snr(
spectrum: RamanSpectrum,tuple[float, float],
noisy_region_cm1: -> float:
) """Peak SNR measurement."""
= spectrum.normalize().between(*noisy_region_cm1).intensities.std()
noise = 1 / noise
peak_snr return peak_snr
# Parameters for baseline subtraction
= 1e4
lam = (520, 3200)
wavenumber_range_cm1
# Parameters for SNR calculation
= (1700, 2500)
noisy_region_cm1
# Parameters for peak finding
= {
relative_prominences "horiba", 785): 0.5,
("renishaw", 785): 0.1,
("wasatch", 785): 0.1,
("openraman", 532): 0.5,
("wasatch", 532): 0.1,
(
}= 10 peak_separation
Figure 2 shows the peak SNR measurements of each instrument and the approximate number of peaks.
Source code for this figure:
# Store peak SNR measurement results
= []
snr_measurements
# Create figure
= cc124_tap_dataframe.groupby(["λ_nm"]).count()["instrument"].max().item()
num_rows = cc124_tap_dataframe["λ_nm"].nunique()
num_cols = make_subplots(
fig =num_rows,
rows=num_cols,
cols=True,
shared_xaxes=True,
shared_yaxes=0.02,
horizontal_spacing=0.05,
vertical_spacing=["785 nm", "532 nm"],
column_titles
)
# Map each (instrument, wavelength) combo to a subplot
= { # (instrument, λ_nm) -> (col, row)
subplot_indices "horiba", 785): (1, 1),
("renishaw", 785): (1, 2),
("wasatch", 785): (1, 3),
("openraman", 532): (2, 2),
("wasatch", 532): (2, 3),
(
}
for i, dataframe_row in cc124_tap_dataframe.iterrows():
# Get the raw spectrum and the instrument and wavelength with which it was recorded
= dataframe_row["λ_nm"]
wavelength_nm = dataframe_row["instrument"]
instrument = cc124_tap_spectra[i]
raw_spectrum = subplot_indices[(instrument, wavelength_nm)]
subplot_col, subplot_row
# Crop and normalize spectra for baseline estimation
= raw_spectrum.between(*wavenumber_range_cm1).normalize()
normalized_spectrum # Apply median filtering on OpenRAMAN data to filter out hot pixels from the detector
if instrument == "openraman":
= normalized_spectrum.smooth(kernel_size=3)
normalized_spectrum
# Estimate and subtract baseline
= aspls(normalized_spectrum.intensities, lam=lam)
baseline_estimate, _params = normalized_spectrum.intensities - baseline_estimate
baseline_subtracted_intensities = RamanSpectrum(
baseline_subtracted_spectrum =normalized_spectrum.wavenumbers_cm1,
wavenumbers_cm1=baseline_subtracted_intensities,
intensities
).normalize()
# Crop out saturated region from Wasatch 532 nm spectrum
if (instrument == "wasatch") and (wavelength_nm == 532):
= baseline_subtracted_spectrum.between(0, 1800).normalize()
baseline_subtracted_spectrum
# Calculate peak SNR
= compute_peak_snr(baseline_subtracted_spectrum, noisy_region_cm1)
snr
snr_measurements.append(snr)
# Find prominent peaks
= find_peaks(
peaks, _peak_properties
baseline_subtracted_spectrum.intensities,
relative_prominences[(instrument, wavelength_nm)],=peak_separation,
distance
)= baseline_subtracted_spectrum.interpolate(peaks)
peaks_cm1, peak_heights
# Plot each raw (normalized) spectrum
fig.add_trace(
go.Scatter(=normalized_spectrum.wavenumbers_cm1,
x=normalized_spectrum.intensities,
y="skip",
hoverinfo=False,
showlegend={"color": color_palette[(instrument, wavelength_nm)]},
marker=0.5,
opacity
),=subplot_row,
row=subplot_col,
col
)
# Plot each baseline estimate
fig.add_trace(
go.Scatter(=normalized_spectrum.wavenumbers_cm1,
x=baseline_estimate - 0.05,
y="skip",
hoverinfo=False,
showlegend={"color": darken(color_palette[(instrument, wavelength_nm)])},
marker=0.5,
opacity
),=subplot_row,
row=subplot_col,
col
)
# Plot baseline subtracted spectra
= f"{instrument.capitalize()}" if instrument != "openraman" else "OpenRAMAN"
name
fig.add_trace(
go.Scatter(=baseline_subtracted_spectrum.wavenumbers_cm1,
x=baseline_subtracted_spectrum.intensities,
y=name,
name="skip",
hoverinfo={"color": color_palette[(instrument, wavelength_nm)]},
marker
),=subplot_row,
row=subplot_col,
col
)
# Plot peaks
fig.add_trace(
go.Scatter(=peaks_cm1,
x=peak_heights,
y=name,
name="markers",
mode={"color": color_palette[(instrument, wavelength_nm)]},
marker=False,
showlegend
),=subplot_row,
row=subplot_col,
col
)
# Plot SNR and num peaks annotation
= f"Peak SNR: {snr:.1f}<br>Num peaks: {peaks.size:.0f}"
text
fig.add_trace(
go.Scatter(=[2000],
x=[0.6],
y=False,
showlegend=text,
text="text",
mode="middle right",
textposition=name,
name
),=subplot_row,
row=subplot_col,
col
)
# Create pandas DataFrame of peak fitting results
"snr"] = snr_measurements
cc124_tap_dataframe[
# Configure plotly layout
=450)
fig.update_layout(layout, heightrange=[wavenumber_range_cm1[0] - 100, wavenumber_range_cm1[1] + 100])
fig.update_xaxes(range=[-0.05, 1.3])
fig.update_yaxes(="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=1)
fig.update_xaxes(title_text="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=2)
fig.update_xaxes(title_text="Intensity (normalized)", row=2, col=1)
fig.update_yaxes(title_text fig.show()
These measurements indicate that the Renishaw and 785 nm Wasatch systems provide the highest peak SNR values. The OpenRAMAN exhibits the lowest SNR, which aligns with expectations given its relatively low-cost components. In particular, the OpenRAMAN has by far the weakest laser power of any system and likely the most read noise from the detector; in contrast to the other systems, the camera is not cooled. Perhaps not surprisingly, the two instruments with the highest SNR also have the highest number of discernible spectral features.
The excitation wavelength greatly affects the usable signal. While it also has a relatively high SNR, the number of features in the 532 nm Wasatch spectrum is much lower. This suggests that 532 nm may be less optimal for analyzing these particular types of samples. It should be noted that the number of peaks detected in each spectrum is only an approximation.
Now, we’ll move on to applications requiring large spectral datasets, such as batch processing and machine learning (ML) techniques. These approaches will help us evaluate the reproducibility of each instrument and assess their sensitivity to subtle variations in differentiating strains and growth conditions of our cell samples, for example.
Batch preprocessing of Chlamydomonas spectra
The C. reinhardtii spectra characterized above come from a much larger dataset of Chlamydomonas spectra. Three strains of Chlamydomonas, representing two species, were cultured in two different media conditions, as described in Section 3. With each instrument, we acquired Raman spectra from plates (or, in the case of the Renishaw, glass slides) of cells representing each of these six treatments.
Let’s load the full dataset of spectra and associated metadata — a random sample of metadata from ten spectra from this dataset is shown in Table 2.
from analysis.load_spectra import load_chlamy_spectra
# Load all the spectra of Chlamydomonas and the corresponding metadata
= load_chlamy_spectra()
chlamy_spectra, chlamy_dataframe
# Show a sample of spectra metadata
print(f"Total number of Chlamydomonas spectra: {len(chlamy_dataframe)}")
10, random_state=57).reset_index(drop=True) chlamy_dataframe.sample(
Total number of Chlamydomonas spectra: 536
instrument | λ_nm | species | strain | medium | |
---|---|---|---|---|---|
0 | renishaw | 785 | C. reinhardtii | CC-124 | TAP |
1 | renishaw | 785 | C. reinhardtii | CC-125 | TAP |
2 | wasatch | 532 | C. smithii | CC-1373 | TAP |
3 | openraman | 532 | C. reinhardtii | CC-125 | TAP |
4 | openraman | 532 | C. smithii | CC-1373 | M-N |
5 | wasatch | 532 | C. smithii | CC-1373 | TAP |
6 | wasatch | 785 | C. reinhardtii | CC-125 | M-N |
7 | wasatch | 785 | C. reinhardtii | CC-124 | M-N |
8 | openraman | 532 | C. reinhardtii | CC-125 | M-N |
9 | wasatch | 785 | C. reinhardtii | CC-125 | M-N |
Since these spectra capture biochemical variations between species and environmental influences, we aim to apply machine-learning techniques to distinguish between groups. However, before diving into the modeling, we need to perform batch preprocessing to correct for background fluorescence, noise from the detectors, and other experimental inconsistencies. This ensures that our spectral differences reflect biological variability (as much as possible) rather than artifacts from acquisition conditions, improving the reliability of our classification models. The preprocessing pipeline is comprised of the following steps:
- Denoising: Smooth spectra with Savitzky-Golay filter.
- Cropping: Crop spectra to the region of interest (300–1,800 cm–1).
- Baseline subtraction: Fit and remove the fluorescence baseline using the asPLS algorithm.
- Normalization: Apply min-max normalization to scale the spectra for consistency across samples.
Because data acquired with the Renishaw instrument was collected from glass slides, we need to perform an additional background subtraction. Glass has a distinct Raman spectrum that overlaps with spectral regions of interest from our biological samples (Fikiet et al., 2020), requiring a separate correction to prevent it from confounding our analysis. Figure 3 shows the results of our preprocessing pipeline on the Chlamydomonas dataset.
When we collected data on the Horiba instrument, our focus wasn’t on ML classification tasks, so we didn’t design the dataset with that goal in mind. As a result, we don’t have a sufficient number of samples to support meaningful classification analysis. Given these limitations, we excluded this dataset from our classification efforts.
Source code for this figure and the preprocessing pipeline:
from scipy.signal import savgol_filter
# Preprocessing parameters
= 9
savgol_window_length = 3
savgol_polynomial_order = (300, 1800)
fingerprint_region_cm1
# Load spectrum from Renishaw glass slide control
= "data/Renishaw_Qontor/glass_slide_background.txt"
txt_filepath = RamanSpectrum.from_renishaw_txtfile(txt_filepath)
glass_slide_control_spectrum
# Store preprocessed spectra
= []
preprocessed_chlamy_spectra
for i, dataframe_row in chlamy_dataframe.iterrows():
# Get the raw spectrum and its associated metadata
= chlamy_spectra[i]
raw_spectrum = dataframe_row["λ_nm"]
wavelength_nm = dataframe_row["instrument"]
instrument = dataframe_row["strain"]
strain = dataframe_row["medium"]
medium
# Preprocessing steps
# -------------------
# 0.5) Apply median filtering on OpenRAMAN data to filter out hot pixels from the detector
if instrument == "openraman":
= raw_spectrum.smooth(kernel_size=3)
raw_spectrum # 1) Denoising
= savgol_filter(
smoothed_intensities
raw_spectrum.intensities,=savgol_window_length,
window_length=savgol_polynomial_order,
polyorder
)# 1.5) Glass slide background subtraction for the Renishaw
if instrument == "renishaw":
-= glass_slide_control_spectrum.intensities
smoothed_intensities # 2) Crop spectral range to the fingerprint region
= RamanSpectrum(
cropped_spectrum =raw_spectrum.wavenumbers_cm1,
wavenumbers_cm1=smoothed_intensities,
intensities*fingerprint_region_cm1)
).between(# 2.5) Additional cropping for 532 nm Wasatch
if (instrument == "wasatch") and (wavelength_nm == 532):
= cropped_spectrum.between(0, 1700)
cropped_spectrum # 3) Baseline subtraction
= aspls(cropped_spectrum.intensities, lam=lam)
baseline_estimate, _params = cropped_spectrum.intensities - baseline_estimate
baseline_subtracted_intensities # 4) Compose into RamanSpectrum and normalize
= RamanSpectrum(
preprocessed_spectrum =cropped_spectrum.wavenumbers_cm1,
wavenumbers_cm1=baseline_subtracted_intensities,
intensities
).normalize()
preprocessed_chlamy_spectra.append(preprocessed_spectrum)
# Create figure
= len(chlamy_dataframe.groupby(["instrument", "λ_nm"]).nunique()) + 1
num_rows = 2
num_cols = make_subplots(
fig =num_rows,
rows=num_cols,
cols=[0.5, 0.5, 0.1, 0.5, 0.5], # add space to separate 532 vs 785 subplots
row_heights=True,
shared_xaxes=True,
shared_yaxes=0.02,
horizontal_spacing=0.05,
vertical_spacing=["Raw", "Preprocessed"],
column_titles
)# Map each (instrument, wavelength) combo to a subplot
= { # (instrument, λ_nm) -> row
subplot_rows "renishaw", 785): 1,
("wasatch", 785): 2,
("openraman", 532): 4,
("wasatch", 532): 5,
(
}
for (instrument, wavelength_nm), dataframe_group in chlamy_dataframe.groupby(
"instrument", "λ_nm"]
[
):# Only show a sample of spectra to avoid rendering issues
= min(50, len(dataframe_group))
num_samples for i, dataframe_row in dataframe_group.sample(num_samples).iterrows():
# Get the raw and preprocessed spectra and its associated metadata
= chlamy_spectra[i]
raw_spectrum = preprocessed_chlamy_spectra[i]
preprocessed_spectrum = dataframe_row["strain"]
strain = dataframe_row["medium"]
medium = subplot_rows[(instrument, wavelength_nm)]
subplot_row
# Plot each raw spectrum
fig.add_trace(
go.Scatter(=raw_spectrum.wavenumbers_cm1,
x=raw_spectrum.normalize().intensities,
y="skip",
hoverinfo=strain + medium,
legendgroup=False,
showlegend={"color": color_palette[(strain, medium)]},
marker=0.5,
opacity
),=subplot_row,
row=1,
col
)
# Plot each preprocessed spectrum
fig.add_trace(
go.Scatter(=preprocessed_spectrum.wavenumbers_cm1,
x=preprocessed_spectrum.intensities,
y="skip",
hoverinfo=strain + medium,
legendgroup=False,
showlegend={"color": color_palette[(strain, medium)]},
marker=0.5,
opacity
),=subplot_row,
row=2,
col
)
# Plot fake traces for legend
= [
ordered_groups "CC-124", "TAP"),
("CC-125", "TAP"),
("CC-1373", "TAP"),
("CC-124", "M-N"),
("CC-125", "M-N"),
("CC-1373", "M-N"),
(
]for strain, medium in ordered_groups:
fig.add_trace(
go.Scatter(=[None],
x=[None],
y="lines",
mode=f"{strain} | {medium}",
name={"color": color_palette[(strain, medium)], "width": 3},
line=strain + medium,
legendgroup=True,
showlegend
)
)
# Configure plotly layout
=600)
fig.update_layout(layout, height="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=1)
fig.update_xaxes(title_text="Wavenumber (cm<sup>-1</sup>)", row=num_rows, col=2)
fig.update_xaxes(title_textfor (instrument, wavelength_nm), row in subplot_rows.items():
= f"{instrument.capitalize()}<br>{wavelength_nm} nm"
title =title, row=row, col=1)
fig.update_yaxes(title_text fig.show()