---
title:  "1. An Introduction to readJDX"
author:
  - Bryan A. Hanson^[Professor Emeritus of Chemistry & Biochemistry, DePauw University, Greencastle IN USA., hanson@depauw.edu]
date:  "`r Sys.Date()`"
output:
    bookdown::html_document2: # use for pkgdown site
    #bookdown::pdf_document2: # use for CRAN to keep size down; breaks GHA
      toc: yes
      toc_depth: 3
      fig_caption: yes
      number_sections: true
# Required: Vignette metadata for inclusion in a package.
vignette: >
    %\VignetteIndexEntry{1. An Introduction to readJDX}
    %\VignetteKeyword{JCAMP-DX, NMR, IR, Raman, MS, UV-Vis}
    %\VignettePackage{readJDX}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
link-citations: yes
bibliography: JCAMP-DX.bib
#biblio-style: plain
pkgdown:
  as_is: true
---

```{r results = "hide", echo = FALSE}
library("readJDX")
rm(list = ls())

desc <- packageDescription("readJDX")
set.seed(999)
knitr::opts_chunk$set(tidy = TRUE,
  fig.height = 7, fig.width = 7,
  fig.fullwidth = TRUE, echo = FALSE)
```

This vignette is based on `readJDX` version `r desc$Version`.

# Background

The JCAMP-DX format was developed as an manufacturer-independent means of sharing spectroscopic [data](http://www.jcamp-dx.org).  The standard is described in a series of publications [@McDonald1988; @Grasselli1991; @Davies1993; @Lampen1994; @Lampen1999;  @Bumbach2001; @Cammack2006; @Woollett2012].  There is a recent overview of the standard [@Davies2022]. JCAMP-DX was developed during a time when data storage was expensive, and hence makes extensive use of compression schemes.  The original application was to IR spectroscopy, but the standard has evolved over time to accommodate other spectroscopies.

# File Structure

JCAMP-DX files consist of two parts:

* A more-or-less human readible set of metadata which is needed to understand the data and verify the accuracy of any needed decompression.  Besides required basic information about the data itself, most files contain instrument and manufacturer-specific parameters in the metadata.
* A variable list, compressed in various ways.

# Challenges When Reading Files

The JCAMP-DX standard allows a lot of flexibility and instrument manufacturers have written widely varying export functions.  Some of the challenges in reading a JCAMP-DX file include:

* JCAMP-DX files can contain different kinds of data, including non-spectroscopic data [@Gasteiger1991] and more than one type of spectroscopic data.
* JCAMP-DX files can contain more than one spectrum in the file.
* Instruments may be configured to use . or , as the decimal point when writing files.  This is generally a geographical / cultural nuance.
* Numbers may be written using E to signify exponent, but only in some compression formats.
* The variable list can be presented in several possible formats.
* Some manufacturers take liberties with the required format.


# Supported Formats

* Variable lists can be presented in several different formats.  The supported formats are:
    * **XYDATA=(X++(Y..Y))** Each line starts with an $x$ value, and is followed by as many $y$ values as can fit within the 80 character per line limit.  Subsequent $x$ values are incremented according to the $x$ resolution and the number of $y$ values that fit on the previous line (which in turn depends upon the compression scheme).
    * **DATA TABLE=(X++(R..R))** As above.  The real data from a 1D NMR spectrum.
    * **DATA TABLE=(X++(I..I))** As above.  The imaginary data from a 1D NMR spectrum.
    * **DATA TABLE=(F2++(Y..Y))** As above.  Format used for the slices of a 2D NMR spectrum.
    * **PEAK TABLE=(XY..XY)** Entries are $x$,$y$ pairs separated by spaces or semicolons.  No compression is used.  Used for example for single MS spectra.
    * **DATA TABLE=(XI..XI)** Entries are $x$,$y$ pairs separated by spaces or semicolons.  No compression is used.  No compression is used.  Used for instance in LC-MS data sets.
    * **XYPOINTS= (XY..XY)** Entries are $x$,$y$ pairs separated by spaces or semicolons.  No compression is used. Used for many types of spectra.
* Within a variable list, several different compression schemes can be employed.  The following are supported:
    * **AFFN**: ASCII numbers separated by at least one space, or + or -.
    * **PAC**: Numbers separated by exactly one space, + or -.
    * **SQZ**: Delimiter, leading digit and sign are replaced by a pseudo-digit. A pseudo-digit is typically a letter.
    * **DIF**: DIF uses a SQZ pseudo-digit for the first $y$ value, but subsequent $y$ entries are differences between each data value after the first.  Sometimes referred to as SQZDIF.
    * **DUP**: Not a format, but a method of signifying repeated values.
    * **DIFDUP**: A combination of DIF and DUP.  Widely used, as it permits the greatest amount of compression.

# Formats That are Not Supported

* Mixed spectroscopic types and non-spectroscopic entries (such as structures) are not supported by `readJDX` and will not be supported in the future.
* Compound files: JCAMP-DX files may contain more than one spectrum in the file.  The following JCAMP-DX standards require a compound file and are therefore not directly supported, however a utility function `splitMultiblockDX` can separate these compound files into separate files that can be read by `readJDX`:
    * EMR, EPR, ESR spectroscopy [@Cammack2006]
    * CD spectroscopy [@Woollett2012]
* `readJDX` is geared toward raw spectral data.  Therefore variable lists formats representing derived information like PEAK ASSIGNMENTS are not supported (but your pull requests are welcomed!).

# Practical Matters

`readJDX` tries its best to deal with all these options.  If you have a file that you believe should be supported but gives an error, please file an issue at GitHub.  Be sure to attach the file that is giving you problems.

Before release, `readJDX` is tested against a large collection of files with varying formats.  A few of these files were obtained locally.  Others were collected from publically available sources (e.g. [www.jcamp-dx.org](http://www.jcamp-dx.org/testdata.html)).  These files are not included with the package to save space, and in addition, while they are publically available, for many of them the licensing status is unclear (i.e. the OWNER entry).

The JCAMP standard requires a number of checks on the integrity of the data decompression process.  `readJDX` implements most of these either directly or indirectly.  Verification is important, and we have found JCAMP files that were not written correctly in the process of checking integrity.  For details about how data decompression is checked, please see the original source files.

# References