File: _AAREADME.txt
Database: TUH EEG Seizure Corpus (TUSZ)
Version: 1.5.2
-------------------------------------------------------------------------------

----
Change Log:

(20220331) created a permanent archive by expanding links; updated stats
(20200329) the dev and eval sets were manually reviewed by multiple annotators
(20200427) the feat_f1w2 was created based on frame=1sec and window = 2 sec
(20200427) the feat_f5w10 was created based on frame=5sec and window = 10 sec
----

This file contains some basic statistics about the TUH EEG Seizure
Corpus, a corpus developed to motivate the development of high
performance seizure detection algorithms using machine learning. This
corpus is a subset of the TUH EEG Corpus and contains sessions that
are known to contain seizure events. To balance the corpus, some
sessions are provided that do not contain seizure events, so that the
false alarm performance of a system can be tested.

When you use this specific corpus in your research or technology
development, we ask that you reference the corpus using this
publication:

 Shah, V., von Weltin, E., Lopez. S., McHugh, J., Veloso, L.,
 Golmohammadi, M., Obeid, I., and Picone, J. (2018). The Temple University
 Hospital Seizure Detection Corpus. Frontiers in Neuroinformatics. 12:83.
 doi: 10.3389/fninf.2018.00083

This publication can be retrieved from:

https://www.isip.piconepress.com/publications/journals_refereed/2018/frontiers_neuroscience/tuh_eeg_seizure

Our preferred reference for the TUH EEG Corpus, from which this
seizure corpus was derived, is:

 Obeid, I., & Picone, J. (2016). The Temple University Hospital EEG
 Data Corpus. Frontiers in Neuroscience, Section Neural Technology,
 10, 196. http://doi.org/http://dx.doi.org/10.3389/fnins.2016.00196

v1.5.2 of the TUH EEG Seizure Corpus was based on v1.1.0 of the
TUH EEG Corpus.

There are two main directories in this release: dev and train. 
The train directory contains data you are allowed to use for the
development of your technology. The dev data is disjoint from the
training set and should only be used for testing.

There is also a blind evaluation set that we are reserving for open source
cmpetitions.

The top-level directories:

 edf/dev/01_tcp_ar
 edf/dev/02_tcp_le
 edf/dev/03_tcp_ar_a
 edf/train/01_tcp_ar
 edf/train/02_tcp_le
 edf/train/03_tcp_ar_a

refer to the appropriate channel configurations for the
EEGs. 01_tcp_ar refers to an AR reference configuration, with
annotations referencing a TCP format described below.

The pathname of a typical EEG file can be explained as follows:

 Filename:
  edf/dev/01_tcp_ar/002/00000258/s002_2003_07_21/00000258_s002_t000.edf

 Components:
  edf: contains the edf data

  dev: part of the dev_test set (vs.) train

  01_tcp_ar: data that follows the averaged reference (AR) configuration,
             while annotations use the TCP channel configutation

  002: a three-digit identifier meant to keep the number of subdirectories
       in a directory manageable. This follows the TUH EEG v1.1.0 convention.

  00000258: official patient number that is linked to v1.1.0 of TUH EEG

  s002_2003_07_21: session two (s002) for this patient. The session
                   was archived on 07/21/2003.

  00000258_s002_t000.edf: the actual EEG file. These are split into a series of
  			  files starting with t000.edf, t001.edf, ... These
			  represent pruned EEGs, so the original EEG is 
			  split into these segments, and uninteresting
			  parts of the original recording were deleted
			  (common in clinical practice).

The easiest way to access the annotations is through the spreadsheet
provided (_SEIZURES_*.xlsx). This contains the start and stop time
of each seizure event in an easy to understand format. Convert the
file to .csv if you need a machine-readable version.

There are six types of files in this release:

 *.edf:    the EEG sampled data in European Data Format (edf)
 *.txt:    the EEG report corresponding to the patient and session
 *.tse:    term-based annotations using all available seizure type classes
 *.tse_bi: same as *.tse except bi-class annotations (seizure/background) 
 *.lbl:    event-based annotations using all available seizure type classes
 *.lbl_bi: same as *.lbl except bi-class annotations (seizure/background)

Event-based annotations are per-channel. This means the annotation contains,
in addition to a start and stop time, a channel index. Seizures often can
be observed on one or more channels and then spread to other channels.
Event-based annotations capture this.

Term-based annotations use one label that applies to all channels. These
are most useful for machine learning research in which we tend to worry
only about the overall classification of a segment and are not concerned
about individual channels.

Bi-class annotations use two labels: seizure and background. The multi-class
annotations use all available seizure types. There are described in the
spreadsheet (_SEIZURES_*.xlsx).

Time-synchronous event (TSE) files use a simple format that looks like this:

 0.0000 490.0000 bckg 1.0000

The fields are: start time in secs, stop time in secs, label and probability
(by default, set to 1.0).

Label files (LBL) are more complicated and essentially describe a graph
that can represent a hierarchical annotation (e.g., FNSZ and GNSZ map to
SEIZ). They contain the start and stop times, a channel index, a level
index and probabilities for each class or symbol.

Clinical EEGs use a variety of channel configurations. In the larger
TUH EEG Corpus, there are over 40 different channel configurations. In
this subset, there are two type of EEGs: averaged reference (AR) and
linked ears reference (LE). Fortunately, all files in this subset
contain the standard channels you would expect from a 10/20
configuration, and all files can be converted to a TCP montage (which
is what we use internally for our processing).

What is somewhat confusing is that some patients have sessions listed
under both 01_tcp_ar and 02_tcp_le. There are 50 unique patients in
the development test set and 266 patients in the training set. But a find
command will return slightly higher numbers:

 find dev_test -mindepth 2 -maxdepth 2 | wc
     65      65    1570
 find train -mindepth 2 -maxdepth 2 | wc
     303     303    7675

because some patients appear in multiple montages:

 ls -1 -d */*/*/*/00002991
  edf/train/01_tcp_ar/029/00002991
  edf/train/02_tcp_le/029/00002991
  edf/train/03_tcp_ar_a/029/00002991
 ls -1 -d */*/*/*/00002297
  edf/dev/02_tcp_le/022/00002297
  edf/dev/03_tcp_ar_a/022/00002297

To learn more about this, please consult the following publication:

 Lopez, S., Gross, A., Yang, S., Golmohammadi, M., Obeid, I., &
 Picone, J. (2016). An Analysis of Two Common Reference Points for
 EEGs. In IEEE Signal Processing in Medicine and Biology Symposium
 (pp. 1–4). Philadelphia, Pennsylvania, USA. Available at:
 https://www.isip.piconepress.com/publications/conference_proceedings/2016/ieee_spmb/montages/.

The channel number in .lbl and .lbl_bi files refers to the channels
defined using a standard ACNS TCP montage. This is our preferred way
of viewing seizure data. The montage is defined as follows:

 montage =  0, FP1-F7: EEG FP1-REF --  EEG F7-REF
 montage =  1, F7-T3:  EEG F7-REF  --  EEG T3-REF
 montage =  2, T3-T5:  EEG T3-REF  --  EEG T5-REF
 montage =  3, T5-O1:  EEG T5-REF  --  EEG O1-REF
 montage =  4, FP2-F8: EEG FP2-REF --  EEG F8-REF
 montage =  5, F8-T4 : EEG F8-REF  --  EEG T4-REF
 montage =  6, T4-T6:  EEG T4-REF  --  EEG T6-REF
 montage =  7, T6-O2:  EEG T6-REF  --  EEG O2-REF
 montage =  8, A1-T3:  EEG A1-REF  --  EEG T3-REF
 montage =  9, T3-C3:  EEG T3-REF  --  EEG C3-REF
 montage = 10, C3-CZ:  EEG C3-REF  --  EEG CZ-REF
 montage = 11, CZ-C4:  EEG CZ-REF  --  EEG C4-REF
 montage = 12, C4-T4:  EEG C4-REF  --  EEG T4-REF
 montage = 13, T4-A2:  EEG T4-REF  --  EEG A2-REF
 montage = 14, FP1-F3: EEG FP1-REF --  EEG F3-REF
 montage = 15, F3-C3:  EEG F3-REF  --  EEG C3-REF
 montage = 16, C3-P3:  EEG C3-REF  --  EEG P3-REF
 montage = 17, P3-O1:  EEG P3-REF  --  EEG O1-REF
 montage = 18, FP2-F4: EEG FP2-REF --  EEG F4-REF
 montage = 19, F4-C4:  EEG F4-REF  --  EEG C4-REF
 montage = 20, C4-P4:  EEG C4-REF  --  EEG P4-REF
 montage = 21, P4-O2:  EEG P4-REF  --  EEG O2-REF

For example, channel 1 is a difference between electrodes F7 and T3,
and represents an arithmetic difference of the channels
(F7-REF)-(T3-REF), which are channels contained in the EDF file.
For files in the 02_tcp_le montage the channels are named as EEG P4-LE. All 
channel derivations are the same. 
For files in the 03_tcp_ar_a montage the derivations EEG A1-REF and 
EEG A2-REF are not included.

A spreadsheet is provided that classifies each seizure by type. This 
spreadsheet contains a legend that explains these fields.

Finally, here are some basic descriptive statistics about the data:

EVAL SET (EVAL)

total files: 1023
total sessions: 152
total patients: 50

files with seizures: 235
sessions with seizures: 78
patients with seizures: 42
total number of seizures: 511

total duration: 542,125.00 secs
total duration of files with seizures: 186,119.00 secs (34.33%)
total background duration: 505,250.36920 secs (93.20%)
*total seizure duration: 37,884.63080 secs (6.98%)

Note: The background duration is calculated from the duration of the
bckg labels, which when summed, is slightly larger than (100% - seiz dur).
The former is 93.20%, while the latter is 93.02%. However, this
difference is insignificant and a result of the end of file not coinciding
exactly with the end of the last bckg label.

-----------------------------

DEVELOPMENT SET (DEV)

total files: 1013
total sessions: 238
total patients: 50

files with seizures: 280
sessions with seizures: 104
patients with seizures: 40
total number of seizures: 673

total duration: 613,232.00 secs
total duration of files with seizures: 230,031.00 secs (37.51%)
total background duration: 554,786.89340 secs (90.47%)
total seizure duration: 58,445.10660 secs (9.53%)

-----------------------------

TRAINING SET (TRAIN)

total files: 4599
total sessions: 1185
total patients: 592

files with seizures: 869
sessions with seizures: 343
patients with seizures: 202
total number of seizures: 2,377

total duration: 2,726,212.00 secs
total duration of files with seizures: 639,300.00 secs (23.45%)
total background duration: 2,540,689.25700 secs (93.19%)
total seizure duration: 169,793.74300 secs (6.23%)

Note: The background duration is calculated from the duration of the
bckg labels, which when summed, is slightly smaller than (100% - seiz dur).
The former is 93.19%, while the latter is 93.77%. However, this
difference is insignificant and a result of the end of file not coinciding
exactly with the end of the last bckg label.

-----------------------------

If you have any additional comments or questions about the data,
please direct them to help@nedcdata.org.

Best regards,

Joe Picone