Class FastqRecordReader

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public abstract class FastqRecordReader
    extends org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
    A record reader for the interleaved FASTQ format. Reads over an input file and parses interleaved FASTQ read pairs into a single Text output. This is then fed into the FastqConverter, which converts the single Text instance into two Alignments.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int DEFAULT_MAX_READ_LENGTH
      Default maximum read length, 10,000 bp.
      protected long end
      First index value beyond the slice, i.e.
      protected boolean isCompressed
      True if the underlying data is compressed.
      protected boolean isSplittable
      True if the underlying data is splittable.
      static String MAX_READ_LENGTH_PROPERTY
      Maximum read length property name.
      protected long pos
      Current position in file.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected FastqRecordReader​(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapreduce.lib.input.FileSplit split)
      Builds a new record reader given a config file and an input split.
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      protected abstract boolean checkBuffer​(int bufferLength, org.apache.hadoop.io.Text buffer)
      Checks to see whether the buffer is positioned at a valid record.
      void close()
      Close this RecordReader to future operations.
      Void getCurrentKey()
      FASTQ has no keys, so we return null.
      org.apache.hadoop.io.Text getCurrentValue()
      Returns the last interleaved FASTQ record.
      float getProgress()
      How much of the input has the RecordReader consumed?
      void initialize​(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)  
      protected boolean lowLevelFastqRead​(org.apache.hadoop.io.Text readName, org.apache.hadoop.io.Text value)
      Parses a read from an interleaved FASTQ file.
      protected String makePositionMessage()
      Produces a debugging message with the file position.
      protected abstract boolean next​(org.apache.hadoop.io.Text value)
      Reads from the input split.
      boolean nextKeyValue()
      Seeks ahead in our split to the next key-value pair.
      protected int positionAtFirstRecord​(org.apache.hadoop.fs.FSDataInputStream stream, org.apache.hadoop.io.compress.CompressionCodec codec)
      Position the input stream at the start of the first record.
      static void setMaxReadLength​(org.apache.hadoop.conf.Configuration conf, int maxReadLength)
      Set the maximum read length property to maxReadLength.
    • Field Detail

      • DEFAULT_MAX_READ_LENGTH

        public static final int DEFAULT_MAX_READ_LENGTH
        Default maximum read length, 10,000 bp.
        See Also:
        Constant Field Values
      • MAX_READ_LENGTH_PROPERTY

        public static final String MAX_READ_LENGTH_PROPERTY
        Maximum read length property name.
        See Also:
        Constant Field Values
      • end

        protected long end
        First index value beyond the slice, i.e. slice is in range [start,end).
      • pos

        protected long pos
        Current position in file.
      • isSplittable

        protected boolean isSplittable
        True if the underlying data is splittable.
      • isCompressed

        protected boolean isCompressed
        True if the underlying data is compressed.
    • Constructor Detail

      • FastqRecordReader

        protected FastqRecordReader​(org.apache.hadoop.conf.Configuration conf,
                                    org.apache.hadoop.mapreduce.lib.input.FileSplit split)
                             throws IOException
        Builds a new record reader given a config file and an input split.
        Parameters:
        conf - The Hadoop configuration object. Used for gaining access to the underlying file system.
        split - The file split to read.
        Throws:
        IOException
    • Method Detail

      • setMaxReadLength

        public static void setMaxReadLength​(org.apache.hadoop.conf.Configuration conf,
                                            int maxReadLength)
        Set the maximum read length property to maxReadLength.
        Parameters:
        conf - configuration
        maxReadLength - maximum read length, in base pairs (bp)
      • checkBuffer

        protected abstract boolean checkBuffer​(int bufferLength,
                                               org.apache.hadoop.io.Text buffer)
        Checks to see whether the buffer is positioned at a valid record.
        Parameters:
        bufferLength - The length of the line currently in the buffer.
        buffer - A buffer containing a peek at the first line in the current stream.
        Returns:
        Returns true if the buffer contains the first line of a properly formatted FASTQ record.
      • positionAtFirstRecord

        protected final int positionAtFirstRecord​(org.apache.hadoop.fs.FSDataInputStream stream,
                                                  org.apache.hadoop.io.compress.CompressionCodec codec)
                                           throws IOException
        Position the input stream at the start of the first record.
        Parameters:
        stream - The stream to reposition.
        Throws:
        IOException
      • initialize

        public final void initialize​(org.apache.hadoop.mapreduce.InputSplit split,
                                     org.apache.hadoop.mapreduce.TaskAttemptContext context)
                              throws IOException,
                                     InterruptedException
        Specified by:
        initialize in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Throws:
        IOException
        InterruptedException
      • getCurrentKey

        public final Void getCurrentKey()
        FASTQ has no keys, so we return null.
        Specified by:
        getCurrentKey in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Returns:
        Always returns null.
      • getCurrentValue

        public final org.apache.hadoop.io.Text getCurrentValue()
        Returns the last interleaved FASTQ record.
        Specified by:
        getCurrentValue in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Returns:
        The text corresponding to the last read pair.
      • nextKeyValue

        public final boolean nextKeyValue()
                                   throws IOException,
                                          InterruptedException
        Seeks ahead in our split to the next key-value pair. Triggers the read of an interleaved FASTQ read pair, and populates internal state.
        Specified by:
        nextKeyValue in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Returns:
        True if reading the next read pair succeeded.
        Throws:
        IOException
        InterruptedException
      • close

        public final void close()
                         throws IOException
        Close this RecordReader to future operations.
        Specified by:
        close in interface AutoCloseable
        Specified by:
        close in interface Closeable
        Specified by:
        close in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Throws:
        IOException
      • getProgress

        public final float getProgress()
        How much of the input has the RecordReader consumed?
        Specified by:
        getProgress in class org.apache.hadoop.mapreduce.RecordReader<Void,​org.apache.hadoop.io.Text>
        Returns:
        Returns a value on [0.0, 1.0] that notes how many bytes we have read so far out of the total bytes to read.
      • makePositionMessage

        protected final String makePositionMessage()
        Produces a debugging message with the file position.
        Returns:
        Returns a string containing {filename}:{index}.
      • lowLevelFastqRead

        protected final boolean lowLevelFastqRead​(org.apache.hadoop.io.Text readName,
                                                  org.apache.hadoop.io.Text value)
                                           throws IOException
        Parses a read from an interleaved FASTQ file. Only reads a single record.
        Parameters:
        readName - Text record containing read name. Output parameter.
        value - Text record containing full record. Output parameter.
        Returns:
        Returns true if read was successful (did not hit EOF).
        Throws:
        RuntimeException - Throws exception if FASTQ record doesn't have proper formatting (e.g., record doesn't start with @).
        IOException
      • next

        protected abstract boolean next​(org.apache.hadoop.io.Text value)
                                 throws IOException
        Reads from the input split.
        Parameters:
        value - Text record to write input value into.
        Returns:
        Returns whether this read was successful or not.
        Throws:
        IOException
        See Also:
        lowLevelFastqRead(Text, Text)