profile
viewpoint

jkbonfield/io_lib 30

Staden Package "io_lib" (sometimes referred to as libstaden-read by distributions). This contains code for reading and writing a variety of Bioinformatics / DNA Sequence formats.

daviesrob/bwa 2

Burrow-Wheeler Aligner

daviesrob/hts-specs 1

Specifications of SAM/BAM and related high-throughput sequencing file formats

daviesrob/mmonitor 1

Very simple memory monitor

daviesrob/pthread_mon 1

Pthreads locking monitor

daviesrob/samtools 1

This is not the official development repository for samtools. For that you should use samtools/samtools.

daviesrob/bambi 0

A set of programs to manipulate SAM/BAM/CRAM files, using HTSLIB

daviesrob/bamtools 0

C++ API & command-line toolkit for working with BAM data

daviesrob/bcftools 0

This is the official development repository for BCFtools. To compile, the develop branch of htslib is needed: git clone --branch=develop git://github.com/samtools/htslib.git htslib

daviesrob/hts-binaries 0

Build script to make binary distributions of HTSlib, samtools and bcftools

pull request commentsamtools/htslib

Capturing Error Messages

I guess this might work if you started a thread yourself, and then used it exclusively for a single file. That doesn't really match how HTSlib operates, where thread pool workers will pick up decoding jobs as they arrive, and don't have any link to a specific file.

I think what you really need is a way of linking the messages to a specific file handle, or possibly API call. I'm afraid it might not be too easy, given the constraints on how the library works at the moment. We'll have to think about it, but I fear that there won't be an easy solution to this problem.

BickfordA

comment created time in an hour

pull request commentsamtools/htslib

Capturing Error Messages

This looks like messages from threads could be trapped inside buffers that can't be easily accessed. I think it might be better if the messages from all threads could be combined into a stream.

This could be done now without making any changes to HTSlib by redirecting STDERR to a self-pipe. It would be necessary to ensure that the reading end of the pipe got drained promptly.

If we do make changed to HTSlib, we could try making a global thread-safe ring buffer to capture all of the messages. Or we could allow the messages to be redirected to an hFILE, and make some sort of plugin to allow the messages to be captured into a memory buffer.

BickfordA

comment created time in 20 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

I guess you could do this by probing from another thread. It's the thread safe part of it that would be hard...

BickfordA

comment created time in 21 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

We have htell and bgzf_tell, but currently no combined hts_tell. However, the cases here are different from the general one because the interfaces consume the entire file before returning. You either need a callback (as here), or a completely different indexing interface that gets called in a loop so you can update progress each time it returns.

BickfordA

comment created time in a day

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

This is a nice idea, but currently it changes the signature of tbx_index(), which would change both the API and ABI. Instead, you should make a tbx_index2() with the extra parameters, and leave tbx_index() as it was.

The file_progress_func typedef would be better named hts_file_progress_func (or maybe hts_progress_callback?) to make name clashes with other progress callbacks less likely.

BickfordA

comment created time in a day

pull request commentsamtools/htslib

Update docs: Iterate through bam1_t records

Yes, I think you're right. Although looking at it, the entire for loop should also have if (recs) { ... } around it due to the test on line 905.

BickfordA

comment created time in a day

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  bam_consensus.c -- consensus subcommand.++    Copyright (C) 1998-2001,2003 Medical Research Council (Gap4/5 source)+    Copyright (C) 2003-2005,2007-2021 Genome Research Ltd.++    Author: James Bonfield <jkb@sanger.ac.uk>++The primary work here is GRL since 2021, under an MIT license.+Sections derived from Gap5, which include calculate_consensus_gap5()+associated functions, are mostly copyright Genome Research Limited from+2003 onwards.  These were originally under a BSD license, but as GRL is+copyright holder these portions can be considered to also be under the+same MIT license below:+++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notice and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++/*+ * The Gap5 consensus algorithm was in turn derived from the earlier Gap4+ * tool, developed by the Medical Research Council as part of the+ * Staden Package.  It is unsure how much of this source code is still+ * extant, without deep review, but the license used was a compatible+ * modified BSD license, included below.+ */++/*+Modified BSD license for any legacy components from the Staden Package:++Copyright (c) 2003 MEDICAL RESEARCH COUNCIL+All rights reserved++Redistribution and use in source and binary forms, with or without+modification, are permitted provided that the following conditions are met:++   . Redistributions of source code must retain the above copyright notice,+this list of conditions and the following disclaimer.++   . Redistributions in binary form must reproduce the above copyright notice,+this list of conditions and the following disclaimer in the documentation+and/or other materials provided with the distribution.++   . Neither the name of the MEDICAL RESEARCH COUNCIL, THE LABORATORY OF+MOLECULAR BIOLOGY nor the names of its contributors may be used to endorse or+promote products derived from this software without specific prior written+permission.++THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.+*/+++// FIXME: also use strand to spot possible basecalling errors.+//        Specifically het calls where mods are predominantly on one+//        strand.  So maybe require + and - calls and check concordance+//        before calling a het as confident.  (Still call, but low qual?)++// TODO: call by kmers rather than individual bases?  Or use kmers to skew+// quality at least.  It can identify variants that are low quality due to+// neighbouring edits that aren't consistently correlated.++// TODO: pileup callback ought to know when it's the last in the region /+// chromosome.  This means the caller code doesn't have to handle the+// termination phase and deduplicates the code.  (Changing from+// one chr to the next is the same as ending the last.)+//+// TODO: track which reads contribute to multiple confirmed (HQ) differences+// vs which contribute to only one (LQ) difference.  Correlated changes+// are more likely to be real.  Ie consensus more of a path than solely+// isolated columns.+//+// Either that or a dummy "end of data" call is made to signify end to+// permit tidying up.  Maybe add a "start of data" call too?++// Eg 50T 20A seems T/A het,+// but 30T+ 20T- 18A+ 2A- seems like a consistent A miscall on one strand+// only, while T is spread evenly across both strands.++#include <config.h>++#include <stdio.h>+#include <stdlib.h>+#include <math.h>+#include <limits.h>+#include <float.h>+#include <ctype.h>++#include <htslib/sam.h>++#include "samtools.h"+#include "sam_opts.h"+#include "bam_plbuf.h"+#include "consensus_pileup.h"++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#ifndef MIN+#  define MIN(a,b) ((a)<(b)?(a):(b))+#endif+#ifndef MAX+#  define MAX(a,b) ((a)>(b)?(a):(b))+#endif++// Minimum cutoff for storing mod data; => at least 10% chance+#define MOD_CUTOFF 0.46++enum format {+    FASTQ,+    FASTA,+    PILEUP+};++typedef unsigned char uc;++typedef struct {+    // User options+    char *reg;+    int use_qual;+    int min_qual;+    int adj_qual;+    int use_mqual;+    double scale_mqual;+    int nm_adjust;+    int nm_halo;+    int low_mqual;+    int high_mqual;+    int min_depth;+    double call_fract;+    double het_fract;+    int gap5;+    enum format fmt;+    int cons_cutoff;+    int ambig;+    int line_len;+    int default_qual;+    int het_only;+    int all_bases;+    int show_del;+    int show_ins;+    int excl_flags;+    int incl_flags;+    int min_mqual;++    // Internal state+    samFile *fp;+    FILE *fp_out;+    sam_hdr_t *h;+    hts_idx_t *idx;+    hts_itr_t *iter;+    kstring_t ks_line;+    kstring_t ks_ins_seq;+    kstring_t ks_ins_qual;+    int last_tid;+    hts_pos_t last_pos;+} consensus_opts;++/* --------------------------------------------------------------------------+ * A bayesian consensus algorithm that analyses the data to work out+ * which hypothesis of pure A/C/G/T/absent and all combinations of two+ * such bases meets the observations.+ *+ * This has its origins in Gap4 (homozygous) -> Gap5 (heterozygous)+ * -> Crumble (tidied up to use htslib's pileup) -> here.+ *+ */++#define CONS_DISCREP    4+#define CONS_ALL        15++#define CONS_MQUAL      16++typedef struct {+    /* the most likely base call - we never call N here */+    /* A=0, C=1, G=2, T=3, *=4 */+    int call;++    /* The most likely heterozygous base call */+    /* Use "ACGT*"[het / 5] vs "ACGT*"[het % 5] for the combination */+    int het_call;++    /* Log-odds for het_call */+    int het_logodd;++    /* Single phred style call */+    int phred;++    /* Sequence depth */+    int depth;++    /* Discrepancy search score */+    float discrep;+} consensus_t;++#define P_HET 1e-4++#define LOG10            2.30258509299404568401+#define TENOVERLOG10     4.34294481903251827652+#define TENLOG2OVERLOG10 3.0103++#ifdef __GNUC__+#define ALIGNED(x) __attribute((aligned(x)))+#else+#define ALIGNED(x)+#endif++static double prior[25]    ALIGNED(16);  /* Sum to 1.0 */+static double lprior15[15] ALIGNED(16);  /* 15 combinations of {ACGT*} */++/* Precomputed matrices for the consensus algorithm */+static double pMM[101] ALIGNED(16);+static double p__[101] ALIGNED(16);+static double p_M[101] ALIGNED(16);++static double e_tab_a[1002]  ALIGNED(16);+static double *e_tab = &e_tab_a[500];+static double e_tab2_a[1002] ALIGNED(16);+static double *e_tab2 = &e_tab2_a[500];+static double e_log[501]     ALIGNED(16);++/*+ * Lots of confusing matrix terms here, so some definitions will help.+ *+ * M = match base+ * m = match pad+ * _ = mismatch+ * o = overcall+ * u = undercall+ *+ * We need to distinguish between homozygous columns and heterozygous columns,+ * done using a flat prior.  This is implemented by treating every observation+ * as coming from one of two alleles, giving us a 2D matrix of possibilities+ * (the hypotheses) for each and every call (the observation).+ *+ * So pMM[] is the chance that given a call 'x' that it came from the+ * x/x allele combination.  Similarly p_o[] is the chance that call+ * 'x' came from a mismatch (non-x) / overcall (consensus=*) combination.+ *+ * Examples with observation (call) C and * follows+ *+ *  C | A  C  G  T  *          * | A  C  G  T  *+ *  -----------------          -----------------+ *  A | __ _M __ __ o_         A | uu uu uu uu um+ *  C | _M MM _M _M oM         C | uu uu uu uu um+ *  G | __ _M __ __ o_         G | uu uu uu uu um+ *  T | __ _M __ __ o_         T | uu uu uu uu um+ *  * | o_ oM o_ o_ oo         * | um um um um mm+ *+ * In calculation terms, the _M is half __ and half MM, similarly o_ and um.+ *+ * Relative weights of substitution vs overcall vs undercall are governed on a+ * per base basis using the P_OVER and P_UNDER scores (subst is+ * 1-P_OVER-P_UNDER).+ *+ * The heterozygosity weight though is a per column calculation as we're+ * trying to model whether the column is pure or mixed. Hence this is done+ * once via a prior and has no affect on the individual matrix cells.+ */++static void consensus_init(double p_het) {+    int i;++    for (i = -500; i <= 500; i++)+        e_tab[i] = exp(i);+    for (i = -500; i <= 500; i++)+        e_tab2[i] = exp(i/10.);+    for (i = 0; i <= 500; i++)+        e_log[i] = log(i);++    // Heterozygous locations+    for (i = 0; i < 25; i++)+        prior[i] = p_het / 20;+    prior[0] = prior[6] = prior[12] = prior[18] = prior[24] = (1-p_het)/5;++    lprior15[0]  = log(prior[0]);+    lprior15[1]  = log(prior[1]*2);+    lprior15[2]  = log(prior[2]*2);+    lprior15[3]  = log(prior[3]*2);+    lprior15[4]  = log(prior[4]*2);+    lprior15[5]  = log(prior[6]);+    lprior15[6]  = log(prior[7]*2);+    lprior15[7]  = log(prior[8]*2);+    lprior15[8]  = log(prior[9]*2);+    lprior15[9]  = log(prior[12]);+    lprior15[10] = log(prior[13]*2);+    lprior15[11] = log(prior[14]*2);+    lprior15[12] = log(prior[18]);+    lprior15[13] = log(prior[19]*2);+    lprior15[14] = log(prior[24]);+++    // Rewrite as new form+    for (i = 1; i < 101; i++) {+        double prob = 1 - pow(10, -i / 10.0);++        // May want to multiply all these by 5 so pMM[i] becomes close+        // to -0 for most data. This makes the sums increment very slowly,+        // keeping bit precision in the accumulator.+        pMM[i] = log(prob/5);+        p__[i] = log((1-prob)/20);+        p_M[i] = log((exp(pMM[i]) + exp(p__[i]))/2);+    }++    pMM[0] = pMM[1];+    p__[0] = p__[1];+    p_M[0] = p_M[1];+}++static inline double fast_exp(double y) {+    if (y >= -50 && y <= 50)+        return e_tab2[(int)(y*10)];++    if (y < -500)+        y = -500;+    if (y > 500)+        y = 500;++    return e_tab[(int)y];+}++/* Taylor (deg 3) implementation of the log */+static inline double fast_log2(double val)+{+    // FP representation is exponent & mantissa, where+    // value = 2^E * M.+    // Hence log2(value) = log2(2^E * M)+    //                   = log2(2^E)+ log2(M)+    //                   =        E + log2(M)+    int64_t *const exponent = ((int64_t*)&val);+    int64_t x = *exponent;+    const int E = ((x >> 52) & 2047) - 1024; // exponent E+    // Initial log2(M) based on mantissa+    x &= ~(2047LL << 52);+    x +=   1023LL << 52;+    *exponent = x;++    val = ((-1/3.) * val + 2) * val - 2/3.;++    return E + val;+}++#define ph_log(x) (-TENLOG2OVERLOG10*fast_log2((x)))+++int nins(const bam1_t *b){+    int i, indel = 0;+    uint32_t *cig = bam_get_cigar(b);+    for (i = 0; i < b->core.n_cigar; i++) {+        int op = bam_cigar_op(cig[i]);+        if (op == BAM_CINS || op == BAM_CDEL)+            indel += bam_cigar_oplen(cig[i]);+    }+    return indel;+}++// Return the local NM figure within halo (+/- HALO) of pos.+// This local NM is used as a way to modify MAPQ to get a localised MAPQ+// score via an adhoc fashion.+double nm_local(const pileup_t *p, const bam1_t *b, int pos) {+    int *nm = (int *)p->cd;+    if (!nm)+        return 0;+    pos -= b->core.pos;+    if (pos < 0)+        return nm[0];+    if (pos >= b->core.l_qseq)+        return nm[b->core.l_qseq-1];++    return nm[pos] / 10.0;+}++/*+ * Initialise a new sequence appearing in the pileup.  We use this to+ * precompute some metrics that we'll repeatedly use in the consensus+ * caller; the localised NM score.+ *+ * We also directly amend the BAM record (which will be discarded later+ * anyway) to modify qualities to account for local quality minima.+ *+ * Returns 0 (discard) or 1 (keep) on success, -1 on failure.+ */+int nm_init(void *client_data, samFile *fp, sam_hdr_t *h, pileup_t *p) {+    consensus_opts *opts = (consensus_opts *)client_data;+    if (!opts->use_mqual)+        return 1;++    const bam1_t *b = &p->b;+    int qlen = b->core.l_qseq, i;+    int *local_nm = calloc(qlen, sizeof(*local_nm));+    if (!local_nm)+        return -1;+    p->cd = local_nm;++    if (opts->adj_qual) {+#if 0+        // Tweak by localised quality.+        // Quality is reduced by a significant portion of the minimum quality+        // in neighbouring bases, on the pretext that if the region is bad, then+        // this base is bad even if it claims otherwise.+        uint8_t *qual = bam_get_qual(b);+        const int qhalo = 8; // 2?+        int qmin = 50; // effectively caps PacBio qual too+        for (i = 0; i < qlen && i < qhalo; i++) {+            local_nm[i] = qual[i];+            if (qmin > qual[i])+                qmin = qual[i];+        }+        for (;i < qlen-qhalo; i++) {+            //int t = (qual[i]*1   + 3*qmin)/4; // good on 60x+            int t = (qual[i]   + 5*qmin)/4; // good on 15x+            local_nm[i] = t < qual[i] ? t : qual[i];+            if (qmin > qual[i+qhalo])+                qmin = qual[i+qhalo];+            else if (qmin <= qual[i-qhalo]) {+                int j;+                qmin = 50;+                for (j = i-qhalo+1; j <= i+qhalo; j++)+                    if (qmin > qual[j])+                        qmin = qual[j];+            }+        }+        for (; i < qlen; i++) {+            local_nm[i] = qual[i];+            local_nm[i] = (local_nm[i] + 6*qmin)/4;+        }++        for (i = 0; i < qlen; i++) {+            qual[i] = local_nm[i];++            // Plus overall rescale.+            // Lower becomes lower, very high becomes a little higher.+            // Helps deep GIAB, but detrimental elsewhere.  (What this really+            // indicates is quality calibration differs per data set.)+            // It's probably something best accounted for somewhere else.++            //qual[i] = qual[i]*qual[i]/40+1;+        }+        memset(local_nm, 0, qlen * sizeof(*local_nm));+#else+        // Skew local NM by qual vs min-qual delta+        uint8_t *qual = bam_get_qual(b);+        const int qhalo = 8; // 4+        int qmin = 99;+        for (i = 0; i < qlen && i < qhalo; i++) {+            if (qmin > qual[i])+                qmin = qual[i];+        }+        for (;i < qlen-qhalo; i++) {+            int t = (qual[i]   + 5*qmin)/4; // good on 15x+            local_nm[i] += t < qual[i] ? (qual[i]-t) : 0;+            if (qmin > qual[i+qhalo])+                qmin = qual[i+qhalo];+            else if (qmin <= qual[i-qhalo]) {+                int j;+                qmin = 99;+                for (j = i-qhalo+1; j <= i+qhalo; j++)+                    if (qmin > qual[j])+                        qmin = qual[j];+            }+        }+        for (; i < qlen; i++) {+            int t = (qual[i]   + 5*qmin)/4; // good on 15x+            local_nm[i] += t < qual[i] ? (qual[i]-t) : 0;+        }+#endif+    }++    // Adjust local_nm array by the number of edits within+    // a defined region (pos +/- halo).+    const int halo = opts->nm_halo;+    const uint8_t *md = bam_aux_get(b, "MD");+    if (!md)+        return 1;+    md = (const uint8_t *)bam_aux2Z(md);++    // Handle cost of being near a soft-clip+    uint32_t *cig = bam_get_cigar(b);+    int ncig = b->core.n_cigar;++    if ( (cig[0] & BAM_CIGAR_MASK) == BAM_CSOFT_CLIP ||+        ((cig[0] & BAM_CIGAR_MASK) == BAM_CHARD_CLIP && ncig > 1 &&+         (cig[1] & BAM_CIGAR_MASK) == BAM_CSOFT_CLIP)) {+        for (i = 0; i < halo && i < qlen; i++)+            local_nm[i]+=10;+        for (; i < halo*2 && i < qlen; i++)+            local_nm[i]+=5;+    }+    if ( (cig[ncig-1] & BAM_CIGAR_MASK) == BAM_CSOFT_CLIP ||+        ((cig[ncig-1] & BAM_CIGAR_MASK) == BAM_CHARD_CLIP && ncig > 1 &&+         (cig[ncig-2] & BAM_CIGAR_MASK) == BAM_CSOFT_CLIP)) {+        for (i = qlen-1; i >= qlen-halo && i >= 0; i--)+            local_nm[i]+=10;+        for (; i >= qlen-halo*2 && i >= 0; i--)+            local_nm[i]+=5;+    }++    // Now iterate over MD tag+    int pos = 0;+    while (*md) {+        if (isdigit(*md)) {+            uint8_t *endptr;+            long i = strtol((char *)md, (char **)&endptr, 10);+            md = endptr;+            pos += i;+            continue;+        }++        // deletion.+        // Should we bump local_nm here too?  Maybe+        if (*md == '^') {+            while (*++md && !isdigit(*md))+                continue;+            continue;+        }++        // substitution+        for (i = pos-halo*2 >= 0 ? pos-halo*2 : 0; i < pos-halo; i++)+            local_nm[i]+=5;+        for (; i < pos+halo && i < qlen; i++)+            local_nm[i]+=10;+        for (; i < pos+halo*2 && i < qlen; i++)+            local_nm[i]+=5;+        md++;+    }++    return 1;+}+++static+int calculate_consensus_gap5(int pos, int flags, int depth,+                             const pileup_t *plp, consensus_opts *opts,+                             consensus_t *cons, int default_qual) {+    int i, j;+    static int init_done =0;+    static double q2p[101], mqual_pow[256];+    double min_e_exp = DBL_MIN_EXP * log(2) + 1;++    double S[15] ALIGNED(16) = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};+    double sumsC[6] = {0,0,0,0,0,0}; // A C G T * N++    // Small hash on seq to check for uniqueness of surrounding bases.+    // If it's frequent, then it's more likely to be correctly called than+    // if it's rare.+    // Helps a bit on deep data, especially with K2=3, but detrimental on+    // shallow and (currently) quite a slow down.++//#define K2 2+#ifdef K2+    int hashN[1<<(K2*4+2)] = {0};+    int hash1[1<<2] = {0};+#endif++    /* Map the 15 possible combinations to 1-base or 2-base encodings */+    static int map_sing[15] ALIGNED(16) =+        {0, 5, 5, 5, 5,+            1, 5, 5, 5,+               2, 5, 5,+                  3, 5,+                     4};+    static int map_het[15] ALIGNED(16) =+        {0,  1,  2,  3,  4,+             6,  7,  8,  9,+                12, 13, 14,+                    18, 19,+                        24};++    if (!init_done) {+        init_done = 1;+        consensus_init(P_HET);++        for (i = 0; i <= 100; i++) {+            q2p[i] = pow(10, -i/10.0);+        }++        for (i = 0; i < 255; i++) {+            //mqual_pow[i] = 1-pow(10, -(i+.01)/10.0);+            mqual_pow[i] = 1-pow(10, -(i*.9)/10.0);+            //mqual_pow[i] = 1-pow(10, -(i/3+.1)/10.0);+            //mqual_pow[i] = 1-pow(10, -(i/2+.05)/10.0);+        }+        // unknown mqual+        mqual_pow[255] = mqual_pow[10];+    }++    /* Initialise */+    int counts[6] = {0};++    /* Accumulate */++#ifdef K2+    const pileup_t *ptmp = plp;+    for (; ptmp; ptmp = ptmp->next) {+        const pileup_t *p = ptmp;+        if (p->qual < opts->min_qual)+            continue;++        int hb = 0;+#define _ 0+        static int X[16] = {_,0,1,_,2,_,_,_,3,_,_,_,_,_,_,_};+        uint8_t *seq = bam_get_seq(&p->b);+        int i, base1 = X[p->base4];+        hash1[base1]++;+        for (i = p->seq_offset-K2; i <= p->seq_offset+K2; i++) {+            int base = i >= 0 && i < p->b.core.l_qseq ? X[bam_seqi(seq,i)] : _;+            hb = (hb<<2)|base;+        }+        hashN[hb]++;+#undef _+    }+#endif++    int td = depth; // original depth+    depth = 0;+    for (; plp; plp = plp->next) {+        const pileup_t *p = plp;++        if (p->next)+            _mm_prefetch(p->next, _MM_HINT_T0);++        if (p->qual < opts->min_qual)+            continue;++        if (p->ref_skip)+            continue;++#ifdef K2+        int hb = 0;+#define _ 0+        static int X[16] = {_,0,1,_,2,_,_,_,3,_,_,_,_,_,_,_};+        int i, base1 = X[p->base4];+        for (i = p->seq_offset-K2; i <= p->seq_offset+K2; i++) {+            int base = i >= 0 && i < p->b.core.l_qseq ? X[bam_seqi(seq,i)] : _;+            hb = (hb<<2)|base;+        }+        //        fprintf(stderr, "%c: %d %d of %d\t%d %d\n", p->base, hashN[hb], hash1[base1], td, p->qual, p->qual * hashN[hb] / hash1[base1]);+#undef _+#endif++        const bam1_t *b = &p->b;+        uint8_t base = p->base4;+        uint8_t *qual_arr = bam_get_qual(b);+        uint8_t qual = p->qual;+        //qual = qual*qual/40+1;+        if (qual == 255 || (qual == 0 && *qual_arr == 255))+            qual = default_qual;++#ifdef K2+        //qual = qual * hashN[hb] / hash1[base1];+        qual -= -TENOVERLOG10*log(hashN[hb] / (hash1[base1]+.1));+        if (qual < 1)+            qual = 1;+#endif++        // =ACM GRSV TWYH KDBN *+        static int L[32] = {+            5,0,1,5, 2,5,5,5, 3,5,5,5, 5,5,5,5,+            4,4,4,4, 4,4,4,4, 4,4,4,4, 4,4,4,4,+        };++        // convert from sam base to acgt*n order.+        base = L[base];++        double MM, __, _M, qe;++        // Correction for mapping quality.  Maybe speed up via lookups?+        // Cannot nullify mapping quality completely.  Lots of (true)+        // SNPs means low mapping quality.  (Ideally need to know+        // hamming distance to next best location.)++        if (flags & CONS_MQUAL) {+            int mqual = b->core.qual;+            if (opts->nm_adjust) {+                mqual /= (nm_local(p, b, pos)+1);+                mqual *= 1 + 2*(0.5-(td>30?30:td)/60.0); // depth fudge+            }++            // higher => call more; +FP, -FN+            // lower  => call less; -FP, +FN+            mqual *= opts->scale_mqual;++            // Drop these?  They don't seem to ever help.+            if (mqual < opts->low_mqual)+                mqual = opts->low_mqual;+            if (mqual > opts->high_mqual)+                mqual = opts->high_mqual;++            double _p = 1-q2p[qual];+            double _m = mqual_pow[mqual];+            qual = ph_log(1-(_m * _p + (1 - _m)/4)); // CURRENT+            //qual = ph_log(1-_p*_m); // testing+            //qual *= 6/sqrt(td);+        }++        /* Quality 0 should never be permitted as it breaks the maths */+        if (qual < 1)+            qual = 1;++        __ = p__[qual];       // neither match+        MM = pMM[qual] - __;  // both match+        _M = p_M[qual] - __;  // one allele only (half match)++        if (flags & CONS_DISCREP) {+            qe = q2p[qual];+            sumsC[base] += 1 - qe;+        }++        counts[base]++;++        switch (base) {+        case 0: // A+            S[0] += MM;+            S[1] += _M;+            S[2] += _M;+            S[3] += _M;+            S[4] += _M;+            break;++        case 1: // C+            S[1] += _M;+            S[5] += MM;+            S[6] += _M;+            S[7] += _M;+            S[8] += _M;+            break;++        case 2: // G+            S[ 2] += _M;+            S[ 6] += _M;+            S[ 9] += MM;+            S[10] += _M;+            S[11] += _M;+            break;++        case 3: // T+            S[ 3] += _M;+            S[ 7] += _M;+            S[10] += _M;+            S[12] += MM;+            S[13] += _M;++            break;++        case 4: // *+            S[ 4] += _M;+            S[ 8] += _M;+            S[11] += _M;+            S[13] += _M;+            S[14] += MM;+            break;++        case 5: /* N => equal weight to all A,C,G,T but not a pad */+            S[ 0] += MM;+            S[ 1] += MM;+            S[ 2] += MM;+            S[ 3] += MM;+            S[ 4] += _M;++            S[ 5] += MM;+            S[ 6] += MM;+            S[ 7] += MM;+            S[ 8] += _M;++            S[ 9] += MM;+            S[10] += MM;+            S[11] += _M;++            S[12] += MM;+            S[13] += _M;+            break;+        }++        depth++;++        if (p->eof && p->cd)+            free(p->cd);+    }++    /* We've accumulated stats, so now we speculate on the consensus call */+    double shift, max, max_het, norm[15];+    int call = 0, het_call = 0;+    double tot1 = 0, tot2 = 0;++    /*+     * Scale numbers so the maximum score is 0. This shift is essentially+     * a multiplication in non-log scale to both numerator and denominator,+     * so it cancels out. We do this to avoid calling exp(-large_num) and+     * ending up with norm == 0 and hence a 0/0 error.+     *+     * Can also generate the base-call here too.+     */+    shift = -DBL_MAX;+    max = -DBL_MAX;+    max_het = -DBL_MAX;++    for (j = 0; j < 15; j++) {+        S[j] += lprior15[j];+        if (shift < S[j])+            shift = S[j];++        /* Only call pure AA, CC, GG, TT, ** for now */+        if (j != 0 && j != 5 && j != 9 && j != 12 && j != 14) {+            if (max_het < S[j]) {+                max_het = S[j];+                het_call = j;+            }+            continue;+        }++        if (max < S[j]) {+            max = S[j];+            call = j;+        }+    }++    /*+     * Shift and normalise.+     * If call is, say, b we want p = b/(a+b+c+...+n), but then we do+     * p/(1-p) later on and this has exceptions when p is very close+     * to 1.+     *+     * Hence we compute b/(a+b+c+...+n - b) and+     * rearrange (p/norm) / (1 - (p/norm)) to be p/norm2.+     */+    for (j = 0; j < 15; j++) {+        S[j] -= shift;+        double e = fast_exp(S[j]);+        S[j] = (S[j] > min_e_exp) ? e : DBL_MIN;+        norm[j] = 0;+    }++    for (j = 0; j < 15; j++) {+        norm[j]    += tot1;+        norm[14-j] += tot2;+        tot1 += S[j];+        tot2 += S[14-j];+    }++    /* And store result */+    if (!depth || depth == counts[5] /* all N */) {+        cons->call = 4; /* N */+        cons->het_call = 0;+        cons->het_logodd = 0;+        cons->phred = 0;+        cons->depth = 0;+        cons->discrep = 0;+        return 0;+    }++    cons->depth = depth;++    /* Call */+    if (norm[call] == 0) norm[call] = DBL_MIN;+    // Approximation of phred for when S[call] ~= 1 and norm[call]+    // is small.  Otherwise we need the full calculation.+    int ph;+    if (S[call] == 1 && norm[call] < .01)+        ph = ph_log(norm[call]) + .5;+    else+        ph = ph_log(1-S[call]/(norm[call]+S[call])) + .5;++    cons->call     = map_sing[call];+    cons->phred = ph < 0 ? 0 : ph;++    if (norm[het_call] == 0) norm[het_call] = DBL_MIN;+    ph = TENLOG2OVERLOG10 * (fast_log2(S[het_call])+                             - fast_log2(norm[het_call])) + .5;++    cons->het_call = map_het[het_call];+    cons->het_logodd = ph;++    /* Compute discrepancy score */+    if (flags & CONS_DISCREP) {+        double m = sumsC[0]+sumsC[1]+sumsC[2]+sumsC[3]+sumsC[4];+        double c;+        if (cons->het_logodd > 0)+            c = sumsC[cons->het_call%5] + sumsC[cons->het_call/5];+        else+            c = sumsC[cons->call];+        cons->discrep = (m-c)/sqrt(m);+    }++    return 0;+}+++/* --------------------------------------------------------------------------+ * Main processing logic+ */++static void dump_fastq(consensus_opts *opts,+                       const char *name,+                       const char *seq, size_t seq_l,+                       const char *qual, size_t qual_l) {+    enum format fmt = opts->fmt;+    int line_len = opts->line_len;+    FILE *fp = opts->fp_out;++    fprintf(fp, "%c%s\n", ">@"[fmt==FASTQ], name);+    size_t i;+    for (i = 0; i < seq_l; i += line_len)+        fprintf(fp, "%.*s\n", (int)MIN(line_len, seq_l - i), seq+i);++    if (fmt == FASTQ) {+        fprintf(fp, "+\n");+        for (i = 0; i < seq_l; i += line_len)+            fprintf(fp, "%.*s\n", (int)MIN(line_len, seq_l - i), qual+i);+    }+}++//---------------------------------------------------------------------------++/*+ * Reads a single alignment record, using either the iterator+ * or a direct sam_read1 call.+ */+static int readaln2(void *dat, samFile *fp, sam_hdr_t *h, bam1_t *b) {+    consensus_opts *opts = (consensus_opts *)dat;++    for (;;) {+        int ret = opts->iter+            ? sam_itr_next(fp, opts->iter, b)+            : sam_read1(fp, h, b);+        if (ret < 0)+            return ret;++        // Apply hard filters+        if (opts->incl_flags && !(b->core.flag & opts->incl_flags))+            continue;+        if (opts->excl_flags &&  (b->core.flag & opts->excl_flags))+            continue;+        if (b->core.qual < opts->min_mqual)+            continue;++        return ret;+    }+}++/* --------------------------------------------------------------------------+ * A simple summing algorithm, either pure base frequency, or by+ * weighting them according to their quality values.+ *+ * This is crude, but easy to understand and fits with several+ * standard pileup criteria (eg COG-UK / CLIMB Covid-19 seq project).+ *+ *+ * call1 / score1 / depth1 is the highest scoring allele.+ * call2 / score2 / depth2 is the second highest scoring allele.+ *+ * Het_fract:  score2/score1+ * Call_fract: score1 or score1+score2 over total score+ * Min_depth:  minimum total depth of utilised bases (depth1+depth2)+ * Min_score:  minimum total score of utilised bases (score1+score2)+ *+ * Eg het_fract 0.66, call_fract 0.75 and min_depth 10.+ * 11A, 2C, 2G (14 total depth) is A.+ * 9A, 2C, 2G  (12 total depth) is N as depth(A) < 10.+ * 11A, 5C, 5G (21 total depth) is N as 11/21 < 0.75 (call_fract)+ *+ *+ * 6A, 5G, 1C  (12 total depth) is AG het as depth(A)+depth(G) >= 10+ *                              and 5/6 >= 0.66 and 11/12 >= 0.75.+ *+ * 6A, 5G, 4C  (15 total depth) is N as (6+5)/15 < 0.75 (call_fract).+ *+ *+ * Note for the purpose of deletions, a base/del has an ambiguity+ * code of lower-case base (otherwise it is uppercase).+ */+static int calculate_consensus_simple(const pileup_t *plp,+                                      consensus_opts *opts, int *qual) {+    int i, min_qual = opts->min_qual;++    // Map "seqi" nt16 to A,C,G,T compatibility with weights on pure bases.+    // where seqi is A | (C<<1) | (G<<2) | (T<<3)+    //                        * A C M  G R S V  T W Y H  K D B N+    static int seqi2A[16] = { 0,8,0,4, 0,4,0,2, 0,4,0,2, 0,2,0,1 };+    static int seqi2C[16] = { 0,0,8,4, 0,0,4,2, 0,0,4,2, 0,0,2,1 };+    static int seqi2G[16] = { 0,0,0,0, 8,4,4,1, 0,0,0,0, 4,2,2,1 };+    static int seqi2T[16] = { 0,0,0,0, 0,0,0,0, 8,4,4,2, 8,2,2,1 };++    // Ignore ambiguous bases in seq for now, so we don't treat R, Y, etc+    // as part of one base and part another.  Based on BAM seqi values.+    // We also use freq[16] as "*" for gap.+    int freq[17] = {0};  // base frequency, aka depth+    int score[17] = {0}; // summation of base qualities++    // Accumulate+    for (; plp; plp = plp->next) {+        const pileup_t *p = plp;+        if (p->next)+            _mm_prefetch(p->next, _MM_HINT_T0);++        int q = p->qual;+        if (q < min_qual)+            // Should we still record these in freq[] somewhere so+            // we can use them in the fracts?+            // Difference between >= X% of high-qual bases calling Y+            // and >= X% of all bases are high-quality Y calls.+            continue;++        //int b = p->is_del ? 16 : bam_seqi(bam_get_seq(&p->b), p->seq_offset);+        int b = p->base4;++        // Map ambiguity codes to one or more component bases.+        if (b < 16) {+            int Q = seqi2A[b] * (opts->use_qual ? q : 1);+            freq[1]  += Q?1:0;+            score[1] += Q?Q:0;+            Q = seqi2C[b] * (opts->use_qual ? q : 1);+            freq[2]  += Q?1:0;+            score[2] += Q?Q:0;+            Q = seqi2G[b] * (opts->use_qual ? q : 1);+            freq[4]  += Q?1:0;+            score[4] += Q?Q:0;+            Q = seqi2T[b] * (opts->use_qual ? q : 1);+            freq[8]  += Q?1:0;+            score[8] += Q?Q:0;+        } else { /* * */+            freq[16] ++;+            score[16]+=8 * (opts->use_qual ? q : 1);+        }+    }++    // Total usable depth+    int tscore = 0;+    for (i = 0; i < 5; i++)+        tscore += score[1<<i];++    // Best and second best potential calls+    int call1  = 15, call2 = 15;+    int depth1 = 0,  depth2 = 0;+    int score1 = 0,  score2 = 0;+    for (i = 0; i < 5; i++) {+        int c = 1<<i; // A C G T *+        if (score1 < score[c]) {+            depth2 = depth1;+            score2 = score1;+            call2  = call1;+            depth1 = freq[c];+            score1 = score[c];+            call1  = c;+        } else if (score2 < score[c]) {+            depth2 = freq[c];+            score2 = score[c];+            call2  = c;+        }+    }++    // Work out which best and second best are usable as a call+    int used_score = score1;+    int used_depth = depth1;+    int used_base  = call1;+    if (score2 >= opts->het_fract * score1 && opts->ambig) {+        used_base  |= call2;+        used_score += score2;+        used_depth += depth2;+    }++    // N is too shallow, or insufficient proportion of total+    if (used_depth < opts->min_depth ||+        used_score < opts->call_fract * tscore) {+        used_depth = 0;+        // But note shallow gaps are still called gaps, not N, as+        // we're still more confident there is no base than it is+        // A, C, G or T.+        used_base = call1 == 16 /*&& depth1 >= call_fract * depth*/+            ? 16 : 0; // * or N+    }++    // Our final call.  "?" shouldn't be possible to generate+    const char *het =+        "NACMGRSVTWYHKDBN"+        "*ac?g???t???????";++    //printf("%c %d\n", het[used_base], used_depth);+    if (qual)+        *qual = used_base ? 100.0 * used_score / tscore : 0;++    return het[used_base];+}++static int empty_pileup2(consensus_opts *opts, sam_hdr_t *h, int tid,+                         hts_pos_t start, hts_pos_t end) {+    const char *name = sam_hdr_tid2name(h, tid);+    hts_pos_t i;++    int err = 0;+    for (i = start; i < end; i++)+        err |= fprintf(opts->fp_out, "%s\t%"PRIhts_pos"\t0\t0\tN\t0\t*\t*\n", name, i+1) < 0;++    return err ? -1 : 0;+}++/*+ * Returns 0 on success+ *        -1 on failure+ */+static int basic_pileup(void *cd, samFile *fp, sam_hdr_t *h, pileup_t *p,+                        int depth, int pos, int nth, int is_insert) {+    unsigned char *qp, *cp;+    char *rp;+    int ref, cb, cq;+    consensus_opts *opts = (consensus_opts *)cd;+    int tid = p->b.core.tid;++//    opts->show_ins=0;+//    opts->show_del=1;+    if (!opts->show_ins && nth)+        return 0;++    if (opts->iter) {+        if (opts->iter->beg >= pos || opts->iter->end < pos)+            return 0;+    }++    if (opts->all_bases) {+        if (tid != opts->last_tid && opts->last_tid >= 0) {+            hts_pos_t len = sam_hdr_tid2len(opts->h, opts->last_tid);+            if (opts->iter)+                len =  MIN(opts->iter->end, len);+            if (empty_pileup2(opts, opts->h, opts->last_tid, opts->last_pos,+                              len) < 0)+                return -1;+            if (tid >= 0) {+                if (empty_pileup2(opts, opts->h, tid,+                                  opts->iter ? opts->iter->beg : 0,+                                  pos-1) < 0)+                    return -1;+            }+        }+        if (opts->last_pos >= 0 && pos > opts->last_pos+1) {+            if (empty_pileup2(opts, opts->h, p->b.core.tid, opts->last_pos,+                              pos-1) < 0)+                return -1;+        } else if (opts->last_pos < 0) {+            if (empty_pileup2(opts, opts->h, p->b.core.tid,+                              opts->iter ? opts->iter->beg : 0, pos-1) < 0)+                return -1;+        }+    }++    if (opts->gap5) {+        consensus_t cons;+        calculate_consensus_gap5(pos, opts->use_mqual ? CONS_MQUAL : 0,+                                 depth, p, opts, &cons, opts->default_qual);+        if (cons.het_logodd > 0 && opts->ambig) {+            cb = "AMRWa" // 5x5 matrix with ACGT* per row / col+                 "MCSYc"+                 "RSGKg"+                 "WYKTt"+                 "acgt*"[cons.het_call];+            cq = cons.het_logodd;+        } else{+            cb = "ACGT*"[cons.call];+            cq = cons.phred;+        }+        if (cq < opts->cons_cutoff && cb != '*') {+            cb = 'N';+            cq = 0;+        }+    } else {+        cb = calculate_consensus_simple(p, opts, &cq);+    }+    if (cb < 0)+        return -1;++    if (!p)+        return 0;++    if (!opts->show_del && cb == '*')+        return 0;++    /* Ref, pos, nth, score, seq, qual */+    kstring_t *ks = &opts->ks_line;+    ks->l = 0;+    ref = p->b.core.tid;+    rp = (char *)sam_hdr_tid2name(h, ref);++    int err = 0;+    err |= kputs(rp, ks)    < 0;+    err |= kputc_('\t', ks) < 0;+    err |= kputw(pos, ks)   < 0;+    err |= kputc_('\t', ks) < 0;+    err |= kputw(nth, ks)   < 0;+    err |= kputc_('\t', ks) < 0;+    err |= kputw(depth, ks) < 0;+    err |= kputc_('\t', ks) < 0;+    err |= kputc_(cb, ks)   < 0;+    err |= kputc_('\t', ks) < 0;+    err |= kputw(cq, ks)    < 0;+    err |= kputc_('\t', ks) < 0;+    if (err)+        return -1;++    /* Seq + qual at predetermined offsets */+    ks_resize(ks, ks->l + depth*2 + 2);

The return value needs to be checked here.

jkbonfield

comment created time in a day

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;+        }++        b = &pnew->b;++        /* Force realloc */+        //fp->bs = NULL;+        //fp->bs_size = 0;++        //r = samread(fp, pnew->b);+        if (r >= 0) {+            if (b->core.flag & BAM_FUNMAP)+                continue;++            if (b->core.tid == -1) {+                /* Another indicator for unmapped */+                continue;+            } else if (b->core.tid == last_ref) {+                pos = b->core.pos+1;+                //printf("New seq at pos %d @ %d %s\n", pos, b->core.tid,+                //       bam_name(b));+                last_in_contig = 0;+            } else {+                //printf("New ctg at pos %d @ %d\n",b->core.pos+1,b->core.tid);+                pos = (b->core.pos > col ? b->core.pos : col)+1;

Shouldn't this be another pos = col + 1? This b->core.pos is for another chromosome, so comparing to col is meaningless. Also, the purpose of last_in_contig might be easier to understand if it was named something like finish_last_contig. Or maybe set pos = HTS_POS_MAX; so the last contig gets ejected without needing it?

jkbonfield

comment created time in 4 days

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;+        }++        b = &pnew->b;++        /* Force realloc */+        //fp->bs = NULL;+        //fp->bs_size = 0;++        //r = samread(fp, pnew->b);+        if (r >= 0) {+            if (b->core.flag & BAM_FUNMAP)+                continue;++            if (b->core.tid == -1) {+                /* Another indicator for unmapped */+                continue;+            } else if (b->core.tid == last_ref) {+                pos = b->core.pos+1;+                //printf("New seq at pos %d @ %d %s\n", pos, b->core.tid,+                //       bam_name(b));+                last_in_contig = 0;+            } else {+                //printf("New ctg at pos %d @ %d\n",b->core.pos+1,b->core.tid);+                pos = (b->core.pos > col ? b->core.pos : col)+1;+                last_in_contig = 1;+            }+        } else {+            last_in_contig = 1;+            pos = col+1;+        }++        if (col > pos) {+            fprintf(stderr, "BAM/SAM file is not sorted by position. "+                    "Aborting\n");+            return -1;+        }++        /* Process data between the last column and our latest addition */+        while (col < pos && phead) {+            struct pileup *eof_head = NULL, *eofp = NULL;+            int v, ins, depth = 0;+            //printf("Col=%d pos=%d nth=%d\n", col, pos, nth);++            /* Pileup */+            is_insert = 0;+            pileup_t *pnext = phead ? phead->next : NULL;+            for (p = phead, last = NULL; p; p = pnext) {+#if 0+                // Simple prefetching+                pnext = p->next;+                if (pnext)+                    _mm_prefetch(pnext, _MM_HINT_T0);+#else+                // More complex prefetching => more instructions, but+                // no usually faster.+                pnext = p->next;+                if (pnext) {+                    // start memory fetches; a big help on very deep data+                    if (pnext->next)+                        // struct 2 ahead+                        _mm_prefetch(pnext->next, _MM_HINT_T0);+                    // seq/qual 1 ahead+                    _mm_prefetch(pnext->b_qual + pnext->seq_offset,+                                 _MM_HINT_T0);+                    _mm_prefetch(pnext->b_seq  + pnext->seq_offset/2,+                                 _MM_HINT_T0);+                }+#endif++                if (!get_next_base(p, col, nth, &ins))+                    p->eof = 1;+                if (p->eof == 1) {+                    if (eofp)+                        eofp->eofn = p;+                    eofp = p;+                    eofp->eofl = last;+                    if (!eof_head)+                        eof_head = eofp;+                } else {+                    last = p;+                }++                if (is_insert < ins)+                    is_insert = ins;++                depth++;+            }+            if ((ptail = last) == NULL)+                ptail = phead;++            /* Call our function on phead linked list */+            v = seq_add(client_data, fp, h, phead, depth,+#ifdef START_WITH_DEL+                        col-1,+#else+                        col,+#endif+                        nth, is_insert);++            /* Remove dead seqs */+            for (p = eof_head ; p; p = p->eofn) {+                next = p->next;

This appears to be unused...

jkbonfield

comment created time in 4 days

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

 check test: samtools $(BGZIP) $(TEST_PROGRAMS) 	test/merge/test_bam_translate test/merge/test_bam_translate.tmp 	test/merge/test_rtrans_build 	test/merge/test_trans_tbl_init-	cd test/mpileup && AWK="$(AWK)" ./regression.sh mpileup.reg-	cd test/mpileup && AWK="$(AWK)" ./regression.sh depth.reg+	cd test/mpileup && AWK="$(AWK)" ../regression.sh mpileup.reg+	cd test/mpileup && AWK="$(AWK)" ../regression.sh depth.reg

consensus.reg is missing here. Adding it results in a few failures, looks like they're due to a missing expected output file.

jkbonfield

comment created time in 4 days

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;

Possible leakage here. goto error;?

jkbonfield

comment created time in 4 days

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;+        }++        b = &pnew->b;++        /* Force realloc */+        //fp->bs = NULL;+        //fp->bs_size = 0;++        //r = samread(fp, pnew->b);+        if (r >= 0) {+            if (b->core.flag & BAM_FUNMAP)+                continue;++            if (b->core.tid == -1) {+                /* Another indicator for unmapped */+                continue;+            } else if (b->core.tid == last_ref) {+                pos = b->core.pos+1;+                //printf("New seq at pos %d @ %d %s\n", pos, b->core.tid,+                //       bam_name(b));+                last_in_contig = 0;+            } else {+                //printf("New ctg at pos %d @ %d\n",b->core.pos+1,b->core.tid);+                pos = (b->core.pos > col ? b->core.pos : col)+1;+                last_in_contig = 1;+            }+        } else {+            last_in_contig = 1;+            pos = col+1;+        }++        if (col > pos) {+            fprintf(stderr, "BAM/SAM file is not sorted by position. "+                    "Aborting\n");+            return -1;+        }++        /* Process data between the last column and our latest addition */+        while (col < pos && phead) {+            struct pileup *eof_head = NULL, *eofp = NULL;+            int v, ins, depth = 0;+            //printf("Col=%d pos=%d nth=%d\n", col, pos, nth);++            /* Pileup */+            is_insert = 0;+            pileup_t *pnext = phead ? phead->next : NULL;+            for (p = phead, last = NULL; p; p = pnext) {+#if 0+                // Simple prefetching+                pnext = p->next;+                if (pnext)+                    _mm_prefetch(pnext, _MM_HINT_T0);+#else+                // More complex prefetching => more instructions, but+                // no usually faster.

Should that "no" be there?

jkbonfield

comment created time in 4 days

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;+        }++        b = &pnew->b;++        /* Force realloc */+        //fp->bs = NULL;+        //fp->bs_size = 0;++        //r = samread(fp, pnew->b);+        if (r >= 0) {+            if (b->core.flag & BAM_FUNMAP)+                continue;++            if (b->core.tid == -1) {+                /* Another indicator for unmapped */+                continue;+            } else if (b->core.tid == last_ref) {+                pos = b->core.pos+1;+                //printf("New seq at pos %d @ %d %s\n", pos, b->core.tid,+                //       bam_name(b));+                last_in_contig = 0;+            } else {+                //printf("New ctg at pos %d @ %d\n",b->core.pos+1,b->core.tid);+                pos = (b->core.pos > col ? b->core.pos : col)+1;+                last_in_contig = 1;+            }+        } else {+            last_in_contig = 1;+            pos = col+1;+        }++        if (col > pos) {+            fprintf(stderr, "BAM/SAM file is not sorted by position. "+                    "Aborting\n");+            return -1;

More possible leakage.

jkbonfield

comment created time in 4 days

PullRequestReviewEvent

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++typedef struct pileup {+    // commonly used things together, to fit in a cache line (64 bytes)+    struct pileup *next;  // A link list, for active seqs+    void *cd;             // General purpose per-seq client-data+    int  eof;             // True if this sequence has finished+    int  qual;            // Current qual (for active seq only)+    char start;           // True if this is a new sequence+    char base;            // Current base (for active seq only) in ASCII+    char ref_skip;        // True if the cause of eof or start is cigar N+    char padding;         // True if the base was added due to another seq+    int  base4;            // Base in 4-bit notation (0-15)+    int  pos;             // Current unpadded position in seq

This should be type hts_pos_t, and the same type needs to be used in other places that pos gets passed to.

jkbonfield

comment created time in 4 days

PullRequestReviewEvent

delete branch daviesrob/samtools

delete branch : pr1542m

delete time in 5 days

PR merged samtools/samtools

Reviewers
Add new `view --fetch-pairs` option

Proof of principle implementation that allows to pull reads from regions including their mates, even when they fall outside the regions or are unmapped.

Note

  • so far this was tested on small data only
  • tests need to be added
  • everything is stored in memory, with many regions an external temporary storage may be needed
  • some unmapped reads may be missed, this depends on the success of https://github.com/samtools/htslib/pull/1352 pull request
+702 -236

17 comments

7 changed files

pd3

pr closed time in 5 days

push eventsamtools/samtools

Petr Danecek

commit sha 487f2019c9bb8aa0c065679af28d46f33c16e9ed

Add new `view --fetch-pairs` option The new `-P, --fetch-pairs` option allows to pull reads from regions including their mates even when they fall outside the regions or are unmapped. Note - everything is stored in memory, with many regions an external temporary storage may be needed

view details

Rob Davies

commit sha 5feb1b06afb41d050424c45c7f32893327186cc1

Restore initial subsam_frac setting Making it negative ensures no sub-sampling happens.

view details

Rob Davies

commit sha 106917e45c71e6004fdfa37084dd37ebfa5e7dc5

Error handling improvements Adjust some error messages Add more error checking, mostly for memory allocation and to ensure that iterators have finished correctly. Ensures that errors bubble up to main() with proper clean-up instead of calling exit directly. Mainly benefits pysam, which embeds samtools in a way that leaks resources if they aren't cleaned up properly.

view details

Rob Davies

commit sha 1c13b02cc1966b49f7b1ab815906022ed1e5d3e0

Use multi_region_view() for the old region iterator The code was duplicated between the multi- and original iterator loops, so multi_region_view() can be used for both - as long as samview_settings_t::hts_idx is left intact between calls.

view details

Rob Davies

commit sha f792ea614a81870dc61a48de098cee97e31e26b0

Remove more duplicated code Common code for filtering and writing alignment records is now used for the streaming, and single- and multi- iterator cases.

view details

Rob Davies

commit sha 28331fd7596f8f1c6a1da1459a5135d1e4381816

Split aux tag adjustment from read filtering Pull the part that implements the `--keep-tags` and `--remove-tags` options out of process_aln() and put it in a new adjust_tags() function. This means the tag adjustment only has to be done when actually needed - so for example it can be skipped in counting mode. Invalid aux tags detected in adjust_tags() are now reported immediately as errors. This just means they're trapped a bit earlier, as before they would have caused the writer to fail. Note that as this code used to come after all the process_aln() filters had run, the tags were only ever edited if the read passed filtering. For compatibility the new version has been made to work the same way, and notes have been added to the manual page stating that the options only affect passed reads.

view details

Rob Davies

commit sha 55c36b72dcebc87110a602666025e87dacc9ced5

Make filtering work on all potential output reads for --fetch-pairs Ensure filters are consistently applied by using them for found mate pairs as well as reads in the original search region. This fixes some potential anomalies that could occur when the mate pairs were always included. The first pass filter also skips records where the BAM_FPAIRED bit is unset, to avoid going on a wild-goose chase looking for pair records that don't exist. Change the filtering test to one that excludes DUP reads. Update the manual page with more information about how the option works and revised text for how it interacts with other options.

view details

Rob Davies

commit sha 6339b7cea36d612dd220999661b7e6d233d0204e

Merge PR Add new view --fetch-pairs option (#1542)

view details

push time in 5 days

create barnchdaviesrob/samtools

branch : pr1542m

created branch time in 5 days

pull request commentsamtools/samtools

Add new `view --fetch-pairs` option

Now squashed to a slightly tidier set of commits.

pd3

comment created time in 5 days

push eventpd3/samtools

Rob Davies

commit sha 5feb1b06afb41d050424c45c7f32893327186cc1

Restore initial subsam_frac setting Making it negative ensures no sub-sampling happens.

view details

Rob Davies

commit sha 106917e45c71e6004fdfa37084dd37ebfa5e7dc5

Error handling improvements Adjust some error messages Add more error checking, mostly for memory allocation and to ensure that iterators have finished correctly. Ensures that errors bubble up to main() with proper clean-up instead of calling exit directly. Mainly benefits pysam, which embeds samtools in a way that leaks resources if they aren't cleaned up properly.

view details

Rob Davies

commit sha 1c13b02cc1966b49f7b1ab815906022ed1e5d3e0

Use multi_region_view() for the old region iterator The code was duplicated between the multi- and original iterator loops, so multi_region_view() can be used for both - as long as samview_settings_t::hts_idx is left intact between calls.

view details

Rob Davies

commit sha f792ea614a81870dc61a48de098cee97e31e26b0

Remove more duplicated code Common code for filtering and writing alignment records is now used for the streaming, and single- and multi- iterator cases.

view details

Rob Davies

commit sha 28331fd7596f8f1c6a1da1459a5135d1e4381816

Split aux tag adjustment from read filtering Pull the part that implements the `--keep-tags` and `--remove-tags` options out of process_aln() and put it in a new adjust_tags() function. This means the tag adjustment only has to be done when actually needed - so for example it can be skipped in counting mode. Invalid aux tags detected in adjust_tags() are now reported immediately as errors. This just means they're trapped a bit earlier, as before they would have caused the writer to fail. Note that as this code used to come after all the process_aln() filters had run, the tags were only ever edited if the read passed filtering. For compatibility the new version has been made to work the same way, and notes have been added to the manual page stating that the options only affect passed reads.

view details

Rob Davies

commit sha 55c36b72dcebc87110a602666025e87dacc9ced5

Make filtering work on all potential output reads for --fetch-pairs Ensure filters are consistently applied by using them for found mate pairs as well as reads in the original search region. This fixes some potential anomalies that could occur when the mate pairs were always included. The first pass filter also skips records where the BAM_FPAIRED bit is unset, to avoid going on a wild-goose chase looking for pair records that don't exist. Change the filtering test to one that excludes DUP reads. Update the manual page with more information about how the option works and revised text for how it interacts with other options.

view details

push time in 5 days

pull request commentsamtools/samtools

Add new `view --fetch-pairs` option

Another small adjustment to check the BAM_FPAIRED flags bit. If this is unset, there shouldn't be a paired read and following RNEXT and PNEXT isn't likely to lead anywhere useful.

pd3

comment created time in 6 days

push eventpd3/samtools

Rob Davies

commit sha 80f8df0884b300de53318c85fc41563fd024a182

Only look for pairs on records that are actually paired Skip records where the BAM_FPAIRED bit is unset, to avoid going on a wild-goose chase looking for pair records that don't exist.

view details

push time in 6 days

delete branch daviesrob/samtools

delete branch : new_year_2022

delete time in 6 days

PR opened samtools/samtools

Copyright updates for new year.
+3 -3

0 comment

2 changed files

pr created time in 6 days

create barnchdaviesrob/samtools

branch : new_year_2022

created branch time in 6 days

delete branch daviesrob/htslib

delete branch : new_year

delete time in 7 days

more