profile
viewpoint
James Bonfield jkbonfield Wellcome Trust Sanger Institute Cambridge, UK

jkbonfield/rans_static 60

rANS coder (derived from https://github.com/rygorous/ryg_rans)

jkbonfield/io_lib 30

Staden Package "io_lib" (sometimes referred to as libstaden-read by distributions). This contains code for reading and writing a variety of Bioinformatics / DNA Sequence formats.

jkbonfield/crumble 29

Exploration of controlled loss of quality values for compressing CRAM files

mklarqvist/djinn 17

C++ library for analysing and storing large-scale cohorts of sequence variant data

jkbonfield/fqzcomp 11

Fastq compression tool

samtools/htscodecs 6

Custom compression for CRAM and others.

jkbonfield/htslib 5

C library for high-throughput sequencing data formats

jkbonfield/samtools 4

Samtools with CRAM support (inherited from staden package io_lib).

Pull request review commentsamtools/samtools

Add a "samtools consensus" sub-command.

+/*  consensus__pileup.h -- Pileup orientated data per consensus column++    Copyright (C) 2013-2016, 2020-2021 Genome Research Ltd.++    Author: James Bonfied <jkb@sanger.ac.uk>++Permission is hereby granted, free of charge, to any person obtaining a copy+of this software and associated documentation files (the "Software"), to deal+in the Software without restriction, including without limitation the rights+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell+copies of the Software, and to permit persons to whom the Software is+furnished to do so, subject to the following conditions:++The above copyright notices and this permission notice shall be included in+all copies or substantial portions of the Software.++THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER+DEALINGS IN THE SOFTWARE.  */++#include <config.h>+#include <htslib/sam.h>++#ifdef __SSE__+#   include <xmmintrin.h>+#else+#   define _mm_prefetch(a,b)+#endif++#include "consensus_pileup.h"++#define MIN(a,b) ((a)<(b)?(a):(b))+#define bam_strand(b)  (((b)->core.flag & BAM_FREVERSE) != 0)++/*+ * START_WITH_DEL is the mode that Gap5 uses when building this. It prepends+ * all cigar strings with 1D and decrements the position by one. (And then+ * has code to reverse this operation in the pileup handler.)+ *+ * The reason for this is that it means reads starting with an insertion work.+ * Otherwise the inserted bases are silently lost. (Try it with "samtools+ * mpileup" and you can see it has the same issue.)+ *+ * However it's probably not want most people expect.+ */+//#define START_WITH_DEL++/* --------------------------------------------------------------------------+ * The pileup code itself.+ *+ * This consists of the external pileup_loop() function, which takes a+ * sam/bam samfile_t pointer and a callback function. The callback function+ * is called once per column of aligned data (so once per base in an+ * insertion).+ *+ * Current known issues.+ * 1) zero length matches, ie 2S2S cause failures.+ * 2) Insertions at starts of sequences get included in the soft clip, so+ *    2S2I2M is treated as if it's 4S2M+ * 3) From 1 and 2 above, 1S1I2S becomes 2S2S which fails.+ */+++/*+ * Fetches the next base => the nth base at unpadded position pos. (Nth can+ * be greater than 0 if we have an insertion in this column). Do not call this+ * with pos/nth lower than the previous query, although higher is better.+ * (This allows it to be initialised at base 0.)+ *+ * Stores the result in base and also updates is_insert to indicate that+ * this sequence still has more bases in this position beyond the current+ * nth parameter.+ *+ * Returns 1 if a base was fetched+ *         0 if not (eg ran off the end of sequence)+ */+static int get_next_base(pileup_t *p, int pos, int nth, int *is_insert) {+    bam1_t *b = &p->b;+    int op = p->cigar_op;++    p->start -= p->start>0;+    if (p->first_del && op != BAM_CPAD)+        p->first_del = 0;++    *is_insert = 0;++    /* Find pos first */+    while (p->pos < pos) {+        p->nth = 0;++        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0;+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        if ((op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF)+            && p->cigar_len <= pos - p->pos) {+            p->seq_offset += p->cigar_len;+            p->pos += p->cigar_len;+            p->cigar_len = 0;+        } else {+            switch (op) {+            case BAM_CMATCH:+            case BAM_CEQUAL:+            case BAM_CDIFF:+                p->seq_offset++;+                /* Fall through */+            case BAM_CDEL:+            case BAM_CREF_SKIP:+                p->pos++;+                p->cigar_len--;+                break;++            case BAM_CINS:+            case BAM_CSOFT_CLIP:+                p->seq_offset += p->cigar_len;+                /* Fall through */+            case BAM_CPAD:+            case BAM_CHARD_CLIP:+                p->cigar_len = 0;+                break;++            default:+                fprintf(stderr, "Unhandled cigar_op %d\n", op);+                return -1;+            }+        }+    }++    /* Now at pos, find nth base */+    while (p->nth < nth) {+        if (p->cigar_len == 0) {+            if (p->cigar_ind >= b->core.n_cigar) {+                p->eof = 1;+                return 0; /* off end of seq */+            }++            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+        }++        switch (op) {+        case BAM_CMATCH:+        case BAM_CEQUAL:+        case BAM_CDIFF:+        case BAM_CSOFT_CLIP:+        case BAM_CDEL:+        case BAM_CREF_SKIP:+            goto at_nth; /* sorry, but it's fast! */++        case BAM_CINS:+            p->seq_offset++;+            /* Fall through */+        case BAM_CPAD:+            p->cigar_len--;+            p->nth++;+            break;++        case BAM_CHARD_CLIP:+            p->cigar_len = 0;+            break;++        default:+            fprintf(stderr, "Unhandled cigar_op %d\n", op);+            return -1;+        }+    }+ at_nth:++    /* Fill out base & qual fields */+    p->ref_skip = 0;+    if (p->nth < nth && op != BAM_CINS) {+        //p->base = '-';+        p->base = '*';+        p->base4 = 16;+        p->padding = 1;+        if (p->seq_offset < b->core.l_qseq)+            p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+        else+            p->qual = 0;+    } else {+        p->padding = 0;+        switch(op) {+        case BAM_CDEL:+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CPAD:+            //p->base = '+';+            p->base = '*';+            p->base4 = 16;+            if (p->seq_offset+1 < b->core.l_qseq)+                p->qual = (p->qual + p->b_qual[p->seq_offset+1])/2;+            else+                p->qual = (p->qual + p->b_qual[p->seq_offset])/2;+            break;++        case BAM_CREF_SKIP:+            p->base = '.';+            p->base4 = 0;+            p->qual = 0;+            /* end of fragment, but not sequence */+            p->eof = p->eof ? 2 : 3;+            p->ref_skip = 1;+            break;++        default:+            if (p->seq_offset < b->core.l_qseq) {+                p->qual = p->b_qual[p->seq_offset];+                p->base4 = p->b_seq[p->seq_offset/2] >>+                    ((~p->seq_offset&1)<<2) & 0xf;+                p->base = "NACMGRSVTWYHKDBN"[p->base4];+            } else {+                p->base = 'N';+                p->base4 = 15;+                p->qual = 0xff;+            }++            break;+        }+    }++    /* Handle moving out of N (skip) into sequence again */+    if (p->eof && p->base != '.') {+        p->start = 1;+        p->ref_skip = 1;+        p->eof = 0;+    }++    /* Starting with an indel needs a minor fudge */+    if (p->start && p->cigar_op == BAM_CDEL) {+        p->first_del = 1;+    }++    /* Check if next op is an insertion of some sort */+    if (p->cigar_len == 0) {+        if (p->cigar_ind < b->core.n_cigar) {+            op=p->cigar_op  = p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK;+            p->cigar_len = p->b_cigar[p->cigar_ind] >> BAM_CIGAR_SHIFT;+            p->cigar_ind++;+            if (op == BAM_CREF_SKIP) {+                p->eof = 3;+                p->ref_skip = 1;+            }+        } else {+            p->eof = 1;+        }+    }++    switch (op) {+    case BAM_CPAD:+    case BAM_CINS:+        *is_insert = p->cigar_len;+        break;++    case BAM_CSOFT_CLIP:+        /* Last op 'S' => eof */+        p->eof = (p->cigar_ind == b->core.n_cigar ||+                  (p->cigar_ind+1 == b->core.n_cigar &&+                   (p->b_cigar[p->cigar_ind] & BAM_CIGAR_MASK)+                   == BAM_CHARD_CLIP))+            ? 1+            : 0;+        break;++    case BAM_CHARD_CLIP:+        p->eof = 1;+        break;++    default:+        break;+    }++    return 1;+}++/*+ * Loops through a set of supplied ranges producing columns of data.+ * When found, it calls func with clientdata as a callback. Func should+ * return 0 for success and non-zero for failure. seq_init() is called+ * on each new entry before we start processing it. It should return 0 or 1+ * to indicate reject or accept status (eg to filter unmapped data).+ * If seq_init() returns -1 we abort the pileup_loop with an error.+ * seq_init may be NULL.+ *+ * Returns 0 on success+ *        -1 on failure+ */+int pileup_loop(samFile *fp,+                sam_hdr_t *h,+                int (*seq_fetch)(void *client_data,+                                 samFile *fp,+                                 sam_hdr_t *h,+                                 bam1_t *b),+                int (*seq_init)(void *client_data,+                                samFile *fp,+                                sam_hdr_t *h,+                                pileup_t *p),+                int (*seq_add)(void *client_data,+                               samFile *fp,+                               sam_hdr_t *h,+                               pileup_t *p,+                               int depth,+                               int pos,+                               int nth,+                               int is_insert),+                void *client_data) {+    int ret = -1;+    pileup_t *phead = NULL, *p, *pfree = NULL, *last, *next, *ptail = NULL;+    pileup_t *pnew = NULL;+    int is_insert, nth = 0;+    int col = 0, r;+    int last_ref = -1;++    /* FIXME: allow for start/stop boundaries rather than consuming all data */++    if (NULL == (pnew = calloc(1, sizeof(*p))))+        return -1;++    do {+        bam1_t *b;+        int pos, last_in_contig;++        //r = scram_next_seq(fp, &pnew->b);+        r = seq_fetch(client_data, fp, h, &pnew->b);+        //r = sam_read1(fp, h, &pnew->b); // FIXME: use readaln+        if (r < -1) {+            fprintf(stderr, "bam_next_seq() failure.\n");+            return -1;+        }++        b = &pnew->b;++        /* Force realloc */+        //fp->bs = NULL;+        //fp->bs_size = 0;++        //r = samread(fp, pnew->b);+        if (r >= 0) {+            if (b->core.flag & BAM_FUNMAP)+                continue;++            if (b->core.tid == -1) {+                /* Another indicator for unmapped */+                continue;+            } else if (b->core.tid == last_ref) {+                pos = b->core.pos+1;+                //printf("New seq at pos %d @ %d %s\n", pos, b->core.tid,+                //       bam_name(b));+                last_in_contig = 0;+            } else {+                //printf("New ctg at pos %d @ %d\n",b->core.pos+1,b->core.tid);+                pos = (b->core.pos > col ? b->core.pos : col)+1;

A reasonable comment on last_in_contig. It's set when we change contigs, so isn't so much that this read is the last but that the previous one was, so a change of name is OK.

As for col+1, oddly it makes no difference as when last_in_contig is set we have pos++ later on for this case and both col and pos march up until we hit the end of the contig. Hence any value of pos is sufficient providing it's above col. The simpler definition you suggest is therefore the obvious one. (I'm struggling to remember precisely why that increment is there, but it's needed. That probably means I should have commented that bit!)

HTS_POS_MAX is a bad idea though. Yes it does work, but only because it's defined to be 0x7fffffff80000000 rather than 0x7fffffffffffffff, which luckily avoids the overflow when we increment pos beyond HTS_POS_MAX.

jkbonfield

comment created time in 10 minutes

PullRequestReviewEvent

issue commentsamtools/hts-specs

Specifications for Base Calling Accuracies Across Platforms - Suggestion

The problem with base calling software is they typically output in FASTQ, which has no real concept of meta-data and headers. Consequentially the downstream processes that create SAM et al usually lose track of the upstream software that produced the data unless someone is being very conscious of data provenance and retrofitting these fields after aligning.

I absolutely agree though it would be great to track such thing and there is already provision to do this via @PG. Unfortunately realistically I don't see it happening, irrespective of whether we make specific recommendations. The methods exist already, if people are interested enough. "You can lead a horse to water, but you cannot make it drink".

There is a "Recommended Practice for the SAM Format" section of the SAM spec, which perhaps could be strengthedn with more recommendations, such as emphasising the important of data provenance via PG lines indicating both software names and versions, and my own pet peeve would be to recommend UR and/or M5 strings for SQ lines so the reference sequences used are unambiguous rather than some anonymous "chr1". Generally this section of the spec has been quite weak though as we've kept clear from more discussion oriented things.

husamia

comment created time in 3 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

That's a valid point about single long-running function vs many repeated calls to (eg) read, merge, etc. I was thinking more broadly, including those that manipulate samtools. (Although frankly the way tools like pysam pretend samtools is a library rather than a subprocess has never really sat well with me.)

BickfordA

comment created time in 3 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

Maybe we need to know what the actual problem is you're trying to solve. For example why is being able to pause computation via a callback a beneficial feature?

My worry about doing this for indexing, is what's next? General reading through a file, needing a new API with a call back there? Then for VCF too? Then in the synced reader VCF interface? You can see there is inevitable room for feature creep. You personally may only need indexing, but once precedence has been set it's inevitable other people will start to ask for a callback interface in every part of the library.

BickfordA

comment created time in 19 hours

push eventjkbonfield/io_lib

James Bonfield

commit sha 910bb8028a2b900e8aa06d860b9562bc0b11b98b

Add use of FQZComp_qual SEQ context for PB/ONT data. This is a bit messy because the cram_decode_seq code automatically decodes quality at the same time. If sequence is used as a context for quality decoding, then that code needs separating. Our solution is to twiddle the required-fields parameter to disable QS decoding, and then explicitly decode it afterwards. Note this only works if we're not using any of the compound seq/qual CRAM feature codes, but I think we'd likely have to forbid their usage anyway within the specification as we'd need len(seq) == len(qual) for the context to make any sense. An example of a local PacBio file with 100k records: Before: 187305340 bytes (145419949 QS) After: 165605250 bytes (123719859 QS)

view details

push time in 20 hours

issue commentsamtools/hts-specs

Specifications for Base Calling Accuracies Across Platforms - Suggestion

I accept that it can be hard to directly compare figures between different manufacturers, as they may not be calibrated, but the meaning of the Phred score is defined and poorly calibrated implementations isn't something the spec should be compensating for. Claims of improved accuracy are generally considered in the light of their own previous base-callers on the same technology.

There is however some room for nuance in what an error really means. For example PacBio uses qualities all the way up into the 90s. So that's 1 in 10^9 chance of an error. That's unrealistic, both in terms of real accuracy (ie poor calibration), but also there are likely to be library preparation errors that cause a denovo base mutation at a higher rate than 1 in 10^9. So what is the error really describing - the total chance of an error of DNA collection to BAM file, or just the final sequencing component post library creation? If the qualities are reasonably calibrated though I doubt that becomes a major issue.

husamia

comment created time in 20 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

I'm not making myself clear. I'm talking about separate UI threads from computation threads. This is often how many GUIs are written. So it doesn't matter that the function consumes all input before finishing, provided it's possible to query the progress from another thread and feed back.

That's why I say a thread safe interface to query position in the input file. (With also a thread safe way of determining the start / end positions for purposes of range queries)

BickfordA

comment created time in 20 hours

create barnchjkbonfield/io_lib

branch : fqz_seq

created branch time in 20 hours

push eventjkbonfield/htscodecs

James Bonfield

commit sha e9598eb4b30ea2bd5b5f9a635cf181bb793ec83b

Added the ability to use sequence bases as a quality context. The idea of using sequence as a context for quality value encoding goes back a long way. Certainly it was discussed during the SequenceSqueeze contest: https://encode.su/threads/1409-Compression-Competition-15-000-USD?p=28377&viewfull=1#post28377 Ultimately it didn't help much in fqzcomp because sequence is not a major player in Illumina quality value compression. The sequence (in a wider context than just 1 bp) is however highly indicative of quality in PacBio CLR and ONT data. An example of adding sequence as a context to some PB data: ./tests/fqzcomp_qual -x 0x660000000000000 ~/tmp/PB.qual Total output = 14266164 ./tests/fqzcomp_qual -x 0x660000000000862 ~/tmp/PB.qual ~/tmp/PB.seq Total output = 11903124 Unfortunately this isn't available within CRAM yet as it means added correlations between the streams. Although it's curious this saves 2363040 bytes on 26885686 values, so 0.7 bit per value. Still considerably lower than the size of compressed sequence (about 1.7bit/base for this example), but there may even be occasions where storing a squashed sequence hash in the stream can improve compression. TODO: figure out how to add to CRAM 3.1 / 4.0. - We need to cope with reversing quality. This also needs revcomp on seq. - How to detect when it's useful. Long variable read length as a proxy?

view details

James Bonfield

commit sha 24feae11af2e42a3873ed16c01ac82239631239b

Add FQZcomp GFLAG_USE_SEQ. This is an easy way to detect that sequence contexts are in use, to aid CRAM decoding. The alternative would be a different FQZ name, eg FQS, so we do it by codec variant. However the current solution is more in line with O0/O1 for rANS, albeit that's not something the decoder needs to be aware of.

view details

James Bonfield

commit sha 1a7f56922f4a850ee2349603e07e5d655bfb03fb

Improve GFLAG_USE_SEQ flag. If it's not set, the format is as before (no base bits / loc flags) for compatibility.

view details

push time in 20 hours

pull request commentsamtools/htslib

Progress and Cancelation for Indexing

I wonder if there is a general way this can be handled. It's rather specific to indexing only. What about any other task iterating over a file? It seems to me that file offset, if known, is a more workable progression marker. We don't necessarily know the start / end of the region in question, but maybe that too can be made accessable by analysing the index query results.

We could almost do this right now by using ftell type functions from a separate thread, were it not for potential thread memory access clashes. If there was a thread-safe API function that could be called though then the issue is solved.

It's a different model of operation, requiring a separate UI thread that periodically polls to perform updates, but this feels cleaner than adding callbacks to potentially more and more pieces of code.

BickfordA

comment created time in a day

issue commentsamtools/samtools

Format that cannot be usefully indexed

Please open new issues rather than simply replying to long-closed ones.

In answer to your question though, that error occurs when the program doesn't believe the input to be either BCF or VCF format. Is it actually VCF? If you zless it (or zcat file.gz|more, etc) then does it look like a VCF? My guess is not. Or if it does, possibly it's gzipped and not bgzipped and the software detected the difference. I haven't explored what error it would report then.

The htsfile command should also give some hints as to what we believe the format to be.

flaviahodel

comment created time in a day

issue commentsamtools/htslib

cram_io.h error

What compiler are you using?

Also, how are you including this? Is it simply building htslib, or are you attempting to include this (internal) file directly from your own program? (If so, it's not intended for use that way, but if we have missing dependencies in the include file then we should still fix them.)

wuyunqiq

comment created time in a day

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

I did read your commit message, but I thought I'd experimentally try it without an explicit include just to see how often it was a problem in practice.

I'll also add, as alluded to elsewhere (not sure which issue now), most compilers have the option to do a pre-include forcing it before the compilation. Sure enough MSVC is no exception: https://docs.microsoft.com/en-us/cpp/build/reference/fi-name-forced-include-file?view=msvc-170

What this means is that basically the choice here is between documented appropriate CFLAGS that may be useful when using MSVC against our public headers, vs modifying the headers themselves. I can see both sides of this argument and it's unclear what's most appropriate.

jmarshall

comment created time in 4 days

PR opened samtools/htslib

Improve windows build.

Specifically we create the extra files needed for MSVC linkage, and document the MSYS2/MINGW setup process.

Also added a win-dist target which attempts to produce a directory structure suitable for binary distribution. This isn't executed by default, but is a good aide-memoire and to simplify testing compatibility with things like MSVC.

Ideally we'd have a similar mechanism for all platforms to permit easy creation of binary distributions (see #533).

+59 -5

0 comment

4 changed files

pr created time in 4 days

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

I can confirm that this PR works, but I can also confirm that -Dssize_t=intptr_t also works with MSVC:

for i in htslib/*.h
do
    echo $i
    echo -e "#include <$i>\\nint main(void){return 0;}" > x.c
    cl.exe -Dssize_t=intptr_t x.c -I.
done

So while this PR simplifies things a bit, it's not the only viable method. We could instead simply document the lack of ssize_t and recommend apropriate compiler variables are set to compensate for lack of POSIX compliance. I'm OK with both, although I note this may accidentally lead to breakage in the future while we have to remember to add this boilerplate any time we use ssize_t in a new file.

Ie that's a policy decision I'll punt upwards. :)

jmarshall

comment created time in 4 days

create barnchjkbonfield/htslib

branch : win-build

created branch time in 5 days

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

I take that last statement back. It's been solong since I installed MINGW that I forgot the nuances. Thanks to @daviesrob for reminding me that both the MSYS environment matters and also the package names (pacman doesn't use the environment to work out the repository to install from).

With $MSYSTEM of MINGW64 it does produce a library that msvc can link against. I hacked up a copy of test_view with a few bits tweaked to get it to build, and compiled that natively using MSVC and linking agains the MINGW64 flavour of hts-3.dll and it runs.

So this does go back to the main thrust of this PR. However it needed a lot more than just ssize_t to get it to work. I need to work out what's required yet by my test program (ie test_view) and what was in the library headers only. hts_verbose is one source of error though. I think it needs extra declspec stuff to work in MSVC. (I just commented it out for the sake of testing linkage.)

jmarshall

comment created time in 5 days

issue commentsamtools/bcftools

bcftools mpileup "didn't recognize" sorted file

Does this mean that I can exclude the ivar step or else? Thanks

I cannot answer that question. We provide the basic tools, rather than define pipelines and workflows. If your experiment requires primer trimming then you'll need it or something comparable (eg samtools ampliconclip).

It's possible (but I don't know) that ivar may work on unsorted data, in which case the easy solution is to swap the order of the ivar trim and the samtools sort steps. If it does require coordinate sorted data and you wish to continue using ivar, then you can just run the sort step a second time.

Closing the issue though as it's working as intended.

fransdany

comment created time in 5 days

issue closedsamtools/bcftools

bcftools mpileup "didn't recognize" sorted file

Hi everyone,

I'm a novice in bioinformatics and I'm trying to make variant calling of SARS-CoV-2 by using one SRA file.

I'd like to ask about why bcftools doesn't recognized the sorted file while in fact I sorted it beforehand.

I ran this command: $ bcftools mpileup -Ou -f reference-genome-sars-cov-2.fasta SRA_clean.bam -o SRA.pileup

and got the following results:

[mpileup] 1 samples in 1 input files [mpileup] maximum number of reads per input file set to -d 250 [E::bam_plp_push] The input is not sorted (reads out of order)

Fyi, prior to this, I did the following in order:

  1. Alignment and sorting the file by Coordinate using STAR,
  2. Duplication removal with Picard,
  3. samtools sort for the output file resulted from the Picard's step and samtools index accordingly, and
  4. primer trimming with ivar, with SRA_clean.bam as the output

The mpileup still ran though but the file size (SRA.pileup) was only 2 kb while that of SRA_clean.bam was around 40 Mb.

Am I missing something or did I do the wrong steps? Any inputs will be highly appreciated.

Thank you

closed time in 5 days

fransdany

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

Given you're already willing to use a bunch of unix tools, headers and libraries in order to compile, what is your rationale for using the MSVC compiler specifically rather than using msys2/mingw?

It's been a good 20 years since I had to use MSVC in anger, and I'm now remembering why I expunged all I learnt about it back then. Actually I'm starting to think using mingw libs from msvc is just plain not possible! I'm sure I've seen it done before, but I can't get it to work. I'd still like to know your thinking though too.

I've been trying to get MSVC to link against a mingw prebuilt library, but it's challenging. The steps I've tried to take are:

  1. Link using gcc (mingw) -Wl,--enable-auto-import -Wl,--out-implib=hts.dll.a -o hts.dll. I theory I'd like to go direct from the .a or .so to a .def and .lib and/or .dll, but that's even more problematic from what I can see.
  2. gendef.exe to turn the .dll into a .def
  3. llvm-dlltool to convert the .def to a .lib
  4. Use cl to link against hts.lib

All of that works and produces me a binary. However that binary then fails on first malloc (I think). Eg an htsfile equivalent built that way displays usage, but cannot parse a BAM file:

$ ./htsfile.exe test/colons.bam
      0 [main] htsfile (13536) child_copy: cygheap read copy failed, 0x180348408..0x180359210, done 0, windows pid 13536, Win32 error 6
    337 [main] htsfile (13536) C:\msys64\home\jkbon\htslib\htsfile.exe: *** fatal error - ccalloc would have returned NULL

The other issue is that it would require a ghastly number of ancillary dlls:

$ ldd hts.dll
        ntdll.dll => /c/WINDOWS/SYSTEM32/ntdll.dll (0x7ffefb140000)
        KERNEL32.DLL => /c/WINDOWS/System32/KERNEL32.DLL (0x7ffef9f20000)
        KERNELBASE.dll => /c/WINDOWS/System32/KERNELBASE.dll (0x7ffef8630000)
        msvcrt.dll => /c/WINDOWS/System32/msvcrt.dll (0x7ffef91f0000)
        msys-2.0.dll => /usr/bin/msys-2.0.dll (0x180040000)
        msys-curl-4.dll => /usr/bin/msys-curl-4.dll (0x83d2420000)
        msys-bz2-1.dll => /usr/bin/msys-bz2-1.dll (0x6e61220000)
        msys-lzma-5.dll => /usr/bin/msys-lzma-5.dll (0x49fa70000)
        msys-z.dll => /usr/bin/msys-z.dll (0x522fe0000)
        msys-gssapi-3.dll => /usr/bin/msys-gssapi-3.dll (0x42f53c0000)
        msys-crypto-1.1.dll => /usr/bin/msys-crypto-1.1.dll (0x478b980000)
        msys-brotlidec-1.dll => /usr/bin/msys-brotlidec-1.dll (0x438fc10000)
        msys-psl-5.dll => /usr/bin/msys-psl-5.dll (0x4a92e0000)
        msys-idn2-0.dll => /usr/bin/msys-idn2-0.dll (0x4322bb0000)
        msys-ssh2-1.dll => /usr/bin/msys-ssh2-1.dll (0x2a8c350000)
        msys-nghttp2-14.dll => /usr/bin/msys-nghttp2-14.dll (0x4f20660000)
        msys-zstd-1.dll => /usr/bin/msys-zstd-1.dll (0x48b870000)
        msys-asn1-8.dll => /usr/bin/msys-asn1-8.dll (0x6c901d0000)
        msys-ssl-1.1.dll => /usr/bin/msys-ssl-1.1.dll (0xd8f950000)
        msys-com_err-1.dll => /usr/bin/msys-com_err-1.dll (0x6219420000)
        msys-heimntlm-0.dll => /usr/bin/msys-heimntlm-0.dll (0x31fe970000)
        msys-hcrypto-4.dll => /usr/bin/msys-hcrypto-4.dll (0x50a4450000)
        msys-heimbase-1.dll => /usr/bin/msys-heimbase-1.dll (0x33889a0000)
        msys-krb5-26.dll => /usr/bin/msys-krb5-26.dll (0x20e6ec0000)
        msys-brotlicommon-1.dll => /usr/bin/msys-brotlicommon-1.dll (0x6ff5210000)
        msys-roken-18.dll => /usr/bin/msys-roken-18.dll (0x35fe2a0000)
        msys-iconv-2.dll => /usr/bin/msys-iconv-2.dll (0x5603f0000)
        msys-unistring-2.dll => /usr/bin/msys-unistring-2.dll (0x43b990000)
        msys-intl-8.dll => /usr/bin/msys-intl-8.dll (0x430b30000)
        msys-gcc_s-seh-1.dll => /usr/bin/msys-gcc_s-seh-1.dll (0xde8160000)
        msys-wind-0.dll => /usr/bin/msys-wind-0.dll (0x7e97010000)
        msys-wind-0.dll => /usr/bin/msys-wind-0.dll (0xd60000)
        msys-hx509-5.dll => /usr/bin/msys-hx509-5.dll (0x727f610000)
        msys-sqlite3-0.dll => /usr/bin/msys-sqlite3-0.dll (0x1d798a0000)
        msys-crypt-0.dll => /usr/bin/msys-crypt-0.dll (0x43dbf0000)
        advapi32.dll => /c/WINDOWS/System32/advapi32.dll (0x7ffefae90000)
        sechost.dll => /c/WINDOWS/System32/sechost.dll (0x7ffef9e50000)
        RPCRT4.dll => /c/WINDOWS/System32/RPCRT4.dll (0x7ffefad70000)
        CRYPTBASE.DLL => /c/WINDOWS/SYSTEM32/CRYPTBASE.DLL (0x7ffef7e30000)
        bcryptPrimitives.dll => /c/WINDOWS/System32/bcryptPrimitives.dll (0x7ffef8f60000

That would be a rediculous amount of cruft to have to ship with your dll. I don't know why so many, but my money would be on libcurl. (I can't see any other reason for thins like zstd and sqlite.) We could produce a cut down libary, but that then invites all sorts of bug reports about why it can't do X, Y and Z, so I reject that idea. The alternative is to make a static .lib instead, but static libs in Unix at least are normally just the .o files and the person doing the linking then has to add all those dependencies themselves.

So I'm concluding the MSVC just isn't sufficiently compatible with mingw to be viable. That sort of renders htslib as useless, unless you're also another mingw user (which realistically means you're a unix user and it's more likely you'll be using unix direct or via WSL within windows, rendering a native windows binary copy of the library somewhat moot).

I'd love to hear otherwise though!

So the alternative is to fully support MSVC directly. That's a bit of a pain, and I'm not sure any of us really have the energy to actively maintain such a thing ourselves, not being Microsoft developers. If there was a PR showing all the necessary changes that are needed to get the package to cleanly build under MSVC then we could review what it entails, but my gut instinct here having tried it is it's too much given it'd require shipping lots of third party things such as a pthreads and getopt implementations.

I would recommend instead WIndows users learn to love WSL instead.

jmarshall

comment created time in 6 days

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

I think @jmarshall's changes are to the public headers, meaning when using a pre-built htslib from within MSVC. Your changes are for compiling htslib itself from MSVC. That's an entirely new level of pain.

I don't think it's workable without a lot more fiddling with the Makefile (for generating all that extra stuff in config.h) and/or your cmake config (although surely that's also well out of a limb as far as most MSVC users go - they'd want a project file), plus lots of extra work for import libraries and the like.

Given you're already willing to use a bunch of unix tools, headers and libraries in order to compile, what is your rationale for using the MSVC compiler specifically rather than using msys2/mingw?

jmarshall

comment created time in 6 days

issue commentsamtools/bcftools

bcftools mpileup "didn't recognize" sorted file

Normally the "input is not sorted" error is correct.

Given you also did samtools sort, that possibly implies that your step 4 broke the sort order again. I know nothing about ivar, but I'm assuming if it's doing primer trimming that it could adjust the left end of the reads, softclipping some bases, and therefore changing the sort order again? Try running samtools sort once more, and possibly raising this as a bug with ivar if it breaks the sort order (unless they explicitly state it does this and the documentation tells you to sort again).

fransdany

comment created time in 6 days

pull request commentsamtools/htslib

Remove static from array type

So this is indeed more than the minimal changes in this PR.

That said, if it weren't for this use of "static" the entire rest of the changes could pretty much be forced into CFLAGS (I assume MSVC has a pre-include style option to force an include file to be used before every compile, but I haven't checked that). We did discuss replacing that static with something like HTS_STATIC_ARRAY that can be defined in an OS specific way, but it was rejected, so I think this is probably dead in the water. Sorry

BickfordA

comment created time in 6 days

pull request commentsamtools/htslib

Remove static from array type

How are you driving MSVC? I tried it briefly and it blows up all over the shop:

  • lack of pthreads
  • no M_LN2, M_LN10, M_SQRT in math.h
  • no STDIN_FILENO and STDOUT_FILENO
  • no SSIZE_MAX (as well as ssize_t obviously)

I haven't got much further than those so far, but it looks like a big uphill struggle. This was the latest 2022 cl.exe compiler. Maybe there are specific command line arguments to enable POSIX support that I'm not using?

BickfordA

comment created time in 7 days

pull request commentsamtools/htslib

Provide a definition of ssize_t when compiling with MSVC [more robust alternative to #1375]

Thanks. It's a relatively small change, but I'll delay reviewing and merging for now until I've tested MSVC for real.

Basically I want to know if this is the only issue it finds. If the entire library can be linked against and public headers #included provided this one change is made, then I think it's justified. My recollections of having used an admittedly very old MSVC in the past is that it won't be that trivial. If it's one of many such problems, such as missing symbols, needing underscore added to dozens of POSIX function names, wrappers around functions which take different numbers of arguments (mkdir rings a bell), etc then it's probably not worth the hassle of supporting them all, which also means this change becomes redundant.

jmarshall

comment created time in 7 days

PR closed samtools/htslib

Remove static from array type

While this is valid in C99 I believe it was made optional in C11. While this works in gcc and clang it does not work in MSVC. I know the officially supported build platform is MSYS on windows, but MSVC now supports compiling C11 and C17 and this would be one of the few changes needed to support it.

+1 -1

5 comments

1 changed file

BickfordA

pr closed time in 7 days

pull request commentsamtools/htslib

Remove static from array type

We discussed this in our weekly meeting and decided to close it.

The decision was we only support POSIX/C99 compatible compilers for building the software. There was some discussion whether we should be a bit more flexible about non-standard compilers in third party tools that use the library (so linking against it and including public headers), but it's a moot point here as this is an internal header.

BickfordA

comment created time in 7 days

issue commentsamtools/samtools

mpileup output-QNAME

Closing as we believe this to be an issue of inappropriate duplicate removal, and workarounds have been provided.

I still don't understand why you see differences in 1.9 to me, but it's somewhat irrelevant given the current version behaves the same for both of us.

apallav

comment created time in 7 days

issue closedsamtools/samtools

mpileup output-QNAME

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)

samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd.

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)

Linux 3.10.0-514.6.2.el7.x86_64

  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)

x86_64

  • compiler (run gcc --version or clang --version)

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44.0.3)

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

Hi,

I have a question regarding the correctness of output format given out in the context of insertion/ deletions when using mpileup with --output-QNAME option.

I use following command line :

samtools-1.9/bin/samtools mpileup --output-QNAME -d 10000000 -x -A -s -B -a -Q 0 -C 0 -q 0 -r chr17:56435159-56435159 -f reference.fa bamfile.sorted.bam

Issue is two fold:

  1. The number of read Ids do not match with number of entries in the bases field ( field 5) when there are indels

for example:

for position in the commanline, I get 197576 in DP filed ( 4th) and 197576 read ids. But I encounter 198018 bases in column#5 with 442 insertion or deletion bases.

  1. the read Id entries for the insertion or deletions for the next base do not get encoded in the current position, but # bases do. This leads to mis-alignment of the readIDs with the bases in the 5 th field.

is there a way you could also incorporate the readIDs with indels in the current position as well as next position so the read depth and number of read IDs match?

Or Could you give me any pointers to edit the code to make the correction? I am vaguely familiar with C++.

Or do you have any other suggestions for me.

closed time in 7 days

apallav
more