profile
viewpoint
Heng Li lh3 DFCI & Harvard University Boston, MA, USA http://liheng.org

attractivechaos/klib 3462

A standalone and lightweight C library

attractivechaos/kann 581

A lightweight C library for artificial neural networks

lh3/bioawk 489

BWK awk modified for biological data

attractivechaos/plb 258

Programming language benchmarks

chhylp123/hifiasm 221

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

attractivechaos/k8 144

k8 Javascript shell

lh3/biofast 143

Benchmarking programming languages/implementations for common tasks in Bioinformatics

GFA-spec/GFA-spec 128

Graphical Fragment Assembly (GFA) Format Specification

lh3/bedtk 109

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)

haowenz/chromap 90

Fast alignment and preprocessing of chromatin profiles

issue commentxfengnefx/hifiasm-meta

hifiasm-meta produces redundant assemblies?

Hifiasm-meta sometimes may completely separate strains with a couple of percent divergence. Many Illumina reads would be multiply mapped to such strains.

fplaza

comment created time in 18 hours

push eventlh3/HPP_Year1_Assemblies

Heng Li

commit sha 4d071a5abf8a586feb71cedf1f735c5a019da72a

fixed a typo in link

view details

push time in 2 days

create barnchlh3/HPP_Year1_Assemblies

branch : agc

created branch time in 2 days

fork lh3/HPP_Year1_Assemblies

Assemblies from HPP Year 1 production

fork in 2 days

push eventlh3/HPP_Year1_Assemblies

Heng Li

commit sha 63e3f5945f50ecd5a2a0fb08c92652909b701499

added links and instructions about AGC

view details

push time in 2 days

fork lh3/HPP_Year1_Assemblies

Assemblies from HPP Year 1 production

fork in 4 days

issue commentchhylp123/hifiasm

Too many contigs and lager genome size compared with ONT seqenceing

The assembly looks ok. No need to run purge_dups.

Aannaw

comment created time in 4 days

issue commentchhylp123/hifiasm

The kmer graph distribution is strange

This is probably not HiFi data.

user6i

comment created time in 4 days

issue commentlh3/minimap2

minimap2: align.c:125: mm_fix_cigar: Assertion `qoff == r->qe - r->qs && toff == r->re - r->rs' failed.

Which version are you using? Could you provide me with the reference and query sequences? Thanks.

jia-wu-feng

comment created time in 5 days

issue commentlh3/minimap2

[E::sam_parse1] query name too long

Then you can generate SAM first and then see if there are long query names. The error report is from samtools.

williamrowell

comment created time in 5 days

issue closedlh3/minimap2

[E::sam_parse1] query name too long

I'm attempting to align one haplotype of a hifiasm assembly (v0.15) to reference. This worked fine for haplotype 1, but there's a query name too long error for haplotype 2. The FASTA files were generated by gfatools gfa2fasta (v0.4), followed by compression with bgzip.

> minimap2 -t 20 -L --secondary=no --eqx -ax asm5 -R '@RG\tID:NA19240_hap2\tSM:NA19240' reference/human_GRCh38_no_alt_analysis_set.fasta <(samples/NA19240/hifiasm/NA19240.asm.bp.hap2.p_ctg.fasta.gz) | samtools sort -@ 3 -m 8G > samples/NA19240/hifiasm/NA19240.hap2.GRCh38.bam

[M::mm_idx_gen::51.372*1.47] collected minimizers
[M::mm_idx_gen::59.833*2.23] sorted minimizers
[M::main::59.833*2.23] loaded/built the index for 195 target sequence(s)
[M::mm_mapopt_update::68.990*2.06] mid_occ = 136
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 195
[M::mm_idx_stat::73.779*1.99] distinct minimizers: 214664461 (92.23% are singletons); average occurrences: 1.372; average spacing: 10.529; total length: 3099922541
[E::sam_parse1] query name too long

I started with minimap2 v2.17, and this actually worked fine, but I hit a SegFault attempting to align another sample so I decided to update to the most recent minimap2.

> minimap2 --version
2.24-r1122

The query names are all 12 characters long. I've tried uncompressing the input first.

> zgrep '>' samples/NA19240/hifiasm/NA19240.asm.bp.hap2.p_ctg.fasta.gz | awk '{print length($1)}' | datamash min 1 max 1 mean 1 median 1
12      12      12      12

closed time in 5 days

williamrowell

issue commentlh3/minimap2

[E::sam_parse1] query name too long

<(samples/NA19240/hifiasm/NA19240.asm.bp.hap2.p_ctg.fasta.gz)

Remove <().

williamrowell

comment created time in 5 days

issue commentlh3/minimap2

how is minimap2 selecting output hit between equally good hits (mapq=0)?

Random with the random seed determined by the query name.

katrinakalantar

comment created time in 6 days

issue closedlh3/minimap2

how is minimap2 selecting output hit between equally good hits (mapq=0)?

Hello,

Thanks for maintaining this great tool! I'm running into a question that seems similar to one posted previously here ( https://github.com/lh3/minimap2/issues/370) and hoping to get a bit more clarity.

When running minimap2, it seems that it can output secondary mappings that are of lower alignment quality (via --secordary=yes). However, it seems that exact matches (multi-mappers, identified by the mapq = 0) are not output when using the --secondary=yes option. Is this correct? If so, how is minimap2 selecting the output target when there are multiple equivalent matches? Is it random?

To be specific, when I see a line in the output .paf file that looks like this: NZ_CP006777.1-10238/1 150 0 150 - CP000721.1 6000632 2389307 2389457 150 150 0 NM:i:0;ms:i:300;AS:i:300;nn:i:0;tp:A:P;cm:i:8;s1:i:124;s2:i:124;de:f:0;rl:i:0;cg:Z:150M

...the read mapped to CP000721.1, but was a multi-mapper (mapq=0). How did minimap2 select CP000721.1 amongst the other possible, equally-good, matches?

closed time in 6 days

katrinakalantar

issue commentlh3/minimap2

Question: Different AS for different preset?

These are smith-waterman scores.

kevfengler227

comment created time in 7 days

issue closedlh3/minimap2

Question: Different AS for different preset?

Is it expected behavior for different alignment scores for different presets? For example, a 10 kb HiFi read with perfect alignment will have an AS of 10000 with map-hifi and 20000 with map-pb. Shouldn't these be the same because it based on read length and NM?

Thanks, KF

closed time in 7 days

kevfengler227

push eventlh3/minigraph

Heng Li

commit sha a8f436884bca2bfdc6c0f870beb13642757646df

r434: added --cap-kalloc which defaults to 1G

view details

push time in 8 days

issue commentlh3/minimap2

Memory leak when using Python and threads

Anyway, a ThreadBuffer only grows and never shrinks

Actually a ThreadBuffer may shrink:

https://github.com/lh3/minimap2/blob/06fedaadd0f88074bd68527e6e34634ffe21273e/map.c#L367-L378

The default opt->cap_kalloc is 1GB in v2.24.

cjw85

comment created time in 12 days

issue commentlh3/minimap2

Memory leak when using Python and threads

Sorry that I don't use python threads and I don't know how python threads handle global and thread-local memory. Anyway, a ThreadBuffer only grows and never shrinks, until it gets destroyed. It is intended to be used through the life span of a thread. Minimap2 allocates one ThreadBuffer inside a newly spawned thread and uses the same buffer for multiple reads the thread processes. Minimap2 deallocates the buffer towards the end of the thread.

cjw85

comment created time in 12 days

issue commentchhylp123/hifiasm

Too many contigs and lager genome size compared with ONT seqenceing

In addition, 1) use p_ctg, not p_utg and 2) nextdenovo is known to produce smaller-than-expected assemblies.

Aannaw

comment created time in 12 days

issue closedlh3/bwa

bwa mem - fastq sequence order

Dear bwa team,

I observed few differences in number of variant calls when i compared calls generated from original fastq & converted fastq from bwa generated bam.

I have used the below link to convert back bwa aligned bam to fastq file. https://gist.github.com/darencard/72ddd9e6c08aaff5ff64ca512a04a6dd

I ran alignment again using bwa mem and then variant calling using GATK4. When I compared variant calls generated from bam (from original fastq) & bam (from converted fastq), I observed few difference in number of variants. I found this difference is due to order of sequence identifiers in fastq files. Then i sorted sequence ids in converted fastq in same order of original fastq using BBMap filterbyname.sh function and observed 100% overlap in variant calls from original fastq vs converted fastq.

Could you please let me know whether this difference is expected when sequence identifiers in fastq are in different order?

Thanks In Advance Fazulur Rehaman

closed time in 22 days

Fazulur

issue commentlh3/bwa

bwa mem - fastq sequence order

Yes, expected. Also, that gist is suboptimal. You should use samtools collate instead.

Fazulur

comment created time in 22 days

issue commentlh3/minimap2

Missing 1 Mbp of alignment on Chromosome 16 between T2T-CHM13 v1.0 and GRCh38

Sorry, I meant to say chr3, towards the end of chr3. I haven't checked that chr8 inversion yet.

mrvollger

comment created time in 23 days

issue commentlh3/minimap2

Missing 1 Mbp of alignment on Chromosome 16 between T2T-CHM13 v1.0 and GRCh38

@mrvollger Although v2.23 fixed the chr16 inversion, I noticed minimap2 missed another inversion on chr8. The new v2.24 release is using a more robust solution and has this fixed.

mrvollger

comment created time in 23 days

push eventlh3/minimap2

Heng Li

commit sha 06fedaadd0f88074bd68527e6e34634ffe21273e

typo on simde

view details

push time in 23 days

release lh3/minimap2

v2.24

released time in 23 days

created taglh3/minimap2

tagv2.24

A versatile pairwise aligner for genomic and spliced nucleotide sequences

created time in 23 days

push eventlh3/minimap2

Heng Li

commit sha fe35e679e95d936698e9e937acc48983f16253d6

Release minimap2-2.24 (r1122)

view details

push time in 23 days

issue commentlh3/minimap2

Cookbook recipe for ONT long-read overlaps generates errors

Thanks!

plattsad

comment created time in 23 days

more