Reporting of alignments

Nicole_G

New member
Hi Benjamin,

we've been playing around with the parameters that can be used to limit the reported alignments. We can't write all alignments to file due to hard drive limitations. So at the moment we are using thek or the top parameter. However, we do run into problems when we have species with a lot of sequenced strains (e.g. E. coli). If we set k to 25 but a read maps to all the strains, that read would only be assigned to E. coli regardless of any other potential hits in other species. That leads to some misassigned reads.

Our idea is to have a parameter that returns all hits until e.g. 10 different species are detected. It would basically mean to report a read only once per species even if it hits 100 strains within that species. Does that make sense? That way we would get rid of the strain bias in nr and the LCA can assign the read to the correct taxonomic level.

Cheers,
Nicole
 

Benjamin Buchfink

Administrator
Staff member
Hi Nicole,
so it would be easy to have an option to report at most 1 read per species, but I should know if you want to use that together with --top such that still only the the hits within 10% of the best hit score are reported. Alternatively you can stop after e.g. 10 species but that way would be more computationally expensive so I'd like to understand what would be better for you.
 

Nicole_G

New member
Hi Benjamin,

we would definitely need a cutoff of some kind so that only alignmentsthat actually make sense are reported. At the moment we use quite a stringent e-value cutoff for that and then use the top argument in the LCA step. That way you can change that parameter without redoing the mapping step.

Does that answer your question?

Cheers,
Nicole
 

Benjamin Buchfink

Administrator
Staff member
Ok, so I will implement it such that you can get all alignments and apply the top filter afterwards. This is not a difficult modification so hopefully I'll be able to do it in the next days.
 

Benjamin Buchfink

Administrator
Staff member
The latest github commit now has the option --taxon-k to limit the number of hits per species. So for example with --taxon-k 1 and -k 25 you would get hits for at most 25 species. If you use -k it will be substantially slower than normal however, so I'd recommend using it with --top instead. (you could use the highest value for --top that you'd consider and filter it for lower values later)

I did not test this very extensively so let me know if it works for you.
 
Top