User Tools

Site Tools


faq

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
faq [2018/12/19 02:40]
flack
faq [2018/12/21 15:00]
flack [Can I get a Mcule database SDF file in smaller chunks?]
Line 18: Line 18:
 To split a smi.gz / smiles.gz file into multiple **gzip compressed chunks** use a command like this: To split a smi.gz / smiles.gz file into multiple **gzip compressed chunks** use a command like this:
 <​code>​ <​code>​
-gzip -dc your.smi.gz | split --verbose --lines=<​size>​ --numeric-suffixes --suffix-length=<​suffix_length>​ --additional-suffix='​.smi'​ --filter='​gzip -9> $FILE' - your__+gzip -dc your.smi.gz | split --verbose --lines=<​size>​ --numeric-suffixes --suffix-length=<​suffix_length>​ --additional-suffix='​.smi.gz' --filter='​gzip -9> $FILE' - your__
 </​code>​ </​code>​
  
 For example to split the Mcule Purchasable (Full) smi.gz file into 1M **gzip compressed chunks** use: For example to split the Mcule Purchasable (Full) smi.gz file into 1M **gzip compressed chunks** use:
 <​code>​ <​code>​
-gzip -dc mcule_purchasable_full_180817.smi.gz | split --verbose --lines=1000000 --numeric-suffixes --suffix-length=10 --additional-suffix='​.smi'​ --filter='​gzip -9> $FILE' - mcule_purchasable_full_180817__+gzip -dc mcule_purchasable_full_180817.smi.gz | split --verbose --lines=1000000 --numeric-suffixes --suffix-length=10 --additional-suffix='​.smi.gz' --filter='​gzip -9> $FILE' - mcule_purchasable_full_180817__
 </​code>​ </​code>​
  
Line 33: Line 33:
 If you have access to a unix based system and awk you can use the below commands to split large, gzipped SDF files into smaller chunks. If you have access to a unix based system and awk you can use the below commands to split large, gzipped SDF files into smaller chunks.
  
-To split an sdf.gz file into multiple uncompressed chunks, use a command like this:+To split an sdf.gz file into multiple ​**uncompressed chunks**, use a command like this:
 <​code>​ <​code>​
 gzip -dc your.sdf.gz | awk -v name=<​chunk_name>​ -v ext=sdf -v size=<​size>​ '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 > file1}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close(file1);​file1=file2}}}}'​ gzip -dc your.sdf.gz | awk -v name=<​chunk_name>​ -v ext=sdf -v size=<​size>​ '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 > file1}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close(file1);​file1=file2}}}}'​
Line 40: Line 40:
 Just replace your.sdf.gz with your filename, <​chunk_name>​ with the name of the files you want and <​size>​ with the intended chunk size. Just replace your.sdf.gz with your filename, <​chunk_name>​ with the name of the files you want and <​size>​ with the intended chunk size.
  
-For example to split the Mcule Purchasable (Full) sdf.gz file into 1M uncompressed chunks use:+For example to split the Mcule Purchasable (Full) sdf.gz file into 1M **uncompressed chunks** use:
 <​code>​ <​code>​
 gzip -dc mcule_purchasable_full_180817.sdf.gz | awk -v name=mcule_purchasable_full_180817__ -v ext=sdf -v size=1000000 '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 > file1}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close(file1);​file1=file2}}}}'​ gzip -dc mcule_purchasable_full_180817.sdf.gz | awk -v name=mcule_purchasable_full_180817__ -v ext=sdf -v size=1000000 '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 > file1}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close(file1);​file1=file2}}}}'​
Line 46: Line 46:
  
  
-To split an sdf.gz file into multiple gzip compressed chunks, use a command like this:+To split an sdf.gz file into multiple ​**gzip compressed chunks**, use a command like this:
 <​code>​ <​code>​
 gzip -dc your.sdf.gz | awk -v name=<​chunk_name>​ -v ext=sdf.gz -v size=<​size>​ '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 | "gzip -9 > "​file1""​}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close("​gzip -9 > "​file1""​);​file1=file2}}}}'​ gzip -dc your.sdf.gz | awk -v name=<​chunk_name>​ -v ext=sdf.gz -v size=<​size>​ '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 | "gzip -9 > "​file1""​}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close("​gzip -9 > "​file1""​);​file1=file2}}}}'​
Line 53: Line 53:
 Just replace your.sdf.gz with your filename, <​chunk_name>​ with the name of the files you want and <​size>​ with the intended chunk size. Just replace your.sdf.gz with your filename, <​chunk_name>​ with the name of the files you want and <​size>​ with the intended chunk size.
  
-For example to split the Mcule Purchasable (Full) sdf.gz file into 1M gzip compressed chunks use:+For example to split the Mcule Purchasable (Full) sdf.gz file into 1M **gzip compressed chunks** use:
 <​code>​ <​code>​
 gzip -dc mcule_purchasable_full_180817.sdf.gz | awk -v name=mcule_purchasable_full_180817__ -v ext=sdf.gz -v size=1000000 '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 | "gzip -9 > "​file1""​}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close("​gzip -9 > "​file1""​);​file1=file2}}}}'​ gzip -dc mcule_purchasable_full_180817.sdf.gz | awk -v name=mcule_purchasable_full_180817__ -v ext=sdf.gz -v size=1000000 '​BEGIN{size=size}(NR==1){file1=sprintf("​%s%0.10d.%s",​name,​counter,​ext)}{print $0 | "gzip -9 > "​file1""​}{if($0=="​$$$$"​){file2=sprintf("​%s%0.10d.%s",​name,​int(++counter/​size),​ext);​{if(file1!=file2){close("​gzip -9 > "​file1""​);​file1=file2}}}}'​
faq.txt ยท Last modified: 2018/12/21 15:00 by flack