Trennen is German for separate
- Use machine learning to find out if we can do a good job predicting the angle of optical activity for a given enantiomer.
- Determine factors which affect optical activity.
- [ ]
- Use machine learning to predict the EE% of a reaction
- Determine factors of enantioselectivity.
- [ ]
Input: solvent, reactants, and catalysts as positions in 3d space
Output: EE%
- First, I downloaded the QM9 dataset.
- This includes about 130K different organic molecules with their xyz coordinates, smiles, and inchi.
- That's cool.
- But some of these organic molecules in their SMILES form did not have any stereochemistry.
- We need molecules with stereochemistry because all planar molecules (2d molecules) can be flipped in 3d space to undo a reflection.
- Generally speaking, a n dimensional figure is achiral in n+1 dimensions (I postulate).
- However, an achiral molecule in n+1 dimensions is not necessary to make it chiral in n dimensions.
- We need stereochemistry in optical activity so I simply added all files with stereochemistry as follows
find . -type f -exec grep -F '@' {} \; -exec mv -t files\_with\_stereochemistry/ {} +
. - This is because the SMILES format uses the
@
symbol to denote stereochemistry. - OK.
- That's cool.
- But we don't know if all of these molecules have chiral centers.
- Remember, our goal is to filter all these molecules to just be enantiomers.
- To find the chiral centers and filter them into a new directory, we can use the RDkit python chemistry tooling library.
- So I wrote a simple python script titled "find_files_with_chiral_centers.py".
- After executing it (took about 20 minuteS), we have a new directory with ~97K molecules which contain chiral centers.
- OK.
- That's cool.
- But we need only molecules which are chiral.
- As a sidenote, I do know that diasteoremers are sometimes optically active but for the purpose of this project, we are considering only enantiomers.
- At that time of doing this project, I was learning group theory and how I might possibly determine if a molecule is chiral or not.
- I had checked many places online but I couldn't really find a definitive explanation.
- However, I had received a reply from ChirBase allowing me to sample their database which contained about 13K chiral compounds (the full database contains over 300K compounds).
- So I ran with the idea and began by exporting the database as an excel worksheet in the smiles format (thankfully the isomeric smiles was included).
- I then removed the excess junk and create a single txt file with all the smiles compounds in a directory called chirbase_chiral_molecules named "CHIRBASE_SEPARATION.txt" (not included).
- However, this contained duplicate smiles in the list because it had different information based on the researcher.
- Therefore, a simple script was written to remove all the duplicates named "remove_duplicate_entries.py".
- After running it, the new file was named "CHIRBASE_SEPARATION_UPDATED.txt".
- Now, the final step in setting up the data was to retrieve the optical activity for each of the chiral compounds.
- To do this, I broke it up in two steps.
- First, we would retrieve and determine if the compound exists on Chemsp***r.
- Since Chemsp***r redirects their links immediately to the coumpound if it is found, a simple script was written to automatically determine the redirect link and retrieve it into a file named "get_chemsp***r_links.py".
- This script took about 12 hours to execute since there was a 1 second delay included in the script to prevent overloading the chemsp***r servers as well as to prevent the chemsp***r overlords banning my IP :)
- OK.
- That's cool.
- We have a list of all the links to chiral compounds with their respective chiral molecules.
- By the way, the actual links file was cleaned up to remove the "no redirects" and links which did not get sent to an actual molecule name.
- In total, we have 6K chiral compounds which has a valid chemsp***r link.
- For those interested, the commands in vim were
:%s/^no redirect\n//g
followed by:%s/^.*@.*$\n//g
followed by:%s/^.*C\/.*\n//g
followed by:%s/^.*C(.*\n//g
followed by:%s/^.*C=.*\n//g
followed by%:^s/^.*=O.*\n//g
followed by:%s/?rid.*$//g
followed by:%s/b'//g
. - WAIT
- I just shot myself in the foot.
- I executed all the find and replace and removed all the "no redirects" but now I don't know the smiles format for the structures.
- RIP.
- I guess I have to run this again to determine exact smiles structures . . .
- Next time, I should just leave a blank line or a line with a specific character (such as #) to specify that it is a placeholder for an invalid link.
- However, we can run this again in conjunction with part 2 which is to actually retrieve the optical rotation direction.
- This can simply be done by extracting the title page or synonym of the respective chemsp***r link since it is included in the molecules name.
- To do this, get_chemspider_link.py was completely rewritten.
- The end result should be that the CHIRBASE_SEPARATION_LINKS.txt should be in sync with the CHIRBASE_SEPARATION_UPDATED.txt such that the smiles and links correspond if a valid molecule exists.
- Additionally, the CHIRBASE_SEPARATION_DIRECTION.txt file should include a list of arrays with the respective smile, url, and optical direction.
- As I am working on this file, I just realized that we don't even need the chirbase database. We just need a large set of molecules and simply check if it contains the (-) or (+) indicator in the title to determine its chirality.
- Since this script checks the redirect link as well as retrieving the link, the script took about [blank] hours to execute.
- After all of this, we finally had a list of chiral molecules with their optical rotation direction.
- Depending on how much data we are able to extract form these molecules, we have two options.
- If we have a lot of data (relatively), we will begin writing on our machine learning model.
- If we do not have a lot of data, we should ideally find a larger dataset with more organic molecules and run all of the above steps again.
- In either case, we have one more necessary step for our data science part.
- We need to artifically generate the stereoisomer of each molecule if it is not present.
- And after considering this, I believe it would make the most sense to generate these molecules before running the above script and check them on chemspider.
- This is because I don't see a trivial way of generating the chiral enantiomers with multiple chiral centers.
- Furthermore, it appears that some data in the "chirbase" database does not contain only chiral molecules.
- For example, dichloromethane appears in the data . . .
- Also, it appears that the chirbase database is too small.
- So . . . we are going to transition our work to the
- OK.
- So first I moved all the files in files_With_chiral_centers into a subdirectory named files.
- Then I copied the files/ directory into files_with_optical_rotation directory.
- Then I began working on the script in the directory.
- It seems like that it would be easier to first create a giant file with the list of smiles as well as their stereoisomers to be searched in chemsp***r.
- OK.
- LET's DO THIS
- BRUH
- ok
- Just like undo the past 50 lines or something.
- I was reading something online from Jun 2000!
- And they said you could just simply reflect the mol file across the origin to get the enantiomer.
- So yeah.
- From that we have determine the chirality of a molecule.
- So I basically wrote two functions and made a pull request with rdkit.
- So yeah lol.
- We're just going to use the QM9 dataset (isomeric smiles format) and filter out only chiral molecules.
- Then we'll generate the enantiomers and write a smart function to figure out if an enantiomer is missing on chemsp***r to use the opposite direction of the other enantiomer.
- OK, just generated all the enantiomers of molecules with chirality in the QM9 dataset.
- This means we have a big list of chiral molecules (enantiomers)!!!
- Took about 60 minutes to execute (find_files_with_chirality.py).
- UPDATE: Seems like there is going to be a lot of enantiomers!! At 27%, we already had 40K enantiomers! More Data = Better chances at beating MIT
- Now, we just get the relevant optical rotation value from chemsp***r.
- Fortunately, we already wrote a script to do that!
- Ok so after not working on this for two weeks, here is my progress: All is vanity. Everything in the useless/data/ folder is vanity. Waste of time. Completely.
- At least I learned a lot though. Even got a PR on rdkit. Anyways. . .
- So basically chemsp***er is pretty bad since (1) its slow (2) IT GAVE LIKE 100 OPTICAL ROTATION VALUES AFTER RUNNING IT FOR OVER 9000 compounds.
- That's like a 1% extraction rate and we'll never be able to compete with the 70K molecules ChiRO used.
- We're going to use pubchem.
- And after searching, I came across this article: https://www.ncbi.nlm.nih.gov/Class/PubChem/essentials/limits.html
- Basically, you can obtain all compounds in the pubchem database by their chirality and . . . now we have a dataset of ~17 million chiral compounds (YOO).
- By choosing the export type as the synonyms, we can simply search for molecules with the (+) or (-) synonym and extract the CID number. Then, we use the CID number to obtain the isomeric smiles/mol file.
- Let's go.
- GG.
- Ok so apparently there was a download fail and it only downloaded 4 million compounds out of the possible 18 million compounds.
- But.... good news
- On our computer now (4million.txt), we have approximately 15 thousand compounds labeled with their (-) or (+) indicator, without artifically generating the enantiomer.
- Not bad.
- We'll retry the download to see if we can get all 18 million compounds.
- Alright, so I made a video essentially explaining what I did.
- The download was incredibly slow and a terrible process.
- So I used the esearch api to repeat the search I did on the ncbi site and got all the CIDs in the CIDs.txt file.
- Then I PUGrest to retrieve all the synonyms of the CIDs.
- Since doing individual ones was slow, I wrote a function pubchem.py which essentially sends a post request with a bunch of CIDs separated by commas.
- Then, I retrieved the smiles and placed them in a file.
- After some ReGeX magic, we got to the files ilovesmiles.txt and ilovejson.txt.
- Note that the data in ilovedata.txt is NOT all chiral since it also includes compounds with lines such as (CH+).
- The generate_enantiomers.py script sorts these compounds and creates a new file with enantiomers and only chiral compounds.