MoleculeNet

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.

  2. Open an issue to discuss the dataset you want to add to MolNet.

  3. Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.

  4. Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.

  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.

  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.

  7. Add documentation for your loader to the MoleculeNet docs.

  8. Submit a [WIP] PR (Work in progress pull request) following the PR template.

BACE Dataset

BBBC Datasets

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

Cell Counting Datasets

Chembl Datasets

Chembl25 Datasets

Clearance Datasets

Clintox Datasets

Delaney Datasets

Factors Datasets

HIV Datasets

HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

HPPB Datasets

KAGGLE Datasets

Kinase Datasets

Lipo Datasets

Materials Datasets

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

MUV Datasets

NCI Datasets

PCBA Datasets

PDBBIND Datasets

PPB Datasets

QM7 Datasets

QM8 Datasets

QM9 Datasets

SAMPL Datasets

SIDER Datasets

Thermosol Datasets

Tox21 Datasets

Toxcast Datasets

USPTO Datasets

UV Datasets

ZINC15 Datasets

Platinum Adsorption Dataset