Data¶
DeepChem dc.data provides APIs for handling your data.
DeepChem中的 dc.data 提供了用于处理数据的接口API。
如果你的数据存储在CSV或者SDF文件中,你可以使用DeepChem中的数据加载器**Data Loaders**。
数据加载器(Data Loaders)可以读取你的数据把他们转换成特征(比如把smiles 转换成ECFP指纹),
把这些特征存储在 Dataset的类中。
如果你的数据是python的对象objects比如Numpy数组或者Pandas 的DataFrames类型,你可以直接使用Datesets加载这些数据。
目录
Datasets¶
DeepChem中:code:`dc.data.Dataset`对象是DeepChem程序的核心模块之一。 Dataset对象保存了可以用于机器学习的数据表示,并在DeepChem中得到广泛使用。 Dataset类的目标是兼容表征各种机器学习数据集。 为此,我们在:code:`Dataset`中为各种数据类型(pandas DataFrames, TensorFlow Datasets,和 PyTorch datasets)提高了一套相互转换的方法。
NumpyDataset¶
The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset
which stores its data in numpy.ndarray objects.
DiskDataset¶
The dc.data.DiskDataset class allows for the storage of larger
datasets on disk. Each DiskDataset is associated with a
directory in which it writes its contents to disk. Note that a
DiskDataset can be very large, so some of the utility methods
to access fields of a Dataset can be prohibitively expensive.
ImageDataset¶
The dc.data.ImageDataset class is optimized to allow
for convenient processing of image based datasets.
Data Loaders¶
Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking.
To simplify this process for you, you can use the dc.data.DataLoader classes.
These classes provide utilities for you to load and process large amounts of data.
JsonLoader¶
JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.
InMemoryLoader¶
The dc.data.InMemoryLoader is designed to facilitate the processing of large datasets
where you already hold the raw data in-memory (say in a pandas dataframe).
Data Classes¶
DeepChem featurizers often transform members into “data classes”. These are
classes that hold all the information needed to train a model on that data
point. Models then transform these into the tensors for training in their
default_generator methods.
Graph Data¶
These classes document the data classes for graph convolutions.
We plan to simplify these classes (ConvMol, MultiConvMol, WeaveMol)
into a joint data representation (GraphData) for all graph convolutions in a future version of DeepChem,
so these APIs may not remain stable.
The graph convolution models which inherit KerasModel depend on ConvMol, MultiConvMol, or WeaveMol.
On the other hand, the graph convolution models which inherit TorchModel depend on GraphData.
Base Classes (for develop)¶
Dataset¶
The dc.data.Dataset class is the abstract parent class for all
datasets. This class should never be directly initialized, but
contains a number of useful method implementations.
DataLoader¶
The dc.data.DataLoader class is the abstract parent class for all
dataloaders. This class should never be directly initialized, but
contains a number of useful method implementations.