Call for Papers

Data is the most valuable ingredient of machine learning models to help researchers and companies make informed decisions. However, access to rich, diverse, and clean datasets may not always be possible. One of the reasons for the lack of rich datasets is the substantial amount of time needed for data collection, especially when dealing with rare events or manual annotation is required. Another reason is the need for protecting privacy, whenever raw data contains sensitive information about individuals and hence cannot be shared directly.

A powerful solution that can address both of these challenging scenarios is generating synthetic data. Thanks to the recent advances in generative models, it is possible to create realistic synthetic samples that closely match the distribution of complex, real data. In the case of limited labeled data, synthetic data can be used to augment training data to mitigate overfitting. In the case of protecting privacy, data curators can share the synthetic data instead of the original data, where the utility of the original data is preserved but privacy is protected.

Despite the substantial benefits from using synthetic data, the process of synthetic data generation is still an ongoing technical challenge. Although the two scenarios of limited data and privacy concerns share similar technical challenges such as quality and fairness, they are often studied separately. We invite researchers to submit papers that discuss challenges and advances in synthetic data generation, including but not limited to the following topics.

How can we evaluate the quality of synthetically generated datasets?
How can we handle mixed-type datasets such as tabular data with both categorical and continuous variables?
How can we generate synthetic samples to augment rare samples or limited labeled data?
How can we address privacy violations, measure privacy leakage, and provide probable privacy guarantees?
How can we retain semantic meaning of original samples in the synthetic data?
What are the right datasets/applications/benchmarks to propel this research area forward?
How can we measure and mitigate biases, and thereby ensure fairness in data synthesis?

Important Dates

Submission: March 5, 2021 (Anywhere on Earth)
Notification: March 31, 2021
Workshop: May 7, 2021

Submission Requirements

Submissions in the form of extended abstracts must be at most 4 pages long (not including references; additional supplementary material may be submitted but may be ignored by reviewers), anonymized, and adhere to the ICLR format. We encourage submissions of work that are new to the synthetic data generation community. Submissions solely based on work that has been previously published in machine learning conferences or relevant venues are not suitable for the workshop. On the other hand, we allow submission of works currently under submission and relevant works recently published in relevant venues. The workshop will not have formal proceedings, but authors of accepted abstracts can choose to have a link to arxiv or a pdf added on the workshop webpage.

Submission Link: https://cmt3.research.microsoft.com/SDGICLRW2021

Please email any enquiries to email