Interview with Milind Jadhav, Lead Data Scientist at Fractal
Milind Jadhav is Lead Data Scientist in Fractal’s AIML team. He has a decade of experience solving business problems for Fortune 500 clients in multiple areas such as insurance, banking and technology. With expertise, Milind leads the delivery of data science of customer engagements by creating automated decision-making solutions through the application of AI and machine learning. He also leads domain-independent capability developments addressing the common needs of businesses across various industries. Apart from that, Milind plays an SME role in business development initiatives and actively participates in shaping the careers of data scientists at Fractal.
A few months ago, Milind began leading Fractal’s research into what can arguably be considered the hottest area of AI – GANs, Generative Adversarial Networks. Analytics India Magazine caught up with him to understand the current state of the field and the potential it holds.
AIM: How have GANs developed over the years?
Soft: The concept of GAN is relatively new. It started way back in 2014 when Ian Goodfellow and the team first featured it in an article. Since then, many other researchers have come up with their own versions, and now we have several popular types of GANs, such as conditional GANs, deep convolutional GANs, etc. It started with image generation in particular, but now its scope has expanded to include other areas such as tabular data generation, audio signals, transferring styles from one image to another, semantic segmentation used in autonomous driving, training classifiers to identify contradictory samples and resist attacks, etc.
AIM: What are the real use cases for GANs?
Soft: There are many use cases across all industries. The one that interests me is to use GANs to synthesize new stable chemical compounds. I’ve also read about GANs being used to convert user photos into personal emojis. In photography, SRGANs are used to generate a high resolution version of old photos. Text-to-image conversion is also done through GANs. At Fractal, we started with the use case of generating high-quality structured enterprise data using GANs.
AIM: How are GANs used in Fractal?
Soft: Being in the field for more than 20 years, Fractal has become a leader in the field of AI. We have a lot of experience dealing with business data, and most of it comes in the form of tables. However, we have also found that some customers are plagued by data scarcity, data validity, and data security issues. So we started our research to be able to generate tabular business data to solve these problems.
To generate this complex multidimensional business data with maximum accuracy, it is very important to understand the relationship between these dimensions. What we do at Fractal is first solve a basic problem of recreating a dimension of data for a period of time with the greatest possible accuracy. Then we built a utility around it in two versions. One is where customers can enter the actual data they have, and the utility automatically iterates over multiple GAN architectures, tunes dozens of hyperparameters, and selects them based on validation metrics to ultimately generate synthetic data. We are able to achieve this through our utility without exposing our clients to the complicated training process. As a result, the client gets the output data and the synthetic data quality validation report.
In the second version, we try to facilitate the GAN training process for data scientists. We offer code accelerators to help data scientists train the best GAN architectures to quickly generate synthetic data. In this release, we provide full flexibility over the training process for the data scientist, including choosing their own sets, specifying grids for the hyperparameters they want to tune, etc., without having to worry about the configuration of the underlying python code.
AIM: What are the flaws of GANs?
Soft: I wouldn’t call them loopholes, but by construction, GANs are very difficult to form. Because two models are always competing against each other, improvements to one come at the expense of the other. In fact, many data scientists struggle to achieve balance in training.
Several pitfalls can arise in the formation of GANs; some are:
- The most difficult is the case where multiple generator inputs result in the same output/limited variety. This is called mode collapse.
- There is also no universally accepted validation metric that could be used to know if a gan is performing well in training or not.
What we’re trying to do is, through rigorous experimentation on multiple datasets, automate this entire training process while mitigating issues such as the above as much as possible. For validation in particular, we developed our own metric, which we observed worked quite well in our experiments.
AIM: What’s in store for us in terms of GANs?
Businesses today spend a lot of money to collect and store huge amounts of data. Once GANs for structured data mature, we could envision companies generating highly accurate samples in a secure manner whenever needed. Also in terms of generating unstructured data, the applications are huge and constantly evolving.
Overall, the use of GANs is growing, but one remaining problem is the enormous computing power needed to train them. They typically require GPU resources to accurately learn the underlying conjoint distribution and replicate it as much as possible. In the future, I think as infrastructure costs decrease further, thereby reducing the cost of training, we could see more organizations investing in GAN research for a variety of applications.