Uncovering the true value of data
Current AI leverages poorly the goldmine of information present in rare data extracting trivial notions rather than high level concepts.
We hypothesize that this is due to several key weaknesses. We make simplistic assumptions about how data are distributed not reflective of the real world; and our training processes seem to favour frequent and common data than the rare ones. Moreover, sometimes (at random) poor information is extracted, either because we find suboptimal data representations or due to latching on non-useful artefacts. We want to understand why such inconsistency exists and propose to devise methods that combat it and hence improve how we optimize learning functions. We propose to introduce stronger (causal) assumptions to robustly extract high-level concepts. There is an as-yet-unexploited opportunity where rare data may reveal unique causal relationships. We hope to investigate thoroughly this tantalising prospect. We put herein the underpinnings for an AI that is data-efficient and robust. We will stress test our ideas and methods on synthetic and benchmark data. We explore key healthcare applications in multimodal datasets of cancer to illustrate potential gains and provide material for follow-up work.