Using machine learning to build datasets for machine learning

Using the same datasets over and over when training leads to Neural Networks that generate repetitive results. I am trying different approaches to control the kind of images my Networks generate. Previously, I have tried under-training a CycleGAN to hybridize two images, and using face detection to build a dataset of faces from early Renaissance artworks. 

This week I am using Google’s BigQuery service and its Cloud Vision API, along with the Metropolitan Museum of Art’s recently released online art collection to gather large numbers of high quality images from particular artworks without the need for manually sorting hundreds of thousands of images. 

The online collection of art includes images of more than 200,000 items from museums around the world. The Met’s portal for this collection allows visitors to download any image from this collection if it is public domain. They have also made a catalogue of sorts available, but the half-million items in the CSV is too much for Numbers or Google Sheets, and makes Excel grind almost to a halt and crash pretty frequently. This amount of data calls for a database approach. Google makes this information available on its cloud-based Big Query service, based on SQL. It is clearly intended for businesses rather than individual users, but free accounts are available for a limited number of queries. As with many Google products, the technology is impressive, but the user experience may be baffling. 

As an example of the baffling, the Met’s artwork dataset is divided between three tables – objects, images and vision_api_data. It is not initially clear what the relationship of these tables is. Images appears to include at least some paintings, but so does objects. Objects has detailed info on its items, but Images has just 6 fields. It turns out that all three tables in fact refer to the same items – the 200,000 artworks mentioned earlier. They can be cross-referenced between tables using the JOIN command in SQL using their item number. This is not documented anywhere, but left as an exercise for the curious. There is no obvious reason why this couldn’t be a single table; there are not that many fields in each of the three tables. 

Another red herring lurks in the helpful comment #Standard SQL, found at the top of the sample query provided, which suggests that the syntax of the SQL being used is standard. It is not. In standard SQL, you can for example exclude items with no entry under the heading “period” by saying: 

where period != null

In Big Query, this generates a syntax error. You need quotes around “null”. But how would anyone know that? I was capably assisted by Zach Peyton who figured all this out. I connected with Zach through Codementor.io, which pairs students with experienced programmers for teaching and troubleshooting help at rates averaging $20/15min. 

There is no substitute for learning through experimentation, but sometimes you just get stuck, and forums aren’t any help. It’s good to have an option to ask an expert at times like that. Based on the leg up that Zach gave me, I was able to assemble queries that would return, for example, URLs of images of objects made of rock, which I could then use to train a GAN:

select i.object_id, o.period, i.original_image_url, v.description
from bigquery-public-data.the_met.objects o
join bigquery-public-data.the_met.images i on i.object_id = o.object_id
JOIN (
SELECT
label.description as description,
object_id
FROM bigquery-public-data.the_met.vision_api_data, UNNEST(labelAnnotations) label
) v on v.object_id = o.object_id
where is_public_domain = True
and description = 'rock'