I tried to creating my own dataset from internet image resources this week, and it works well.
Get a list of URLs
Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.
Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset. Press Ctrl+Shift+J in Windows/Linux and Cmd+Opt+J in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.
You will need to get the urls of each of the images. You can do this by running the following commands:
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
And then, you will have your txt file which contents all the urls of each of the images:
Create your own datasets by downloading the images:
View data:
Create CNN and run 4 epochs:
The error_rate goes down and up again, the reason may because of the learning rate is not appropriate. So we can use the tool to find a better learning rate:
I choose a better learning rate between 1e-4 and 1e-3, and the result has been improved:
Here are the top loss images after improvement:
And here is the Confusion Matrix:
Conclusion
This is like a "Hello World!" project that shows how to use the image resources on the internet to training. The dataset is not organized, I believe we can improve the prediction accuracy by pruning some confusion images.
And I have read this paper this week: https://arxiv.org/pdf/1905.09773.pdf
What I think is very inspiring is, they visualize the sound to a human face image without any other input, it just likes what we do in our brain when we have phone calls.
Speech2Face: Learning the Face Behind a Voice
They reconstruct an image of a person’s face from a short audio segment of speech. The input is just a short audio segment of speech, no more materials.
- Voice encoder network
Voice encoder module is a convolutional neural network that turns the spectrogram of a short input speech into a pseudo face feature, which is subsequently fed into the face decoder to reconstruct the face image.
- Face decoder network
Reconstruct the image of a face from a low-dimensional face feature.
Implementation
- Use the face decoder model of Cole et al. to reconstruct a canonical face image.
- Only need to train the voice encoder that predicts the face features.
- To train their model, they use the AVSpeech dataset, comprised of millions of video segments from YouTube with more than 100,000 different people speaking. Their method is trained in a self-supervised manner, i.e., it simply uses the natural co-occurrence of speech and faces in videos, not requiring additional information, e.g., human annotations.
Ablation Studies
- Their own loss function works better.
- Feeding longer speech as input at test time leads to improvement in reconstruction quality.
- With BN (Batch Normalization) the results contain much richer facial information.
- Effect of language and accent.