A week ago we launched www.mymoustache.net, a fun website that uses machine learning to measure your facial hair and promote men’s health.
And now we’re starting to look at the data behind the scenes and learn some interesting, amusing facts.
How do we look at the data?
Every time a photo is analyzed, we generate a small text data with some details like, how many faces we found, their average age, gender, size of moustache/beard found, approximate location where the photo was submitted (very high level, anonymous so there’s no personal identifiable data at all) and we send this to some big pipes we have behind the scenes called Azure Event Hubs. There are meant for super high scale data pumping, like IoT scenarios for example. From there, we persist that data into our cloud storage. Then we read it back using Azure Stream Analytics where we can parse it and send to our reporting tool, PowerBI.
More or less like this:
This architecture allows us to analyze data in real time, aggregate and report on it in all sorts of ways. For example, on Stream Analytics I take that text data (saved in JSON format) and transform it this way:
with data as ( select GetArrayLength(logs.AnalyzeResults.Faces) AS faceCount, arrayElement.ArrayValue.faceId as faceId, arrayElement.ArrayValue.attributes.gender as gender, arrayElement.ArrayValue.attributes.age as age, arrayElement.ArrayValue.attributes.beardPercentile as beardpPercentile, arrayElement.ArrayValue.attributes.moustachePercentile as moustachePercentile, arrayElement.ArrayValue.attributes.BeardLength as beardLength, arrayElement.ArrayValue.attributes.BeardConfidence as beardConfidence, arrayElement.ArrayValue.attributes.MoustacheLength as moustacheLength, arrayElement.ArrayValue.attributes.MoustacheConfidence as moustacheConfidence, arrayElement.ArrayValue.donate as donate, logs.AnalyzeResults.SubmissionMethod as submissionMethod, cast(logs.Timestamp as datetime) as eventDateTime, cast(DATETIMEFROMPARTS (DATEPART ( yyyy , logs.Timestamp ), DATEPART ( mm , logs.Timestamp ), DATEPART ( dd , logs.Timestamp ), 0,0, 0,0) as datetime) as eventDate, cast(DATETIMEFROMPARTS (2015, 11, 07, DATEPART ( hh , logs.Timestamp ), DATEPART ( mi , logs.Timestamp ), DATEPART ( ss , logs.Timestamp ),0) as datetime) as eventTime, logs.Latitude as Latitude, logs.Longitude as Longitude, logs.Country as Country FROM logs as logs CROSS APPLY GetArrayElements(logs.AnalyzeResults.Faces) AS arrayElement ) select * into output from data
If you know SQL you will understand most of this. But we’re not running this against a dataabse. Instead, we’re running this against streaming data. And we’re parsing and remodeling this data acording to our needs. For example, in this particular case I have a JSON payload that may contain an array of faces (a single photo may contain many faces) so I need to turn each face into its own record by dowing a cross query against that array.
Then I setup stream analytics to push this to Power BI where I get this super nice reporting tool that tells us a lot. And what have we learned?
- Total faces analyzed in a week: 49,254
- Total men: 37,534
- Total women: 11,720 (you would wonder why women would care about analyzing their faces with a mustache site but it turns out we have a auto-stache feature that adds mustaches to them)
- Average mustache length from all photos (scale between 0 to 1): 0.28
- Average beard length from all photos (scale between 0 to 1): 0.27
Countries with the biggest mustaches in average (and here I was hoping somewhat I’d find Mexico right at the top so I could have some nice jokes with my friends but that didn’t happened at all. Actually and quite unexpectedly Brazil, my home country, was one of the top ones):
And here the countries with the shortest mustaches, where there seems to be a few Asian ones there:
And also average mustache length by age:
Not many 10 years olds with beard as you would expect 🙂 Some interesting couple of bars that were quite off there, I wonder why…
And that’s it…
Ah, I also decided to write a custom dashboard that shows some of this data across the world in 3D. Because everything looks more fun in 3D: http://stacheistics.azurewebsites.net
The taller the bars are, the higher the number is (they also look lighter):
One interesting fact I’ve got from there: How many people checked the “donate to science” checkbox and authorized us to use their photos to improve our machine learning, per region?
It turns out (I don’t know why) some regions were off the bar. The northeast of Brazil for example is one. Egypt as well shows very high. North of Japan and Belarus also seem pretty high.
A disclaimer: Don’t take this data very seriously. It could very well be a programming error from my side. But looking at this data is still pretty cool! 🙂