Using Azure Machine Learning to understand user communities on Office 365

At the recent 2017 European SharePoint Conference in Dublin, my colleague @VelinGeorgiev and I built a solution for the conference hackathon that uses insights gathered from machine learning to create communities of users on an Office365 tenant who have similar behavior.

I came up with the idea and built the machine learning part of it. Velin built the collaboration logic using SharePoint, Forms and Flow. He blogged about that portion on his blog

We won the hackathon!

The Idea

I have been working in the team that develops collaboration solutions for a very large US company. We deliver collaboration solutions on a few platforms, one of them being Office365: using SharePoint Online as the backbone, but also utilizing other Office365 services like Flow, PowerApps and PowerBI and Microsoft Azure services for additional customization.

Office365 tracks most actions in most of the services, and writes out log entries for every action performed by every user. This data includes things like which workload (SharePoint, Email, OneDrive, PoweApps, Flow, etc.) the action occurred in, what the action was and when it was performed.

This is very useful for auditing and security, but I have been thinking for a while about how we could use the Office365 audit log data for machine learning applications.

One of the issues I’ve been thinking about is how an Office365 tenant owner can identify communities of users on a tenant. In my organization (and I suspect in most other organizations) user communities are created from assumptions. Alternatively, users are all treated the same and given the same content and training.

Wouldn’t if be useful if we could find communities of people by their behavior? We could discover which people use advanced features, which people are struggling, who helps other people, who are administrators, and many other groups of behavior.

I thought that it would be interesting to combine the log data in Office365 with some machine learning and see what we can do.

Approach to machine learning

The part of this project where I feel like I have the most left to learn is definitely the analytics part. I found this awesome series of videos where Microsoft addresses data science noobs like myself and explains the foundations of data science and how you can use Azure Machine Learning Studio. Armed with the knowledge that the type of problem I want to solve is a clustering problem,  I spent some time in the Azure Machine Learning studio samples seeing how other people approached clustering projects.

Based on what I learnt there, I decided that I would try an unsupervised learning algorithm (K-Means clustering) on the log data and see how that works out. I decided that the dataset I would use for the clustering is a matrix of users, with the count of each type of action per user in the tenant.

Getting log data

The easiest way to analyze the query data from your Office365 tenant is to use the Azure Log Analytics tool. You have to use an Azure subscription, then once you have created your analytics workspace you can install a plugin that will forward all log data from the Office tenant to the log analytics solution:

Once you have the log analytics plug-in aggregating the data, the next step is to inspect that data, and figure out what the activity on your tenant looks like. I used a simple query to pull in the list of distinct actions that have been performed in my tenant:

You can see from the above that I had 136 different operations performed on my tenant. This is way too many dimensions for the clustering algorithm to work properly, so I used Excel and mapped those 136 activities into similar groupings, so I could aggregate into 15 groups of actions that are related. I had the following categories of activity:

  • General Usage
  • Azure Development
  • Directory Administration
  • Mail Administration
  • Email Usage
  • OneDrive Usage
  • OneDrive Advanced Usage
  • Development
  • SharePoint Usage
  • SharePoint Content Creation
  • SharePoint Administration
  • SharePoint Content Access
  • SharePoint Advanced Content Creation
  • SharePoint Site Administration
  • SharePoint E-Discovery

I created a spreadsheet where I generated lines of code for a from / to data table which I used in the next query. This is what the spreadsheet looks like:

Once I had my mapping table, I used a log analytics query to count my revised set of actions for each user. Log Analytics allows you to project an arbitrary table and join that to analytics data in the portal. This is what that query looked like:

This query takes the list of operations that we have copied and pasted from Excel, uses the “datatable” function in log analytics to turn that data into a table, and then joins this table to the office log records. It counts occurence per activity per use, and then uses the “evaluate pivot” feature to create the matrix with user running down the left, and count by revised user action across the top. This is what the result looks like:

Clustering the data

Once I had the data, I needed to filter it so that one type of action didn’t interfere over another type of action. Because some types of action (AD login, for example) occur a LOT, the clustering algorithm would ascribe them higher value, even though the don’t necessarily mean anything special. To fix that, I used Excel and ran some formulae on the data that I had in order to normalise all counts within columns.

I normalised the data by using this formula: normalised(x)  =  (x−mean(x))/std(x).

Using this approach I ended up with a dataset where every column had data that was still as different from other cells within the same column as it was to start with, but where the data across the columns was now normalised so that they would appear equally important to the clustering algorithm.

I created an Azure ML workspace that ran the K-Means clustering algo on our data. Here is what it looked like:

The CSV Output from the clustering algorithm is a file containing all of our users, and the cluster that they are most associated with, as well as a measurement of how close they are to the other clusters.

Surveys and Yammer Groups

Note that at this stage we don’t know what type of user is in a cluster, we just know that this is a group of statistically similar users on the Office tenant. The next step  was to upload this CSV file into the user facing part of our hack, where Velin built a SharePoint site where we could survey the users (Using MS Forms) and figure out what the clusters are using PowerBI.

Velin created a process where we could add the users to Yammer groups for them to collaborate, once we had figured out what the groups are in the survey.


Workplace and team motivation

I’ve been building answers for the 2016 edition of Advent of Code, and I have been having a blast with building answers for the puzzles. My solutions are up on Github if anyone is interested in them.

It is a reminder of how much I love coding, with just a tinge of nostalgia  I suppose. I can certainly recall a few days in my architecture career where I have been mired hip deep in politics and longed for an IDE and a well articulated requirement…

While building these I was thinking about how a well designed challenge can enthuse people to dust off old skills or learn new ones, and the topic of enthusiasm is what I wanted to address in this blog post.

The scaled agile framework has some thoughts on unlocking the intrinsic motivation of knowledge workers and it makes the point that by definition, knowledge workers know more about doing their work than the people that are “managing” them, so trying a dictatorial style of management is almost inevitably doomed to fail.

So then, as a manager, what are you to do in order to ensure a happy and productive team? Beyond the basic hygiene factors (adequate pay, decent office space, ability to work from home), what else can the manager of a software team do in order to get the most our of their team?

SAFe suggests giving people agency and responsibility, and I wholeheartedly agree. Someone who feels ownership and responsibility for a project outcome is much less likely to try and do the minimum, and more likely to try their best. This is one of the side-effects of the collective ownership from the agile world that makes proponents of agile methodologies so excited.

Putting on my Grinch hat, I have heard it said that agile works best if you have a very strong team, but then anything works well if you have a really strong team. That leads me to some thoughts on how to build your team’s skills.

With new frameworks and even new languages popping up in a relentless march of progress, it would add lots of value to your team if they were proactively skilling themselves up and adopting new ways of working and technology improvements as they become available.

What about making the team responsible for coming up with competitions, puzzles and hackathons to learn and practice new skills? How about gamification of the workplace, in the same way that Advent Of Code has some gamification where you can earn points for solving solutions quickly?




Node.js is awesome

I’ve been dabbling with full stack javascript development.

I really like angularJS (have spent a bit of effort on it), and the power of what you can do inside the browser is really cool. Also, the ability to change quickly is amazing. It does have a steep learning curve, but there is oodles of info out there for any situation that you may find yourself in.

I have been looking at meteor.js, and it seems to be really cool as well, and might obviate the need for angular for up to medium complexity UIs with its Blaze templating engine.  I still have to explore it a bit more though, maybe that will be the subject of my next post.

In the mean time I am still continuing with my freelance work, and today I had a scenario that node KILLED at. I was writing a C# data transfer service, that reads data from a source that I didn’t have access to (an HTTP GET basically), and then writes this data to two other locations (an HTTP POST to another location, and TCP write)

It literally took me 10 minutes to knock together two servers in nodejs, one to simulate the data source and destination for HTTP, and another to simulate the TCP server.

I did have a bit of a moment where I spent 30 minutes debugging what I thought was a problem in my C# code and ended up being a problem in my Node server 🙂

This usage of Node to mock up a TCP or HTTP server is incredibly powerful. Building the equivalent services in .NET would have taken me at least twice as long as the node ones did.

I am definitely going to keep Node.js in my tool belt for all kinds of rapid prototype usages

TechEd 2013 Overview – Wednesday

On wednesday, the real meat of the conference started. There were competitions and promos, nerds everywhere and stacks and stacks of sessions.


The Durban Convention Centre was swarming with nerds when we got there on Wednesday morning. Any time that a slightly interesting piece of swag appeared, the distributor was mobbed. The only decent coffee was available at two stands – one was a coffee company (Lavazzo) and the other was a the Gijima stand. Both were doing a roaring trade, with people queuing 20 deep (literally!) for coffee. I think another 2 or 5 coffee stands wouldn’t have gone amiss.

There were all sorts of competitions running, and one company (FlowGear) had a speaking area where they were promoting their product, and after every 1 hour session demo of their product, they drew a seat number and gave someone an iPad mini. Masterful marketing (lifted from the timeshare industry), they probably got their message to more ears than the rest of TechEd combined.

By the end of TechEd, the Lavazzo coffee makers had served over 4,000 cups of coffee. I hope they don’t have any repetitive strain injuries from making that much coffee!

Session 1 – Becoming Agile and Lean (Martin Cronjé)

This was a practical session where Martin shared some aspects of his methodology for lean / agile development, and stacks of examples of task boards and what they mean to the organisation that is running them. He had a few examples of agile gone bad too. Martin was an excellent speaker, and the examples he gave were engaging and interesting.

I was impressed by his pragmatic attitude and his contention that agile should be implemented as works for the organisation, not based on a textbook without consideration of the factors that matter to the organisation that is implementing it.

I don’t think that the industry is mature enough to use Agile throughout (especially when it comes to business to business engagements where one company is developing one half of the system and another company the other), but it’s interesting to keep an eye on and I think that the advantages of using Agile where it makes sense are real.

Session 2 – Introduction to open and flexible PaaS with Windows Azure (Beat Schwengler)

Beat presented some of the same strategic content that we had covered in the Cloud Strategy day on Tuesday, and then proceeded to give some demos of what Windows Azure can do for you. He demonstrated deploying cloud apps directly from Source Control (works for TFS and GitHub), and spent a bit of time demonstrating Microsoft’s Hadoop offering on Azure – called HDInsight. I was very interested by that last one, it definitely warrants some more attention.

He then demonstrated Windows Azure Mobile Services, which are a powerful mechanism for creating rich services for use with mobile apps (Windows Phone, Android, iOS or HTML5), with integrated push notifications for Windows Store apps (unfortunately not it seems for Android or iOS apps.)

The awesome thing about these mobile services is that you can run 10 of them for free on Azure (with the proviso that you don’t get a guaranteed uptime until you start paying and deploying more nodes)

Session 3 – ASP.NET MVC : Tips for improved maintainability (Peter Wilmot)

Peter seems to be in the same boat as me. He’s a back-end developer forced to participate in a world where most of the work is increasingly moving to the client. He put together this session about how to write MVC code that is maintainable. One highlight of the presenteation was when, coming at MVC from a code-centric perspective, he put together a brilliant slide depicting which parts of MVC development need what type of skills, and where more change will be vs where you want fewer changes.

He also had a few practical rules for how to structure your projects and which features of MVC to use to make things easier (use data annotations on your view models, etc)

This was one of the most valuable presentations that I attended in the Dev track, it was full of practical advice presented in an accessible manner. I will certainly be covering this content with the developers at my employer.

Session 4 – What did your last presentation die of? (Johan Klut, Blessing Sibanyoni, Jaco Benade, Robbi Laurenson, Rupert Nicolay)

Each participant told a story of where they had made a bad mistake in a presentation, and explained how they got there, and what they should have done (and since always do)  to prevent that particular problem. It was very engaging.

The presenters covered a technique for story telling called the CAST process, as explained on

Session 5 – Panel Discussion: Modern developer pratices the theory and the reality

This session was a group discussion by a few speakers around how development should be done, how developers must take personal responsibility etc. I didn’t really take anything away from it.

The Flowgear Challenge

End of Wednesday. While sitting down with my colleagues for a quick chat before we went for supper, I saw a tweet from Flowgear on how they will give an iPad mini to the person that writes the best implementation of a session reminder system for TechEd sessions that contain a particular keyword. That sounded like something right up my alley, so I decided to write a Windows Azure implementation of this challenge. I spent the evening getting to grips with Azure and writing a web site where someone could request reminders, and a worker role that inspected the data that the web site wrote and then periodically sent the requested reminders. I’ll post the details of my implementation, as well as some things that I learnt about Azure in the process, in a follow-up post.

The above took me until Thursday morning 1AM.

I was going to write about Thursday and Friday but somehow that never happened. It is now three years later and I have long since lost my notes.

Oh well, we live and learn I suppose (I did win the iPad :-P)

TechEd 2013 Overview – Tuesday

My employer sent me to Tech Ed Africa in Durban, and I had a whale of a time. I thought it would be fun for me to chronicle the whole experience, with a few spin-off blog posts about topics I found particularly exciting, or rewarding (more on that later!)

Let’s go through day by day, starting with Tuesday.

Cloud Strategy Day

We got invited to the Microsoft Cloud Strategy day, hosted by Beat Schwengler. He is a director in the cloud strategy group of Microsoft, and other than having the coolest twitter handle ever (@cloudbeatsch), he had some very interesting content to share with regards to how Microsoft is viewing cloud in its Azure product, and what they are doing to drive adoption.

Some key takeaways from his session(for me) were:
MS is embracing Open Source technologies in the cloud.
Ubuntu and CentOS are both first class citizens of Azure (in terms of the IaaS offering), and Azure has a ver interesting Hadoop offering – more on that later.

Computing costs are coming way down.
A user can now host 10 web sites and 10 app APIs for free on Azure, provided that you are happy working without an uptime SLA until you start paying so you can have redundancy. For hackers putting apps together, that is a major benefit. You can try ides for apps and services, and if they don’t work, you just move on without having expended anything other than your time.

Business models and monetisation are critically important.
Beat stressed the importance of being clever about how you monetise your application. He had some interesting stats on the income generated through advertising vs the income generated through subscription services. I found the discussion very interesting. He had us do an exercise where every table in the conference room put together a business plan, which I found to be an interesting exercise.

Opening KeyNote

After dashing through and checking into our hotel, our little group of colleagues went to register for the conference at the convention centre in Durban, and attended the opening keynote.

They had two dancers performing a very gymnastic routine, which was interesting to watch if a bit out of tune with the rest of the proceedings.

The keynote consisted mainly of the MS guys working through a modern computing scenario (a business owner requesting an app feature, some developer adding that feature, and it being deployed by an IT pro, followed by the Business user reporting on the expenditure in Dynamics AX)

At the time I thought that the scenarios were a bit stilted, but on reflection afterwards it wasn’t so bad, they covered a lot of ground and opened a few avenues of interest for me when deciding which sessions to attend.

We opted to skip the opening party, and went to have a pizza at a nice Italian pizza joint we found at a previous TechEd.

Stay tuned for Wednesday’s post…

Visual Studio 2008 and 2010 networking differences for TFS

Someone at work inherited my old development machine and was setting up Visual Studio 2008 and 2010 for his identity.

They could add our TFS server from the VS2010 instance, but not from the 2008 visual studio – VS gave an error saying that the Team Foundation server does not exist.

I stepped in to help get it configured, and as a debugging step asked him to run Wireshark and catch the network traffic.

As soon as we couldn’t find the TFS Server’s IP address in the trace I could see what the issue was. VS is using the system proxy settings, and then asking the proxy server to connect with TFS. This fails on our network, because our proxy does not have visibility of our internal network for security reasons.

As soon as we put in a proxy bypass for local addresses, we could add the TFS Server and all was well.

I would imagine that VS2010 either has its own local address check and then bypasses the proxy, or bypasses the proxy on failure. Either way, it wasn’t bothered by the proxy configuration being incorrect.

Maximum length of .NET command line params

Ever wondered what the maximum length of a command line param to a .NET application is?

I am investigating simple integration options between processes and wanted to know the limitation of this method of passing info from one process to another.

I found some documentation on MSDN ( which states that the max commandline param could be 32,768 characters.

So then I put together this .NET app to test that:

Screenshot of resulting two applications running

You can specify the length of data that you want passed to the other process.

The code that launches the other process looks like this:

And then I modified the other application’s Program.cs file as follows:

Changes to the other process's program.cs file

The result of this test?

I can pass 32,673 characters to the other process. Which is 5 fewer characters than the API call allows. (4 if you subtract the null terminator on the string).

Does anyone know where the additional 4 characters have gone?