Angus Lam / All blog posts

My internship project at Optimizely 2017

I had the opportunity this summer to work as a Software Engineer Intern at Optimizely. I worked on the Web Squad, who owns the Optimizely web frontend dashboard located on app.optimizely.com. In just 12 weeks, I met new friends, worked on a project that removed a major pain point for Optimizely customers, and hacked on an awesome internal tool for onboarding. It was a really fulfilling summer and the best part was I was right in the heart of San Francisco, a city I've never visited before!

I wanted an internship that gave me insight on the work a full-time software engineer did and allowed me to work on meaningful projects. Optimizely was able to provide me with exactly that.

Cross origin tracking

My main project for the summer was implementing a revamped version of a setting called Cross-Origin Tracking¹, found in Optimizely X Web projects. The origin (heh) of this setting stems from the implementation² of Optimizely's embedded JavaScript snippet for shared storage between domains. In order for Optimizely customers to track their visitors across multiple domains, subdomains, protocols, and ports, they'll need to add an entry to the cross-origin tracking settings for each site they own and want to track. However, setting it up correctly requires Optimizely customers to have some level of technical expertise to implement and use Optimizely on their websites correctly.

Now, you're probably thinking, why don't you just track every event the customer's JavaScript snippet ran on? It'll only exist on the customers' respective sites because why would anyone run the script on anything else. Now they don't even need to know the concept of cross-origin tracking! Problem solved.

Well, the snippet can exist on pages the customers don't want to track:

The snippet code is public. Anyone can put the snippet on their page and it'll track visitor data.
Staging environments. These sites usually run code identical to those that'll be deployed eventually, but since it's subject to change they most likely don't want to track data on those pages.

These are two cases where customers definitely don't want to track visitor data. For the first case, we don't want the origin entries to be overly permissive or else there could be misrepresented data. In the second case, that means the customer would want a way to manually override any currently determined origins, in the case that they're still working on anything under a specific domain or subdomain. Considering these two things, there's no easy way to solve this problem without some user guidance, so we're going to need a way to let the customer tell us which sites to track events.

The problem

The old cross-origin tracking settings took a restrictive-first approach in handling entries--customers needed to add entries manually to the list. Optimizely does provided some nice ways to whitelist a group of domains with match types³, but this was still somewhat confusing to users. Organizations are now deploying their sites to multiple domains and protocols (especially with TLS/HTTPS), so it is necessary to be able to track visitor data between all of those sites correctly. This became a pain point for Optimizely customers, and they ultimately need help from Customer Support Engineers to sort out their problem with tracking data.

To minimize interactions with cross-origin tracking and reduce the instances where customers are inputting the same piece of information twice, we need a way to automatically determine the origin entries for customers with the information (URLs in the pages they added for experimentation) they already provided. Additionally, for power users where they have staging environments or any pages they don't want to track on, they'll need the flexibility to override the automatically determined origins.

Another problem the old cross-origin tracking had was that the settings on the dashboard was inconsistent with the behavior of the JavaScript snippet. The previous settings were located on a per-project basis, where you can only view and change the cross-origin tracking for one project at a time. This sounds useful, because it reduces the amount of information noise the customer has to deal with. Unfortunately, since there is only one JavaScript snippet per account, the cross-origin settings from every projects are combined into one list, then shared between all projects (If you are using the shiny new Custom Snippets⁴, cross-origin tracking is shared between all of your snippets!).

As a result, there's a visual disconnect between the customer dashboard and the snippet they use on their site. In the case where there are cross-origin tracking entries buried in one project that is affecting another project, it quickly gets out of hand.

This means before I can make cross-origin tracking smart, I have to make cross-origin tracking right. And, these are the steps:

Move cross-origin tracking from a project-based setting to an account-based setting.

Implement an intelligent way of discovering the URLs from experiments and determine their cross-origin tracking entries automatically.

Now that we're automatically discovering origins, there needs to be a way for users to remove entries from cross-origin tracking.

Housekeeping

Starting off, the old cross-origin settings data is associated with the projects of an account, even though there is only one snippet per account. That means the data model needed to be modified for both projects and accounts, moving the field from projects to accounts. Since the old cross-origin tracking was an existing feature that many, if not all, customers use rely on for their experiments, the current data within the account will need to be preserved and eventually moved to the project settings. This mean that the old settings cannot be simply removed and replaced with the new version.

The solution we chose was to implement the new cross-origin tracking first in parallel with the old one, along with a feature flag on accounts that determines if the account is ready to use the project-based setting. Once the cross-origin tracking data from each project within an account is migrated to the account model, we can safely flip the switch on the account. The feature flag will determine which set of origins will be built into the account's JavaScript snippet and which dashboard UI gets displayed for the customer. This is an elegant solution that makes the transition seamless to customers.

To accomplish this, I needed to write a migration script that took the data from the old fields, remove the duplicates (in theory, an account can have multiple projects with the same origin entries), and insert them into the new account-based fields. Once the data migrated successfully, I can flip the feature flag on the account to switch them over to the new settings.

Once cross-origin tracking data exist on the account-level, I can continue to implement the functionality to automatically determine the cross-origins for an account. When an experiment is created on Optimizely, the customer is assumed to want to take advantage of the features of Optimizely and enable cross-origin tracking for all subdomains and protocols on the same domain. Sounds simple enough, so let's move onto the next step.

Domains

Like I mentioned earlier, the entries for cross-origin tracking have a property called match types, which helps customers identify and whitelist a set of URLs given a pattern. I can leverage this property match types have to help customers automatically set their origin settings. There is a match type called Substring match that, as the name implies, will match and whitelist any website with a given substring. For example:

Substring match "map"
-> http://maps.google.com
-> https://maps.apple.com
-> https://mapquest.com
-> ...

If I were to enable cross-origin tracking for all subdomains and protocols on the same domain, substring match works well. I can take an experiment's URL and add it as an entry. However, there is a case where Substring match will match with unintended URLs. As an example:

Experiment URL: http://optimizely.com/some/path/here
Origin extracted: optimizely.com
Substring match "optimizely.com"
-> http://www.optimizely.com
-> https://app.optimizely.com
-> https://help.optimizely.com
-> https://notoptimizely.com
-> https://optimizely.companyevil.com

Now, clearly notoptimizely.com and companyevil.com are unintended matches as they exists outside of the optimizely.com domain. We really only want to match the suffix of a root domain and not just the substring in this case. The solution is to introduce a new match type, Suffix match, that prevents matching any domains outside of a given URL's root domain.

The way that Substring match is implemented is basically a preset Regex match, so Suffix match should be implemented the same way. The changes are to insert an explicit start or a period _(^|\\.) at the beginning of the origin and insert an explicit end $ at the end of the origin. This will make sure the root domain of any URL will always be the origin and nothing else.

By implementing Suffix match, it can significantly reduce the amount of data noise customers need to see on their dashboard (as opposed to seeing every entry), while maintaining useful information about their cross-origin tracking settings.

Security

Now that the Suffix match type is set up for use, the last part is to backtrack a little and actually determine the origin to extract from a given experiment URL. A naive solution is to take the root domain from all the experiments the customer is running, dedupe the list, and add them as cross-origin tracking entries with Suffix match. However, there are cases where the user might want to run an experiment on a URL where they don't own the entire root domain, but a subdomain (e.g. GitHub pages, Heroku instances, AWS instances etc). For example:

Experiment URL: http://optimizely.github.io/some/more/paths
Naive solution yields: github.io

In this case we definitely don't want to set github.io as an origin entry. This can even extend into further subdomain "chunks" (turns out there's not really a word to describe the subdomain parts separated by periods), so we need to be a generalized solution. We want the find the set of least permissive origins for the customer's experiments that accomplishes the goal of removing the configuration step.

As to how an automatically discovered origin entry is determined, I decided that if an account runs two or more experiments with URLs that contain the same subdomains chunks, I can assume that they control the subdomain chunk one level lower (can be the root domain too), and the algorithm will provide a cross-origin tracking entry that encompasses those URLs.

That's a lot of words and might not make sense, but the behavior is actually quite simple. Here's a few examples:

Experiment URL: http://three.two.one.happynewyear.com
Auto Origin: three.two.one.happynewyear.com

Experiment URLs:
http://three.two.one.happynewyear.com
https://m.two.one.happynewyear.com
www.two.one.happynewyear.com
Auto Origin: two.one.happynewyear.com

Experiment URLs:
https://three.two.one.happynewyear.com
https://www.happynewyear.com
Auto Origin: happynewyear.com

Thankfully this step can be completed in linear complexity, with a dictionary of root domains to n-ary trees, where the n-ary tree stored the subdomain chunks into each node. Once all of the URLs are broken down into chunks and inserted into the right nodes, it traversed the tree to find the deepest node with only one child for each domain. The result was then added as a Suffix match entry into an account's cross-origin tracking setting. This allows us to help the customer set the least permissive origins that encompasses all of their domains. Yay! We solved cross-origin tracking!

Environments

Some customers have staging or development environments set up on the same root domain (i.e. staging.company.com or dev.company.com), and they don't want to the JavaScript snippet to be enabled on those environments. The last thing is to allow a customer to remove an automatically discovered origin entry. We don't want customers to have the ability remove URLs from the list of automatically discovered URLs, since the automatically discovered origins are should be drawn from the collection of URLs customers have in their account. If users have the ability to remove from the auto-origins list, we'd have to answer questions like how should we keep track of it in our database, how that get should be presented to the user on the frontend, what happens when the customer adds another experiment, what happens when they remove the experiment from the list, etc.

Instead, a more sensible approach is to introduce Blocked Origins, a set of origin entries that trump and block the automatically discovered origins and manually added origins. In the off chance that a good majority of the automatically discovered origins are not suitable, the customer can always disable the feature entirely and revert back to manual input.

With all of these parts combined, this makes up the new version of Cross-Origin Tracking for Optimizely X Web! The feature is now live on Optimizely's dashboard, and hopefully you'll never need to touch the settings page! You can find Cross-Origin Tracking settings today on the Optimizely X's account settings page.

That's all

I had a lot of fun working on my project! I was given a lot of responsibility and at the end it was very rewarding. Although I was the primary engineer for this feature, it wouldn't have been possible without the help and advice from my mentor, other engineers, product managers, and my manager. This post is getting a little long, so I'll end it here. I am really grateful that Optimizely brought me out to San Francisco this summer and I will never forget my experience here.

The 2017 Optimizely interns Zach, Angus, Caitlin, Derek, and Flora sitting on the intern couch in a candid manner

Last updated November 2nd, 2017