However, the takeaway is to show how you can use libraries such as Featuretools, which require Pandas dataframes, and scale them to massive data sets. At Zynga, we’ve used Pandas UDF to scale Python libraries to new magnitudes of data sets and have automated much of our propensity modeling pipeline. He received his PhD in computer science from UC Santa Cruz. We use Pandas UDFs in combination with the Featuretools library to perform feature generation on tens of millions of users. This feature enables data scientists to define how to partition a problem, use well-known Python libraries to implement the logic, and achieve massive scale. It generates a wide space of feature transformations and aggregations that a data scientist would explore when manually engineering features, but does so in a programmatic method. The resulting table can be used as input to train propensity models. The key takeaways from my session are that recent features in PySpark enable new orders of magnitude of processing power for existing Python libraries, and that we are leveraging these capabilities to build massive-scale data products at Zynga. We sample data from prior weeks in order to create a training data set and apply the feature transformations on these players in order to establish baseline metrics for model performance. bgweber has no activity It is a python library that uses deep feature synthesis to perform feature generation. The code snippet below shows some of the details involved in performing this transformation. The code below shows how to load the Featuretoools library, perform deep features synthesis (dfs), and output a sample of the results. How to get on the good side of media & streamers, 10mg: invading Steam with microgames to make a point, Game Composers and the Importance of Themes: Interactivity in Game Music (Pt. Here the. 22 San Francisco Bay Area. The first phase in our modeling pipeline is extracting data from our data lake and making it accessible as a Spark dataframe. Take a look at the User-defined functions (UDFs) are executed on worker nodes, enabling existing Python code to now be executed at tremendous scale. Principal Data Scientist at Zynga. We’ve been able to use PySpark and new features on this platform in order to overcome all of these challenges. The fifth phase in our data pipeline is publishing our propensity model scores to our real-time database. The result of this code block is a transformed dataframe with our generated features, and a set of feature descriptors that we can use to transform additional dataframes.

Aka Vs Fka, How To Pronounce Pants, How To Save Outlook Email Attachments To Onedrive, Brighton Vs Watford, Kweb Stock, Mark Billingham Nottingham, Easy Book Week Costumes For Teachers, Was The Mare Of Steel Real, Cardiff Weather 1 Month, ,Sitemap