The Decolonial Turn in Data and Stories: the Global South and Beyond

May 18, 2022

How do we define the Global South and its role in decolonization? The Global South cannot be simply defined by which land masses and peoples live below the Earth’s equator. Rather, there are a plurality of Souths, which exist in every corner of the world, even in the far North. This article will explore the decolonial turn and how it applies to the Global South. Then it will explore case studies from the Inuit in Northern Canada, to show how decolonization can be implemented at the research stage of collecting data, followed by another case study from the Inuit which gives an example of digital colonialism’s effects on communities who traditionally have lived far-from-digital lives.

The Decolonial Turn and Pluralizing the Souths

In a recent article by Nick Couldry and Ulises A. Mejias, the decolonial turn was discussed in detail, which links data extraction everywhere, whether in the Global North or Global South, to the colonial underpinnings of capitalism by looking through “. . . the long-term historical lens of attempts to justify the unequal distribution of the world’s resources that began in earnest 500 years ago”. (2021) This, they said, is crucial to acknowledge when addressing contemporary discourses such as Big Data and AI for Social Good. Researchers, corporations, and governments need to account for the implications of governing human life and freedom through data extraction practices which are colonial in nature. (Couldry & Mejias, 2021)

In another article by Stefania Milan and Emiliano Treré, the authors complicate the idea of the Global South by pluralizing it to be the Souths, as tracing disempowerment geographically through data does not always follow a North-South dichotomy; however there are undeniable inequalities seen where datafication is weaponized by institutions and corporations to manage people, hitting harder where human rights and laws are the most fragile. (Milan and Treré, 2019)

The plurality of the Souths extends into the North and West, for discrimination and inequality know no boundaries, and people who are in any way different or silenced can be found everywhere. That is why in this discussion on the Global South, I will actually be reviewing case studies from the far North, among the Inuit in Northern Canada, to also shed light on counter powers in the form of decolonizing research and understanding the impact of datafication and digitization on indigenous communities.

Decolonizing Research to Decolonize Data: Digital Storytelling

Decolonizing data starts with decolonizing research, through the process of collecting and exhibiting data while ensuring representation and lowering harm. Digital storytelling is a process involving immersive workshops where the relationship of researcher-researched transforms into teller-listener, and personal stories and narratives are related with a mix of voice, video, photographs, artwork and music to create a sort of first-person mini-movie. (Willox et al. 2013)

In the study by Willox et al, a mix of indigenous and non-indigenous individuals teamed up in 2009 in northern Canada to engage with a remote community to develop a digital narrative method which examined the connection between climate change, and health and well being by uniting digital media with storytelling as a way to celebrate the individual and the collective. (Willox et al. 2013) The authors found that “. . .by uniting the finished stories together, a rich, detailed, and nuanced tapestry of voices emerge providing context and depth to localized narratives and collective experiences.” (Willox et al. 2013)

Digital storytelling requires a high level of trust, and has great potential as a more participatory and democratic form of social research. However, it still brings up a lot of personal and political questions while creating this raw form of narrative “data.” Researchers have to decide which stories to share, if they should interfere with how the community is represented to give voice to peoples and issues which are generally silenced without perpetuating stereotypes or misunderstandings. Digital storytelling can disrupt, alter and/or reverse power dynamics in narrative research by removing the researcher as the teller of others’ stories and build a powerful source of data coming straight from lived experiences of individuals, creating “. . . the opening to listen, reflect, learn, trust, and then listen again.” (Willox et al. 2013)

Digital Colonialism and IQ

Inuit traditional knowledge is referred to as IQ (Inuit Qaujimajatuqangit) which represents a set of skills to be constantly practiced and adapted to a changing world. As opposed to knowledge that can be held or recited, rather, IQ is cultivated by experiences in the natural world and learned from elders. Unlike Western forms of objective knowledge, there is not the separation between knowledge and knowledge-holder (Laugrand and Oosten, 2010), and it cannot be learned from reading or watching videos. (Young, 2019) In Jason C. Young's research in Igloolik, he heard repeatedly in interviews the sentiment that the internet is killing IQ culture, due to community members spending less time on the land or with elders and more time online. (Young, 2019)

“Digital engagement is undermining two key aspects of the IQ system—embodied socialization and experiential learning out on the land. Time spent online can trade off with embodied play outdoors, visits at the homes of other Iglulingmiut, and visits to elders outside of the community.” (Young, 2019)

Due to the nature of IQ being about adapting to change, community members seek ways to balance technology and lived experiences. One example is a Facebook group called “Nunavut Hunting Stories of the Day”, which allows Nunavummiut to share hunting stories and knowledge meant to inspire others, mainly youth, to get out on the land themselves and have their own learning experiences. (Young, 2019) I took a look at the group, or a different version than what was listed in the article, called “Inuit Hunting Stories of the Day” and found it very educational. I noted that on some of the posts, in the comments, different people from different communities of Inuit peoples throughout the North shared the names of the animals in their own dialects. The open sharing of culture and language in this format is a way to carve out space for Inuit knowledge and data. It is also a space for people to ask and answer questions about hunting and Inuit knowledge.

Conclusion

There are ways that data, AI, and related technologies can be used for promoting culture and Indigenous knowledge, however, it does not come automatically. What comes automatically are the patterns laid forth by colonialism, capitalism, consumerism, and development, which serve to further separate the ideas of Global North or West and Global South. When considering Indigenous Data Sovereignty, in the Global South and beyond, as a core tenet for designing global data governance, it is important to see it from all sides. Data should be shared back to the communities whom it belongs to, and it should benefit them, not harm them. It should serve to bring people together, not further apart and further from the land which they live on. It should be a compliment to life, not what governs life itself.

Resources

Nick Couldry & Ulises Ali Mejias (2021): The decolonial turn in data and technology research: what is at stake and where is it heading?, Information, Communication & Society, DOI: 10.1080/1369118X.2021.1986102

Milan, S., & Treré, E. (2019). Big Data from the south(s): Beyond data universalism. Television & New Media, 20(4), 319–335. https://doi.org/10.1177/1527476419837739

Cunsolo Willox, A., Harper, S. L., & Edge, V. L. (2012). Storytelling in a Digital age: Digital Storytelling as an emerging narrative method for preserving and promoting indigenous oral wisdom. Qualitative Research, 13(2), 127–147. https://doi.org/10.1177/1468794112446105

Young, J. C. (2019). The new knowledge politics of Digital Colonialism. Environment and Planning A: Economy and Space, 51(7), 1424–1441. https://doi.org/10.1177/0308518x19858998

Environmental Concerns and Indigenous Data Sovereignty

May 3, 2022

What is the connection between data governance, the environment, and the rights of Indigenous peoples?

In the intersection between rising technologies based on data and the need to protect the earth and all its peoples, we find Indigenous Data Sovereignty (IDS). Indigenous communities are disproportionately affected by extraction and exploitation practices throughout the world, which harms them and the earth. (IPCC, 2022) How can Indigenous Data Sovereignty help?

Before colonization, Indigenous peoples had sovereignty over their data, in forms such as art and storytelling, and one thing that links the diverse populations of Indigenous communities is their inherent connection to the natural world. These connections to data and the earth have been corrupted by colonization, which has historically legitimized and legalized the exploitation of land, resources, and the people themselves.

We highlight the example of mining metals from the earth to reflect the term data mining, as, on the surface, mining appears to be a part of our modern world and how we extract metals necessary for many things that we use daily, such as smartphones and computers; however, when we look closer, we can see the complexity and the harms that come to communities and the natural environment in this extraction process. Data mining is different from metal mining in several ways; one being that data is not something that occurs naturally, but that must be produced, by people. Data mining is often compared to extracting resources, or as a sort of modern-day land grab. (Couldry and Mejias, 2021) Metal mining also has negative implications for Indigenous data protection and protection of Indigenous peoples’ health and environment, a further implication of the colonial nature of the practice. IDS would help to mitigate harm in these areas.

In my last article, global data law was discussed, focusing on the inclusion of the digital Non-Aligned Movement and Indigenous Data Sovereignty. In this post, the connection to how we steer global data governance towards protecting the natural environment will be explored, examining case studies on metal mining in Mexico and exploitation of resources in Colombia. First, we will review the current status of the environment and climate change, including how data and law are used to help Indigenous communities in contemporary times.

Tackling Climate Change with ML

The world is aware that we are in a climate crisis, and we have the technology and means to solve it. Microsoft Research published an article titled “Tackling Climate Change with Machine Learning” which addresses suggestions for switching to low-carbon electricity sources, lowering carbon emissions, removing CO2 from the atmosphere, and generally incorporating ML practices to improve standards for climate impact in the areas of Education, Transportation, Finance, Agriculture, and Manufacturing. (Rolnick et al, 2019) While, the recent UN report on climate change provides the data and research needed to turn the ship around if only policymakers and corporate powers can get on board and make serious changes, and soon. The report based its findings on scientific knowledge as well as Indigenous and local knowledge to reduce risks from human-induced climate change while evaluating climate adaptation processes and actions. (IPCC, 2022)

The path forward includes a lot of moving parts, and I would like to highlight the importance of Indigenous rights in this process. Indigenous ideologies are based on the core tenet that there is really no separation between people and the earth. On the other hand, governments and laws globally have a long colonial history of treating land as a commodity to be exploited for profit, endorsed by development discourses. (Rojas-Páez and O’Brien, 2021)

I want to stress the importance of mitigating undue harm to Indigenous peoples when applying data and machine learning technologies to help the planet. Sometimes when we set out to do good, it ends up hurting people unintentionally and reproducing colonial constructs. It is a delicate balance when outside researchers approach issues where they think they know the answers and what is best for others, but if we collaborate efforts and take responsibility for our shared planet, spending more time listening and less prescribing, I think that ultimately is what helps all people the most.

The Native American Rights Fund (NARF) stands to protect the rights of indigenous populations in the US, and their current environmental work primarily concerns climate change. On the international level, NARF has represented the National Tribal Environmental Council and the National Congress of American Indians (NCIA) via the United Nations Framework Convention on Climate Change, ensuring the protection of indigenous rights in international treaties and agreements governing greenhouse gas emissions. NARF represents tribes in court cases, such as in a case representing Alaskan Native tribes against energy companies for damages. They also help tribes relocate when necessary, as the impact of climate change in Alaska is immense and immediate. (NARF, 2019) NARF uses the law to help tribes in America, standing against those who benefit from exploiting their lands and resources.

Environmental problems including climate change, habitat destruction, mining wastes, air pollution, hazardous waste disposal, illegal dumping, and surface/groundwater contamination cause an array of health and welfare risks to Indigenous peoples. NARF is one organization helping tribes protect the environment as a top priority and helps to enforce laws such as the Safe Drinking Water Act, the Clean Air Act, and the Clean Water Act. (NARF, 2019)

NARF has not yet helped to enforce laws around data protection and mining practices for Indigenous people. These laws are nascent in their development and governments globally are only beginning to find consensus on their implementation. NARF is well-positioned to lead the way or join the movement that is beginning to ensure indigenous data, history, land, and representation is protected in the digital future.

NARF is well-positioned to lead the way or join the movement that is beginning to ensure indigenous data, history, land, and representation is protected in the digital future.

Indigenous data policy and case studies in Mexico and Colombia

Indigenous Data Sovereignty (IDS) is not only relevant but necessary for creating fairer governance and a more prosperous future for Indigenous peoples (Rodriguez, 2021 P.139), as was shown repeatedly throughout the book, Indigenous Data Sovereignty and Policy (2021). We will rely on two case studies from the book to exemplify this further and connect them to climate change and environmental concerns.

The chapter by Oscar Luis Figueroa Rodriguez focused on IDS in Mexico and examines unsolved priorities from history that involve the use of data and the particular way that information and knowledge have been generated and transformed, controlled, and exploited across many contexts, indicating a need for IDS. (Rodriguez, 2021 P.139) Another chapter of the book highlighted Colombia’s struggle with IDS (Rojas-Páez and O’Brien, 2021), and introduced a 2019 ruling (the Jurisdicción Especial para la Paz “Special Jurisdiction for Peace” or JEP) which declared nature to be a victim of Colombia’s conflict, pointing to affected ecosystems including rivers and the need for them to gain legal protection. (Rojas-Páez and O’Brien, 2021)

Access to resources such as water is globally justified to be controlled selectively and exclusively, as nature is commodified, however, Indigenous narratives stand in opposition to this. (Johnson et al., 2016) (Rojas-Páez and O’Brien, 2021) By legally protecting rivers and other ecosystems, historically normalized exploitation practices must come to an end. A century ago, in the 1920s, the implementation of large-scale economic projects led to the legitimization of direct violence against Indigenous peoples, for example, in the Amazon and northern region of Santander, oil and rubber plantations resulted in several Indigenous communities’ disappearance, through enslavement and assassination. (Rojas-Páez and O’Brien, 2021)

This wasn’t a practice of Indigenous data mining but of erasure. Not only were Indigenous peoples and their knowledge misrepresented, but they were being wiped out. The only interest at the time was on what resources could be extracted from the land and how much it was worth. As there is now more of an interest in collecting Indigenous data as another form of resource mining, IDS holds great importance in regard to Indigenous communities for it stands to mitigate “. . . demands for territorial rights, food sovereignty and access to natural resources such as water.” (Rojas-Páez and O’Brien, 2021)

Indigenous worldviews which hold that humans are a part of the land and cannot be separated from it have been undermined by the ideology that land is a commodity to be exploited for economic purposes. Rojas-Páez and O’Brien bring up the question, “. . . why is the human cost of the expansion of the extractive economy not challenged in countries whose Indigenous communities are still facing extermination?” (2021) The authors turn to scholar Julia Suarez-Krabbe, who commented on the invisibility of the impact of colonial practices in places like Colombia and explained that “. . . the force of colonial discourse lies in how it succeeds in concealing how it establishes and naturalizes ontological and epistemological perspectives and political practices that work to protect its power” (Suárez-Krabbe, 2016). (Rojas-Páez and O’Brien, 2021) However, recent rulings like the JEP work to recognize Indigenous ontologies, how their data is represented, and to protect the land.

‘The protection of personal data is a constitutional and fundamental right in Colombia’ according to Carolina Pardo, Partner in the corporate department of Baker McKenzie in Colombia. Her article, Colombia Data Protection Overview in DataGuidance, references the Congress of Colombia enacted Statutory Law 1581 of 2012, which Issues General Provisions for the Protection of Personal Data (‘the Data Protection Law’), which ‘develop the constitutional right of all persons to know, update, and rectify information that has been collected on them in databases or files, and other rights, liberties, and constitutional rights referred to in Article 15 of the Political Constitution.’

Indigenous inspectors have been named to monitor natural resources on reservations since 1987. In 1991, Colombia approved their new Constitution recognizing Indigenous rights, including ethnic and cultural diversity, languages, communal lands, archeological treasures, parks and reservations which they have traditionally occupied; and measures to adopt programs to manage, preserve, replace and exploit their own natural resources. (University of Minnesota Human Rights Library, 1995)

The Colombian government’s efforts and commitments to strengthen the dialogue on human rights have been recognized by political figures of the European Union. Patricia Llombart, Colombia’s EU Ambassador, stated that Colombia has shared values with the EU and is seen as a reliable and stable partner. Where the EU has been involved, international agreements which include protecting Indigenous rights as well as labor rights and rights for children have been signed in Andean countries. (Blanco Gaitan, 2019)

Turning now to the Mexican chapter, we see the same history echoed. Without the knowledge or permission of local indigenous peoples, external actors have historically conducted research to better understand the values of natural resources in Indigenous territories, demonstrating a lack of understanding of the implications of exploiting things such as minerals, timber, wildlife, plants, and water for the people who live there, in terms of health and environmental consequences, infrastructure, and investments. (Rodriguez, 2021 P.140–141)

For example, extractive practices such as mineral mining profoundly impact Indigenous communities and have only been promoted by recent presidents in Mexico. In the last 12 years, 7% of Indigenous territories have been lost for the sake of mining alone, frequently without even informing Indigenous communities. (Valladares, 2018, p. 3) (Rodriguez, 2021 p.140–141)

Mining metals from the earth is necessary for much of the technology we know and love today, however, there is a price to pay, and I am not referring to the cost of the latest iPhone. The major cost falls on the people who live where these metals are extracted, or, where they used to live if they had their territories taken away for the purpose of mining. (Rodriguez, 2021 P.140–141)

We use mining as an example, which clearly shows the need for IDS and the consequences, as communities were neither considered nor informed about the extremely invasive methods and exploitation techniques involved in metal mining from their land in Mexico, including not only the use of heavy machinery but massive lixiviation processes mainly with sodium cyanide, which several European countries have forbidden. (Boege, 2013). (Rodriguez, 2021)

January of this year saw a vote by the Mexican Congress to approve the Federal Law for the Protection of the Cultural Heritage of Indigenous and Afro-Mexican Peoples and Communities. (Hermosillo and Soria, 2022) This law includes protecting Indigenous communities and their rights to property and collective intellectual property, traditional knowledge and cultural expressions, including cultural heritage, in an “. . . attempt to harmonize national legislation with international legal instruments on the matter, trying to give a seal of ‘inclusivity’ to minorities” (Schmidt, 2022)

“Intangible cultural heritage is defined as the uses, representations, expressions, knowledge, and techniques; together with the instruments, objects, artifacts, and cultural spaces that are inherent to them; recognized by communities, groups, and, in some cases, individuals as an integral part of their cultural heritage.” (Schmidt, 2022)

These new and relevant definitions, such as “cultural heritage,” “misappropriation,” and “collective property right” are helpful to guide third parties on identifying whether authorization is necessary for use of Indigenous or Afro-Mexican cultural heritage, as failure could result in infringements and/or felonies under Mexican law. (Hermosillo and Soria, 2022)

We can see how the representation of cultural heritage in both of these examples, in Mexico as well as Colombia, has gained importance and legal protection, which is vital to the conversation about data, as cultural heritage represents data about a collective. This further relates to protecting the natural environment, because by protecting cultural heritage, the lands and natural resources of Indigenous communities are also protected.

There is still a place for Indigenous Data Sovereignty as the laws change to protect Indigenous rights.

“IDS could help fill the gap regarding the lack of evaluations as an appropriate approach in the design and implementation of monitoring, evaluation, and learning (MEL) local systems, controlled and used by Indigenous communities.” (Rodriguez, 2021 p.143)

Rodriguez went on to list recommendations from the Organization for Economic Co-operation and Development (OECD) to move forward on these issues.

The OECD recommends four main areas to strengthen Indigenous economies:
1. improving Indigenous statistics and data governance
2. creating an enabling environment for Indigenous entrepreneurship and small business development at regional and local levels
3. improving the Indigenous land tenure system to facilitate opportunities for economic development
4. adapting policies and governance to implement a place-based approach to economic development that improves policy coherence and empowers Indigenous communities
(OECD, 2019, p. 5) (Rodriguez, 2021 p.143)

Lists like this are helpful, however, they must be approached with caution and in communication with the people which they aim to help. These steps must be implemented by Indigenous peoples themselves with the support of organizations such as the OECD.

Through exploring these case studies from Mexico and Colombia, it is clear that in considering public policies for data governance for Indigenous peoples, there is a need to remediate three main data challenges: data collection, data access, and relevance in order to access, use and control their own data and information. (Rodriguez, 2021 P.144) This is something that must be understood for data governance around the world, and to note that there are different local concerns in different regions, but which have all been negatively influenced and impacted by long-standing colonial and exploitation practices. It is important that we continue to educate ourselves and question broader narratives that stem from colonial roots.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

www.accel.ai

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Blanco Gaitan, D. (2019, July 2). Challenges of Colombian data protection framework towards a European adequate level of protection. DUO. Retrieved April 24, 2022, from https://www.duo.uio.no/handle/10852/68578

Boege, E. (2013). El despojo de los indígenas de sus territorios en el siglo XXI. Movimiento Mesoamericano Contra el Modelo Extractivo Minero. Retrieved from https://movimie ntom4.org/2013/06/el-despojo-de-los-indigenas-de-sus-territorios-en-el-siglo-xxi/.

Cunsolo Willox, A., Harper, S. L., & Edge, V. L. (2013). Storytelling in a digital age: digital storytelling as an emerging narrative method for preserving and promoting indigenous oral wisdom. Qualitative Research, 13(2), 127–147. https://doi.org/10.1177/1468794112446105

IPCC, 2022: Summary for Policymakers [H.-O. Pörtner, D.C. Roberts, E.S. Poloczanska, K. Mintenbeck, M. Tignor, A. Alegría, M. Craig, S. Langsdorf, S. Löschke, V. Möller, A. Okem (eds.)]. In: Climate Change 2022: Impacts, Adaptation, and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [H.-O. Pörtner, D.C. Roberts, M. Tignor, E.S. Poloczanska, K. Mintenbeck, A. Alegría, M. Craig, S. Langsdorf, S. Löschke, V. Möller, A. Okem, B. Rama (eds.)]. Cambridge University Press. In Press.

Johnson, H., Nigel, S., and Reece, W. (2016). The commodification and exploitation of fresh water: property, human rights, and green criminology. Institutional Journal of Law, Crime and Justice, 44(2016), 146–162.

NARF. (2019, January 25). Protect tribal natural resources. Native American Rights Fund. Retrieved April 8, 2022, from https://www.narf.org/our-work/protection-tribal-natural-resources/

OECD. (2019). Linking Indigenous Communities with Regional Development, OECD Rural Policy Reviews. Paris: OECD Publishing. doi:10.1787/3203c082-en.

Pool, I. (2016). Colonialism’s and post-colonialism’s fellow traveller: the collection, use and misuse of data on indigenous peoples. In: Indigenous Data Sovereignty Toward and Agenda. ANU Press.

Rodriguez, O. L. F. (2021). Indigenous policy and Indigenous data in Mexico. In Indigenous data sovereignty and policy (pp. 130–147). essay, ROUTLEDGE.

Schmidt, L. C. (2022). New general law for the protection of cultural heritage of Indigenous and Afro-Mexican peoples and communities in Mexico. The National Law Review. Retrieved April 24, 2022, from https://www.natlawreview.com/article/new-general-law-protection-cultural-heritage-indigenous-and-afro-mexican-peoples-and?msclkid=8dc1a6c6c39511ec987d4aa0940a1ab5

Suarez-Krabbe, J. (2016). Race, Rights and Rebels: Alternatives to Human Rights and Development from the Global South. Lanham: Rowman and Littlefield.

Valladares, de la C.L.R. (2018). El asedio a las autonomías indígenas por el modelo minero extractive en México. Iztapalapa. Revista de Ciencias Sociales y Humanidades, 85(39), 103–131, julio-diciembre de 2018. ISSN:2007–9176. Retrieved from http://www.scielo .org.mx/pdf/izta/v39n85/2007–9176-izta-39–85–103.pdf.

University of Minnesota Human Rights Library. Minneapolis, Minn. :University of Minnesota Human Rights Center, 1995. Retrieved from http://hrlibrary.umn.edu/iachr/indig-col-ch11.html

Introduction to Machine Learning with Decision Trees

April 12, 2022

A machine learning model is a program that combs through data to learn, find patterns and make predictions. A model is trained with previously unseen data called training data which, when provided with an algorithm, can reason and learn from the data. An example of where this is used is if you want to build an application that can recognize if a voice is male or female. You can train the model by providing it with various voices labeled either male or female; the algorithm will be able to learn the difference in pitch or speech patterns and recognize if a voice is male or female. While there are various models in machine learning, in this tutorial we will begin with one called the Decision tree. Decision trees are the basic building block for some of the best models in data science and they are easy to pick up.

Decision Tree

When it comes to decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision-making. It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. First, we need to understand decision nodes and leaves.

The leaf node gives the final outcome; the decision node is where the data splits. There are two main types of decision trees, these are classification and regression. Classification trees are when the target variable can take a discrete set of values. For these tree structures, the leaves represent class labels and the branches represent the features that lead to those class labels. Regression trees are when the target variable takes continuous values, usually numbers. For simplicity, we will begin with a fairly simple decision tree.

Making the model

Exploring the data

When beginning any machine learning project the first step is familiarizing yourself with the data. For this, we will use the pandas library which is the primary tool data scientists use when exploring and manipulating data. It is used with the import pandas as pd command below. To follow along, click this link to the jupyter notebook.

Demonstration of the import command.

A vital part of the pandas library is the data frame where data is represented in a tabular format similar to a sheet in Excel or a table in a SQL database. Pandas has powerful methods that will be useful for this kind of data. In this tutorial, we’ll look at a dataset that contains data for housing prices. You can find this dataset in Kaggle.

We will use the pandas function describe() that will give us a summary of statistics from the columns in our dataset. The summary will only be for the columns containing numerical values which are easier to use with most machine learning models. Loading and understanding your data is very important in the process of making a machine learning model.

We load and explore the data with the commands below:

Demonstration of the describe() command.

The summary results for the columns can be understood this way.

Count shows how many rows have no missing values.
The mean is the average.
Std is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75%, and max values, imagine sorting each column from the lowest to the highest value. The first (smallest) value is the min. If you go a quarter way through the list, you’ll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced “25th percentile”). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

Selecting data for modeling

Datasets sometimes have a lot of variables that make it difficult to get an accurate prediction. We can use our intuition to pare down this overwhelming information by picking only a few of the variables. To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the command below.

Demonstration of the .columns command.

There are many ways to select a subset of your data but we will focus on two approaches for now.

Dot notation, which we use to select the “prediction target”
Selecting with a column list, which we use to select the “features”

Selecting The Prediction Target

We can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices is:

Demonstration of the variable `y` assignment to housing price.

Choosing Features

The variables/columns chosen to be added to the model and later used to make predictions are what are referred to as features. For this tutorial, it will be the columns that determine home prices. There are times when you may use all your columns as features and other times fewer features are preferred.

Our model will have fewer features. This is done by selecting multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

It looks like this:

Demonstration of the house_features list.

Traditionally this data is called X.

Demonstration of the variable `x` assignment to the selected housing features.

You can review the data in the features using the .head() command:

Demonstration of the .head() command.

The Model

When creating our model we will use the library Scikit-learn which is easily the most popular library for modeling typical data frames. More on Scikit-learn.

There are steps to building and using an effective model, they are as follows:

Defining — There are various types of models other than the decision tree. Picking the right model and the parameters that go with it is key
Train/Fit — this is when patterns are learned from the data.
Predict — the model will make predictions from the patterns it learned when training.
Evaluation — Check the accuracy of the predictions

Below is an example of a decision tree model defined with scikit-learn and fitted with the features and the target variable.

Example of a decision tree model.

The library is written as sklearn. By specifying a number for random_state you ensure you get the same results in each run. Many machine learning models allow some randomness in model training; this is considered a good practice.

We could predict the first few rows of the training data using the predict function.

Demonstration of the .predict() function.

How good is our model?

Measuring the quality of our model is imperative to improving the model. The fitting measure of a model’s quality is in its predictive accuracy. There are many metrics used for summarizing model quality but we’ll begin with Mean Absolute Error. In Mean Absolute error (MAE for short) the absolute value of the error is converted to a positive number and the average is taken. Simply put, on average our predictions are off by this value.

This is how to calculate the mean absolute error.

Calculating the mean absolute error.

What we calculated above is called the “in-sample” score. We have used the same sample of data for house prices to build and evaluate our model. Since the patterns learned were from the training data so it seems accurate in the training data. This is bad because those patterns derived won’t hold when new data is introduced. Because models’ practical value comes from making predictions on new data, we measure performance on data that wasn’t used to build the model. How to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.

The scikit-learn library has the function train_test_split to break up the data into two pieces. We’ll use some of that data as training data to fit the model, and we’ll use the other data as validation data to calculate mean_absolute_error. Here is the code:

Calculating mean_absolute_error.

This is the difference between a model that is almost exactly right and one that is unusable for most practical purposes.

Overfitting and Underfitting

Overfitting refers to when a model takes too well to training data. This means that the inaccuracies and random fluctuations are picked up as patterns and concepts by the models. The problem with this is that these patterns do not apply to new data and this negatively impacts the models’ ability to generalize.

Overfitting can be reduced by:

Increasing the training data
Reducing the models’ complexity
Using a resampling technique to estimate model accuracy.
Holding back validation data
Limiting the depth of the decision tree with parameters(see below)

There are various ways to control tree depth but here we will look at the max_leaf_nodes argument. This will control the depth of our tree. Below we will create a function to compare MAE scores from different values for max_leaf_nodes:

Function to compare MAE scores.

We can follow this by using a for-loop to compare the accuracy of the model with different values for max_leaf_nodes.

for-loop to compare the accuracy of the model.

50 nodes are the most optimal since they have the least MAE score.

Underfitting is when a model cannot learn patterns from data. When this happens it’s because the model or algorithm does not fit the data well. It happens usually when you do not have enough data to build an accurate model.

Underfitting can be reduced by:

Increasing the models’ complexity
Increasing the number of features
Increase the duration of training

While both overfitting and underfitting can lead to poor models, performance overfitting is the more recurring problem.

Conclusion

There are many ways to improve this model, such as experimenting to find better features or different model types. A random forest algorithm would also suit this model well as it has better predictive accuracy than a single decision tree and it works well with default parameters. As you keep modeling you will learn to use even more sophisticated models and the parameters that go with them.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

References

https://towardsdatascience.com/build-your-first-machine-learning-model-with-python-in-7-minutes-30b9e1a3eafa

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

Global Data Law and Decolonisation

April 11, 2022

Anywhere on earth with an internet connection, AI systems can be accessed from the cloud, and teams from different countries can work together to develop AI models, relying on many datasets from across the planet and cutting edge machine learning techniques. (Engler, 2022)

This global nature of AI can perpetuate ongoing marginalisation and exploitation of the people behind the data that machine learning relies on. Data protection laws vary per country, and in the U.S., per state, making it difficult for businesses who are trying to do the right thing (Barber, 2021), while simultaneously creating targets, mostly in the under-protected ‘Global South,’ for large corporations to mine data freely and gain power (Couldry & Mejias, 2021), keeping inequality strong. What can we do about this? Is the development of a Global Data Law the answer?

As the US and the EU work to try to align on data protection (Engler, 2022), there is a small call for a re-enlivening of the Non-Aligned Movement (NAM) in the digital sphere. (Mejias, 2020) The NAM is an anti-colonial/anti-imperialist movement consisting of 120 countries currently, which are not aligned with any major world powers. It was founded in 1961 to oppose military blocs in the Cold War, and hasn’t yet been adopted for opposing data colonialism, however, some say it should. (Reddy & Soni, 2021) I love this idea, because we are working with a plethora of cultures which have variant value systems, and aligning with Big government and corporate powers is not beneficial across the board; in fact, it can be quite harmful. When approaching the idea of global data law, we must support the rights of people who are the most marginalised. “What we need is a Non-Aligned Technologies Movement (NATM), an alliance not necessarily of nations, but of multiple actors already working towards the same goals, coming together to declare their non-alignment with the US and China.” (Mejias, 2020) Mejias calls for NATM to be more society and community driven than from state powers, which makes it all the more viable for the current situation.

Central to this approach to global data law is the concept of decolonization. If we are not careful, neo-colonialism will lead to yet another stage of capitalism that is fueled by data. That is why it is essential to involve Indigenous and marginalised peoples, not at the margins but at the centre of debates on global standards for law and data, or else efforts to decolonize will only reinforce colonialism. (Couldry & Mejias, 2021)

This is tricky. Colonialism and neo-colonialism are strong systems which feed imperial powers, whether they be government or corporate. At the end of the day, ‘good business’ wins out over what is actually good for everyone involved. This fundamentally needs to change.

What is Global Data Law?

Global data law is an area of great tension, for it is necessary to regulate and protect sensitive data from around the world, whether people live in the EU under protection of the General Data Protection Regulation (GDPR), in other countries with strong data protection laws, or in areas with less data protection. Creating a “one-size fits all” system of global data laws will not work in our diverse world with varying tolerances for oversight and surveillance.

The GDPR is being used as a model for data protection, but it only protects the privacy for EU citizens, no matter where in the world the data is used. (Reddy & Soni, 2021, p.8) Compliance with the GDPR is a must, requiring complete transformation in the way organisations collect, process, store, share and wipe personal data securely, or face exorbitant fines in the tens of millions of dollars. (DLA Piper, 2022) Several other countries have developed specific data protection laws in the last few years, splattered across the world, such as in Canada and Brazil, however the protections vary greatly.

We can also consider the UN recommendations, such as to move away from data ownership and towards data stewardship for data collectors, meanwhile protecting privacy and ensuring self-determination of peoples’ own data. (The UN, 2022) They stress that there is a need to protect basic rights of peoples’ data not being used or sold without permission and in ways that could cause undue harm, with respect to what this means across cultures.

The UN Roadmap for Digital Cooperation highlights:

-global digital cooperation

-digital trust and security

-digital human rights

-human and institutional capacity building

-an inclusive digital economy and society

(The UN, 2022)

However, what we are seeing in global data law is governments developing their own systems unilaterally, making compliance complicated. For example, in the first years of the GDPR, thousands of online newspapers in the US simply decided to block users from the EU rather than face compliance risks. (Freuler, 2020) (South, 2018) Those in the EU were not able to access information previously available to them. Businesses that rely on customers from the EU had to trade off the risk of the compliance liability to the loss of income.

Global data law is indeed complex, and to complicate matters further, next we will briefly dive deeper into the Digital Non-Alignment movement.

Situating the Digital Non-Aligned Movement

The original Non-Aligned Movement (NAM) was formed by leaders of many countries, mainly from the Global South, which sought a political space to counter central powers through coordinated solidarity and exercise strategic autonomy, resisting control from the US and the USSR during the Cold War. (Reddy & Soni, 2021) (Freuler, 2020) Now, in the digital age, there is a call to recentre on the NAM in the digital realms to protect against not only government powers, but Big Tech as well. (Freuler, 2020)

A Non-Aligned Technologies movement would empower civil societies across the globe to act in consort to meet their shared objectives while putting pressure on their respective governments to change the way they deal with Big Tech. The primary goal of NATM would be to transition from technologies that are against the interest of society to technologies that are in the interest of society. (Mejias, 2020)

Current members of the Non-Aligned Movement are in dark blue. The light-blue colour denotes countries with observer-status.

By Maxronneland - https://en.wikipedia.org/wiki/File:Map_of_NAM_Members_and_Observer_states.svg, CC0, https://commons.wikimedia.org/w/index.php?curid=105867196

“NAM must once again come together to ensure the free flow of technology and data, while simultaneously guaranteeing protection to the sovereign interests of nations.” (Reddy & Soni, 2021, p.4)

This is incredibly valid, and countries represented within the NAM need to have a voice in this discussion about global data law. The US, China, the EU, and other wealthy nations should not be solely responsible for regulating open data and sovereignty globally. However, sovereignty is not just for states, which is why Indigenous Data Sovereignty (ID-SOV) should also be used for guiding global data law towards decolonization: if we are going to decolonize, we must centre on the rights of those who have been the most colonised.

Turning to Indigenous Data Sovereignty to Inform Global Data Law

Indigenous Peoples’ focus on self-determination is continuously burdened with the implications of data collected and used against them. The UN Declaration of the Rights of Indiginous Peoples (UNDRIP) states that the authority to control Indigenous cultural heritage (i.e. Indigenous data: their languages, knowledge, practices, technologies, natural resources and territories) should belong to Indigenous communities. (Carroll et al. 2020) It proves to be extremely difficult to break free from colonial and neo-colonial structures of power imbalances however; this is exactly what must be the focus in order to decolonise data practices and data law.

We are up against a long history of extraction and exploitation of value through data, representing a new form of resource appropriation that could be compared to the historical colonial land-grab, where not only land and resources but human bodies and labour were seized, often very violently. The lack of upfront violence in today’s data colonialism doesn’t negate its danger. “The absence of physical violence in today’s data colonialism merely confirms the multiplicity of means by which dispossession can, as before, unfold.” (Couldry & Mejias, 2021)

Contemporary data relations are laced with unquestionable racism (Couldry & Mejias, 2021), along with intersectional discrimination against all those considered marginalised, which is why we turn next to a report that addresses these issues directly, with a focus on surveillance and criminalization via data in the US.

Highlighting Technologies for Liberation

I love this report, titled Technologies for Liberation, which arose from the need to better understand the disproportionate impact of surveillance and criminalization of Queer, Trans, Two-Spirit, Black, Indigenous, and People of Color (QT2SBIPOC) communities and provide a resource for these communities to push back and protect themselves at all levels, from the state-endorsed to the corporate-led. (Neves & Srivastava, 2020) Technologies for Liberation aims to decolonise at a grassroots, community level, focusing on organisers and movement technologists which visualise demilitarised, community-driven technologies that support movements of liberation. This is transformative justice at work, centering on safety and shifting power to communities. (Neves & Srivastava, 2020) This is the bottom-up influence on data protection that we need to be turning towards to inform global data law that won’t leave people on the margins.

Conclusion

There are endless areas that need consideration when discussing global data law and decolonisation. We have already touched on a few areas and highlighted movements such as the digital NAM, ID-SOV, and Technologies for Liberation, after introducing the GDPR and the UN recommendations. This article has discussed potential avenues for solutions, including organisational principles and values which are key to this discussion. There are large power imbalances that need to be addressed and deeply rooted systems that need to be reimagined, and we must start with listening to the voices who have been the most silenced, and guiding everyone involved to do the right thing.

“The time has come for us to develop a set of basic principles on which countries can agree so that consumers worldwide are protected and businesses know what is required of them in any geography.” (Barber, 2021)

Resources

Barber, D. (2021, October 2). Navigating data privacy legislation in a global society. TechCrunch. Retrieved March 26, 2022, from https://techcrunch.com/2021/10/02/navigating-data-privacy-legislation-in-a-global-society/

DLA Piper. (2022). EU General Data Protection Regulation - key changes: DLA Piper Global Law Firm. DLA Piper. Retrieved April 8, 2022, from https://www.dlapiper.com/en/asiapacific/focus/eu-data-protection-regulation/key-changes/

Engler, A. (2022, March 9). The EU and U.S. are starting to align on AI Regulation. Brookings. Retrieved March 26, 2022, from https://www-brookings-edu.cdn.ampproject.org/c/s/www.brookings.edu/blog/techtank/2022/02/01/the-eu-and-u-s-are-starting-to-align-on-ai-regulation/amp/

Freuler, J. O. (2020, June 27). The case for a Digital non-aligned Movement. openDemocracy. Retrieved March 26, 2022, from https://www.opendemocracy.net/en/oureconomy/case-digital-non-aligned-movement/

Mejias, U. A. (2020, September 8). To fight data colonialism, we need a non-aligned Tech Movement. Science and Technology | Al Jazeera. Retrieved April 7, 2022, from https://www.aljazeera.com/opinions/2020/9/8/to-fight-data-colonialism-we-need-a-non-aligned-tech-movement

Neves, B. S., & Srivastava, M. (2020). Technologies for Liberation. Technologies for Liberation: Toward abolionist futures. Retrieved March 26, 2022, from https://www.astraeafoundation.org/FundAbolitionTech/

Reddy, L., & Soni, A. (2021, September). Is there space for a Digital Non-Aligned Movement? - HCSS.NL. New Conditions and Constellations in Cyber . Retrieved April 2, 2022, from https://hcss.nl/wp-content/uploads/2021/09/Is-There-Space-for-a-Digital-Non-Aligned-Movement.pdf

South, J. (2018, August 7). More than 1,000 U.S. news sites are still unavailable in Europe, two months after GDPR took effect. Nieman Lab. Retrieved April 3, 2022, from https://www.niemanlab.org/2018/08/more-than-1000-u-s-news-sites-are-still-unavailable-in-europe-two-months-after-gdpr-took-effect/

The UN. (2022). UN secretary-General's Data Strategy. United Nations. Retrieved April 3, 2022, from https://www.un.org/en/content/datastrategy/index.shtml

PCA Algorithm Tutorial in Python

April 5, 2022

Principal Component Analysis (PCA)

Principal Component Analysis is an essential dimensionality reduction algorithm. It entails lowering the dimensionality of data sets to reduce the number of variables. It keeps the most crucial information in this manner. This method is a helpful tool capable of several applications such as data compression or simplifying business decisions. Keep reading if you want to learn more about this algorithm.

The focus of this algorithm is to reduce the data set to make it simpler while retaining the most valuable possible information. Simplifying the dataset makes it easier to manipulate and visualize the data, resulting in quicker analysis.

Already understand how PCA works? Jump forward to the code!

How is PCA possible?

There are multiple ways to calculate PCA. We’re going to explain the widely used: Eigendecomposition of the covariance matrix.

An eigendecomposition is the factorization of a matrix in linear algebra. We can use a mathematical expression to represent this through the eigenvalues and eigenvectors. These concepts will be explained in section 3: “Calculate the Eigendecomposition of the Covariance Matrix”. Having this in mind, let’s dive further into the five steps to compute the PCA Algorithm:

1. Standardization

Standardization is a form of scaling in which the values are centered around the mean and have a unit standard deviation. Allowing us to work, for example, with unrelated metrics.

It means every feature is standardized to have a mean of 0 and a variation of 1, putting them on the same scale. In this manner, each feature may contribute equally to the analysis regardless of the variance of variables or if they are of different types. This step prevents variables with ranges of 0 to 100 from dominating the ones with values of 0 to 1, making a less biased and more accurate outcome.

Standardization can be done mathematically by subtracting the mean and dividing by the standard deviation for each variable value. The letter z represents the standard score.

Once standardized, every feature will be on the same scale.

2. Covariance Matrix Computation

The covariance matrix is a square matrix that displays the relation between two random features, avoiding duplicated information since the variables we’re working with sometimes can be strongly related.

E.g. let’s imagine we have milk and coffee, they are different drinks that people usually drink. In conducting a survey, we can ask some people which beverage gives them more energy noting their score from 0 to 5. The results will be the data points we can collect to build our covariance matrix. The following table represents collected data from three subjects, having two variables “C” and “M”, that identify coffee and milk, respectively.

This table can be represented in a graph like the following.

As we can see, the horizontal axis represents Coffee, while the vertical axis represents Milk. Here, we can notice some vectors created with the values shown in the previous table, and every vector is labeled according to the subjects’ answers. This graph can be represented as a covariance matrix. A covariance matrix with these features will compare the values each feature has and will proceed with the calculations.

This covariance matrix is calculated by transposing the matrix A and multiplying by itself, then dividing by the number of samples. Represented as the formula below:

Resulting in matrices like these:

Image source: Interpretation of Covariance, Covariance Matrix and Eigenvalues | Towards Data Science

We have two instances of Covariance Matrices here. One is a 2x2 matrix, while the other is a 3x3 matrix. Each with the number of variables according to its dimension. These matrices are symmetric with respect to the main diagonal. Why?

The covariance of a variable with itself represents the main diagonal of each matrix.

cov(x,x)= var(x)

Also, we have to take into account that covariance is commutative.

cov(x,y)= cov(y,x)

That way, we see that the top and bottom triangular parts are therefore equal, making the covariance matrix symmetric with respect to the main diagonal.

At this point, we need to take into account the sign of the covariance. If it’s positive, both variables are correlated; if it’s negative, they’re inversely correlated.

3. Calculate the Eigendecomposition of the Covariance Matrix

In this phase, we will compute the eigenvectors and eigenvalues of the matrix from the previous step. That way, we’re going to obtain the Principal Component. Before entering into the process of how we do this, we’re going to talk about these linear algebra concepts.

In the realm of data science, eigenvectors and eigenvalues are both essential notions that scientists employ. These have a mathematical basis.

An eigenvector is a vector that represents an approximation to a large matrix. This vector won’t change by any transformation; instead, it becomes a scaled version of the original vector. The scaled version is caused by the eigenvalue, that’s simply a calculated value that stretches the vector. Both come in pairs, with the number of pairings equaling the number of matrix dimensions.

Principal Components, on the other hand, being the major focus of this approach, are variables formed as linear combinations of the starting features. The primary goal of the principal components is to save the most uncorrelated data in the first component and leave the remainder to the following component to have the most residual information and so on until all the data is saved.

This procedure will assist you in reducing dimensionality while retaining as much information as possible and discarding the components with low information.

Example of a Principal Component, showing the contributions of variables in Python. Image Source: plot — Contributions of variables to PC in python — Stack Overflow

Now we have all the concepts clear. The real question is how can PCA Algorithm construct the Principal Component?

All of the magic in this computation is due to the eigenvectors and eigenvalues. Eigenvectors represent the axes’ directions where the maximum information is permitted. These are referred to as Principal Components. Eigenvalues, as we know, are values attached to the eigenvectors, giving the amount of variance that every principal component has.

If you rank the eigenvectors in order of every eigenvalue, highest to lowest, you’re going to get the principal components in order of significance, just like the following picture.

Eigenvectors and eigenvalues of a 2-dimensional covariance matrix. Image source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

With the principal components established, the only thing left to do is calculate the proportion of variation accounted for by each component. It is possible to do this by dividing the eigenvalues of each component by the sum of eigenvalues.

4. Feature Vector

A Feature Vector is a vector containing multiple variables. In this case, it is formed with the most significant Principal Components, which means, the one with the vectors corresponding to the highest eigenvalues. The number of the principal component we want to add to our feature vector is up to us and up to the problem we’re going to solve.

Following the same example of the last step, knowing that 1>2, our feature vector can be written this way:

Feature Vector with two Principal Components. Image source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

Also, we know that our second vector contains less relevant information, which is why we can skip it and only create our feature vector with the first principal component.

Feature Vector with only a Principal Component. Image Source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

We need to take into consideration that reducing our dimensionality will cause information loss, affecting the outcome of the problem. We can discard principal components when it looks convenient, which means when the information of those principal components we want to discard is less significant.

5. Recast the Data Along the Axes of the Principal Components

Because the input data is still in terms of the starting variables, we must utilize the Feature Vector to reorient the axes to those represented by our matrix’s Principal Components.

It can be done by multiplying the transposed standardized original data set by the transpose of the Feature Vector.

Now that we finally have the Final Data Set, that’s it! We’ve completed a PCA Algorithm.

PCA Python Tutorial

However, when you have basic data, such as the previous example, it is quite straightforward to do it with code. When working with massive amounts of data, scientists are continually looking for ways to compute this Algorithm. That’s why we’re going to complete an example in Python, you can follow along in this Jupyter Notebook.

The first thing we’re going to do is import all the datasets and functions we’re going to use. For a high-level explanation of the scientific packages: NumPy is a library that allows us to use mathematical functions, this will help us to operate with the matrices (also available thanks to this library).

Seaborn and Matplotlib are both used to generate plots and graphics. Matplotlib is also used as an extension of NumPy, including more mathematical functions.

Sklearn is the reserved word for scikit-learn, a machine learning library for Python, it has some loaded datasets and various Machine Learning algorithms.

Next, we will declare a class called PCA, which will have all the steps we learned previously in this blog.

Functions for Eigenvectors and Projection

These functions are pretty intuitive and easy to follow. The next step is to implement this PCA Class. We’re going to initialize the variables we’re going to use.

Here we can notice we’re using the datasets library, extracted from scikit-learn. Then, we assigned those values to our variables. Scikit-learn comes with various datasets, we’re going to use the known ‘Toys datasets’. These are small standard datasets, used mostly in algorithms examples or tutorials.

Specifically, we’re loading the diabetes dataset. According to the authors, this data is based on “Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”

Following the code, now we’re going to initialize the class and start calling out the functions of the class. When running the code, we can have the following outcome.

To watch the differences between the original datasets, we can just call the first two functions of the class, throwing the next scatterplot:

If we do the whole PCA process, we can better visualize the data in this way:

Now, we’ve learned about the key computing procedures, which included some crucial linear algebra basis.

In Conclusion

The Principal Component Analysis is a straightforward yet powerful algorithm for reducing, compressing, and untangling high-dimensional data. It allows us to isolate the data more clearly, and use it for various machine learning methods.

Since the original data was reduced in dimensionality terms, retaining trends and patterns, we can notice that the final output of this algorithm is easier to manipulate, making further analysis much easier and faster for the machine learning algorithm, allowing us to forget unnecessary variables and some problems like the curse of dimensionality.

The next step is to use your new transformed variables with other machine learning algorithms to better understand your data and finally finish your research. If you want a general guide on the existing different machine learning methods and when to use them, you can check my previous entry on “Machine Learning Algorithms”.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

References:

A guide to principal component analysis (PCA) for Machine Learning. A Guide to Principal Component Analysis (PCA) for Machine Learning (2020). Available at: https://www.keboola.com/blog/pca-machine-learning. (Accessed: 14th February 2022).
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
Jaadi, Z. A step-by-step explanation of principal component analysis (PCA). Built In (2021). Available at: https://builtin.com/data-science/step-step-explanation-principal-component-analysis. (Accessed: 14th February 2022).
Tsesmelis, T. Introduction to principal component analysis (PCA). OpenCV Available at: https://docs.opencv.org/4.x/d1/dee/tutorial_introduction_to_pca.html. (Accessed: 16th February 2022).
Lanhenke, M. Implementing PCA from scratch. Medium (2021). Available at: https://towardsdatascience.com/implementing-pca-from-scratch-fb434f1acbaa. (Accessed: 15th February 2022).

Data Processing in Python

March 24, 2022

Data Processing in Python

Generally speaking, data processing consists of gathering and manipulating data elements to return useful, potentially valuable information. Different encoding types will have various processing formats. The most known formats for encodings are XML, CSV, JSON, and HTML.

With Python, you can manage some encoding processes, and it’s better suited for data processing than other languages due to its simple syntax, scalability, and cleanliness that allows solving different complex problems in multiple ways. All you’re going to need are some libraries or modules to make those encoding methods work, for example, Pandas.

Why is Data processing essential?

Data processing is a vital part of data science. Having inaccurate and bad-quality data can be damaging to processes and analysis. Good clean data will boost productivity and provide great quality information for your decision-making.

What is Pandas?

When we talk about Pandas, most people assimilate the name with the black and white bear from Asia. But in the tech world, it’s a recognized open-source Python library, developed as an extension of NumPy. Its function is to work with Data Analysis, Processing, and Manipulation, offering data structures and operations to manage number tables and time series.

With this said, we agree that Pandas is a powerful essential programming tool for those interested in the Machine Learning field.

Processing CSV Data

Most Data Scientists rely on CSV files (which stand for “Comma Separated Values”) in their day-to-day work. It’s because of the simplicity of the storage in a tabular form as plain text, making it easier to read and comprehend.

CSV files are easy to create. We can use Notepad or another text editor to make a file, for example:

Then, save the file using the .csv extension (example.csv). And select the save as All Files (*.*) option. Now you have a CSV data file.

In the Python environment, you will use the Pandas library to work with this file. The most basic function is reading the CSV data.

Processing Data using Pandas

We will use a simple dataset for this tutorial i.e. Highest grossing movies dataset. You can download this and other datasets from “kaggle.com.”

To start working with pandas we will import the library into our jupyter notebook which you can find here to follow along with this tutorial.

Pandas is one of the more notable libraries essential to the data science workflow as it provides you with the means to process and wrangle the data. This is vital as many consider the data pre-processing stage to occupy as much as 80% of a data scientist’s time.

Import dataset

The next step is to import the dataset for this we will use the read_csv() which is a function of pandas. Since the dataset is in a tabular format, pandas will convert it to a dataframe called data. A DataFrame is a two-dimensional, mutable data structure in Python. It is a combination of rows and columns like an excel sheet.

This dataset contains data on the highest-grossing movies of each year. When working with datasets it is important to consider: where did the data come from? Some will be machine-generated data. Some of them will be data that’s been collected via surveys. Some could be data that are recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs. Don’t jump right into the analysis; take the time to first understand the data you are working with.

Exploring the data

The head() function is a built-in function in pandas for the dataframe used to display the rows of the dataset by default; it displays the first five rows of the dataset. We can specify the number of rows by giving the number within the parenthesis.

Here we also get to see what data is in the dataset we are working with. As we can see there are not a lot of columns which makes the data easier to work with and explore.

We can also see how the last five rows look using the tail() function.

The function memory_usage() returns a pandas series having the memory usage(in bytes) in a pandas dataframe. The importance of knowing the memory usage of a dataframe helps when tackling errors like MemoryError in Python.

In datasets, the information is presented in tabular form so data is organized in rows and columns. Each column has a name, a data type, and other properties knowing how to manipulate the data in the columns is quite useful. We can continue and check the columns we have.

Keep in mind, because this is a simple dataset there are not a lot of columns.

loc[:] can be used to access specific rows and columns as per what you require. If for instance, you want the first 2 columns and the last 3 rows you can access them with loc[:]. One can use the labels or row and column numbers with the loc[:] function.

The above code will return the “YEAR”, “MOVIE”, and “TOTAL IN 2019 DOLLARS” columns for the first 5 movies. Keep in mind that the index starts from 0 in Python and that loc[:] is inclusive of both values mentioned. So 0:4 will mean indices 0 to 4, both included.

sort_values() is used to sort values in a column in ascending or descending order.

The ‘inplace’ attribute here is False but by specifying it to be True you can make a change in the original dataframe.

You can look at basic statistics from your data using the simple data frame function i.e. describe(), this helps to better understand your data.

value_counts() returns a Pandas Series containing the counts of unique values. value_counts() helps in identifying the number of occurrences of each unique value in a Series. It can be applied to columns containing data.

value_counts() can also be used to plot bar graphs of categorical and ordinal data syntax below.

Finding and Rebuilding Missing Data

Pandas has functions for finding null values if any are in your data. There are four ways to find missing values and we will look at all of them.

isnull() function: This function provides the boolean value for the complete dataset to know if any null value is present or not.

isna() function: This is the same as the isnull() function

isna().any() function: This function also gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular format.

isna().sum() function: This function gives the sum of the null values preset in the dataset column-wise.

isna().any().sum() function: This function gives output in a single value if any null is present or not. In this case there is no null value.

When there is a null value present in the dataset the fillna() function will fill the missing values with NA/NaN or 0. Below is the syntax.

De-Duplicate

This is removing all duplicate values. When analyzing data, duplicate values affect the accuracy and efficiency of the results. To find duplicate values the function duplicated() is used as seen below.

While this dataset does not contain any duplicate values if a dataset contains duplicate values it can be removed using the drop_duplicates() function.

Below is the syntax of this function:

We have seen here, we can already conduct fairly interesting data analysis with Pandas that provides various useful functionalities that are fairly straightforward and easy to use. Different approaches can be used for many different kinds of datasets to find patterns and trends to apply more advanced machine learning techniques in the future.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

References

Is Data Mining Ethical?

March 23, 2022

The idea of data mining is one that sends a chill down my spine. The collection and use of data that relies on peoples’ production and sharing of personal and sensitive information has a certain creep factor. Specifically, when data mining is used in ways inconsiderate of the people behind the data, the creep factor increases dramatically.

The media, researchers, and non-governmental organizations continue to access and reuse sensitive data without consent from Indigenous governing bodies. This has been done recently amidst the COVID-19 pandemic where tribal data in the United States was released by government entities without permission or knowledge of the tribes themselves. There is an effort to address gaps in data and data invisibility of Indigenous peoples in America. However, this can result in unintentional harm while ignoring Indigenous sovereign rights, which need to be protected. (RDA COVID-19 Indigenous Data WG, 2020).

In this article, we will review case studies on data mining in African communities, and on contact tracing for COVID-19 in South Korea and Brazil to demonstrate how ethical AI strategies work in different scenarios and cultures to impart a global perspective. These projects appear beneficial on the surface level, however, they embody a colonial nature that is deeply embedded in our world structures. We will be discussing these cases within the framework of top-down, bottom-up, and hybrid models of ethics in artificial intelligence (AI) which you can read more about here. Before we review the case studies, we will review what data mining is in this context.

Defining Data Mining

What is the difference between data sharing and data mining?

Data sharing implies that there is an owner of the data and openness or agreement to share information. Data mining gives the impression of taking without asking, with no acknowledgment or compensation, while the miners of the data are the sole beneficiaries. However, can data sharing and data mining be one and the same?

Data mining is closely tied to data colonialism, an enactment of neo-colonialism in the digital world which uses data as a means of power and manipulation. Manipulation runs rampant in this age of misinformation, which we have seen heavily at play in recent times as well as throughout history, playing on emotions to steer public opinion.

Case Study 1: Data Mining in the African Context

Data sharing is a prime example of conflicting principles of AI ethics. On one hand, it is the epitome of transparency and a crucial element to scientific and economic growth. On the other hand, it brings up serious concerns about privacy, intellectual property rights, organizational and structural challenges, cultural and social contexts, unjust historical pasts, and potential harms to marginalized communities. (Abebe et al., 2021)

The term data colonialism can be used to describe some of the challenges of data sharing, or data mining, which reflect the historical and present-day colonial practices such as in the African and Indigenous context. (Couldry and Mejias, 2019) When we use terms such as ‘mining’ to discuss how data is collected from people, the question remains, who benefits from the data collection?

The use of data can paradoxically be harmful to the communities it is collected from. Establishing trust is challenging due to the historical actions taken by data collectors while mining data from indigenous populations. What barriers exist that prevent data from being of benefit to African and indigenous people? We must address the entrenched legacies of power disparities concerning what challenges they present for modern data sharing. (Abebe et al., 2021)

One problematic example is non-government organizations (NGOs) that try to ‘fix’ problems for marginalized ethnic groups and can end up causing more harm than good. For instance, a Europe-based NGO attempted to address the problem of access to clean potable water in Buranda, while testing new water accessibility technology and online monitoring of resources. (Abebe et al., 2021)

The NGO failed to understand the perspective of the community on the true central issues and potential harms. Sharing the data publicly, including geographic locations put the community at risk, as collective privacy was violated and trust was lost. In the West we often think of privacy as a personal concern, however, collective identity serves as great importance to a multitude of African and Indigenous communities. (Abebe et al., 2021)

Another case study in Zambia observed that up to 90% of health research funding comes from external funders, meaning the bargaining power gives little room for Zambian scholars. In the study, power imbalances were reported in everything from funding to agenda-setting, data collection, analysis, interpretation, and reporting of results. (Vachnadze. 2021) This example exhibits further the understanding that trust cannot be formed on the foundation of these imbalances of power.

Many of these research projects lead with good intentions, yet there is a lack of forethought into the ethical use of data, during and after the project, which can create unforeseen and irreparable harms to the wellbeing of communities. This creates a hostile environment to build relationships of respect and trust. (Abebe et al., 2021)

To conclude the reflection of this case study, we can pose the ethical question, is data sharing good/beneficial? First and foremost, local communities must be the primary beneficiaries of responsible data-sharing practices. It is important to specify who benefits from data sharing and to make sure that it is not doing any harm to the people behind the data.

Case Study 2: Data Sharing for Contact Tracing during COVID-19

Contact tracing for the COVID-19 pandemic is another example of a complex ethical case of data collection.

Contact tracing can be centralized or non-centralized, which directly relates to top-down and bottom-up methods of data collection. Depending on the country and government, some have taken a more centralized top-down approach, and some have utilized a hybrid approach of government recommendations and bottom-up implementation via self-reporting.

The centralized approach was deployed in South Korea, whereby law, and for the purposes of infectious disease control, the national authority is permitted to collect and use the information on all COVID-19 patients and their contacts. In 2020, Germany and Israel tried and failed at adopting centralized approaches, due to a lack of exceptions for public health emergencies in their privacy laws. Getting past the legal barriers can be a lengthy and complex process and not conducive to applying a centralized contact tracing system for the outbreak. (Sagar. 2021)

Justin Fendos, a professor of cell biology from South Korea, wrote that in supporting the public health response to COVID-19, Korea had the political willingness to use technological tools to its full potential. The Korean government had collected massive amounts of transaction data to investigate tax fraud even before the COVID-19 outbreak. Korea’s government databases hold records of literally every credit card and bank transaction, and this information was repurposed during the outbreak to retroactively track individuals. In Korea, 95% of adults own a smartphone and many use cashless tools everywhere they go, including on buses and subways. (Fendos, 2020) Hence, contact tracing in Korea was extremely effective.

Public opinion about surveillance in Korea has been stated to be overwhelmingly positive. Fatalities in Korea due to COVID-19 were a third of the global average as of April 2020, when it was also said that they were one of the few countries to have successfully flattened the curve. There have been concerns, despite the success, regarding the level of personal details released by health authorities, which have motivated updated surveillance guidelines for sensitive information. (Fendos, 2020)

Non-centralized approaches to contact tracing are essentially smartphone apps that track proximal coincidence with less invasive data collection methods. These approaches have thus been adopted by many countries, and don’t have the same cultural and political obstacles as centralized approaches, avoiding legal pitfalls and legislative reform. (Sagar. 2021) Because of this and other reasons, contact tracing doesn’t always work the same as in Korea.

One study focused on three heavily impacted cities in Brazil that had the most deaths from COVID-19 until the first half of 2021. A methodology for applying data mining as a public health management tool included identifying variables of climate and air quality in relation to the number of COVID-19 cases and deaths. They provided forecasting models of new COVID-19 cases and daily deaths in the three Brazilian cities studied. However, the researchers noted that the counting of cases in Brazil was affected by high underreporting due to low testing, as well as technical and political problems, including the spread of misinformation, hence the study stated that cases may have been up to 12 times greater than investigations indicated. (Barcellos et al., 2021)

We can see from these examples that contact tracing has worked very differently in countries that have contrasting systems of government, and the same approach wouldn’t work for all countries. A lack of trust comes into play as well, and contact tracing didn’t work in many places simply because people didn’t trust the technology or the government behind it, often reflecting judgments based on misinformation. In Brazil, the spread of misinformation was coming from the government, which doesn’t inspire trust.

In America, a July 2020 study found that 41% said they would likely not speak on the phone or text with a public health official and 27% were unlikely to share names of recent contacts (McClain, 2020), which are vital steps that create a bottleneck in the process of contact tracing adoption. While there are concerns with contact tracing and privacy, there is a contradiction and hypocrisy when it comes to the prolific use of social media apps and how much data is freely shared on them on a daily basis. Yet, when it comes to participation in a tracking system for a global pandemic that is built with fundamental principles to protect personal privacy, it can be seen as a threat.

Conclusion

Data ethics issues across the planet are complex and this article only offers a couple of examples of areas of use and tensions. We must keep in mind that data represents real people and collecting or mining data from indigenous communities can be at their detriment, often unknown to the data scientists and companies who reap the benefits. This is not a new story, just a new setting, and we must be cognizant of these instances of colonialism that still penetrate our relations across cultures and across the world.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

www.accel.ai

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Abebe, R., Aruleba, K., Birhane, A., Kingsley, S., Obaido, G., Remy, S. L., & Sadagopan, S. (2021). Narratives and Counternarratives on Data Sharing in Africa. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 329–341. https://doi.org/10.1145/3442188.3445897

Anane‐Sarpong, E., Wangmo, T., Ward, C. L., Sankoh, O., Tanner, M., & Elger, B. S. (2018). “You cannot collect data using your own resources and put It on open access”: Perspectives from Africa about public health data‐sharing. Developing World Bioethics, 18(4), 394–405. https://doi.org/10.1111/dewb.12159

Barcellos, D. da S., Fernandes, G. M. K., & de Souza, F. T. (2021). Data based model for predicting COVID-19 morbidity and mortality in metropolis. Scientific Reports, 11(1), 24491. https://doi.org/10.1038/s41598-021-04029-6

Bezuidenhout, L., & Chakauya, E. (2018). Hidden concerns of sharing research data by low/middle-income country scientists. Global Bioethics, 29(1), 39–54. https://doi.org/10.1080/11287462.2018.1441780

Chilisa, B. (2012). Indigenous Research Methodologies. SAGE.

Couldry, N., & Mejias, U. A. (2019). Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject. Television & New Media, 20(4), 336–349. https://doi.org/10.1177/1527476418796632

Fendos, J. (2020). How surveillance technology powered South Korea’s COVID-19 response. Brookings.

Hooker, S. (2018). Why “data for good” lacks precision. Medium.

Maxmen, A. (2019). Can tracking people through phone-call data improve lives? Nature, 569(7758), 614–617. https://doi.org/10.1038/d41586-019-01679-5

McClain, C. (2020, November 13). Key findings about Americans’ views on COVID-19 contact tracing. Pew Research Center.

RDA COVID-19 Indigenous Data WG. “Data sharing respecting Indigenous data sovereignty.” In RDA COVID-19 Working Group (2020). Recommendations and guidelines on data sharing. Research Data Alliance. https://doi.org/10.15497/rda00052

Sagar, R. (2021). What is Hybrid AI? Analytics India Magazine.

Walsh, A., Brugha, R., & Byrne, E. (2016). “The way the country has been carved up by researchers”: ethics and power in north–south public health research. International Journal for Equity in Health, 15(1), 204. https://doi.org/10.1186/s12939-016-0488-4

Walter, M., Kukutai, T., Carroll, S. R., & Rodriguez-Lonebear, D. (2020). Indigenous Data Sovereignty and Policy (M. Walter, T. Kukutai, S. R. Carroll, & D. Rodriguez-Lonebear, Eds.). Routledge. https://doi.org/10.4324/9780429273957

Big Nudging and Misinformation in the Era of COVID-19

March 14, 2022

There are many worries about the Information Age, or Misinformation Age in which we find ourselves, and how living in the digital world is driving us further away from democracy and self-determination. In my last post, I introduced neo-colonialism, which is enforced through data colonialism and digital colonialism, and in this post, I will give a review of these terms as a precursor for discussing how Big Nudging and misinformation in the Era of COVID-19 is having an effect on our free-will. However, I argue that if we can become aware of these things and work together, perhaps we can move toward democracy and not away from it. To do this, we can take some tips from the US Surgeon General, which I review below.

Data Mining and Big Nudging Help to Spread Misinformation

Data mining is a term used to describe the act of data collection in a manner that is reminiscent of colonialism. Data colonialism is when Big Data is collected and used as a way to control or manipulate populations. (Couldry and Mejias, 2019) Digital colonialism is a parallel term that covers the use of digital technologies being used for social, economic, and political domination. (Kwet, 2021)

Big Nudging could be considered Data colonialism in action, although who is holding the reins in the seat of power is not always clear. Is Big Nudging merely a tool that can be used for control, or can it also be used for good?

The concept of nudging is akin to ‘influence with an agenda’ when external forces influence individual or group behaviors and decisions. Nudge theory was first made popular by Richard Thaler, a behavioral economist, and Cass Sunstein, a political scientist. Nudging coaxes behavior without forcing, tweaking the environments in which we make decisions by utilizing insights about our mental processes, and can be used on family, say to remind a loved one to take their daily medicine, or on a larger scale, by requiring people to opt-out of organ donation as opposed to opting-in. The idea is that we still have the choice, without any economical or other incentives, and without forced mandates. (Yates, 2020) When this psychological tool relies on Big Data, it is called Big Nudging. This can be subtle and dangerous when people are unaware that they are being nudged, believing wrongly that they are acting within their own free will.

Political campaigners are massive culprits in this, combining profiling with big nudging to target which demographic groups individuals belong to, gathering data on what issues are most significant in order to procure support for their propositions. Big nudging has been strongly suspected to be used in many large political campaigns, such as Brexit and the 2016 US presidential election. (Wong, 2019)

“The term “big nudging” has emerged to represent using big data and AI to exploit psychological weaknesses to steer decisions — creating problems such as damaging social cohesion, democratic principles, and even human rights”. (Vinuesa et al, 2020 P3)

Big Nudging plays on our emotions, and works almost too well, especially with spreading misinformation. This may explain why one study found that false news stories were 70% more likely to be shared than true stories (Vosoughi et al. 2018), and why they often go viral. During the pandemic, nudging has been used alongside mandates for things like mask-wearing and social distancing, with varying results. (Dudás & Szántó, 2021) Some efforts were indeed used for good, such as handwashing campaigns, however, the threats of Big Nudging spreading misinformation appear to outweigh the benefits.

What can be done about Misinformation for COVID-19?

Recently, the Surgeon General of the United States, Dr. Vivek H. Murthy, put out a report on the dangers of misinformation about COVID-19 as a public health concern. As the next step, Murthy has put out a request for information “. . . on the impact and prevalence of health misinformation in the digital information environment during the COVID-19 pandemic.” (Lesko, 2022).

In the report, Murthy listed several reasons for the rapid spread of misinformation, as well as calls to action for a whole-of-society effort to combat misinformation for the pandemic and beyond. This is extremely useful and could help to curb Big Nudging on multiple fronts.

Here are the reasons misinformation tends to spread so quickly on online platforms:

The emotional and sensational nature heightens psychological responses like anxiety and produces a sense of urgency to react and share.
Incentivization for likes, comments, etc. rewards engagement over accuracy.
Popularity and similarity to previous content are favored by algorithms, which can cause confusion and reinforce misunderstanding. (Murthy, 2021)

Distrust of the government and/or the healthcare system can further cause misinformation to flourish. It is especially prevalent in areas of significant societal division and political polarization, and for those who have experienced racism or other inequities, misinformation can spread even easier. (Murthy, 2021)

The US healthcare system is privatized and has shown bias for socioeconomic status and against minorities, so it is not difficult to understand people’s mistrust in it, however, the over-reliance on emotionally-charged misinformation leaves everyone confused and not knowing what to trust or believe. A recent analysis found that a widely used algorithm in US hospitals that helps manage the care of about 200 million people each year has been systemically discriminating against black people. However, by making changes to find other variables besides healthcare costs to calculate individual medical needs, biases were reduced by 84%. This shows that more diversity is needed in algorithm design teams, and more testing needs to be done before using these algorithms in people’s lives. (Ledford, 2019)

How can we address health misinformation, and hopefully prevent misinformation in other spheres going forward?

The Surgeon General listed some recommendations for taking action:
Equip Americans with the tools to identify misinformation, make informed choices about what information they share, and address health misinformation in their communities, in partnership with trusted local leaders.
Expand research that deepens our understanding of health misinformation, including how it spreads and evolves; how and why it impacts people; who are most susceptible; and which strategies are most effective in addressing it.
Implement product design and policy changes on technology platforms to slow the spread of misinformation.
Invest in longer-term efforts to build resilience against health misinformation, such as media, science, digital, data, and health literacy programs and training for health practitioners, journalists, librarians, and others.
Convene federal, state, local, territorial, tribal, private, nonprofit, and research partners to explore the impact of health misinformation, identify best practices to prevent and address it, issue recommendations, and find common ground on difficult questions, including appropriate legal and regulatory measures that address health misinformation while protecting user privacy and freedom of expression (Murthy, 2021)

The US Surgeon General provided many tips for healthcare workers, educators, journalists, tech companies, governments, and the public on how to combat health misinformation, including an emphasis on creating resilience to misinformation. (Murthy, 2021) Misinformation exists independently of colonialism in all of its forms, yet has been used as a tool to keep people controlled and to nudge people towards decisions that feed systems of control. These systems have been adopted by the algorithms that direct what we see online, and our own emotions do the rest of the work.

My question is this: can we apply Dr. Murthy’s advice in order to decolonize ourselves and the digital world, by building resistance to misinformation and Big Nudging and truly making our own democratic decisions for the pandemic and in the future? Can we learn from all of this and move forward stronger, armed with the knowledge that systems that are made to benefit people but are built like a business, such as the US healthcare system, are not working for us, and democratically call for better systems that truly serve all people? If we can figure out how to combat misinformation and Big Nudging, perhaps we can move toward democracy and not away from it, but to do that we must educate ourselves and be able to recognize what is false and what is manipulative and call it out, shut it out, and move on.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Dudás, L., & Szántó, R. (2021). Nudging in the time of coronavirus? comparing public support for soft and hard preventive measures, highlighting the role of risk perception and experience. PLOS ONE, 16(8). https://doi.org/10.1371/journal.pone.0256241

Gramacho, W., Turgeon, M., Kennedy, J., Stabile, M., & Mundim, P. S. (2021). Political Preferences, Knowledge, and Misinformation About COVID-19: The Case of Brazil. Frontiers in Political Science, 3. https://doi.org/10.3389/fpos.2021.646430

Kwet, M. (2019). Digital colonialism: US empire and the new imperialism in the Global South. Race & Class, 60(4), 3–26. https://doi.org/10.1177/0306396818823172

Ledford, H. (2019). Millions of black people affected by racial bias in health-care algorithms. Nature, 574(7780), 608–609. https://doi.org/10.1038/d41586-019-03228-6

Lesko, M. (2022). Impact of health misinformation in the digital information … hhs.gov. Retrieved March 10, 2022, from https://www.federalregister.gov/documents/2022/03/07/2022-04777/impact-of-health-misinformation-in-the-digital-information-environment-in-the-united-states

Lucero, V. 2022. From CTA/CTT to voter tracing? risk of data misuse in the Philippines. (February 2022). Retrieved February 16, 2022 from https://engagemedia.org/2022/philippines-contact-voter-tracing/

Murthy, V. H. (2021). Confronting health misinformation — hhs.gov. hhs.gov. Retrieved March 10, 2022, from https://www.hhs.gov/sites/default/files/surgeon-general-misinformation-advisory.pdf

Vinuesa, R., Azizpour, H., Leite, I., Balaam, M., Dignum, V., Domisch, S., Felländer, A., Langhans, S. D., Tegmark, M., & Fuso Nerini, F. (2020). The role of artificial intelligence in achieving the Sustainable Development Goals. Nature Communications, 11(1), 233. https://doi.org/10.1038/s41467-019-14108-y

Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359, 1146–1151. http://doi.org/10.1126/science.aap9559

Wong, S. 2019. Filter bubbles and big nudging: Impact on Data Privacy and Civil Society. (September 2019). Retrieved February 22, 2022 from http://www.hk-lawyer.org/content/filter-bubbles-and-big-nudging-impact-data-privacy-and-civil-society#:~:text=Similar%20to%20filter%20bubbles%2C%20big%20nudging%20also%20involves,of%20nudge%20with%20the%20use%20of%20Big%20Data.

Yates, T. (2020, March 13). Why is the government relying on nudge theory to fight coronavirus? . The Guardian. Retrieved March 12, 2022, from https://www.theguardian.com/commentisfree/2020/mar/13/why-is-the-government-relying-on-nudge-theory-to-tackle-coronavirus

An Introduction to the Ethics of Neo-Colonialism

February 12, 2022

There are many frameworks to think about and describe ethics applied to Artificial Intelligence, but my writing on this topic thus far has lacked consideration of colonialism, which at its core and in practice is completely void of ethics. Colonialism is a deeply rooted world system of power and control that plays out in ways that become normal, yet are far from anything that would be considered ethical. In today’s world that relies so heavily on technology, colonialism within data and the digital world is a fundamental problem.

There is a strong separation between the dominant powers and the people and communities that they profit from. This is often framed by seeing the Global North as separate from the Global South. In reality, there is a separation between urban centers which are largely in the Global North, and everyone else, with those in the Global South bearing the brunt of the power imbalance. We use the terminology of Global North and Global South broadly, but this review references examples not specific to this framework. One such instance regarding digital colonialism affecting Inuit communities in Northern Canada is key to our exploration. This case study will appear in a future article.

There are two ways neo-colonialism is being discussed in sociotechnical language: digital colonialism and data colonialism. These are parallel terms and may be considered one and the same, however, we will look at how they have been described independently.

When digital technology is used for social, economic, and political domination over another nation or territory, it is considered digital colonialism. Dominant powers have ownership over digital infrastructure and knowledge, which perpetuates a state of dependency within the hierarchy, situating Big Tech firms at the top and hosting an extremely unequal division of labor, which further defines digital colonialism. (Kwet, 2021)

Data colonialism addresses Big Data in the context of the predatory practices of colonialism. Capitalism depends on the data from the Global South, which represents a new type of appropriation attached to current infrastructures of connection. (Couldry and Mejias, 2019 P1) We see a pulley system of interdependence, however, the concentration of power is clear.

We cannot address colonialism without also addressing capitalism. Colonialism came first, and historical colonialism, with its violence and brutality, paved the way for capitalism. In order to decolonize, we need to fully overhaul the systems of capitalism and consumerism. We cannot add on little bits of law or regulations to govern data and the digital world in an attempt to decolonize. We need a full system change, and it is going to take a lot of work.

We are at the dawn of a new stage of capitalism, following the path laid out by data colonialism, just as historical colonialism paved the way for industrial capitalism. We can’t yet imagine what this will look like, but we know that at its core is the appropriation of human life through data. (Couldry and Mejias, 2019 P1–2)

Not only is this a problem because it creates global inequality, capitalism notably threatens the natural environment. Its structural imperative is based on an insatiable appetite for growth and profit, causing overconsumption of Earth’s material resources, not to mention overheating the planet. (Kwet, 2020) Mining cobalt in the Congo has detrimental effects not just on the earth, but on people’s lives, utilizing harsh child labor (Lawson , 2021). The Congo is where we get over 50% of the world’s cobalt, an essential raw mineral found in cell phones, computers, and electric vehicles, as well as in lithium batteries, which will see an increase in demand alongside renewable energy systems. (Banza Lubaba Nkulu et al., 2018). Not only is data mining causing harm to people and the environment in how it is being collected but also how it is being stored long-term. Data centers alone account for 2% of human carbon emissions, rivaling airlines. There are plans and efforts to lower emissions from data centers, which need to be done across industries, alongside efforts to address the underlying issues of dependence due to capitalism and consumerism.

“What decolonial thinking, in particular, can help us grasp is that colonialism — whether in its historic or new form — can only be opposed effectively if it is attacked at its core: the underlying rationality that enables continuous appropriation to seem natural, necessary, and somehow an enhancement of, not a violence to, human development.’’ (Couldry and Mejias, 2019 P16)

Conclusion

This is merely an introduction to the topics of data colonialism and digital colonialism. In future posts, we will provide many examples that explore various corners of the world and the impact of digital and data colonialism in different ways, including data mining, and case studies in the African indigenous context as well as the scenario topical to most, contact tracing for the COVID-19 pandemic. Within data mining, we will discuss how or even if data mining is different from data sharing, as well as contextualize data mining alongside resource mining from the Earth.

Further examples include the impact of internet usage in indigenous communities such as the Inuit as well as in South America, where their local knowledge is waning due to the influence of digital colonialism. In order to have a truly ethical AI, there needs to be a large shift in the ethics of society, and the decolonization of data and the digital world is a good starting point.

References

Banza Lubaba Nkulu, C., Casas, L., Haufroid, V., De Putter, T., Saenen, N. D., Kayembe-Kitenge, T., Musa Obadia, P., Kyanika Wa Mukoma, D., Lunda Ilunga, J.-M., Nawrot, T. S., Luboya Numbi, O., Smolders, E., & Nemery, B. (2018, September). Sustainability of artisanal mining of cobalt in DR Congo. Nature sustainability. Retrieved February 12, 2022, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6166862/

Data Centers. Data Centers | Better Buildings Initiative. (2021). Retrieved February 12, 2022, from https://betterbuildingssolutioncenter.energy.gov/sectors/data-centers

Fendos, J. (2020). How surveillance technology powered South Korea’s COVID-19 response. Brookings.

Kwet, M. (2021, May 6). Digital colonialism: The evolution of us empire. Longreads. 08/02/2022https://longreads.tni.org/digital-colonialism-the-evolution-of-us-empire

Lawson , M. F. F. (2021). The DRC mining industry: Child labor and formalization of small-scale mining. Wilson Center. Retrieved February 12, 2022, from https://www.wilsoncenter.org/blog-post/drc-mining-industry-child-labor-and-formalization-small-scale-mining#:~:text=Of%20the%20255%2C000%20Congolese%20mining,own%20tools%2C%20primarily%20their%20hands.

Young, J. C. (2019). The new knowledge politics of digital colonialism. Environment and Planning A: Economy and Space, 51(7), 1424–1441. https://doi.org/10.1177/0308518X19858998

Neural Networks

February 7, 2022

What is a neural network?

A Neural Network is a system inspired by the human brain that is designed to recognize patterns. Simply put it is a mathematical function that maps a given input in conjunction with information from other nodes to develop an output.

What can a neural network do?

Neural Networks have a wide range of real-world and industrial applications a few examples are:

Guidance systems for self-driving cars
Customer behavior modeling in business analytics
Adaptive learning software for education
Facial recognition technology

And that’s just scratching the surface.

How does a neural network work?

A simple neural network includes an input layer, an output layer, and, in between, a hidden layer.

Input layer- this is where data is fed and passed to the next layer
Hidden layer- this does all kinds of calculations and feature extractions described below
Output layer- this delivers the final result

Each layer is connected via nodes. The nodes are activated when data is fed to the input layer. It is then passed to the hidden layer where processing and calculations take place through a system of weighted connections. Finally, the hidden layers link to the output layer — where the outputs are retrieved.

What python libraries or packages can you use?

There are a variety of useful python libraries that can be used ;

Pytorch- Apart from Python, PyTorch also has support for C++ with its C++ interface if you’re into that.
Keras- is one of the most popular and open-source neural network libraries for Python.
Tensorflow- is one of the best libraries available for working with Machine Learning on Python
Scikit-learn- It includes easy integration with different ML programming libraries like NumPy and Pandas
Theano- is a powerful Python library enabling easy defining, optimizing, and evaluation of powerful mathematical expressions
Numpy- concentrates on handling extensive multi-dimensional data and the intricate mathematical functions operating on the data.
Pandas- is a Python data analysis library and is used primarily for data manipulation and analysis.

How to implement a single-layer neural network in basic python to solve an XOR logic problem?

The Python language can be used to build neural networks from simple ones to the most complex. In this tutorial, you will learn how to solve an XOR logic problem. XOR problem is a valuable challenge because it is the simplest linearly inseparable problem that exists. It is a classic problem that helps in understanding the basics of Deep Learning. XOR (Exclusive OR) compares two input bits and generates one output bit; in other words, you must have one or the other but not both. The table below represents what we will implement in the code.

To begin we will use the python library NumPy which provides great functions and simplifies calculations. You can follow along in this linked Jupyter Notebook, or in your favorite text editor to begin by defining parameters.

Next is to define the weights and biases between each layer. Weights and biases are the learnable parameters of neural networks and other machine learning models. Weights determine how much influence the input has on the output. Biases that are constant have no incoming connections but have outgoing connections with their own weights. Biases guarantee activation even when inputs are zero. In this case, I did not use a bias. The weights have been set to random values.

I will then create functions for the activation of the neuron. Here we use the Sigmoid function which is normally used when the model is predicting probability. After activation forward propagation begins, this is the calculation for the predicted output. After we have calculated the output activations they will be returned to be used in further calculations. All values calculated during forward propagation are stored as they will be required during backpropagation. Backpropagation is the trial and error method of learning our neural network uses. It determines how much we should adjust the weights to improve the accuracy of our predictions.

In the end, we run the neural network for 10000 EPOCHS and view the loss function. An Epoch is when training neural networks the training data is used for one cycle the data is passed both forward and backward. A Forward pass and a backward pass are counted as one. In an Epoch the data is used just once. A loss function quantifies the difference between the expected outcome and the outcome produced by the neural network.

Following are the predictions of the neural network on test inputs:

We know that for XOR inputs 1,0 and 0,1 will give output 1 and inputs 1,1 and 0,0 will output 0. Another way to go about replicating this experiment is by using Tensorflow see it differs from NumPy in one major respect: TensorFlow is designed for use in machine learning and AI applications and so has libraries and functions designed for those applications. Pytorch is also great because it has a decent order over the Graphics Processing Unit it will be very useful in visualizing data to get additional insights into your work. In conclusion, writing the code and experimenting with the various ways to make this neural network will strengthen your skills and boost your knowledge of the inner workings of neural networks.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Comment

Hybrid Models of Ethics in AI

February 1, 2022

Hybrid approaches are a blend of top-down and bottom-up methodologies for AI ethics. In this article, we dive deeper into hybrid models of ethics for AI and give two examples of how they can be applied. We will explore why hybrid models are more hopeful than top-down or bottom-up methodologies on their own for ethical AI development, and ask questions regarding what problems they may face in the future.

First, we will delve into MIT’s moral machine as one example of hybrid ethics being taught to systems for self-driving vehicles. Then we will explore a study of hybrid ethics being trained on ethical medical situations.

We conclude this exploration by further examining the meaning and construct of hybrid ethics for AI while linking the case studies as an exercise in exploring the potential positive and negative impacts of hybrid ethical AI approaches.

How do we define a hybrid model of ethics for AI?

A hybrid model of top-down and bottom-up ethics for AI has a base of rules or instructions, but then also is fed data to learn from. Real-world human ethics are complex, and a hybrid approach may minimize the limitations of top-down and bottom-up approaches to machine ethics, combining rule-based cognition and protracted ethical learning. (Suresh et al., 2014)

Hybrid AI combines the most desirable aspects of bottom-up, such as neural networks, and top-down also referred to as symbiotic AI. When huge data sets are combined, neural networks are allowed to extract patterns. Then, information can be manipulated and retrieved by rule-based systems utilizing algorithms to understand symbols. (Nataraj et al., 2021) Research has observed the complementary strengths and weaknesses of bottom-up and top-down strategies. Recently, a hybrid program synthesis approach has been developed, improving top-down interference by utilizing bottom-up analysis for web data extraction. (Raza et al., 2021) When we apply this to ethics and values, ethical concerns that arise from outside of the entity are emphasized by top-down approaches, and the cultivation of implicit values arising from within the entity is addressed by bottom-up approaches.

MIT’s Moral Machine as a Hybrid Model for AI Ethics

MIT’s Moral Machine is a hybrid model of AI ethics. It is an online judgment platform geared toward citizens from around the world portraying the moral dilemmas of unavoidable accidents involving automated vehicles (AVs), and what choices individuals would assign for them to respond. Examples include whether to spare humans versus pets or pedestrians versus passengers, with many factors to consider such as gender, age, fitness, and social status. The Moral Machine collects this data and maps it regionally to compare homogeneous vectors of moral preferences in order to provide data to engineers and policymakers in the development of AVs and to improve trust in AI. (Awad et al., 2018) This research is a hybrid of top-down and bottom-up because it collects data from citizens in a bottom-up manner, while also considering top-down morals, principles, and fundamental rules of driving.

Example from the

Moral Machine

where we see the choice between hitting the group on the left or the right. Which would you choose?

However, if the data shows that most people prefer to spare children over a single older adult, would it then become more dangerous for an elderly individual to walk around alone? What if we were to see a series of crashes to avoid groups of school children but run over an unsuspecting lone elder? The situations they give in the simulations are to choose between two scenarios, each resulting in unavoidable death. These decisions are made from the comfort of one’s home and maybe made differently if in the heat of the moment. Is it better to collect these decisions in this way, vs observing what people do in real scenarios? Where would a researcher acquire this data for training purposes? Would police accident reports or insurance claims data offer insights?

It is useful to collect this data, however, it must also be viewed alongside other considerations. Real-life scenarios will not always be so black and white. I personally despise the ‘trolley problem’ and emulations of it, which make us choose who deserves to live and who will be sacrificed. We may think we would hit one person to save a group of people, but who would want to be truly making that decision? It feels awful to be in this position, however, this is the point behind the simulation. In order to build trust in machines, ordinary people need to make these decisions to better understand their complexity and ramifications. Considering the MIT Moral Machine has collected data from over 40 million people, does this take the responsibility away from a single individual?

What they found was that although there are differences across countries and sections of the globe, there is a general preference to spare human lives over animals, spare more lives over fewer, and spare younger lives over older. Looking at the world in sections, there are trends that emerged in the West versus South versus East. For instance, in the Eastern cluster, there was more of a preference for sparing the law-abiding over the law-breaking, and in the Southern cluster, there were more tendencies toward sparing women over men. Policymakers should note that differences abound between individualistic versus collectivist cultures. Individualistic cultures value sparing the many and the young, whereas collectivist cultures value the lives of elders. How these preferences will be understood and considered by policymakers is yet to be determined. (Awad et al., 2018)

Hybrid Ethics for Moral Medical Machines

The second example we will examine is an experiment that was done using six simulations to test a moral machine that would emulate the decisions of an ethical medical practitioner in specific situations, such as with a patient refusing life-saving treatment. The decisions were based on ethics defined by Buchanan and Brock (1989), and the moral machine would copy the actions of the medical professional based on each circumstance.

(van Rysewyk & Pontier, 2014)

It appears straightforward to run an experiment based on a theoretical case study and tell the machine what a human would do, and then the machine can simply copy the same actions. However, how many simulations would it need to be trained on before it could be trusted to act on its own in real-life situations?

We may indeed come across patients refusing life-saving medication, whether due to irrational fear or religion or a host of other reasons. Additional outlying considerations include whether relatives or primary caregivers have opposing opinions to the treatment. Additionally, if there are financial constraints, there could be other complications that make each situation unique. A human medical professional would be able to consider all factors involved and approach each case anew. A moral machine would be basing predictions on past data, which may or may not be sufficient to address the unique needs of each real-life scenario.

Theoretically, the machine would learn more and more over time, andpotentially even perform better at ethical dilemmas than a human agent. However, this experiment with six basic simulations doesn’t give the utmost confidence that we are getting there quickly. Nonetheless, it gives us a good example of hybrid ethics for AI in action, since it is acting within a rule-based system as well as learning from case-based reasoning.

In these cases, they are balancing the benevolence, non-malevolence, and autonomy of the patient. (Pontier & Hoorn, 2012) (van Rysewyk & Pontier, 2014) Another paper on this topic added a fourth consideration which is Justice. They went on to describe a medical version of the trolley problem, where five people need organ transplants and one person is in a coma and has all the organs that the five people need to live. Would you kill one to save five? (Pontier et al., 2012)

Conclusion

Could a hybrid of top-down and bottom-up methodologies be the best application for ethical AI systems? Perhaps, however, we must be aware of the challenges it presents. We must examine the problems posed by hybrid approaches when meshing a combination of diverse philosophies and dissimilar architectures. (Allen et al., 2005) However, many agree that a hybridof top-down and bottom-up would be the most effective model for ethical AI development. Simultaneously, we need to question the ethics of people, both as the producers and consumers of technology, whilst we assess morality in AI.

Additionally, while hybrid systems which lack effective or advanced cognitive faculties will appear to be functional across many domains, it is essential to recognize times when additional capabilities will be required. (Allen and Wallach, 2005)

Regarding MIT’s Moral Machine, it is interesting to collect this data in service of creating more ethical driverless vehicles and to promote more trust in them from the public, however, the usefulness of it is yet to be proven. (Rahwan, 2017) AVs will be a part of all of our lives on a daily basis, so it is valuable to know that public opinions are being considered.

In the field of medicine, there is a broader sense of agreement on ethics than in something like business ethics, however, healthcare in the United States is a business, which causes decisions for and against the treatment of a patient to get very ethically blurred.

It will be vital as we move forward to identify when additional capabilities will be necessary, however functional hybrid systems may be across a variety of domains, even with limited cognitive facilities. (Allen and Wallach, 2005) AI development must be something that we keep a close eye on as it learns and adapts, we must note where it can thrive, and where a human is irreplaceable.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Join us in driving #AI for #SocialImpact initiatives around the world!

If you enjoyed reading this, you could contribute good vibes (and help more people discover this post and our community) by hitting the 👏 below — it means a lot!

References

Allen, C., Smit, I., & Wallach, W. (2005). Artificial Morality: Top-down, Bottom-up, and Hybrid Approaches. Ethics and Information Technology, 7(3), 149–155. https://doi.org/10.1007/s10676-006-0004-4

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J. F., & Rahwan, I. (2018). The Moral Machine experiment. Nature, 563(7729), 59–64. https://doi.org/10.1038/s41586-018-0637-6

Buchanan, A. and B. D. (2004). Deciding for others: The Ethics of Surrogate Decision making. Cambridge University Press.

Pontier, M. A., & Hoorn, J. F. (2012). Toward machines that behave ethically better than humans do. Proceedings of the Annual Meeting of the Cognitive Science Society.

Pontier, M. A., Widdershoven, G. A. M., & Hoorn, J. F. (n.d.). Moral Coppélia-Combining Ratio with Affect in Ethical Reasoning.

Rahwan, I. (2018). Society-in-the-loop: programming the algorithmic social contract. Ethics and Information Technology, 20(1), 5–14. https://doi.org/10.1007/s10676-017-9430-8

Suresh, T., Assegie, T. A., Rajkumar, S., & Komal Kumar, N. (2022). A hybrid approach to medical decision-making: diagnosis of heart disease with machine-learning model. International Journal of Electrical and Computer Engineering (IJECE), 12(2), 1831. https://doi.org/10.11591/ijece.v12i2.pp1831-1838

Vachnadze, G. (2021). Reinforcement learning: Bottom-up programming for ethical machines. Marten Kaas. Meium.

van Rysewyk, S. P., & Pontier, M. (2015). A Hybrid Bottom-Up and Top-Down Approach to Machine Medical Ethics: Theory and Data (pp. 93–110). https://doi.org/10.1007/978-3-319-08108-3_7

Wallach, W., Allen, C., & Smit, I. (2020). Machine morality: bottom-up and top-down approaches for modelling human moral faculties. In Machine Ethics and Robot Ethics (pp. 249–266). Routledge. https://doi.org/10.4324/9781003074991-23

Comment

2 Comments

Machine Learning Algorithms Cheat Sheet

January 24, 2022

Machine learning is a subfield of artificial intelligence (AI) and computer science that focuses on using data and algorithms to mimic the way people learn, progressively improving its accuracy. This way, Machine Learning is one of the most interesting methods in Computer Science these days, and it's being applied behind the scenes in products and services we consume in everyday life.

In case you want to know what Machine Learning algorithms are used in different applications, or if you are a developer and you’re looking for a method to use for a problem you are trying to solve, keep reading below and use these steps as a guide.

Machine Learning Algorithms Cheat Sheet by LatinX in AI™. Download the pdf: https://github.com/latinxinai/AI-Educational-Resources/raw/master/CheatSheets/Machine%20Learning%20Cheat%20Sheet.pdf

Machine Learning can be divided into three different types of learning: Unsupervised Learning, Supervised Learning, and Semi-supervised Learning.

Unsupervised learning uses information data that is not labeled, that way the machine should work with no guidance according to patterns, similarities, and differences.

On the other hand, supervised learning has a presence of a “teacher”, who is in charge of training the machine by labeling the data to work with. Next, the machine receives some examples that allow it to produce a correct outcome.

But there’s a hybrid approach for these types of learning, this Semi-supervised learning works with both labeled and unlabeled data. This method uses a tiny data set of labeled data to train and label the rest of the data with corresponding predictions, finally giving a solution to the problem.

To begin, you need to know the number of dimensions you’re working with, it means the number of inputs in your problem (also known as features). If you’re working with a large dataset or many features, you can opt for a Dimension Reduction algorithm.

Unsupervised Learning: Dimension Reduction

A large number of dimensions in a data collection can have a significant influence on machine learning algorithms' performance. The "curse of dimensionality" is a term used to describe the troubles large dimensionality might cause, for example, the “Distance Concentration” problem in Clustering, where the different data points will have the same value as the dimensionality of the data increases.

Techniques for minimizing the number of input variables in training data are referred to as “Dimension Reduction”.

Now you need to be familiar with the concept of Feature Extraction and Feature Selection to keep going. The process of translating raw data into numerical features that can be processed while keeping the information in the original data set is known as feature extraction. It produces better outcomes than applying machine learning to raw data directly.

It’s used for three known algorithms for dimensionality reduction including Principal Component Analysis, Singular Value Decomposition, and Linear Discriminant Analysis, but you need to know exactly which tool you want to use to find patterns or infer new information from the data.

If you’re not looking to combine the variables of your data, instead you want to remove unneeded features by just keeping the important ones, then you can use the Principal Component Analysis algorithm.

PCA (Principal Component Analysis)

It's a mathematical algorithm for reducing the dimension of data sets to simplify the number of variables while retaining most of the information. This trade-off of accuracy for simplicity is extensively used to find patterns in large data sets.

Image source: What is principal component analysis? | Bits of DNA (wordpress.com)

In terms of linear connections, it has a wide range of applications when large amounts of data are present, such as media editing, statistical quality control, portfolio analysis, and in many applications such as face recognition and image compression.

Alternatively, if you want an algorithm that works by combining variables of the data you’re working with, a simple PCA may not be the best tool for you to use. Next, you can have a probabilistic model or a non-probabilistic one. Probabilistic data is data that involves a random selection and is preferred by most scientists for more accurate results. While non-probabilistic data doesn’t involve that randomness.

If you are working with non-probabilistic data, you should use the Singular Value Decomposition algorithm.

SVD (Singular Value Decomposition)

In the realm of machine learning, SVD allows data to be transformed into a space where categories can be easily distinguished. This algorithm decomposes a matrix into three different matrices. In image processing, for example, a reduced number of vectors are used to rebuild a picture that is quite close to the original.

Compression of an image with a given number of components. Source: Singular Value Decomposition | SVD in Python (analyticsvidhya.com)

Compared with the PCA algorithm, both can make a dimension reduction of the data. But while PCA skips the less significant components, the SVD just turns them into special data, represented as three different matrices, that are easier to manipulate and analyze.

When it comes to probabilistic approaches, it’s better to use the Linear Discriminant Analysis algorithm for more abstract problems.

LDA (Linear Discriminant Analysis)

Linear Discriminant Analysis (LDA) is a classification approach in which two or more groups have previously been identified, and fresh observations are categorized into one of them based on their features. It’s different from PCA since LDA discovers a feature subspace that optimizes group separability while the PCA ignores the class label and focuses on capturing the dataset's highest variance direction.

This algorithm uses Bayes’ Theorem, a probabilistic theorem used to determine the likelihood of an occurrence based on its relationship to another event. It is frequently used in face recognition, customer identification, and medical fields to identify the patient’s disease status.

Distribution of 170 face images of five subjects (classes) randomly selected from the UMIST database in (a) PCA-based subspace, (b) D-LDA-based subspace, and (c) DF-LDA-based subspace. Source: (PDF) Face recognition using LDA-based algorithms (researchgate.net)

The next step is to select whether or not you want your algorithm to have responses, which means if you want to develop a predictive model based on labeled data to teach your machine. You may use the Clustering techniques if you’d rather use non-labeled data so your machine can work with no guidance and search for similarities.

On the other hand, the process of picking a subset of relevant features (variables, predictors) for use in model creation is known as feature selection. It helps in the simplicity of models to make them easier to comprehend for researchers and users, as well as the reduction of training periods and the avoidance of the dimensionality curse.

It includes the Clustering, Regression, and Classification methods.

Unsupervised Learning: Clustering

Clustering is a technique for separating groups with similar characteristics and assigning them to clusters. If you're looking for a hierarchical algorithm:

Hierarchical Clustering

This type of clustering is one of the most popular techniques in Machine Learning. Hierarchical Clustering assists an organization to classify data to identify similarities, and different groupings and features, so their pricing, goods, services, marketing messages, and other aspects of the business are targeted. Its hierarchy should show the data similar to a tree data structure, known as a Dendrogram. There are two ways of grouping the data: agglomerative and divisive.

Agglomerative clustering is a "bottom-up" approach. To put it another way, each item is first thought of as a single-element cluster (leaf). The two clusters that are the most comparable are joined into a new larger cluster at each phase of the method (nodes). This method is repeated until all points belong to a single large cluster (root).

Divisive clustering works in a “top-down” way. It starts at the root, where all items are grouped in a single cluster, then separates the most diverse into two at each iteration phase. Iterate the procedure until all of the items are in their group.

In case you’re not looking for a hierarchical solution, you must determine whether your method requires you to specify the number of clusters to be used. You can utilize the Density-based Spatial Clustering of Applications with Noise algorithm if you don't need to define it.

DBSCAN (Density-based Spatial Clustering of Applications with Noise)

When it comes to arbitrary-shaped clusters or detecting outliers, it’s better to use Density-Based Clustering. DBSCAN is a method for detecting those arbitrary-shaped clusters and the ones with noise by grouping points close to each other based on two parameters: eps and minPoints.

The eps tells us the distance that needs to be between two points to be considered a cluster. While the minPoints are the minimum number of points to create a cluster. We use this algorithm in the analysis of Netflix Servers outliers. The streaming service runs thousands of servers, and normally less than one percent it’s capable of becoming unhealthy, which degrades the performance of the streaming. The real problem is that this problem isn’t easily visible, to solve it, Netflix uses DBSCAN specifying a metric to be monitored, then collects data and finally is passed to the algorithm for detecting the servers outliers.

Source: Tracking down the Villains: Outlier Detection at Netflix | by Netflix Technology Blog | Netflix TechBlog

One daily usage can be when e-commerce makes a product recommendation to its customers. Applying DBSCAN on the data of products the user has bought before.

In case you need to specify the number of clusters, there are three existing algorithms you could use, including K-Modes, K-Means, and Gaussian Mixture Model. Next, you need to know if you’re going to work with categorical variables, which are discrete variables that capture qualitative consequences by grouping observations (or levels). If you’re going to use them, you may opt for K-Modes.

K-Modes

This approach is used to group categorical variables. We determine the total mismatches between these types of data points. The fewer the differences between our data points, the more similar they are. The main difference between K-Modes and K-Means is that for categorical data points we can’t calculate the distance since they aren’t numeric values.

This algorithm is used for text mining applications, document clustering, topic modeling (where each cluster group represents a specific subject), fraud detection systems, and marketing.

For numeric data, you should use K-Means clustering.

K-Means

Data is clustered into a k number of groups in such a manner that data points in the same cluster are related while data points in other clusters are further apart. This distance is frequently measured with the Euclidean distance. In other words, the K-Means algorithm tries to minimize distances within a cluster and maximize the distance between different clusters.

Search engines, consumer segmentation, spam/ham detection systems, academic performance, defects diagnosis systems, wireless communications, and many other industries use k-means clustering.

If the intended result is based on probability, then the Gaussian Mixture Model should be used.

GMM (Gaussian Mixture Model)

This approach implies the presence of many Gaussian distributions, each of which represents a cluster. The algorithm will determine the probability of each data point belonging to each of the distributions for a given batch of data.

GMM differs from K-means since in GMM we don’t know if a data point belongs to a specified cluster, we use probability to express this uncertainty. While the K-means method is certain about the location of a data point and starts to iterate over the whole data set. The Gaussian Mixture Model is frequently used in signal processing, language recognition, anomaly detection, and genre classification of music.

In the event that you use labeled data to train your machine, first, you need to specify if it is going to predict numbers, this numerical prediction will help the algorithm to solve the problem. In case it does, you can choose Regression Algorithms.

Supervised Learning: Regression

Regression is a machine learning algorithm in which the outcome is predicted as a continuous numerical value. This method is commonly used in banking, investment, and other fields.

Here, you need to decide whether you rather have speed or accuracy. In case you’re looking for speed, you can use a Decision Tree algorithm or a Linear Regression algorithm.

Decision Tree

A decision tree is a flowchart like a tree data structure. Here, the data is continuously split according to a given parameter. Each parameter is allowed in a tree node, while the outcomes of the whole tree are located in the leaves. There are two types of decision trees:

Classification trees (Yes/No types), here the decision variable is categorical.
Regression trees (Continuous data types), where the decision or the outcome variable is continuous.

When there are intricate interactions between the features and the output variables, decision trees come in handy. When there are missing features, a mix of category and numerical features, or a large variance in the size of features, they perform better in comparison to other methods.

This algorithm is used to enhance the accuracy of promotional campaigns, detection of fraud, and detection of serious or preventable diseases on patients.

Linear Regression

Based on a given independent variable, this method predicts the value of a dependent variable. As a result, this regression approach determines if there is a linear connection between the input (independent variable) and the output (dependent variable). Hence, the term Linear Regression was coined.

Linear regression is ideal for datasets in which the features and the output variable have a linear relationship. It's usually used for forecasting (which is particularly useful for small firms to understand the sales effect), understanding the link between advertising expenditure and revenue, and in the medical profession to understand the correlations between medicine dose and patient blood pressure.

Alternatively, if you need accuracy for your algorithm you can use the following three algorithms: Neural Network, Gradient Boosting Tree, and Random Forest.

Neural Network

A Neural Network is required to learn the intricate non-linear relationship between the features and the target. It’s an algorithm that simulates the workings of neurons in the human brain. There are several types of Neural Networks, including the Vanilla Neural Network (that handles structured data only), as well as Recurrent Neural Network and Convolutional Neural Network which both can work with unstructured data.

When you have a lot of data (and processing capacity), and accuracy is important to you, you'll almost certainly utilize a neural network. This algorithm has many applications, such as paraphrase detection, text classification, semantic parsing, and question answering.

Gradient Boosting Tree

Gradient Boosting Tree is a method for merging the outputs of separate trees to do regression or classification. Both supervised learning incorporates a large number of decision trees to lessen the danger of overfitting (a statistical modeling mistake that happens when a function is too tightly matched to a small number of data points, making it possible to reduce the predictive power of the model) that each tree confronts alone. This algorithm employs Boosting, which entails consecutively combining weak learners (typically decision trees with just one split, known as decision stumps) so that each new tree corrects the preceding one's faults.

When we wish to reduce the Bias error, which is the amount whereby a model's prediction varies from the target value, we usually employ the Gradient Boosting Algorithm. When there are fewer dimensions in the data, a basic linear model performs poorly, interpretability is not critical, and there is no stringent latency limit, gradient boosting is most beneficial.

It’s used in many studies, such as a gender prediction algorithm based on the motivation of masters athletes, using gradient boosted decision trees, exploring their capacity to predict gender based on psychological dimensions evaluating reasons to participate in masters sports as statistical methodologies.

Random Forest

Random Forest is a method for resolving regression and classification problems. It makes use of ensemble learning, which is a technique for solving complicated problems by combining several classifiers. It consists of many decision trees, where the outcomes of every one of them will throw the final result taking the average or mean decisions. The greater the number of trees, the better precision of the outcome.

Random Forest is appropriate when we have a huge dataset and interpretability is not a key problem, as it becomes increasingly difficult to grasp as the dataset grows larger. This algorithm is used in stock market analysis, diagnosis of patients in the medical field, to predict the creditworthiness of a loan applicant, and in fraud detection.

For non-numerical prediction algorithms, you can choose the Classification methods over regression.

Supervised Learning: Classification

Alike to the regression methods, you need to choose if you would rather speed or accuracy for your outcomes.

If you’re looking for accuracy, you not only may opt for the Kernel Support-Vector Machine, but you can use other algorithms that were mentioned previously, such as Neural Network, Gradient Boosting Tree, and Random Forest. Now, let’s introduce this new algorithm.

Kernel Support-Vector Machine

To bridge linearity and non-linearity, the kernel technique is commonly utilized in the Support-Vector Machine model. To understand this, it is essential to know that the SVM method learns how to separate different groups by forming decision boundaries.

But when we’re in front of a data set of higher dimensions and the costs are expensive, it is recommended to use this kernel method. It enables us to work in the original feature space without having to compute the data's coordinates in a higher-dimensional space.

It’s mostly used in text classification problems since most of them can be linearly separated.

When speed is needed, we need to see if the technique we're going to employ is explainable, which implies it can explain what happens in your model from start to finish. In that case, we might use a Decision Tree algorithm or a Logistic Regression.

Logistic Regression

Logistic Regression is used when the dependent variable is categorical. Through probability estimate, it aids in understanding the link between dependent variables and one or more independent variables.

There are three different types of Logistic Regression:

Binary Logistic Regression, where the response only has two possible values.
Multinomial Logistic Regression, three or more outcomes with no order.
Ordinal Logistic Regression, three or more categories with ordering.

The Logistic Regression algorithm is widely used in hotel booking, it shows you (through statistical research) the options you may want to have in your bookings, such as the hotel room, some journeys in the area, and more.

If you’re only interested in the input and output of your problem, you can check if the data you’re working with is too large. If the number is huge, you can use a Linear Support-Vector Machine.

Linear Support-Vector Machine

Linear SVM is used for linearly separable data. It works in data with different variables (linearly separable data) that can be separated with a simple straight line (linear SVM classifier). This straight line represents the user behavior or outcome through a stated problem.

Since texts are often linearly separable and have a lot of features, the Linear SVM is the best option to use in its classification. In the case of our next algorithm, you can use it either if the data is large or not.

Naïve Bayes

This algorithm is based on Bayes Theorem. It consists of predictions through objects’ probabilities. It’s called Naïve because it assumes that the appearance of one feature is unrelated to the appearance of other characteristics.

This method is well-liked because it can surpass even the most sophisticated classification approaches. Furthermore, it is simple to construct and may be built rapidly. Due to its easy use and efficiency, it’s used to make real-time decisions. Along with that, Gmail uses this algorithm to know if a mail is Spam or not.

The Gmail spam detection picks a set of words or ‘tokens’ to identify spam email (this method is also used in text classification and it’s commonly known as a bag of words). Next, they use those tokens and compare them to spam and non-spam emails. Finally, using the Naive Bayes algorithm, they calculate the probability that the email is spam or not.

In Conclusion

We find that Machine Learning is a widely utilized technology with many applications unrecognized by us because it's a regular occurrence. In this article, we have not only distinguished between the different approaches of machine learning but how to use them according to the data we’re working with and the problem we want to solve.

To learn Machine Learning, you have to have some knowledge of calculus, linear algebra, statistics, and programming skills. You can use different programming languages to implement one of these algorithms, from Python to C++, and R language. It’s up to you to make the best decision and start learning along with your machine.

2 Comments

Practical Principles for AI Ethics

January 8, 2022

Principles of AI are a top-down approach to ethics for artificial intelligence (AI). Recently, we have been seeing lists of principles for AI ethics popping up everywhere. They are very useful, not only for AI and its impact but also on a larger social level. Because of AI, people are thinking about ethics in a whole new way: How do we define and digest ethics in order to codify it?

Previously I have written an analysis of top-down and bottom-up approaches to ethics for AI, and then we explored the bottom-up method of reinforcement learning for teaching AI ethics. In this segment, we will address AI principles as a top-down method for working towards an ethical AI.

Ethical AI Principles

Principles can be broken into two categories: principles for people who program AI systems to follow, and principles for the AI itself.

Some of the principles for people, mainly programmers and data scientists, read like commandments. For instance, The Institute for Ethical AI & ML has a list of eight principles geared toward technologists. These include human augmentation, to keep a human in the loop; bias evaluation, to continually monitor bias; explainability and justification, to improve transparency; reproducibility, to ensure infrastructure that is reasonably reproducible; displacement strategy, to mitigate the impact on workers due to automation; practical accuracy, to align with domain-specific applications; trust by privacy, to protect and handle data; and data risk awareness, to consider data and model security. (The Institute for Ethical Ai & Machine Learning)

The Responsible Machine Learning Principles:

Human Augmentation
Bias Evaluation
Explainability and Justification
Reproducibility
Displacement Strategy
Practical Accuracy
Trust by Privacy
Data Risk Awareness

Other lists of principles are geared towards the ethics of AI systems themselves and what they should adhere to. One such list consists of four principles, published by the National Institute of Standards and Technology (NIST), and are intended to promote explainability. The first of these is regarding explanation: that a system can provide evidence and reasons for its processes and outputs, be readable by a human, and explain its algorithms. The remaining three expand on this. The second recommends that AI systems must be meaningful and understandable, and have methods to evaluate meaninglessness. The third principle is explanation accuracy: a system must correctly reflect the reason(s) its generated output. Finally the fourth is knowledge limits: ensuring that a system only operates under conditions for which it was designed and that it does not give overly confident answers in areas it has limited knowledge of; for example, a system programmed to classify birds being used to classify an apple. (Marengo, 2021)

Many of the principles overlap across corporations and agencies. We can see a detailed graphic and information published by the Berkman Klein Center for Internet and Society at Harvard, found here. This gives a great overview of forty-seven principles that various organizations, corporations, and other entities are adopting, where they overlap, and their definitions.

https://cyber.harvard.edu/publication/2020/principled-ai

The authors provide many lists and descriptions of ethical principles for AI and categorize them into eight thematic trends: Privacy, Accountability, Safety and security, Transparency and explainability, Fairness and non-discrimination, Human control of technology, Professional responsibility, and Promotion of human values. (Fjeld and Nagy, 2020)

The Illusion of Ethical AI

One particular principle that I see as missing from these lists regards taking care of the non-human world. As Boddington states in her book, Toward a Code of Ethics for Artificial Intelligence (2018), “. . . we are changing the world, AI will hasten these changes, and hence, we’d better have an idea of what changes count as good and what counts as bad.” (Boddington, 2018) We will all have different opinions on this, but it needs to be part of the discussion. We can’t continue to destroy the planet while trying to create super AI and still be under the illusion that our ethical principles are saving the world.

This will also be a cautionary tale, for a lot of these principles are theoretically sound, yet act as a veil that presents the illusion of ethics. This can be dangerous because it makes us feel like we are practicing ethics while business carries on as usual. Part of the reason for this is because the field of ethical AI development is so new and not a lot of research has been done yet to ensure the overall impact is a benefit to society. “Despite the proliferation of these ‘AI principles,’ there has been little scholarly focus on understanding these efforts either individually or as contextualized within an expanding universe of principles with discernible trends.” (Fjeld and Nagy, 2020)

Principles are a double-sided coin. On one hand, making the stated effort to follow a set of ethical principles is good. It is beneficial for people to be thinking about doing what is right and ethical, and not just blindly entering code that could be detrimental in unforeseen ways.

Some principles are simple in appearance yet incredibly challenging in practice. For example, if we look at the commonly adopted principle of transparency, there is quite a difference between saying that algorithms and machine learning should be explainable and actually developing ways to see inside of the black box. As datasets get bigger, this presents more and more technical challenges. (Boddington, 2018)

Furthermore, some of the principles can conflict with each other, which can land us in a less ethical place than where we started. For example, transparency can conflict with privacy, another popular principle. We can run into a lot of complex problems around this, and I hope to see this addressed quickly and thoroughly as we move forward.

Overall, we want these concepts in people's minds: such as Fairness. Accountability, and Transparency. These are the core tenets and namesake of the FAAcT conference that addresses these principles in depth. It is incredibly important for corporations and programmers to be concerned about the commonly addressed themes of bias, discrimination, oppression, and systemic violence. And yet… What can happen is that these principles make us feel like we are doing the right thing, however, how much does writing out these ideals actually change things?

The AI Ethical Revolution We Need

In order for AI to be ethical, A LOT has to change, and not just in the tech world. There seems to be an omission of the unspoken principles: the value of money for corporations and those in power and convenience for those who can afford it. If we are trying to create fairness, accountability, and transparency in AI, we need to do some serious work on society to adjust our core principles away from money and convenience and towards taking care of everyone’s basic needs and the Earth.

Could AI be a tool that has the side effect of starting an ethics revolution?

How do we accomplish this? The language that we use is important, especially when it comes to principles. Moss and Metcalf pointed out the importance of using market-friendly terms. If we want morality to win out, we need to justify the organizational resources necessary, when more times than not, companies will choose profit over social good. (Moss and Metcalf, 2019)

Whittlestone et al. describe the need to focus on areas of tension in ethics in AI, and point out the ambiguity of terms like ‘fairness’, ‘justice’, and ‘autonomy’. The authors prompt us to question how these terms might be interpreted differently across various groups and contexts. (Whittlestone et al. 2019)

They go on to say that principles need to be formalized into standards, codes, and ultimately regulation in order to be useful in practice. Attention is drawn to the importance of acknowledging tensions between high-level goals of ethics, which can differ and even contradict each other. In order to be effective, it is vital to include a measure of guidance on how to resolve different scenarios. In order to reflect the genuine agreement, there must be acknowledgment and accommodation of different perspectives and values as much as possible. (Whittlestone et al. 2019)

The authors then introduce four reasons that discussing tensions is beneficial and important for AI ethics:

Bridging the gap between principles and practice
Acknowledging differences in values
Highlighting areas where new solutions are needed
Identifying ambiguities and knowledge gaps

Each of these needs to be considered ongoing, as these tensions don’t get solved overnight. Particularly, creating a bridge between principles in practice is important, as I have argued above.

To wrap up, I will share this direct quote because it is incredibly profound:

“We need to balance the demand to make our moral reasoning as robust as possible, with safeguarding against making it too rigid and throwing the moral baby out with the bathwater by rejecting anything we can’t immediately explain. This point is highly relevant both to drawing up codes of ethics and to the attempts to implement ethical reasoning in machines.” (Boddington, 2018 p.18-19)

In conclusion, codes of ethics, or ethical principles for AI are important to have, and I like the conversations that are being started because of their existence. However, it can’t stop there. I am excited to see more and more ways that these principles are put into action and to see technologists and theorists working together to investigate ways to make them work. I would also hope that we can open minds to ideas beyond making money for corporations and creating conveniences, and rather toward addressing tensions and truly creating a world that works for everyone.

Citations

ACM Conference on Fairness, accountability, and transparency (ACM FACCT). ACM FAccT. (2021). Retrieved January 7, 2022, from https://facctconference.org/

Ai Principles. Future of Life Institute. (2021, December 15). Retrieved December 30, 2021, from https://futureoflife.org/2017/08/11/ai-principles/

Berkman Klein Center Media Library. (n.d.). Retrieved January 8, 2022, from https://wilkins.law.harvard.edu/misc/

Boddington, Paula. (2018). Towards a code of Ethics for Artificial Intelligence. SPRINGER INTERNATIONAL PU.

Fjeld , J., & Nagy, A. (2020). Principled artificial intelligence . Berkman Klein Center. Retrieved December 30, 2021, from https://cyber.harvard.edu/publication/2020/principled-ai

Marengo, F. (2021). Federico Marengo on linkedin: Four principles of explainable AI: 35 comments. LinkedIn. Retrieved January 7, 2022, from https://www.linkedin.com/posts/fmarengo_four-principles-of-explainable-ai-activity-6878970042382356480-updf/

Moss , E., & Metcalf, J. (2019, November 14). The ethical dilemma at the heart of Big Tech Companies. Harvard Business Review. Retrieved December 13, 2021, from https://hbr.org/2019/11/the-ethical-dilemma-at-the-heart-of-big-tech-companies.

The Institute for Ethical Ai & Machine Learning. (n.d.). The machine learning principles. The 8 principles for responsible development of AI & Machine Learning systems. Retrieved December 30, 2021, from https://ethical.institute/principles.html

Whittlestone, J., Cave, S., Alexandrova, A., & Nyrup, R. (2019). The role and limits of principles in AI Ethics: Towards a … Retrieved December 13, 2021, from http://lcfi.ac.uk/media/uploads/files/AIES-19_paper_188_Whittlestone_Nyrup_Alexandrova_Cave.pdf.

Comment

Reinforcement Learning as a Methodology for Teaching AI Ethics

December 22, 2021

One of the biggest questions when considering ethics for artificial intelligence (AI) is how to implement something so complex and un-agreed-upon into machines that are contrastingly good at precision. Some say this is impossible. “Ethics is not a technical enterprise, there are no calculations or rules of thumb that we could rely on to be ethical. Strictly speaking, an ethical algorithm is a contradiction in terms.” (Vachnadze, 2021)

Dismissing ethics training for AI because it presents many technical challenges is not going to help our society when the technology is advancing regardless and ethical concerns are continuing to arise.

So, we turn to the bottom-up approach of reinforcement learning as a promising avenue to explore how to move forward towards AI for positive social impact.

What is Reinforcement Learning and Where Does it Originate?

“Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.” (Bhatt, 2018)

Reinforcement learning is different from other forms of learning that rely on top-down rules. Rather, this system learns as it goes, making many mistakes but learning from them, and adapting through sensing the environment. It is trained in a simulated environment using reward systems, with either positive or negative feedback, so the agent can try lots of different actions within its environment with no real-world consequence until it gets it right.

We see RL commonly used in training algorithms to play games, such as Alpha Go and chess. In the beginning, RL was studied in animals, as well as early computers.

“The history of reinforcement learning has two main threads, both long and rich, that were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error that started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s. The other thread concerns the problem of optimal controland its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two threads have been largely independent, the exceptions revolve around a third, less distinct thread concerning temporal-difference methods. . . All three threads came together in the late 1980s to produce the modern field of reinforcement learning,” (Sutton and Barto, 2015)

It is interesting to note that this form of learning originated partially in the training of animals, and is often used for teaching human children as well. It is something that has been in existence and evolving for many decades.

How is reinforcement learning used for machine learning?

Now let’s explore more of this bottom-up approach to programming and how it functions for artificial intelligence. “Instead of explicit rules of operation, RL uses a goal-oriented approach where a ‘rule’ would emerge as a temporary side-effect of an effectively resolved problem. That very same rule could be discarded at any moment later on, where it proves no longer effective. The point of RL modeling is to help the A.I. mimic a living organism as much as possible, thereby compensating for what we commonly held to be the main draw-back of Machine-Learning: The impossibility of Machine-Training, which is precisely what RL is supposed to be.” (Vachnadze, 2021)

This style of learning that throws the rule book out the window could be promising for something like ethics, where the rules are not overly consistent or even agreed upon. Ethics is more situation-dependent, therefore teaching a broad rule is not always sufficient. Could RL be the answer?

“Reinforcement learning problems involve learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. . . These three characteristics — being closed-loop in an essential way, not having direct instructions as to what actions to take, and where the consequences of actions, including reward signals, play out over extended time periods — are the three most important distinguishing features of reinforcement learning problems.” (Sutton and Barto, 2015)

Turning ethics into numerical rewards can pose many challenges but may be a hopeful consideration for programming ethics into AI systems. The authors go on to say that “. . . the basic idea is simply to capture the most important aspects of the real problem facing a learning agent interacting with its environment to achieve a goal. Clearly, such an agent must be able to sense the state of the environment to some extent and must be able to take actions that affect the state.” (Sutton and Barto, 2015)

There are many types of machine learning, and it may be promising to look at using more than one type in conjunction with RL in order to approach the ethics question. One paper used RL along with inverse reinforcement learning (IRL). IRL learns from human behaviors but is limited to learning what people do online, so it only gets a partial picture of actual human behavior. Still, this in combination with RL might cover some blind spots and is worth testing out.

source:

https://towardsdatascience.com/inverse-reinforcement-learning-6453b7cdc90d

Can reinforcement learning be methodized in ethics for AI?

One of the ways that RL can work in an ethical sense, and avoid pitfalls, is by utilizing systems that keep a human in the loop. “Interactive learningconstitutes a complementary approach that aims at overcoming these limitations by involving a human teacher in the learning process.” (Najar and Chetouani, 2021)

Keeping a human in the loop is critical for many issues, including those around transparency. The human can be thought of as a teacher or trainer, however, an alternative way to bring in a human agent is as a critic.

“Actor-Critic architectures constitute a hybrid approach between value-based and policy-gradient methods by computing both the policy (the actor) and a value function (the critic) (Barto et al., 1983). The actor can be represented as a parameterized softmax distribution. . . The critic computes a value function that is used for evaluating the actor.” (Najar and Chetouani, 2021)

Furthermore, I like the approach of moral uncertainty because there isn’t ever one answer or solution to an ethical question, and to admit uncertainty leaves it open to continued questioning that can lead us to the answers that may be complex and decentralized. This path could possibly create a system that can adapt to meet the ethical considerations of everyone involved.

“While ethical agents could be trained by rewarding correct behavior under a specific moral theory (e.g. utilitarianism), there remains widespread disagreement about the nature of morality. Acknowledging such disagreement, recent work in moral philosophy proposes that ethical behavior requires acting under moral uncertainty, i.e. to take into account when acting that one’s credence is split across several plausible ethical theories,” (Ecoffet and Lehman,)

Moral uncertainty needs to be considered, purely because ethics is an area of vast uncertainty, and is not an answerable math problem with predictable results.

There are plentiful limitations, and many important considerations to attend to along the way. There is no easy answer to this, rather there are many answers that depend on a lot of factors. Could an RL program eventually learn how to compute all the different ethical possibilities?

“The fundamental purpose of these systems is to carry out actions so as to improve the lives of the inhabitants of our planet. It is essential, then, that these agents make decisions that take into account the desires, goals, and preferences of other people in the world while simultaneously learning about those preferences.” (Abel et. al, 2016)

The Limitations of Reinforcement Learning for Ethical AI

There are many limitations to consider, and some would say that reinforcement learning is not a plausible answer for an ethical AI.

“In his recent book Superintelligence, Bostrom (2014) argues against the prospect of using reinforcement learning as the basis for an ethical artificial agent. His primary claim is that an intelligent enough agent acting so as to maximize reward in the real world would effectively cheat by modifying its reward signal in a way that trivially maximizes reward. However, this argument only applies to a very specific form of reinforcement learning: one in which the agent does not know the reward function and whose goal is instead to maximize the observation of reward events.” (Abel et. al, 2016)

This may take a lot of experimentation. It is important to know the limitations, while also remaining open to being surprised. We worry a lot about the unknowns of AI: Will it truly align with our values? Only through experimentation can we find out.

“What we should perhaps consider is exploring the concept of providing a ‘safe learning environment’ for the RL System in which it can learn, where models of other systems and the interactions with the environment are simulated so that no harm can be caused to humans, assets, or the environment. . . However, this is often complicated by issues around the gap between the simulated and actual environments, including issues related to different societal/human values.” (Bragg and Habli, 2018)

It is certainly a challenge to then take these experiments from their virtual environments and utilize them in the real world, and many think this isn’t achievable.

“In the real world, full state awareness is impossible, especially when the desires, beliefs, and other cognitive content of people is a critical component of the decision-making process.” (Abel et. al, 2016)

What do I think, as an anthropologist? I look around and see that we are in a time of great social change, on many fronts. Ethical AI is not only possible, it is an absolute necessity. It is worth exploring reinforcement learning and other hybrid models that include RL, most definitely. The focus on rewards is a bit troubling to me, as the goal of a ‘reward’ is not always the most ethical. There is much in the terminology that troubles me, however, I don’t think that AI is inherently doomed. It is not going anywhere, so we need to work together to make it ethical.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Join us in driving #AI for #SocialImpact initiatives around the world!

If you enjoyed reading this, you could contribute good vibes (and help more people discover this post and our community) by hitting the 👏 below — it means a lot!

Citations

Abel, D., MacGlashan, J., & Littman, M. L. (2016). Reinforcement Learning as a Framework for Ethical Decision Making. aaai.org. Retrieved December 20, 2021, from https://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/viewFile/12582/12346

Bhatt, S. (2019, April 19). Reinforcement learning 101. Medium. Retrieved December 20, 2021, from https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292

Bragg, J., & Habli, I. (2018). What is acceptably safe for reinforcement learning? whiterose.ac.uk. Retrieved December 20, 2021, from https://eprints.whiterose.ac.uk/133489/1/RL_paper_5.pdf

Gonfalonieri, A. (2018, December 31). Inverse reinforcement learning. Medium. Retrieved December 20, 2021, from https://towardsdatascience.com/inverse-reinforcement-learning-6453b7cdc90d

Ecoffet, A., & Lehman , J. (2021). Reinforcement learning under moral uncertainty . arxiv.org. Retrieved December 20, 2021, from https://arxiv.org/pdf/2006.04734v3.pdf

Najar, A., & Chetouani, M. (2021, January 1). Reinforcement learning with human advice: A survey. Frontiers. Retrieved December 20, 2021, from https://www.frontiersin.org/articles/10.3389/frobt.2021.584075/full

Noothigattu, R., Bouneffouf, , D., Mattei, N., Chandra, R., Madan, P., Varshney, K. R., Campbell, M., Singh, M., & Rossi, F. (2020). Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration. Ritesh Noothigattu — Publications. Retrieved December 20, 2021, from https://www.cs.cmu.edu/~rnoothig/publications.html

Sutton , R. S., & Barto, A. G. (2015). Reinforcement learning: An introduction — stanford university. http://web.stanford.edu/. Retrieved December 20, 2021, from http://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

Vachnadze, G. (2021, February 7). Reinforcement learning: Bottom-up programming for ethical machines. Marten Kaas. Medium. Retrieved December 20, 2021, from https://medium.com/nerd-for-tech/reinforcement-learning-bottom-up-programming-for-ethical-machines-marten-kaas-ca383612c778

Comment

Understanding Top-Down and Bottom-Up Ethics in AI Part 2

December 13, 2021

In part two of this investigation into top-down and bottom-up ethics in Artificial Intelligence (AI), I would like to explore three different angles, including the technical perspective, the ethical viewpoint, and through a political lens while also discussing individual and hybrid approaches to implementation.

The first angle is to understand the technicalperspective, broken down into programming and applied machine learning: Essentially how to implement algorithmic policies with balanced data that will lead to fair and desirable outcomes.

The next angle is the theoretical ethics viewpoint: Ethics can work from the top-down, coming from rules, philosophies, etc., or bottom-up looking at the behaviors of people and what is socially acceptable for individuals as well as groups, which varies by culture.

Third, I want to come back to my original hypothesis that top-down implied ethics dictated from the powers that be, and bottom-up could only be derived from the demands of the people. We might call this a more political perspective.

Finally, we will connect them all back together and split them apart again, into top-down, bottom-up, and hybrid models of how ethics functions for AI. This is an exercise in exploration to reach a deeper understanding. How ethics for AI works, in reality, is a blend of all of these theories and ideas acting on and in conjunction with one another.

Technical Machine Learning Top-Down vs Bottom-Up

The technical angle of this debate is admittedly the most foreign to me, however, in my research, I have found some basic examples that I hope are helpful.

“In simple terms and in the context of AI, it is probably easiest to imagine ‘Top-down AI’ to be based on a decision tree. For example, a call center chatbot is based on a defined set of options and, depending on the user input, it guides the caller through a tree of options. What we typically refer to as AI these days — for applications such as self-driving cars or diagnostic systems in health care — would be defined as ‘Bottom-up AI’ and is based on machine learning (ML) or deep learning (DL). These are applications of AI that provide systems with the ability to automatically learn and improve from experience without being explicitly programmed.” (Eckart, 2020)

Top-down systems of learning can be very useful for some tasks that machines can be programmed to do, like the call center example above. However, if they are not monitored, they could make mistakes and it is up to us people to catch those mistakes and correct them. They may also lack exposure to sufficient data to make a decision or prediction in order to solve a problem, leading to system failure. This is the value of having a ‘human in the loop’. This gets more complicated when we move into the more theoretical world of ethics.

Bottom-up basically defines machine learning. The system is given data to learn from, and it uses that information from the past to predict and make decisions for the future. This can work quite well for many tasks. It can also have a lot of flaws built-in because the world that it learns from is flawed. We can look at the classic example of harmful bias being learned and applied, for instance in who gets a job or a loan, because the data from the past reflects biased systems in our society.

Here we will mention the use of a hybrid model of top-down and bottom-up, that has a base of rules or instructions, but then also is fed data to learn from as it goes. This method claims to be the best of both worlds and covers some of the shortcomings of both top-down and bottom-up models. For instance, self-driving cars can be programmed with laws and rules of the road, and also can learn from observing human drivers.

Theoretical Ethics Top-Down vs Bottom-Up

Now let’s move on to talk about Ethics. The first thing we need to mention in this part of the analysis is that ethics has been historically made for people, and people are complex in how they understand and apply ethics, especially top-down ethics.

“Top-down ethical systems come from a variety of sources including religion, philosophy, and literature. Examples include the Golden Rule, the Ten Commandments, consequentialist or utilitarian ethics, Kant’s moral imperative and other duty-based theories, legal codes, Aristotle’s virtues, and Asimov’s three laws for robots.” (Wallach et. al, 2005)

The one exception on this list that doesn’t apply to people is of course Asimov’s laws, which are applied precisely for AI. However, Asimov himself said that they were flawed.

“When thinking of rules for robots, Asimov’s laws come immediately to mind. On the surface, these three laws, plus a ‘zeroth’ law that he added in 1985 to place humanity’s interest above that of any individual, appear to be intuitive, straightforward, and general enough in scope to capture a broad array of ethical concerns. But in story after story Asimov demonstrates problems of prioritization and potential deadlock inherent in implementing even this small set of rules (Clark, 1994). Apparently, Asimov concluded that his laws would not work, and other theorists have extended this conclusion to encompass any rule-based ethical system implemented in AI (Lang, 2002).” (Wallach et. al, 2005)

… Asimov concluded that his laws would not work, and other theorists have extended this conclusion to encompass any rule-based ethical system implemented in AI.

A lot of science fiction doesn’t predict the future as much as warn us against its possibilities. Furthermore, the top-down approach is tricky for AI in different ways than how it is tricky for humans.

As humans, we learn ethics as we go, from those practiced by our families and community, how we react to our environment, and how others react to us. One paper made the case that “. . . while one can argue that individuals make moral choices on the basis of this or that philosophy, actual humans first acquire moral values from those who raise them, and then modify these values as they are exposed to various inputs from new groups, cultures, and subcultures, gradually developing their own personal moral mix.” (Etzioni, 2017)

This personal moral mix could be thought of as a hybrid model for ethics for humans. The question is, how easy and practical is it to take human ethics and apply them to machines?

Political Ethics Top-Down vs Bottom-Up

When I hear top-down, I imagine either government or big Business/big Tech figureheads, sitting in a room making decisions for everyone else. This has always put a bad taste in my mouth. It is how our world works, in some ways more than others, but we are also seeing it with how big Tech has approached ethics in AI.

Here are some examples of top-down ethics from the powers that be: “The Asilomar AI principles, developed in 2017 in conjunction with the Asilomar conference for Beneficial AI, outline guidelines on how research should be conducted, ethics and values that use of AI must respect, and important considerations for thinking about long-term issues (Future of Life Institute 2017). . . Around the same time, the US Association for Computing Machinery (ACM) issued a statement and set of seven principles for Algorithmic Transparency and Accountability, addressing a narrower but closely related set of issues (ACM US Public Policy Council 2017).” (Whittlestone et al. 2019)

We are also seeing some crowd-collected considerations about ethics in AI, and this is what I think of when I hear bottom-up: decisions being called for by the people. This is the grassroots ethics that I think we need to be paying attention to, especially the voices of marginalized and minoritized groups.

“Bottom-up data institutions are seen by some as mechanisms that could be revolutionary for rebalancing power between big tech corporations and communities. It was argued that there is a widespread assumption that bottom-up data institutions will always be benign and will represent everyone in society, and these assumptions underpin their promotion. It was discussed whether bottom-up data institutions are, by definition, only representative of the particular communities included within their data subjects rather than of general societal values.” (ODI, 2021)

This is an important point to keep in mind when thinking about bottom-up and grassroots ethics: there will always be different ethics coming from different groups of people, and the details of the applications of it are where the disagreements abound.

The Top-Down Method of AI Being Taught Ethics

Now we get to recombine all of the top-down angles together: The technical, the theoretical, and the political.

If we teach AI ethical core principles and expect it to live by human values and virtues, I imagine we will be sorely disappointed. There just isn’t a foreseeable way to make this work for everyone.

“Many of the principles proposed in AI ethics are too broad to be action-guiding. For example, ensuring that AI is used for “social good” or “the benefit of humanity” is a common thread among all sets of principles. These are phrases on which a great majority can agree exactly because they carry with them few if any real commitments.” (Whittlestone et al. 2019)

Furthermore, if these principles are being administered from big Tech or the government, there could be a lot that slips by because it sounds good. In my previous article, we were working with the example of fairness. Fairness is something we can all agree is good, but we can’t all agree on what it means in practice. Fair for one person or group could equate to really unfair to another.

“The strength of top-down theories lies in their defining ethical goals with a breadth that subsumes countless specific challenges. But this strength can come at a price: either the goals are defined so vaguely or abstractly that their meaning and application are subject for debate, or they get defined in a manner that is static and fails to accommodate or may even be hostile to new conditions.” (Wallach et. al, 2005)

A machine doesn’t implicitly know what ‘fairness’ means. So how can we teach it a singular definition when fairness holds a different context for everyone?

A machine doesn’t implicitly know what ‘fairness’ means. So how can we teach it a singular definition when fairness holds a different context for everyone?

The Bottom-up Method of AI Being Taught Ethics

The bottom-up approach isn’t as easy to wrap up. Sometimes we see crowd-sourced (bottom-up politically) principles (top-down theoretically) being called for, and possibly a hybrid model (technically) for applied AI. If it was purely bottom-up and learning from the ethics of people, I fear disappointment will be the end result. We as humans haven’t quite mastered ethics, let alone standardized it into something codifiable.

One paper describes bottom-up approaches as “. . . those that do not impose a speciﬁc moral theory, but which seek to provide environments in which appropriate behavior is selected or rewarded. These approaches to the development of moral sensibility entail piecemeal learning through experience, either by unconscious mechanistic trial and failure of evolution, the tinkering of programmers or engineers as they encounter new challenges or the educational development of a learning machine.” (Allen et. al. 2005)

This is very very challenging and time-consuming. And as we know, AI doesn’t learn as humans do. It lacks a solid foundation. Building on top of that and applying band-aid after band-aid is not going to help.

“Bottom-up strategies hold the promise of giving rise to skills and standards that are integral to the overall design of the system, but they are extremely diﬃcult to evolve or develop. Evolution and learning are ﬁlled with trial and error — learning from mistakes and unsuccessful strategies. This can be a slow task, even in the accelerated world of computer processing and evolutionary algorithms.” (Allen et. al. 2005)

The Hybrid of Bottom-Up and Top-Down Ethics for AI

So if top-down is flawed and bottom-up isn’t promising, what about a hybrid model? “If no single approach meets the criteria for designating an artiﬁcial entity as a moral agent, then some hybrid will be necessary. Hybrid approaches pose the additional problems of meshing both diverse philosophies and dissimilar architectures.” (Allen et. al. 2005)

According to most experts, a hybrid model is the better choice. Rules and structure are helpful, but only to a point, and sometimes, they contradict each other. AI is good at following rules, but it struggles around ethics, which are subjective and often contradictory.

Bringing it All Together

We have taken apart top-down and bottom-up ethics in AI in three ways: technically, theoretically, and politically. Then we took a step back and looked at top-down, bottom-up, and hybrid models of ethics for AI. Well, it still seems pretty messy, like we need to all be doing a lot more research and work in this area, but I hope that this has been helpful to understand the various angles of this analysis. To leave us with a final thought: “Ethical issues are never solved, they are navigated and negotiated as part of the work of ethics owners.” (Moss and Metcalf, 2019)

“Ethical issues are never solved, they are navigated and negotiated as part of the work of ethics owners.”

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Join us in driving #AI for #SocialImpact initiatives around the world!

If you enjoyed reading this, you could contribute good vibes (and help more people discover this post and our community) by hitting the 👏 below — it means a lot!

Citations

Allen, C., Wallach, W., & Smit, I. (2005). Artificial morality top-down bottom-up and hybrid approaches. Artificial Morality: Top-down, Bottom-up, and Hybrid Approaches. Retrieved December 3, 2021, from https://www.researchgate.net/profile/Wendell-Wallach/publication/225850648_Artificial_Morality_Top-down_Bottom-up_and_Hybrid_Approaches/links/02bfe50d1c8d2c733e000000/Artificial-Morality-Top-down-Bottom-up-and-Hybrid-Approaches.pdf.

Eckart, P. (2020, May 29). Top-down AI: The simpler, data-efficient AI. 10EQS. Retrieved December 13, 2021, from https://www.10eqs.com/knowledge-center/top-down-ai-or-the-simpler-data-efficient-ai/.

Etzioni, A., & Etzioni, O. (2017). Incorporating ethics into Artificial Intelligence — Philpapers. Retrieved November 30, 2021, from https://philpapers.org/archive/ETZIEI.pdf.

Google. (2021). #OPEN roundtable summary note: Experimentalism — le guin part 2. Google Docs. Retrieved December 13, 2021, from https://docs.google.com/document/d/1cMhm4Kz4y-l__2TQANClVVMLCd9X3X8qH3RfhAGghHw/edit?pli=1#.

Wallach, W., Smit, I., & Allen, C. (2005). Machine morality: Bottom-up and top-down approaches … — AAAI. Machine Morality: Bottom-up and Top-down Approaches for Modeling Human Moral Faculties. Retrieved December 3, 2021, from https://www.aaai.org/Papers/Symposia/Fall/2005/FS-05-06/FS05-06-015.pdf.

Comment

Anthology

The Decolonial Turn and Pluralizing the Souths

Decolonizing Research to Decolonize Data: Digital Storytelling

Digital Colonialism and IQ

Resources

Tackling Climate Change with ML

Indigenous data policy and case studies in Mexico and Colombia

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Decision Tree

Making the model

Exploring the data

Selecting data for modeling

Selecting The Prediction Target

Choosing Features

The Model

How good is our model?

Overfitting and Underfitting

Conclusion

What is Global Data Law?

Situating the Digital Non-Aligned Movement

Turning to Indigenous Data Sovereignty to Inform Global Data Law

Highlighting Technologies for Liberation

Conclusion

Resources

Principal Component Analysis (PCA)

How is PCA possible?

1. Standardization

2. Covariance Matrix Computation

3. Calculate the Eigendecomposition of the Covariance Matrix

4. Feature Vector

5. Recast the Data Along the Axes of the Principal Components

PCA Python Tutorial

In Conclusion

References:

Data Processing in Python

What is Pandas?

Processing CSV Data

Processing Data using Pandas

Import dataset

Exploring the data

Finding and Rebuilding Missing Data

De-Duplicate

References

Defining Data Mining

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Data Mining and Big Nudging Help to Spread Misinformation

What can be done about Misinformation for COVID-19?

Join us in driving #AI for #SocialImpact initiatives around the world!

References

Conclusion

References

How do we define a hybrid model of ethics for AI?

MIT’s Moral Machine as a Hybrid Model for AI Ethics

Hybrid Ethics for Moral Medical Machines

Conclusion

Join us in driving #AI for #SocialImpact initiatives around the world!

If you enjoyed reading this, you could contribute good vibes (and help more people discover this post and our community) by hitting the 👏 below — it means a lot!

References

Unsupervised Learning: Dimension Reduction

PCA (Principal Component Analysis)

SVD (Singular Value Decomposition)

LDA (Linear Discriminant Analysis)

Unsupervised Learning: Clustering

Hierarchical Clustering

DBSCAN (Density-based Spatial Clustering of Applications with Noise)

K-Modes

K-Means

GMM (Gaussian Mixture Model)

Supervised Learning: Regression

Decision Tree

Linear Regression

Neural Network

Gradient Boosting Tree

Random Forest

Supervised Learning: Classification

Kernel Support-Vector Machine

Logistic Regression

Linear Support-Vector Machine