Data Types in Python
List
- Ordered
- Mutable
- Allows Duplicate
- Store pointers to the objects in memory
- heterogeneous (can store string and integer and another list)
.append(x)
.extend(iterable)
.insert(i, x)
insertx
at positioni
.i = 0
inserts at the beginning.remove(x)
removes the first occurrence.pop(i)
removes and returns the item at the positioni
.pop()
removes and returns the last item- Using a list as a stack (efficient)
- Using a list as a queue (highly inefficient)
list_from_string = list("hello")
# ['h', 'e', 'l', 'l', 'o']
stack = []
stack.append(4)
last_item = stack.pop()
deque = collections.deque() # double ended queue O(1) in both directions
Dictionary
- Organizer | Collection of
key-value pairs
- database records, configurations, JSON from APIS
- Insertion Ordered
- NO Duplicate keys (Old value will be overwritten)
- Key Must be Hashable (Immutable Type): strings, numbers, booleans, tuples, No list, sets, dicts
- Hash Table | Hash Map O(1) look up time on average
empty_dict = {}
user = {
"username": "alex",
"id": 103,
"is_active": True
}
# dict() Constructor
user = dict(username="alex", id=103, is_active=True)
# zip() trick
keys = ["fruit", "vegetable", "grain"]
values = ["apple", "broccoli", "rice"]
food_map = dict(zip(keys, values))
# zip() returns a zip object
#################################
# Accessing
d[key]
# raise KeyError if key does not exist
d.get(key, default) # safer way to access a potentially missing keys.
# EXAMPLE
config = {"retries": 3}
config["timeout"] # Raise KeyError
config.get("timeout", 30) # sets default (30)
#################################
.keys() #dict_keys object
.values() # dict_values object
.items() # key-value tuple pair
Set
- Unordered, Unique, Hashable Elements ONLY
- Deduplication: The single most efficient way to remove duplicate elements from a list is to convert it to a set and then back to a list. This is a fundamental pattern in data cleaning.
- Membership testing in faster in set than in list. O(1) vs O(n)
empty_set = set()
vowels = {'a', 'e', 'i', 'o', 'u'}
user_ids = [_, _, _]
unique_user_ids = list(set(user_ids)) # order not guaranteed
Data Manipulation Operations
A list of dictionaries is a very common data structure, often resulting from reading a CSV file or parsing a JSON API response. Filtering this structure is a daily task for data analysts and engineers. Python offers several ways to accomplish this, with list comprehensions being the most idiomatic.
- *List Comprehension (Preferred Method)
employees = {}
engineers = [emp for emp in employees if emp['role'] == "Engineer"]
high_earning_engineers = [emp for emp in employees
if emp['role'] == "Engineer" and emp['salary'] > 100000
]
- Filtering a Dictionary
grades = {'John': 85, 'Mary': 92, 'Matt': 78, 'Michael': 95, 'Laura': 88}
# get students with >= 90 score
top_performers = {name: score for name, score in grades.items() if score >= 90}
# new_dict = {key: value for key, value in dict.items() if value ... sth}
m_students = {name: score for name, score in grades.items() if name.startswith('M')}
Sorting
-
list.sort()
: in-place sorting; returnsNone
-
sorted(iterables)
: returns a new, sorted iterable -
Sorting a list of dictionaries
employees = {sth}
sorted_by_salary_asc = sorted(employees, key=lambda emp: emp['salary'])
sorted_by_salary_desc = sorted(employees, key=lambda emp: emp['salary'], reverse=True)
# Complex Sort, Tie-breaking
sorted_complex = sorted(employees, key=lambda emp: (emp['role'], -emp['salary']))
Aggregating, Grouping | from collections import defaultdict!!!!
Grouping is a cornerstone of data aggregation and analysis. It is the process of taking a flat list of items and restructuring it into a nested data structure—typically a dictionary of lists—where items are categorized based on a common property or key.
transactions = [
{'id': 't1', 'category': 'books', 'amount': 25},
{'id': 't2', 'category': 'electronics', 'amount': 120},
{'id': 't3', 'category': 'books', 'amount': 15},
{'id': 't4', 'category': 'clothing', 'amount': 50},
{'id': 't5', 'category': 'electronics', 'amount': 85},
]
# input: a list of dictionaries
# returns: a dictionary with key=category and value = list of transactions in the specific category
from collections import defaultdict
grouped_transactions = defaultdict(list)
for transaction in transactions:
cat = transaction['category']
grouped_transactions[cat].append(transaction)
sort(iterable, key, reverse)
Taming Nested Data: From APIs and JSON to Python Objects
- Deserialization: Converting JSON to Python Object
- Serialization: Converting a Python Object to JSON
json
Modules
json.loads(json_string)
: JSON-formatted string to Python object; deserializationjson.load(file_object)
: Reads from a file-like object (e.g., a file opened in read mode) containing JSON data and returns the corresponding Python object.json.dumps(python_object, indent=None)
: Python object to JSON-formatted string; serializationjson.dump(python_object, file_object)
: Takes a Python object and writes it to a file-like object in JSON format.
import json
# load from file
with open("./data.json", mode='r') as file:
read_as_dict = json.load(file)
# write to file
with open("./data.json", mode='w') as file:
json.dump(data, file)
API Response
import json
import requests
url = "something"
try:
response = requests.get(url)
response.raise_for_status()
data = response.json() # python dict object
except Exception as e:
print(e)
Miscellaneous
from collections import Counter
sentence = "the quick brown fox jumps over the lazy dog"
words = sentence.split()
word_counts = Counter(words)
most_common = word_counts.most_common(3)
from collections import deque #O(1) from either direction
task_queue = deque()
task_queue.append("Task 1")
task_queue.append("Task 2")
next_task = task_queue.popleft() # 'Task 1'
- Think in Patterns: Recognize that most data manipulation tasks are variations of a few fundamental patterns: filtering, sorting, grouping, and transforming. By identifying the pattern, one can apply the appropriate and most Pythonic tool for the job.
- Choose the Right Tool for the Job: Do not default to a list. Before writing a line of code, consider the access patterns the data requires. Does it need positional access? Fast key-based lookups? Uniqueness and set logic? A conscious choice between a list, dict, and set is the first step toward writing efficient and clean code.
- Embrace Comprehensions: Make list, dictionary, and set comprehensions the default tool for creating new collections from existing iterables. They are more than just syntactic sugar; they are a core part of the Pythonic idiom, leading to more concise, readable, and often more performant code.
- Master Nested Navigation: In an API-driven world, data is rarely flat. Practice safe and efficient navigation of nested dictionaries and lists. Make robust patterns like the .get() method and try-except blocks second nature to handle the inevitable inconsistencies of real-world data.